Create Comparison
To create a new comparison, whether within or across databases, start by navigating to Comparisons via the left-hand sidebar. Click on Create New Comparison to begin the setup.
1. Select Comparison Type:
After clicking on Create New Comparison, you'll have options to choose the Comparison Type either Shallow Comparison or Deep Comparison.
- Shallow Comparison
This comparison stops after finding the first value difference, optimizing runtime for quick difference checks. - Deep Comparison
This comparison performs a thorough scan of both datasets to identify all differences, generating detailed reports, which is perfect for comprehensive validations and data quality checks. - Schema Comparison
This comparison focuses on comparing the structure of the source and target datasets. It checks if the schemas match—such as table names, column names, data types, and constraints—while ignoring row-level data. It is especially useful for verifying structural consistency during migrations or schema updates.
2. After selecting the Comparison Type, give your comparison a name.
3. Choose Data Sources:
Select your source and target data source types.
4. Next, indicate whether you want to compare tables or queries:
Table:
You can select this option to compare data directly from database tables.
-
Choose a dataset from the available options.
-
Filter: Add a filter clause to narrow down your dataset, helping you focus only on relevant data points for analysis.
For example, if you want to filter products with a price above a certain value, you can use
product.price > 100
.
Query:
Select this option to compare datasets using custom SQL queries.
-
Write a custom SQL query to retrieve specific data based on your requirements.
-
For example, if you want to compare sales data for a specific year where the transaction amount is greater than 500, you can use:
SELECT transaction_id, customer_id, amount, transaction_date FROM sales WHERE YEAR(transaction_date) = 2023 AND amount > 500;
5. Primary Keys
Define the matching keys between your Source and Target datasets. These keys are used to align records for accurate comparison. Select the relevant fields using the dropdowns, and click Add Primary Key to include more keys if needed. This step ensures precise row-level matching.
6. Click Next to continue.
7. Column Selection:
On the next screen, you’ll see all the columns available for comparison. You can select all columns or choose specific ones. However, you cannot select columns that are exclusive to one dataset. To enforce case sensitivity during comparison, use the Case Sensitive toggle — this ensures that differences in uppercase and lowercase characters (e.g., name
vs Name
) are treated as mismatches.
8. Add Column Mapping
Column Mapping allows you to define how fields from the source dataset should be matched with fields in the target dataset. This ensures accurate data comparison between two datasets.
To add column mapping, choose the column from the source dataset and the corresponding column in the target dataset.
Click Add Column Mapping Pair to map additional columns
9 Attach Rules (Optional)
Rules help you manage variations in comparison results by controlling how specific differences are handled. For example, you might want to ignore differences in column lengths, treat NULL and empty strings as equivalent, or bypass case mismatches.
You can configure rules from Settings > Workspaces > Your Workspace > Comparison Rules. For more information, see Comparison Rules.
To add a rule, click + Add Rule and select a predefined rule from the dropdown. You can add multiple rules to customize the comparison behavior as needed.
9. Configuring Semantic Similarity for Comparison:
Semantic Similarity is only available for Shallow Comparisons.
Semantic Similarity helps identify data variations by comparing text-based columns using different similarity functions and pre-processing techniques. To configure semantic similarity:
- Toggle the switch to activate semantic similarity for this comparison
- Choose a text analysis model that best suits your comparison needs
- Apply Pre-Processing Functions like lowercasing and punctuation removal to standardize text. For more details, see Pre-Processing Functions.
- Choose a Similarity Function, such as Levenshtein Distance, to measure text closeness. For more info, refer to Similarity Functions.
- Set a Match Threshold: Define a similarity score (0 to 1) to determine matching accuracy. For instance, setting 0.8 means values with at least 80% similarity will be considered a match.
10. Advanced Configurations
Customize the performance and accuracy of your comparisons with the following advanced settings:
Bisection Threshold – Defines the minimum record count threshold above which the dataset is split into smaller segments for comparison. For example, if your dataset has 120,000
records and the threshold is set at 50,000
, datachecks will divide the dataset into smaller parts (e.g., three segments) to compare them efficiently. Segments smaller than this size are downloaded and compared locally to optimize performance.
Bisection Factor – Sets the division factor used when breaking down large datasets during bisection. For instance, if the factor is set to 10
, the dataset will be split into 10 smaller segments
for each comparison iteration. If set to 2
, it performs a binary search by dividing the dataset into 2 segments
per iteration, making the comparison process more focused and efficient.
Max Threadpool Size – Controls the maximum number of parallel threads used for comparison, optimizing processing speed for larger datasets. For example, if set to 10
, datachecks can run up to 10
simultaneous comparison tasks, speeding up the overall process by leveraging multiple workers. This setting balances performance with resource usage, preventing overload on your database or server.
Egress Limit - Defines the maximum number of differing rows that triggers an automatic stop in the diffing process. This prevents excessive load on the database and reduces unnecessary data transfer when large discrepancies occur.
For example, if you set it to 50,000
, datachecks will halt comparisons as soon as it identifies 50,000 differences
, minimizing network and database usage.
Per Column Diff Limit - Specifies the limit on the number of differences detected per column, including mismatches, exclusive values, and duplicate primary keys. Once this threshold is reached for a column, checking for differences in that column stops. If all columns hit their limits, the entire comparison ends.
If set to 100
, column comparison stops after 100 differences are found, speeding up the process by avoiding over-analysis of noisy columns.
Timeout Limit - Sets the maximum allowed duration (in minutes) for a comparison task before it is automatically cancelled. For example, if set to 300
, any comparison running longer than 5 hours
will be timed out and stopped, and you will receive the report generated up to that point. This ensures long-running or stuck jobs don’t block the processing queue.
Use these settings to fine-tune the balance between speed, resource consumption, and accuracy to best fit your environment and dataset size.
11. Finally, click Submit to successfully create your comparison.
Updated 12 days ago