Create Comparison
To create a new comparison, whether within or across databases, start by navigating to Comparisons via the left-hand sidebar. Click on Create New Comparison to begin the setup.
1. Select Comparison Type:
After clicking on Create New Comparison, you'll have options to choose the Comparison Type either Shallow Comparison or Deep Comparison.
- Shallow Comparison
Stops at the first difference, providing a faster check. Suitable for basic validation where only the presence of discrepancies is important. - Deep Comparison
Thoroughly scans both datasets to identify all differences and generate detailed reports. Ideal for comprehensive validations and data quality assessments.
2. After selecting the Comparison Type, give your comparison a name.
3. Choose Data Sources:
Select your source and target data source types.
4. Next, indicate whether you want to compare tables or queries:
Table:
You can select this option to compare data directly from database tables.
-
Choose a dataset from the available options.
-
Filter: Add a filter clause to narrow down your dataset, helping you focus only on relevant data points for analysis.
For example, if you want to filter products with a price above a certain value, you can use
product.price > 100
.
Query:
Select this option to compare datasets using custom SQL queries.
-
Write a custom SQL query to retrieve specific data based on your requirements.
-
For example, if you want to compare sales data for a specific year where the transaction amount is greater than 500, you can use:
SELECT transaction_id, customer_id, amount, transaction_date FROM sales WHERE YEAR(transaction_date) = 2023 AND amount > 500;
5. Primary Keys
Define the matching keys between your Source and Target datasets. These keys are used to align records for accurate comparison. Select the relevant fields using the dropdowns, and click Add Primary Key to include more keys if needed. This step ensures precise row-level matching.
6. Click Next to continue.
7. Column Selection:
On the next screen, you’ll see all the columns available for comparison. You can select all columns or choose specific ones. However, you cannot select columns that are exclusive to one dataset. To enforce case sensitivity during comparison, use the Case Sensitive toggle — this ensures that differences in uppercase and lowercase characters (e.g., name
vs Name
) are treated as mismatches.
8. Add Column Mapping
Column Mapping allows you to define how fields from the source dataset should be matched with fields in the target dataset. This ensures accurate data comparison between two datasets.
To add column mapping, choose the column from the source dataset and the corresponding column in the target dataset.
Click Add Column Mapping Pair to map additional columns
9. Configuring Semantic Similarity for Comparison:
Semantic Similarity is only available for Shallow Comparisons.
Semantic Similarity helps identify data variations by comparing text-based columns using different similarity functions and pre-processing techniques. To configure semantic similarity:
- Toggle the switch to activate semantic similarity for this comparison
- Choose a text analysis model that best suits your comparison needs
- Apply Pre-Processing Functions like lowercasing and punctuation removal to standardize text. For more details, see Pre-Processing Functions.
- Choose a Similarity Function, such as Levenshtein Distance, to measure text closeness. For more info, refer to Similarity Functions.
- Set a Match Threshold: Define a similarity score (0 to 1) to determine matching accuracy. For instance, setting 0.8 means values with at least 80% similarity will be considered a match.
10. Advanced Configurations
Customize the performance and accuracy of your comparisons with the following advanced settings:
Bisection Threshold – Defines the minimum record count threshold above which the dataset is split into smaller segments for comparison. For example, if your dataset has 120,000 records and the threshold is set at 50,000, the system will divide the dataset into smaller parts (e.g., three segments) to compare them efficiently. Segments smaller than this size are downloaded and compared locally to optimize performance.
Bisection Factor – Sets the division factor used when breaking down large datasets during bisection. For instance, if the factor is set to 10, the dataset will be split into 10 smaller segments for each comparison iteration. If set to 2, it performs a binary search by dividing the dataset into 2 segments per iteration, making the comparison process more focused and efficient.
Max Threadpool Size – Controls the maximum number of parallel threads used for comparison, optimizing processing speed for larger datasets. For example, if set to 10, the system can run up to 10 simultaneous comparison tasks, speeding up the overall process by leveraging multiple workers. This setting balances performance with resource usage, preventing overload on your database or server.
Use these settings to fine-tune the balance between speed, resource consumption, and accuracy to best fit your environment and dataset size.
11. Finally, click Submit to successfully create your comparison.
Updated 4 days ago