During a blue-green deployment, there are discrepancies between environments that we need to address to ensure data integrity. This calls for an effective data diff solution that complements existing frameworks and improves the data validation process. If we can do that with zero budget, that is even better.
In this article, we are going to learn what data diffing is and explore such a solution that can be implemented with dbt and Snowflake.
What is data diff?
Data diff is the process of comparing data entities between two environments, including the structure of data differences and the variances on value level.
Data diffing can help us find potentially critical issues that traditional testing methodologies, such as unit tests and integration tests, have not been successful with.
Simply put, diffing is powerful when it comes to detecting discrepancies that might otherwise go unnoticed.
The practice of data diffing involves the following steps:
- Starting with the base definition
- Synonymous with data reconciliation
- Three diffing levels finalising:
- Key diff: Compare the key value only;
- Schema diff: Compare the data structure (column name and data type);
- Content diff: Compare all cell values;
- Creating an initial version with manual guidance serving the following requirements:
- Entity configuration for comparison;
- Result view for summary and detailed drill-down;
- Query logging for observability;
- Progressing toward automation and optimisation:
- Packaging all functionalities in a dbt package;
- Optimising runtime speed through parallelism.
Understanding data diffing
The term data diff is still lacking widespread recognition within the current modern data stack. To align with the more widely accepted terminology, we can consider it synonymous with data reconciliation.
Data diffing aims at:
- Having the base definition;
- Developing the first version which can deliver the corresponding good enough result with a manual guideline;
- Adding the automation and flexibility in a dbt package in the next version.
The good-enough result should show us the diffing at the row and column level.
We are looking to implement a robust, efficient method that complements existing data quality frameworks and enhances the overall accuracy and reliability of data validation processes during deployment, user acceptance testing and code refactoring.
We borrow what was implemented in Datafold’s data diff solution and combine it with the concept of a 3-level medallion architecture (described by Databricks) in a diffing depth-level context. Hence, we will work with 3 categories of diffing, namely, key diff, schema diff and content diff.
Moreover, we will need a prior configuration of the entities to be compared as well as logging which underlying queries run with corresponding timestamps for observability usage. Ultimately, the solution will produce the result view of summary and will also enable us to drill down into the details.
The data stack we work with here is Snowflake and dbt. So, we are going to use the Snowflake Scripting Stored Procedure (1 stored procedure stands for 1 diff level) to build the solution and then create a trigger as a dbt macro.
Now, without further ado, let’s explore the implementation.
Before any diffing runs, we need to configure the entities that we are going to compare. Each entity contains the required information of the two environments; let’s call them Source and Target.
Assuming the Source and Target are located in the same accessible storage, essentially the configuration looks like this:
- FQN of the Source table or view: source_database.source_schema.source_table_name
- FQN of the Target table or view: target_database.target_schema.target_table_name
- Primary key, or the unique combination of columns: pk
- Optionally, we will need:
- exclude_columns/include_columns: Ability to select columns with the exclusion and inclusion
- where: Ability to filter the subset of data
Here is the sample config in code:
Level 1: Key diff
At this level, everything is fairly straightforward. We will perform a full join of Source and Target to mainly find out which keys are exclusive in Source and which keys are exclusive in Target.
Behind the scenes, the SQL query looks like this:
Level 2: Schema diff
Now, we are going to look at INFORMATION_SCHEMA.COLUMNS in order to query the underlying table’s schema, including column names and data types.
We expect to find out:
- Which columns are exclusive in Source based on the column name;
- Which columns are exclusive in Target based on the column name;
- The mutual columns with the data type matching check.
Here, the SQL query could look like this:
Level 3: Content diff
Using on the matched key and the mutual columns, we are going to leverage the MINUS/EXCEPT operator to build a query to compare every cell value between Source and Target.
Alternatively, we can use HASH_AGG, which is a much faster option (see this article on data validation after refactoring in Snowflake), but this is a trade-off between the speed and the drill-down capabilities which point us to the columns that are different.
We are going to try MINUS first and see if we can buy in the HASH method later.
- Summary reports on the column level diffing with the corresponding percentage matching value;
- Being able to drill down into which rows are different.
The SQL query could look like this:
Finally, we bring them all together in a dbt package with a dedicated macro where we call all the stored procedures:
We first came across the concept of the data diff during a blue-green deployment. In the context of data quality assurance before swapping, unit tests and integration tests may overlook critical issues that arise from discrepancies in data entities between different environments (blue vs green).
To address this gap, we need an effective data diff solution that can comprehensively compare data structures and values across environments, providing insights into hidden issues that impact data integrity.
Maintaining data integrity is crucial for the entire organisation with all the data consumers. If you need a hand building a robust, trustworthy system that makes ensuring data quality a breeze, give us a shout. We are here to help.
This article was inspired by:
- Data Validation After Refactoring in Snowflake | The Infinite Lambda Blog
- Performing a blue/green deploy of your dbt project on Snowflake | dbt Labs
- Data Diff | Datafold Blog
- Data validation and reconciliation | Wiki