...

Data Diff Validation in Blue-Green Deployments

Dat Nguyen
January 31, 2024
Read: 4 min

During a blue-green deployment, there are discrepancies between environments that we need to address to ensure data integrity. This calls for an effective data diff solution that complements existing frameworks and improves the data validation process. If we can do that with zero budget, that is even better.

In this article, we are going to learn what data diffing is and explore such a solution that can be implemented with dbt and Snowflake.

What is data diff?

Data diff is the process of comparing data entities between two environments, including the structure of data differences and the variances on value level.

Data diffing can help us find potentially critical issues that traditional testing methodologies, such as unit tests and integration tests, have not been successful with.

Simply put, diffing is powerful when it comes to detecting discrepancies that might otherwise go unnoticed.

data-diff-blue-green

The practice of data diffing involves the following steps:

  1. Starting with the base definition
    1. Synonymous with data reconciliation
    2. Three diffing levels finalising:
      1. Key diff: Compare the key value only;
      2. Schema diff: Compare the data structure (column name and data type);
      3. Content diff: Compare all cell values;
  2. Creating an initial version with manual guidance serving the following requirements:
    1. Entity configuration for comparison;
    2. Result view for summary and detailed drill-down;
    3. Query logging for observability;
  3. Progressing toward automation and optimisation:
    1. Packaging all functionalities in a dbt package;
    2. Optimising runtime speed through parallelism.

Understanding data diffing

The term data diff is still lacking widespread recognition within the current modern data stack. To align with the more widely accepted terminology, we can consider it synonymous with data reconciliation.

Data diffing aims at:

  • Having the base definition;
  • Developing the first version which can deliver the corresponding good enough result with a manual guideline;
  • Adding the automation and flexibility in a dbt package in the next version.

The good-enough result should show us the diffing at the row and column level.

Implementation

We are looking to implement a robust, efficient method that complements existing data quality frameworks and enhances the overall accuracy and reliability of data validation processes during deployment, user acceptance testing and code refactoring.

We borrow what was implemented in Datafold’s data diff solution and combine it with the concept of a 3-level medallion architecture (described by Databricks) in a diffing depth-level context. Hence, we will work with 3 categories of diffing, namely, key diff, schema diff and content diff.

Moreover, we will need a prior configuration of the entities to be compared as well as logging which underlying queries run with corresponding timestamps for observability usage. Ultimately, the solution will produce the result view of summary and will also enable us to drill down into the details.

The data stack we work with here is Snowflake and dbt. So, we are going to use the Snowflake Scripting Stored Procedure (1 stored procedure stands for 1 diff level) to build the solution and then create a trigger as a dbt macro.

Now, without further ado, let’s explore the implementation.

Configuration

Before any diffing runs, we need to configure the entities that we are going to compare. Each entity contains the required information of the two environments; let’s call them Source and Target.

Assuming the Source and Target are located in the same accessible storage, essentially the configuration looks like this:

  • FQN of the Source table or view: source_database.source_schema.source_table_name
  • FQN of the Target table or view: target_database.target_schema.target_table_name
  • Primary key, or the unique combination of columns: pk
  • Optionally, we will need:
    • exclude_columns/include_columns: Ability to select columns with the exclusion and inclusion
    • where: Ability to filter the subset of data

Here is the sample config in code:

 

Level 1: Key diff

At this level, everything is fairly straightforward. We will perform a full join of Source and Target to mainly find out which keys are exclusive in Source and which keys are exclusive in Target.

Behind the scenes, the SQL query looks like this:

 

Level 2: Schema diff

Now, we are going to look at INFORMATION_SCHEMA.COLUMNS in order to query the underlying table’s schema, including column names and data types.

We expect to find out:

  • Which columns are exclusive in Source based on the column name;
  • Which columns are exclusive in Target based on the column name;
  • The mutual columns with the data type matching check.

Here, the SQL query could look like this:

 

Level 3: Content diff

Using on the matched key and the mutual columns, we are going to leverage the MINUS/EXCEPT operator to build a query to compare every cell value between Source and Target.

Alternatively, we can use HASH_AGG, which is a much faster option (see this article on data validation after refactoring in Snowflake), but this is a trade-off between the speed and the drill-down capabilities which point us to the columns that are different.

We are going to try MINUS first and see if we can buy in the HASH method later.

We expect:

  • Summary reports on the column level diffing with the corresponding percentage matching value;
  • Being able to drill down into which rows are different.

The SQL query could look like this:

Finally, we bring them all together in a dbt package with a dedicated macro where we call all the stored procedures:

We first came across the concept of the data diff during a blue-green deployment. In the context of data quality assurance before swapping, unit tests and integration tests may overlook critical issues that arise from discrepancies in data entities between different environments (blue vs green).

To address this gap, we need an effective data diff solution that can comprehensively compare data structures and values across environments, providing insights into hidden issues that impact data integrity.

Maintaining data integrity is crucial for the entire organisation with all the data consumers. If you need a hand building a robust, trustworthy system that makes ensuring data quality a breeze, give us a shout. We are here to help.

 

Further reading

This article was inspired by:

Recommended tools:

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Tag-Based Masking in Snowflake
Tag-Based Masking in Snowflake: Practical Guide with Scalable Implementation
As data continues to be a critical asset for organisations across industries, safeguarding sensitive information while enabling data access for authorised users is a constant...
June 11, 2024
Cloud sustainability
Cloud Sustainability
This article on cloud sustainability is a part of a series on carbon analytics published on the Infinite Lambda Blog. Appreciating the cloud is becoming...
June 5, 2024
How to measure happiness and safety in tech teams
How to Measure Happiness and Safety in Tech Teams
Software product development initiatives can go awry for a whole range of reasons. However, the main ones tend not to be technical at all. Rather,...
May 30, 2024
why sustainability analytics
Why Sustainability Analytics
We all like a sunny day. Kicking back in the garden with the shades on, cool drink in hand and hopefully a liberal amount of...
May 8, 2024
GDPR & Data Governance in Tech
GDPR & Data Governance in Tech
The increasing focus on data protection and privacy in the digital age is a response to the rapid advancements in technology and the widespread collection,...
January 18, 2024
Data masking on Snowflake using data contracts
Automated Data Masking on Snowflake Using Data Contracts
As digital data is growing exponentially, safeguarding sensitive information is more important than ever. Compliance with strict regulatory frameworks, such as the European Union’s General...
January 17, 2024

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.