...

Data Diff Validation in Blue-Green Deployments

Dat Nguyen
January 31, 2024
Read: 4 min

During a blue-green deployment, there are discrepancies between environments that we need to address to ensure data integrity. This calls for an effective data diff solution that complements existing frameworks and improves the data validation process. If we can do that with zero budget, that is even better.

In this article, we are going to learn what data diffing is and explore such a solution that can be implemented with dbt and Snowflake.

What is data diff?

Data diff is the process of comparing data entities between two environments, including the structure of data differences and the variances on value level.

Data diffing can help us find potentially critical issues that traditional testing methodologies, such as unit tests and integration tests, have not been successful with.

Simply put, diffing is powerful when it comes to detecting discrepancies that might otherwise go unnoticed.

data-diff-blue-green

The practice of data diffing involves the following steps:

  1. Starting with the base definition
    1. Synonymous with data reconciliation
    2. Three diffing levels finalising:
      1. Key diff: Compare the key value only;
      2. Schema diff: Compare the data structure (column name and data type);
      3. Content diff: Compare all cell values;
  2. Creating an initial version with manual guidance serving the following requirements:
    1. Entity configuration for comparison;
    2. Result view for summary and detailed drill-down;
    3. Query logging for observability;
  3. Progressing toward automation and optimisation:
    1. Packaging all functionalities in a dbt package;
    2. Optimising runtime speed through parallelism.

Understanding data diffing

The term data diff is still lacking widespread recognition within the current modern data stack. To align with the more widely accepted terminology, we can consider it synonymous with data reconciliation.

Data diffing aims at:

  • Having the base definition;
  • Developing the first version which can deliver the corresponding good enough result with a manual guideline;
  • Adding the automation and flexibility in a dbt package in the next version.

The good-enough result should show us the diffing at the row and column level.

Implementation

We are looking to implement a robust, efficient method that complements existing data quality frameworks and enhances the overall accuracy and reliability of data validation processes during deployment, user acceptance testing and code refactoring.

We borrow what was implemented in Datafold’s data diff solution and combine it with the concept of a 3-level medallion architecture (described by Databricks) in a diffing depth-level context. Hence, we will work with 3 categories of diffing, namely, key diff, schema diff and content diff.

Moreover, we will need a prior configuration of the entities to be compared as well as logging which underlying queries run with corresponding timestamps for observability usage. Ultimately, the solution will produce the result view of summary and will also enable us to drill down into the details.

The data stack we work with here is Snowflake and dbt. So, we are going to use the Snowflake Scripting Stored Procedure (1 stored procedure stands for 1 diff level) to build the solution and then create a trigger as a dbt macro.

Now, without further ado, let’s explore the implementation.

Configuration

Before any diffing runs, we need to configure the entities that we are going to compare. Each entity contains the required information of the two environments; let’s call them Source and Target.

Assuming the Source and Target are located in the same accessible storage, essentially the configuration looks like this:

  • FQN of the Source table or view: source_database.source_schema.source_table_name
  • FQN of the Target table or view: target_database.target_schema.target_table_name
  • Primary key, or the unique combination of columns: pk
  • Optionally, we will need:
    • exclude_columns/include_columns: Ability to select columns with the exclusion and inclusion
    • where: Ability to filter the subset of data

Here is the sample config in code:

 

Level 1: Key diff

At this level, everything is fairly straightforward. We will perform a full join of Source and Target to mainly find out which keys are exclusive in Source and which keys are exclusive in Target.

Behind the scenes, the SQL query looks like this:

 

Level 2: Schema diff

Now, we are going to look at INFORMATION_SCHEMA.COLUMNS in order to query the underlying table’s schema, including column names and data types.

We expect to find out:

  • Which columns are exclusive in Source based on the column name;
  • Which columns are exclusive in Target based on the column name;
  • The mutual columns with the data type matching check.

Here, the SQL query could look like this:

 

Level 3: Content diff

Using on the matched key and the mutual columns, we are going to leverage the MINUS/EXCEPT operator to build a query to compare every cell value between Source and Target.

Alternatively, we can use HASH_AGG, which is a much faster option (see this article on data validation after refactoring in Snowflake), but this is a trade-off between the speed and the drill-down capabilities which point us to the columns that are different.

We are going to try MINUS first and see if we can buy in the HASH method later.

We expect:

  • Summary reports on the column level diffing with the corresponding percentage matching value;
  • Being able to drill down into which rows are different.

The SQL query could look like this:

Finally, we bring them all together in a dbt package with a dedicated macro where we call all the stored procedures:

We first came across the concept of the data diff during a blue-green deployment. In the context of data quality assurance before swapping, unit tests and integration tests may overlook critical issues that arise from discrepancies in data entities between different environments (blue vs green).

To address this gap, we need an effective data diff solution that can comprehensively compare data structures and values across environments, providing insights into hidden issues that impact data integrity.

Maintaining data integrity is crucial for the entire organisation with all the data consumers. If you need a hand building a robust, trustworthy system that makes ensuring data quality a breeze, give us a shout. We are here to help.

 

Further reading

This article was inspired by:

Recommended tools:

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

GDPR & Data Governance in Tech
GDPR & Data Governance in Tech
The increasing focus on data protection and privacy in the digital age is a response to the rapid advancements in technology and the widespread collection,...
January 18, 2024
Data masking on Snowflake using data contracts
Automated Data Masking on Snowflake Using Data Contracts
As digital data is growing exponentially, safeguarding sensitive information is more important than ever. Compliance with strict regulatory frameworks, such as the European Union’s General...
January 17, 2024
What AI is not: demystifying LLMs
Demystifying LLMs: What AI Is Not
Just a year ago, hardly anyone had heard of large language models (LLMs), the technology behind ChatGPT. Now, these models are everywhere, revolutionising the way...
January 11, 2024
Digital innovations in Ukraine
Top 9 Digital Innovations in Ukraine
Attention. Air raid alert. Proceed to the nearest shelter. Don’t be careless. Your overconfidence is your weakness. – Air raid alert app voice-over using the...
January 4, 2024
Doing business in Ukraine
Business in Ukraine: Recommendations for the Tech Sector
In December 2023, I made my first trip to Ukraine since Russia’s full scale invasion of the country in February 2022. It was, for all...
January 3, 2024
How to get started with data contracts
How to get started with data contracts
More and more modern companies are looking to get started with data contracts because they are a great way of increasing a business' data maturity...
December 28, 2023

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.