Oh, well. Your current model is working as expected. The resulting table or query for reporting has good quality data that has been already validated by stakeholders and auditors, and is ready to go to production. If only the transformations could be more performant.
Your team looks under the hood and identifies a few steps in the upstream models that would make the process faster and budget-friendly.
There is now one small final task to complete: validate that the new table you have created has exactly the same values as the old one, i.e. run a regression test.
At this point, you are asking yourself these key questions:
- Are both tables exactly the same?
- If they are not, which rows are different?
- Ok, I know they are not exactly the same tables but how similar are they?
Fear not, we are here to help you answer all three questions.
1. Are the tables the same?
There are various ways you can check that, among which INTERSECT, MINUS/EXCEPT, massive joins with all the columns, dbt-audit-helper, HASH and more. Let’s take a look at some of them.
My favourite is using HASH_AGG in Snowflake which will give you a quick yes/no answer but will not identify the rows that are different.
- Set operators: MINUS/EXCEPT, dbt-audit, massive joins and the like are essentially variations of the same flavour using set operators, which essentially allow queries to be combined.
The best-known example of a set operator is UNION (ALL) with several more that are common in many databases. One of the alternatives is INTERSECT, which "returns rows from one query's result set which also appear in another query's result set with duplicate elimination". Another one is MINUS/EXCEPT, which "removes rows from one query's result set which appear in another query's result set, with duplicate elimination". The monospace above has been taken straight out of Snowflake docs;
- dbt_audit_helper: this generates a query using INTERSECT (via dbt-utils.intersect);
- HASH_AGG: it computes an "aggregate signed 64-bit hash value over the unordered set of input rows. HASH_AGG never returns NULL, even if no input is provided". Simply put, it is a fingerprint of the whole table where the order of the rows does not matter.
Let’s see each of these in action. We are going to use Snowflake's sample data to see how these approaches differ.
Step 1: Create some tables to compare
Step 2: Check if the tables are identical
With the HASH_AGG function, we get a long number for each of the tables and we can easily compare if they are the same. For this dataset, using an extra-small (XS) warehouse, the comparison takes around 45 s regardless of whether the tables are identical (like in the first example) or different (like in the second one).
With the MINUS function, we do not get a straightforward answer. We remove the rows in one query that are also present in the other query but we need to run the query twice, once in each direction, to assert whether the two tables are equal. If that is the case, no row should be returned in either query.
You can have a look at the query plans using Snowlake’s Query Profile. Keep in mind that you will need to run the queries first and make sure to take good note of the query id. The most interesting parts are that:
- HASH_AGG is a lot faster;
- MINUS is just a more convenient way to set up a join (an anti join, to be precise). With all the columns selected, check the details of the most expensive node.
So, if what you need is a quick yes/no answer, HASH_AGG is a lot more performant than MINUS.
But you must now be wondering what if actually as a result of the modifications upstream now the two tables should have mostly the same values except for some timestamp that keeps track of when the query has been run (inserted_at, updated_at). Or what if, as a result of the changes, we have decided to change the order of the columns (e.g. moving PK to the front)?
Worry not. HASH_AGG can be used with * but also accepts a list of columns. You could change the queries above with something like this:
This is an easy way to get the column list:
Caveats around HASH/HASH_AGG as pseudo key/obfuscation, etc.
Sometimes, we may be hashing a column upstream of this HASH_AGG (e.g. when creating a surrogate key, to hide PII, etc). In these cases, care should be taken to ensure that everything is equivalent. If it was not (e.g. the filename of an S3 file with a timestamp embedded was used to hash a key in the first model and a new S3 file with the same contents but a different timestamp was used in the second model), the hashes would not match. An alternative would be to isolate the common parts for the purpose of the HASH_AGG comparison.
When working with strings, HASH or HASH_AGG calculate the hash of the actual string, regardless of the collation used. Set operations, on the other hand, would either error if different collations were specified or use the specified collation for the comparison when compatible. Hence, there could be differences in the output of the set operations described above and the hash value obtained.
If using HASH_AGG to compare two tables returns different hashes but no differences are found using MINUS/INTERSECT, you are utilising HASH_AGG’s bonus capability to detect duplicate rows, which is not that straightforward with set operations.
HASH_AGG is specific to Snowflake. In other tools, either this is not available, or you will need to come up with alternative solutions. For instance, for BigQuery, instead of HASH_AGG you can use something like this:
-- for BigQuery: SELECT BIT_XOR(FARM_FINGERPRINT(TO_JSON_STRING(t))) FROM MyTable AS t
(Credit: Stack Overflow)
NB: All of these approaches yield similar results but this does not make them interchangeable. For example, you cannot compare HASH_AGG results in Snowflake with the above query in BigQuery.
2. Which rows are different?
We know the two tables are not the same but now we want to know what rows are different. Let’s see what the best way to do this is.
We have already seen that we can do a MINUS operation that will retrieve all the rows in one dataset that are not in the other. This is easy, straightforward and likely sufficient. But there are other ways as well.
Meet HASH a cousin of HASH_AGG that returns a signed 64-bit hash value:
The output of this query is the same as the MINUS one: it returns rows in one table that are not present in another but it runs ~60% of the time.
With this particular dataset, the number of columns will impact the difference. Have a go at running it and then look at its query profile.
3. How similar are the two tables?
Going back to the scenario where we know two tables are not the same, you may only be interested in how similar they are. In other words, you are looking to explore the similarity in the whole dataset.
Neither HASH_AGG nor HASH would not be very helpful here. Neither would be INTERSECT/MINUS as that would only return the rows that are different.
Enter MinHash & APPROXIMATE_SIMILARITY. MinHash creates a number of hashes defined as a parameter passed to the function. APPROXIMATE_SIMILARITY takes that array of hashes and computes a similarity score as a percentage.
If we take a rough look at the numbers, we see there are 600M rows in each dataset and they only differ in 17487 rows (all the changes happen in the QUANTITY column). This is basically a 0.003% difference, so essentially these are 100% similar.
It takes a smaller example to see the differences more easily:
Here, we have 10k suppliers and only 3190 are different. This is 51% actual similarity, fairly close to the 55% approximation.
There are actually ways to use HASH to compute those differences (possibly in a more performant way) but that will be left for the reader to explore.
A final word
Consider using HASH or HASH_AGG to compare two datasets for equality and to see which rows differ, if any.
There are commercial tools out there that can help you implement the data regression tests (e.g. Datafold, Datagaps), which may even be suitable for developing CI/CD pipelines.
Often, however, having a good set of tests, such as dbt singular & generic tests, combined with minimal effort for the regression test can provide enough comprehensive coverage for data validation.
Make sure to visit the Infinite Lambda blog for more data analytics tips and tricks.