...

Multitable SCD2 Joins: How to Unify Historical Changes

Minh Ngo
September 25, 2023
Read: 6 min

In the realm of data management, historical changes are conventionally stored in separate Slowly Changing Dimension Type 2 (SCD2) tables. However, extracting point-in-time insights from these dispersed sources requires merging them into a single, unified entity.

This guide offers a succinct walkthrough of the process for performing multitable SCD2 joins, presenting two distinct approaches, Direct Join and Unified Timeline, evaluating their respective advantages and drawbacks through practical examples.

Let’s see how to effectively unify historical data and derive valuable insights.

Preparing source SCD2 tables

A quick review of SCD2

SCD2 is a type of slowly changing dimension where instead of updating existing records, you add new ones to represent changes. In an SCD2 dimension table, there is an effective datetime column and an expiration datetime column to show the time range for which the row represents the data's state.

Creating our sample SCD2 tables

As this is a hands-on guide, we will start by creating three SCD2 tables – scd2_table1, scd2_table2 and scd2_table – that we are going to use as input to demonstrate different approache to multitable SCD2 joins later on.

To keep this simple, each table contains only one primary key column (pk) and 1 dimension column (dim1, dim2, dim3). The joining logic in the next sections should be the same for tables with multiple keys and dimensions. However, you will still need to modify select statements and introduce more joins.

NB: I will be using Snowflake to demonstrate all SQL examples in this blogpost, so if you are using other SQL syntax, you might need to modify the scripts.

To create the tables, run the script below:

SQL script to create source SCD2 tables

 

Output source tables

SCD2_Table1

PK DIM1 VALID_FROM VALID_TO
pk dim1-a 2023-01-01 0:00:00 2023-01-03 0:00:00
pk dim1-b 2023-01-03 0:00:00 2023-01-04 0:00:00
pk dim1-c 2023-01-04 0:00:00 2023-01-07 0:00:00
pk dim1-d 2023-01-07 0:00:00 2023-01-09 0:00:00
pk dim1-e 2023-01-09 0:00:00 9999-12-31 0:00:00

 

SCD2_Table2

PK DIM2 VALID_FROM VALID_TO
pk dim2-a 2022-12-31 0:00:00 2023-01-02 0:00:00
pk dim2-b 2023-01-02 0:00:00 2023-01-04 0:00:00
pk dim2-c 2023-01-04 0:00:00 2023-01-06 0:00:00
pk dim2-d 2023-01-06 0:00:00 9999-12-31 0:00:00

 

SCD2_Table3

PK DIM3 VALID_FROM VALID_TO
pk dim3-a 2023-01-01 0:00:00 2023-01-03 0:00:00
pk dim3-b 2023-01-05 0:00:00 2023-01-07 0:00:00
pk dim3-c 2023-01-07 0:00:00 9999-12-31 0:00:00

 

Notes on validity intervals

Bear in mind that for the sake of simplicity, the time granularity for these examples is ‘Day' even though the columns are in timestamp format.

Inbound and out-of-bound intervals

The validity intervals of these tables are LEFT CLOSED - RIGHT OPENED, which means valid_to is out of bound (in this case we set it as '9999-12-31') for current records. On the other hand, valid_from is inbound and the earliest value in the 3 tables is ‘2022-12-31 0:00:00’ (Table 2).

In one of the approaches to multitable SCD2 joins, a LEFT CLOSED - RIGHT OPENED interval might cause missing rows in the final merged table. To prevent this, LEFT OPENED - RIGHT OPENED validity intervals (valid_from '1900-01-01') should be used instead. You will see an example of this in the Pros & Cons of Direct Join section further in this article.

Continuous & discontinuous intervals

The validity periods in Table 1 and Table 2 are continuous, which means the valid_from of the next record is equal to the valid_to of the previous one and there are no gaps. On the other hand, there is a gap in Table 3 with ‘dim3-b’ (highlighted above).

Gaps in your source tables will result in NULLs for the corresponding dimensions in the final merged SCD2. Rest assured, we will discuss some examples of this later on.

The existence of an interval gap usually indicates missing data: either the SCD2 was not loaded correctly or the raw sources did not have the corresponding data for that period. Unless it is intended to be so, it is recommended to check your loading process and raw data before proceeding with the next steps.

Recalculating valid_to based on valid_from

If:

  • Your source tables only have valid_from column and don’t have valid_to;
  • There are gaps in your validity intervals, which are supposed to be continuous,

you can always calculate or recalculate valid_to based on valid_from.

See the example below on Table3:

 

SCD2_Table3_continuous_intervals

PK DIM3 VALID_FROM VALID_TO
pk dim3-a 2023-01-01 0:00:00 2023-01-05 0:00:00
pk dim3-b 2023-01-05 0:00:00 2023-01-07 0:00:00
pk dim3-c 2023-01-07 0:00:00 9999-12-31 0:00:00

 

First approach to multitable SCD2 joins: Direct Join

Direct Join is a common practice when it comes to joining 02 SCD2 tables and many how-to articles refer to it. This method involves 3 steps: timeline join condition, validity period recalculation, and deduplication. As the name suggests, we will be joining the validity range directly from the source SCD2 tables without any scaffolding or timeline modification.

In this section, we will apply this method to 2 cases: joining 02 SCD2 tables and joining 03 or more SCD2 tables using the samples above.

Direct Join with 02 SCD2 tables

1. Perform this join condition that enables merging 2 rows that are in the same validity period

 

2. Recalculate valid_from & valid_to

  • valid_from should be the latest one of the 2 valid_from of the 2 tables
  • valid_to should be the earliest one of the 2 valid_to of the 2 tables

 

3. Filter out rows with valid_from = valid_to after join to avoid duplicated validity intervals

In most of real-life examples, valid_from and valid_to are usually in lower time granularity like timestamp (with the differences down to seconds), hence there might not be cases where these column value “overlapped” across multiple SCD2 tables and we might skip this step. However, for an overall solution, this should be a part of your joining condition.

 

4. Combine them all together

 

Direct Join with 03 or more SCD2 tables

In this part, we simply repeat the process of joining 02 SCD2 tables as mentioned above, while re-calculating valid_from and valid_to along the way. It is worth noting that there is only 1 join in each step between the next SCD2 validity range (e.g. Table 3) and the previously combined validity range (e.g. Table 2 + Table 1).

If we break it down into simple CTE steps, it should look like this:

  1. CTE1 = SCD2-table1 JOIN SCD2-table2;
  2. Recalculate valid_from & valid_to in CTE1;
  3. CTE1 join SCD2-table3;
  4. Repeat.

Take a look at the combined script:

 

Direct join cons

Even though Direct Join is a straightforward and frequently used approach to multitable SCD2 joins, it does come with certain limitations. Let’s go over these now.

Missing row for out-of-bound valid_from

The validity interval of 2022-12-31 to 2023-01-01 is not populated. I would expect this row for this period:

PK DIM1 DIM2 DIM3 VALID_FROM
pk NULL dim2-a NULL 2022-12-31 0:00:00

 

Due to the logic of recalculating valid_to and valid_from, this row will not show up even if you use a full outer join instead of an inner join.

To make this work, all SCD2s should be in a LEFT OPENED - RIGHT OPENED state. In other words, the first period of all tables should have valid_from = '1900-01-01' or the same starting points.

Missing rows due to interval gaps

A gap in Table 3 led to the missing of these 2 rows:

PK DIM1 DIM2 DIM3 VALID_FROM
pk dim1-b dim2-b NULL 2023-01-03 0:00:00
pk dim1-c dim2-c NULL 2023-01-04 0:00:00

 

To make this work, all SCD2s should be continuous. You can check it out using the script above with the scd2_table3_continuous_intervals we created earlier instead of scd2_table3 (see the commented part of the join).

Second approach to multitable SCD2 joins: unified timeline

This approach to multitable SCD2 joins first creates a unified timeline based on all the valid_from from referenced SCD2s. This timeline will be used as a scaffold for joining back all the SCD2s later on. Unlike Direct Join, the deduplication & valid_to recalculation steps are done during timeline unification. This allows us to execute all SCD2 joins in 1 CTE instead.

Note that the process is quite similar when joining 02, 03 or more SCD2s:

  1. Union PKs, valid_from from subsequent SCD2s;
  2. Deduplicate the unioned table to make the following calculation lighter;
  3. Recalculate valid_to from the unioned data to create the timeline;
  4. Join subsequent SCD2s back to the timeline using this condition below. Repeat for all SCD2s.

 

Unified timeline – combine all steps

 

Unified timeline pros

As you can see from the results of the above scripts, the Unified Timeline approach covers all of the rows previously missed by the Direct Join approach. It has fewer window functions and is generally shorter in terms of query length when it comes to joining more than 02 tables.

On the other hand, if you have only 02 SCD2s without any out-of-bound valid_from row, Direct Join seems to be an easier, more straightforward method compared to Unified Timeline.

There is still some missing data (nulls) due to the mentioned GAP in Table 3 but this is an expected outcome with that raw data input. If possible, I would recommend using scd2_table3_continuous_intervals instead.

Query performance comparison

The Direct Join approach recalculates valid_to in every single CTE, while Unified Timeline 2 only calculates it once after the union section. This process is a window function, hence we can expect it to affect query performance. There is little to no difference if you are joining 02 SCD2 only (only 01 window function).

However, Direct Join’s performance declines significantly with 03 SCD2s and up. On a Snowflake x-small warehouse, the above Direct Join query for  03 SCD2s took 1.2s, while the similar one for Unified Timeline finished in only 650ms.

Conclusion: start doing multitable SCD2 joins effectively

Unifying SCD2 tables is a crucial step in accessing point-in-time historical data. There are two approaches you can choose from for multitable SCD2 joins: Direct Join and Unified Timeline.

While Direct Join is a widely used and straightforward method, it carries the risk of missing data. On the other hand, the Unified Timeline approach comes in as a more comprehensive solution by addressing the limitations of Direct Join while improving query performance.

Now that you are familiar with these approaches to multitable SCD2 joins and understand how to apply them, you can navigate the landscape of historical data integration effectively to unlock valuable insights.

Explore the Inifnite Lambda blog for other insightful articles on data and analytics engineering.

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

row-level metadata for enhanced data observability on dbt Cloud
Enhancing dbt Observability with Row-Level Metadata
Data observability is vital in every dbt project, especially in large and complex ones with multiple concurrent executions. It enables data teams to swiftly detect...
November 6, 2023
dbt Labs Partner of the Year 2023 EMEA
Infinite Lambda named dbt Labs Partner of the Year for EMEA
Infinite Lambda has been named dbt Labs Partner of the Year for the EMEA region in the Services category. This award is a testament not...
October 17, 2023
data observability dbt packages
Data Observability dbt Packages
What are dbt packages? Software engineers often use modularised code libraries, empowering them to focus on business logic while leveraging preexisting, perfected code for efficiency....
October 13, 2023
VaultSpeed Sapphire Partner
Infinite Lambda Becomes VaultSpeed Sapphire Partner
We are thrilled to share the exciting news that Infinite Lambda is now recognised as a VaultSpeed Sapphire partner. This development is extremely important to...
August 9, 2023
dbt Labs Premier Consulting Partner
Infinite Lambda Becomes dbt Labs Premier Consulting Partner
We are delighted to announce our new Premier Consulting Partner status with dbt Labs, marking a significant step in our successful collaboration. As two pioneers...
August 8, 2023
data contracts explained for non-tech audiences
Data Contracts for Non-Tech Readers: a Restaurant Analogy
Introducing the data kitchen Data contracts have been steadily gaining in popularity within the data community, providing a structured approach to manage and regulate the...
July 17, 2023

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.