Deleting Records with dbt

Petyo Pahunchev
January 27, 2020
Read: 4 min

dbt is a great tool for managing data transformations. It is built from the ground up to support version control and peer review, it enables and encourages best practices such as automated testing, environment separation and modularity.

One of the most powerful features of dbt is the ability to only transform and load new and updated records. This is implemented through incremental models.


A comparison of an incremental load vs. a full dbt load of 40 models running on Snowflake. The incremental load has been loading only the new records. Please note that performance will vary in your case.

The above chart shows the performance advantages of an incrementally loaded dbt model vs. a fully refreshed model. This performance benefit translates to real life savings (e.g. when using Snowflake) and also allows faster decision making off the back of analytics data. We now see why this approach is so appealing to data engineers.

However, it is often required to also update a record after its initial ingestion. For instance an order item was substituted due to unavailability of the original product. In cases like this dbt helps us out, when defining an incremental model we also define a unique key, which is then used to implement an insert and update statements (or merge where supported).

Another approach often taken by data architects is to store every version of the record and then have a set of timestamp columns which address the bi-temporal nature of the data e.g.

In the above example we see how this data can tell us:

  • What the total for order_id 1 is;
  • What the total between 16th and 17th of January of 2020 was.


While the above approaches will help us solve most cases, we recently stumbled upon one that does not quite fit. Our client required a data model which deletes records from a reporting table. The reasons why we needed to implement an actual delete are long and complicated, but it boils down to:

  • Lots of existing code which queries this table and changing this code to incorporate new flags/columns would be difficult and lengthy
  • A bespoke library was in use which generates requests against this table based on a UI tool – changing that would also be tedious!
  • A view would be impractical due to the sheer number of records (billions! and growing)

As you all know we can only write SELECT statements in our SQL model files, so how can one delete records with a dbt model if all you can write is a SELECT statement? Thankfully, the folks at dbt Labs have allowed for various hooks that run post model execution and others.

The idea is pretty simple, add a column that tells us that a record is outdated and we need to delete it, then add a post model hook to delete any record with the above flag set!

Our input data contained a JSON message with a top level order and a set of order items belonging to it. When a customer returns an item an updated message is received which contains one fewer records in the order item set as in below example:

You can see that a customer has made one order with two items for a total of 11.50. They then decide to return one of the items:

A new and updated record is sent, but now it has one fewer order items!

It is rather difficult to update a record that is not in your input, so ended up first doing a union of any existing records with the input joining by order_id and setting their deletion flag to true i.e. clear any existing records and then insert the new records (which are now fewer) with their deletion flag set to false.

The post model hook then takes care of cleaning up the outdated records:

post_hook=’delete from {{this}} where to_delete = true’

{% if is_incremental() %}
true as to_delete,
current_timestamp(4) as ts
FROM {{this}}
where order_id in (select distinct order_id from latest_batch)
{% endif %}

total as total,
version as version,
false as to_delete,
current_timestamp(4) as ts
FROM latest_batch[/php]

A few points to note from the above code:
We are only doing this if we are indeed performing an incremental load, otherwise this isn’t needed

  • order_pk is a surrogate key which we use to still update existing records
  • latest_batch is the result of a CTE that contains only the latest set of records (with filters for incremental loading etc.)

We complemented with a set of tests! ✅

Ensure no records are left to be deleted:

[php]SELECT 1
FROM {{ ref(‘ORDERS’) }}
where to_delete = true[/php]

Ensure there are no records for the same order with different versions:

SELECT order_id, sub_id, count(*) as n_records
FROM {{ ref(‘ORDERS’) }}
GROUP BY order_id, sub_id
HAVING n_records > 1[/php]

Read the rest of the posts on the Infinite Lambda blog.

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Data Validation After Refactoring in Snowflake
Data Validation After Refactoring in Snowflake
Oh, well. Your current model is working as expected. The resulting table or query for reporting has good quality data that has been already validated...
January 24, 2023
speed up Terraform
How to Speed Up Terraform in CI/CD Pipelines
In this series of blog posts, we are going to offer various tips and tricks to speed up Terraform in CI/CD pipelines. In the first...
January 20, 2023
Infinite Lambda is now a Snowflake Elite Services Partner
Infinite Lambda Named Snowflake Elite Services Partner
Infinite Lambda has been accredited with Snowflake Elite Services Partnership. We are beyond thrilled to share the news as this recognition attests to our extensive...
December 8, 2022
How to Provide Platform-Specific Interface for Kotlin Multiplatform Code
This blog post looks at interfacing with platform-specific code using Kotlin Multiplatform. This is Part 2 of a series, so make sure to check out...
December 2, 2022
event-driven architectures for data-driven apps
Make Data-Driven Apps with Event-Driven Architectures
The rise of cloud computing and cloud-native technologies enabled the emergence of new age companies. A digital-native breed of businesses that truly operate 24/7 across...
November 23, 2022
Fivetran Regional Innovation Partner of the Year for EMEA 2022
Infinite Lambda Named Fivetran Regional Innovation Partner of the Year for EMEA
We are thrilled to announce that we have been named Fivetran Regional Innovation Partner of the Year for EMEA. We are twice as happy to...
October 20, 2022

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.