As data irrevocably grows in volume, complexity and value, so do the demands from business stakeholders, who need visibility in good time, whilst minimising cost.
This means that as a developer, you are constantly looking for ways to reduce development time, save money and simplify every iteration in your dbt project. If you are not doing that already, you will need to start soon.
dbt is a popular and highly effective tool to manage data transformation efficiently. dbt deferral, one of dbt’s most powerful features, is a great way to reduce both costs and run time which you will want to consider.
Making changes to a single model
Imagine a dbt project where you want to make a change to a single model, model_a, which is dependent on, say, 20 upstream models.
After you have made the required modifications to the model in your development branch, you need to validate your changes. Here, you might execute a command such as:
dbt run --select +model_a
This would instruct your chosen data warehouse (say, Snowflake) to compute and materialise 21 models in your development environment: 20 upstream dependencies plus model_a itself.
You then validate model_a, tweak the code and iterate until you are certain that it is correct. Finally, you would commit and push your changes, create a pull request, seek a peer review and merge to the production branch. Job done.
Hang on.
You have just run 20 other models, incurring added expense in terms of time, computation and storage, just to look at the results of your (possibly very trivial) changes to model_a.
Each of these 20 models exists in production and will have been computed and materialised fairly recently, assuming models in the production branch are built on a regular schedule. So, why not use them?
This is where dbt deferral becomes your friend.
What is dbt deferral
dbt deferral is a powerful feature which allows the developer to run modified models, without needing to also run all of the upstream dependencies.
Following on from the example above, imagine I start with no models materialised in my development environment but this time I provide the selector --select model_a in a dbt run command as well as the option to defer to my production environment. dbt would instruct Snowflake to compute and materialise model_a, and get any upstream data required from production.
This makes a huge difference:
- Without deferral, 21 models are computed and materialised, to validate changes to 1 model;
- With deferral, 1 model is computed and materialised, to validate changes to 1 model.
As a result of using deferral, the number of models computed and materialised is dramatically reduced.
Why use dbt deferral?
Optimising resources
Using dbt deferral simplifies every iteration in your project. As shown above, it helps to minimise the number of models that are materialised in your development environment. This reduces compute and storage costs, and saves time. The larger your project, the bigger the savings.
Keeping a tidy environment
Furthermore, this helps you avoid clutter and/or stale tables in your development environment.
Validation
dbt deferral is also great for validation and measuring the impact of code changes. Let’s assume that you modify a single model. If you defer to production, then at the moment you materialise that model, the resulting table will be based upon precisely the same data as the corresponding table in production.
For those two tables, any differences between production and development will exist solely due to code changes and not differences in data freshness. This saves more time as the impact of your changes is easier to understand and demonstrate.
A step-by-step example of using deferral
Background
Consider this purely illustrative dbt project, which takes a set of raw ingredients and assembles them into a product: a delicious burger in our case.
Here is dbt's Directed Acyclic graph (DAG) for the project:
I have 12 configured sources (the ingredients); and 17 models that either reference those sources or reference each other. To create the final product, the model burger, I need to compute and materialise 16 models (the other one, ing_box, is not a dependency of burger, not yet anyway).
My chosen data warehouse is Snowflake and my source data (green in the DAG above) resides in a schema named RAW. In my production environment, the 17 models downstream of my sources, are materialised in a schema named PROD on a regular schedule. I also have a schema named DEV, which is empty at the moment.
This is the state of things:
We need a change
Now, it is development time. I want to make some changes to the model burger. Instead of using ing_wrapper, I want to use ing_box.
First, I create a new branch from the production (main) branch of the dbt project. Then, I checkout the branch and make the required changes to burger, replacing the use of ing_wrapper with ing_box. Next, I need to run burger to validate my changes. I am working locally, using the dbt Core command-line interface (CLI), and I execute the following:
dbt run --select +burger
This command, with the selector: --select +burger, instructs Snowflake to compute and materialise all 15 upstream dependencies of burger, plus burger itself in my default environment, which is development.
And sure enough, we get 16 models materialised in 22.14 seconds:
After running that, RAW, PROD and DEV look like this:
Save time and budget
Above, I only modified one model. It would be a waste of time, compute cost and storage cost to run 15 other models that have not changed and already exist in my production environment.
If I run this using dbt deferral (and I am going to explain the syntax of the command later), here is what happens, starting again from an empty DEV schema:
dbt run --select burger --defer --state prod-run-artifacts
This time, I am kindly asking dbt: please run burger, and if you cannot find anything you need, then defer to the relevant data in my production environment, viz. four models directly referenced by burger: sa_layer_1, sa_layer_2, ing_top_bun and ing_box.
Here it is:
This time, only the burger model is run and it takes 9.39 seconds, which is much quicker.
In the previous, non-deferred run, we ran 16 models, so you might expect it to have taken about one 16th of the time. However, I have dbt configured to use 8 threads, meaning much of the work was done in parallel.
In addition, I have saved the compute and storage cost on 15 models. The more numerous and complicated those models (and their dependencies), the more costs we avoid, and the more time we save.
This is now the state of RAW, PROD and DEV. The arrows show you where DEV.BURGER has sourced its data from:
What is not to like? By applying just a couple of extra arguments to the dbt run command, I have made the development environment much tidier, avoided unnecessary cost and enabled faster materialisation and validation for the modified models.
How does dbt deferral work?
The ref function
Arguably the most important function in dbt is ref, and there are two main reasons for that.
select * from {{ ref(‘ing_lettuce’) }}
Firstly, it allows the dbt dependency graph to be built, enabling dbt to run models in the correct order and leverage parallel processing (multiple threads) when using the dbt run command, for example.
Secondly, and this is important in relation to deferral, it allows details of the schema to be set via configuration and not hardcoded into each model’s SQL. Requesting deferral changes the way that the ref function is resolved.
Artifacts and manifest.json
You may be aware of the fact that each time dbt run is executed, a number of artifacts are generated. These are files created as a product of the run and contain a wealth of useful information.
In our burger project, where we are working locally using the dbt Core CLI, artifacts are written to the target folder by default.
This is what it looks like:
You can see artifacts such as: compiled models; executed (run) models; some json files: catalog.json, run_results.json and manifest.json; and a few other things. Most relevant in relation to deferral is manifest.json.
The ‘dbt manifest’ (manifest.json) is a file describing most of your dbt project, from configurations and properties of the project as a whole, to individual resources, such as models, tests and macros.
The manifest serves as a useful snapshot of the state of the project at the moment dbt was last run. It can also be used for countless analyses of the project, and provide input for subsequent transformations or automations. Last but not least, manifest.json is what enables deferral.
NB: In production, where dbt run will be called by some sort of scheduled ‘Job’, dbt artifacts, including manifest.json, will be written to a different location, depending on how you handle orchestration. For the purposes of this demo, let’s just assume that manifest.json is available somewhere for your production environment.
manifest.json can be large and a bit difficult to navigate in its raw form, but let’s focus on part of the manifest from the latest production run of the burger project and see what it looks like.
In the json, there is a top-level key named nodes, which itself is a dictionary (key/value pairs) of all analyses, models, seeds, snapshots and tests.
This is the raw manifest.json:
How to request deferral
In order for your dbt run to defer to data in some other environment, you need to specify where that data is. As demonstrated above, manifest.json contains all of the information needed.
Here is a command to run the model burger, in my development environment, deferring to production for any dependent models:
dbt run --select burger --defer --state prod-run-artifacts
First, you need to add an argument to your dbt run command: --defer. According to dbt’s own documentation:
“When the --defer flag is provided, dbt will resolve ref calls differently depending on two criteria:
- Is the referenced node included in the model selection criteria of the current run?
- Does the reference node exist as a database object in the current environment?
If the answer to both is no - a node is not included and it does not exist as a database object in the current environment - references to it will use the other namespace instead, provided by the state manifest.”
This is where the --state argument comes in. In this argument, you need to provide the relative path to a folder containing the manifest.json you would like to ‘defer’ to.
In my burger project, I created a folder called prod-run-artifacts at the top level of the project and then copied the manifest.json from the latest production run to prod-run-artifacts.
It looks like this:
As mentioned previously, my burger model contains four ref calls to the following models: sa_layer_1, sa_layer_2, ing_top_bun and ing_box.
For those four models, let’s consider the two criteria:
- Is the referenced node included in the model selection criteria of the current run? The selector used in my command is --select burger, and so the answer is no. (If the selector were --select +burger, the answer would have been yes).
- Does the reference node exist as a database object in the current environment? Let’s assume this is the first time I run my modified burger model. I’m starting with a clean environment and there are no tables or views in my DEV schema, so the answer is no.
If the answer to both is no, the references to sa_layer_1, sa_layer_2, ing_top_bun, and ing_box will use the namespaces defined in the location specified in my --state argument, viz. prod-run-artifacts/manifest.json. And that location, as we have seen, is: DAN_DEFER_DEMO.PROD.
A few additional points of interest
You can use environment variables
Deferral requires both --defer and --state to be set, either by passing flags explicitly in your dbt command, as I showed above; or you can achieve the same by setting environment variables in your dbt_project.yml. These are DBT_DEFER and DBT_STATE respectively.
Not just dbt run
Above, I focussed on the use of deferral with dbt run. However, keep in mind that deferral may also be used with other commands for which the concept of deferral makes sense, like dbt test, dbt build and dbt show.
Slim CI uses dbt deferral
In dbt Cloud, Slim CI, which by definition runs only modified models, uses deferral. It is used for testing the contents of pull requests before allowing a merge. There’s some documentation about this here. Also, you can see this being configured, including some discussion of dbt deferral, in this learning video from dbt: Set up Slim CI dbt cloud (you will need to register first and it’s free).
Watch out for stale data
Remember, criterion #2 asks if the reference node exists as a database object in the current environment. What if the answer is yes?
This is one of the big ‘gotchas’ with deferral. By default, if a given dependent model or source of the selector specified in your dbt run command exists already in your target environment (e.g. development), then that table/view will be used.
The danger is that data is ‘stale’, or, in other words, not materialised recently, and possibly not as part of the work done in the current branch.
You need to be careful here to avoid confusion. To eliminate the risk of unexpectedly using stale data, I recommend that as soon as you create a new branch from production (main), you clear down your development environment, manually or using a script.
Ultimately, doing this before starting work on a new branch is an excellent practice, regardless of whether or not you plan to use dbt deferral.
The other way to manage this risk is to add an extra argument to your dbt command --favor-state or add an environment variable called DBT_FAVOR_STATE so that if a given model exists in both the current environment and the defined --state, it will always use the latter.
Slim CI does not have this problem, as each time it is triggered, a brand new and empty temporary schema is created hence there is no chance of having any stale data lying around.
Environment setup
There are many different ways of setting up production and development environments. Here are some common examples.
- One organisation. One account. One database:
- One schema for production;
- One or many schemas for development work.
- One organisation. One account:
- One database for production (with any number of schemas);
- One or many databases for development work (each with any number of schemas).
- One organisation. Two different accounts:
- One account for production databases/schemas;
- One account for development databases/schemas.
In this demo, I worked with Snowflake, and environment setup number 1.
In manifest.json, there is no reference to the account; only databases, schemas and table/view names. Therefore, dbt deferral only works if the chosen state references databases on the same account. So, in the environment setup examples above, 1. and 2. would work for deferral, while 3. would require some additional overhead.
If your environment setup is like 3. i.e. production and development environments span multiple accounts, you could still use dbt deferral by making data from the production account available in the development account by using Snowflake Data Sharing for example.
Furthermore (and this concerns all three setups), the user and the role you use in the development environment must have permission to read from whatever databases(s) are referenced in the chosen state (let’s say they are in production).
If role security is such that development users cannot read from production, there may be some extra overhead to enable this. Or, you could consider creating a copy of production with different permissions, for example a clone in Snowflake.
Automatic replication of the production manifest
I have described an approach where the manifest.json from the latest production run is copied manually to a folder of your choice, in order to enable dbt deferral during development.
If you use deferral frequently, I’d recommend automating this, so it’s always available and up to date, if you need it. For instance, you could copy the production manifest to a configured location on a regular schedule, or when certain events occur, like a merge or the creation of a new branch.
dbt Labs demonstrate how you might do this using dbt Cloud, in this short demo. They use Github actions to trigger a dbt job whenever a change is merged to production. Once that has completed, the resulting manifest.json is copied into a specified folder in the production code base. This way, all new branches created for development automatically have access to the latest production manifest, making dbt deferral a piece of cake.
Summary
dbt deferral is a powerful and easy-to-implement feature that reduces development time, saves money, and simplifies every iteration in your dbt project.
Head to Infinite Lambda’s blog for more tips and tricks on making your life in analytics easier.
Infinite Lambda is a Platinum dbt Partner. We will be happy to answer your questions about leveraging the tool, so feel free to reach out.
Useful resources
For further reading, I have provided some useful references, below:
- dbt documentation on Deferral
- dbt documentation on Artifacts
- dbt documentation on manifest.json
- dbt documentation: The ref function
- dbt documentation on Directed Acyclic Graph (DAG)
- Configuring a Slim CI Job
- dbt learning: Set Up Slim CI in dbt Cloud
- Snowflake documentation on Data Sharing
- Snowflake documentation on Clone
- Upgrading to dbt v1.5 (latest)
- JSON Crack: navigate your json files