...

Data Observability dbt Packages

Adrien Boutreau
October 13, 2023
Read: 9 min

What are dbt packages?

Software engineers often use modularised code libraries, empowering them to focus on business logic while leveraging preexisting, perfected code for efficiency. In dbt, these powerful code libraries are referred to as 'packages’. They enable efficient problem-solving as shared analytic challenges are common across organisations.

Today, we are going to focus on the dbt packages for data observability.

Fostering reusability

dbt packages have been inspired by software engineers' good practices and applied within dbt. They offer a number of advantages, but the one that warrants our attention first is reusability.

Reusability comes in two ways:

  1. Internal packages
    When you work on various projects and find yourself re-using the same macros, you can create your own packages to feature common macros. The package you create will be saved into your private git repository; using git versioning, you can easily share it to all of your dbt projects.

    Useful tip: If you are trying to extend what a dbt product can offer, for instance by adding new features not specifically related to your data, first have a look at existing packages. You might not find what you are looking for on the first try, but do take another, closer look as chances are it is there already. If you did not find it the second time, go ahead and start developing your own solution.

  2. External packages
    You can reuse or get inspired by external packages available on open source. This way, you will not be reinventing the wheel, you will be saving time without compromising quality and you can draw inspiration from what has already been implemented.

One caveat is that open-source dbt packages are maintained by the community. For packages that are not handled by dbt Labs, you run the risk of using outdated solutions. Even if this is the case, you get a chance to be the one to pick them up and start maintaining them in the spirit of open source.

In a nutshell, you should associate dbt packages with the principles of DRY (Don't Repeat Yourself) and KISS (Keep It Simple, Stupid), leveraging them to avoid unnecessary duplication and draw inspiration from others' work

Now that we know what dbt packages are, let’s go on and explore the way we experience them compared to the way our clients do.

POC phase / dbt discovery

In the beginning of a dbt journey, engineers usually want to get familiar with the tool and learn how to make the most of it. They will take a limited scope of existing transformations that they have already achieved without dbt and try to replicate them using dbt this time.

Without leveraging packages, it takes between 4 and 8 weeks on average before any conclusive result can be seen. Let’s see how to make that process more productive.

Generic dbt packages

Packages are very easy to set up. Here are some of the most generic ones that are usually the first to be installed.

dbt-codegen

During the POC phase, engineers have discovered how dbt works and also how verbose it could be — creating a model's YAML file manually with many tables and many columns tends to be very repetitive and painful.

This is where dbt code gen will be really useful. By connecting to your database and retrieving metadata, dbt-codegen will be able to generate for you:

  • Source models
  • Base models
  • Models

The macro will generate the YAML file content in your console; you then just need to manually create a YAML file and copy / paste the content into it.

 

dbt-utils

This package contains:

  • Advanced generic tests: for example, on dbt (without packages) you can easily create accepted_values unit tests. The opposite, however, is not available by default, so this package allows you to use not_accepted_values to define values which will not be accepted;
  • SQL generator: a good example is the macro date_spine, which can help you create a data range from one specific date to another;
  • Introspective macros: you can retrieve column values, for example, and use Jinja to loop on the results;
  • Web macros: use it when you want to extract information from a URL, such as parameters, domains or subdomains. This one is a perfect example of not reinventing the wheel;. If you have ever used SPLIT / EXTRACT functions in your code and created macros in your project only to discover that what you have done had already been achieved in a better way by more experienced engineers, this one is for you;
  • Jinja helper, a functionality that we will come back to later: When you are about to start using dbt in an extensive way, this package will be a cornerstone of your project. Many other packages depend on this one, but we will go back to it in a second.

The packages above are essential for your writing dbt code in an easier and quicker way. Now you are ready to start experimenting with more advanced ones.

dbt packages for testing

dbt_expectations

If this sounds familiar, it is because dbt_expectations is actually the famous Python library adapted for dbt. This is your ultimate testing buddy; while dbt-utils can give more advanced tests, which are not available with the basic product, dbt_expectations offers more than 40 tests to choose from.

This dbt package is not only about testing data values but also allows you to test table structure.

Have a look at expect_table_column_count_to_equal for example. It “expects the number of columns in a model to be equal to expected_number_of_columns” and you can use it like this:

 

dq_tools by Infinite Lambda

Here are two tough questions for you:

  1. What do you know about the quality of your data?
  2. Do you know how to measure it?

These questions could be really difficult to answer when you depend on multiple data sources you do not maintain yourself.

Let’s have a look at a simple example. Say there is a marketing source that your dashboards depend on; you use it to extract email information that is critical to calculate the number of new customers. If the data you extract is not trustworthy every single day, your dashboards face a number of serious issues around validity.

Naturally, your first thought is to add a test in dbt that raises a warning; yet, you cannot block the dbt run because all of the other KPIs are still relevant. Instead, the dq_tools package allows you to monitor data quality and assign a score to it that is based on the 6 pillars of data quality as shown in analytics engineer Dat Nguyen's article about implementing data quality testing in dbt:

data quality KPIs

 

Here is an example of a Snowflake dashboard that shows what this package does:

Snowflake dashboard that shows data observability dbt package dq_tools capabilities

 

NB: Stay tuned for an article that explores dq_tools in detail.

Audit packages

Auditing is a major stage in any dbt project. By that time, your customers will have discovered how to leverage packages to create advanced tests or to produce YAML files.

At such an advanced level, engineers start to use dbt in production or sometimes rewrite it from scratch after the customer gets a good understanding on how dbt actually works or open it to data analysts and scientists.

Your next challenges involve preserving code consistency in the project and ensuring that naming conventions are followed.

dbt-project-evaluator

The dbt-project-evaluator package was released in 2022 to help you monitor each graph object and create KPIs on top of it (the mart layer). All you need to do is to define the rules to raise warnings or errors.

Work with your data team and reiterate the rules that you are going to put in place to make sure that rules do not become blockers for engineers but rather help them understand why they need to follow a certain logic.

dbt-project-evaluator logic

Here is an introduction to the dbt-project-evaluator put together by Grace Cohen at dbt Labs.

A quick overview of this package:

  • Modelling
    During development, engineers can stay on top of the number of poor practices via an intermediate, mart layer having a direct relationship with the data source or a table being hardcoded for example.
  • Testing
    You can make sure that the primary key for each model does exist and measure test coverage for your models and even. You can also set up a warning or an error if test coverage goes under a certain percentage.
  • Documentation
    Similar to testing, documentation coverage will be created and trigger a warning / error. You will also be able to compile a list of models which do not have documentation.
  • Structure
    You can ensure naming conventions are followed and are the same for test, model or source directories.
  • Performance
    Engineers that are just getting started with dbt could feel tempted to create numerous transformation steps as a view, but this can impact the warehouse performance. This package allows you to set up a steps limit where views are linked together as well as raise errors upon reaching the limit. Engineers will be able to fix the errors by creating a materialised table or reviewing the code.

It is worth noting that you can customise this package easily depending on the project and the needs. When you feel comfortable using it, add it to your CI pipelines.

dbt-audit-helper

dbt-audit-helper provides a list of macros useful for auditing data. This comes especially handy when you are migrating your old pipeline to your dbt project and comparing mart tables.

Check out this example of compare_column_values:

compare_column_values dbt packages data observability

This macro will summarise the differences between two tables. Furthermore, it can be useful for data analysts working on analyses on tables that contain different data. The summary will point out the differences and enable you to debug quickly.

dbt packages for data observability

dbt packages could be a good starting point for data observability.
This is a neat approach that will be extremely useful in the future if you decide to work with an observability tool, which will provide far more features than you can get from dbt packages.

dbt_profiler

Let’s admit it, we have all been dreaming about picking the right table out of three that are all named a variation of “Order” and fetching information about the number of rows, the min / max / avg values, the number of distinct values and more. Now, you can do that and get all KPIs available in dbt docs with dbt_profiler.

Just make sure not to run it unless you really need to (e.g. not every single day) as it can impact data warehouse costs. Running it every day is definitely an overkill; in our experience, running it once a week is enough and once a month is the minimum.

dbt_artifacts

Running a dbt operation generates a results.json file that contains all operations executed by dbt. Any metadata that is created in the process is extremely valuable, especially as it is not available on the cloud data warehouse.

dbt_artifacts creates a macro that will parse the file, store the information into your data warehouse and create a mart layer with information about each of the nodes.

dbt_artifacts nodes

The package provides the following information about the models:

  • dim_dbt__current_models
  • dim_dbt__exposures
  • dim_dbt__models
  • dim_dbt__seeds
  • dim_dbt__snapshots
  • dim_dbt__sources
  • dim_dbt__tests
  • fct_dbt__invocations
  • fct_dbt__model_executions
  • fct_dbt__seed_executions
  • fct_dbt__snapshot_executions
  • fct_dbt__test_executions

dbt-fin-ops by Infinite Lambda

This is a package Infinite Lambda is going to release in the upcoming weeks.

The dbt-fin-ops package provides an overview of running pipelines relates to:

  1. Time spent to run your models
  2. Credit costs to run your models

dbt-fin-ops package

This package enables you to monitor your dbt models in terms of both time and cost. Moreover, you can define limits and raise alerts to get notified in case your pipeline is taking more time than average or if costs are increasing.

It looks like this:

dbt-fin-ops in action

Follow Infinite Lambda on GitHub to get a notification when dbt-fin-ops is released.

Let’s wrap up

There are over 200 open-source packages available on the dbt Hub.

In this article, we have reviewed a handful that will enable you to:

  • Take your first steps in data observability;
  • Monitor the resources that your dbt pipelines consume.

For a data team starting from scratch or migrating pipelines, it takes 8 to 10 months on average to implement all of the packages we talked about. This is if all goes according to plan.

Additional resources:

  1. What is a dbt package?
  2. dbt Data Quality Tests Implementation
  3. Deleting Records With dbt
  4. Using dbt Deferral To Simplify Development

If you could use some help, the Infinite Lambda team is ready to jump in and drive your project forward. We are are dbt Labs Premier Consulting Partner and a Platinum Partner holding the 2022 award for the largest number of certified engineers.

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Tag-Based Masking in Snowflake
Tag-Based Masking in Snowflake: Practical Guide with Scalable Implementation
As data continues to be a critical asset for organisations across industries, safeguarding sensitive information while enabling data access for authorised users is a constant...
June 11, 2024
Cloud sustainability
Cloud Sustainability
This article on cloud sustainability is a part of a series on carbon analytics published on the Infinite Lambda Blog. Appreciating the cloud is becoming...
June 5, 2024
How to measure happiness and safety in tech teams
How to Measure Happiness and Safety in Tech Teams
Software product development initiatives can go awry for a whole range of reasons. However, the main ones tend not to be technical at all. Rather,...
May 30, 2024
why sustainability analytics
Why Sustainability Analytics
We all like a sunny day. Kicking back in the garden with the shades on, cool drink in hand and hopefully a liberal amount of...
May 8, 2024
Data diff validation in a blue green deployment: how to guide
Data Diff Validation in Blue-Green Deployments
During a blue-green deployment, there are discrepancies between environments that we need to address to ensure data integrity. This calls for an effective data diff...
January 31, 2024
GDPR & Data Governance in Tech
GDPR & Data Governance in Tech
The increasing focus on data protection and privacy in the digital age is a response to the rapid advancements in technology and the widespread collection,...
January 18, 2024

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.