...

dbt Data Quality Tests Implementation

Dat Nguyen
June 20, 2023
Read: 3 min

Data quality definition

In this blog post, I am going to show you how to leverage dbt data quality tests to build robust pipelines that work for your organisation and case. To do that, we first need to understand what data quality is.

Data quality (DQ) is the extent to which the data is fit for its intended purpose, depending on the business requirements and environment.

DQ refers to the accuracy, completeness, consistency, timeliness, validity and uniqueness of data. These KPIs are used to measure different aspects of data quality, and by monitoring them, organisations can ensure that their data is accurate, reliable and effective in supporting decision-making processes.

data quality KPIs

Let us look at each of these 6 KPIs:

  1. Accuracy: The degree to which data correctly reflects the real-world object or event: e.g. Phone number field value must be a number;
  2. Consistency: Data across all systems reflects the same information and are in sync with each other across the enterprise: e.g. foreign key check;
  3. Completeness: Completeness is the expected comprehensiveness: e.g. missing first name of a customer;
  4. Timeliness: Information is available when expected and needed: e.g. data is never older than 1 day;
  5. Validity: Information that does not conform to a specific format or does not follow business rules: e.g. email address does not contain '@';
  6. Uniqueness: Only one instance of the information appearing in a database: e.g. CustomerID must be unique.

Scope

In dbt, we find all of KPIs above within two types of tests:

Implementation of these dbt data quality tests will make sure we maintain healthy, high-quality data pipelines. At this point, you might be wondering if there is a native solution to monitor this 'DOG'.

Exposing the test results with the defined KPIs and monitoring how data quality changes over time might require 3rd party vendors. This article shows you how to avoid that to enable you to self-manage test results and easily visualise them with the BI tools you are already using.

A hands-on demonstration

The solution is very simple: use a dbt package which:

  • Supports the automatic collection of test results’ metadata;
  • Supports building the downstream data mart;
  • Provides guidelines/standard visualisations to follow.

In this article, we are going to use the dq-tools package developed by Infinite Lambda and pair it with Streamlit to demonstrate what your first dashboard should look like.

For the purposes of this article, we are going to use Jaffle Shop, a self-contained dbt project for testing.

Step 1: Clone the demo repo

We are using Snowflake as our data warehouse and we are going to set up the local environment. Let’s start by cloning the demo repo locally:

Assuming that you are using Python 3 with its alias shell: python3, let’s activate the virtual environment:

You might notice that we have already registered dq-tools in the packages.yml file:

Now, let’s install dbt dependencies: dbt deps

Done: Your local environment is live.

 

Step 2: Run the pipeline and build the mart of the test results

  1. Build Jaffle Shop models and test them
  2. Enable dq-tools’s models
  3. Build DQ Metric views (DQ Mart)

At this point, we should see data in the BI_DQ_METRICS view:

dbt data quality testing: BI_DQ_Metrics table

 

Step 3: Draft your first dashboard

In this demo, we are using Streamlit.io (via Plotly) to build our Data Quality Score dashboard. Feel free to use your preferred BI tools (e.g. Looker, Tableau, Power BI).

To optimise costs, we are going to export our DQ Metric view into Duck DB and build a dashboard database connection based on it.

  1. (Optional) Export dq_mart.duckdb: We can skip this step as it is already available at dashboard/dq_mart.duckdb but here is what it would look like:
  2. Run Streamlit:

dbt data quality tests - simple dashboard

 

Links and resources on dbt data quality tests

Ensuring high data quality is critical for businesses, as poor quality can lead to incorrect decisions, missed opportunities and reduced efficiency. Therefore, maintaining good data quality should be a priority for any organisation that relies on data to operate.

If you have any questions about the solution outlined above, reach out to the Infinite Lambda team and we would be more than happy to help.

Make sure to explore more technical content on the Infinite Lambda blog.

 

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Multitable SCD2 joins
Multitable SCD2 Joins: How to Unify Historical Changes
In the realm of data management, historical changes are conventionally stored in separate Slowly Changing Dimension Type 2 (SCD2) tables. However, extracting point-in-time insights from...
September 25, 2023
VaultSpeed Sapphire Partner
Infinite Lambda Becomes VaultSpeed Sapphire Partner
We are thrilled to share the exciting news that Infinite Lambda is now recognised as a VaultSpeed Sapphire partner. This development is extremely important to...
August 9, 2023
dbt Labs Premier Consulting Partner
Infinite Lambda Becomes dbt Labs Premier Consulting Partner
We are delighted to announce our new Premier Consulting Partner status with dbt Labs, marking a significant step in our successful collaboration. As two pioneers...
August 8, 2023
data contracts explained for non-tech audiences
Data Contracts for Non-Tech Readers: a Restaurant Analogy
Introducing the data kitchen Data contracts have been steadily gaining in popularity within the data community, providing a structured approach to manage and regulate the...
July 17, 2023
composability in data platforms
Towards Greater Composability in Data Platforms
This post is based on a talk of the same name that the author delivered at Data Mash #7 – London Edition in March 2023....
June 15, 2023
Fivetran upsert
How to Use Fivetran Upserting to Update Ingested Data
Introducing Fivetran and its AWS Lambda connector If you are a data engineer building your own ELT pipeline, then at some point in your career...
June 5, 2023

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.