...

dbt Data Quality Tests Implementation

Dat Nguyen
June 20, 2023
Read: 3 min

Data quality definition

In this blog post, I am going to show you how to leverage dbt data quality tests to build robust pipelines that work for your organisation and case. To do that, we first need to understand what data quality is.

Data quality (DQ) is the extent to which the data is fit for its intended purpose, depending on the business requirements and environment.

DQ refers to the accuracy, completeness, consistency, timeliness, validity and uniqueness of data. These KPIs are used to measure different aspects of data quality, and by monitoring them, organisations can ensure that their data is accurate, reliable and effective in supporting decision-making processes.

data quality KPIs

Let us look at each of these 6 KPIs:

  1. Accuracy: The degree to which data correctly reflects the real-world object or event: e.g. Phone number field value must be a number;
  2. Consistency: Data across all systems reflects the same information and are in sync with each other across the enterprise: e.g. foreign key check;
  3. Completeness: Completeness is the expected comprehensiveness: e.g. missing first name of a customer;
  4. Timeliness: Information is available when expected and needed: e.g. data is never older than 1 day;
  5. Validity: Information that does not conform to a specific format or does not follow business rules: e.g. email address does not contain '@';
  6. Uniqueness: Only one instance of the information appearing in a database: e.g. CustomerID must be unique.

Scope

In dbt, we find all of KPIs above within two types of tests:

Implementation of these dbt data quality tests will make sure we maintain healthy, high-quality data pipelines. At this point, you might be wondering if there is a native solution to monitor this 'DOG'.

Exposing the test results with the defined KPIs and monitoring how data quality changes over time might require 3rd party vendors. This article shows you how to avoid that to enable you to self-manage test results and easily visualise them with the BI tools you are already using.

A hands-on demonstration

The solution is very simple: use a dbt package which:

  • Supports the automatic collection of test results’ metadata;
  • Supports building the downstream data mart;
  • Provides guidelines/standard visualisations to follow.

In this article, we are going to use the dq-tools package developed by Infinite Lambda and pair it with Streamlit to demonstrate what your first dashboard should look like.

For the purposes of this article, we are going to use Jaffle Shop, a self-contained dbt project for testing.

Step 1: Clone the demo repo

We are using Snowflake as our data warehouse and we are going to set up the local environment. Let’s start by cloning the demo repo locally:

Assuming that you are using Python 3 with its alias shell: python3, let’s activate the virtual environment:

You might notice that we have already registered dq-tools in the packages.yml file:

Now, let’s install dbt dependencies: dbt deps

Done: Your local environment is live.

 

Step 2: Run the pipeline and build the mart of the test results

  1. Build Jaffle Shop models and test them
  2. Enable dq-tools’s models
  3. Build DQ Metric views (DQ Mart)

At this point, we should see data in the BI_DQ_METRICS view:

dbt data quality testing: BI_DQ_Metrics table

 

Step 3: Draft your first dashboard

In this demo, we are using Streamlit.io (via Plotly) to build our Data Quality Score dashboard. Feel free to use your preferred BI tools (e.g. Looker, Tableau, Power BI).

To optimise costs, we are going to export our DQ Metric view into Duck DB and build a dashboard database connection based on it.

  1. (Optional) Export dq_mart.duckdb: We can skip this step as it is already available at dashboard/dq_mart.duckdb but here is what it would look like:
  2. Run Streamlit:

dbt data quality tests - simple dashboard

 

Links and resources on dbt data quality tests

Ensuring high data quality is critical for businesses, as poor quality can lead to incorrect decisions, missed opportunities and reduced efficiency. Therefore, maintaining good data quality should be a priority for any organisation that relies on data to operate.

If you have any questions about the solution outlined above, reach out to the Infinite Lambda team and we would be more than happy to help.

Make sure to explore more technical content on the Infinite Lambda blog.

 

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Tag-Based Masking in Snowflake
Tag-Based Masking in Snowflake: Practical Guide with Scalable Implementation
As data continues to be a critical asset for organisations across industries, safeguarding sensitive information while enabling data access for authorised users is a constant...
June 11, 2024
Cloud sustainability
Cloud Sustainability
This article on cloud sustainability is a part of a series on carbon analytics published on the Infinite Lambda Blog. Appreciating the cloud is becoming...
June 5, 2024
How to measure happiness and safety in tech teams
How to Measure Happiness and Safety in Tech Teams
Software product development initiatives can go awry for a whole range of reasons. However, the main ones tend not to be technical at all. Rather,...
May 30, 2024
why sustainability analytics
Why Sustainability Analytics
We all like a sunny day. Kicking back in the garden with the shades on, cool drink in hand and hopefully a liberal amount of...
May 8, 2024
Data diff validation in a blue green deployment: how to guide
Data Diff Validation in Blue-Green Deployments
During a blue-green deployment, there are discrepancies between environments that we need to address to ensure data integrity. This calls for an effective data diff...
January 31, 2024
GDPR & Data Governance in Tech
GDPR & Data Governance in Tech
The increasing focus on data protection and privacy in the digital age is a response to the rapid advancements in technology and the widespread collection,...
January 18, 2024

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.