Data quality definition
In this blog post, I am going to show you how to leverage dbt data quality tests to build robust pipelines that work for your organisation and case. To do that, we first need to understand what data quality is.
Data quality (DQ) is the extent to which the data is fit for its intended purpose, depending on the business requirements and environment.
DQ refers to the accuracy, completeness, consistency, timeliness, validity and uniqueness of data. These KPIs are used to measure different aspects of data quality, and by monitoring them, organisations can ensure that their data is accurate, reliable and effective in supporting decision-making processes.
Let us look at each of these 6 KPIs:
- Accuracy: The degree to which data correctly reflects the real-world object or event: e.g. Phone number field value must be a number;
- Consistency: Data across all systems reflects the same information and are in sync with each other across the enterprise: e.g. foreign key check;
- Completeness: Completeness is the expected comprehensiveness: e.g. missing first name of a customer;
- Timeliness: Information is available when expected and needed: e.g. data is never older than 1 day;
- Validity: Information that does not conform to a specific format or does not follow business rules: e.g. email address does not contain '@';
- Uniqueness: Only one instance of the information appearing in a database: e.g. CustomerID must be unique.
Scope
In dbt, we find all of KPIs above within two types of tests:
- Generic Tests (access documentation);
- Singular Tests (access documentation).
Implementation of these dbt data quality tests will make sure we maintain healthy, high-quality data pipelines. At this point, you might be wondering if there is a native solution to monitor this 'DOG'.
Exposing the test results with the defined KPIs and monitoring how data quality changes over time might require 3rd party vendors. This article shows you how to avoid that to enable you to self-manage test results and easily visualise them with the BI tools you are already using.
A hands-on demonstration
The solution is very simple: use a dbt package which:
- Supports the automatic collection of test results’ metadata;
- Supports building the downstream data mart;
- Provides guidelines/standard visualisations to follow.
In this article, we are going to use the dq-tools package developed by Infinite Lambda and pair it with Streamlit to demonstrate what your first dashboard should look like.
For the purposes of this article, we are going to use Jaffle Shop, a self-contained dbt project for testing.
Step 1: Clone the demo repo
We are using Snowflake as our data warehouse and we are going to set up the local environment. Let’s start by cloning the demo repo locally:
Assuming that you are using Python 3 with its alias shell: python3, let’s activate the virtual environment:
You might notice that we have already registered dq-tools in the packages.yml file:
Now, let’s install dbt dependencies: dbt deps
Done: Your local environment is live.
Step 2: Run the pipeline and build the mart of the test results
- Build Jaffle Shop models and test them
- Enable dq-tools’s models
- Build DQ Metric views (DQ Mart)
At this point, we should see data in the BI_DQ_METRICS view:
Step 3: Draft your first dashboard
In this demo, we are using Streamlit.io (via Plotly) to build our Data Quality Score dashboard. Feel free to use your preferred BI tools (e.g. Looker, Tableau, Power BI).
To optimise costs, we are going to export our DQ Metric view into Duck DB and build a dashboard database connection based on it.
- (Optional) Export dq_mart.duckdb: We can skip this step as it is already available at dashboard/dq_mart.duckdb but here is what it would look like:
- Run Streamlit:
Links and resources on dbt data quality tests
- dbt package available at infinitelambda/dq-tools and at dbt Hub;
- Demo repository: infintelambda/dq-demo
- Streamlit live dashboard: Data Quality Score (Streamlit)
Ensuring high data quality is critical for businesses, as poor quality can lead to incorrect decisions, missed opportunities and reduced efficiency. Therefore, maintaining good data quality should be a priority for any organisation that relies on data to operate.
If you have any questions about the solution outlined above, reach out to the Infinite Lambda team and we would be more than happy to help.
Make sure to explore more technical content on the Infinite Lambda blog.