Data observability is vital in every dbt project, especially in large and complex ones with multiple concurrent executions. It enables data teams to swiftly detect issues, analyse their root causes and optimise pipelines efficiently.
The most common approach to implementing monitoring in dbt is by utilising dbt artifacts. These files store essential metadata from dbt invocations, such as execution times, test results and row counts. However, dbt artifacts focus solely on the model level, lacking row-level metadata.
Row-level metadata is crucial for understanding when, where and how each row enters tables during each run. This information is invaluable for root cause analysis and optimising incremental models.
In this article, we are going to explore the significance of different types of row-level metadata, their use cases and how to implement them using dbt Cloud & Snowflake to enhance data observability.
Why record row-level metadata?
Consider this scenario: You have a scheduled dbt job that inserts new data rows into a specific table daily. However, during the job's execution, a colleague mistakenly runs an SQL script that performs the same operation manually. This leads to duplicate rows in the destination table.
Without row-level metadata detailing which job or who inserted which row, you would face the arduous task of evaluating incremental logic, comparing execution times and checking SQL history logs to pinpoint the root cause.
A more complex use case for row-level metadata involves analysing incremental logic. Faulty incremental logic can result in missing, duplicated or outdated rows. In most cases, reproducing the exact incremental load that caused the issue is impossible. In such situations, you would have to compare the target table with a snapshot or use any available time-travel features. Row-level batch metadata, combined with other information from dbt artifacts, can significantly simplify this bug-chasing process.
Defining row-level metadata
In the context of dbt, row-level metadata provides information about each row, answering key questions:
When?
- When was this row first loaded into the table?
- When was the last time this row was updated?
Where?
- From which source tables, parent models or Common Table Expressions (CTEs) did this row originate?
How?
- Which query inserted or modified this row?
- Which scheduled jobs triggered the mentioned query?
- Who triggered this query?
- And more.
Types of metadata
There are several types of metadata to consider:
dbt invocation metadata
The dbt invocation metadata originates from dbt artifacts, primarily from run_result.json. While these are not strictly row-level data, they are pivotal in enhancing our metadata ecosystem. These metadata elements should be extracted and systematically stored in a separate table (let’s call it dbt_run_results) for future reference and analysis.
Moreover, there are many observability-related packages within the dbt community (such as "Elementary") that offer ready-made solutions. These can save you the effort of constructing everything from scratch. However, if you prefer to build a custom solution for greater control over the output, you can explore this comprehensive guide: “dbt Observability 101: How to Monitor dbt Run and Test Results”.
Some of the notable dbt invocation metadata are: destination databases & schemas of each model run, run & test statuses, execution time, rows affected, and query id. For a comprehensive list of attributes, please refer to the dbt Run Results JSON Documentation.
IDs and mapping
Metadata comes from various sources in your dbt project, with some generated during dbt execution and others imported from external tools. The latter type is often unavailable during model execution, necessitating a mapping process to establish connections.
For instance, consider the relationship between dbt and Snowflake. To track which dbt run inserted which rows, along with the run's status and corresponding Snowflake query, a linkage must be forged. It is worth noting that run statuses are recorded only at the end of a run, while data rows (along with their metadata) are inserted or modified during execution. On the other hand, dbt_run_results includes query_id, which can be used to trace down the query executed on Snowflake query history.
To streamline this process, we need a unique identifier for each model run, referred to as model_run_id. This identifier combines two key IDs from dbt:
- invocation_id: This ID distinguishes which dbt run command affected specific rows. It is unique for each dbt command within each dbt project. In cases where multiple projects are involved, consider an additional dbt_project column;
- unique_id: This provides a unique identifier for each model or test within your dbt project.
The implementation of model_run_id is essential in both the dbt_run_results table and within each row-level metadata column to facilitate future joins. In the case of dbt_run_results, this column should be generated during the parsing and insertion process following each run.
For row-level metadata, dbt provides a convenient means to import the mentioned IDs as variables. Here is how you can implement model_run_id in each record:
Orchestration tool metadata (dbt Cloud)
Orchestration tool metadata plays a crucial role in debugging data pipelines, especially when dealing with potentially overlapping scheduled jobs. While this blog post primarily focuses on dbt Cloud, it is essential to consider that these metadata concepts are generally consistent across various orchestration tools.
In most cases, you would want to know which run of which job caused a particular issue. To do that, you need to record the following IDs in your metadata system: PROJECT_ID, JOB_ID, and RUN_ID.
Note that dbt Cloud's RUN_ID refers to the scheduled run of each dbt Cloud job and is not to be confused with invocation_id, as one run may encompass multiple commands and, subsequently, multiple invocation_ids.
These IDs are accessible as environment variables. You can retrieve them using the following syntax: {{ env_var("DBT_CLOUD_RUN_ID", "manual") }}. For details, refer to the Special environment variables section of the “Environment variables” documentation meticulously provided by dbt.
These columns can be implemented either as invocation metadata in the dbt_run_results table or directly within each row as row-level metadata (during dbt execution). In this blog post, we will opt for the former approach to reduce the size of metadata columns.
Batch metadata
Batch metadata provides information about the source of data within each row and the timestamps of its creation or modification. These metadata are essential in debugging faulty incremental logic which potentially results in missing, duplicated or incorrectly updated rows.
- created_at: Timestamp indicating when a specific set of rows was first loaded into the tables;
- updated_at: Timestamp indicating when a specific set of rows was last updated;
- record_source: The source from which these rows originated, usually the raw table name;
- CTE_source: Useful for tracing records in multi-CTE models.
Some prebuilt dbt packages and macros have already implemented these batch metadata columns under different names. For example, the AutomateDV package (formerly dbtvault) provides load_datetime and record_source if you are working on a Data Vault project.
Data warehouse metadata
Data warehouses typically provide their own set of metadata for each query, with a particular focus on query cost and performance. In the case of Snowflake, this valuable information resides within the query history, where each query is uniquely identified by a query_id.
These query_ids are also diligently recorded in dbt_run_result after the completion of each invocation, as mentioned in the previous section. This dual recording approach empowers us to conduct granular analysis, not only at the model level but also at the level of individual related rows.
Alternatively, there is another method to establish a backward mapping between dbt model-run(s) and Snowflake, and it involves the use of the following:
- query-comment: This feature injects dbt metadata into SQL text comments, either as headers or footers. It may include details like the model name and execution timestamp. This facilitates traceability, showing us which dbt job triggered or who executed a particular query;
- query-tag: This approach adds tags to each model, making them queryable within Snowflake's query history. Unlike query-comment, this method offers the advantage of grouping multiple related queries for performance analysis. However, it comes with a slightly slower execution time and higher metadata granularity.
Metadata implementation
To effectively implement the diverse range of metadata discussed earlier in a dbt Cloud and Snowflake environment, follow these steps:
- Parse and Record: Begin by parsing the run_result.json file and recording the run results in a dedicated model, which we are going to call dbt_run_result. This model should also incorporate metadata from dbt Cloud jobs;
- Calculate Model Run IDs: Calculate the model_run_id for each row within every dbt model. This unique identifier will serve as the link between individual rows and the dbt_run_result;
- Batch Metadata: For batch metadata and other row-level metadata accessible during dbt execution, add or calculate them in each row of every dbt model.
Handling excessive metadata
While each piece of metadata usually occupies its own column, accumulating too much metadata can widen tables and potentially impact query performance. To mitigate this, consider storing only the relevant metadata based on your system's requirements.
If even that becomes overwhelming, think about consolidating metadata into a Snowflake VARIANT column using JSON formatting. This reduces the number of columns but can make accessing said metadata more challenging due to the flattening process.
Metadata inheritance
Row-level metadata columns should not be inherited from upstream models. Instead, they should be recalculated in every downstream model. Therefore, if you are using SELECT * in one of your upstream CTEs instead of explicitly selecting all related columns, remember to exclude the previous metadata and re-add/ recalculate it in the final CTE.
For example, see the code block below:
Wrapping up the row-level metadata talk
Enhancing your dbt project with row-level metadata not only simplifies troubleshooting and optimisation but also empowers you to gain deeper insights into your data. With the right metadata in place, you can confidently navigate the complexities of your data pipelines and ensure data integrity at every step.
Make sure to visit the Infinite Lambda blog for more insights on leveraging dbt efficiently.