Data Vault is a data warehousing methodology that provides a standardised and scalable approach to managing enterprise data. At its core, it is designed to be a flexible and adaptable framework that can store and manage large amounts of structured and unstructured data.
The three main Data Vault components are the Hub, the Link and the Satellite. They work together to create a flexible and scalable data model that can be easily extended and modified as data requirements change over time.
Additionally, Data Vault provides a number of additional features such as built-in data lineage and auditing capabilities, making it well-suited to compliance
Let us take a look at the main Data Vault components first.
Main Data Vault components
Hubs: Unique list of business keys
Hubs store all business keys from each source system provided that the semantic meaning and the granularity remain unchanged. They are the central entities in the model and are essential for ensuring referential integrity and avoiding redundancy in the Data Vault.
They have the following characteristics:
- Uniqueness: When hubs store unique business keys, they ensure that each record can be identified as a distinct and separate entity;
- Stability: Hubs provide stability in the model by maintaining a consistent list of business keys, even as source systems change or evolve over time;
- Simplicity: The structure of the hubs is simple, typically only containing the business key, the record source and load date, and a surrogate key. This level of simplicity aids in maintaining and scaling the data warehouse;
- Minimal dependency: Hubs have no dependencies on other components within the Data Vault model, which promotes modularity and facilitates parallel loading processes.
The idea behind hubs is to create a central repository for each type of business object, which makes it easier to manage and maintain the data. By separating data into hubs, you can also ensure that each piece of data is as accurate and consistent as possible.
Links: Unique list of relationships
Simply put, links connect hubs. Instead of adding the foreign key to a table, the relationship is stored as data, which makes it very flexible to change. You can think of links as the relationships between business entities in Data Vault 2.0.
Here is what you need to know about links:
- Structure: A link consists of a link ID (the primary key), which is usually a system-generated hash key. Additionally, it contains the foreign keys from the hubs it connects. These foreign keys are the surrogate keys of the respective hubs;
- Grain: Links maintain a record at the lowest level of granularity for the relationships between the hubs they connect. This ensures that the data remains highly detailed and accurate;
- Minimal dependency: Similarly to hubs, links also have minimal dependencies on other Data Vault components. This modularity simplifies the maintenance process and facilitates parallel loading when integrating new data sources.
In Data Vault, you can connect multiple hubs and links together to create a chain. This can be useful for analysing complex relationships between different pieces of data. Moreover, in most cases, links are bidirectional, meaning you can easily navigate back and forth between the two pieces of information they connect.
Satellites: Descriptive data with change history
Satellites store contextual, descriptive and historical information about the hubs and links they are attached to, depending on whether the data is related to a business object or a relationship.
Satellites contribute to:
- Flexible data storage: Satellites contain attributes, such as descriptions, statuses or dates, that describe the hubs or links they are connected to. This flexibility enables easy addition or modification of attributes as business needs change;
- Historical tracking: Satellites maintain a history of changes to the data they represent, allowing for accurate analysis of historical trends and business changes;
- Isolation of data: By separating descriptive information from the hubs and links, satellites help to isolate data and prevent potential data integrity issues. This isolation simplifies data management and ensures data quality;
- Reduced Dependencies: Satellites have minimal dependencies on other Data Vault components, contributing to the overall modularity and maintainability of the data warehouse.
Each satellite in Data Vault provides additional, valuable information about the main entity. A satellite has a one-to-many relationship with other entities and has one parent (either one hub or one link) as shown here:
Additional Data Vault features
Built-in data lineage feature
Data Vault provides a powerful and flexible way to manage enterprise data, which makes it well-suited to compliance and regulation-heavy industries.
The methodology offers a way to track the data as it moves through the system, from its source to its final destination. This enables users to get insights into where the data came from, how it was processed and where it is currently being used. This helps in establishing trust in the data and ensures that the data is aligned with business requirements.
You can think about data lineage as a family tree, with the hub tables representing the ancestors, the link tables representing the relationships between the ancestors and the satellite tables representing additional information about the ancestors and their relationships.
By tracking the data lineage, organisations can ensure that their data is accurate, complete and trustworthy. This is particularly important in industries with strict compliance regulations, such as finance, insurance and healthcare.
Data auditability
Data Vault provides a detailed audit trail of all changes made to the data model, capturing changes to data elements, relationships and business rules.
The audit trail can be used to identify who made the changes, when they were made and why they were made.
The ability to track changes provides a clear history of the data, from its origin to its current state. This helps organisations maintain integrity and accuracy of their data, which is important for compliance purposes, especially in heavily regulated industries.
Modelling Data Vault components
The single most important thing about modelling Data Vault components is to model business processes rather than the data.
To model the components in your Data Vault, start by thinking about the way hubs, links and satellites will be used in your model. Remember that hubs represent the unique business keys within your dataset, the links capture the relationships between hubs and the satellites store context and historical information about the other two Data Vault components.
Now, start modelling:
- Model the hubs: Each hub represents an entity that is relevant in the context of the business process you are trying to model. These entities are uniquely identified by business keys, which you will need to look for in various source systems. On top of the business key, ensure that each hub has a load date and record source attributes as well as hash representation of the business key called hash key;
- Model the links: Think about the relationships between the hubs and create your links accordingly. Each link should have hashed primary key, along with the foreign keys from the hubs it connects;
- Model the satellites: Determine the contextual, descriptive, and historical information associated with hubs and links, and create satellites to store this data.
Wrapping up
There are three main Data Vault components: hubs, links and satellites.
Hubs represent business concepts, links connect hubs and satellites contain information about the hubs.
Infinite Lambda’s Data Vault experts have cutting-edge expertise based on extensive project experience. Get in touch to talk about the specifics of your case and take your enterprise data warehouse to the next level.
Learn more about Data Vault as a framework to understand the core idea behind the methodology.
If you are looking to implement Data Vault with dbt on Snowflake, stay tuned for the blog post based on hands-on project experience.