Introducing the data kitchen
Data contracts have been steadily gaining in popularity within the data community, providing a structured approach to manage and regulate the complex relationships between data producers and consumers. There are plenty of great articles out there explaining the technical aspects of data contracts, yet it remains a challenge for the non-technical audience to understand their role within the data platform. That is where this article comes in.
By following an analogy with an actual service contract between a fresh produce supplier and a restaurant, we will explain the key ideas behind data contracts in simple terms to help you understand their benefits and pave the way for further technical reading on how to implement them in your own work.
Let's tuck in.
A much needed legal framework for busy kitchens
Consider a farm supplying fruit, veggies and meat to an increasing number of restaurants. At first, it might be manageable to maintain a single relationship but with more and more clients, each using the produce in different ways (think a sushi restaurant vs a kebab shop) and eventually many suppliers shipping to various kitchens, the farm comes to pain where they need to manage and regulate the growing complexity to ensure kitchens get exactly what they need to keep delighting their hungry patrons.
Data producers and consumers have a similar relationship of growing complexity that they need to normalise and regulate. This is what data contracts were originally designed to help with.
A typical supply chain service contract negotiation would result in a producer and their customer agreeing on:
- Quality, freshness, ripeness: a definition of standards and targets to be met, such as “no more than x% of produce failing to meet the standard”;
- Shipment contents: what the shipment should contain (meat, fruit, veggies) and their origin;
- Quantity & timeliness: the volume of produce to be delivered and the frequency of deliveries (daily, weekly);
- Procedures to check quality: sampling or checking every item of food;
- Handling of damaged products: what to do with items that do not pass the quality checks
- Compliance and certification: describe processes for handling allergens or toxic products (think puffer fish);
- Dispute resolution and penalties in case of a breach of contract.
For each of the points above, we can easily draw a comparison to data contracts. Data contracts:
- Define data quality, completeness and freshness expectations (e.g. no more than x% of nulls or missing data, batch should provide data records that are no older than x days or hours);
- Define the schema (field names, data types), semantics (field definition, relationships, business rules, data being PII) and origin (contact details) of the data produced by the consumer;
- Define the expected volume and frequency of data to be received by the consumer;
- Can describe the tests to perform on the incoming data (count number of nulls, calculate average value based on a sample);
- Describe the tooling used to handle records in breach of the contract (sending to dead letter queue, diverting to a specific storage container for later inspection) and notify the responsible parties through monitoring and alerting mechanisms;
- Can be used to indicate which fields contain sensitive data, such as PII, and specify encryption requirements;
- Can describe the courses of action to take depending on failure rates, from notifying the team responsible for the data producer to stopping the producer entirely in extreme cases.
The Michelin inspector: enforcing the contract
In a professional kitchen, line cooks are in charge of inspecting incoming ingredients to spot potential quality issues so that only food items of the required quality make it down the chain to the chefs.
If a steak or a piece of fish is found defective, they are disposed of and the incident is logged and reported to the supplier, who may have to issue a discount on future orders if the incident caused quality levels to drop below the contractual agreement between the restaurant and supplier.
The same process applies in the world of data, except that data is inspected through automated tests within the data pipeline, ETL or streaming framework moving the data from source to destination.
And just like some ingredients, such as rice or peas, can only be sampled rather than checked individually, data contract enforcement can be facilitated through the use of packages like great_expectations, which intelligently samples data to assess quality, or Infinite Lambda’s dq_tools, which takes it a step further to provide a high-level overview of key quality metrics and trends over time.
Going further, where the line cook would fetch a document detailing quality expectations for the incoming shipment from a specified folder or shelf, data pipelines would fetch the data contract to enforce a corresponding batch from a central repository, such as the Confluent Schema Registry.
Finally, if incidents keep happening and penalties start piling up, the restaurant might terminate the contract, stop receiving food from the supplier and start sourcing ingredients from an alternative supplier in order to continue serving dishes to customers.
In the data world, termination of data contracts due to excessive failure rates can similarly result in the pipeline being stopped entirely. Depending on the failure's root cause, the data contract might be renegotiated to stipulate more permissive clauses, or it may be necessary to switch to a different data producer that delivers data from an alternative source.
Data contract best practices: keep recipes flexible and plan for menu changes
Flexibility and substitutions
Next, we head to the kitchen of an Italian restaurant and look at the contract with the vegetables supplier. If the contract is overly specific about the exact type of ingredients expected in the shipment, issues can arise if the availability of certain key ingredients varies seasonally.
For instance, the contract may specify a specific type of mushroom to go in their risotto, say porcini. However, if porcini mushrooms become unavailable due to a regional shortage, the restaurant should be able to use an assortment of wild mushrooms without compromising on taste for their signature dish.
In the world of data contracts, it can also be tempting to be strict and define specific contracts to ensure optimal quality. Yet, this can cause brittleness and excessive error rates, so it is usually better to allow for some flexibility in your contracts, avoiding, for instance, strict limits on text field length or precise range for numeric values unless absolutely necessary.
Such flexibility can help better accommodate changes in business operations, user behaviours or data sources that could impact the nature of your data.
Contract evolution & backward compatible changes
Now consider a scenario where a restaurant decides to expand their menu to cater to different tastes. They need to be able to update the contract fairly easily to start receiving new ingredients, whilst still receiving the original list of ingredients to cook the dishes on the core menu.
Data contracts allow for such evolution through versioning: if version 2 of a contract contains additional fields on top of the set of fields from version 1, the changes do not break the original data contract. It is good practice to avoid introducing changes that break backwards compatibility, which should be read as “adding a new field is fine, removing one is not”.
Avoid excessive packaging/nesting
Back to the line chef inspecting incoming ingredients. In order for them to complete their inspection in reasonable time, ingredients should be easily accessible/visible for inspection after opening the main package. Could you imagine how much more painful that job would be if they had to open boxes within the main box and more boxes within these? It would literally require far more time and resources for the line chef to inspect than it is reasonable to spend on the task.
The same goes with data contracts: nested schemas with for instance arrays of objects of objects are computationally expensive to unpack in order to validate if the data respects the contract. So keep the schemas simple to reduce compute costs and move the nested fields into a separate schema if possible.
Key takeaways
I would say the analogy between kitchens and data teams works surprisingly well: they both benefit from clear and reliable contracts that define the expected quality standards and procedures to follow when things do not go according to plan.
Just like in the food business, contracts in the modern data platform help hold producers accountable for quality and timeliness standards. Moreover, ingredients need to be of top quality to ravish customers, but variation and evolution require flexibility.
As data teams grow even further, data contracts may become a permanent and fundamental fixture of the modern data stack alongside data catalogs, orchestrators, ingestion tools and warehouses. Perhaps we are not so far off needing data lawyers specialised in drafting and reviewing these precious contracts.
But for now, understanding what data contracts are, how they benefit the data stack and where to introduce them in your data platform is what is most important. I hope that you have found this analogy helpful and it has given you food for thought.
Bon appétit.
Read up:
If you are looking to explore data contracts from a technical perspective, we recommend:
- Chad Sanderson's seminal article “The Rise of Data Contracts”;
- Maggie Hays’s article “The What, Why, and How of Data Contracts” for a simple introduction to key aspects of data contracts;
- Atlan’s comprehensive overview “Data Contracts: The Key to Scaling Distributed Data Architecture and Reducing Data Chaos” for a more in-depth introduction to core concepts.
Explore our case studies to see how we have been leveraging such cutting-edge practices on projects.
Make sure to also visit the Infinite Lambda blog for other insightful pieces. If you are ready to talk about the challenges that data contracts can help resolve in your organisation, let's have a chat.