...

Towards Greater Composability in Data Platforms

Nina Anderson
June 15, 2023
Read: 7 min

This post is based on a talk of the same name that the author delivered at Data Mash #7 – London Edition in March 2023.

If you have been working in the data industry over the last couple of years, you will have noticed an explosion in available tooling that is allowing us to build ever more complex, composable data platforms. We have increasingly embraced a code-first approach and, through a combination of imperative and declarative options, we now have more power at our fingertips than ever before.

While this has paved the way for much innovation over the past few years, I argue there is a certain opportunity this new breed of architecture gives rise to that we haven not yet capitalised on.

Let's start by defining the pivotal word, ‘composability’:

Composability is “the ability of different components to be easily integrated or combined with each other to form a cohesive and functional platform”.

If you pay attention, you will notice that this theme reappears increasingly. In this article, I am going to trace it from the first time I was introduced to it to the present and take a look at where I think (and hope) it is all going.

Beginnings: templated analytic logic

I was first introduced to the modern data world when I took a job at a fintech startup back in 2019. We intrepidly decided to build a data platform, and with that came my first introduction to what I'll call reusable analytic logic. We were using a BI tool called Looker to develop various dashboards and reporting, such as 360° views of the customer and so on. In support of this, we would develop dimensions and measures, Looker’s atomic units of meaning, using its proprietary language, LookML.

It felt new and exciting to have these pieces of reusable logic, such as the definition of a customer, or a high-risk transaction, that were shared across the platform, and that others could use however they wanted, building incrementally on top to define new concepts, repurposing (for better or worse), filtering, etc.

I might just sound like I am talking about a semantic layer, but there is a slight distinction here. As David J puts it, “semantic layers provide a mapping between the real world and data.” Templated analytic logic, on the other hand, is a more general concept; it need not necessarily be exposed to the world or data consumers, it can be the intermediary transformation steps too.

I started to wonder: if we could share definitions of metrics within an organisation, could we push this further and share templated analytic logic across different organisations? While business logic will always differ, we could still speculate that there would be plenty of common ground too.

 

from templated analytic logic to greater composability

A natural evolution: reusable analytic logic across different companies

Turns out I was far from the only person wondering about this. The growing analytics engineering community was expanding on the concept of templated analytic logic and beginning to share its benefits across a nascent global network.

There were Looker Blocks, swiftly followed by the rise of dbt, a tool with transformations-as-code at its core, and its associated open source package hub. Finally, one person could figure out how to turn Google Analytics data into a usable format, and we could all benefit from it.

Yet dbt, and by extension the package hub, facilitates not just semantic templated logic but also more utilitarian, less domain-specific concepts to be templated too. Not a new concept in software engineering, but analytics engineers began learning how to apply it to data.

Crucially, it is not all or nothing: any modifications are possible and a package can act as a launch-pad without being restrictive. If I am building a marketing attribution model, instead of starting from a blank slate, I can start from someone else’s v1. That ability to choose precisely what you need and customise the rest is part of the power of it being expressed in code.

The commercial benefit to using templated transformations is clear: it can help you piece together a data platform from tessellating blocks rather than from raw materials, thereby increasing data team productivity and advancing collective knowledge on common foundations. But I think the impact can go a lot further than that.

An application of templated analytic logic

Last year I decided to explore whether we could apply this approach to the domain of sustainability. It felt like an urgent topic, but one that businesses, especially small ones, typically do not have the time or resources to focus on. I have laid out the reasoning in a previous post you can refer to.

Inspired by the possibilities introduced by innovations such as the dbt package hub, together with colleagues, I attempted to codify the analytic logic of corporate carbon emissions.

The idea was to help organisations start creating basic emissions metrics by providing them pre-packaged code that had been verified to match international standards for estimating emissions. The thought was this was too much of a pain to do from scratch but people might try if there was a template.

But the dream was also to be able to benchmark emissions across different organisations. A global semantic layer if you will. So rather than just allowing different companies to follow a templated approach, this would actually mean making their metrics available in a uniform way so that the public could easily understand and, more importantly, compare consistently.

During this project, it was hard to define the right level of abstraction - specific enough to be useful but general enough to be applicable to a wide range of companies. We’ve had some early successes, including our project with the World Health Organization, but there is so much more to do. In particular, there are interesting problems to solve around getting the right level of abstraction.

I am still hopeful that providing people with the right-size building blocks – making things more composable while retaining flexibility – will help us solve problems like quantifying corporate environmental impact, and stop solving the same analytical problems over and over again.

But it is not just analytic logic that can benefit from these composable, reusable blocks; it is also infrastructure itself. As data platforms become more complex and ambitious, data teams are solving new problems, which are not just analytical but also relating to the platforms and environments in which we create and collaborate. This is where I see an opportunity that has been least capitalised on so far, and is coming into focus whilst the line between software work and data work becomes increasingly blurred.

 

From templated analytic logic to templated infrastructure for greater composability

From templated analytic logic to templated infrastructure and more composability

Complex and distributed architectures are becoming more commonplace in data platforms. For example, data mesh is something we are hearing more and more, if not seeing (yet).

Cloud technologies are also allowing us to have more complexity and more collaboration across different orgs and teams, a true explosion of data volumes and objects. That is great, but I believe we now need new ways of managing this increased complexity as DDL scripts saved in a repo and copy-pasted will not cut it any more.

Colleagues of mine recently worked with the Francis Crick Institute, a biomedical research organisation, on a data transformation project. The Crick funds a multitude of programmes around tropical disease, drug development and other medical innovation. The success of their work and the research teams that they fund depends on many researchers across the globe being able to easily share & access each others’ experimental data.

We used the data warehousing platform Snowflake and proposed a modular architecture where each researcher or team gets their own area in the platform and roles, permissions, databases, schemas, resource monitors are synced with budgets. (If you are familiar with the concept of data mesh, this modular, domain-driven approach probably sounds familiar.)

The vision was that you could just develop the pattern once, in code, for these research environments and use that as the blueprint for new initiatives of any shape and size. A sort of recipe for a workspace.

Then, not only could you then roll out at scale, you could also get better insight into how things were being used through usage tracking, and integrate with IT management systems to allow individuals to request new research environments in an automated workflow – no more waiting for a platform team to set things up for you. Read the full story in the white paper we put together in collaboration with Snowflake.

This blueprint has since been used for a COVID project among others. The most impressive part for me is that in theory other organisations could use a pattern like this too because it is templated.

Now you might be thinking: Infrastructure as code has been around for a while. You would be right, but it is still focused on backend and it often stays in the hands of DevOps teams & separate tooling. There is a degree of separation that means that data teams are still not fully benefiting from the IaC way of thinking.

I also do not feel that composable, templated thinking has been fully exploited in IaC yet, although some companies are exploring it. If you would allow a heavy handed metaphor, IaC is like cooking where the templates, such as the one we developed for the Crick Institute, are the recipes. Recipes that, crucially, analytics engineers are comfortable using. And I think this is where a treasure trove of value is yet to be discovered.

It works on a smaller scale too. I can now go online and find a code package for integrating 8 different marketing data sources into a unified model. Why can’t I find the same for a data development environment in a particular data warehouse or an entire data product as a template. I wish there was a marketplace of such recipes.

This kind of composable approach is already evident in some exciting products out there. Shipyard are using this approach for integration/data transfer workflows. For data ingestion, Stitch and Singer have been leveraging the value of composability for a while. We need something in the same vein for data environments.

Preparing for the future

We have an amazing range of flexible tooling now, but we will need to get even better at composability to address challenges and prepare for the opportunities that are coming.

We have started to do it for analytic code, not so much for infrastructure. Composability ultimately democratises not access to data but access to building data tooling. That should be really exciting and achievable through a templated IAC approach (recipes). In the longer term, I can see a more sophisticated version of this emerging where people do not need to know IaC and can just plug and play.

I am excited about a world where these recipes enable faster collaboration, where we spend less time figuring out what the product/marketing team’s data environment should look like in dev and prod, or figuring out how medical researchers can collaborate and share data securely with one another in the cloud.

I want to see a world where people share those solutions after they have solved them, and where we have the platforms and frameworks for them to do so.

If this concept of composability resonates with you and you have been solving similar problems, reach out. Nina Anderson and the Infinite Lambda team would be thrilled to hear from you.

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

why sustainability analytics
Why Sustainability Analytics
We all like a sunny day. Kicking back in the garden with the shades on, cool drink in hand and hopefully a liberal amount of...
May 8, 2024
Data diff validation in a blue green deployment: how to guide
Data Diff Validation in Blue-Green Deployments
During a blue-green deployment, there are discrepancies between environments that we need to address to ensure data integrity. This calls for an effective data diff...
January 31, 2024
GDPR & Data Governance in Tech
GDPR & Data Governance in Tech
The increasing focus on data protection and privacy in the digital age is a response to the rapid advancements in technology and the widespread collection,...
January 18, 2024
Data masking on Snowflake using data contracts
Automated Data Masking on Snowflake Using Data Contracts
As digital data is growing exponentially, safeguarding sensitive information is more important than ever. Compliance with strict regulatory frameworks, such as the European Union’s General...
January 17, 2024
What AI is not: demystifying LLMs
Demystifying LLMs: What AI Is Not
Just a year ago, hardly anyone had heard of large language models (LLMs), the technology behind ChatGPT. Now, these models are everywhere, revolutionising the way...
January 11, 2024
Digital innovations in Ukraine
Top 9 Digital Innovations in Ukraine
Attention. Air raid alert. Proceed to the nearest shelter. Don’t be careless. Your overconfidence is your weakness. – Air raid alert app voice-over using the...
January 4, 2024

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.