I have been doing Data Engineering in some shape or form for the past 10 years.
I have failed, many times. I have also succeeded a few times. Good news is, more recently I’ve mostly been succeeding. I attribute this increase in success rate to two things: I work with good people who know their s**t, and I’ve acquired some experience that helps to win with Data Engineering. Here are 5 tips:
Tip #1: Never release before you validate data
Typically a company’s ‘data maturity’ evolves like this: In the beginning, data consumers go to various systems to obtain a report. E.g. digital marketing specialist will go to Facebook Ad Center or a CRM tool; customer support supervisors/analysts will go to Zendesk dashboards; financial analysts will go to a plethora of systems to pull Excel files for analysis; etc.
At some point, it all becomes a bit too much. There are so many different systems people need to go to in order to get information. It’s too laborious. Furthermore, some (the clever ones) start thinking about combining data to do something more sophisticated. E.g. what does the app usage behaviour look like for customers who raise a high number of customer support tickets? If we figure this out, maybe we can tweak the app to fix a problem and reduce the burden on support reps.
At that point, the data engineers get to work. Connectors get built, a data warehouse gets deployed and various tables that contain data from the aforementioned systems are made available to data consumers. ‘Here, data consumer, you can find all the data you so laboriously worked to obtain from all these systems. All automated, joined up and ready for analysis in one single place!’
In the words of Donald Trump: “WRONG!” As in, your data is wrong. Or at least, it doesn’t look like what is on the source system. Why? Because source systems rarely present ‘raw’ data. They often apply filters, hide operational things, deduplicate, clean, etc. etc. And, helpfully, when you ask them for the ‘raw’ data, they give you everything. No filters, no cleaning, just pure data. It’s up to you to turn this into information.
Exacerbating the problem further is that source systems don’t much care for other source systems. If you want to join all the support tickets of a customer to his/her transactions data, you have to rely on a unique ID for that customer. What if that was an email address? Well, email addresses change. At the point you start joining the two data sets on email address, you might be chopping the data in half. "Why, data engineers, do I only see half the transactions of that user?" – a data consumer might ask. "We don’t know, probably because the email address changed over time, we need to look into it" we would answer. If we had validated these data joins properly, we would have arrived at the problem early on, and perhaps done something to solve it like create a surrogate table that maps email addresses (that can change over time) to canonical IDs (that do not).
Some such data consumers will tolerate errors and bugs for a while, but most will just give up on you after one or two hiccups and go back to using their trusted, laborious manual process of using the source systems directly. To data engineers, this is commonly known as ‘failing’.
Tip #2: Become important enough that others push to you
If you always have to pull, you are always playing catch up and are always a second-class citizen. As a nasty side-effect, the whole company becomes less data-driven.
What is Pull? Pull means you have to go and fetch the data, hoping it a) looks like what you expect it to look like and b) is sitting there waiting for you. Pull means you have very little control and rely on some processes ‘upstream’ to define what data should look like and what it will be used for. If the powers that be decide to change data schemas, your processes break. If data doesn’t get captured correctly, you don’t find out until you try to pull it, and by that time it’s usually impossible to recover what was lost. Furthermore, pull usually means batch. Which usually means delay, which means you are missing on a lot of good use cases for real-time analytics like anomaly detection, disaster/fraud prevention, adaptive content serving, customer retention, etc.
So what’s the alternative? Push, obviously. Designing a data platform to be an integral part (or even the backbone) of how information flows within an organisation is paramount. While core transactional processes will always be king - can’t afford to miss a customer order for example - aim to architect for the same datum to be sent down the data stream simultaneously, so that various data warehousing, analytics and ML processes can take place.
Image Source: XKCD
Tip #3: Don’t model the world
In more data-mature environments, we tend to see there is ‘too much data’ to know what to do with. Yet our natural drive is to make all this data available for analytics so, usually by modelling it and applying some business logic on top. Huge data warehouses, with multiple layers and strict MDM (Master Data Management) rules, emerge or perhaps complex Knowledge Graphs in RDF trying to piece it all together. You are now in modelling purgatory, where you are modelling data for the sake of modelling data. (I know I’ve done it too many times. It’s weirdly fun working on the most comprehensive data model. Sadly it’s also the biggest time blackhole with very little productivity yield.)
What should you do instead? Remember that software got a lot better since we started doing this agile thing. Why not apply it to data? Focus on the most obvious wins with data analytics and model for those. E.g. make it easy to serve BI tools with the principal facts and dimensions for core organisational KPIs. Or enable analysts and data scientists to extract information through a minimal amount of table joins. Make sure to pick a data warehousing technology that allows you to quickly change your underlying data model and just crack on. Yes, it will change over time. Yes, you will break someone’s work eventually (and you will fix it). However, what you won’t be is stuck in modelling purgatory for months, failing to deliver the one thing you are responsible to deliver - information.
Tip #4: Do test-driven
Speaking of borrowing best practices from the world of Software Engineering - make sure you are utilising a test-driven approach. Here at Infinite Lambda we quite like technologies like DBT and DataForm because they help us build tests and sanity-checks into our data modelling processes. This way, if data schemas, formats or transformations result in a breaking change - we’ll know immediately.
Tip #5: Bring everyone to the party
It is unfair to expect a Data Analyst or Scientist to know much about Software Engineering. Some will know Python or Spark, however, it will be more of a utility for hacking together a simple (ish) process that gets the job done and won’t have real in-depth knowledge of the language. Most will just know SQL.
Don’t be selfish. Don’t do data modelling in obscure languages just because it’s more efficient. Nobody cares about how elegant or how performant it is. People care about understanding what’s going on. As a result, we are big fans of using SQL to process, transform and model data (hence our love for DBT). Sure, SQL is ugly compared to Python, Scala or Go. Sure, it needs a database technology or the not-so-performant Presto or Spark SQL to run. But it does bring everyone to the party. And this matters, because now you enable data consumers to understand how data is processed in the pipeline and can help you identify issues, errors in business logic, or self-service in building their own data processes.
In Summary: There is no silver bullet for getting data platforms right. They are complex and take a lot of cross-departmental buy-in and a lot of experience to build. Ensure that you don't lose the initial momentum (validate before release), that data processes are a 'first-class citizen' within the company's tech stack and that you stay agile and bring people to the party. Good luck!