Why your entire data infrastructure should be in code

Infrastructure as Code (IaC) is a very popular practice in modern DevOps. There are plenty of resources on the internet on why you should do it. In this post, I’ll focus on why you should do it for your data infrastructure.

Contents

  1. Data Governance
  2. Consistency across environments
  3. Maintainability

1. Data Governance

Most medium- to large-scale organisations have several teams that deal with data. Some teams will deal mainly with capturing/producing data (e.g. storing transactions, pushing to a Kafka stream, creating nightly ETL processes to put data into a lake or a warehouse, etc.) Others deal with consuming that data – BI running transformations and reports; advanced analytics and data science doing exploratory queries and mining for business insights; AI running model training on large data sets or real-time inference, etc. All of these activities *will* produce a chaotic jungle of disparate bits of data infrastructure everywhere if you let it get out of control. Ever seen a database on your cloud console and had no idea who’s using it, even though it’s costing you thousands every month? Or a bunch of SNS topics just lying there with no subscribers? Dozens of cloud storage buckets nobody’s using? Can you delete them? Would it impact anyone?

The best way to avoid this situation is to never, ever add any infrastructure through your cloud console. Sure, it may only take 5 clicks versus writing 50 lines of Terraform code, but here’s the deal: Code gives you lineage. Clicks don’t.

Code gets checked in. It gets commented, documented, versioned, contributed to and peer-reviewed (do that). All this gives it context. It gives it traceability. It serves as the medium that glues together infrastructure components, people, history and business needs.

Don’t do it through the console. Do it through code.

2. Consistency across environments

Imagine this – you just spent 3 months running a PoC for your new data platform. It works perfectly on your test environment – event-driven, auto-scaling, minimal latency from data generation to data reporting, etc. You are ready – it’s Production time.

How do you make sure you reproduce all of this in your entirely new Prod environment? Did you remember all your Redshift settings? Did you remember to point all your Glue metastore tables to the right S3 buckets (the prod ones, not the test ones)? What’s that Firehose stream doing here, are we putting it into prod?

All these questions are irrelevant when you have IaC for your data platform. Everything lives under one source code repository, and nothing gets ‘left out’. You just press ‘deploy’ on your CI/CD tool for a different environment, your tests run, your CD gets a green light, and voila – your Production environment is ready.

3. Maintainability

I recently had to ask a SysOps colleague to make changes to a Redshift cluster that must have been created 5 years ago (through AWS console). I needed to modify the WLM queues so that data scientists could get a bit more resource on the cluster for their heavier queries while taking away from BI who had a very disproportionate amount of resources dedicated. Here are some of the questions I had to answer (I failed): ‘Who owns this cluster?’, ‘Why was it set up like this in the first place? Are you sure we are OK to modify it? Who can sign-off?’, ‘What processes feed data into this cluster? What if there is downtime?’, ‘What if this cluster ID is referenced by something else, and we end up having to restore the cluster from backup – do we need to give it the same ID?’ (there were a few more good questions, I was there for a while). The truth is, nobody could answer these questions – it was an impasse. If the client had infrastructure as code, we would have been able to answer most of these questions on the spot by opening a git repo and reading through the code, readme and commit history.

Summary: Get into the habit of writing your data infrastructure as code. It helps you tie things together, track It will save you a lot of headaches down the line.

Sales spiel: Want your data infrastructure analysed and re-written as code? We’ll also throw in free advice on how to optimise and reduce cost. Reach out to Infinite Lambda via the contact form below.

Share on facebook
Share on twitter
Share on linkedin