Why Your Entire Data Infrastructure Should Be in Code

Nas Radev
April 14, 2019
Read: 3 min

Infrastructure as Code (IaC) is a very popular practice in modern DevOps. There are plenty of resources on the internet on why you should do it. In this post, I’ll focus on why you should do it for your data infrastructure.


  1. Data Governance
  2. Consistency across environments
  3. Maintainability

1. Data Governance

Most medium- to large-scale organisations have several teams that deal with data. Some teams will deal mainly with capturing/producing data (e.g. storing transactions, pushing to a Kafka stream, creating nightly ETL processes to put data into a lake or a warehouse, etc.) Others deal with consuming that data – BI running transformations and reports; advanced analytics and data science doing exploratory queries and mining for business insights; AI running model training on large data sets or real-time inference, etc. All of these activities *will* produce a chaotic jungle of disparate bits of data infrastructure everywhere if you let it get out of control. Ever seen a database on your cloud console and had no idea who’s using it, even though it’s costing you thousands every month? Or a bunch of SNS topics just lying there with no subscribers? Dozens of cloud storage buckets nobody’s using? Can you delete them? Would it impact anyone?

The best way to avoid this situation is to never, ever add any infrastructure through your cloud console. Sure, it may only take 5 clicks versus writing 50 lines of Terraform code, but here’s the deal: Code gives you lineage. Clicks don’t.

Code gets checked in. It gets commented, documented, versioned, contributed to and peer-reviewed (do that). All this gives it context. It gives it traceability. It serves as the medium that glues together infrastructure components, people, history and business needs.

Don’t do it through the console. Do it through code.

2. Consistency across environments

Imagine this – you just spent 3 months running a PoC for your new data platform. It works perfectly on your test environment – event-driven, auto-scaling, minimal latency from data generation to data reporting, etc. You are ready – it’s Production time.

How do you make sure you reproduce all of this in your entirely new Prod environment? Did you remember all your Redshift settings? Did you remember to point all your Glue metastore tables to the right S3 buckets (the prod ones, not the test ones)? What’s that Firehose stream doing here, are we putting it into prod?

All these questions are irrelevant when you have IaC for your data platform. Everything lives under one source code repository, and nothing gets ‘left out’. You just press ‘deploy’ on your CI/CD tool for a different environment, your tests run, your CD gets a green light, and voila – your Production environment is ready.

3. Maintainability

I recently had to ask a SysOps colleague to make changes to a Redshift cluster that must have been created 5 years ago (through AWS console). I needed to modify the WLM queues so that data scientists could get a bit more resource on the cluster for their heavier queries while taking away from BI who had a very disproportionate amount of resources dedicated. Here are some of the questions I had to answer (I failed): ‘Who owns this cluster?’, ‘Why was it set up like this in the first place? Are you sure we are OK to modify it? Who can sign-off?’, ‘What processes feed data into this cluster? What if there is downtime?’, ‘What if this cluster ID is referenced by something else, and we end up having to restore the cluster from backup – do we need to give it the same ID?’ (there were a few more good questions, I was there for a while). The truth is, nobody could answer these questions – it was an impasse. If the client had infrastructure as code, we would have been able to answer most of these questions on the spot by opening a git repo and reading through the code, readme and commit history.

Summary: Get into the habit of writing your data infrastructure as code. It helps you tie things together, track It will save you a lot of headaches down the line.

Sales spiel: Want your data infrastructure analysed and re-written as code? We’ll also throw in free advice on how to optimise and reduce cost. Reach out to Infinite Lambda via the contact form below.

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Data Validation After Refactoring in Snowflake
Data Validation After Refactoring in Snowflake
Oh, well. Your current model is working as expected. The resulting table or query for reporting has good quality data that has been already validated...
January 24, 2023
speed up Terraform
How to Speed Up Terraform in CI/CD Pipelines
In this blog post, we are going to take a look at where Terraform providers are installed locally. We are then going to use what...
January 20, 2023
Infinite Lambda is now a Snowflake Elite Services Partner
Infinite Lambda Named Snowflake Elite Services Partner
Infinite Lambda has been accredited with Snowflake Elite Services Partnership. We are beyond thrilled to share the news as this recognition attests to our extensive...
December 8, 2022
How to Provide Platform-Specific Interface for Kotlin Multiplatform Code
This blog post looks at interfacing with platform-specific code using Kotlin Multiplatform. This is Part 2 of a series, so make sure to check out...
December 2, 2022
event-driven architectures for data-driven apps
Make Data-Driven Apps with Event-Driven Architectures
The rise of cloud computing and cloud-native technologies enabled the emergence of new age companies. A digital-native breed of businesses that truly operate 24/7 across...
November 23, 2022
Fivetran Regional Innovation Partner of the Year for EMEA 2022
Infinite Lambda Named Fivetran Regional Innovation Partner of the Year for EMEA
We are thrilled to announce that we have been named Fivetran Regional Innovation Partner of the Year for EMEA. We are twice as happy to...
October 20, 2022

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.