...

Why Your Entire Data Infrastructure Should Be in Code

Nas Radev
April 14, 2019
Read: 3 min

Infrastructure as Code (IaC) is a very popular practice in modern DevOps. There are plenty of resources on the internet on why you should do it. In this post, I’ll focus on why you should do it for your data infrastructure.

Contents

  1. Data Governance
  2. Consistency across environments
  3. Maintainability

1. Data Governance

Most medium- to large-scale organisations have several teams that deal with data. Some teams will deal mainly with capturing/producing data (e.g. storing transactions, pushing to a Kafka stream, creating nightly ETL processes to put data into a lake or a warehouse, etc.) Others deal with consuming that data - BI running transformations and reports; advanced analytics and data science doing exploratory queries and mining for business insights; AI running model training on large data sets or real-time inference, etc. All of these activities *will* produce a chaotic jungle of disparate bits of data infrastructure everywhere if you let it get out of control. Ever seen a database on your cloud console and had no idea who's using it, even though it's costing you thousands every month? Or a bunch of SNS topics just lying there with no subscribers? Dozens of cloud storage buckets nobody's using? Can you delete them? Would it impact anyone?

The best way to avoid this situation is to never, ever add any infrastructure through your cloud console. Sure, it may only take 5 clicks versus writing 50 lines of Terraform code, but here's the deal: Code gives you lineage. Clicks don't.

Code gets checked in. It gets commented, documented, versioned, contributed to and peer-reviewed (do that). All this gives it context. It gives it traceability. It serves as the medium that glues together infrastructure components, people, history and business needs.

Don't do it through the console. Do it through code.

2. Consistency across environments

Imagine this - you just spent 3 months running a PoC for your new data platform. It works perfectly on your test environment - event-driven, auto-scaling, minimal latency from data generation to data reporting, etc. You are ready - it's Production time.

How do you make sure you reproduce all of this in your entirely new Prod environment? Did you remember all your Redshift settings? Did you remember to point all your Glue metastore tables to the right S3 buckets (the prod ones, not the test ones)? What's that Firehose stream doing here, are we putting it into prod?

All these questions are irrelevant when you have IaC for your data platform. Everything lives under one source code repository, and nothing gets 'left out'. You just press 'deploy' on your CI/CD tool for a different environment, your tests run, your CD gets a green light, and voila - your Production environment is ready.

3. Maintainability

I recently had to ask a SysOps colleague to make changes to a Redshift cluster that must have been created 5 years ago (through AWS console). I needed to modify the WLM queues so that data scientists could get a bit more resource on the cluster for their heavier queries while taking away from BI who had a very disproportionate amount of resources dedicated. Here are some of the questions I had to answer (I failed): 'Who owns this cluster?', 'Why was it set up like this in the first place? Are you sure we are OK to modify it? Who can sign-off?', 'What processes feed data into this cluster? What if there is downtime?', 'What if this cluster ID is referenced by something else, and we end up having to restore the cluster from backup - do we need to give it the same ID?' (there were a few more good questions, I was there for a while). The truth is, nobody could answer these questions - it was an impasse. If the client had infrastructure as code, we would have been able to answer most of these questions on the spot by opening a git repo and reading through the code, readme and commit history.

Summary: Get into the habit of writing your data infrastructure as code. It helps you tie things together, track It will save you a lot of headaches down the line.

Sales spiel: Want your data infrastructure analysed and re-written as code? We'll also throw in free advice on how to optimise and reduce cost. Reach out to Infinite Lambda via the contact form below.

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

digital skills gap
How to Address the Digital Skills Gap to Build a Future-Proof Tech Workforce
If an organisation is to scale, it needs a data and cloud related talent strategy. A bold statement, I know, so let us look into...
May 20, 2023
dbt deferral
Using dbt deferral to simplify development
As data irrevocably grows in volume, complexity and value, so do the demands from business stakeholders, who need visibility in good time, whilst minimising cost....
May 11, 2023
How to implement Data Vault with dbt on Snowflake
How to Implement Data Vault with dbt on Snowflake
Data Vault is a powerful data modelling methodology when combined with dbt and Snowflake Data Cloud. It allows you to build a scalable, agile and...
April 27, 2023
Data Vault components
Data Vault Components: An Overview
Data Vault is a data warehousing methodology that provides a standardised and scalable approach to managing enterprise data. At its core, it is designed to...
April 21, 2023
Data Vault
Data Vault: Building a Scalable Data Warehouse
Over the past few years, modern data technologies have been allowing businesses to build data platforms of increasing complexity, serving ever more sophisticated operational and...
April 13, 2023
Snowpark for Python
Snowpark for Python: Best Development Practices
While machine learning applications have been enjoying a peak in popularity in the last few years, companies still have a hard time integrating these innovative...
March 29, 2023

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.