- When is your data too big?
- The more cost-effective option – Spectrum & Athena
- The less cost-effective option – cluster scaling
1. When is your data too big
AWS Redshift is a data warehouse made for big data – up to petabyte-scale datasets. Most users operate on the terabyte-scale and therefore will not reach the theoretical scaling limits of Redshift. However, theoretical limits are most likely not what you are concerned about. It’s the size of your cluster, your budget, and what sort of cluster size would it make the most sense to pre-pay for in order to benefit optimally from reserved instance cost savings.
In this article, I concentrate on what to do when your data is getting too big for the Redshift cluster you have already put in place. I am assuming that you have already utilised the best practices for storing data in your Redshift cluster – if not, read our guide on how to optimise storage space here.
As a rule of thumb, if your cluster is over 65% of storage capacity, your data is getting too big for it.
2. The more cost-effective option – Spectrum & Athena
Redshift clusters consist of nodes that allow you to store and run queries over your data (you probably knew that). When you add a node to a cluster, it adds more storage capacity, and also more compute power. There is an inherent assumption here – if you put data in your Redshift, you are going to use the compute to query it. If you are not going to query it, or you are going to query it very rarely, it really should not be on Redshift. It should just be on your data lake. (Barring cases where you need to apply a lot of update/delete type logic on your data as you store it).
I always advise people to keep as much data on their data lake as possible – storage is really quite cheap, and your data scientists will thank you a lot. Luckily, the good folks at AWS also seem to agree with this strategy, as they have introduced two services that allow you to keep your data on your lake (S3), and still be able to query it when needed – namely AWS Redshift Spectrum and AWS Athena.
Spectrum allows you to create tables in your Redshift that are ‘external’ – the data isn’t really on Redshift, but it can be queried using normal SQL, can be joined with other internal or external data, etc. (but it cannot be updated/deleted directly). Athena allows you to run ‘serverless’ SQL queries on data in your lake – via the Athena console, or via an API call (i.e. outside of a Redshift environment altogether). In both cases, the data sits on S3.
I advise people to consider which data sets are 1) quite big and 2) not really queried very often (up to a dozen times per day max). Leave those data sets in S3, accessing them only via Spectrum (or Athena). Note that you should store the data in a columnar format where possible (e.g. Parquet) so as to ensure optimal speed of access + cost savings. Spectrum charges $5 per terabyte of data scanned, but you can get it to scan just the minimum amount of data by selecting only specific columns.
Previously, Infinite Lambda has managed to save clients from having to effectively double their Redshift clusters before peak sales periods by simply offloading large volumes of rarely-accessed data onto S3. That way they can still keep running reporting and analytics over this data (via Spectrum), without it eating away precious storage. We saved them around $72,000 per year in just 2 weeks worth of work.
You can really save a lot of money this way and also reduce Redshift maintenance complexity significantly (although there is a bit of data engineering overhead to store data in the right way).
3. The less cost-effective option – cluster scaling
What if you do frequently access all the data on your cluster, you absolutely need more storage/compute and can’t use Spectrum? If I were answering this question a year ago, I’d begin with ‘well, brace yourself…’ because the classic resize method used to take many hours (sometimes over a day), and that usually means ELT panic. Luckily, now Redshift has Elastic Resize, which allows you to quickly add or remove nodes within minutes and virtually no downtime. However, you will ‘feel’ the cluster to be slower in the hours after a resize because Redshift is re-distributing your data to fully optimise for the new node count.
It is worth pointing out that you can only have nodes of the same type, i.e. you can’t add a ds2.xlarge node to a ds2.8xlarge cluster just because you needed a tiny bit more disk space. There are also hard limits to how many nodes you can add/remove elastically. For example, a ds2.xlarge cluster with 4 nodes can only be downsized to 2 nodes or upsized to 8 nodes.
Sometimes, elastic resize won’t be what you are after (e.g. when you want to change the node type of your cluster). In that case, you have to use Classic Resize, which comes with the issue I described above – it freezes your cluster for a while. You can still run queries on it, but you can’t modify the data, which means no ELT.
In this post, I highlighted two strategies for expanding the capacity of your Redshift data warehouse. The first strategy is to apply a ‘hybrid’ approach, whereby your data is split between your data lake (S3) and Redshift itself. Using Redshift Spectrum, you can then happily run queries across your data, while sparing yourself the cost of under-utilised Redshift nodes.
The second strategy is to just scale up your cluster by adding nodes, where I highlighted the benefits of Elastic Resize over Classic Resize.
Sales spiel: Infinite Lambda helps you optimise your data warehouse while reducing your operational costs significantly. Not only can we help you save dozens of thousands of pounds on storage and compute costs, we can also offer you ongoing maintenance and data engineering support at very affordable rates. Want to try setting up a hybrid data warehouse? Reach out via the contact form below.