Optimising Redshift Tables for Storage

Nas Radev
April 12, 2019
Read: 3 min

In this post, I’ll share some tried and tested techniques for optimising your Redshift tables, with a focus on storage.


  1. Vacuum
  2. Encode
  3. Partition
  4. Unload
  5. (Bonus) Monitor performance

1. Vacuum

Redshift’s documentation says that the vacuum command “Resorts rows and reclaims space in either a specified table or all tables in the current database.” That’s exactly right, and you really need to run Vacuums regularly.

When you update or delete rows in a Redshift table, you don’t actually physically remove the old/deleted record until a ‘vacuum’ (or a ‘vacuum delete only’) command is run. Redshift tries to perform this operation automatically, but in my experience, this doesn’t always work. I have seen tables with billions of undeleted records – chances are, Redshift’s automated ‘vacuum delete’ fails to complete before new records are inserted/deleted/updated due to the sheer volume of the data being processed.

Run this command on your larger tables to see if they have many deleted rows waiting to be vacuumed (changing the parameters with your own schema and table name)

[php]SELECT tbl_rows – (SELECT count(*) FROM public.users) FROM svv_table_info WHERE “schema” = ” AND “table” = ”;[/php]

A further reason you want to run vacuum regularly is so that your table gets re-sorted. Any insertion or modification of records causes your table to become imperfectly sorted. The less well sorted a table is, the slower it will be to query it. This is super important for query optimisation, but this post is about storage, so I’ll leave this bit for another post.

Lastly, I’d like to mention that VACUUM does take compute resource from your cluster, and it can take a long time to run on some bigger tables. That’s why I’d recommend you schedule vacuum commands to run during off-peak times.

2. Encode

Encode. Every. Single. Column. (Apart from your sort-key columns, don’t encode those.)

This is the single greatest advice I can give for reducing table storage sizes. Picking the correct compressions for your columns can render up to 70% reduction in size (and with Redshift, storage size equals cost).

There are many articles written around how to pick your best column encodings but I recommend starting simple – https://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html

3. Partition

Redshift will handle tables with billions of records easily. That is until you need to change a dist/sort key or alter an encoding. At that point, you need to create a new table with the added modifications and move your data from the old table to the new one. And this isn’t a pleasant operation at all, especially if you have tons of ELT running on your platform at the same time, and even more so if your table isn’t optimised in the first place. Similar story with running full VACUUM commands on a really big table – it just takes forever and takes away a valuable compute resource.

The solution – partition your data into multiple tables. Take for example daily clickstream data from a website. We might be talking hundreds of millions of click records per day. The most optimal approach here would be to have one table per day’s worth of data (e.g. clickstream_data_2019_01_01, then clickstream_data_2019_01_02 etc). This makes you a lot more flexible to:

  1. schema changes over time,
  2. ability to do changes on your tables without risk of all your commands timing out or taking forever and
  3. vacuum more easily (you’d only need to vacuum once each daily table).

This partitioning does require you to join multiple tables together, however, you can easily create a VIEW that unions all the tables, reducing the complexity of your queries.

Note: if your daily data isn’t that big, you can partition it monthly, or weekly, depending on what makes sense.

4. Unload

Do your reports and analytics really only query the last month of data in a table? Then just unload the rest of the data to S3! In general, there is no point keeping data in Redshift that you don’t query very often – that’s what data lakes are for. You can always bring it back into Redshift later or, preferably, use Redshift Spectrum to query it directly on S3.

5. (Bonus) Monitor performance

If you don’t already, you should make sure to have these Redshift Admin views set up on your Redshift cluster. They are very, very valuable for monitoring tables that need vacuum’ing, checking for transaction locks, and obtaining various other handy details about your cluster. In fact, I’d dedicate a single WLM slot to a process that periodically selects from some of these views and presents the results via a dashboard, raising an alert on various thresholds (e.g. a lock on my main reporting tables, a non-vacuumed table, etc.)

Sales spiel: Infinite Lambda helps you optimise your data warehouse while reducing your operational costs significantly. Not only can we help you save dozens of thousands of pounds on storage and compute costs, we can also offer you ongoing maintenance and data engineering support at very affordable rates. Reach out via the contact form below.

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

speed up Terraform
How to Speed Up Terraform in CI/CD Pipelines
In this blog post, we are going to take a look at where Terraform providers are installed locally. We are then going to use what...
January 20, 2023
Infinite Lambda is now a Snowflake Elite Services Partner
Infinite Lambda Named Snowflake Elite Services Partner
Infinite Lambda has been accredited with Snowflake Elite Services Partnership. We are beyond thrilled to share the news as this recognition attests to our extensive...
December 8, 2022
How to Provide Platform-Specific Interface for Kotlin Multiplatform Code
This blog post looks at interfacing with platform-specific code using Kotlin Multiplatform. This is Part 2 of a series, so make sure to check out...
December 2, 2022
event-driven architectures for data-driven apps
Make Data-Driven Apps with Event-Driven Architectures
The rise of cloud computing and cloud-native technologies enabled the emergence of new age companies. A digital-native breed of businesses that truly operate 24/7 across...
November 23, 2022
Fivetran Regional Innovation Partner of the Year for EMEA 2022
Infinite Lambda Named Fivetran Regional Innovation Partner of the Year for EMEA
We are thrilled to announce that we have been named Fivetran Regional Innovation Partner of the Year for EMEA. We are twice as happy to...
October 20, 2022
dbt Labs Platinum Partnership and Certification Award
Infinite Lambda Named dbt Labs Platinum Partner
We are thrilled to announce that Infinite Lambda has been named a Platinum partner to dbt Labs. We have been using dbt since the very...
October 18, 2022

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.