Optimising Redshift Tables for Storage

Nas Radev
April 12, 2019
Read: 3 min

In this post, I’ll share some tried and tested techniques for optimising your Redshift tables, with a focus on storage.


  1. Vacuum
  2. Encode
  3. Partition
  4. Unload
  5. (Bonus) Monitor performance

1. Vacuum

Redshift’s documentation says that the vacuum command “Resorts rows and reclaims space in either a specified table or all tables in the current database.” That’s exactly right, and you really need to run Vacuums regularly.

When you update or delete rows in a Redshift table, you don’t actually physically remove the old/deleted record until a ‘vacuum’ (or a ‘vacuum delete only’) command is run. Redshift tries to perform this operation automatically, but in my experience, this doesn’t always work. I have seen tables with billions of undeleted records – chances are, Redshift’s automated ‘vacuum delete’ fails to complete before new records are inserted/deleted/updated due to the sheer volume of the data being processed.

Run this command on your larger tables to see if they have many deleted rows waiting to be vacuumed (changing the parameters with your own schema and table name)

[php]SELECT tbl_rows – (SELECT count(*) FROM public.users) FROM svv_table_info WHERE “schema” = ” AND “table” = ”;[/php]

A further reason you want to run vacuum regularly is so that your table gets re-sorted. Any insertion or modification of records causes your table to become imperfectly sorted. The less well sorted a table is, the slower it will be to query it. This is super important for query optimisation, but this post is about storage, so I’ll leave this bit for another post.

Lastly, I’d like to mention that VACUUM does take compute resource from your cluster, and it can take a long time to run on some bigger tables. That’s why I’d recommend you schedule vacuum commands to run during off-peak times.

2. Encode

Encode. Every. Single. Column. (Apart from your sort-key columns, don’t encode those.)

This is the single greatest advice I can give for reducing table storage sizes. Picking the correct compressions for your columns can render up to 70% reduction in size (and with Redshift, storage size equals cost).

There are many articles written around how to pick your best column encodings but I recommend starting simple – https://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html

3. Partition

Redshift will handle tables with billions of records easily. That is until you need to change a dist/sort key or alter an encoding. At that point, you need to create a new table with the added modifications and move your data from the old table to the new one. And this isn’t a pleasant operation at all, especially if you have tons of ELT running on your platform at the same time, and even more so if your table isn’t optimised in the first place. Similar story with running full VACUUM commands on a really big table – it just takes forever and takes away a valuable compute resource.

The solution – partition your data into multiple tables. Take for example daily clickstream data from a website. We might be talking hundreds of millions of click records per day. The most optimal approach here would be to have one table per day’s worth of data (e.g. clickstream_data_2019_01_01, then clickstream_data_2019_01_02 etc). This makes you a lot more flexible to:

  1. schema changes over time,
  2. ability to do changes on your tables without risk of all your commands timing out or taking forever and
  3. vacuum more easily (you’d only need to vacuum once each daily table).

This partitioning does require you to join multiple tables together, however, you can easily create a VIEW that unions all the tables, reducing the complexity of your queries.

Note: if your daily data isn’t that big, you can partition it monthly, or weekly, depending on what makes sense.

4. Unload

Do your reports and analytics really only query the last month of data in a table? Then just unload the rest of the data to S3! In general, there is no point keeping data in Redshift that you don’t query very often – that’s what data lakes are for. You can always bring it back into Redshift later or, preferably, use Redshift Spectrum to query it directly on S3.

5. (Bonus) Monitor performance

If you don’t already, you should make sure to have these Redshift Admin views set up on your Redshift cluster. They are very, very valuable for monitoring tables that need vacuum’ing, checking for transaction locks, and obtaining various other handy details about your cluster. In fact, I’d dedicate a single WLM slot to a process that periodically selects from some of these views and presents the results via a dashboard, raising an alert on various thresholds (e.g. a lock on my main reporting tables, a non-vacuumed table, etc.)

Sales spiel: Infinite Lambda helps you optimise your data warehouse while reducing your operational costs significantly. Not only can we help you save dozens of thousands of pounds on storage and compute costs, we can also offer you ongoing maintenance and data engineering support at very affordable rates. Reach out via the contact form below.

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Apache Airflow start_date and execution_date explained
Airflow start_date and execution_date Explained
Despite Airflow’s popularity in data engineering, the start_date and execution_date concepts remain confusing among many new developers today. This article aims to demystify them. Basic...
June 15, 2022
Breaking Some Myths about the Use of Dual-Track Agile
Bringing both flexibility and transparency, the Dual-Track Agile methodology is increasingly popular. With a growing number of teams that decide to try it out, it...
June 10, 2022
Creating a PostgreSQL to BigQuery Sync Pipeline Using Debezium and Kafka
Many companies today use different database technologies for their application and their data platform. This creates the challenge of enabling analytics on application data without...
June 1, 2022
How to Apply Dual-Track Agile in Practice
This article is a part of a series on the Dual-Track model. Here, I am going to share with you 5 rules on how to...
May 17, 2022
Challenges of Using Dual-Track Agile and How to Handle Them
Welcome to Part II of the Infinite Lambda’s blog series on Dual-Track Agile. You might want to check Part I that explains what this model...
April 15, 2022
Sustainability: the Last Frontier in Business Intelligence
The power of the modern data stack in generating actionable insights out of disparate data is well documented. It’s time to apply this to sustainability....
April 1, 2022

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Optimized by Seraphinite Accelerator
Turns on site high speed to be attractive for people and search engines.