...

Optimising Redshift Tables for Storage

Nas Radev
April 12, 2019
Read: 3 min

In this post, I'll share some tried and tested techniques for optimising your Redshift tables, with a focus on storage.

Contents:

  1. Vacuum
  2. Encode
  3. Partition
  4. Unload
  5. (Bonus) Monitor performance

1. Vacuum

Redshift's documentation says that the vacuum command "Resorts rows and reclaims space in either a specified table or all tables in the current database." That's exactly right, and you really need to run Vacuums regularly.

When you update or delete rows in a Redshift table, you don't actually physically remove the old/deleted record until a 'vacuum' (or a 'vacuum delete only') command is run. Redshift tries to perform this operation automatically, but in my experience, this doesn't always work. I have seen tables with billions of undeleted records - chances are, Redshift's automated 'vacuum delete' fails to complete before new records are inserted/deleted/updated due to the sheer volume of the data being processed.

Run this command on your larger tables to see if they have many deleted rows waiting to be vacuumed (changing the parameters with your own schema and table name)

[php]SELECT tbl_rows - (SELECT count(*) FROM public.users) FROM svv_table_info WHERE "schema" = '' AND "table" = '';[/php]

A further reason you want to run vacuum regularly is so that your table gets re-sorted. Any insertion or modification of records causes your table to become imperfectly sorted. The less well sorted a table is, the slower it will be to query it. This is super important for query optimisation, but this post is about storage, so I'll leave this bit for another post.

Lastly, I'd like to mention that VACUUM does take compute resource from your cluster, and it can take a long time to run on some bigger tables. That's why I'd recommend you schedule vacuum commands to run during off-peak times.

2. Encode

Encode. Every. Single. Column. (Apart from your sort-key columns, don't encode those.)

This is the single greatest advice I can give for reducing table storage sizes. Picking the correct compressions for your columns can render up to 70% reduction in size (and with Redshift, storage size equals cost).

There are many articles written around how to pick your best column encodings but I recommend starting simple - https://docs.aws.amazon.com/redshift/latest/dg/t_Compressing_data_on_disk.html

3. Partition

Redshift will handle tables with billions of records easily. That is until you need to change a dist/sort key or alter an encoding. At that point, you need to create a new table with the added modifications and move your data from the old table to the new one. And this isn't a pleasant operation at all, especially if you have tons of ELT running on your platform at the same time, and even more so if your table isn't optimised in the first place. Similar story with running full VACUUM commands on a really big table - it just takes forever and takes away a valuable compute resource.

The solution - partition your data into multiple tables. Take for example daily clickstream data from a website. We might be talking hundreds of millions of click records per day. The most optimal approach here would be to have one table per day's worth of data (e.g. clickstream_data_2019_01_01, then clickstream_data_2019_01_02 etc). This makes you a lot more flexible to:

  1. schema changes over time,
  2. ability to do changes on your tables without risk of all your commands timing out or taking forever and
  3. vacuum more easily (you'd only need to vacuum once each daily table).

This partitioning does require you to join multiple tables together, however, you can easily create a VIEW that unions all the tables, reducing the complexity of your queries.

Note: if your daily data isn't that big, you can partition it monthly, or weekly, depending on what makes sense.

4. Unload

Do your reports and analytics really only query the last month of data in a table? Then just unload the rest of the data to S3! In general, there is no point keeping data in Redshift that you don't query very often - that's what data lakes are for. You can always bring it back into Redshift later or, preferably, use Redshift Spectrum to query it directly on S3.

5. (Bonus) Monitor performance

If you don't already, you should make sure to have these Redshift Admin views set up on your Redshift cluster. They are very, very valuable for monitoring tables that need vacuum'ing, checking for transaction locks, and obtaining various other handy details about your cluster. In fact, I'd dedicate a single WLM slot to a process that periodically selects from some of these views and presents the results via a dashboard, raising an alert on various thresholds (e.g. a lock on my main reporting tables, a non-vacuumed table, etc.)

Sales spiel: Infinite Lambda helps you optimise your data warehouse while reducing your operational costs significantly. Not only can we help you save dozens of thousands of pounds on storage and compute costs, we can also offer you ongoing maintenance and data engineering support at very affordable rates. Reach out via the contact form below.

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Tag-Based Masking in Snowflake
Tag-Based Masking in Snowflake: Practical Guide with Scalable Implementation
As data continues to be a critical asset for organisations across industries, safeguarding sensitive information while enabling data access for authorised users is a constant...
June 11, 2024
Cloud sustainability
Cloud Sustainability
This article on cloud sustainability is a part of a series on carbon analytics published on the Infinite Lambda Blog. Appreciating the cloud is becoming...
June 5, 2024
How to measure happiness and safety in tech teams
How to Measure Happiness and Safety in Tech Teams
Software product development initiatives can go awry for a whole range of reasons. However, the main ones tend not to be technical at all. Rather,...
May 30, 2024
why sustainability analytics
Why Sustainability Analytics
We all like a sunny day. Kicking back in the garden with the shades on, cool drink in hand and hopefully a liberal amount of...
May 8, 2024
Data diff validation in a blue green deployment: how to guide
Data Diff Validation in Blue-Green Deployments
During a blue-green deployment, there are discrepancies between environments that we need to address to ensure data integrity. This calls for an effective data diff...
January 31, 2024
GDPR & Data Governance in Tech
GDPR & Data Governance in Tech
The increasing focus on data protection and privacy in the digital age is a response to the rapid advancements in technology and the widespread collection,...
January 18, 2024

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.