- Need to separate compute and storage?
- Need less maintenance?
- Ease of use
Need to separate compute and storage?
This is one of my favourite features of Snowflake – the separation of storage from compute. In Snowflake, data is stored in databases. These are just like any other big data warehouse’s databases – collections of tables and views with certain metadata and permissions settings.
Compute, however, is somewhat differently done in Snowflake. Compute deals with querying data, inserting/deleting data, doing complicated joins, running window functions, etc., etc. These processes run within a Virtual Warehouse – an (in my opinion) oddly named abstraction of a compute cluster. Many such compute clusters can run queries over the same database. And one cluster can run queries across multiple databases.
Note that this is different from how traditional data warehouses (like Redshift) work. There, you’d have a pre-allocated amount of compute resources bundled together with storage, and if you wish to scale one, you’d have to scale the other. Importantly, this goes both ways – if you wished to de-scale one (e.g. reduce compute as you are not using it), you’d have to reduce the storage i.e. reduce your data!
What does all that mean for you? It means that you can be extremely flexible in how you utilise your data platform. For example, if you had a really resource-intensive process that needs to run in the morning (e.g. to generate your daily reports), you could tell Snowflake to bring up a beefy compute cluster to do the processing quickly, and then take it down. You only pay for a couple of hours of compute usage. With Redshift, you’d need to have a Redshift cluster running all day, and the more under-utilised your cluster is for the rest of the day, the more wasteful that would be.
Another flexibility delivered by the compute-storage separation is very obvious once you start having multiple tenants on your data warehouse. For example, you might have pixel-perfect automated reporting, BI dashboards, advanced analytics/data scientists running heavy exploratory analysis, AI team training models, etc. etc. Sooner or later, these tenants start stepping on each-other’s toes because you’ll hit the compute resource limits. Unless, that is, you have separated compute from storage, and can provision computational resources elastically as and when needed (with precise controls in place around cost), and bring them back down when they are not. Isn’t Snowflake wonderful?
Need less maintenance?
Just like Redshift, Snowflake utilises columnar data storage. (Columnar data storage is good for analytics as it enables you to more efficiently access particular fields – columns – in your data, which is more important for analytics than accessing individual records).
However, unlike Redshift, Snowflake requires minimal setup to ensure your data is stored in the most efficient way. This is done via Snowflake’s automated micro-partitioning, which effectively splits out your data into millions of tiny clusters and keeps metadata about each cluster to enable quick retrieval. The best part is, this is done on-the-fly for you, as you insert data – and you don’t even need to tell Snowflake how to cluster your data. It just knows.
Why is this better?
Before I create a Redshift table, I need to carefully consider what sort of queries will be running on it. What columns will be used for joins? What data skew might I get? Then I would define the distribution and sort keys accordingly. After I’ve put some data into it, I’d need to do a VACUUM SORT, to ensure that the data is neatly re-partitioned. Ever worked on a table of 1B+ records that had no sort keys, and has never been vacuumed? It’s slow. Ever tried to change the keys of such a table? It is very slow. The point is, Redshift requires a fair amount of maintenance. If you wish to enable your data engineers to focus on building towards future requirements, you need to worry about the amount of maintenance they need to perform on ‘legacy stuff’. Snowflake solves this problem for you, out of the box.
Ease of use
I’ll just randomly list a few of my favourite features about Snowflake that I can’t really do in Redshift (or not as easily):
– Query data in my cloud storage bucket (not on the database). With Snowflake, I just create a stage that points to my S3 bucket / Azure blob, specify the file type (it can read JSON, Parquet, CSV, etc. out of the box), and start querying. In Redshift, I’d need to use a metastore to know the schema of my data (Glue), crawl my data before I can use it, setup external Spectrum tables, etc.
– Horizontal scaling. Do your data scientists complain that they can’t get any work done on the data warehouse in the morning while other reports and analytics are running on it? Their queries always seem to get stuck on a queue. With Snowflake, you can setup rules for auto-scaling that provide more compute at a certain time during the day, but only when needed.
– Vertical scaling. When 90% of your queries need a small cluster, but the last 10% are full of beefy billion-record joins, what do you do? Over-provision a cluster so that your queries don’t blow up? With Snowflake, you can just say ‘run those 9 queries on a small compute cluster, and the last 1 on an extra-large one’, or, alternatively ‘run those 9 queries on the cluster as it is, but before the last one, do a cluster upgrade to extra-large, run the query, then scale back down’. Magic.
There are several other things I love about Snowflake but I shall limit myself to what we have so far, and continue in a different post.
Summary: Snowflake gives you elasticity and power while keeping maintenance to a minimum. While you might be able to squeeze more out of a fully optimised Redshift cluster, you will need to have a dedicated team looking after it and keeping tables/processes optimised. If you don’t want to dedicate that sort of resource to data warehouse management, use Snowflake.
Sales spiel: Infinite Lambda has a team of data engineers ready to set you up with Snowflake immediately. Work with us to build out a cutting edge data warehouse, improve an existing one, or find ways to optimise operational costs.