Leveraging Snowflake Data Shares: an E-commerce Case Study

A primer on data sharing

Data shares are a cutting-edge new tool developed by Snowflake, allowing any organisation with a Snowflake account to seamlessly and instantly provide data access to other organisations anywhere in the world. Snowflake themselves are leveraging this technology to provide a data marketplace, most recently being deployed by Starschema to provide public access to COVID-19 epidemiological data.

Data sharing is an efficient way to provide third party access to large amounts of data without having to architect a data transfer process. This means it’s well suited to scenarios in which an organisation has preexisting raw data available in its Snowflake data warehouse. It generally also doesn’t require much infrastructure setup, and usually enables automated data refresh, so the recipient can move directly to applying quality checks, transformations and reporting.

An e-commerce case study

At Infinite Lambda, we were recently involved in a data sharing proof-of-concept project. We worked with an e-commerce company whose email marketing data was residing with a third party that was also managing their email campaigns. This third party was already providing some insights on campaign performance, but since the data were pre-aggregated, our client was not able to arbitrarily slice and dice data, perform detailed cohort analysis or integrate the insights into their own attribution model. To remedy this, the third party offered to provide raw data via a Snowflake share.

We used Snowflake data shares to step into model and transform the data into a state that would facilitate reporting and business intelligence, specifically to expose it to Looker, the client’s reporting tool. 

Using dbt to apply quality checks & transform raw data

We received the data in the form of several secure views via Snowflake data shares. Since these secure views were querying a much larger underlying table, they were slow to query (the secure views bypass certain optimisations used in standard views). The first step was therefore to clone the data into a RAW layer before applying any transformations, removing these performance issues for subsequent nodes in the DAG. This is represented as Layer 1 in the diagram below. dbt is first and foremost a data transformation tool, but it is just as well suited to an ingestion task like this.

Working with event stream data meant that we were able to make use of an incremental materialisation strategy in dbt – only refreshing new or changed table records after the initial build. We also masked PII (personally identifiable information) in the RAW layer, but applied no other transformations.

Quality controls and formatting

Before modelling the data, various quality controls and formatting needed to be applied. We began by generating surrogate keys (the dbt_utils package provides a neat way of doing so) and deduplicating the event data in Layers 2-3.
Layer 3 also handled column renaming, casting timestamps, booleans and so on.
Finally, Layer 4 housed our normalised fact and dimension tables, as well as some pre-aggregated data ready for the reporting layer.

Project infrastructure

Here’s a high-level diagram of the project’s structure:

leveraging Snowflake data shares

Layering successive transformations like this makes the models easy-to-read and easy-to-follow so that everyone involved in the data pipeline can see what is happening, where and how. This structure also makes it much more straightforward to debug and make modifications. Again, most of the models we created were incrementally materialised, limiting the number of transformations that happen in each dbt run.

Setting up the data warehouse

We set up the data warehouse, in this case Snowflake, in such a way as to separate different ingestion and transformation tasks. The specifics will depend on the use case, but in this case it was important to minimise the number of users with access to PII and to be able to easily monitor the resource usage of different workflows. The diagram below represents the structure we settled on.

Snowflake data shares warehouse

Here we have dedicated users for ingesting data, transforming and reporting. We’re showing the structure of our development database here, but it is essentially mirrored for the production environment. We also created a macro that applies a custom row limit in the development environment, to limit the resources required for building out and testing our models.

Running the models

In this section, we’ll lay out the details of how we were executing dbt. If you don’t need the technical details, feel free to skip ahead.

Running our first layer with dbt:

1 dbt run --target ingest_dev --models 1_RAW

This executes all .sql files inside the 1_RAW folder of our dbt project, cloning the raw data from the secure views into our ingest warehouse.

Let’s move to our second layer – staging. Again, executing this layer is as follows:

1 dbt run --target transform_dev --models 2_STAGING_DATA_QUALITY

Notice the change of targets. Here we need to take data from the RAW_DEV database and write new relations in ANALYTICS_DEV.

Note that it would be possible to execute different layers with different targets simply by using dbt run, but to do this you would need to refer to source models with {{source}} instead of the usual {{ref}} function. This has the unfortunate consequence of breaking the lineage in the auto-generated dbt documentation, so we preferred not to do it.

Since our transform_dev database default is ANALYTICS_DEV we have to {{ref}} a model that is in different database, we needed to include the following in our dbt_project.yml:

1 database:

"{{ 'RAW_DEV' if target.name in ('ingest_dev', 'transform_dev') else 'RAW_PROD' }}"

We can then run layers 3 and 4 analogously, and no further change of target is required.

In this proof of concept, the models were being run manually. In a production environment, we can schedule the setup we described above in Airflow and run it regularly with a Python wrapper.

The output

The final reporting layer results in normalised fact and dimension tables, as well as some pre-aggregations. Connecting this reporting layer to a BI tool then allowed us to integrate and enrich this email engagement information with the other customer data already available via that BI tool. This then paves the way for the client to delve much deeper into marketing campaigns, cohort and persona analysis and run much more powerful A/B tests, ultimately driving more value for the business.

Check out the rest of the posts on our blog for more insights into data and business intelligence.

To discuss how Infinite Lambda could help your business, reach out to contact@infinitelambda.com.

Share on facebook
Share on twitter
Share on linkedin

More on the topic