Automated Data Masking on Snowflake Using Data Contracts

Zdravko Yanakiev
January 17, 2024
Read: 7 min

As digital data is growing exponentially, safeguarding sensitive information is more important than ever.

Compliance with strict regulatory frameworks, such as the European Union’s General Data Protection Regulation (GDPR), is paramount, and data masking is a crucial technique to protect users’ personally identifiable information (PII) against data breaches and unauthorised access.

Data masking is a security technique that hides private information, allowing only authorised individuals to access the genuine data. It enables organisations to derive valuable insights without compromising individual privacy. However, identifying and manually masking personal data is time-consuming and error-prone.

Leveraging data contracts to mask personal data

Data contracts are specifications that outlines the structure, format and rules which govern the data exchange between systems.

Within a data contract definition, PII fields can be explicitly marked, allowing for the automated application of data masking across the data lake. This post details an approach for automated data masking using data contracts, illustrated through an example use case.

Use case: protecting user privacy in event-driven data platforms


Automated data masking

Consider the following scenario: an online retailer has implemented an event-driven architecture. They use several different microservices in their daily operations to register and manage users, handle customer orders and track and fulfil product return requests.

These services, each developed by a separate team, produce many events and publish them to topics in Confluent Kafka, which acts as the central nervous system of the overall architecture.

Every event type conforms to a carefully designed data contract to facilitate communication between different systems. The events contain data which is essential to business operations, such as customer names and contact information or order details.

The data is also useful for analytics purposes, such as customer segmentation or aggregation queries. To serve this purpose, all events from Confluent Kafka are also ingested into the retailer’s Snowflake data warehouse. However, warehouse users, say analysts, should not have access to any PII data in order to be GDPR-compliant.

This is where data masking comes into play to replace sensitive information with pseudonymous or fictitious data.

The key to effective data masking lies in maintaining some correlation between masked data and its original state, rather than creating entirely new information. In the context of an analytical warehouse, this approach not only ensures compliance but also facilitates analytics by replacing or transforming sensitive data while preserving statistical patterns. By applying this technique, we can keep the usefulness of the event data we have collected, at the same time protecting individual privacy.

To make the whole process more accurate and less time-consuming, we can opt for an automated data masking approach that leverages Snowflake’s built-in functionalities paired with the descriptive capabilities of data contracts.

Implementing automatic data masking

Tagging sensitive fields

First things first: how do we know what data is sensitive and needs to be masked, and what can be kept as is? Engineers often find this difficult to decide, as it depends on the semantics of the data fields and the applicable regulations, among other things.

More often than not, security or privacy officers, working on an analytical data platform, do not have a total view of all the specifics of ingested data. This is a cause for concern, as they are supposed to apply masking to PII data if it were done manually.

Ideally, a clear description of PII data would be readily available for all entities ingested. This description must be both human-readable to inform other people of the specifics of sensitive data, and machine-readable to enable automated handling of PII data.

This is an ideal use case for leveraging data contracts to mark personal data as it fulfils both of the requirements above, allowing us to add the metadata effortlessly.

As we explained in a recent article on getting started with data contracts, the data producer contributes the data contract for their data as they know said data best. In the context of our microservice-based retail platform, this would mean that the customer team would define the contract for customer events, while the team that handles orders would define the contract for order events, and so on.

Note that ideally, contracts are managed in version-controlled repositories, and producers can add tags to PII fields to mark them sensitive when defining the data contract.

A JSON Schema contract for customer data could look like this:

Here, we have a simple customer registration event that only has four fields:

  1. Customer name;
  2. Email address;
  3. Marketing communications consent;
  4. Unique ID.

Out of these four, only the first two constitute sensitive data and should be marked as such by adding the PII tag to the field definition. The tags in the data contract are there to alert other stakeholders of the sensitivity, but we can also leverage them to automate processing.

Data masking in Snowflake

Snowflake’s Enterprise Edition provides out-of-the-box data masking capabilities. The feature employs masking policies to selectively obscure plain-text data within the table and view columns during query execution.

At query runtime, the masking policy is enforced on the column at every instance of its occurrence. Depending on the conditions set out in the masking policy, the SQL execution context and the role hierarchy, Snowflake query operators may see the plain-text value, a partially masked value or a fully masked value.

The masking policy can consist of a simple SQL expression or call a complex user-defined function defined in any of Snowflake’s supported languages. The custom logic contained therein may be masking data differently depending on the role or the environment used for the query, for example leaving data unmasked in development environments or only masking data for specific user types.

Snowflake masking policies are schema-level objects and, as such, are managed using SQL. We can create a basic policy to replace a string value with its SHA2 hash value and apply it to sensitive fields in the customer event table. This is what the SQL code would look like:

At query time, this will replace all occurrences of the name and email fields with their hashed values. Due to the deterministic nature of the hash function, analytical queries on these columns will remain possible. At the same time, no actual customer data will be revealed.

Automated data masking

We now have an exhaustive description of all sensitive fields in event entities by including tags in data contracts, as well as a mechanism to mask these fields using Snowflake’s native dynamic data masking capabilities.

Yet, keeping track of all PII fields in data contracts and manually defining and applying masking policies to their occurrences in Snowflake remains a cumbersome, error-prone task. We want to automatically use the metadata in contracts for Kafka events to create and apply appropriate masking policies on the Snowflake data.

To make this possible, there are two requirements for us to meet.

First, we must use data contracts for all Kafka topics ingested into Snowflake. A prerequisite for this is publishing data contracts to the Confluent Schema Registry. After that, schema usage can be enforced in different ways. The common practice is to ask producers to use the contract’s schema to serialise their payloads, resulting in implicit validation. Another option is enabling server-side schema validation on the topics concerned.

Second, there must be a clear one-to-one mapping between a topic and its corresponding Snowflake table. This mapping can be configured with the optional snowflake.topic2table.map property in the Kafka Sink Connector, used to ingest event data into Snowflake.

At Infinite Lambda, we developed a Python tool that automatically manages masking policies on Snowflake data ingested from Confluent. It makes use of the Confluent APIs to fetch all necessary data for creating appropriate masking policies.

The tool can be used to generate and execute code for creating and applying masking policies – either as plain SQL statements or in the format used by your database object management automation system.

It can be integrated into a CI/CD workflow so that appropriate masking policies are applied to Snowflake data whenever a data contract is created or updated. Using this approach, PII fields on Snowflake are masked as soon as a new data contract is released.

The following diagram illustrates the automated masking workflow. We assume that all data contracts are managed in a version control repository with CI/CD enabled.

Automated data masking to protect user privacy

It works like this and you can use these steps as guidelines for building your own PII data masking tool:

  1. A data contract author commits a new contract or an updated version of an existing contract to their version control repository. This triggers a CI/CD workflow run;
  2. The workflow publishes the contract to the Confluent Schema Registry;
  3. The workflow calls the custom masking tool;
  4. The masking tool fetches the latest data contract from the Confluent Schema Registry;
  5. The marking tool queries the Snowflake Sink Connector’s configuration and extracts topic-to-table mappings;
  6. The masking tool generates appropriate masking code based on the information obtained and applies it to the Snowflake environment.

Next steps

There are several additional steps that organisations can take to enhance this approach. First, they can define additional tags within data contracts to specify the type of masking to be applied, employing specific masking techniques for different data types (for example, partially masking date fields). This improvement allows for a more nuanced data masking strategy.

Furthermore, you can provide an extra layer of protection by establishing a set of globally defined PII fields to serve as a fallback will ensure that masking is consistently applied, even if a contract lacks explicit tags.

Finally, it is also worth incorporating a data governance tool into the company’s data strategy. Such a tool will be able to run scans on the Snowflake data warehouse, helping to identify any potentially missed PII columns and ensuring comprehensive coverage in data protection measures.

In a nutshell

In this article, we explored a streamlined solution for safeguarding sensitive information amid exponential data growth.

By tagging sensitive fields in data contracts and utilising Snowflake's dynamic data masking capabilities, you can efficiently protect PII in analytical data warehouses. The key lies in automating data masking to reduce complexity, accomplished through version-controlled contracts, schema governance in Confluent Kafka and a Python tool for automated masking.

The powerful combination of data contracts and Snowflake's data masking feature provides the foundation for scalable, effective data protection, continuous compliance with regulations as well as insightful analytics that do not compromise individual privacy.

If you are looking to optimise the way you approach PII data masking on the cloud, reach out to our team. We will help you ensure full compliance while making the most of the analytic capabilities of your platform.

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Tag-Based Masking in Snowflake
Tag-Based Masking in Snowflake: Practical Guide with Scalable Implementation
As data continues to be a critical asset for organisations across industries, safeguarding sensitive information while enabling data access for authorised users is a constant...
June 11, 2024
Cloud sustainability
Cloud Sustainability
This article on cloud sustainability is a part of a series on carbon analytics published on the Infinite Lambda Blog. Appreciating the cloud is becoming...
June 5, 2024
How to measure happiness and safety in tech teams
How to Measure Happiness and Safety in Tech Teams
Software product development initiatives can go awry for a whole range of reasons. However, the main ones tend not to be technical at all. Rather,...
May 30, 2024
why sustainability analytics
Why Sustainability Analytics
We all like a sunny day. Kicking back in the garden with the shades on, cool drink in hand and hopefully a liberal amount of...
May 8, 2024
Data diff validation in a blue green deployment: how to guide
Data Diff Validation in Blue-Green Deployments
During a blue-green deployment, there are discrepancies between environments that we need to address to ensure data integrity. This calls for an effective data diff...
January 31, 2024
GDPR & Data Governance in Tech
GDPR & Data Governance in Tech
The increasing focus on data protection and privacy in the digital age is a response to the rapid advancements in technology and the widespread collection,...
January 18, 2024

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.