Solving a Partial Data Retention Challenge with Cryptography and Java UDFs

9 min readApr 21, 2022

Originally published by Snowflake: https://www.snowflake.com/blog/solving-data-retention-challenge-cryptography-java-udf/

At Singular, we process a lot of sensitive information for our customers, such as terabytes of usage logs, each of which is attached to all sorts of confidential information. Sometimes our customers ask us to apply selective retention policies on this data for privacy reasons. One such requirement cropped up while we were migrating to Snowflake: we were asked to retain certain information for no longer than 30 days. However, this information was interspersed between columns that we had to keep for more than 30 days. To complicate the matter, the specific values with 30-day retention limits sometimes appeared in the middle of strings — strings which themselves needed to be retained for longer than 30 days.

This posed an interesting challenge. How do we selectively delete columns and parts of strings 30 days after their ingestion, while keeping query speeds and costs to a minimum?

A brief introduction to Snowflake

The usage logs we work with need to be accessed reliably and without extra cost overheads, which is why we use Snowflake. In our work with Snowflake, we found that it’s essential to store each log de-normalized — that is, with all its attached information — to ensure queries perform at the level we require. This creates a conundrum when we want to store most of the data in each log for several months while retaining parts of it for only 30 days.

Snowflake stores data in micro-partitions, each containing between 50MB and 500MB of data. The rows in a micro-partition are stored in a columnar format, meaning only queried columns are read — highly useful when there are lots of fields. For each micro-partition, Snowflake also keeps a range index (the maximum and minimum values of each column within the micro-partition). This is used to filter which micro-partitions a query should look at, potentially reducing the amount of data to scan. To run queries at target performance levels over large data sets, we learned to use these indices judiciously, but we will not delve into that subject here. For the purposes of this blog post, what matters is that you can create excellent filters using steadily increasing fields — in our case, timestamp fields.

Snowflake queries, whether initiated by external API calls or internal periodic jobs, all use resources allocated as virtual warehouses. Each virtual warehouse is an allocation of memory and processing power, completely independent from the storage of the data being queried. This allows the allocation to be completely dynamic. It is just as easy to allocate an extra small virtual warehouse as it is to allocate a 4XL one.

For every operation, be it an aggregated query or a simple modification, scanning large amounts of data requires a large investment of resources. Keep this in mind as we move ahead with our story.

How Singular uses Snowflake

At Singular, we use Snowflake to analyze massive amounts of continuously updated data. We ingest dozens of terabytes of logs every day. We upload files continuously to an S3 bucket, get them ingested into Snowflake using Snowpipe, and finally insert them into long-term storage tables using Streams and Tasks. For a more detailed dive into our ingestion architecture, see our previous blog: Migrating our Events Warehouse from Athena to Snowflake.

In summary, our data is stored for the long term in several tables, where retention is handled through the deletion of rows. By partitioning our data correctly upon ingestion, these DELETE commands mostly drop whole micro-partitions, greatly improving their efficiency.

None of this helps us deal with the specific challenge of having to drop specific columns after a set amount of time. It definitely doesn’t make it easy to drop partial information contained in string or JSON columns. So let’s talk about solutions.

Approaches to our data retention challenge

We first tried a naive approach: modify the relevant data once a day using SQL queries. This was not feasible. In Snowflake, modifying a single column in a single row involves rewriting the micro-partition in which the row resides. In our case, this means rewriting our entire data set over and over again.

The second approach was to separate sensitive, retention-limited data from the main tables into an auxiliary table, and use a JOIN clause to retrieve it when querying the data. This way we could set up a simple DELETE job to limit the auxiliary data table to 30 days of retention. One problem with this approach was how to handle sensitive values that are embedded in strings alongside other data. But the main problem was the sheer amount of unique values of sensitive data — any JOIN statement using the auxiliary table would make the query significantly slower.

A bit of cryptography

The solution we came up with was hiding sensitive values from the start. We would search our data before uploading it to Snowflake and encrypt each sensitive value we found. Encrypted values are unreadable without the encryption key.

However, cryptographic methods can be very slow when used for a large number of short values. For example, deciphering dozens of text messages takes much longer than deciphering a single file of the same size. So we looked for a faster way to encrypt and decipher a value, and found it in the most basic method of encryption: the Exclusive or (XOR) binary operator, as applied to the plain value (what we wish to hide) and the encryption key.

XOR is an operator combining two bits. Its result is 1 if the bits are different, and 0 otherwise. Like other bitwise operators, it can be applied to two binary inputs of the same length in bits. When used this way, it is a simple (and thus very fast) operator with a very useful property: if we take a plain message P, and a randomized key K, then applying XOR(P, K) results in a cipher that is unreadable without the key K. However, again applying XOR(cipher, K) gives us the plain P again. As long as the key is only used once and kept hidden, encrypted data cannot be read without the key. (For more information about why the key must be unique for each value we encrypt, read this explanation by Cryptosmith.)

Figure 1: The XOR operator can be used as a simple and fast way to do basic encryption.

Now we had to find a way to generate a unique encryption key per value, of a size equal to or greater than the size of the value while using only a constant amount of memory (that is, without using JOINs). To do that we needed one of the building blocks of encryption: cryptographic hash functions or CHFs.

CHFs are an extremely useful tool, well worth learning about if you’re unfamiliar with them. They are fast, one-way functions that map arbitrarily long strings into constant-size output. The output is deterministically determined by the input but is still hard to guess from parts of the input. CHFs let us create encryption keys based on multiple pieces of information, all of which are required to decrypt the data. If just one piece is missing, you cannot recreate the encryption key.

CHFs limit the length of the encrypted data — at least when using the encryption method outlined above. Fortunately, we only needed to encrypt short strings, such as geo-level information or network details.

As shown in Figure 2, with a single random value per day(1), deleted after 30 days, we were able to create a unique encryption key for every piece of data we wanted to expire if it was accompanied by a unique value. We achieved this with the help of a random data generator, using the generated randomized salt(2) value together with the daily value to create the actual unique key(3) for the encryption. The salt was then attached to the encrypted value along with the date of the daily value, into the actual result that is written to Snowflake(4). This enabled us to recreate the encryption key on query.

Figure 2: With a single daily key, deleted after 30 days, we were able to create a unique encryption key for every piece of data we wanted to expire if it was accompanied by a unique value.

Implementing columnar and intra-string retention policies

To read the encrypted values, each query needs:

The daily keys used to encrypt the specific values
A method to find and decipher the encryption containing the hidden values

For the first part, we use Snowflake’s session variables, which let us make the keys available to the query using the GETVARIABLE command. Session variables are simple and have equivalents in other databases so we will not delve into them here. For the second part, we use masking policies, which alter the SQL of every query on the fly when accessing certain columns.

Masking policies are worth an example and a bit of an explanation. A masking policy is defined on a column, transforming a simple query like:

into

This moves the entire deciphering process under the hood, allowing us to continue constructing queries without regard to the encryption, as if all sensitive data were instead stored plainly.

Schematically, we define the masking policy as follows:

As you can see, this policy is much like a function, operating on the values of a column. We can now apply it to a column of our choice:

This way, queries to the log table should seamlessly decipher data in the customer_sensitive_info column without any modification.

Implementing Java UDFs

Another essential tool for implementing our solution was JavaScript User Defined Functions (Java UDFs). We’ll start with a simple example and then attempt to demonstrate the flexibility of this tool through a more complex example.

In the course of implementing the masking policy, we ran into a snag. Snowflake doesn’t offer a method to XOR two binary inputs. We tried using built-in SQL methods and were not satisfied with the results. However, Snowflake allows us to execute UDFs in the Snowflake engine — and the UDF mechanism provided a quick and simple solution. Take a look at js_binxor:

This simple example illustrates the usefulness of UDFs. However, the power of UDFs extends far beyond this example because they have a global state. This allows us to depend on expensive initialization logic without sacrificing performance. In our case, we used a JavaScript library that is not imported into the UDF environment by default:

As you can see, we can “import” a library not contained in Snowflake’s JavaScript dependencies without compromising runtime efficiency. If we wanted to, we could also define some helper methods, or run any other setup we’d normally be wary of including in a similar situation. This demonstrates the true versatility of UDFs, which we hope will prove to be of use to you as well.

Partial data retention: Lessons learned

Sometimes processing can be done at the level of a single column value to solve a problem. Look for it before resorting to massive JOINs or DELETEs.
When using cryptography in a performance-sensitive environment, it pays to understand the theory behind the functions. The standard, off-the-shelf solution may not be ideal (this is probably true in general when fine-tuning performance).
When privacy or security is a concern, be extra careful with the implementation. Cryptographic functions often make assumptions on their parameters and use, which you should adhere to or risk creating vulnerabilities.
In Snowflake, don’t be afraid of creating your own methods! They can speed up development significantly.