Handling sensitive data with LLMs

A picture that illustrates the challenges with sensitive data and LLM

During the past six months I have spoken with numerous customers about LLMs and Gen AI, where in the beginning much of the discussions was about what use cases it enables to later be more about how to use their own data in combination with LLMs.

Restricting the sharing of data with LLM vendors

When talking about using internal data many organizations today are aware of the risks of using third-party hosted LLMs with their own data, they are often making sure they run them in environments that they control where no data is shared with the vendor before giving their internal users access.

This is also what Snowflake enables you to do with the Snowpark Container Services, in Public Preview, and the Cortex LLM functions, in Public Preview, where both gives you the possibility to use either your own LLMs or LLMs hosted by Snowflake within your Snowflake account without exposing any of the data used with them.

Many organizations do think that this is enough and by applying this pattern they have solved the challenges with using internal and sensitive data with LLMs.

However, this is only solving one part.

Because even if you now are running your models in a restricted environment, you still risk exposing sensitive data to users that are not allowed to see it. This is because all data used for training a LLM is visible for everyone that has access to it, as long as you can figure out how to ask for it. A LLM does not, today, have any built in security, access to data in it is binary, either you have access to everything or to nothing, and by nothing meaning you do not have access to the model.

Let me give a very simple example.

Let’s say that you want to provide your employees with a LLM driven chatbot where they can ask questions about their employment, benefits, salary etc. In order to do that you might use all HR data for all employees to train it, or most likley fine tune a existing LLM. This will open up the possibility for an employee to ask questions about other employees’ salaries, benefits and so on and you, hopefully, do not want that to be possible.

In order to support this you would have to train multiple models, basically one per employee and if you would like to give managers the possibility to ask questions about their employees you would need additional models as well. This would not be a suitable solution and probably a nightmare to manage.

Another way could be to add a layer between the user and the LLM, think of it as an application or API where all queries and answers have to pass through in order to access the model. In this layer we would add logic to filter out data that a specific user is not allowed to see. It would require us to create a security framework where we keep track of what data a specific user is allowed to see, and it would make us very dependent on that we can figure it out from the question asked or the answer retrieved from the model.

Does this mean we can not use data that is restricted by the user with our LLMs?

It does not, but it means we need another approach.

Using sensitive data with LLMs

First, we should only train, or fine tune, a model with data that everyone that will use the model is allowed to see. In the HR case we would use data about what benefits that exist, policies etc and does not contain information about a specific employee.

Second, sensitive data needs to be provided at the time the model is queried so it can be restricted by the user and one way to do that is to use RAG.

RAG, Retrieval Augmented Generation, is a way to provide a model with additional context to be used to generate the answer, in addition to the data it has been trained with.

Using RAG will enable us to use the question a user asks and based on that select data that is similar, often by using a function for calculating similarity between those. A popular way to do this is to generate embeddings for the data that can be used with a model and then when a user asks a question, turn the question to embeddings and use those to find the data that is most similar. My colleague Tom Christian has written a nice blog post about how this can be done in Snowflake and my other colleague Carlos Carrero has written a post how you can automate it.

Since we now keep the sensitive data separated from the model, we should be able to limit access by who is asking for the data. Meaning if I ask our HR LLM about my benefits it would retrieve data that is related to only me, but if my manager asks about employees in his or her team it should get data about all employees in that team.

When using embedding it is very common to store those in a Vector Database and use it to search for data that is related to the question asked by the user. In order for our HR example to work we need to be able to limit the data returned based on which user is asking questions and we might even need to be able to mask some of the data as well before using it.

Most Vector databases do not, today, have any security and data governance functionality beyond giving users access and if they can read, update or delete data. A common suggestion in how to solve this is to handle it in the application layer, however that gets us back to the same challenges we would have with using a application layer for the model that is using sensitive data.

Another way is of course to use Snowflake.

How Snowflake enables you to use sensitive data with LLMs

As mentioned earlier in this post, Snowflake enables you to use LLMs within your Snowflake account and you can do that in combination with Snowflake’s built in security and data governance features.

For our HR example this means we can now apply row level security polices on the data so that users can only get data they are allowed to see and also combine it with dynamic data masking policies on columns with sensitive data, so even if the user is allowed to get the row, he or she might not be allowed to se all or parts of some of the column values.

Since Snowflake requires you to log on with a user in order to access data; those policies are always applied, no matter how you access Snowflake.

This means that we can store embeddings, using the vector data type that is in private preview, in the same table as the actual data and by that leverage the built in security and governance, or store them in separate tables where we can use the same policies or create new ones.

So when our HR chatbot asks for data related to the question a user is asking, the data returned will always be the data that the user is allowed to use and see.

Another benefit with using Snowflake for handling the data, is that even if our chatbot was using a LLM running outside Snowflake the data returned would still be limited by the user.

In a future post I will provide an end to end example of how to set this up using Snowflake so stay tuned.

--

--