Data Science at Microsoft

Lessons learned in the practice of data science at Microsoft.

Using differential privacy to understand usage patterns in AI applications

--

by Dhruv Joshi, Chris Parnin, and Bhavish Pahwa

Image created with Microsoft Designer.

With the introduction of Generative AI tools, the way that users interact with AI systems has fundamentally changed from searching to conversing, and from simple actions to complex workflows. Large Language Models (LLMs) and their interfaces are evolving rapidly, and learning from user interactions within LLMs is critical to building the right products to solve user needs. For example, by knowing that its users frequently make requests for help with math problems, the product team behind the related Generative AI product can prioritize investments to better support that need.

Privacy remains a critical consideration. In some classes of applications, user actions may contain sensitive and/or confidential data or identifying user information; users might enter information related to healthcare, finance, or legal — and AI tools are poised to be especially useful for these types of applications. However, there’s a risk of violating user privacy by running traditional natural language pattern extraction techniques on these kinds of datasets. The inherent value of learning from user actions and understanding behavior, however, remains an important problem to solve. As previously reported by Microsoft [1], differential privacy (DP) could unlock this value in a privacy preserving and responsible manner.

In this article, we demonstrate how the versatile Differentially Private n-gram Extraction (DPNE)[2] approach previously published by Microsoft Research can be used to solve this problem with similar privacy standards, using an example on public datasets.

User intent analysis

We demonstrate this technique on the problem of analyzing user intent (i.e., the intention of which task the user would like to perform) in conversations with AI. To do this, we use the public WildChat [3] dataset, a public corpus of one million real-world user conversations with ChatGPT.

Data preparation

We extracted a random sample of 50,000 messages from the WildChat-1M dataset filtered to the English language (using random seed set 42 while sampling). With this approach, each selected message is provided with a unique identifier to identify it within the sampled dataset.

Intent rephrasing and extraction

The key step is to use an LLM to extract the information we want to learn (in this case, the user intent) by instructing it to rephrase and normalize the private conversation data (instruction prompt in appendix). This normalization step also vastly increases the subsequent yield of DP n-grams. We used GPT-4o to rephrase the first turn of the sampled conversations (see appendix for prompt). Here are a few examples of what that looks like:

Running DPNE to extract top intents

The next step is to run DPNE on these rephrased intents. We used the implementation that was open-sourced by Microsoft in 2021 [4], setting the DP hyperparameter ε to 3. The table below compares DPNE n-grams (with n >= 4), clustered into different themes that emerged. We observe that the intent or task patterns that are learned are at a fairly deep level of granularity and thus are providing valuable insights into usage, while also inheriting the same privacy assurances that differential privacy provides. (Code to generate these results is available on the same GitHub repository).

*Theme was created by using agglomerative clustering to cluster DP patterns with 𝑛-gram length above 4 and using GPT-4o for theme assignment.

Comparison with traditional techniques

To compare and contrast with more traditional techniques, we ran the same sampled dataset through clustering and LLM-based intent detection from those clusters using the following steps:

  • Start by using Agglomerative Clustering with a Sentence Transformer embedding model: all-MiniLM-L6-v2. This yielded 26,891 clusters with a minimum cluster size of 1 and a maximum cluster size of 1,345.
  • From each cluster, randomly sample up to five messages (to manage LLM context length, cost, and so on) and prompt GPT-4o to extract the theme using the same prompt template used above.

We display a table of intent versus frequency for the top 10 topics extracted. (Here “frequency” is the occurrence of the topic in the messages or sample and is different from coverage.)

From this it’s clear that the fidelity of these extracted intents is vastly different from the previous approach — we are able to demonstrate higher granularity intents derived via the DP approach as well as potentially greater coverage over the dataset. Additionally, the above results have no DP safeguards and thus, the extracted themes may run an increased risk of leaking private or sensitive information.

Conclusion

Differential privacy is a powerful tool to mine usage insights from private datasets at a fidelity that could otherwise be far too risky to derive using traditional techniques. This versatile technique can easily be replicated and customized to various applications using Generative AI.

When implementing this technique, one must be mindful of tracking the privacy budget that is being spent; in other words, by running multiple DP processes on the same dataset, the desired privacy guarantees are not possible. When implementing at scale, we recommend a robust governance mechanism as well as training and education resources for data scientists to prevent unintentional overuse of privacy budget. The effectiveness of the algorithm at protecting privacy could be validated using an auditing approach based on membership inference, such as previously reported methods [5]. As an additional precaution, we recommend limiting access to resultant data only to in-house product crews who need to use it as part of their day to day work.

Additionally, the LLM prompt for the initial rephrasing and intent extraction step may need some tweaking for each specific use case.

Appendix

Rephrasing prompt

You are a helpful assistant that rewrites user task descriptions given to an AI Assistant into a general task description that can be reused in an application menu item or analysis of user behaviors.
- If asked about current information, remember you do not need to answer the question, just attempt to rephrase it.
- Avoid references to specific columns, person names, or entities. Instead, generalize or abstract away (e.g., "Write a bio for a famous person" instead of "Satya Nadella")
- Try to keep rephrased task description to under ten words.
- Focus on the primary part of task.
- Always summarize the description into English.
- Users can occasionally provide followup responses or give feedback as tasks, such as "Thank you or That didn't work" in that case, summarize the followup request, don't answer it (e.g., don't say "You're welcome", response with "Express gratitude")
- Avoid overly formal phrasing, these should be accessible to everyday users.
Example Rephrased Descriptions:
{recent_rephrases}
Finally,
- Reuse wording from Example Rephrased Descriptions, when possible.
- Only return the rephrased description.

Theme extraction prompt

Given the following clustered list of ngrams, try to create a common theme for the whole cluster. Keep the themes to 1–2 words at max. If possible, try to extract the theme from the cluster of ngram sentences itself.
`{row_cluster_sentences}`
If the cluster is just made up of ngram sentences that contain just prepositions/connector words and has no Verb, Noun, Adjective than the theme is General.
Return only the final theme as a string. Provide no explanation for the themes you provided and add no prefix or suffix text to the result.

References

  1. Putting differential privacy into practice to use data responsibly | AI for Business (microsoft.com)
  2. https://arxiv.org/abs/2108.02831
  3. https://arxiv.org/abs/2405.01470
  4. GitHub — microsoft/differentially-private-ngram-extraction: Implementation of Differentially Private n-gram Extraction (DPNE) paper
  5. https://arxiv.org/abs/2307.05608

--

--

Data Science at Microsoft
Data Science at Microsoft

Published in Data Science at Microsoft

Lessons learned in the practice of data science at Microsoft.

Dhruv Joshi
Dhruv Joshi

Written by Dhruv Joshi

I love technology! Microsoft Data and Privacy PM. Stanford MS&E '18 grad, IIT Bombay's 50th graduating batch.

Responses (1)