Balancing Data Privacy and Machine Learning Performance: Strategies and Challenges

Understanding the challenges and solutions of training AI models with sensitive data without compromising performance or privacy

Thomas Wood
Fast Data Science
2 min readOct 13, 2023

--

Does protecting sensitive information compromise machine learning models?

Fast Data Science, a company specializing in natural language processing consultancy, explores this challenging area.

Machine learning students, either via university studies or online courses, commonly work with publicly available datasets such as the Titanic Dataset, Fisher’s Iris Flower Dataset, or the Labelled Faces in the Wild Dataset. However, in a commercial setting, sensitive or private data is often the source for training machine learning models, leading to a number of challenges for data scientists.

Recent models developed by Fast Data Science, such as the NLP model for email triage, have raised questions about data protection legislation, user consent for storage of data, the GDPR’s right to be forgotten, and the potential risk of sensitive data reproduction from the model.

An example of an AI email triage system

Research from institutions such as Cornell University has also indicated that sensitive parts of training data could be accidentally memorised by machine learning models, allowing potential reconstruction of part of the data by those with access to the trained machine learning model.

Given these challenges and risks, there are several strategies that data scientists can pursue:

  1. Delete the dataset — After the model has been trained, the entire dataset, including annotations, can be deleted. While this ensures data security, it also means that restarting the project would involve re-annotating a new dataset.
  2. Anonymise (mask) data — Sensitive information can be removed before using it for machine learning, but the risk lies in failing to remove all sensitive information. Moreover, the anonymised dataset may not facilitate the most accurate model development.
  3. Store only IDs — Data can be annotated and then deleted, only storing a hash or ID of the original information, enabling the reconstruction of training data without storing data in the machine learning system. However, the risk of hashed data being reconstructed by a hacker cross-checking your hashed database with another database remains.

In the end, the use of sensitive data for machine learning model training is a balancing act between generating the most accurate model and ensuring the due protection of sensitive data.

To learn more about working with sensitive data in Machine Learning, visit Fast Data Science’s page.

--

--

Thomas Wood
Fast Data Science

Data science consultant at www.fastdatascience.com. I am interested in all things AI and natural language processing. www.freelancedatascientist.net