Maintaining data confidentiality in Trusted Research Environments

Why Statistical Disclosure Control is a vital part of operating a Trusted Research Environment

Richard Welpton
The Health Foundation Data Analytics
6 min readNov 19, 2020

--

Image via Hurca by Pixabay

For a number of years, data providers have allowed researchers to download anonymised sources of individual-level data. For example, the UK Data Service gives researchers access to anonymised versions of data including the Labour Force Survey, which they can download and use under an End User Licence or Special Licence. But the limited detail of these data makes them unsuitable for some types of research.

Now, more Trusted Research Environments (TREs) are being built to allow researchers to analyse sensitive sources of data in a secure setting. Data cannot be downloaded; instead analysis of data takes place in the TRE. They are technically advanced computing environments, with built-in security features and many are accredited to national and/or international information security management standards. So in their creation, focus has been on the need for robust IT security processes and procedures to ensure close monitoring of who has access to the data and the projects they undertake.

But to concentrate solely on IT security ignores the question of their purpose: specifically, what is it that researchers want to release from these environments? The general principle of a TRE is to allow researchers to have statistical results of their analyses released to them, which can include syntax used to create these results. The release of such outputs is a critical component of a TRE’s operation.

The Five Safes framework, established by Professor Felix Ritchie while he worked at the Office for National Statistics in 2006, helps us understand how best to manage TREs. The key elements to consider are:

  • Safe People — can we allow this person to access the data?
  • Safe Project — is the purpose of the project a ‘good’ one, in the sense that the motivation behind the project is to serve the ‘public good’?
  • Safe Setting — do we have the appropriate technical security controls in place?
  • Safe Data — what are the characteristics of the data that imply they should only be accessed in a TRE?
  • Safe Outputs — what is going to be released and published from the TRE?

As a data steward at the Health Foundation, with experience of managing all aspects of the Five Safes to establish and operate TREs here and in other organisations, I believe that the Safe Outputs principle is just as critical as the other four components of the framework.

Why Safe Outputs matter

Screening researchers and their project proposals, and creating a secure environment for them to access sensitive data, makes no difference if the released outputs can then be used to re-identify data subjects and/or reveal confidential information. The consequences are the same for distributing access to sensitive data without proper controls:

  • data protection laws might be broken
  • individuals might be harmed or distressed as information about them is revealed or exploited by data intruders
  • people might stop using critical public services, such as their GP surgery, because they worry about how their data will be kept safe.

For these reasons, it is critical to build a strong and robust process that ensures outputs leaving a TRE do not go on to breach data confidentiality. The tool and process to achieve this is Statistical Disclosure Control (SDC).

What is SDC?

SDC is the process of ensuring that an individual or organisation cannot be identified from a set of data or a set of statistics. It also ensures that confidential information isn’t released.

This is the how anonymised versions of data, including the Labour Force Survey, are produced. SDC is applied to the raw data to make sure that it is almost impossible for somebody to be revealed and associated with some confidential information. It is necessary to take this action because the data cannot be controlled once they are downloaded.

But this process is applied only minimally to data that will be accessed within a TRE, where the extra detail about individuals is needed for research. Once names, addresses and IDs are usually removed, the remaining level of detail is what researchers need for their analyses. But even this could present a risk to confidentiality, which is why the data can only be accessed in a TRE and can’t be downloaded, and why only statistical results are released instead.

When creating statistical results, researchers could, without realising, create outputs that contain some detail. When released from the safe setting of the TRE, a data intruder could use this to work out who somebody is and perhaps blackmail them, which recently happened after a data breach at a psychotherapy clinic in Finland. So, SDC needs to be applied to these statistical outputs before they are released to ensure that individuals are not identifiable and no confidential information is released.

Where to turn to for help

While there’s lots written on the topic of SDC for anonymising data and making them safe, less guidance is available about applying this process to statistical outputs generated by researchers in a TRE. If I was coming to this topic from scratch, I would want to know:

  • What is SDC?
  • Why does it matter?
  • How do I apply it to statistical results?
  • How do I assess types of statistics I’m not familiar with?
  • How can I put a system in place that means I don’t spend all my time doing this?
  • How can I work with researchers to apply SDC efficiently?

Fortunately, we have some great resources that can help operational staff responsible for releasing outputs. My recommended reading list includes:

Further developments — training staff

Simon Parker (Data Liaison Manager at Cancer Research UK) and I believe that as well as making reading material available to staff with SDC responsibilities at TREs, specific training should to be provided (and a way of assessing staff understanding of concepts and techniques).

Therefore, we have developed a training course which we will deliver to NHS Digital staff working in their TRE and at University of Manchester. The course will cover:

  • what statistical disclosure is
  • how to assess outputs for potential disclosure of confidential information
  • how to work with researchers to ensure only safe statistical outputs are released from the TRE
  • how to implement systems for efficiently managing requests from researchers to release their statistical results safely.

After a bit of finessing and testing, we’ll be making the course contents available through the Safe Data Access Professionals working group.

Professor Felix Ritchie is developing similar training, which he has delivered to staff at NHS Digital, UK Data Service and Office for National Statistics.

Recommendations

In this blog, I’ve provided an introduction to the world of TREs and focused on the need for thorough knowledge of SDC. If you’re working in a TRE, or responsible for setting one up, here are my key recommendations:

  1. work through the reading list above
  2. speak to other organisations about how they have implemented SDC processes for their TREs
  3. make sure staff with responsibility for SDC are knowledgeable and well-trained
  4. put in a system that incentivises both researchers and staff to work efficiently — this will enable operations to be scaled up without requiring further resources.

Of course, it’s very important to have high levels of IT security to protect data. However, none of it will work if outputs can be used to re-identify individuals and/or they contain confidential data. That’s why SDC is vital.

With increasing interest and demand for data that can be used in research to benefit the public, TREs need to look with renewed vigour at their approach to SDC, as they will only operate safely if they ensure the data remain confidential.

Acknowledgments

I would like to thank Dr Hannah Knight, Senior Analytics Manager at The Health Foundation, and Christine Garrington, for their help and support to develop this article.

--

--

Richard Welpton
The Health Foundation Data Analytics

Head of Data Services Infrastructure, Economic and Social Research Council. Access to data for research, data confidentiality. Runner. @rwelpton