Large-scale anonymization of medical imaging data in the DICOM format

Mateusz Grzyb
ResponsibleML
Published in
4 min readSep 7, 2022

This blog is the sixth in our xLungs series about the Responsible Artificial Intelligence for Lung Diseases project that we work on in MI2DataLab. You can check out our previous blogs here.

In the first blog, we wrote about the acquisition process of a large number of computed tomography scans (CT for short). The images result from a screening study conducted by the Polish Association on Lung Cancer (Polska Grupa Raka Płuca in Polish).

We are relieved that this time-consuming undertaking of manually copying data from CDs has been completed, resulting in a database comprising examinations of nearly 45,000 patients. Since its data format follows the Digital Imaging and Communications in Medicine standard (DICOM for short, more on which in a moment), this translates to roughly 25,000,000 files. It is a result of the fact that, according to the format, individual CT slices are stored in separate files. In addition, each file contains not only a bitmap but also comprehensive metadata.

Screenshot of a few DICOM attributes extracted using RadiAnt, a DICOM viewer software. https://www.radiantviewer.com/dicom-viewer-manual/dicom_tags.html

The mentioned DICOM standard is the most widely used specification for storing and transmitting digital medical imaging data worldwide. It was developed by the combined forces of American College of Radiology and National Electrical Manufacturers Association, and work on it goes back to the 1980s. Files conforming to its rules can be viewed as ordinary dictionaries (in the sense of an abstract data type), where the keys are group-element pairs in numeric form, and the values are stored data. There are hundreds of predefined attributes, describing aspects such as the place and circumstances of the study, the equipment used and its configuration, the parameters of the image obtained and, of course, the patient data itself. In addition, for each attribute, its requirement (e.g. necessary, possibly empty or optional), type of data stored (almost 30 types called Value Representations are defined) and multiplicity are specified. This potential plethora of information carried and the need to remain compliant with the extensive standard make robust anonymization a challenging task.

But why do we anonymize the acquired data at all?

Our server without which the described anonymization would not be possible — Akira.

One of the goals of our project is to make the collected data available to a broader audience to support the research and development of AI solutions in the medical imaging and diagnostics field. However, we must not forget that personal health information is sensitive data, the use and sharing of which can have wide-ranging consequences and are governed by numerous legal acts. Behind each study is a real person with a fundamental right to privacy which we believe is more than enough to justify every effort to preserve it. However, if this argument is not convincing enough, history shows us that penalties for causing a privacy breach through negligence can be severe.

All of the above creates the need for anonymization procedures that, on the one hand, will not lead to a distortion of the data that could hinder its beneficial use and, on the other hand, will safeguard the interests of the patients and ourselves. The following features characterize the process we have developed, which we believe succeeds at addressing this need.

  • The first stage occurs in a hospital facility using a computer without physical access to the internet. Blacklisting ensures the security of the most sensitive information and allows data to be moved off-site.
  • The second stage takes place on our secure server with strictly managed access. With whitelisting, only explicitly approved attributes are final, and any not yet covered by predefined rules are manually re-verified.
  • So-called Private Attributes (i.e., not included in the official specification) are always removed due to the high uncertainty of the data contained.
  • Some attributes undergo additional transformations. For example, Unique Identifiers are modified using complex hash functions to prevent back-tracking of the file origin but also preserve essential relationships.
  • Processed data is not transferred further before undergoing independent verification and quality control by another team member. At this stage, among other things, the unique values of all attributes are reviewed.
  • The course of the entire process is documented in the form of extensive logs, making it easier and faster to identify problems should they occur.

At this point, we are still in the process of anonymizing batches of data using the procedure described. The core of the process has been tested and works well. However, we are still making improvements — the latest is the dynamic use of limited space on SSDs to accelerate read and write operations significantly. We hope the techniques developed will allow us to complete the work quickly and effectively prevent any sensitive data leakage in the future.

If you are interested in other posts about explainable, fair, and responsible ML, follow #ResponsibleML on Medium.

--

--