DICOM de-identification at scale in Visual NLP — Part 3.

Mykola Melnyk
John Snow Labs
Published in
6 min readSep 29, 2023

This post will delve into the utilization of Visual NLP to manipulate pixel and overlay data within DICOM images.

In the following examples, we will work with these two transformers: DicomToImageV3, responsible for extracting frame images, and DicomDrawRegions, which draws rectangle regions to the frames and proves useful in building de-identification pipelines.


DicomToImageV3 is extract images from the pixel and overlay data to the Spark DataFrame as Image structure.

It support following PhotometricInterpretations:

  • MONOCHROME2: This Photometric Interpretation represents monochrome images, which are grayscale images with varying shades of gray. It is often used for medical images like X-rays and grayscale photographs.
  • RGB: RGB stands for Red, Green, Blue. This Photometric Interpretation is used for full-color images, where each pixel is represented by three color channels: red, green, and blue. By combining these three channels in varying intensities, it creates a wide range of colors, making it suitable for standard color images.
  • YBR: YBR stands for YCbCr (Luminance, Chrominance Blue, Chrominance Red). It is a color space used to represent color images in a way that separates the luminance (brightness) information from the chrominance (color) information. It’s often used in medical imaging and JPEG compression.
  • YBR FULL: This is an extension of the YBR color space, providing full color information. It still separates luminance and chrominance but includes all color information needed for accurate color representation.
  • YBR FULL 422: This is a variation of YBR FULL that uses 4:2:2 chroma subsampling. It reduces the amount of chrominance data while preserving good color quality, making it useful for compression without significant loss of image quality.
  • PALETTE COLOR: This Photometric Interpretation uses a color palette to represent images. Instead of storing individual color values for each pixel, it indexes a color palette to represent the colors in the image. It’s an efficient way to store and transmit color images with a limited color set, such as in GIF images.

Let’s extract frames from the one of the test DICOM file:

dicom_to_image = DicomToImageV3() \
.setInputCols(["content"]) \
.setOutputCol("image") \

result = dicom_to_image.transform(dicom_df)

| image|exception|pagenum| path| modificationTime|length|
|{file:/Users/nmel...| | 0|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...| | 1|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...| | 2|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...| | 3|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...| | 4|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...| | 5|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...| | 6|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...| | 7|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...| | 8|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...| | 9|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...| | 10|file:/Users/nmeln...|2023-08-20 14:17:23|426776|

We can see here separate row with image for each frame. pagenum column contains number of the frame. Let’s display frames as images using display_images function:

display_images(result, limit=2)

DicomToImageV3 supports up to 1000 frames or more, depending on memory limitations. When working with a large number of frames during debugging of the pipeline, it is useful to extract only a limited number of frames. To do this, you can set the frameLimit parameter:

dicom_to_image = DicomToImageV3() \
.setInputCols(["content"]) \
.setOutputCol("image") \
.setFrameLimit(1) \

result = dicom_to_image.transform(dicom_df)

| image|exception|pagenum| path| modificationTime|length|
|{file:/Users/nmel...| | 0|file:/Users/nmeln...|2023-08-20 14:17:23|426776|

For handle big files (2 and more GB) need to use path as input instead of content. This force to load file directly from the file system instead to load it to the DataFrame.

dicom_to_image = DicomToImageV3() \
.setInputCols([“path”]) \
.setOutputCol(“image”) \


DicomDrawRegions is draw regions to the frames on DICOM. It updates both pixel and overlay data.

It support same PhotometricInterpretations as DicomToImageV3.

Let’s do simplest de-identification, detect text on the image and hide it. We already can extract frame images using DicomToImageV3. Need set keepInput to True for able to compare results with original images.

dicom_to_image = DicomToImageV3() \
.setInputCols(["content"]) \
.setOutputCol("image") \

Next we need to detect text. We can use ImageTextDetectorV2 here:

text_detector = ImageTextDetectorV2 \
.pretrained("image_text_detector_v2", "en", "clinical/ocr") \
.setInputCol("image") \
.setOutputCol("regions") \
.setScoreThreshold(0.5) \
.setTextThreshold(0.2) \

And as final step we draw filled rectangles using DicomDrawRegions:

draw_regions = DicomDrawRegions() \
.setInputCol("path") \
.setInputRegionsCol("regions") \
.setOutputCol("dicom_cleaned") \
.setRotated(True) \

For run this we will define Spark ML Pipeline and call it:

pipeline = PipelineModel(stages=[

result = pipeline.transform(dicom_df)

| dicom_cleaned| exception| path| content|
|[52 75 62 6F 20 4...| |file:/Users/nmeln...|[52 75 62 6F 20 4...|

Let’s display original and cleaned DICOMS using display_dicom function:

display_dicom(result, "content,dicom_cleaned", show_meta=False, limit_frame=2)

Additionally, DicomDrawRegions also supports the following compressions:

  • RLELossless: RLELossless is a compression method used in DICOM for medical image storage. It operates based on Run-Length Encoding Lossless, encoding consecutive runs of identical pixel values as a count followed by the pixel value itself. This method is employed for lossless compression, ensuring that the original medical image can be perfectly reconstructed from the compressed data without any loss of quality.
  • JPEGBaseline8Bit: JPEGBaseline8Bit is a specific variant of JPEG (Joint Photographic Experts Group) compression. It adheres to baseline compression with 8 bits per color channel, typically resulting in a 24-bit color depth for RGB medical images. This compression method is inherently lossy, reducing file size by discarding some image data while aiming to maintain diagnostic image quality.
  • JPEGLSLossless: JPEGLSLossless is a DICOM compression method that represents a lossless image compression standard based on the JPEG-LS (Lossless JPEG) standard. In DICOM, it is used to ensure that medical images are stored and transmitted without any loss of quality. It achieves this by employing predictive coding, context modeling, and entropy coding techniques, making it suitable for medical imaging applications where preserving diagnostic image quality is paramount.

We can choose compression by setting the compression parameter and force compression of pixel data for files without compression by setting the forceCompress parameter to True:

draw_regions = DicomDrawRegions() \
.setInputCol("content") \
.setInputRegionsCol("regions") \
.setOutputCol("dicom_cleaned") \
.setRotated(True) \
.setCompression(DicomCompression.RLELossless) \

The last stage in our today’s post is storing the results back to the file. To retrieve the name of the original file from the path column, let’s define a UDF function:

def get_name(path, keep_subfolder_level=0):
path = path.split("/")
path[-1] = path[-1].split('.')[0]
return "/".join(path[-keep_subfolder_level-1:])

To save the DataFrame with the cleaned DICOM files using the binaryFormat datasource to the output_path, we need to specify a few options:

  1. ‘type’ of the file.
  2. The ‘field’ that contains the DICOM file.
  3. A ‘prefix’ for the files.
  4. The ‘nameField’ column, which contains the name of the file.”
output_path = "./deidentified/"
from pyspark.sql.functions import *

result.withColumn("fileName", udf(get_name, StringType())(col("path"))) \
.write \
.format("binaryFormat") \
.option("type", "dicom") \
.option("field", "dicom_cleaned") \
.option("prefix", "ocr_") \
.option("nameField", "fileName") \
.mode("overwrite") \

Jupyter notebook with full code you can found here.

In this post, we have constructed the simplest de-identification pipeline to conceal all text in DICOM images. In the next post, we will create a more complex pipeline using NER de-identification models with Spark NLP and Spark NLP for Healthcare.


