Extract DICOM Images Only for Deep Learning

Nawaf Alageel
Analytics Vidhya
Published in
4 min readSep 10, 2020

What is DICOM?

DICOM or digital imaging and communications in medicine are image files sourced from different modalities and it is the international standard to transmit, store, retrieve, print, process, and display medical imaging information. However, DICOM groups information into the data set, and that means that the image file contains the patient information ID, date of birth, age, sex, and other information about the diagnosis all this within the image, as shown in the figure the main components of the medical image.

Medical Image Components
  • Pixel Depth is the number of bits used to encode the information of each pixel. For example, an 8-bit raster can have 256 unique values that range from 0 to 255.
  • Photometric Interpretation specifies how the pixel data should be interpreted for the correct image display as a monochrome or color image. To specify if the color information is or is not stored in the image pixel values, we introduce the concept of samples per pixel, also known as (number of channels).
  • Metadata is the information that describes the image (i.e. patients ID, date of the image).
  • Pixel Data is the section where the numerical values of the pixels are stored. All the components are essential but in our scope the pixel depth and pixel data. To my knowledge that ultrasound images are not an issue with converting the image to another format, but we have to look into consideration the depth of the image since we cannot convert 16-bit DICOM image to JPEG or PNG with 8-bit that might corrupt the image quality and image features. Pixel data the data that we are going to feed it to the network.

For more information about DICOM format visit:

Why Extracting the image Only?

DICOM format contains a lot of information as we saw and sometimes we only need the images, either because the private information of the patient or we need to reduce the size even though the information not as big as the image itself, so we had to remove the metadata of the images since we are not allowed for a non-medical or as engineers to look at the patient’s data or we do not need to risk the data to be exposed by mistake to anyone. Mainly we just need the image prat (pixel array) without anything else.

How to Extract the Images Only?

Most of the time we have to prepare the dataset to read all the images and store it as one list to feed it to the network, and this process can be different from one and another like the way how the files are organized.

Start collecting the data from the directory

NOTE: The Files must be organized as

my_directory/

|
├── Unnamed_154159/
└── IM-0005–0001.dcm

└── IM-0005–0002.dcm
|
├── Unnamed_136281/
| └── IM-0001–0001.dcm
|
├── Unnamed_190381/
| └── IM-0002–0001.dcm
|
├── Unnamed_102430/
|└── IM-0001–0002.dcm
.
.
.

import pydicom as di 
import os
from os import listdir
PathDicom = "The/path/to/DICOM/floders"
DCMFiles = []
for dirName, subdirList, fileList in os.walk(PathDicom):
for filename in fileList:
if ".dcm" in filename.lower():
DCMFiles.append(os.path.join(dirName,filename))
print("Number of (.dcm) files =", len(DCMFiles))

Now we have the list contains the .dcm files then we need to remove all the data and only extract the pixel array (the image itself).

import pydicom as diImages1 = []
for k in DCMFiles:
Images = di.read_file(k,force=True)
Images1.append(Images.pixel_array)

Images1 is the list that contains the images only. We can now store the images or “pickle it”.

The previous technique is suitable to feed it to the network but unfortunately, this might be not efficient if someone has the labels on each image, and to my knowledge, the dcm format will not be labeled yet! Because the dcm format considered the initial stage to collect the medical images and not yet begin to be labeled by experts (e.g. Doctors, Radiologists). The method keeps the image size without any changes, but it is a good way to make sure that the pixels data does not change.

On the other hand, using software called XnView which is a software that viewing, organizing, and converting the (.dcm) format. And using the software is much easier for people how need to view the images only. Somebody might need to convert the images to PNG format for whatever reason. Converting the images from DICOM format to PNG, I used one of .dcm images and realize that the size of (.dcm) is 3.3MB and the (.png) is only 630KB and this is a great compression. XnView has a feature where the user can export images by batches that make it easy to convert batches to PNG and then using only PNG images without using the previous method using python.

In the end, The dataset which in our situation is the DICOM images considers one of the main important parts of building a robust model that can classify. So it is really important when we deal with DICOM format to extract the images properly without losing images or manipulate the features of the images, especially when we deal with medical images.

--

--