Neuromation Research: Pediatric Bone Age Assessment with Convolutional Neural Networks

Over time, the NeuroNuggets and Neuromation Research series will serve to introduce all AI researchers whom we have collected in our wonderful research team. Today, we are presenting our very own Kaggle master, Alexander Rakhlin! Alexander is a deep learning guru specializing in problems related to medical imaging, which usually means segmentation, object detection, and generally speaking convolutional neural networks, although medical images are often in 3D and are not necessarily RGB images, as we have seen when we discussed imaging mass-spectrometry.

You may have already met Alexander Rakhlin here in our research blog: he has authored a recent post with a general survey of AI applications for healthcare. But today we have great news: Alexander’s paper, Pediatric Bone Age Assessment Using Deep Convolutional Neural Networks (a joint work with Vladimir Iglovikov, Alexander Kalinin, and Alexey Shvets), has been accepted for publication at the 4th Workshop on Deep Learning in Medical Image Analysis (DLMIA 2018)! This is already not the first paper on medical imaging under Neuromation banners, and this is a great occasion to dive into some details of this work. Similar to our previous post on medical concept normalization, this will be a serious and rather involved affair, so get some coffee and join us!

You Are as Old as Your Bones: Bone Age Assessment

Skeletal age, or bone age, is basically how old your bones look like. As a child develops, the bones in his/her skeleton grow and mature; this means that by looking at a child’s bones, you can estimate the average age when a child should have this kind of skeleton and hence learn how old the child is. At this point you’re probably wondering whether this will be a post about archaeology: it’s not often that living children can get an X-ray but nobody knows when they were born.

Well, yes and no. If the child is developing normally, bone age should indeed be roughly within 10% of the chronological age. But there can be exceptions. Some exceptions are harmless but still good to know: e.g., your kid’s growth spurt in adolescence is related to bone age. So if it’s a couple of years more than the chronological age the kid will stop growing earlier, and if the bones are a couple of years “younger” than the rest you can expect a delayed growth spurt. Moreover, given the current height and bone age you can predict the final adult height of a child rather accurately, which can also come in handy: if your kid loves basketball you might be interested whether he’ll grow to be a 7-footer.

Other exceptions are more serious: a significant mismatch between bone age and chronological age can signal all kinds of problems, including growth disorders and endocrine problems. A single reading of skeletal age informs the clinician of the relative maturity of a patient at a particular time, and, integrated with other clinical findings, separates the normal from the relatively advanced or retarded. Successive skeletal age readings indicate the direction of the child’s development and/or show his or her progress under treatment. By assessing skeletal age, a pediatrician can diagnose endocrine and metabolic disorders in child development such as bone dysplasia, or growth deficiency related to nutritional, metabolic, or unknown factors that impair epiphyseal or osseous maturation. In this form of growth retardation, skeletal age and height may be delayed to nearly the same degree, but, with treatment, the potential exists for reaching normal adult height.

Due to all of the above, it is very common for pediatricians to order an X-rays of a child’s hand to estimate his/her bone age… so naturally it’s a great problem to try to automate.

Palm Reading: Assessing Bone Age from the Hand and Wrist

Skeletal maturity is mainly assessed by the degree of development and ossification of secondary ossification centers in the epiphysis. For decades, bone maturity has been usually determined by visual evaluation of the skeletal development of the hand and wrist. Here is what a radiologist looks for when she examines an X-ray of your hand:

The two most common techniques for estimating skeletal age today are Greulich and Pyle and Tanner-Whitehouse (TW2). Both methods use radiographs of the left hand and wrist to assess skeletal maturity based on recognizing maturity indicators, i.e., changes in the radiographic appearance of the epiphyses of tubular bones from the earliest stages of ossification until they fuse with the diaphysis, or changes in flat bones until they reach adult shape… don’t worry, we hadn’t heard these words before either. Let’s show them on a picture:

Conventional techniques for assessing skeletal maturity, such as GP or TW2, are tedious, time consuming, to a certain extent subjective, and even senior radiologists don’t always agree on the results. Therefore, it is very tempting to use computer-aided diagnostic systems to improve the accuracy of bone age assessment, increase reproducibility and efficiency of clinicians.

Recently, approaches based on deep learning have demonstrated performance improvements over conventional machine learning methods for many problems in biomedicine. In the domain of medical imaging, convolutional neural networks (CNN) have been successfully used for diabetic retinopathy screening, breast cancer histology image analysis, bone disease prediction, and many other problems; see our previous post for a survey of these and other applications.

So naturally we tried to apply modern deep neural architectures to bone age assessment as well. Below we describe a fully automated deep learning approach to the problem of bone age assessment using the data from the Pediatric Bone Age Challenge organized by the Radiological Society of North America (RSNA). While achieving as high accuracy as possible is a primary goal, our system was also designed to stay robust against insufficient quality and diversity of radiographs produced on different hardware by various medical centers.


The dataset was made available by the Radiological Society of North America (RSNA), who organized the Pediatric Bone Age Challenge 2017. The radiographs have been obtained from Stanford Children’s Hospital and Colorado Children’s Hospital; they have been taken on different hardware at different times and under different conditions. These images had been interpreted by professional pediatric radiologists who documented skeletal age in the radiology report based on a visual comparison to Greulich and Pyle’s Radiographic Atlas of Skeletal Development of the Hand and Wrist. Bone age designations were extracted by the organizing committee automatically from radiology reports and were used as the ground truth for training the model.

Radiographs vary in scale, orientation, exposure, and often feature specific markings. The entire RSNA dataset contained 12,611 training, 1,425 validation, and 200 test images. Since the test dataset is obviously too small, and its labels were unknown at development time, we tested the model on 1000 radiographs from the training set which we withheld from training.

The training data contained 5,778 female and 6,833 male radiographs. The age varied from 1 to 228 months, the subjects were mostly children from 5 to 15 years old:

Preprocessing I: Segmentation and Contrast

One of the key contributions of our work is a rigorous preprocessing pipeline. To prevent the model from learning false associations with image artifacts, we first remove the background by segmenting the hand.

For image segmentation we use the U-Net deep architecture. Since its development in 2015, U-Net has become a staple of segmentation tasks. It consists of a contracting path to capture context and a symmetric expanding path that enables precise localization; since this is not the main topic of this post, we will just show the architecture and refer to the original paper for details:

We also used batch normalization to improve convergence during training. In our algorithms, we use the generalized loss function

where His the standard binary cross entropy loss function


true value of the pixel

is the predicted probability for the pixel, and

is a differentiable generalization of the Jaccard index:

We finalize the segmentation step by removing small extraneous connected components and equalizing the contrast. Here is an how our preprocessing pipeline works:

As you can see, the quality and contrast of the radiograph does indeed improve significantly. One could stop here and train a standard convolutional neural network for classification/regression, augmenting the training set with our preprocessing and standard techniques such as scaling and rotations. We gave this approach a try, and the result, although not as accurate as our final model, was quite satisfactory.

However, original GP and TW methods focus on specific hand bones, including phalanges, metacarpal and carpal bones. We decided to try to use this information and train separate models on several specific regions in high resolution to numerically evaluate and compare their performance. To correctly locate these regions, we have to transform all images to the same size and position, i.e., to bring them all to the same coordinate space, a process known as image registration.

Preprocessing II: Image Registration with Key Points

Our plan for image registration is simple: we need to detect the coordinates of several characteristic points in the hand. Then we will be able to compute affine transformation parameters (zoom, rotation, translation, and mirroring) to fit the image into the standard position.

To create a training set for the key points model, we manually labelled 800 radiographs using VGG Image Annotator (VIA). We chose three characteristic points: the tip of the distal phalanx of the third finger, tip of the distal phalanx of the thumb, and center of the capitate. Pixel coordinates of key points serve as training targets for our regression model.

The key points model is, again, implemented as a deep convolutional neural network, inspired by a popular VGG family of models but with regression output. The VGG module consists of two convolutional layers with Exponential Linear Unit activation, batch normalization, and max pooling. Here is the architecture:

The model is trained with Mean Squared Error loss (MSE) and Adam optimizer:

To improve generalization, we applied standard augmentations to the input. including rotation, translation and zoom. The model outputs 6 coordinates, 2 for each of the 3 key points.

Having found the key points, we calculate affine transformations (zoom, rotation, translation) for all radiographs. Our goal is to keep the aspect ratio of an image but fit it into a uniform position such that for every radiograph:

  1. the tip of the middle finger is aligned horizontally and positioned approximately 100 pixels below the top edge of the image;
  2. the capitate is aligned horizontally and positioned approximately 480 pixels above the bottom edge of the image.

By convention, bone age assessment uses radiographs of the left hand, but sometimes images in the dataset get mirrored. To detect these images and adjust them appropriately, we used the key point for the thumb.

Let’s see a sample of how our image registration model works. As you can see, the hand has been successfully rotated into our preferred standard position:

And here are some more examples of the entire preprocessing pipeline. Results of segmentation, normalization and registration are shown in the fourth row:

Bone age assessment models

Following Gilsanz and Ratib’s Hand Bone Age: a Digital Atlas of Skeletal Maturity, we have selected three specific regions from registered radiographs and trained an individual model for each region:

  1. whole hand;
  2. carpal bones;
  3. metacarpals and proximal phalanges.

Here are the regions and some sample corresponding segments of real radiographs:

Convolutional neural networks are typically used for classification tasks, but bone age assessment is a regression problem by nature: we have to predict age, a continuous variable. Therefore, we wanted to compare two settings of the CNN architecture, regression and classification, so we implemented both. The models share similar parameters and training protocols, and only differ in the two final layers.

Our first model is a custom VGG-style architecture with regression output. The network consists of a stack of six convolutional blocks with 32, 64, 128, 128, 256, 384 filters followed by two fully connected layers of 2048 neurons each and a single output (we will show the picture below). The input size varies depending on the considered region of an image. For better generalization, we apply dropout layers before fully connected layers. We rescale the regression target, i.e., bone age, to the range [−1, 1]. To avoid overfitting, we use train time augmentation with zoom, rotation and shift. The network is trained with the Adam optimizer by minimizing the Mean Absolute Error (MAE):

The second model, for classification, is very similar to the regression one except for the two final layers. One major difference is a distinct class assigned to each bone age. In the dataset, bone age is expressed in months, so we considered all 240 classes, and the penultimate layer becomes a softmax layer with 240 outputs. This layer outputs vector of probabilities, where probability of a class takes a real value in the range [0, 1]. In the final layer, the probabilities vector is multiplied by a vector of distinct bone ages [1, …, 239, 240]. Thereby, the model outputs a single expected value of the bone age. We train this model using the same protocol as the regression model.

Here is the model architecture for classification; the regression model is the same except for the lack of softmax and binning layers:


We evaluated the models on a validation set of 1000 radiographs withheld from training. Following GP and TW methods that account for sex, for each spatial zone we trained gender-specific models separately for females and males, and compared them to a gender-agnostic model trained on the entire population. Here is a summary of our results which we will then discuss:

It turns out that adding gender to the input significantly improves accuracy, by 1.4 months on average. The leftmost column represents the performance of a regression model for both genders. The region of metacarpals and proximal phalanges (region C) has Mean Absolute Error (MAE) 8.42 months, while MAE of the whole hand (region A) is 8.08 months. A linear ensemble of the three zones improves overall accuracy to 7.52 months (bottom row in the table).

Gender-specific regression models (second and third columns) improved MAE to 6.30 months for males and to 6.49 months for females. Note that for the female cohort, region of metacarpals and proximal phalanges © has MAE equal to 6.79 months, even more accurate than the whole hand, which gets a MAE of only 7.12 months!

Gender-specific classification models (fourth and fifth columns) perform slightly better than regression models and demonstrate a MAE of 6.16 and 6.39 months respectively (bottom row)

Finally, in the sixth column we show an ensemble of all gender-specific models (classification and regression). On the validation dataset it achieved state of the art accuracy of 6.10 months, which is a great result both in terms of the bone age assessment challenge and from the point of view of real applications.


Let’s wrap up: in this post, we have shown how to develop an automated bone age assessment system that can assess skeletal maturity with remarkable accuracy, similar to or better than an expert radiologist. We have numerically evaluated different zones of a hand and found that bone age assessment could be done just for metacarpals and proximal phalanges without significant loss of accuracy. To overcome the widely ranging quality and diversity of the radiographs, we introduced rigorous cleaning and standardization procedures that significantly increased robustness and accuracy of the model.

Our model has a great potential for deployment in clinical settings to help clinicians in making bone age assessment decisions accurately and in real time, even in hard-to-reach areas. This would ensure timely diagnosis and treatment of growth disorders in their little patients. And this is, again, just one example of what the Neuromation team is capable of. Join us later for more installments of Neuromation Research!

Alexander Rakhlin
Researcher, Neuromation

Sergey Nikolenko
Chief Research Officer, Neuromation