GSOC 2021 with ML4SCI | Deep Regression for Exploring Dark Matter

Yurii Halychanskyi
9 min readAug 21, 2021


Welcome! I got an opportunity to participate in Google Summer of Code (GSoC) 2021 in the Machine Learning For Science (ML4SCI) organization.

In this blog post, I will briefly describe my project and show the results I got.

Let’s go!

Image by: Google Summer of Code

What is DeepLense?

DeepLense is a deep learning pipeline for particle dark matter searches with strong gravitational lensing.

What is the project about?

Applying deep regression techniques to explore the properties of dark matter. We approximate the mass density of vortex substructure of dark matter condensates on simulated strong lensing images.

What is strong gravitational lensing?

A phenomenon when two galaxies are perfectly aligned down the line-of-sight to Earth. The background galaxy’s light is bent by the mass of the foreground galaxy, therefore the background galaxy can be seen multiple times from a telescope. In our case, the background galaxy appears as a ring around the foreground galaxy, which is called an Einstein ring. A picture that describes this process is shown below.

Image by: F. Courbin, S. G. Djorgovski, G. Meylan, et al., Caltech / EPFL / WMKO.


For simulating strong lensing images, we use PyAutoLense, which is an open-source software for the analysis and modeling of strong gravitational lenses.

You can find more information about strong gravitational lensing with PyAutoLense here.


All the code for this project can be found in this repository.

About Me

My name is Yurii Halychankyi, a sophomore Computer Science student at the University of Washington. I participated in FellowshipAI in the summer of 2020, and now I am working as a Computer Science intern at Shin-Etsu.

The Image Data

Examples of strong lenses with vortex substructure.

The dataset consists of strong lensing images with vortex substructure generated by PyAutoLense. The parameters for the generator were taken from the following paper: “Decoding Dark Matter Substructure without Supervision.

The dataset contains 25k grayscale images with the size of 150x150.

Images are stored in a .npy file with the following dimensions: (250000,1,150,150).

Labels are stored in a separate .npy file and have the following dimensions: (250000,1).

You can find the dataset and the weights for the model here.

Lensing signatures from substructures are used to distinguish different models of dark matter. Vortex substructure signature resembles a line.

Let’s draw the coordinates of vortices on the images to understand the data better.

On the left are original images; on the right are the same images but with lines wherever there are vortices.

In places where vortices exist, we can observe slight distortions. The larger the mass density of the vortex, the stronger the distortion.

Mass Density Distribution

Mass density distribution of the regression dataset.

The mass density distribution is skewed to 1.5. The distortions for such a low mass density will be hard to see on the images, but it shouldn’t be a problem for our regression model.

Splitting the data set

In total, there are 25k strong lensing images in our dataset. We are going to split the data in the following way:

  • train — 21250 images
  • validation — 2500 images
  • test — 1250 images


Transformations would help us to expand the dataset without generating new images. As I mentioned earlier, the lensing signature for vortex looks like a line, so we should come up with transforms that would not change the properties of that line.

Resizing, cropping, and stretching would distort the line, so we will go only with flips and rotations.

Let’s plot the transformations:

Examples of strong lensing images after applying flips and rotations.

Here is a list of Pytorch transforms that flip and rotate the images:

rotation_image_transofrms = [

Create a Dataset

We are going to use FastAI to perform training, so we need to create a custom dataset to be able to load the data from numpy arrays.

train_dataset = RegressionNumpyArrayDataset(images, labels, train_indx,                                         transforms.Compose(base_image_transforms+rotation_image_transofrms))valid_dataset = RegressionNumpyArrayDataset(images, labels, valid_indx,                                            transforms.Compose(base_image_transforms))test_dataset = RegressionNumpyArrayDataset(images, labels, test_indx,                                 transforms.Compose(base_image_transforms))

The code for RegressionNumpyArrayDataset can be found in the project’s repository.

Data Loaders

To pass the datasets to the learner, we need to put them in a data loader. Before doing that, we need to choose a batch size.

Usually, the batch size is a number of a power of 2 for the sake of computation simplicity on GPU. The more GPU memory you have, the bigger batch size you can choose.

I have run the model with different batch sizes to see how it affects the loss.

Loss via batch size graph.

The rule seems to be working although the loss does not go much down after 128. I am going to train the model on Kaggle, so I will go with the batch of size 64.

device = 'cuda' if torch.cuda.is_available() else 'cpu'
batch_size = 64
dls = DataLoaders.from_dsets(train_dataset,valid_dataset,batch_size=batch_size, device=device, num_workers=2)

Model Architecture

We are going to use Resnet with a linear layer in the end, which would output the mass density. But instead of using a pure Resnet, we use a modified version that is called XResnetHybrid.

What is XResnet?

XResnet is a Resnet that applies some tricks from the following publication: “Bag of Tricks for Image Classification with Convolutional Neural Networks”.

Here are some of the tricks the model implements:

  • Increases computational speed by improving parallelism and using mixed-precision training
  • Modifies the parameters of some layers to improve the performance
  • Uses label smoothing and knowledge distillation

The full list of tricks can be found in the paper.

What is XResnetHybrid?

XResnetHybrid is a XResnet that has a deconvolution layer instead of batch normalization in the first layer of the CNN.

Image by: John F. Wu via this poster.

The tests show that such an architecture outperforms XResnet.

The implementation of XResnetHybrid is taken from this repository.

What is Deconvolution?

Network deconvolution is a procedure that removes pixel-wise and channel-wise correlations before the data is fed into each layer. It results in a sparse representation of features, which helps the model to perform better without using batch normalization.

We create an instance of XResnetHybrid101 that takes grayscale images as input:

model = xresnet_hybrid101(n_out=1, sa=True, act_cls=Mish_layer, c_in = 1,device=device)

n_out is the number of neurons in the output linear layer. We need only one neuron to predict mass density because it is a scalar.

c_in is the number of channels of an image. Since we are working with grayscale images, the number of channels is 1.

Mish Activation Function

Mish is a novel self-regularized non-monotonic activation function that can be mathematically defined as:

Mish provides much better accuracy, overall lower loss, smoother and well-conditioned easy-to-optimize loss landscape as compared to both Swish and ReLU.

Here is how it looks like:

Mish activation function.

This is a Pytorch implementation of Mish:

class Mish_layer(nn.Module):
The class represents Mish activation function.
def __init__(self):
def forward(self,x):
return x*torch.tanh(F.softplus(x))

Refer to this paper to learn more about Mish.

Self Attention

The problem of vanilla CNN is that it considers only the local information of a pixel when calculating the convolution. Self-attention allows using global information as well. This approach is based on covariance between the predicted pixel and every other pixel, in which each pixel is considered as a random variable.

You can find more information about self-attention in this paper.

Ranger: RAdam + LookAhead

Ranger is a combination of two optimizers: RAdam and LookAhead. While RAdam stabilizes training at the start, LookAhead stabilizes the convergence during the rest of the training.

Such a combination outperforms other optimizers. You can find more information about ranger in this blog.

Loss Function

Choosing a loss function is an important step since training is just a process of finding the global minimum of the loss function. Our dataset is unbalanced, so we need a function that would punish outliers.

Although Mean Square Error (MSE) seems to be a good choice, we are going to use Root Mean Square Error (RMSE) to compare our model with existing regression benchmarks.


Metrics allow us to measure the performance of the model. In our case, we use Mean Absolute Error (MAE) because we had already used it in our previous experiments.


Learner is a FastAI object that contains dataloaders, model, optimizer,loss, and a lot of other things we need for training.

learn = Learner(
loss_func= root_mean_squared_error,
model_dir = ''

Finding the Learning Rate

There is a function in FastAI that helps us to choose the learning rate. It runs the model a few times with different learning rates and plots that graph.

Learning rate graph.

The best choice would be the steepest negative slope on the graph. FastAI suggests choosing 1e-3, but I found out that 1e-2 works better with our model.


We train the model for 120 epochs. After epoch 100, the training loss becomes much lower than the validation one, which indicates that this is a good time to stop training before the model starts to overfit.

learn.fit_one_cycle(120,1e-2,cbs=[                                SaveModelCallback(monitor='mae_loss_wgtd',fname='best_model')])

SaveModelCallback automatically saves the best model during training.

Losses for the first 10 epochs.

Fit_one_cycle allows us to train the model with Leaslie 1cycle policy.

The policy makes the learning rate follow the following pattern during the training:

Image by: FastAI docs.


Before testing, we need to load the best saved model.


Next, we create a data loader for the test dataset and predict mass densities for it.

test_dl = DataLoader(test_dataset, batch_size=1,shuffle=False,device=device)m_pred,m_true = learn.get_preds(dl=test_dl,reorder=False)

And now it is time to plot the results.

test_mae = mae_loss_wgtd(m_pred,m_true)
plt.scatter(m_true, m_pred, color='black')
line = np.linspace(0, 6, 100)
plt.plot(line, line)
plt.xlabel('Observed mass')
plt.ylabel('Predicted mass')
plt.text(1,4, 'MAE: {:.4f}'.format(test_mae))
Results of xResnsetHybrid101 trained for 120 epochs with batch_size=64, learning_rate=1e-2 .

The results look quite good. The points form a line, which means that the predicted masses are close to the observed ones.

Final Thoughts

From my perspective, the results are reassuring, but there is still room for improvement. I have spent some time testing how the model behaves with error maps (the concept is described in this paper), and the results were worse in comparison to the real images. But the concept seems to be promising, so it is definitely worth exploring.

Also, we used rotations and flip as the transformations for our data. But we could eliminate them by using an equivariant neural network for regression. Such a network might reduce the training time and help the model to converge.

Overall, I enjoyed working on the project this summer; I want to thank Sergei Gleyzer, Pranath Reddy, Michael Toomey, and Brendan Ames for helping and directing me during the program.

Thank GSoC for such an opportunity!

Also, I would like to express my gratitude to John F. Wu. The regression pipeline was heavily inspired by his blog.

Important links: