Image-to-image constellation outlining
LTAT.02.001 Neural Networks project, University of Tartu
Ingvar Baranin, Perseverance Ngoy, Martin Masaba, Marilin Moor
When we were little we used to love connecting dots to get images like in Figure 1. Now x years later, we have trained several neural network models (pix2pix and CycleGAN) to allow the machine to do the job for us (but without the chronological order in the constellations of course). This project employs multiple image-to-image conversion networks to try and convert constellation images into their corresponding object outline image (Figure 2).
The dataset was provided by Tarun Khajuria (https://github.com/tarunkhajuria42/Constellations-Dataset).
In order to train and test our models, the objects in Constellations_TOP were used for the test set and the objects in Constellations_All (excluding the objects in Constellations_TOP) were used for training the model.
For the correct segregation, a script was written that would traverse both datasets and find the filename set difference. Due to required folder structure variances between the two used models, both scripts, although similar, handled this restructuring with these requirements in mind. Conclusively, out of the 3533 images from Constellations_ALL, the intersecting 481 images from Constellations_TOP were used as the test set, and the remaining 3052 images were split 90–10 to form a training set of 2746, and a validation set of 306 images (only for the pix2pix model).
It should also be pointed out that within every object folder a set of images with different level difficulties were prevalent (for reference see Figure 3). Difficulty level denotes the outline points’ distances in the constellation — the smaller the level, the closer the points (the easier the model).
Furthermore, the dataset was provided with different noise levels — meaning the level of disturbance to the constellations was either 0.003 or 0.002.
As the goal of our project was to evaluate how well the models can create outline images out of the respective constellations, the data about constellations and outlines were used.
The original GAN
Generative Adversarial nets  provide a way of training two models, the Discriminator, D and the generator, G. The former is trained to correctly classify whether an image is real or a fake (binary classification) while the latter is trained such that the fake images it generates tries to fool D into making a wrong prediction.
The generator model is based on the encoder-decoder architecture while the discriminator is a deep convolutional neural network for image classification. Essentially, D learns to be good at distinguishing real images from generated fake images while G learns to be good at fooling D (adversarial game). The objective function is a min-max problem.
In the maximization part, the discriminator’s prediction probability for real images is as close as possible to 1 and close to 0 for fake images, in this case gradient ascent is used to update the D. On the other hand, the generator minimizes the objective function, in which case gradient descent is used to update G, as it wants D to predict badly on fake images (close to a probability 1).
In addition to the latent vector input in the original GAN architecture above, pix2pix uses paired input and target image, in our case the constellations and corresponding outlines. Pix2pix is a conditional generative model (cGAN) with U-Net and patch-GAN architectures for generator and discriminator respectively. The idea of patch-GAN is that the focus of the classifier is on a small patch of the image of defined dimensions, rather than the whole image. The objective function for pix2pix combines both the GAN loss and L1 loss .
CycleGAN is suited for this task given our two domains of constellations and corresponding outlines. The goal is to map constellation images in the input domain (X) to the outlines (Y) G: X → Y such that the mapped images are as close as possible to the outlines.
To achieve this, the model uses an inverse mapping function F: Y → X, which takes the output domain and maps back to the input domain, both mappings occurring simultaneously, which is enforced by forward and backwards consistency (cycle consistency loss).
Methods and results
All of the work for this project was done using University of Tartu’s HPC GPU (Tesla V100-SXM2–32GB) in the Jupyter environment and the model implementations originated from this GitHub repository .
Firstly, in order to determine the most optimal amount of epochs for training the pix2pix models, multiple models with iterating epochs were trained with constellations of the average difficulty level (11).
It can be seen from Figure 7 that the model trained with 40 epochs (+another 40 for linearly decaying the learning rate to zero) seemed to be the most similar in terms of detail to the real image. This is why all of the pix2pix models (for both 002 and 003 noise levels) were trained for 40+40 epochs.
Due to the performance-heavy behavior of the CycleGAN, it was decided to train the model for 10+10 epochs and for noise level 003 only. One model took approximately 4 hours to train with the provided GPU.
It can be seen from Figure 8 that all of the models are far from perfect.
When comparing the pix2pix models at different noise levels it seems that noise level does indeed have an effect on the model performance, as input constellations with noise level 002 seem to output relatively accordion-like images throughout all of the difficulties.
The CycleGAN models on the another hand, seem to lose their way from difficulty level 11 onward, ultimately becoming inspired by bubbles instead. Truthfully, we ourselves find difficulties connecting the dots from level 13 and could certainly not produce an accordion to the same extent of detail as in the real image.
Validation of model performance
While visual comparison is one approach to assess the goodness of models, a quantitative method was also employed for validation.
The idea is as follows: we use real constellation outlines and labels which depict the outline object and use them to train a classifier. We also keep a holdout test dataset of real constellation outlines in order to get a baseline of how well the model performs on the domain it trained on. Next, we let the trained classifier predict what it sees on fake constellation outlines. Using the baseline as comparison, we see how bad (or good) the classifier is at understanding constellation outlines generated by image-to-image models.
However, this idea needs labels, which we did not have. We made use of the fact that the constellations dataset is a subset of the THINGS dataset (https://osf.io/jum2f/), which categorizes some (but unfortunately not all) of its images into 27 high level categories. We found the intersection of our constellation dataset images and the THINGS dataset images, for which we could find labels for, and thus created a mapping from most of our constellation images to a category. Since some images had multiple high-level categories — for example, “food” and “vegetable” -, we kept only the higher-level category, which in this example was “food”. This process left out the following categories: “bird”, “fruit”, “insect” and “vegetable”, which were replaced by “animal” or “food”, leaving us with 23 possible categories (listed on Figure 11).
With the mapping in place, the intersection left us with 2160 training images (10% of which the classifier split for validation) and 406 test images. To reiterate, the image count mismatch compared to our initial datasets stems from the fact that not all THINGS dataset images had a high-level category label.
The image classifier used for validation in this project was a vision transformer model (VIT) implemented with Tensorflow. The VIT model applies transformer architecture with self-attention to sequences of image patches, without using convolution layers .
For training the constellation outline classifier the outline images were resized to 256x256, to match the output dimensions of image-to-image model-generated fakes. We assumed that this would help generalize across real and fake domains, in addition to keeping the input size consistent. Additionally, we converted all of the classifier input into grayscale, because the generated outlines used for evaluation often had colorful artifacts, which were detrimental to the classifier’s performance since it had been trained to see the world in black and white.
As the original transformer implementation worked on smaller images, we consequently increased the default image patch size from 6 to 16 to lessen the demand of computation. On a trial run of 100 epochs we saw massive overfitting, so we decreased the epoch count to 50. With the training done, we had a baseline of 45% classification accuracy and 70% top-5 accuracy on our real constellation outlines test set. Top-5 accuracy describes the percentage of predictions which included the correct category in the 5 most likely predictions.
To avoid confusion with the previous real constellation outlines test set, we will refer to the test set of fake constellation outlines as the evaluation dataset. The evaluation dataset consisted of 100 objects altogether, and it was attempted to have the distribution of objects per categories to be as uniform as possible (unlike the training set which objects were fixed), while also not having the dataset unconventionally tiny (refer back to Figure 11).
Figure 12 indicates the end-results of the classification accuracies.
Despite the prior visual observation that pix2pix model with noise level 002 seemed to be more akin to the real outline, the classifier seems to output similar accuracies to both noise levels.
Generally, the accuracies are around 4–5x lower, and the top-5 accuracies about half as good as the baseline, implying on that these out-of-the-box models are not performing well enough for the task at hand.
This project achieved to train robustly two separate models — pix2pix and CycleGAN on constellations of different difficulty and noise levels and obtain the according fake images of objects in the test set. Additionally, validation of the model performance was determined through classification of image categories by a visual transformer model.
Recommendations for future improvements:
- to improve the accuracy of the validation classifier — use contrastive learning in the transformer part and train it with human sketches (general properties and feature learning), train the classification part using the same network with outlines
- test out →NEW← improved version of cycleGAN model — Contrastive Unpaired Translation by the same authors (Taesung Park and Jun-Yan Zhu) — supposedly faster and less memory-intensive 
The link to the project repository can be found here: https://github.com/Martin-Msb/Constellation-II-Project