Train a DL model for synthetic data generation for model optimization with OpenVINO — Part 1

Athanasios Masouris
OpenVINO-toolkit
Published in
14 min readSep 7, 2022

Generative Adversarial Networks (GANs) have revolutionized the field of Deep Learning. From the adversarial training to the generation of synthetic data from nothing (random noise), GANs are an impressive framework with several applications, one of which is model optimization using GAN-generated data.

This blog is the first of a two-blog series (Blog #2) regarding the Google Summer of Code 2022 project “Train a DL model for synthetic data generation for model optimization”, which is developed under the auspices of Intel’s OpenVINO Toolkit. The project consists of two parts. For the first part, which is covered in this blog, the goal was to train a lightweight Deep Learning model to generate synthetic images. For the second part, the pre-trained model of the first part will be used to generate a dataset of synthetic images for CIFAR-10. Subsequently, this dataset will be used for model optimization with OpenVINO’s Post-training Optimization Tool. We will evaluate the performance of the 8-bit post-training quantization method on a range of Computer Vision models.

Google Summer of Code page: Program Project | Google Summer of Code
GitHub repository: ThanosM97/gsoc2022-openvino: Development of a DL model for synthetic data generation for model optimization using OpenVINO’s Post-training Optimization Toolkit. (github.com)

Introduction

Image Generation

Image Generation is a task of Computer Vision that has been long researched in the literature. Studies leverage Generative Adversarial Networks (GANs) [1], which using only a random noise vector as input can produce synthetic images (Figure 1). Recent models [2] are able to generate images that are indistinguishable from real ones, even in complex datasets such as ImageNet [3]. Although this is an interesting task, a more practical one, and also more complex, is the task of Conditional Image Generation. It refers to the task of Computer Vision, in which a generative model is used to synthesize realistic-looking images based on input conditions (Figure 2). The conditions can be attributes, text descriptions, or class labels, among others. Recent advances in this topic present models [4][5] that are able to generate high-quality and high-fidelity images, but at the expense of millions of parameters, that require substantial computational resources.

Figure 1: Generative Adversarial Network (GAN) — Source: Image by author
Figure 2: Conditional Generative Adversarial Network (cGAN) — Source: Image by author

Knowledge Distillation

Training a GAN from scratch is an intricate procedure [6], especially on complex datasets. In addition, the current state of the art models [4][5] showcase a trend of scaling up to achieve better performance. So the question is whether or not it is possible to generate quality images using a smaller model. In Romero et al. (2014) [7], the authors propose a Knowledge Distillation framework for network compression. In this framework the knowledge of a pre-trained teacher network (big model) is used to train a student network (small model), which is then able to achieve comparable results with the former’s, while requiring significantly fewer parameters. In Chang et al. (2020) [8], the authors adopted this framework and proposed a black-box knowledge distillation method designed for GANs. Their proposed model, TinyGAN, successfully distils BigGAN, by achieving competitive performance, while reducing the number of parameters by a factor of 16.

Project

Although TinyGAN successfully reduces the number of parameters of BigGAN, while keeping a competitive performance, the model still requires a respectable amount of computational resources both for training and generating images. The purpose of the first part of this project is to investigate whether or not it is possible to generate quality images with an even shallower network than TinyGAN, by leveraging their proposed distillation framework. Due to our limited computational resources, we opted to use the CIFAR-10 [9] dataset, which although has a small spatial size (32x32), it is still complex enough to require a large model to generate quality images. In particular, the current state of the art model in Class-conditional Image Generation on CIFAR-10 is StyleGAN2 [4], which contains more than 20 million trainable parameters. In the following sections we will describe the distillation technique used, the training objectives, the dataset, and the architecture of the selected network.

Black-Box Distillation

We adopt the proposed knowledge distillation framework of Chang et al. (2020) [8], where the teacher network is applied as a black-box, requiring only limited access to its input-output pairs. In particular, the teacher model is utilized by collecting its outputs, or generated images, given a random noise vector as input, along with a class label. The created collection consists not only of the generated images, but also of the input vectors for the noise and the class labels. This collection, or dataset, is then leveraged to train the student network in a supervised way. Following this approach, no knowledge of the internal states or intermediate features of the model is required. In addition, upon the creation of the dataset, the teacher network can be discarded, since it does no longer participate in the training of the student network.

Figure 3: Black-Box distillation framework proposed in Chang et al. (2020) [8]

Objectives

Figure 4: Illustration of the training objectives in Chang et al. (2020) [8].

The main goal of the knowledge distillation framework is for the student’s generator to approximate the distribution of the teacher’s generator, or mimic its functionality. With this goal in mind, we leverage the objectives described in [8]. In particular, the authors propose the following objective for the generator:

where,

  • The Feature-Level Distillation Loss is calculated from feature maps extracted using the Discriminator’s network for both the images generated by the student network, and the images generated by the teacher one. In particular, it is calculated as the weighted sum of the L1 distances between feature maps of different levels for the pairs of student-teacher generated images. It is used to alleviate the blurriness of the generated images, when only using the pixel-level distillation loss described below.
  • The Pixel-Level Distillation Loss is calculated as the pixel-level L1 distance between the images generated by the teacher and the student networks. This loss is particularly important for the early stages of training, as it provides supervision while the student network tries to mimic the functionality of the teacher.
  • The Adversarial Distillation Loss is used to make the student’s generator approximate better the distribution of the teacher’s. In particular, the discriminator is used to provide an adversarial loss for the generated images by the student network, for the inputs used in the teacher network.
  • The Adversarial GAN Loss is used to ensure that the model also learns to produce images similar to the ones of the selected dataset. Thus, it utilizes the discriminator to produce an adversarial loss for randomly generated images of the student network. The generator loss is calculated the same way as the Adversarial Distillation Loss.

As for the discriminator, we use the following objective

where,

  • The Adversarial Distillation Loss is used to encourage the discriminator to distinguish between generated images by the student network, and the ones generated by the teacher network. Therefore, in this case the student images are treated as fake, while the teacher images as real.
  • The Adversarial GAN Loss is used to encourage the discriminator to distinguish between student generated images, and images from the distribution of the real data.

All of the aforementioned adversarial losses are calculated using the Hinge version [10] of the adversarial loss, instead of binary cross-entropy.

Dataset

Figure 5: The classes of CIFAR-10, along with 10 random images from each. (source)

The CIFAR-10 [9] dataset consists of 60,000 colour images from 10 different class categories, with 6,000 images per class. Although the spatial size of each image is small (32x32), it is still complex enough to require a large model to generate quality images. More information about the dataset can be found on the official website.

Teacher Network

In the case of the CIFAR-10 dataset, we opted to distil the StyleGAN2-ADA [11] model. The aforementioned model is able to achieve state-of-the-art performance on the task of conditional image generation on the CIFAR-10 [9] dataset. The main reason behind this success was the proposed adaptive discriminator augmentation mechanism that significantly stabilizes training when there are limited data available. However, in our case the model is going to be utilized for a black-box image generation. So despite the training procedure and techniques followed by the study, we only need access to the input-output pairs of the model’s generator. In particular we use the official PyTorch implementation of the StyleGAN2-ADA model by NVIDIA Research Projects from Github, along with the provided weights for the pre-trained model on the CIFAR-10 dataset, for conditional image generation. Therefore, StyleGAN is used to create a FakeCIFAR10 dataset, consisting of images generated by the model, along with their corresponding input noise vectors and labels. Subsequently, this dataset is used to train the student network to mimic the functionality of the teacher network (i.e. StyleGAN2-ADA), using the previously described objectives. On the other hand, the StyleGAN2-ADA model, upon the creation of the dataset, is no longer used in the training procedure of the student network, and thus it is discarded.

FakeCIFAR10

The FakeCIFAR10 dataset consists of 50,000 synthetic images, generated by the StyleGAN2-ADA model. There are 5,000 images for each of the 10 classes, along with the noise vectors that were used as input to StyleGAN’s Generator. The dataset can be downloaded from here, or can be recreated following these instructions.

Student Network

Figure 6: Overview of DiStyleGAN’s architecture — Source: Image by author

Generator
Initially, the Gaussian random noise vector is projected to 128 dimensions, using a Fully Connected layer. Subsequently, the condition embedding, along with a projected noise vector are concatenated and passed through another Fully Connected layer, which is followed by 3 consecutive Upsampling blocks. Each upsampling block consists of an upsample layer (scale_factor=2, mode=’nearest’), a 3x3 convolution with padding, a Batch Normalization layer, and a Gated Linear Unit (GLU). Finally, there is a convolutional block, consisting of a 3x3 convolution with padding and a hyperbolic tangent activation function (tanh) which produces the fake image.

Discriminator
DiStyleGAN’s discriminator consists of 4 consecutive Downsampling blocks (4x4 strided-convolution, Spectral Normalization, and a LeakyReLU), with each of them reducing the spatial size of the input image by a factor of 2. These four blocks also produce the feature maps used for the calculation of the Feature Loss. Subsequently, the logit is flattened, projected to 128 dimensions, and concatenated with the class condition embedding, before being passed through a final fully connected layer to produce the class-conditional discriminator loss.

Training

  1. Clone the GitHub repository.
  2. Install the python packages in the requirements file. (Python 3.10)
  3. Download the FakeCIFAR10 dataset from here and extract the zip file, or recreate it using the create_dataset.py script, following the instructions here.
  4. Train the model using one of the following options:

​ ​​​ ​ ​​​ ​ ​​​- Using the default configurations with the example below in Python

Train DiStyleGAN in Python

​ ​​​ ​ ​​​ - Using the command line options for the corresponding python script

$ python distylegan.py train -h
usage: distylegan.py train [-h] --dataset DATASET --save SAVE [--c_dim C_DIM] [--lambda_ganD LAMBDA_GAND] [--lambda_ganG LAMBDA_GANG] [--lambda_pixel LAMBDA_PIXEL] [--nc NC][--ndf NDF] [--ngf NGF] [--project_dim PROJECT_DIM] [--transform TRANSFORM] [--z_dim Z_DIM] [--adam_momentum ADAM_MOMENTUM] [--batch_size BATCH_SIZE] [--checkpoint_interval CHECKPOINT_INTERVAL] [--checkpoint_path CHECKPOINT_PATH] [--device DEVICE] [--epochs EPOCHS] [--gstep GSTEP] [--lr_D LR_D] [--lr_G LR_G] [--lr_decay LR_DECAY] [--num_test NUM_TEST] [--num_workers NUM_WORKERS] [--real_dataset REAL_DATASET]
options:
-h, --help show this help message and exit
Required arguments for the training procedure:
--dataset DATASET Path to the dataset directory of the fake CIFAR10 data generated by the teacher network
--save SAVE Path to save checkpoints and results
Optional arguments about the network configuration:
--c_dim C_DIM Condition dimension (Default: 10)
--lambda_ganD LAMBDA_GAND
Weight for the adversarial GAN loss of the
Discriminator (Default: 0.2)
--lambda_ganG LAMBDA_GANG
Weight for the adversarial distillation loss
of the Generator (Default: 0.01)
--lambda_pixel LAMBDA_PIXEL
Weight for the pixel loss of the Generator
(Default: 0.2)
--nc NC Number of channels for the images
(Default: 3)
--ndf NDF Number of discriminator filters in the first
convolutional layer (Default: 128)
--ngf NGF Number of generator filters in the first
convolutional layer (Default: 256)
--project_dim PROJECT_DIM
Dimension to project the input condition
(Default: 128)
--transform TRANSFORM
Optional transform to be applied on a sample
image (Default: None)
--z_dim Z_DIM Noise dimension (Default: 512)
Optional arguments about the training procedure:
--adam_momentum ADAM_MOMENTUM
Momentum value for the Adam optimizers'
betas (Default: 0.5)
--batch_size BATCH_SIZE
Number of samples per batch (Default: 128)
--checkpoint_interval CHECKPOINT_INTERVAL
Checkpoints will be saved every `
checkpoint_interval` epochs (Default: 20)
--checkpoint_path CHECKPOINT_PATH
Path to previous checkpoint
--device DEVICE Device to use for training ('cpu' or 'cuda')
(Default: If there is a CUDA device
available, it will be used for training)
--epochs EPOCHS Number of training epochs (Default: 150)
--gstep GSTEP The number of discriminator updates after
which the generator is updated using the
full loss(Default: 10)
--lr_D LR_D Learning rate for the discriminator's Adam
optimizer (Default: 0.0002)
--lr_G LR_G Learning rate for the generator's Adam
optimizer (Default: 0.0002)
--lr_decay LR_DECAY Iteration to start decaying the learning
rates for the Generator and the
Discriminator(Default: 350000)
--num_test NUM_TEST Number of generated images for evaluation
(Default: 30)
--num_workers NUM_WORKERS
Νumber of subprocesses to use for data
loading (Default: 0, whichs means that the
data will be loaded in the main process.)
--real_dataset REAL_DATASET
Path to the dataset directory of the real
CIFAR10 data. (Default: None, it will be
downloaded and saved in the parent directory
of input `dataset` path)

Image Generation

  1. Clone the GitHub repository.
  2. Install the python packages in the requirements file. (Python 3.10)
  3. Download the checkpoint for our pre-trained model and extract the zip file in the root directory of our repository.
  4. Generate synthetic samples using one of the following options:

​ ​​​ ​ ​​​ ​ ​​​- Example in Python

Generate samples with DiStyleGAN in Python

​ ​​​ ​ ​​​ ​ ​​​- Using the command line options for the corresponding python script

$ python distylegan.py generate -h
usage: distylegan.py generate [-h] --checkpoint_path CHECKPOINT_PATH --nsamples NSAMPLES --save SAVE [--label [{0,1,2,3,4,5,6,7,8,9} ...]] [--batch_size BATCH_SIZE]
options:
-h, --help show this help message and exit
Required arguments for the generation procedure:
--checkpoint_path CHECKPOINT_PATH
Path to previous checkpoint (the directory
must contain the generator.pt and
config.json files)
--nsamples NSAMPLES Number of samples to generate per label
--save SAVE Path to save the generated images to
Optional arguments about the generation procedure:
--label [{0,1,2,3,4,5,6,7,8,9} ...]
Class label(s) for the samples (Default:
None, random labels) --> e.g. --label 0 3 7
--batch_size BATCH_SIZE
Number of samples per batch (Default: 32)

​ ​​​ ​ ​​​ ​ ​​​- Using the flask webapp by running the command flask run inside the webapp/ directory of our repository. Then following the link displayed in the command line (e.g. http://127.0.0.1:5000), you will be presented with the interface of Figure 7.

Figure 7: Flask webapp for synthetic sample generation using DiStyleGAN — Source: Image by author

Evaluation

The performance of the model was evaluated both qualitatively and quantitatively.

Qualitative Evaluation
For the qualitative evaluation, synthetic samples are generated in each epoch of training, for the same noise vectors as inputs. Then, using the gifmaker.py script, which produces a GIF file of the evolution of the synthetic samples throughout training, we can investigate the training progress in terms of image quality of the generated samples. Figure 8 showcases a corresponding file created for the training of the DiStyleGAN model.

Figure 8: Evolution of the synthetic samples of DiStyleGAN throughout training — Source: Image by author

Additionally, inspired by Zhan et al. (2017) [12], the t-SNE algorithm [13] was utilized to investigate if the trained model suffered from a common GAN problem, mode collapse (i.e. the generator produces only a small set of outputs that manage to fool the discriminator). In particular, having generated 5000 samples (500 samples per class), we used the tsne.py script to create a t-SNE visualization of the synthetic samples on a 2D grid by using a pre-trained VGG19 model to extract features of 4096 dimensions, which were then compressed to 300 dimensions using the PCA algorithm and mapped to cartesian coordinates (2D) using the t-SNE algorithm. Following this procedure, similarly looking images are placed into neighbouring tiles in the grid, allowing us to check for mode collapse. Figure 9 illustrates the aforementioned grid for our pre-trained model. We did not observe any noticeable mode collapse patterns.

Figure 9: t-SNE visualization of DiStyleGAN’s synthetic samples — Source: Image by author

Quantitative Evaluation

For the quantitative evaluation, we calculate the Inception Score (IS) [14] and the Fréchet Inception Distance (FID) [15]. We used the TensorFlow implementation of the two metrics by Junho Kim and Ahmed Fares on GitHub. The results presented below were calculated on 50,000 synthetic images (5,000 per class). For the calculation of the Inception Score, we used 10 splits.

Figure 10: Inception Score and Fréchet Inception Distance for DiStyleGAN on CIFAR-10

Conclusion

Developing a model for conditional image generation constitutes a challenging task. Even in seemingly simple datasets, such as CIFAR-10, only huge models, with millions of parameters, are able to generate quality images. By adopting the knowledge distillation framework proposed in Romero et al. (2014) [7], while using a similar network architecture to that of DCGAN’s [16], we were able to improve on its performance on the task, with respect to the Inception Score metric. Moving on to the second part of the project where we are going to use the synthetic generated data by DiStyleGAN to optimize the performance of a range of computer vision models.

References

[1] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems 27 (2014).

[2] Sauer, Axel, Katja Schwarz, and Andreas Geiger. “Stylegan-xl: Scaling stylegan to large diverse datasets.” arXiv preprint arXiv:2202.00273 1 (2022).

[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database. IEEE Computer Vision and Pattern Recognition (CVPR), 2009.

[4] Kang, Minguk, et al. “Rebooting acgan: Auxiliary classifier gans with stable training.” Advances in Neural Information Processing Systems 34 (2021): 23505–23518.

[5] Brock, Andrew, Jeff Donahue, and Karen Simonyan. “Large scale GAN training for high fidelity natural image synthesis.” arXiv preprint arXiv:1809.11096 (2018).

[6] Salimans, Tim, et al. “Improved techniques for training gans.” Advances in neural information processing systems 29 (2016).

[7] Romero, Adriana, et al. “Fitnets: Hints for thin deep nets.” arXiv preprint arXiv:1412.6550 (2014).

[8] Chang, Ting-Yun, and Chi-Jen Lu. “Tinygan: Distilling biggan for conditional image generation.” Proceedings of the Asian Conference on Computer Vision. 2020.

[9] Krizhevsky, Alex, and Geoffrey Hinton. “Learning multiple layers of features from tiny images.” (2009): 7.

[10] Lim, Jae Hyun, and Jong Chul Ye. “Geometric gan.” arXiv preprint arXiv:1705.02894 (2017).

[11] Karras, Tero, et al. “Training generative adversarial networks with limited data.” Advances in Neural Information Processing Systems 33 (2020): 12104–12114.

[12] Zhang, Han, et al. “Stackgan++: Realistic image synthesis with stacked generative adversarial networks.” IEEE transactions on pattern analysis and machine intelligence 41.8 (2018): 1947–1962.

[13] Laurens van der Maaten and Geoffrey Hinton, “Visualizing Data using t-SNE”, Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008.

[14] Salimans, Tim, et al. “Improved techniques for training gans.” Advances in neural information processing systems 29 (2016).

[15] Heusel, Martin, et al. “Gans trained by a two time-scale update rule converge to a local nash equilibrium.” Advances in neural information processing systems 30 (2017).

[16] Radford, Alec, Luke Metz, and Soumith Chintala. “Unsupervised representation learning with deep convolutional generative adversarial networks.” arXiv preprint arXiv:1511.06434 (2015).

--

--