Synthetic Images Anomaly Detection with CLIP

3 min readJun 24, 2022

TL;DR

You have just generated a bunch of synthetic images by your favorite generative model. Most of them look great, but some looks really bad. These are outliers. Since GAN, the most popular generative model structure, doesn’t produce a likelihood score for generated images, you can not know which of the images generated by it are outliers.

With the following method, you can inspect your synthetic dataset more efficiently than by just looking at all images.

Realistic puppy generated by StyleGAN2-ADA. The kind of samples that you wish to keep in your dataset.

Intro

I co-wrote an article[1] about generative models evaluation methods such as Frechet Inception Distance and Inception Score.

In order to prove our hypotheses, one of the tests that we made was to show that CLIP[2] Encoder latent feature distributes Gaussily and catches the essence of different types of image datasets.

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on the task of matching images to captions. It was trained on 400M images from a wide variety of domains and was shown in multiple works to give strong representations that are useful for generating images.

CLIP encoder acts as a powerful anomaly detector for synthetic images that GANs and VAEs generated. This method will allow you to clean your generated synthetic dataset from bad outliers such as this :

Outliers detected by CLIP of synthetic images generated by StyleGAN2-ADA.

We relied on the basic assumption from the FID article that the latent feature at the deeper level of the network distributes Gaussily and compute a multi-variant Gaussian model of each dataset using CLIP instead of the original Inception network from the article.

We propagate a set of synthetic images created by StyleGAN2-ADA[3] and for each image used the latent representation vector to measure the probability of belonging to the original Gaussian model.

CLIP detects the outlier images very well and outperforms Inception as an anomaly detector.

Part of CLIP architecture. In red — CLIP encoder.

I chose to write this post because I believe that this is a practical tool that many can find useful.

Method

Randomly select 20K images from the original data set used to train your generative model and compute their feature vectors with CLIP encoder network. If you have fewer samples, use your all training set.

֫for i, batch_source in enumerate(tqdm(dl_source)):
    with torch.no_grad():
        batch_feature_tensor_source = model.encode_image(batch_source.to(device))
     feature_tensor_source_arr.append(batch_feature_tensor_source.cpu().numpy())

2. Fit a Gaussian model to these feature vectors.

֫    mu1 = np.mean(batch_feature_source_np, axis=0)
    sigma1 = np.cov(batch_feature_source_np, rowvar=False)
    m = scipy.stats.multivariate_normal(mu1, sigma1)

3. Sample images from a generative model.

4. Calculate the probability of each of the synthetic samples belonging to the corresponded Gaussian model and rank them by their score.