Express Lane to Content Safety: Zero-Shot NSFW Filter with Stable Diffusion and CLIP

8 min readAug 21, 2023

CLIP’s Multimodal Mastery

This multi-modal model is a bridge between the realms of Natural Language Processing and Computer Vision, allowing for a seamless association of text with images. But what does this mean in real terms?

For a clearer picture, imagine briefly referring to our accompanying animation. You’ll see the word ‘apple’ gradually intertwining with images of apples, symbolizing CLIP’s adeptness at understanding this relationship. The same goes for images of bananas connecting with their textual counterpart. This isn’t just a dance of words and images; it’s a testament to the future potential of AI.

The significance? By establishing such connections, we’re not only witnessing a technological marvel but also paving the way for revolutionary applications, from enhancing search capabilities to refining AI interactions. In essence, with models like CLIP, the digital world becomes more integrated, intuitive, and ready for groundbreaking applications.

Animation showing how CLIP learns concepts with text/images.

Leveraging Pre-trained Power: The Marriage of Stable Diffusion and CLIP

In the expansive realm of AI, having access to a model pre-trained on a vast dataset is a significant advantage. Enter Stable Diffusion: a model trained on an immense collection of images, bestowing upon it a robustness that’s hard to replicate. Its depth of training makes it a top choice for our purposes.

But there’s another layer to Stable Diffusion’s prowess. Nested within it is CLIP, a model renowned for its zero-shot capabilities. Zero-shot learning, for the uninitiated, is the ability of an AI to generalize and act on unseen data without specific prior training.

To sum it up, this integration of CLIP + Stable Diffsuion provides a dual advantage: the expansive image training of Stable Diffusion and the adaptable zero-shot learning of CLIP. Together, they form a comprehensive solution, ready to be deployed without further ado.

I have created an interactive visualisation that allows to navigate in 3 dimensions through a vector space simulating the embeddings of images trained with CLIP https://girlazo.com/clip-image-latent-vector-space/

Simulation of image embeddings in a 3D representation. Mocking a trained space with CLIP. https://girlazo.com/clip-image-latent-vector-space/

Beyond Image Generation: The Invisible Shield of Stable Diffusion’s Safety Checker

Stable Diffusion is not simply an image generator; it’s a multi-functional AI tool with an array of capabilities. One such discrete functionality, the “Safety Checker”, plays a pivotal role in content moderation, especially in detecting NSFW (Not Safe for Work) content. But, it’s the hidden nature of this feature that truly sets it apart.

The Safety Checker operates by keeping its NSFW detection concepts concealed. By doing so, it ensures a higher level of security, making it less vulnerable to being bypassed. At the heart of this system lies the creation of embeddings for both images and text, essentially converting them into numerical representations in a latent space.

So, how does this work in practice?

Imagine we feed the system an image of an apple. The embedded representation of this apple will naturally have a short distance to other apples within this latent space, given their similarity. Think of it as a clustering effect, where similar items gravitate close to each other in this representation space.

Now, let’s consider a more sensitive scenario. If an explicit or NSFW image is introduced to the system (e.x: nudity pictures), its embedding will be much closer to the hidden NSFW concepts within Stable Diffusion’s list. The shorter this distance, the higher the probability that the content falls under the NSFW category.

Essentially, by measuring the proximity of embeddings, the Safety Checker can determine the nature of the content, acting as an invisible shield against inappropriate content. It’s a blend of discretion and efficiency, ensuring the right content gets through while safeguarding digital platforms against potentially harmful material.

Determine what an image looks like by measuring its distance in vector space from other elements.

For a deeper dive into how embeddings operate, check out our article on the inner workings of GloVe.

From Black Boxes to Transparency: Decoding Stable Diffusion’s NSFW Detection

Stable Diffusion’s Safety Checker is an astute module dedicated to spotting potential NSFW content using the power of embeddings. Here is the safety_checker function in its original repository. Let's decode the crucial sections:

pooled_output = self.vision_model(clip_input)[1]
image_embeds = self.visual_projection(pooled_output)

Here, images undergo a transformation process. The vision model extracts embeddings, which are then projected to a specific space using visual_projection.

special_cos_dist = cosine_distance(image_embeds, self.special_care_embeds).cpu().float().numpy()
cos_dist = cosine_distance(image_embeds, self.concept_embeds).cpu().float().numpy()

The cosine_distance function measures the proximity of image embeddings to predefined 'special care' and generic 'concept' embeddings. The closer the cosine distance, the more similar the embeddings are, indicating potential NSFW content.

if result_img["special_scores"][concept_idx] > 0:
    result_img["special_care"].append({concept_idx, result_img["special_scores"][concept_idx]})
    adjustment = 0.01

For every image, the code checks if the calculated cosine distance surpasses certain thresholds. If it does, the image is flagged.

images[idx] = torch.zeros_like(images[idx])  # black image

Images deemed NSFW are replaced with a black placeholder, preventing explicit content from being shown.

However, the code keeps the exact NSFW concepts it checks against concealed. This opacity, although offering protection against manipulation, limits customizability. Our goal? Reimplement this function with our own NSFW concept embeddings, allowing us to have clearer criteria for flagging content.

In upcoming sections, we’ll develop our custom embeddings and integrate them into this safety checker for a more personalized approach.

Customizing Stable Diffusion’s Safety Checker for Precise NSFW Filtering

One key component of content moderation is ensuring that images follow a defined set of standards. Stable Diffusion offers a robust safety checker, but like any machine learning tool, it requires customization to better align with specific requirements.

For our purpose, we’ve identified two arrays: concepts and special_concepts.

concepts: This is our primary NSFW filter. It comprises terms like 'sexual', 'nude', and 'explicit content'. This array can be easily extended to include other potential NSFW terms.

concepts = ['sexual', 'nude', 'sex', '18+', 'naked', 'nsfw', 'porn', 'explicit content', 'uncensored']

special_concepts: This set addresses a more delicate aspect. Images of children can be sensitive in many contexts, so we want to be particularly cautious here. To achieve this, we've added terms such as 'little girl' and 'young child' to our list.

special_concepts = ["little girl", "young child", "young girl"]

In the code, we utilize the cosine_distance function to determine the similarity between image embeddings and text embeddings. The closer the value to 1, the higher the similarity. If a significant match is found against any of the NSFW or special concepts, the image is flagged accordingly.

To determine matches:

We loop over each image embedding and calculate the cosine distance with both our concepts and special concepts.
If any image has a strong match with an NSFW term, it is flagged.
If the image matches any special concept, it’s flagged under the ‘special’ category.

Our custom function, forward_inspect, returns the matches and indicates if any image contains NSFW concepts.

This customization allows us to have a granular approach towards image filtering. By defining our own NSFW and special terms, we can tailor Stable Diffusion’s robust model to our specific needs, ensuring the content we host meets our stringent safety standards.

This is the complete implementation of the safety checker:

# Import necessary libraries
import torch
from torch import nn

# List of standard NSFW concepts to filter out
concepts = ['sexual', 'nude', 'sex', '18+', 'naked', 'nsfw', 'porn', 'explicit content', 'uncensored']

# List of special concepts, focusing on protecting images of minors
special_concepts = ["little girl", "young child", "young girl"]

# Function to compute the cosine distance between image and text embeddings
def cosine_distance(image_embeds, text_embeds):
    # Normalize the image embeddings for better accuracy
    normalized_image_embeds = nn.functional.normalize(image_embeds)
    # Normalize the text embeddings for better accuracy
    normalized_text_embeds = nn.functional.normalize(text_embeds)
    # Compute the dot product between the normalized embeddings
    return torch.mm(normalized_image_embeds, normalized_text_embeds.t())

# Decorator to ensure no gradients are computed during the forward pass, for efficiency
@torch.no_grad()
def forward_inspect(self, clip_input, images):
    # Extract the output from the vision model, which gives embeddings for the images
    pooled_output = self.vision_model(clip_input)[1]
    # Project the embeddings to align with text embeddings
    image_embeds = self.visual_projection(pooled_output)

    # Calculate cosine distance between image embeddings and special concepts
    special_cos_dist = cosine_distance(
        image_embeds, self.special_care_embeds
    ).cpu().numpy()
    
    # Calculate cosine distance between image embeddings and standard NSFW concepts
    cos_dist = cosine_distance(image_embeds, self.concept_embeds).cpu().numpy()

    # Dictionary to store matched NSFW and special terms for each image
    matches = {"nsfw": [], "special": []}
    batch_size = image_embeds.shape[0]
    
    # Iterate over each image in the batch
    for i in range(batch_size):
        # Dictionary to store matching scores and concepts for the current image
        result_img = {
            "special_scores": {}, "special_care": [], "concept_scores": {}, "bad_concepts": []
        }

        adjustment = 0.0

        # Check each image against the list of special concepts
        for concet_idx in range(len(special_cos_dist[0])):
            concept_cos = special_cos_dist[i][concet_idx]
            concept_threshold = self.special_care_embeds_weights[concet_idx].item()
            
            # Compute how much the current image matches the special concept
            result_img["special_scores"][concet_idx] = round(
                concept_cos - concept_threshold + adjustment, 3
            )
            # If there's a strong match, flag the image and adjust the threshold for subsequent checks
            if result_img["special_scores"][concet_idx] > 0:
                result_img["special_care"].append(
                    {concet_idx, result_img["special_scores"][concet_idx]}
                )
                adjustment = 0.01
                matches["special"].append(special_concepts[concet_idx])

        # Check each image against the list of standard NSFW concepts
        for concet_idx in range(len(cos_dist[0])):
            concept_cos = cos_dist[i][concet_idx]
            concept_threshold = self.concept_embeds_weights[concet_idx].item()
            
            # Compute how much the current image matches the NSFW concept
            result_img["concept_scores"][concet_idx] = round(
                concept_cos - concept_threshold + adjustment, 3
            )
            
            # If there's a strong match, flag the image
            if result_img["concept_scores"][concet_idx] > 0:
                result_img["bad_concepts"].append(concet_idx)
                matches["nsfw"].append(concepts[concet_idx])

    # Check if any images have been flagged as NSFW
    has_nsfw_concepts = len(matches["nsfw"]) > 0

    # Return the matches and the flag status
    return matches, has_nsfw_concepts

Not Just for NSFW: The Expansive Horizon of Stable Diffusion.

At first glance, might seem like just another image moderation system, albeit with added perks like zero-shot learning and concept customization. But its power stretches far beyond that:

Evolution and Open Source: Being open-source, Stable Diffusion is in perpetual growth. With the community’s contributions, its proficiency in image detection will only improve over time.
Fine-Tuning with Minimal Data: Consider customizing Stable Diffusion to identify Swastikas. With a handful of images, we can finely tailor its detection capabilities.
Versatile Detection: This isn’t just an NSFW filter. The sky’s the limit! For instance, you could configure it to only allow floral images.

Stable Diffusion redefines the horizon of what’s possible with image filtering and detection.

Code

GitHub - EsteveSegura/NSFW-StableDifussion-Filter

Contribute to EsteveSegura/NSFW-StableDifussion-Filter development by creating an account on GitHub.

github.com

References: Red-Teaming the Stable Diffusion Safety Filter

- And would this work with audios?
- Yes! there is “CLAP” the same as “CLIP” but for audios. Coming soon in a new article.