Faster Image Captioning and VQA Using fastdup

Published in

Visual Layer

6 min readSep 21, 2023

Introduction to Image Captioning

Image Captioning is the process of using a deep learning model to describe the content of an image. Most captioning architectures use an encoder-decoder framework, where a convolutional neural network (CNN) encodes the visual features of an image, and a recurrent neural network (RNN) decodes the features into a descriptive text sequence.

VQA

Visual Question Answering (VQA) is the process of asking a question about the contents of an image, and outputting an answer. VQA uses similar architectures to image captioning, except that a text input is also encoded into the same vector space as the image input.

Image captioning and VQA are used in a wide array of applications:
• Image search and retrieval based on a given query
• Content Recommendation based on scene analysis
• News and Journalism with automated image captions
• Improving product search by recognizing items in photos
• Aiding visually impaired people by providing textual descriptions

📣 Today, we are happy to announce that fastdup now supports image captioning and VQA as part of its public API!

Why Captioning With fastdup?

Image captioning can be a computationally-expensive task, requiring many processor hours to conduct. Recent experiments have shown that the free fastdup tool can be used to reduce dataset size without losing training accuracy. By generating captions and VQAs with fastdup, you can save expensive compute hours by filtering out duplicate data and unnecessary inputs.

About fastdup

If you’re new to fastdup, we recommend reading through our documentation. As a brief recap, fastdup is an unsupervised and free tool for image and video data analysis that can clean massive visual datasets without the need for expensive GPUs.

fastdup has over 280,000 downloads on PyPI and has been used to analyze over 50 billion images. You can try fastdup yourself and get started with the official GitHub repo.

Example Jupyter Notebook

If you’d like to skip ahead to an example of how to use fastdup for image captioning and VQA, you can download and run the example notebook hosted in the fastdup GitHub Repo. Otherwise, continue reading for a step-by-step overview of these new fastdup features.

https://github.com/visual-layer/fastdup/blob/main/examples/caption_generation.ipynb

Getting Started With Captioning in fastdup

To start generating captions with fastdup, you’ll first need to install and import fastdup in your computing environment.

pip install fastdup

import fastdup

Then, proceed to run fastdup on your visual dataset. As a revolutionary lightweight tool, fastdup runs with no GPU needed, and can process millions of images in hours. Simply point fastdup to the directory containing your data.

fd = fastdup.create(input_dir='./coco_minitrain_25k')
fd.run(ccthreshold=0.9, overwrite=True)

After running fastdup on your dataset, you can filter out any outliers, duplicates, or invalid images. Then, you can use the fd.caption() method to begin generating captions or answer visual questions.

Available Captioning and VQA Models

fastdup supports a number of popular captioning and VQA models, with more models being added in the future. These include lightweight models such as ViT-GPT2, which can be run on a CPU, with no GPU needed. More heavyweight models, such as BLIP-2, are also included. A GPU runtime is recommended for using the heavier models on large datasets.

Currently, the available models for captioning are:

ViT-GPT2 (model info) : 'vitgpt2' : a lightweight and fast model trained on COCO images. This model takes about 0.5s per image caption (on a CPU), but may provide less useful results for images that are very different from COCO-like images.
BLIP-2 (model info) : 'blip2' : a more heavyweight model. This model may provide more robust answers for images different than COCO images, but can take upwards of 10s per image caption.
BLIP (model info) : 'blip' : a middleweight model that provides a middle-way approach between ViT-GPT2 and BLIP-2.

Available models for VQA are:

Vilt-b32 (model info): 'vqa' : used for general question answering.
ViT-Age (model info): 'age' : used to classify the age of humans in a photo.

Processor Selection and Batching

The captioning method in fastdup enables you to select either a GPU or CPU for computation, and decide your preferred batch size. By default, CPU computation is selected, and batch sizes are set to 8. For GPUs with high-RAM (40GB), a batch size of 256 will enable captioning in under 0.05 seconds per image.

To select a model, processing device, and batch size, the following syntax is used. If no parameters are entered, the fd.caption() method will default to ViT-GPT2, CPU processing, and a batch size of 8.

Note: you can generate captions for a specified subset of images in your data. To do so, pass a list of file paths in the subset argument of the fd.caption() method.

See full documentation here: https://visual-layer.readme.io/docs/v1-api#data-enrichments

captions_df = fd.caption(model_name='vitgpt2', device = 'cpu', batch_size=8)

Visualizing Outlier Images’ Captions With fastdup Galleries

Using fastdup’s built-in galleries methods, you can visualize the captions you have generated.

captions_to_show = captions_df.sample(20)
visualization_df = pd.DataFrame({'from':captions_to_show['filename'],'to':captions_to_show['filename'], 'label':captions_to_show['caption'], 'distance':0*len(captions_to_show),})
fastdup.create_outliers_gallery(visualization_df, save_path='.', num_images=10)
from IPython.display import HTML
HTML('outliers.html')

VQA with fastdup

To answer visual questions about your dataset, use the same fd.caption() method with a prompt added.

vqa_df = fd.caption(model_name='vqa', vqa_prompt='is this photo taken indoors or outdoors?')

fastdup will automatically create a DataFrame of answers to your question, which you can visualize the VQA for outliers using the previously-mentioned galleries method.

vqa_to_show = vqa_df.sample(20)
vis_vqa_df = pd.DataFrame({'from':vqa_to_show['filename'],'to':vqa_to_show['filename'], 'label':vqa_to_show['caption'], 'distance':0*len(vqa_to_show),})
fastdup.create_outliers_gallery(vis_vqa_df, save_path='.', num_images=10)
from IPython.display import HTML
HTML('outliers.html')

Wrap Up

Now that you’ve learned how to use fastdup for captioning and VQA, you can use the tool on any of your datasets.

Next, feel free to check out some useful fastdup tutorials:

⚡ Quickstart: Learn how to install fastdup, load a dataset, and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you’re new, start here!
🧹 Clean Image Folder: Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
🖼 Analyze Image Classification Dataset: Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
🎁 Analyze Object Detection Dataset: Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try.

VL Profiler — A faster and easier way to diagnose and visualize dataset issues

If you prefer a no-code platform to inspect and visualize your dataset, try our free cloud product VL Profiler — VL Profiler is our first no-code commercial product that lets you visualize and inspect your dataset in your browser.

VL Profiler is free to get started. Upload up to 1,000,000 images for analysis at zero cost!