A picture is worth a thousand words — leveraging AI to generate and enrich image descriptions

Cesar Diniz Maciel
IBM Data Science in Practice
13 min readJul 2, 2022

Humans have an incredible ability to interpret images and absorb the contextual features within it in a matter of seconds. The same capability is a daunting task for a computer. Although the advances in computer vision algorithms have been stellar over the last few years, and the ability to classify images and detect objects in it has surpassed the human accuracy in the same task, it is still a challenge for an algorithm to put in words what an image is showing. The tide may be changing, however, with new advances in multimodal AI and natural language processing.

Why does this task present such a challenge to algorithms? Well, an image may have multiple descriptions, depending on the context it is inserted on. For example, the following picture is mostly described as “children playing at the beach”, but can also be correctly described as “family members enjoying vacation”, or “an example of children friendship”.

Children at the beach, or children drowning? — CC0 Public Domain
Figure 1 — Children at the beach, or family enjoying vacation?

And why does this matter? Well, from the perspective of automatically describing an image, context is key to convey the message intended by the picture. This is specifically important when the picture is being used to promote an idea, product or service. If you have a travel website, you want the above picture to be seen as “children having fun”, not “children drowning at the sea”, for obvious reasons…

Search Engine Optimization

Another area where image captioning is very important is Search Engine Optimization (SEO). When a search engine such as Google indexes a website, it indexes text and images, and a part of image indexing is the alt text attribute. Alt text is the written copy that appears in place of an image on a web page if the image fails to load on a user’s screen. This text helps screen-reading tools describe images to visually impaired readers and allows search engines to better crawl and rank your website. The SEO market is expected to reach 86 billion dollars in 2023, according to The Business Research Company, which demonstrates the importance of the activity.

SEO market size — Copyright The Business Research Company
Figure 2 — SEO market size

Naturally, the better alt text aligns with the intended meaning of the image, the better will be the search result. Therefore, writing the perfect sentence is key to improve the chances that the page is ranked higher in the search engine.

Companies specialized in SEO employ creative professionals that understand the intricacies of search engines and craft the best alt text captions for images on a website, contextualizing the text to the focus of the site. For example, the beach picture from this article (Figure 1) can be described as “children playing at the beach”, but can also be described as “children enjoying vacation in the Caribbean”, if the images is used on a travel website. Or even “exposure to sunlight early morning is vital for vitamin D synthesis”, if the image is used in a medical website.

Because of this, creating an algorithm that automatically captions an image is a complex task. Not only the algorithm has to correctly identify what the image depicts, but, as demonstrated, there are multiple correct descriptions for the same image, and what defines which description is more adequate depends on the context where the image is inserted.

Therefore, the process of optimizing a description for an image is time consuming, and the specialist manually writes the description, and tags, to the image. It is a tedious and repetitive process.

The hunt for the effective model

Over the years, several algorithms have been developed to describe photos, and the results are quite impressive. They are part of what is called multimodal AI — meaning an algorithm that can learn about concepts in several modalities, such as textual and visual domains. They incorporate abilities such as image classification and object detection, as well as Natural Language Processing (NLP) and Natural Language Generation (NLG).

These models are large and require significant compute capabilities for training, since the model performance depends on the amount of data used for training. Also, the type of data used for training determines the performance of the model — for a general purpose model, a large amount of data from different subjects is required for training. Fortunately, there are a few companies that provide the models as a service, to be consumed by using API calls. These allow developers to focus on solving the business problem, and leave the complex task of developing and enhancing the algorithm to the expert researchers in that field.

One article by Microsoft Research claims better than human performance in image captioning in their implementation, and a quick evaluation of the technology indeed shows great capabilities, as seen in the following example:

People walking — CC0 Public Domain
Figure 3 — ‘a group of people walking down a sidewalk’ with confidence 62.00%

When running this image through the Microsoft Azure Image Description service, it generates the following caption: “a group of people walking down a sidewalk”, which is quite accurate and contextualized.

However, for some other images, although the description is correct, it is also very generic.

Living room with fireplace — CC0 Public Domain
Figure 4 — 'a living room with a fireplace’ with confidence 60.73%
A living room with a fireplace — CC0 Public Domain
Figure 5 — ' a living room with a fireplace’ with confidence 57.96%

While both descriptions are technically correct, they do not represent the differences between the two living rooms, and treat them equally. From an SEO perspective, this description does not add much value. It is, however, impossible for the algorithm to figure out by itself which enrichments to the description are recommended in each case.

How can the algorithm “enhance” the description in order to make it more attractive to search engines, and assist the SEO specialist in a way that he can be more efficient?

This is where we join forces with AI — we leverage AI to automate operations, and leverage the human capabilities to fine tune the process, resulting in an efficient teamwork and a streamlined business process.

Combining model capabilities

As mentioned before, contextualization is key in delivering a good result when captioning an image. In order to address the shortcomings of the previous method, there is a need of injecting some guidance in the process by a specialist that understands the context where the image will be used. For example, in Figure 5, if the website is for a furniture company, it is beneficial that the alt text contains information about the sofa and the coffee table. If it is for a real state website, it may be useful to highlight that the living room is modern and bright.

The existing AI models for describing an image, while sometimes being generic, serve an important role in automating the generation of an SEO description. They provide information about the subject that the picture is about. The description for Figure 7 not only identifies the 'chair' objects, but also places them in an area with grass.

Chairs in a grassy lawn — CC0 Public Domain
Figure 6— 'a couple of wooden chairs in a grassy area'

However, in the currently available model APIs, there is no option to provide input to the model to guide the inference. The only input allowed is the image file (injection of keywords could be implemented on a model trained specifically for the task, but that would require model management, curating data for training, keeping the model up do date, and all the activities associated in building models, which is not a task for SEO specialists).

We can, nonetheless, combine different models in a pipeline, so that not only we can leverage each model’s capabilities but also insert, or fine-tune, input parameters between models. Each model contributes to enhance the overall understanding of the image, and the generation of a contextualized description.

One combination of models that generated richer image descriptions was achieved by combining three models. The first one is the above mentioned model that describes an image. A second model takes the image as input and outputs “tags” associated with the image. These tags are objects detected in the image, but also some context description. For example, when using Figure 4 as input, the model outputs the following tags associated with the image:

AI model pipelining for image description
Figure 7— model pipelining enhances the description of the image

Several of the tags are objects in the image (“table”, “chair”, “window”), but some are also contextualized descriptions of the image (“interior design”, “indoor”, “dining room”). These tags improve the understanding of the image, and provide some additional input to formulate a richer image description.

Now that we have a description of the image, and some tags associated with it, we leverage another type of model to build the description. We use a NLG model to build a sentence, based on the output from the two previous models.

Transformers — More than meets the eye

The recent years have seen a lot of improvements in NLP algorithms, with Transformers becoming a powerful and capable neural network architecture. Transformers overcome limitations that appear with traditional NLP networks such as RNN and LSTM in handling longer sequences of text while maintaning context. Transformer-based models, namely GPT, BERT, and XLNet, are becoming dominant architecture for NLP, with applications in several areas from language translation to text summarization to programming code generation.

The Transformer architecture — source https://arxiv.org/pdf/2102.08036.pdf
Figure 8— The Transformer architecture — Source https://arxiv.org/pdf/2102.08036.pdf

Transformers are semi-supervised models, involving an unsupervised pre-training, and a (optional) supervised fine-tuning. Leveraging large amounts of unlabeled data translates into effectiveness in learning general representations that can then be further fine-tuned for downstream tasks to much success.

However, training these models can be costly both from an economic and environmental standpoint. The BERT model was trained on the English Wikipedia and the BookCorpus dataset (which is a large collection of free novel books written by unpublished authors, which contains 11,038 books). GPT-3 was trained on about 45 TB of text data from different sources. It becomes clear that training these models require a significant compute capacity, with a high cost of hardware, electricity and cooling infrastructure. Fortunately, several of these pre-trained models are available for consumption, either by downloading the model and building the application to use it, or by services providing REST APIs for model consumption.

Trend of sizes of state-of-the-art NLP models over time — source Microsoft
Figure 9 — Trend of sizes of state-of-the-art NLP models over time. Source Microsoft

The largest Transformer model available for use today is the GPT-3 from OpenAI, with 175 billion parameters (as a comparison, the first BERT implementation based on Google’s original paper “Attention Is All You Need” has 110 million parameters). Microsoft announced Megatron-Turing NLG 530B which has 3 times the number of parameters of GPT-3, but at the time of the writing, this model is not available for usage (at least not outside of Microsoft or Nvidia, which is a partner in the development of this model).

Initially, GPT-3 was only available for some selected developers. Because of that, some alternatives were developed, including a GPT-like model called GPT-J from EleutherAI that is free and open source. Several companies embraced this model to provide GPT capabilities as a service.

GPT-3 is now available as an API from OpenAI, and it remains their proprietary implementation. Microsoft has licensed the OpenAI models and is previewing their OpenAI services on Azure.

The creative model

As previously mentioned, Transformers are trained with vast amounts of text data in an unsupervised learning, which provides the models with the ability to perform many tasks without any additional training on specific context.

The following example uses an implementation of GPT-J provided as a service from NLP Cloud trained for product description and ad generation. It expects a list of keywords as input, and generates an ad as the output text, as shown on Figure 10.

Example of using GPT-J for ad generation
Figure 10 — “Luxury, speed, comfort and performance of Ferrari can be rented, for the price of just $100. Add travel insurance for just $4.”

As we can see, with no format input, and just the keywords, the model generated an ad for renting a Ferrari. And because of the model’s ability to generalize, it even suggested prices for the rental (although not realistic for a Ferrari rental).

The ability for large language models to generate relevant content without specific training is called zero-shot learning and it stems from the fact that the model, being trained in an unsupervised way with a very large dataset is capable of generalize and produce reasonable output even in a domain that it has not been trained into.

An extension of zero-shot learning is few-shot learning, where the idea is to provide some examples to the model, so that it can still generalize, but with some guidance on the expected output. It is much faster and easier to perform than fine tuning the model (which essentially means performing additional training in a supervised way, using transfer learning from the unsupervised model), and provide a significant gain in accuracy on the desired output.

accuracy improvement with few-shot learning — Source https://arxiv.org/pdf/2005.14165.pdf
Figure 11 — accuracy improvement with few-shot learning — Source https://arxiv.org/pdf/2005.14165.pdf

When focusing on alt text, the desired output is a single sentence, concise, objective and relatively small (although the HTML specification does not impose a hard limit on alt text, it is recommended to be 125 characters or less, for accessibility reasons, e.g. screen readers). As mentioned before, GPT can be very creative when generating text, and write paragraphs instead of single sentences. The length of the text can be parametrized when the model is called, but a good way to generate outputs in the desired format is leveraging few-shot learning.

If we feed the GPT model with the image description and image tags, it generates a text based on the input words. For example, if we combine the image description and tags from Figure 4 and feed to GPT, we have the following output:

sentences generated with zero-shot training on GPT-J
Figure 12 — sentences generated with zero-shot training

The sentences are well formed and a good description of the environment, and would be great for a brochure describing the living room. They are not, however, in a format that could be readily used in alt text.

Now, if we use the input from other images, and manually create the “ideal” alt text, we can use these examples as a few-shot learning input for the model.

There is no right number of examples to be used — one is better than zero, a few are better than one, but how many is the optimal value is better defined as art than science. A few-shot learning example for the GPT model is similar to the following one:

few-shot learning examples
Figure 13 — few-shot learning examples

When submitting the keywords for text generation, the model accepts the examples also as input, and then generates text using the examples as a guideline. For the same example from Figure 5, when using few-shot learning, here are some of the outputs generated:

few-shot learning outputs
Figure 14 — few-shot learning outputs

As can be seen, the output text is more concise and limited to a single sentence, and also incorporating adjectives highlighting the quality of the environment, which was emphasized in the examples provided to the model.

A living room with a fireplace — CC0 Public Domain
Figure 15 — “A cozy living room with fireplace, two armchairs and a rug in front of the fireplace”

These generated sentences are much more suitable to be used as alt text. They add more details to the image being described, as well as incorporating relevant adjectives to the sentence. It is more likely that someone becomes interested in “a cozy living room with soft furnishings, wood burning fireplace and large windows” than simply to “a living room with a fireplace”.

There are many parameters that can be tuned on a NLG model to change the length, creativity, and variation of the output. Words can be added or removed from the text (for example, removing bad words, or words that would be detrimental to the description), and there is the ability to fine-tune a model for a specific subject — as mentioned before, this is a bit more time consuming, and less flexible than few-shot learning since you retrain the model, but if the subject in case is the same (always generating texts about cars, for example), fine tuning is a great option for better results.

nlpcloud.io GPT-J configuration options
Figure 16 — nlpcloud.io GPT-J configuration options

Also, it is important to notice that due to the “creativity” of the model, it will always generate a text output, but it does not always mean that the text is completely correct in the description. There are parameters that control how deterministic the sentence will be — the more deterministic, the less creative. It is a balance between improving the description, and being accurate. Sometimes you achieve both, sometimes the model favors creativity over accuracy. One example of a sentence generated for Figure 15 reads “a rustic living room with a fireplace, exposed beams and hardwood floors”.
If we look at the picture again, the living room does not really look rustic, nor are exposed beams or hardwood floor visible. While it can still be used and fixed, it demonstrates the non-deterministic nature of the language generation from large models. It is a beautifully crafted sentence, but in this case the model took its creative vibe too far.

Useful tools, but not a human replacement

It is clear that leveraging AI models for image description and text generation provides an incredible opportunity to explore and augment the options available for translating images into words, and for optimizing images for search engines. It is not, at least with the options available today, a replacement for an expert. It is, however, another set of tools that experts in SEO or image description in general can leverage to automate the process, generate new ideas and optimize their work.

Moreover, these AI models can be incorporated in a business process and automate the collection, tagging description and alt text generation for images in a repository, or in a website, in a way that streamlines the process for the SEO specialists, saving time, keeping the business process organized, and integrating the different flows involved in the project. It is an innovative approach that some clients are evaluating and hoping that it will increase productivity of their teams and reduce the repetition and manual steps involved in the process.

--

--

Cesar Diniz Maciel
IBM Data Science in Practice

Electrical engineer, IBM employee for longer than I want to admit, with experience in hardware infrastructure, electronics, and machine learning.