Automatic Image Captioning Using Deep Learning

Manthan Bhikadiya 💡
The Startup
Published in
7 min readOct 5, 2020
Credits: BrainlyQuote

Overview of Deep Learning:

Deep learning and Machine learning are the most progressive technologies in this era. Artificial intelligence is now compared with the human mind and in some field AI doing a great job than humans. Day by day there is new research in this field. This field is increasing very fast because now we have sufficient computational power for doing this task. Deep learning is a branch of machine learning that uses neural networks with many layers.

In traditional machine learning, the algorithm is given a set of relevant features to analyze. However, in deep learning, the algorithm is given raw data and decides for itself what features are relevant. Deep learning networks will often improve as you increase the amount of data being used to train them.

Read More About Simple Explanation of Deep Learning here

There Are some very interesting deep learning applications shown below.

Source:-https://techvidvan.com/tutorials/deep-learning-applications/

Now We will be going to see one of its applications which is Photo descriptions or image captions generator.

Image Captions Generator :

Image Caption Generator or Photo Descriptions is one of the Applications of Deep Learning. In Which we have to pass the image to the model and the model does some processing and generating captions or descriptions as per its training. This prediction is sometimes not that much accurate and generates some meaningless sentences. We need very high computational power and a very huge dataset for better results. Now we will see some information about the dataset and the architecture of the neural network of the Image captions generator.

Pre-requisites :

This project requires good knowledge of Deep learning, Python, working on Jupyter notebooks, Keras library, Numpy, and Natural language Processing

Make sure you have installed all the following necessary libraries:

  • Tensorflow
  • Keras
  • Pandas
  • NumPy
  • nltk ( Natural language tool kit)
  • Jupyter- IDE

Dataset :

In this project, we are using the flicker 30k dataset. In which it has 30,000 images with image id and a particular id has 5 captions generated.

Here is the link to the dataset so that you can also download that dataset.

Flicker_30k:- https://www.kaggle.com/hsankesara/flickr-image-dataset

One of the Image in Dataset with ID 1000092795

1000092795.jpg

Here are the particular captions for these images which is present in the dataset.

The Architecture of Network :

1 .Image Features Detection :

For image Detecting, we are using a pre-trained model which is VGG16. VGG16 is already installed in the Keras library.VGG 16 was proposed by Karen Simonyan and Andrew Zisserman of the Visual Geometry Group Lab of Oxford University in 2014 in the paper VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION.

This model won the ILSVRC challenge in 2014.

Here is the model representation in 3-D and in 2-D.

Source:-Very Deep Convolutional Networks for Large-Scale Image Recognition

Source:-VGG16 | CNN Model GFG

Overview :

The input to conv1 layer is of fixed size 224 x 224 RGB image. The image is passed through a stack of convolutional (conv.) layers, where the filters were used with a very small receptive field: 3×3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations, it also utilizes 1×1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1-pixel for 3×3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2×2 pixel window, with stride 2.

Three Fully-Connected (FC) layers follow a stack of convolutional layers (which has a different depth in different architectures): the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks.

Read More Here

2. Text Generation using LSTM

Long Short Term Memory networks — usually just called “LSTMs” — are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997) and were refined and popularized by many people in the following work. They work tremendously well on a large variety of problems and are now widely used.

Source:-LSTM Networks

Overview :

A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.

LSTM networks are well-suited to classifying, processing, and making predictions based on time series data since there can be lags of unknown duration between important events in a time series. LSTMs were developed to deal with the vanishing gradient problem that can be encountered when training traditional RNNs. Relative insensitivity to gap length is an advantage of LSTM over RNNs, hidden Markov models, and other sequence learning methods in numerous applications.

The advantage of an LSTM cell compared to a common recurrent unit is its cell memory unit. The cell vector has the ability to encapsulate the notion of forgetting part of its previously-stored memory, as well as to add part of the new information. To illustrate this, one has to inspect the equations of the cell and the way it processes sequences under the hood.

Now we are combining this model architecture in one model and that is our final model which will generate caption from images.

Read Wikipedia:-LSTM Networks Full explained

Main Model Architecture:

This final model is a combination of CNN and RNN models. To train this model we have to give two inputs two the models. (1) Images (2) Corresponding Captions. For each LSTM layer, we input one word for each LSTM layer, and each LSTM layer predicts the next word, and that how the LSTM model optimizes itself by learning from captions. For Image features, we are getting All image features array from the VGG16 pre-trained model and saved in a file so that we can use this file or features directly to correlate captions and image features with each other. Finally the image features and LSTM last layer we input this both outputs combination into decoder model in which we are adding both image features and captions so that model learns to generate captions from images and for a final layer we generate output or captions which length is the maximum length of dataset captions.

The last layer has a size of the length of the vocab. For this model, we are using ‘categorical cross-entropy ’ because in the last layer we have to predict each word probability and then we are only using high probability words. We are using Adam optimizer for optimization of the network or update the weights of the network.

Bleu Score :

We can generate captions with the n-grams model for that purpose we are using Blue-score for this model. By using BLEU Score we can check which n-gram is best to generate captions for this dataset.BLEU Score lies between 0 and 1. BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a candidate translation of the text to one or more reference translations. Although developed for translation, it can be used to evaluate text generated for a suite of natural language processing tasks.

Model Summary :

Code :

Conclusion :

Finally, conclude this project we understand VGG16 model Architecture, Long short term memory Network, how to combine this both model, bleu score, How the LSTM network generates captions, How the VGG16 model we can use for our project, and how to generate captions from images using deep learning.

More Blogs On Image Captions:

Image Captioning in Python with Keras

Image Caption Using Attention Mechanism

Automatic Image Captioning With CNN and RNN

Connect with me :

Github :

Medium :

LinkedIn :

Final Note :

Thanks for reading! If you enjoyed this article, please hit the clap👏 button as many times as you can. It would mean a lot and encourage me to keep sharing my knowledge.

--

--

Manthan Bhikadiya 💡
The Startup

Beyond the code lies magic. 🪄 Unveiling AI's potential with Generative AI, ML, DL, NLP, CV. Explore my blog's insights!