Technical Portfolio

7 min readFeb 14, 2016

About Me

I’m a Computer Science undergraduate passionate about Artificial Intelligence.
Andrew Ng’s Introductory course on Machine Learning introduced me to a field at the intersection of Math and Computer Science — my favorite subjects back from high school.
This article (https://aeon.co/essays/how-close-are-we-to-creating-artificial-intelligence) inspired me to read about the fundamentals of AI, Turing’s paper — “Computing Machinery And Intelligence” .
I’ve been performing research on open problems at the nexus between Computer Vision and Natural Language Processing, such as Image Captioning and Visual Question Answering, leveraging the power of Deep Learning to solve these problems.
I’d love to be a part of the AI Community, and work as a Deep Learning Research Scientist along with the top minds and PhD researchers in this field.

Things that I believe I need to work on:

1. Open source contribution, either by publishing open source code or contributing to open source deep learning frameworks such as Keras, Lasagne, Chainer, Caffe, Theano.

2. Experimenting with Generative Adverserial Networks, Neural Turing Machines, Dynamic Memory Networks.

3. I’m currently at the level of understanding state of the art research work. Someday I’d like to make a contribution to the research community by publishing my own research work.

Projects

Playing Pong, Breakout Using Deep Reinforcement Learning

Timeline: February 1st — February 10th

Motivation:

Demis Hassabis’ video about the Theory Of Everything and his presentation of DeepMind’s work

Technical Details:

Developed the pong game using pygame python library
Used Arcade Learning Environment to simulate the Breakout game.
Implemented the Deep Q-Learning algorithm as such, using Keras (Also experimented with TensorFlow)
Used Experience Replay of 500,000 timesteps and an ε-greedy exploration scheme of 2 million steps annealed gradually from 1.0 to 0.05.
The AI learns to win Pong with highly skewed scores of 20–4 after 2.5 million timesteps of training.

Object Recognition and Detection, Visual Product Recommendation for Fashion Domain

Timeline: January 8th — January 29th

Motivation:

Object Recognition, Detection techniques are required in the fashion domain to identify different articles of clothing a person wears and to provide visual product recommendations.

Technical Details:

Performed Object Recognition using a standard Convolutional Neural Network (Caffe AlexNet Model)
Performed Object detection using the concept of Bounding Box Regression.
Performed Visual Product Recommendation by ensembling a visual similarity model and a color similarity model.

Experiments:

Tried Caffe implementations of R-CNN, Fast R-CNN and Faster R-CNN to perform Object Detection. However, these didn’t turn out good results, possibly due to the lack of large amount of data.
Performed Visual product recommendation using distance metric learning techniques such as Siamese Networks (LeCun et al.), Deep Ranking. These two techniques didn’t churn out satisfactory results.
Used concepts such as Saliency Detection (DeepDetect — CVPR 2015) to remove the background noise from wild images. However, these were computationally expensive and weren’t fast enough in real-time.

Yelp Restaurant Photo Classification

Timeline: January 15th — January 18th

Motivation:

Identifying 9 independent attributes of restaurants from the large set of user submitted photos.

Technical Details:

Extracted Image features from the pool5 layer of the Caffe GoogleNet model.
Individually classified attributes of each image using a Gradient Boosted Tree (XGBoost Model).
Averaged the set of 9 attributes identified in each image, across all the images belonging to the same restaurant.
Currently Ranked 8 in the Kaggle Leaderboard.

Experiments:

Gradient Boosted Tree on top of CNN image features performed better than a model trained to individually classify each one of the 9 attributes.
Max tree depth of 2 gave the best results with early stopping criteria.
Used 5-fold cross validation to tune the hyperparameters of the XGBoost mode

Generating New Rap Lyrics From Eminem’s Songs

Timeline: December 8th — December 17th

Motivation:

Experimenting with Character Level Language Models and pulling this off would be really cool!

Technical Details:

Used Character Level LSTM-Network to generate the rap lyrics character by character.
The model failed to converge in the end, and didn’t give out meaningful rap lyrics.
However, It was interesting to see the model trying to piece together characters out of of nowhere and making (almost) meaningful words out of it.

Question Answering of Simple Toy Tasks : Towards AI-Complete Question Answering

Timeline: December 5th — December 8th

Motivation:

Experimenting with the new concept of Memory Networks, introduced by FAIR.

Problem Definition: An example toy task, as defined by the facebook bAbI project, is being able to answer questions about locations of objects/characters from passages that simulate them moving around and interacting in different locations.

Technical Details:

Converted each sentence in the passage to a vector representation and stored it in the memory.
Used a neural network (Memory network) to score the match between the question vector and each vector stored in the memory.
Applied the same process to extract supporting facts from the memory and used each supporting fact to identify the next.
Performed the same toy task using the concept of End-To-End Memory Networks, which produced (almost) the same results, but wasn’t as strongly supervised as the Memory Network model

Visual Question Answering

Timeline: December 1st — December 5th

Motivation and Scope:

Visual question answering definitely had to follow Image Captioning! Also, performing VQA to a satisfactory level would be one of my long term research goals.

VQA is a very open research problem and for the sake of narrowing the scope and making a positive stride towards solving it, the question types were brought down to Object, Color, Number, Location.

Technical Details:

Extracted Image features from fc7 layer of the Caffe VGG-16 Model, pretrained with ImageNet weights.
Used Bag of Words representation for the question sentence.
Fed the Concatenated Image, BOW question features to a Multilayer Perceptron (MLP) that predicts the answer word. ( Restricted the answer to a single word, and fixed the vocabulary size of the answer word)

Experiments and Intuitions:

Tweaked the number of hidden layers, number of hidden nodes to squeeze the last bit of accuracy out of the model.
Used a Recurrent Neural Network to generate the answer.
It was interesting to note that the accuracy of training an RNN to generate the answer for a question without feeding the image features was only ~3% lesser than the model along with the image!
This gave an interesting insight that the RNN language model learns to intelligently guess answers just from the question, which empirically turns out to be much better than humans!

Image Captioning

Timeline: November 24th — November 30th

Motivation:

Creating an algorithm that automatically captions an image, producing a full length english sentence describing the scene.

Technical Details:

Produced Image captions using multimodal-Recurrent Neural Networks.
Extracted Image features from fc7 layer of the Caffe VGG-16 Model, pretrained with ImageNet weights.
Used LSTM Language Model for the text generation.
Initialized the hidden state of the LSTM network with image features linearly mapped to a lower (512) dimension.

Experiments:

Initialized the word vectors with pretrained GloVe Vectors.
Used a multi-layer LSTM for the language model, analyzed performance benefits of a GRU, used instead of an LSTM.
Passed the image features through the LSTM gates before initializing the hidden states (The intuition being, unnecessary features from images may be rejected by the gates and only allow useful features to pass through).

Currently ranked 14 in BLEU-1 C5 Metric in the CodaLab Image Captioning Leaderboard and ranked 19 in CIDEr-D C40 metric

Image + Text Sentiment Analysis

Timeline: November 17th — November 24th

Motivation:

Leveraging the sentiment content of images for 5-class sentiment analysis, pertinent in mediums such as Twitter, Snapchat, Instagram

Technical Details:

Calculated Visual sentiment by identifying ADPs (Adjective Noun Pairs) associated with each image using Convolutional Neural Networks.
Used a pool of 3244 recognized Adjective Noun Pairs. (Example: Misty Night, Colorful Clouds)
Used Caffe AlexNet model (pretrained with ImageNet weights) to classify the images into 1 of 3244 classes, representing an ANP.
Maintained a SentiBank that scores the ANP with a sentiment value between -2,2
Calculated text sentiment using a Recurrent Neural Network.
Maintained a vocabulary of 8000 words, each word containing a word vector representation. (Initialized with GloVe pretrained word vectors)
Mean Pooled the hidden states of the Recurrent Neural Network for a given sentence and classified using a softmax layer.
The 5 sentiment classes represent successive integer values from -2,2 ranging from strongly negative to strongly positive sentiment.

Data Science Capstone

Timeline: March 2015 — May 2015

Completed the Data Science Capstone partnered with SwiftKey Corporation at Coursera by Johns Hopkins University.
Designed a Next-Word Prediction App that is fast, accurate, responsive and deployed online at shinyapps.io

Skills Learnt: