Technical Portfolio
--
About Me
- I’m a Computer Science undergraduate passionate about Artificial Intelligence.
- Andrew Ng’s Introductory course on Machine Learning introduced me to a field at the intersection of Math and Computer Science — my favorite subjects back from high school.
- This article (https://aeon.co/essays/how-close-are-we-to-creating-artificial-intelligence) inspired me to read about the fundamentals of AI, Turing’s paper — “Computing Machinery And Intelligence” .
- I’ve been performing research on open problems at the nexus between Computer Vision and Natural Language Processing, such as Image Captioning and Visual Question Answering, leveraging the power of Deep Learning to solve these problems.
- I’d love to be a part of the AI Community, and work as a Deep Learning Research Scientist along with the top minds and PhD researchers in this field.
Things that I believe I need to work on:
1. Open source contribution, either by publishing open source code or contributing to open source deep learning frameworks such as Keras, Lasagne, Chainer, Caffe, Theano.
2. Experimenting with Generative Adverserial Networks, Neural Turing Machines, Dynamic Memory Networks.
3. I’m currently at the level of understanding state of the art research work. Someday I’d like to make a contribution to the research community by publishing my own research work.
Projects
Playing Pong, Breakout Using Deep Reinforcement Learning
Timeline: February 1st — February 10th
Motivation:
Demis Hassabis’ video about the Theory Of Everything and his presentation of DeepMind’s work
Technical Details:
- Developed the pong game using pygame python library
- Used Arcade Learning Environment to simulate the Breakout game.
- Implemented the Deep Q-Learning algorithm as such, using Keras (Also experimented with TensorFlow)
- Used Experience Replay of 500,000 timesteps and an ε-greedy exploration scheme of 2 million steps annealed gradually from 1.0 to 0.05.
- The AI learns to win Pong with highly skewed scores of 20–4 after 2.5 million timesteps of training.
Object Recognition and Detection, Visual Product Recommendation for Fashion Domain
Timeline: January 8th — January 29th
Motivation:
Object Recognition, Detection techniques are required in the fashion domain to identify different articles of clothing a person wears and to provide visual product recommendations.
Technical Details:
- Performed Object Recognition using a standard Convolutional Neural Network (Caffe AlexNet Model)
- Performed Object detection using the concept of Bounding Box Regression.
- Performed Visual Product Recommendation by ensembling a visual similarity model and a color similarity model.
Experiments:
- Tried Caffe implementations of R-CNN, Fast R-CNN and Faster R-CNN to perform Object Detection. However, these didn’t turn out good results, possibly due to the lack of large amount of data.
- Performed Visual product recommendation using distance metric learning techniques such as Siamese Networks (LeCun et al.), Deep Ranking. These two techniques didn’t churn out satisfactory results.
- Used concepts such as Saliency Detection (DeepDetect — CVPR 2015) to remove the background noise from wild images. However, these were computationally expensive and weren’t fast enough in real-time.
Yelp Restaurant Photo Classification
Timeline: January 15th — January 18th
Motivation:
Identifying 9 independent attributes of restaurants from the large set of user submitted photos.
Technical Details:
- Extracted Image features from the pool5 layer of the Caffe GoogleNet model.
- Individually classified attributes of each image using a Gradient Boosted Tree (XGBoost Model).
- Averaged the set of 9 attributes identified in each image, across all the images belonging to the same restaurant.
- Currently Ranked 8 in the Kaggle Leaderboard.
Experiments:
- Gradient Boosted Tree on top of CNN image features performed better than a model trained to individually classify each one of the 9 attributes.
- Max tree depth of 2 gave the best results with early stopping criteria.
- Used 5-fold cross validation to tune the hyperparameters of the XGBoost mode
Generating New Rap Lyrics From Eminem’s Songs
Timeline: December 8th — December 17th
Motivation:
Experimenting with Character Level Language Models and pulling this off would be really cool!
Technical Details:
- Used Character Level LSTM-Network to generate the rap lyrics character by character.
- The model failed to converge in the end, and didn’t give out meaningful rap lyrics.
- However, It was interesting to see the model trying to piece together characters out of of nowhere and making (almost) meaningful words out of it.
Question Answering of Simple Toy Tasks : Towards AI-Complete Question Answering
Timeline: December 5th — December 8th
Motivation:
Experimenting with the new concept of Memory Networks, introduced by FAIR.
Problem Definition: An example toy task, as defined by the facebook bAbI project, is being able to answer questions about locations of objects/characters from passages that simulate them moving around and interacting in different locations.
Technical Details:
- Converted each sentence in the passage to a vector representation and stored it in the memory.
- Used a neural network (Memory network) to score the match between the question vector and each vector stored in the memory.
- Applied the same process to extract supporting facts from the memory and used each supporting fact to identify the next.
- Performed the same toy task using the concept of End-To-End Memory Networks, which produced (almost) the same results, but wasn’t as strongly supervised as the Memory Network model
Visual Question Answering
Timeline: December 1st — December 5th
Motivation and Scope:
Visual question answering definitely had to follow Image Captioning! Also, performing VQA to a satisfactory level would be one of my long term research goals.
VQA is a very open research problem and for the sake of narrowing the scope and making a positive stride towards solving it, the question types were brought down to Object, Color, Number, Location.
Technical Details:
- Extracted Image features from fc7 layer of the Caffe VGG-16 Model, pretrained with ImageNet weights.
- Used Bag of Words representation for the question sentence.
- Fed the Concatenated Image, BOW question features to a Multilayer Perceptron (MLP) that predicts the answer word. ( Restricted the answer to a single word, and fixed the vocabulary size of the answer word)
Experiments and Intuitions:
- Tweaked the number of hidden layers, number of hidden nodes to squeeze the last bit of accuracy out of the model.
- Used a Recurrent Neural Network to generate the answer.
- It was interesting to note that the accuracy of training an RNN to generate the answer for a question without feeding the image features was only ~3% lesser than the model along with the image!
- This gave an interesting insight that the RNN language model learns to intelligently guess answers just from the question, which empirically turns out to be much better than humans!
Image Captioning
Timeline: November 24th — November 30th
Motivation:
Creating an algorithm that automatically captions an image, producing a full length english sentence describing the scene.
Technical Details:
- Produced Image captions using multimodal-Recurrent Neural Networks.
- Extracted Image features from fc7 layer of the Caffe VGG-16 Model, pretrained with ImageNet weights.
- Used LSTM Language Model for the text generation.
- Initialized the hidden state of the LSTM network with image features linearly mapped to a lower (512) dimension.
Experiments:
- Initialized the word vectors with pretrained GloVe Vectors.
- Used a multi-layer LSTM for the language model, analyzed performance benefits of a GRU, used instead of an LSTM.
- Passed the image features through the LSTM gates before initializing the hidden states (The intuition being, unnecessary features from images may be rejected by the gates and only allow useful features to pass through).
Currently ranked 14 in BLEU-1 C5 Metric in the CodaLab Image Captioning Leaderboard and ranked 19 in CIDEr-D C40 metric
Image + Text Sentiment Analysis
Timeline: November 17th — November 24th
Motivation:
Leveraging the sentiment content of images for 5-class sentiment analysis, pertinent in mediums such as Twitter, Snapchat, Instagram
Technical Details:
- Calculated Visual sentiment by identifying ADPs (Adjective Noun Pairs) associated with each image using Convolutional Neural Networks.
- Used a pool of 3244 recognized Adjective Noun Pairs. (Example: Misty Night, Colorful Clouds)
- Used Caffe AlexNet model (pretrained with ImageNet weights) to classify the images into 1 of 3244 classes, representing an ANP.
- Maintained a SentiBank that scores the ANP with a sentiment value between -2,2
- Calculated text sentiment using a Recurrent Neural Network.
- Maintained a vocabulary of 8000 words, each word containing a word vector representation. (Initialized with GloVe pretrained word vectors)
- Mean Pooled the hidden states of the Recurrent Neural Network for a given sentence and classified using a softmax layer.
- The 5 sentiment classes represent successive integer values from -2,2 ranging from strongly negative to strongly positive sentiment.
Data Science Capstone
Timeline: March 2015 — May 2015
- Completed the Data Science Capstone partnered with SwiftKey Corporation at Coursera by Johns Hopkins University.
- Designed a Next-Word Prediction App that is fast, accurate, responsive and deployed online at shinyapps.io
Skills Learnt:
- Text Preprocessing (Data Cleaning, Tokenization, Sentence Segmentation)
- Text Mining, Exploratory Analysis and Data Visualization
- Language Modelling
Tools Used:
- Shell Scripts, R and RStudio, Shiny (Creating web-apps)
- Project Presentation : Data Science Capstone Project
- Project Report : Data Science Capstone Milestone Report
- Project Display : Next-Word Prediction App
Frameworks, Tools Learnt:
- R, Python
- Caffe
- Theano, Keras, TensorFlow, Lasagne
- Graphlab, Scikit-Learn
References
Visual Sentiment Ontology: http://www.ee.columbia.edu/ln/dvmm/vso/download/flickr_dataset.html
Robust Image Sentiment Analysis: https://www.cs.rochester.edu/u/qyou/papers/sentiment_analysis_final.pdf
Show and Tell : Neural Image Caption Generator: http://arxiv.org/abs/1411.4555v2.pdf
Towards AI-Complete Question Answering — A Set of Prerequisite Toy Tasks: http://arxiv.org/abs/1502.05698
Memory Networks:
End-To-End Memory Networks: http://arxiv.org/abs/1503.08895
Char-RNN: https://github.com/karpathy/char-rnn
Toronto-COCO-QA Dataset: http://www.cs.toronto.edu/~mren/imageqa/data/cocoqa/
Visual Question Answering: http://www.visualqa.org/
R-CNN: http://www.cs.berkeley.edu/~rbg/papers/r-cnn-cvpr.pdf
Learning Visual Similarity For Product Design: http://www.cs.cornell.edu/~kb/publications/SIG15ProductNet.pdf
Saliency Detection: http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Zhao_Saliency_Detection_by_2015_CVPR_paper.pdf
Deep Reinforcement Learning:
http://www.nervanasys.com/demystifying-deep-reinforcement-learning/
Technical Background
1. CS231n — Convolutional Neural Networks For Visual Recognition
Timeline: November 2015 — January 2016
Stanford University — OCW
2. CS224d — Deep Learning For Natural Language Processing
Timeline: July 2015 — October 2015
Stanford University — OCW
3. Data Science Specialization
Timeline: May 2015 — May 2016
Johns Hopkins University — Coursera
4. Machine Learning
Timeline : September 2013 — November 2013
Stanford University — Coursera