My Internship Experience at Couture.ai

Rahul Shevade
cverse-ai
Published in
6 min readOct 1, 2019

An internship is one of the most crucial phases in one’s professional career. It sets the stage for the transition from student life to a more work-oriented life. Couture.ai was my first internship, so I was very excited to finally get to experience it. It began on 21st May and ended on 13th July. Everything, from the people, the work and the office, was wonderful here.

I wasn’t a coding geek, neither did I know many languages, but that was okay because the idea was not to learn any particular language, but being able to implement algorithms/logic through any one of them.

In the first week, I had to learn some basics about machine learning (the popular one on Coursera by Andrew Ng). I had done a course in my college (Neural Network and Fuzzy Logic) so the course was not very difficult to understand.

After a brief review of Machine Learning basics, I was given my first real work- Learning about evaluation metrics in informational retrieval. I learned about Mean Average Precision, Discounted Cumulative Gain, and a few others. I had to implement Discounted Cumulative Gain on the data given to me.

Given below is a picture that shows how MAP is calculated. Over here, Recall talks about out of all the relevant documents, how many have you retrieved? And Precision talks about out of all the documents retrieved, how many are relevant?

Calculating the MAP

Given below is the formula and example to calculate Discounted Cumulative Gain.

Discounted Cumulative Gain
Normalized DCG

At rank 1, in Ranking 1, we get DG as 23/log(1+1)=23

At rank 2, in Ranking 1, we get DG as 10/log(2+1)= 6.30929

At rank 3, in Ranking 1, we get DG as 17/log(3+1)=8.5

DCG1= 37.809

Ranking 1, Ranking 2 and Ideal Rank

Similarly, for Ranking 2, we will get DCG2 as 17+23/log(3)+10/2 = 36.51. We can see that Ranking 1 is better than Ranking 2. We can compare different ranking algorithms by dividing the DCG by the Ideal Ranking Discounted Cumulative Gain (IDCG) to get Normalised Discount Cumulative Gain (nDCG).

Since I wasn’t very fluent in Python, it was a great opportunity for me to learn about Python’s usability, its ability to do so much in such little line of code. From generator expressions to its various available libraries and their functionalities, they never ceased to amaze me. I encountered a few hiccups in beginning and had to look up the errors a lot of times to see if there were other people like me (and most importantly to immediately jump at the solutions to the errors posted on different websites). But in the end, when the code finally did what I wanted it to, it was all well and good. I had to create a few output files to feed into the dashboard for visualization purposes and to evaluate what had been done so far.

My second task was to learn about Topic Modeling. It seemed like a very interesting and useful model. I began by looking at Latent Dirichlet Allocation (LDA, probably the most popular topic model). It was pretty amazing what the LDA could achieve. It could discover hidden features about the documents and then we could try and group them based on the topic distribution of the documents. We don’t need to explicitly name the topics, they can be abstract. They will still be discovered. Below is an image for how a topic model works, and a plate diagram.

Topic Model
A typical Plate Diagram

The most important part was to clean the text before feeding it to the model. Cleaning involved operations such as tokenizing, stemming, lemmatizing, splitting words, n-grams and last but not the least spelling corrections. The order of all these had to be chosen carefully. After running the cleaned text through a vectorizer, it was finally ready for the actual LDA algorithm. I used Grid Search to choose the hyperparameters (number of topics and learning decay), with cross-validation, and for the first time, I could see a (relatively) modern machine struggling to produce an output, that otherwise, I reckoned, would only take a few seconds. This made me understand the importance of optimizing the code and the need for faster hardware. Well, there was nothing that I could do about that, and I had to just wait until the code produced an output. This also gave me an idea about the vastness of data that people work with. I created more files for the dashboard.

Finally, I had to use a clustering algorithm to cluster the documents that had been assigned the topic distribution. K-Means clustering is the simplest and most popular clustering algorithm, and I used that. We first need to decide how many clusters should we construct. If the data had lesser than two components (dimensions), we could just plot the raw data and see it visually. But, many a time there are more than even three components to the points and hence plotting it won’t be an option at all. The next option would be to run the algorithm multiple times with the different K values and plot a graph of error versus K. We then look at the graph and locate an ‘elbow’ like part and then choose the value that’s near the joint (Aptly called ‘elbow method’).

Using Elbow Method to choose K

In this image, WCSS is the error (within cluster sum of squares) which is nothing but the sum of squared distances between the points and the cluster center. It tells us whether the points in the cluster are far away and dispersed from the cluster center or are they compact and well defined. The X-Axis is the number of clusters. We can see ‘3’ is chosen as the optimal number of clusters. What we try to do is choose a point that justifies the increase in the number of clusters by the decrease in the error. If we have a lot of clusters it would make the model unnecessarily complex and it won’t be able to generalize for new data points.

But for our work (and any other for that matter) automation beats manual intervention any day. So we wanted to automate the choice of K. There was a helpful article, titled How to Automatically Determine the Number of Clusters in your Data — and more, which was pretty simple to follow and easy to implement.

I wrapped up my work by compiling and merging files that had different information for different parts of the dashboard. Along the way I learned basic table operations like joins, pivoting, melting, sorting, group by, etc. Also, I learned a lot of minute details and I became much better at preventing and recognizing common errors in Python. I tried being more efficient where I could and tried to keep the code as short as possible.

In the end, this was new and a fun couple of months, something that I wanted to experience for a long time. I was looking forward to this and I was very keen to observe how people operate and communicate in an organization, and how different pieces of work done by different employees all comes together to become a part something bigger and whole. This will be one of the best learning experiences in my life.

--

--