Summer Internship Experience- Part 3

Shivam Sahu
6 min readJul 19, 2019

--

Learning from the Internship Experience @ Startup

I spent 2 months as an intern at Inturiyam, a company which focuses on NLP in Indian languages. My main project was about making text predictions in the image, and extract text from it. If you’re interested, in the previous part of the blog I have briefly explained how the overall project works in part-2 of the blog.

I want to sum up the 9 main elements I’ve learned. Firstly, 3 technical lessons about machine learning realities and its deployment. Secondly, the 3 main skills every data scientist should have. And lastly, I want to share 3 ways to learn data science nowadays that helped me.

Three machine learning and production level deployment lessons

There is no “best model ever”. Some models perform better at some problems, yes. For instance, neural networks have shown their superiority in computer vision and natural language processing. But in general, no one can assert that a model will always perform better than the others. Model performance metrics are the only way to get the truth. That’s why we shouldn’t have influences like “neural networks are awesome in complex problems like computer vision then let’s use them for our sales predictions”. That’s a bad way of thinking. To find the best learning algorithm, it’s easy to make a first learning phase with a bunch of different algorithms on a smaller subset of the data. Then make quick learning and cross-validation before comparing results. Repeat that process on different subset and identify on average the best learning algorithm to use on the whole dataset.

“Data preprocessing is everything.” Usually, we don’t spend much time on implementing the learning algorithms. Plenty of frameworks and libraries already did that work. They are well optimized with sufficient time and memory performances for our use case. The single task on the learning phase will be to play with hyperparameters and compute cross-validation to measure improvements. Well, that’s not the biggest energy-consuming part. But preprocessing the data is. By preprocessing data I mean adding complementary features from external datasets, cleaning it (remove “bad” rows), extracting new features from others (feature combination or decomposition), selecting the best ones (feature selection or dimensionality reduction), etc. The only way to get an idea of the preprocessing work that has been done is, once again, to measure improvements with cross-validation or comparing a learning curve.

“Deploying the machine learning model” is yet another task. The machine learning model is of no use if it doesn’t go into production. To remove the various aspects of system engineering many companies provides support so that developers can easily deploy the model. We chose Google Cloud Platform to deploy the model, as the model can be easily scalable and security measures can be implemented properly. Things goes ugly with scalability.

Scalability matters the most

Three data scientist skills

“Maths and statistics” To my mind, one can get its hands in machine learning without advanced maths or statistics knowledge related to the field. We can’t just throw away some model on our data, and also we can’t experiment a lot, as developing a model requires effort and training it requires so much computation, which is very costly. Though programming skills, frameworks and a basic understanding of machine learning principles, a lot of things can still be achieved. But a data scientist should preferably have a deep understanding of its model. For instance, he should be able to answer “why the model made that output Y given the input X”, mainly because users could positively seek to know why a decision has been made by the service.

“Computer science and programming” Yes, it is important. Mainly for manipulating data (files, databases) and operating on them (programming or using high-level software). During the analysis phase, scripting is useful to handle data and make some visualizations before learning. But later in the project, when it comes to building a full-service application, software engineering skills are required. While a machine learning engineer will keep its focus on the learning phase, a data scientist should work on the whole pipeline from data manipulation to delivering the predictions into the application that use them.

“Patience and Communication” To be patient while developing the model is very important. The model can be developed in a short span of time, but it may not get the desired result. Not being a researcher and a math expert I was not able to figure out why the model is not working,so it is quite frustrating to see the efforts of last days going into vain, but I also realised, this is the learning curve which everyone has to go. Learning from doing a production-level project is altogether different from the academic one.

Communicating results and discoveries within the data is the main point for a data scientist. They should get people to understand what’s hidden in the data and what can be useful for their company. It can be oral speaking, a written report, or even better a visual report with data visualization techniques. When dealing with people that are not in the field, a very business-related vocabulary or business metric should be used to facilitate their understanding, instead of plain lower level statistical metrics less meaningful from their point of view.

Three ways to learn data science

“Online Courses” Obviously the first starting point, at least for machine learning, is Coursera. Machine learning and deep learning classes are considered as the best ways to start learning it. Overall, online classes website can be a great way to learn complex theories that require a long time to study. If you are interested in the math behind Deep Learning, then you should consider going over this post for resources to understand them.

“Hands-on doing projects” The best way to learn is to do projects. If you see things happening in reality then it makes learning more fun and productive. One faces a lot of difficulties while putting things into deployment, and the experience we gain makes us think why it was necessary to learn it during the academic courses.

“Meetups” Bangalore is a Silicon valley on India, and Indiranagar is the startup place for many big companies, so meetup culture is quite famous in Indiranagar. In the past two months, every time I visited any cafe or restaurant, I found people sitting with laptops and discussing their ideas. We know that a though lot of ideas fails, but if one out of thousand, get successful, then also it is worth. Seeing such a passion in youngster about Startup was altogether a new one, I know it is cheap to listen to someone conversation, but I listened to them to see how they function, how to convince VCs to invest into their startup, and how to make India a better place to live, because not all ideas were about how to build money.

Overall it was a great experience at Inturiyam, not only they empowered me with a lot of knowledge in the field of computer science and machine learning, but also they immediately gave me rewarding responsibilities.

Thanks for reading till the end. :)

Part 1 link : here

Part 2 link : here

--

--

Shivam Sahu

Senior Undergraduate Student @ Madan Mohan Malaviya University of Technology