Unlike the latest installment in the Avengers franchise, it was not all gloom and doom in the first 11 months. I made a ‘next word’ predictor in R, which was trained on Twitter data, and could predict the next word almost instantaneously by scanning 29 million n-grams. Accuracy was less than the state-of-the-art Swiftkey, but good enough. I went on to build a model for a large infrastructure company to predict the probability that a customer would default on the next payment and, more importantly, why.
But there was a lull after the initial 11 months of midnight oil burning. R was good and I could wrangle data like nobody’s business, but I still had no idea how to train predictive models. Then, a colleague suggested that I should switch to Python. Complaining to my aforementioned cat (she’s a good listener) that I had to learn a new syntax, I enrolled for Applied Data Science with Python, on coursera.org. In spite of the instructors being mediocre, I enjoyed the course.
Pandas was conceptually the same as R and the sklearn courses helped me understand the data pre-process — train — evaluate — predict — repeat workflow, and that was a break-through for me. I could connect the theory to practice. But, sklearn has so many algorithms that I had no clue how to approach a problem and more importantly, what method to use. I tried some online competitions on Kaggle and Analytics Vidhya, but made little progress. Data science as a discipline was growing exponentially and I was being left behind.
Then, in an obscure article in Flipboard, I found a reference to a revolutionary site called fast.ai. I decided to give it a try and I was hooked. Their motto is ‘Democratize AI’. Their goal is to wrest AI from the labs and classrooms of elite academicians and disseminate it to everybody. And they are really doing it.
They primarily recommend only two methods — RandomForest/GBM and Neural Nets — claiming that these two can solve more than 95% of the ML problems. That’s good enough for me. Their teaching method is ‘top down’ which means that they solve the problem first and then layer by layer, reveal the concepts and details of how it is solved. It’s like Mother Nature, who gives you the test first and the lesson later. Imagine how effective this method could have been, if it applied widely. If I had decided that I want to be a data scientist when I was 6 years old and thereafter only learnt what a data scientist is required to know, I would not have wasted time on World History and Chemistry.
The first course I did was Machine Learning Course Part 1. I took around 6 months to go through and understand the videos, but at the end I could score in the top 5% of a Kaggle competition. More than anything else, it gave me the confidence and the strategy to tackle any machine learning problem. I am in the midst of their Deep Learning Course Part 1 - v3, and after just 3 lessons I made my first image classifier to help me at my part-time job as a real estate photographer. I also got to within 0.01 of the top score in the Intel Scene Classification contest at Analytics Vidhya. Recently I submitted an entry to the Berkeley AI Summit challenge.
I can now predict more than just house prices. I have become relevant again.
Like software programming skills, data science is also becoming a commodity. You don’t need a PhD to understand it.
Some advice for people on the fence.
- Anyone can do ML and DL. All you need is an open mind, perseverance, domain knowledge, ability to identify use cases — and stackoverflow.com. Advanced math is optional, unless you want to do research, and Python is easy.
- There are many sub-fields in data science— regression, classification, NLP, computer-vision, etc. When starting out, focus on one category of problems, for example regression using structured tabular data. Only after you are comfortable and get reasonably good results, move on to another one. Find your niche.
- After you understand the basic methods and processes for that sub-field, choose a problem that you would like to solve and learn as you solve it. It may weeks or months but at the end you will feel great and you would have learnt a lot.
- Nowadays, a huge amount of knowledge available freely. If you do not carefully pick and choose what to learn, it is easy to go into tangents and lose the main thread. Therefore, learn only what is necessary to solve your chosen problem.
- Don’t chase after the latest developments in the field. Everyday there is new research and new papers with new approaches and new solutions. Do try to read a few that are relevant to your current problems. If you try to keep up with every new paper and blog, you will lose focus.
- Find the method of coding that suits you. The way I do coding is like a spider weaving a web. I do it line by line. Code a line —> run it —> if it errors out, check stackoverflow —> correct code and run again — >repeat till line works —> add line to main code body. Over time, you will need to check stackoverflow less and less.
- An excellent IDE which supports my style is (not surprisingly) called Spyder. Other than the standard conveniences, it lets you run code line by line, and shows details of the variables you are using in a Variable Explorer. Very useful.
- Even a couple of years back, getting the necessary infrastructure to run DL was expensive. Now, you have many online virtual machines which charge pennies per hour and provide nearly all the power you need.
- Dedicate some time every day. It can be a small amount of time like 30 minutes, but it will help you keep moving forward.
- Like software programming skills, data science is also becoming a commodity. You don’t need a PhD to understand it. People from diverse and non-mathematical backgrounds are becoming data scientists. There are a whole host of tools and libraries which you can use with a smattering of technical knowledge. It is easier for a domain expert to learn data science than vice versa.
I am always on the lookout for interesting problems to solve. Do you have one?
If you have not read Part 1 you can find it here.