Beginning machine learning by a software engineer in a hurry

Conrado Quilles Gomes
TOTVSLabs
Published in
5 min readNov 15, 2017

If you are like me and right now don’t have enough time to go after Python, R, statistics, and algorithms, but still want to play with data and do some predictions using machine learning, I think I can give you a short-cut to accomplish that.

Overwhelmed

First, let me do a quick summary about how I reached this point: During my work in Silicon Valley at TOTVS Labs, I had the opportunity to spend some time with Mario Filho and learn from one of the best guys in the field. It was a good experience and I understood the foundation of his work and his thought process and more importantly, that I don’t really need to understand exactly everything to give me the opportunity to play with it. He helped me break the barrier that I had about data science.

He insisted that I should start learning data science by competing in any Kaggle competition. At first, I was a little reluctant because I knew what I was getting myself into and I knew that I’m not ready to go deeper, and here is why:

I started my journey into data science by:

✔️ After spending 3 weeks studying in Dataquest (Very good, but boring), I realized that I need a stronger statistics and probability base;

✔️ 1 more week reading Head First Statistics, and then I understood what I was getting myself into: Maths ➡️ Statistics ➡️ Probability ➡️ Distribution ➡️ Regression ➡️ Bayesian Theorem ➡️ Implementing a machine learning algorithm ➡️ 💥 No, no, no, not yet!

Here we go…, I can see my future…, It doesn’t look good, 6 months to a year of “deep learning” (pun intended) before risking anything…

“Not this time Conrado!”

A quick journey

If you are like me, and want to reach something before compromising a huge amount of time, then you should do this:

Prepare yourself

  1. Watch a video about statistics (you don’t need to watch all, I’m sure that you will get back to it later):

2. Watch this series from Google about machine learning (Ignore the coding part, stick to the concepts):

3. Watch this series from Data Science Dojo:

And then practice

  1. Play around with this Kaggle Titanic Sample in Azure ML. If you want to watch this experiment being created check this out;
  2. Submit some predictions to Kaggle;
  3. Check other’s work and forum discussions on Kaggle and Cortana Gallery;
  4. Change and improve your experiment;
  5. Try new models and play around with the tools provided, the Azure ML documentation will help you understand the parameters and components, as well provide new sample experiments;
  6. Repeat 4th step and soon you’ll lose track of time;
  7. When you’ve had enough, go to the next Kaggle competition;

With those simple steps, you’ll find yourself reading about each algorithm, technical aspect, looking into other’s solutions on how to extract and infer better features, experimenting with hyperparameters, stacking models, creating custom Python/R scripts just to see if that new feature will improve your prediction or not.

This is what I’ve done so far:

My first two competitions, it’s looking good for me! I’m not that lost after all!

My best model for the Titanic competition is only a 3% improvement over the template from Data Science Dojo, but it took me 24 tries (models) to find it. I tested different algorithms, training strategies and a lot of useless new features.

What I’ve learned so far

After all, concepts like supervised learning, especially classification and regression make more sense to me and now I’m close to try some unsupervised problems.

Another important part that I wasn’t aware of, is how data quality is important, most of my exercises so far I got a pretty clean and well develop data. I know that this made my life easier and I can imagine the kind of problem that people get in real world cases.

The organization and roles in data science are clearer now to me, now I can tell the difference between a data analyst, a data engineer and a data scientist. Each one with its part in the data flow.

Infrastructure ➡️ Get ➡️Clean ➡️ Improve ➡️Create Model ➡️Check ➡️Predict ➡️Deploy

  • Infrastructure (Data Engineer): The pipeline to get data from systems and the structure to enable everything below;
  • Get (Data Engineer): Bring data from a wide range of sources;
  • Clean (Data Engineer): Fill the gaps, remove unusable or useless data;
  • Explore (Data Analyst): Infer new features, extract business information and create better representations of the data;
  • Create Model (Data Scientist): Find the best algorithm, parameters, and performance;
  • Check (Data Scientist): Validate your results;
  • Predict (Data Scientist): Run the model against a real world and unknown case;
  • Deploy (ML Engineer): Make the prediction available to your users;

Now I’m ready!

Both you and me are ready to go further. I want to reimplement all my Azure ML experiments in Python, try new algorithms that are not available there, explore more the data I’m using, and finally start the Coursera course that all my fellow professional data scientists, always suggest me:

I know, I’m far from being even a Data Analyst, but the knowledge gave me the foundation to understand the role of my coworkers better and what I can do as a software engineer to improve their flow, the importance of good data and which tools they use to achieve their goals, after all, that’s what I really need to know.

Hey!, You!, Data Scientist! What path did you take? What do you think about the path I took?

--

--