Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

machine learning != .fit()

--

Like all beginners, I’d made so many mistakes when I started my machine learning journey. It’s the same story. A lot to learn. A lot of hype. Tons of technical jargon. Tons of prerequisites to covered and what not.

Seems familiar? I know.

Photo by David Paschke on Unsplash

For me, machine learning was all about complex maths at that time and nothing else. It was all about more and more complex models. I was learning all of the mathematical details and fooling myself that I am learning, but in reality, that’s not the case.

There was one de-facto rule in my mind back then, if you want to increase the performance or reduce the error metric, just switched on to a more complex model. And keep doing it until the performance increases or loss decreases.

Tell me about the results? Frustration!

Things were not going as per my expectations. I was waiting for the performance to go up drastically after switching from linear regression to support vector machine regressor with RBF kernel.

But this is not how things work in the real world.

After doing this many times, I realized something is going wrong here that I need to find out.

And I learned:

Garbage in => Machine => Garbage out

machine learning != .fit()

Don’t make the same mistake.

There’s a lot more to machine learning.

It’s not only about complex models. It’s more than that.

Below are the three main steps in machine learning pipeline you should also focus on:

Data cleaning

In the real-world projects, you don’t get that cleaned .csv file every time. More often, they’re the case when you have to collect the data from multiple sources, clean them, then join them, and furthermore check for consistency and duplications.

Data cleaning does not only involve filling missing value. There are lots of things other than that. Few of them are listed below:

  • Checking the consistency throughout the data
  • Selection of a subset of data that is relevant to your problem.
  • Outlier handling

Juts removing the outliers isn’t the solution. Sometimes, you get to dig deep and find out the reason why this record value behaves differently. Is it possible in real cases as per your problem statement or not? And then, handle accordingly.

  • Renaming the columns name for interpretability
  • Removing duplicate records
  • Handling missing values
  • Storing the scraped data in the proper format

and a lot more…

Data analysis and visualization

If you don’t know your data well, you can’t do anything with it. Plain and simple

Ask questions relevant to the business problem and know the solution via code.
Sometimes, in the real world, the task is not to build a state-of-the-art model to predict something. It’s to analyze the data, find out the hidden insights that can benefit the business, and present the insights in the simple language.

Ex: Let’s suppose you work at Amazon and you’ve most of the details about each transaction made on the platform by customers.

Questions you can ask:

  1. What is the peak time in the day during which most of the transactions take place?
  2. What type of products are selling most on Amazon?
  3. Does the number of ratings affect the purchase or not?

And many more…

Got the point? Your goal is to find out the insights that can help Amazon to get more sales.

Your question changes as your problem statement changes.

Even when your goal is to predict something, you can not do that well without knowing your data.

And it does not stop here only, data analysis and visualization also helps you in the next step of the pipeline i.e. Feature engineering

There is a reason behind why everyone says: know thyself.

Feature engineering

It’s the process of extracting new features from the original feature set or transforming the existing feature set to make it work for the machine learning model.

Why feature engineering?

You need to understand: Simple models with good feature set outperforms complex model with a bad feature set.

You can not do feature engineering without understanding the data and your problem statement.

Domain knowledge plays a very important role in feature engineering. You need to find out the previous work done in the space. Read the literature, and then incorporate your findings to construct new features. Even better talk to domain experts, if you have that luxury.

Feature selection: selecting a subset of important features from the original set. It also comes under feature engineering and sklearn has great documentation about it. Read it here.

Create a simple baseline model so that you can compare your results after incorporating new features, and see if they’re helpful or not. Remember: It’s an iterative process.

Investing your time to construct new features is far better than waiting for a complex model to get work.

Key learning lesson: Don’t fall for complex models, fall in love with data. Focus on each aspect. It’s all connected.

Hope you loved this article and learned something new. Incorporate these learnings into your new project and share the article with someone who is just starting out!

Follow me for more such articles. Peace and Power.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Rishabh garg
Rishabh garg

Written by Rishabh garg

Machine Learning Practitioner and life long learner. Twitter: @rishabh_grg

Responses (1)