Hi, So you want to get started into data science and Ml and don’t know where to start ? What concepts to learn ? Which algorithms to read first . Well, you have come to right place . I am starting a new series of blogs on Ml and This is Part1.
In Ml there are certain things that are absolutely necessary to know . You will use them in almost every data science and ml task . Good news for you all !! You will learn all these concepts in this blog .. If you are not familiar with stats you can view one of my blogs on stats by clicking on the link below .
Descriptive Stats (Concepts + Code)
Descriptive statistics provide simple summaries about the sample. Such summaries may be either quantitative(summary…
So without wasting time let’s just skip to the cool stuff . Some common concepts of ML and Data Science are —
- Problem Statement (type of learning)
- Metrics (How good is our algorithm)
- Feature Engineering
- Feature Scaling
- Algorithms (based on type of learning problem)
- Model Parameters and HyperParameters
- Cross Validation
There are a lot more but the above will work fine for starters ….
Trust me, I can write almost a chapter of 20 pages to explain each concept. still it won’t be enough , but I will try to give an overview of everything in just 5 minutes .
Problem Statement is simply a problem, that you are trying to solve . LOL !! Well what i mean is that problem Statement will tell us that what type of learning techniques should we use to solve a problem . Whether we have a type of supervised learning , or unsupervised learning , reinforcement learning or deep learning problem . (I will explain each of this in some other blog ) . You will only know this after solving many or reading many different types of data science problems .
- Supervised learning : You are required to predict whether an email is spam or not (provided you have some data containing spam and non spam emails
- Unsupervised learning :you want to recommend a video from certain categories to the users of you tube. (Provided you have users data and by that I mean videos they have watched before )
- Reinforcement learning : you want to create an agent / computer player to play chess against the world’s best chess player . Oh !! Fun Fact — Google actually did that .
- Deep learning : you want to build a face recognition application or a chatbot or you want to make an Alexa device .
Remember : The learning problem can also be categorised in sub categories like regression , classification , time series , clustering etc . but more of it on some other day ..
Metrics is used to judge the performance of an algorithm or a model to be precise. The key to a great data scientist or ml is to use the best metric and optimize the algorithm according to it in the best possible way. Unlike online data science competitions, In real world, we are required to use a metric of our choice that will best describe our model performance . There are very standard metrics for almost every problems though !! Let’s look at some metrics
For regression we have metrics like MAE and RMSE and for classification we have metrics like ROC-AUC or loggloss or accuracy or sometimes we just have precision and recall in some cases .
Don’t worry I will write a separate blog on this too .
Data science problems are solved in a vectorized way (I hope you know what it means ) and a vector is something which has a quantity and a direction . Right !! Well dimensions can simply be referred as number of features in our given data.( that’s how we visualize a problem, in space with many dimensions ) .
(every column in you data table is a vector and is added to the dimension of data ie. if you have data with 5 columns or 5 columns of independent variables then dimension of data at the begining is 5. Remember — This list keeps on increasing when we do feature Engineering )
Feature Engineering in layman terms is just the number of features that one as a data scientist/ml use and create from the given data to make an algorithm or a model .Remember I am talking about feature engineering and not feature learning. Let’s take a look at it by an example:
Suppose you want to predict the price of a house (regression from supervised learning problem with metric =RMSE ).
What feature in general do you think is important when predicting the price of a house ?? correct !! if your answer is location or area of the house or u say facilities nearby , then you are on the right track . and obviously !! you are right too. you have your features but remember to test your hypothisis too. what you say or think does not mean that you are right . See and verify with data.
Feature Scaling is important because of the way it is used in an algorithm . you just need a way to quantify different variables on the same scale. Let’s look at it —
Suppose you are trying to develop a model which will predict how rich a person is ?? (Regression problem from supervised learning ) and you create 2 features (number of currency note and number of currency coins ) and you create a simple model like
Model = Number of notes + number of coins
Do you think this is an accurate model to predict how rich a person is ??
If your answer is no then you are on the right track ..!! So you need feature scaling in order to improve the efficiency of your model/algorithm …
After understanding the above concepts we can learn about ml algorithms . Well In general an algorithm is just a program logic to do a task … so ml algorithm is just a program that automatically creates a program logic to solve a problem or to do a task … and what is a task ?? Problem Statement obviously !! . Let’s look at some examples of machine learning algorithms .
Linear regression , polynomial regression , neural network , svm , decision trees , random forest , muse net, alex net , LSTM , k-means , single linkage clustering , KNN , xgboost etc … if I missed any than don’t worry , it takes 4 lines of python code to use these algorithm and develop a model . ( Don’t judge me but its true) and i can’t explain all these algo’s in 5 minutes (but i will explain it some day provided you read all my blog :) )
Model parameters and Hyperparameters
We create a model by fitting it to the above ml algorithm and then the machine or computer learns to solve the problem statement by using standard parameters called model parameters like train data and test data etc. If you are wondering what training and testing data is ? then just wait for a minute . Then there are other kind of parameters that cannot be learned directly from the model. we call these parameters as HyperParameters which has some high level of properties that improves the efficiency of our algorithm .
A. Depth of tree or number of sub samples etc .. when creating a model using decision tree algorithm
B. Learning rate used in many regression algorithms .
C. Number of clusters in k-mean algorithms
And many more … we need to tune them ie. Constantly use different values to make a good model .
Cross validation is basically a technique to improve the efficiency of our model or to create a generalised model that can perform better on any new data. We convert train data into chunks generally in 5% or 10% or 20% depending on the size of train data and test the performance of model on all these chunks indivisually by treating 1 at a time as a test set and remaining others as train set . This method in turn gives us a more generalised model or good model that works well on both train and test data .
Summary in 100 words —
A data science model is developed by working on data . We divide this initial data into 2 parts — train data and test data . In general , training data will have independent variables called features and and output variable which will be dependent on features (independent variables) . we always work on training data , we create a model on training data by fitting an Ml algorithm on it and then apply the model on test data and then calculate the performance of model against a pre-defind metric . We expect our model to be generalised and expect it to perform well on test data (data that is completely new or which our model has not seen before ) . But one common rule of data science is to never see or use test data ever to make a model . So how can we say by working on training data, that our model will perform well on test data too . The answer is simple. By doing a cross validation.
I hope you liked my blog. Thanks for viewing . You can also view my blog on Pythonic way of doing things by clicking on the link below.
Pythonic Way Of Doing Things
Disclaimer: This blog is not about the zen of python. This blog is about simple and easy ways of solving problems in…
Don’t forget to Clap Share and Follow ..