The Basic Essence of Data Science and Machine Learning

Published in

CodeX

4 min readMay 8, 2021

“The sexiest job of 21st century”….. Well, it isn’t “sexy” anymore.

The job as stated by many to be the “sexiest” job, is no more sexy because many people and professionals have failed to understand the basic necessities that a data scientist or a machine learning developer should have.

People are of the opinion that getting the data and applying heavy algorithms would solve their problems. However, most of the times the solutions to huge problems lies in the basics. If the basics of the problems are identified, then the solution would present itself.

So what is data science actually?

Is it just fetch the data and apply algorithms heuristically? Is it about parameter tuning on the raw data and getting outputs in various forms and then presenting it to the customer? Is it all about the raw, uncooked, “filthy” data and thinking that this is the right data?

Well, I am afraid the answer to all these questions in a big NO. Data science is like cooking or poetry or a dance or a sculpture, it is an art form. There are certain steps and processes that need to be followed to get valuable insights from the data given, though there can always be room for improvement.

The process or steps could be as follows:

See the data

Most of the times this important step is not done. Until and unless the data is seen and understood, no matter what algorithm is applied, the output will always be substandard.

Pre-processing the data

This crucial step involves the heart of many problems that can easily be solved. This follows from the previous step. Here, a few sub-steps are involved:-

NULLs or NA values are seen and analyzed
The class or category of each attribute is seen, i.e. if it is numeric or character or class and information about that is obtained
The distribution of the output is seen. An answer which is obtained is is it evenly distributed or not (in case of classification), or what is the range and what values can the output take (in case of regression)
Correlation between the attributes
Covariance
Skewness of the data
Standard deviation. For values which are outliers, some processing needs to be done to check it’s validity
Other statistical tools can be applied to understand how to proceed further

Visualization

Statistical pre-processing gives numeric values which doesn’t portray too much insight. However, a graph gives higher insights on the distribution, frequencies and other parameters that are needed. Some graphs include bar plots, box plots, line graph and correlation graphs.

Data transformation

Sometimes, the attribute’s data is very shabby and certain transformation like normalization or Box-Cox transformation along with other power transformation etc. can bring in more insights . This step’s chronology is not cemented. It can be done after the preprocessing step or after the visualization step. It depends on the choice of the person. If one deems fit and is sure on the transformation needed after the pre-processing step, then the transformation is done before the visualization, else it is done post the visualization.

After the transformation is done, the data needs to be visualized again to see if any more insights can be drawn from it.

Algorithms

Now, after all cutting and snapping of the data is done, algorithms are used to gain insights on the data. Here, it is always advisable to have a base model based on which the further tuning or other algorithms can be chosen. Many documented algorithms are present whose validity and viability on various applications have been proven many times.

Tuning of parameters

Each algorithm has certain parameters which can be tuned to finer levels and output can be obtained. However, not too much time needs to be spent on this, because this is a very decisive step. If the parameters are over-tuned then it sometimes leads to overfitting of the data, i.e. it works well on one set of data but is not generalized over other unseen data.

The tuning should be done by keeping in mind the effect it has not only on the metric that we are choosing to maximize or minimize but also on a few other dependent parameters as well. A graphical approach is the best one here.

Results

One important step forgotten by many is communication. The customer is interested in seeing the results and doesn’t care about the code. So communicating the results in a proper documentation is very essential for profit making and attracting clients.

Now, we are all set the launch the data science journey. I believe that many theory and theoretical approach is present on the web to read. What I intend to do is do multiple case studies, each would have certain insights which a data scientist should have. For the start I will work on Iris Dataset in R as well as Python as these are the two most widely used programming languages for machine learning.

Feel free to follow this and also if you are interested in watchin videos rather than reading, I have done these and many more videos in R here. There are many interesting videos there. Feel free to like, share and subscribe.