Jump into Data Science

Published in

Analytics Vidhya

5 min readApr 12, 2020

After reading this article you will get the basic understanding of Data Science process .🤩

The first and most important things you need to get started in Data Science field is your ability of understanding the data and how you turn your data into information, and information into insight . The best Data Scientist never underestimate the data , the whole data science depends on that big “ D ” and things get done only if the data we gather can inform and inspires those in a position to make a difference.

Data science process is same as that how magicians manipulate the human mind . A Magician testing in human perception for thousands of years and after that he is able to do magic. Magic is not really about the mechanics of your senses. Magic is about understanding — and then manipulating — how viewers digest the sensory information. Now you have pretty idea what you are going to do with data. Excited? I know you are.

In this article i will discuss about the three main Data Science process and tools we need to apply on data:

1. Cleaning Data

It is most important to be able to deal with messy data , data cleaning is process of detecting and correcting inaccurate records from a database . A data scientist mostly deal with :

Missing Data
Outliers
Inconsistent Data
Unnecessary Data

Cleaning data may be time consuming but lots of tools have cropped up to make this crucial duty a little more bearable. Python offers a lots of libraries for making data orderly and legible, these are Dora , datacleaner , PrettyPandas , tabulate, scrubadub , Arrow, ftfy and many others.

2. Visualisation of Data

Data Analysis , This is where you make lots of plots of the data in an attempt to understand it (plotting is also another place where spreadsheets start lagging behind). Through this process, a data scientist is trying to craft a story, explaining the data in a way that will be easy to communicate and easy to act on.

Top 3 visualisation libraries of python you should know

1. Matplotlib

Matplotlib is the most popular data visualisation library of python and is a 2D plotting library . It comes with an interactive environment across platforms. It is a very versatile visualisation library . With this library, with just a few lines of code, one can generates a plots, bar charts , histograms , scatterplots, error charts , pie charts and many other types.

2. Plotly

Plotly is a web-based toolkit to form data visualisations. It has a great API , with unique functionalities such as contour plots , dendrograms , and 3D charts, it has visualisations like scatter plots, line charts ,bar charts , multiple axes and many others. It contains a great API including one for python.

3. Seaborn

Seaborn provides an API on top of Matplotlib that offers same choices for plots style and color defaults , define simple high-level functions for common statistical plot types and integrates with the functionality provides by Pandas DataFrame.

3. Model Building

This is the part where magic happen. In this step, the actual model building process starts. Here, Data scientist distributes datasets for training and testing. Techniques like association, classification, and clustering are applied to the training data set. The model once prepared is tested against the “testing” dataset.

There are three types of ML Algorithms : supervised learning , unsupervised learning and reinforcement that will helps you build model for your data.

1. Supervised learning

supervised learning is a learning in which we teach or train the machine using data which is well labelled that means some data is already tagged with the correct answer. After that, the machine is provided with a new set of data so that supervised learning algorithm analyse the training data and produces a correct outcome from labelled data.

2. Unsupervised learning

Unsupervised learning is the training of machine using information that is neither classified nor labelled and allowing the algorithm to act on that information without guidance. It’s like train a new born baby , the task of machine is to group unsorted information according to similarities, patterns and differences without any prior training of data.

3. Reinforcement learning

Reinforcement learning is an area of Machine Learning. It is about taking suitable action to maximise reward in a particular situation. It is employed by various software and machines to find the best possible behaviour or path it should take in a specific situation.

Closure

You can best learn data mining and data science by doing, so start analysing data as soon as you can! However, don’t forget to learn the theory, since you need a good statistical and machine learning foundation to understand what you are doing and to find real nuggets of value in the noise of data

Feel free to send me message if you have any feedback.

Good Luck ; )