Introduction to Data Science

Published in

FACE | Amrita Bangalore

7 min readFeb 20, 2021

What is Data Science?

To that question, there’s always a textbook solution which goes like “Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data”. But we’re interested to understand what it is rather than know the definition because we’re not writing an exam right now.

For you guys, I put forward a question to ponder on “How does the YouTube recommendation system work?”

The answer is data science. YouTube has the viewer data based on what the user watches which is put through complex algorithms to get to know the taste of that particular user, and thus is used to recommend content to the user.

In simple terms, data science is the art of using numerous amounts of data to get insights, predictions, patterns, etc.

Data Science is the cross between a Statistician and a Software Engineer. Blending the two has given us the opportunity to automate what a Statistician does to data and use that with vast amounts of data.

Why is it so important?

This is the era of Artificial Intelligence and Big Data. There is a massive data explosion that has resulted in the culmination of new technologies and smarter products. With 2.5 quintillion bytes of data being produced on a daily basis, the field of Data Science has been ever demanding. The principal purpose of Data Science is to find patterns within data. With such a large amount of data, it has become significant to analyze and draw insights from it.

Studying raw data in such depth has resulted in endless applications across various fields. Data Science has changed almost every industry one can think of. A few of its practical applications include identifying and predicting diseases, recommendation systems, natural language processing, fraud and risk detection, speech recognition, advanced image recognition and processing, target advertising, airline route planning, etc. The list is endless.

As we live in a data-driven world, the data science industry is very much on the rise. Simulation of human intelligence in machines is booming and data science being the backbone, its importance is clearly reflected.

Common Programming languages used:

Python, R, SQL, Julia, Java, Scala are the best data science languages for dealing with complex projects containing high volume data sets. TensorFlow and MATLAB are used in statistical analysis and numerical computation. Python and R are the most popular among the lot. Being well versed in either will give you the competence as a data scientist.

The basic procedure of data science:

1. Discovery:

The Discovery step involves acquiring data from all the identified internal & external sources.

The data can be:

Logs from webservers
Data gathered from social media
Census datasets
Data streamed from online sources using APIs

2. Preparation:

Data can have lots of inconsistencies like missing values, blank columns, an incorrect data format that needs to be cleaned. You need to process, explore, and condition data before modeling.

3. Model Planning:

In this stage, you need to determine the method and technique to draw the relation between input variables. Planning for a model is performed by using different statistical formulas and visualization tools. SQL analysis services, R, and SAS/access are some of the tools used for this purpose.

4. Model Building:

In this step, the actual model building process starts. Here, a Data scientist distributes datasets for training and testing. Techniques like association, classification, and clustering are applied to the training data set. The model once prepared is tested against the “testing” dataset.

5. Operationalize:

In this stage, you deliver the final baselined model with reports, code, and technical documents. Model is deployed into a real-time production environment after thorough testing.

6. Communicate Results:

In this stage, the key findings are communicated to all stakeholders. This helps you to decide if the results of the project are a success or a failure based on the inputs from the model.

Popular libraries used in Data Science

Pandas :

Pandas provide data structures, mainly Series and Dataframes, and data analysis tools that are easily understandable and beginner-friendly. It is a perfect library that is used for easy and quick data manipulation, aggregation and visualization. Operations are ranging from indexing, sorting, merging data frames to plotting data with the help of a histogram or barplot that can be done with Pandas.

Matplotlib:

Matplotlib is very popular and a quintessential library that is used for creating 2-dimensional visualizations. This library provides an object-oriented API for applying embedded plots.

Matplotlib can provide a wide variety of visualizations such as scatter plots, histograms, bar plots, pie plots, with personal formatting for each kind of plot.

NumPy:

NumPy is one of the fundamentally essential array-processing libraries. It provides efficient tools to work with the arrays and homogenous multidimensional data in general. Arrays having the same datatype are generally used here. NumPy makes math operations on arrays and their vectorization simpler. Basic array operations such as add, multiply, slice, reshape and advanced array operations such as stacking and splitting are facilitated by NumPy.

Scikit-Learn:

For Python, Scikit Learn is a robust machine learning library. It has ML algorithms such as SVMs, random forests, clustering of k-means, spectral clustering, mean shift, cross-validation, and more. Scikit Learn also supports NumPy, SciPy, and similar scientific libraries, with Scikit Learn being part of the SciPy Stack. Through a clear Python framework, Scikit-learn offers a variety of supervised and unsupervised learning algorithms.

TensorFlow:

TensorFlow is an AI library that lets developers use data flow graphs to build large-scale neural networks with several layers. TensorFlow also encourages the creation of Deep Learning models, moves the state-of-the-art in ML/AI, and enables ML-powered applications to be quickly deployed. When it comes to classification, interpretation, learning, finding, forecasting, and generating data, TensorFlow is very efficient. Of all the repositories, one of the most developed websites is TensorFlow. TensorFlow is used by giants such as Google, Coca-Cola, Airbnb, Twitter, Intel, DeepMind.

Popular Visualizations used in Data Science

Data Visualization is one of the key aspects of Data Science as it helps the audience understand the data. Charts and graphs help in portraying a clearer picture of data rather than just excel sheets or data frames.

The most popular visualizations used in Data Science are:

Bar Chart

A Bar Chart consists of a series of bars that are mapped to the development of a particular value. This chart works well for those values that have a fixed number of categories as their output. It is the go-to visualization used to track the development of one or two variables over a stipulated period.

Pie Chart

A Pie Chart is a circular graph that is divided into slices. This chart is generally used to display the proportion of each of the variables given, that is, a section of the whole. Thus, it also works well for values having a fixed number of categories.

Line Graph

A Line Graph consists of one or many lines in a single plot which signifies the development of a particular variable or shows how one variable grows (or falls) concerning the other. This is a continuous plot.

Scatter Plot

A Scatter Plot consists of multiple data points that are plotted in a 2-D frame. It works only with numerical. It can be considered as a discontinuous form of the line graph.

Timeseries Plot

A Timeseries Plot models the behavior of a line graph or a scatter plot, where the x-axis is always time (could in years or months or any other unit of time). This serves to be pretty useful in many analyses where the development of a variable concerning time must be observed.