Introduction to Data Science in Python

Sanjeev Lingam-Nattamai
Deep Dives with Data
9 min readJul 29, 2020
From Inside Analysis

What is data science? From Wikipedia, data science is an “interdisciplinary field focused on extracting knowledge from data sets, which are typically large. The field encompasses analysis, preparing data for analysis, and presenting findings to inform high-level decisions in an organization.”

Given the continued accelerated growth of the Internet of Things (IOT) over recent years, there has been a consistent increase in the amount of data generated by the internet. Not only that, but the number of internet users in the world at the end of June 2020 was more than 4.8 billion people. That figure was 2.4 billion in 2014.

This means that it is becoming increasingly important to retrieve, analyze, and utilize big data for solving problems. In this introduction to data science, I want to highlight how to solve problems using the data science process of:

  • Obtaining data
  • Scrubbing data
  • Exploring data
  • Modeling data
  • Interpreting data

I’m going to introduce different data cleaning, analysis, and visualization libraries and packages in Python. If you are new to Python, I recommend reading the article I previously wrote on basic Python concepts and syntax.

It’s important to note that Python isn’t the only language that can solve data science tasks. However, it is the easiest and the most effective programming language for all of the aforementioned tasks I have laid out.

Step 0: Developing a Question

Like any problem, it’s important to first pose a question. For example, types of data science questions include:

  • Predicting optimal flight ticket prices for an airlines company
  • Projecting the impact of sales for a bakery for next month given the COVID-19 pandemic
  • Determining the effect of advertising with coupons and discounts for retail companies

Each of these questions requires analysis from applicable big datasets in order to produce a final output. After a question is developed, it is onto creating the framework of the problem and identifying the different parts of the question that need to be solved. Before any of this is started, the first step is obtaining the data.

Step 1: Obtain data

So, if we already have an idea of what data science is, what is data?

At its core, data is quite simply a collection of entities and attributes. Typically, data comes in two forms (structured vs. unstructured).

Structured data includes any form of tabular data and also relational databases with formats of CSV, JSON, HTML, and SQL. In contrast, unstructured data is text, videos, photos, audio, etc. About 80 % of the world’s data is unstructured and this type of data is harder to work with because it is harder to find patterns. From the image below, here a couple more differences between structured data vs. unstructured data.

From Lawtomated

In terms of collecting data for solving a problem, there are a couple of different ways to do so.

The first of which is to download pre-made data files. If you’re looking for datasets to explore, Kaggle is a community of data scientists and machine learning practitioners that contains thousands of public datasets so there will definitely be a dataset that interests you.

The second is to generate your own data where you can create your own file and add input yourself. An example is scraping data from web pages using the Python web scraping library Beautiful Soup.

Step 2: Scrub Data

Once you have collected your data, more often than not, the data needs to be cleaned so that it is easier to work with. Examples of data cleaning are:

  • Handling missing values
  • Format issues
  • Removing unnecessary words or punctuation

Pandas

Pandas is a Python data science library that provides essential data cleaning properties. For the example below, I have pre-downloaded the Iris dataset. While the Iris dataset is already cleaned, I wanted to display the different data cleaning properties for a Pandas data frame.

When dealing with missing values in a data set, Pandas handles those cases by either replacing those values or removing the respective rows/columns.

Regular Expressions

A really good way to fix string format issues and removing unnecessary information involves using regular expressions. A regular expression is a sequence of characters that define a search pattern. Here is an example which lists out different regular expression functions and their functionality:

The last example isn’t a regular expression as it is the replace() function that replaces any value of a Pandas data frame. In the example, I used the replace() function to replace the ‘setosa’ species with ‘s’.

Overall, regular expressions are typically very useful in text processing tasks like web scraping and simple parsing. A useful tool that will help find specific patterns to match a string is this website. You can type in your regular expression and test string in order to identify if your regular expression matches any part of that string or not. Another useful tool that will help is this cheat sheet which contains all of the different symbols, ranges, grouping, and assertions that are used when constructing a regular expression.

As a whole, data scrubbing is an important part of the data science process because the data can then be properly analyzed in the next step.

Step 3: Explore Data

After the data is cleaned, preprocessed, and easy to use, it is time for data computation and analysis. The exploration stage first creates a broad picture of the important trends and major features in the dataset and then most likely goes in-depth to dig key nuggets of information. Here are three of the more common data exploration packages in Python.

Numpy

Numpy is a linear algebra library that is widely used because of Numpy arrays which come in two forms.

  • Vectors: 1-dimensional array
  • Matrices: 2-dimensional arrays

In the example below, I displayed the basic Numpy matrix functionality.

However, Numpy plays a much bigger role as a data science tool than just basic calculations. Oftentimes, matrices are involved because they can be used to solve complex computation needed in order to retain meaningful properties like in principal component analysis(PCA) and dimensionality reduction. Numpy is most useful to learn to understand the basic structure of forming vectors and matrices using Numpy arrays and then using them to solve complex problems.

Pandas

While Pandas is great for data cleaning, it is an even better tool for data exploration. Here is an example:

Out of all of the Python data science libraries and packages, Pandas is one of the most versatile for both data cleaning and data exploration. The simplicity of the data frame structure allows users to easily access values in order to make further calculations.

Scikit-Learn

Scikit-Learn is a machine learning library that has efficient tools for predictive data analysis. Some of the types of problems that it can help solve are:

  • classification: identifying which category or class label an object belongs to
  • regression: predicting a continuous-valued attribute associated with an object
  • clustering: grouping similar objects into sets

Classification and regression problems are different types of supervised learning, whereas clustering problems are a type of unsupervised learning.

Here is an example of classification, regression, and clustering problems:

If your problem requires predictive functionality, then Scikit-Learn is a great library to utilize. Furthermore, along with the different problem types that Scikit-Learn can help solve, it can also be used for dimensionality reduction and model selection.

Overall, data exploration plays a pivotal role in the data science process because depending on the complexity of the problem at hand, there are a wide range of libraries and other tools that can help solve that particular problem.

Step 4: Model Data

While the data analysis is important, it is also just as important to communicate the analysis through a visualization.

So, what are the qualities of a good visualization?

In The Truthful Art, Alberto Cairo introduces the pillars of a good data visualization where he describes five qualities that should be the foundation of every visualization which are:

  1. Truthful: Is this visualization based on honest research that minimizes bias as much as possible?
  2. Functional: Is the information and data in the visualization accurate so that other people can take action based on it?
  3. Beautiful: Is it aesthetically pleasing?
  4. Insightful: Does the visualization allow us to see a trend or pattern that we might not otherwise have seen?
  5. Enlightening: Do people learn from the visualization or does it change peoples’ viewpoints on the issue at hand?

When creating a visualization, it is really important to check if these five qualities apply to your visualization so that it can be most effective. Python has many different plotting packages that allow you to make great visualizations, but here are two of the most effective packages.

Matplotlib

Matplotlib is one of the most standard plotting libraries in Python that allows users to plot simple line-graphs, dot-graphs, bar-graphs, histograms, etc. Here is an example of Matplotlib functionality:

While Matplotlib is easy to use, it fails to produce an aesthetically pleasing plot that is engaging to someone. This next library will solve that issue.

Plotnine

Plotnine is an implementation of the grammar of graphics in Python and it is based on ggplot2 in R. The grammar allows users to make plots by mapping data to visual objects.

Plotnine is a much more effective plotting library over Matplotlib because the quality of the plot is much higher where people are much more likely to spend more time looking at the visualization like the overlay of the density plots in the example above.

While Matplotlib and Plotnine are great, there are many other data visualization packages out there as well!

  • Seaborn is a statistical graphics library that is built on top of Matplotlib and works closely with Pandas data structures
  • Plotly is a graphic library that makes interactive plots and can produce unique functionalities that are visually appealing
  • Altair is a declarative statistical visualization library where you are declaring links between data columns and visual encoding channels like the x-axis, y-axis, color, etc.

Ultimately, choose whatever plotting library works best for you and presents your data effectively. However, it’s important to remember that data visualizations need to convey a lot of information in a small amount of space. The more accurate, eye-catching, and colorful a visualization is, the more intrigue it is going to receive where people will look at the fine details.

Step 5: Interpret Data

Finally, after completing the first four steps of the process, the final step is interpreting the data which involves taking all of the analysis and visualizations to identify a solution to the problem.

In the corporate setting, this could represent a company realizing that spending more on marketing could lead to more consumers and higher profits. In the research world, this step could be the process of writing a publication on how a research team found a new discovery in their respective field.

However, when a problem is being solved, it isn’t as simple as it is first thought out. There are often many questions and challenges that emerge along the way that could require changing to a different approach to solve the problem.

It could also mean that you need to pivot where more data needs to be collected which would mean that you’re going back to Step 1. What this represents is that the entire data science process is a cycle that is potentially iterated through multiple times in order to get a final answer to the overall problem.

Summary

What we learned about the five steps of the Data Science process:

  • Obtain data: Collecting or generating raw data
  • Scrub data: Cleaning and preprocessing the data for analysis
  • Explore data: Computing and analyzing the different attributes of the data
  • Model data: Visualizing the data using informative models and graphs
  • Interpret data: Taking action with the data that is represented in the model.

Like I mentioned earlier, while these steps seem like an easy flow-through, it is worth noting that solving complex questions might require multiple passes through this data science cycle to solve a question.

If you are interested in learning more about data science, I highly recommend checking out this Introduction to Data Science specialization from IBM on Coursera. It provides a deeper breakthrough in data science along with lectures on other data science languages like R and SQL.

--

--

Sanjeev Lingam-Nattamai
Deep Dives with Data

SDE @ AWS | Graduate of Computer Science + Statistics @ Purdue University