What Should You Know in Python for Exploratory Data Analysis?

Rajas Khokle
Data Science Concise
4 min readDec 16, 2018

Python is hands down the most popular language in Data Science. Undoubtedly, R is often used by statisticians and in general known for its great visualisations, Python is more general purpose and is a must have tool in the arsenal of a data scientist. Here, is the short version of what you should know in Python to conduct basic Exploratory Data Analysis. A very convenient PANDAS library is used for this purpose. A simple one dimensional data is encapsulated in an object of type Pandas Series, whereas multidimensional (multi-column) data is stored as a Pandas Dataframe.

Loading data in Python: An external data file can be readily loaded in python using functions like read_csv or read_data from pandas library.

import pandas as pd
df = pd.read_csv('Bank_churns.csv')

apart from filename, tt can take many other arguments like skiprow to start reading at particular row to skip headers and encoding to specify encoding styles like utf-8 , latin1etc. Instead of CSV, one can also use other formats like JSON, HTML, HDF, EXCEL, STATA, SAS, SQL and even clipboard. A full documentation on file IO can be found here.

First Data View: Once the data is loaded in the memory, one needs to see what it contains. This is accomplished by one of the following methods.

df.info()                  # Displays information about every column
df.head(10) # Displays first 10 rows in the data
df.tail(10) # Displays last 10 rows in the data
df.describe() # Displays statistics of each column
df.dtypes # Displays data types of each column
df.column_name.nunique() # Displays number of unique data elements
df.column_name.unique() # Displays unique data elements in column
df.shape() # Displays the row and columns of df

From this information, one need to find and verify following things.

  1. Which columns are present in the file? In short, what is the data about. Ideally, one should create a data dictionary from this information as a separate file for future reference and auditing.
  2. Are the datatypes correct according the column name? For example, if currency is shown as object (string) instead of int or float, then we may need to take a close look. Either the data may have an actual string in one of the rows or python is considering it as a string because of currency symbols like $. One needs to correct this and ensure that columns are of correct data type.
  3. From the output of describe() method, note if there is very high skew for a particular numeric column. It may indicate outliers or anomalous data which may need filtering before analysis.

Filtering and Cleaning: The data needs to be filtered by some specific rule set. In SQL, this is achieved by WHERE clause. In Python, this can be achieved by many ways. One of the popular way is to define the mask which will produce a list of Boolean variables where True will corresponding to the row for which provided condition evaluates to True and vice versa. This mask is then passed to original data frame which gives the rows corresponding to the True values in the mask. Other ways include using regular expressions, use of isnull() and fillna() methods for finding and filling Nanvalues. Filtering using Regular Expressions will be discussed in other post.

When your Pandas GroupBy() …. Your Bamboo is in Trouble!

Aggregation and pivot tables: To make any basic explorations on data that contains categorical variables, one needs to know how to aggregate the numerical results by categories. This is also known as pivot tables in excel. In SQL, aggregation is done GROUP BY clause. In Python, groupby() method is used. However, it returns an iterable object and to access the actual data, one can loop through this object using a for loop. However, this is not the standard use of groupby() in python. After calling a groupby() function in python on a particular column having categorical data, a numerical column(s) is (are) accessed through [] (indexing brackets) and an operation like sum() is called. This can be seen in the dummy code below.

x =df.groupby('Location') ['Sales'].sum()

First, we group by locations, then we access Sales column and then perform summing operation. Thus, groupby() in python works on SPLIT — APPLY — COMBINE principle. It first splits the data, then performs the given operation (like sum, mean etc.) and then combines the result in a single dataframe. Python, can group multilevel data and can group along columns as well as rows.

In short, one needs to know Data Quality Check — Data Cleaning — Data Filtering — Data Aggregation functions in Python for exploratory data analysis.

An excellent 20 minute read on all these functionalities can be found here.

Happy Analysing Folks!

--

--