Understanding the Data with Pandas

One step towards Data Science

Bhanu
Analytics Vidhya
5 min readAug 19, 2020

--

The main objective of this blog is to help you understand how simple is Pandas, which is one of the most popular libraries in Python for Machine learning and AI.

Photo by Markus Spiske on Unsplash

If you want to read, process(tabular data) and understand the data briefly then I would say Pandas is the one-stop solution.

Pandas has one of the important Data Structures called Data frames, which represents the data in tabular format (row and columns). It can deal with Heterogeneous two-dimensional data (like strings, numbers etc,.).

To appreciate pandas in a much better way, let’s take a simple data set and understand the basic commands on how to understand the given data and draw conclusions so that our next step becomes much clear on how to approach towards the solution. So it’s all about how well did you understand the data. Intuitively we can take this as one of the data pre-processing steps before we apply any machine learning model on top of it (Skipped many steps in between).

Command to install pandas in Python 3?

pip3 install pandas

Reading the data:

Once pandas is installed, let’s understand how to load the data into Data Frame. I have downloaded Haberman’s Survival data set from Kaggle and stored in the path: Users\HP\Desktop\Data Science\EDA\

We shall be using this data set throughout the blog to help you understand. This data is actually in CSV format(comma-separated values). So, let’s know how to load CSV data with the below code.

Loading the data using read.csv

The above code snippet gives us a brief idea of how our data set looks like. Data set is about Cancer survival status, given features(age, year and number of nodes affected due to cancer). Column status represents whether the patient has survived or not (1:Survived, 2: Not survived). It has 306 data points with 4 features. This represents the shape of the data set which can also be obtained with haberman.shape

To know what all columns or features present in the data, we can use haberman.columns and with haberman.head we can get the top 5 rows of the data by default, we can input any number inside the parenthesis of head(10) to get those number of rows(10) as output.

With this, we got an idea about the size of the data set and it has two classes(Survived, not survived) which can also be obtained with .unique(). So the task would be solving a Binary Classification problem.

Missing values:

In order to apply any plotting techniques or to draw conclusions accurately from the data, there shouldn’t be any missing values. If there are any, it is must to fill the missing values, as it impacts the model performance. So, how do we do it? We can do this with the help of one of the techniques called Imputation. Replace the missing values by the mean value of the feature/column. Similarly, we can replace with median or mode as well. We can also impute the values based on the class label. If a missing value is from class-1, just take the mean of the class-1 feature and replace it.

Another technique is Model-based imputation, where we make all the missing values as test data and train on non-missing values so that our model predicts the missing values in test time.

If the data set is certainly large and there are only 2 to 3 missing values, in that case, we can just ignore them. It basically depends on what kind of data we are dealing with.

Let’s go to the code part now.

The output shows 306 points in all features. Hence no missing values.

As the CSV file which we have chosen, doesn’t have any missing values, I have manually removed some values and stored them in an excel sheet to help you understand better. In the below code, let’s see how to read an excel file(xlsx) and check for missing values.

Nodes column has 3 missing values. Use above techniques to replace missing values.

Class imbalance:

It is very important to understand if the data is well balanced. If it is not balanced, it is necessary to balance the data because the accuracy we get with class imbalances will be high as it predicts the query point which belongs to that majority class. This is a dumb model.

We can balance the data with Oversampling where we take minority class points and make them equal to majority classes by just duplicating the same value. Mostly preferred.

Instead of just repeating minority class points, we can create new artificial points within the region of the minority class. This is called Extrapolation.

Undersampling, in this we randomly sample majority class points such that the size of the sample is equal to the minority class. But with this, we are not utilizing the entire data. Hence the loss of information should be avoided to make the model work reasonably well.

Weighing approach, in this we give more weight to minority class points and less weight to majority class points.

Below code explains it all. We are checking the column ‘status’ for the number of data points for each class.

Clearly the data set is imbalanced(225>>81), we can use the above techniques to balance the data.

Deeper insight:

To get deeper insights on the data, we need to understand what is the minimum, maximum, average values etc, i.e to get the summary we can use the below code to understand.

This gives all the information like the range of patients age, years range etc which is pretty clear with the output.

If we do not want all the outputs and we need only specific outputs like count, mean and so on we can simply use DataFrame.count , DataFrame.max , DataFrame.min and so on as per our requirement. Here in our blog, we read the data into DataFrame with name haberman. Hence we should use haberman[“age”].min() and so on. Please modify the code accordingly so as to get the desired output.

This is all about some of the important commands which we’ve used to understand the data and how to overcome the problems if we have any. If we dive deep further, there are many such commands, please follow any of the reference links below for deeper understanding.

Feel free to provide any feedback. Much appreciated. Thank you :)

References:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

https://www.dataquest.io/blog/pandas-cheat-sheet/

--

--