Python Data Analysis — Value counts

Published in

The Startup

4 min readApr 11, 2020

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analysing data much easier. Pandas Series.value_counts() function return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Dataset

We will using Kaggle dataset — Telecom Customer Churn Prediction to understand value_counts() .

In the code below , I will be importing libraries and reading the dataset

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline# Loading the CSV with pandas
data = pd.read_csv(‘WA_Fn-UseC_-Telco-Customer-Churn.csv’)
data.head(5)

First 5 rows of Telco Customer Churn dataset

General usage

The value_counts() function can be used to get count of unique values for given column in a dataset. The code below counts unique values for Gender column

data.gender.value_counts()

To sort in ascending or descending order , use sort as argument. In the code below Tenure column is sorted descending order

data.tenure.value_counts(sort= True)

Group by with value_counts()

The value_counts() function can be used with other useful Panda functions to enhance the data analysis . In this case , we will be using Group by function on Contract column along with value_count() on Gender column.

data.gender.groupby(data['Contract']).value_counts()

Normalize

At times the absolute values displayed does not make things clear. A better solution would be to show the relative frequencies of the unique values in each group. By setting normalize=True, the object returned will contain the relative frequencies of the unique values. The normalizeparameter is set to False by default.

data.gender.groupby(data[‘Contract’]).value_counts(normalize=True)

Binning

For columns with lot of unique values , default function is not helpful . In this dataset tenure column is one such column . value_counts() has a bins argument. It allows us to give groups we want to split the data into (number of bins) as an integer. In the example below I have added bins=5 to split the tenure counts into 5 groups.

data.tenure.value_counts(bins =5)

When percentage is a better criterion then the count , then use normalize = True argument

data.tenure.value_counts(bins =5 , normalize=True)

nlargest() & nsmallest()

There are third type of columns with large unique values where ever binning argument will not help in data analysis. In this dataset MonthlyCharges column is one such column . If we use value_counts() against this we get an output that is not particularly insightful. We can use another Pandas function called nlargest() in combination with value_counts().

data.MonthlyCharges.value_counts(sort= True).nlargest(10)

Similarly we can use nsmallest() function to display the bottom 10 MonthlyCharges .

data.MonthlyCharges.value_counts(sort= True).nsmallest(10)

Plots

We can also use various plotting libraries in python in conjunction with value_counts() function to display the data in more insightful manner

Monthly_Charges_Count = data.MonthlyCharges.value_counts(sort=True).nlargest(10)
plt.figure(figsize=(10,5))
sns.barplot(Monthly_Charges_Count.index, Monthly_Charges_Count.values, alpha=0.8)
plt.show()

NaN values with value_counts()

By default, the count of null values is excluded from the result. But, the same can be displayed easily by setting the dropna parameter to False.

Since I have started using python for Data analysis , value_counts() function has been extensively used by me to understand the data from various angles.

If you have any questions , let me know happy to help. Follow me on Medium or LinkedIn if you want to receive updates on my blog posts!