Python Data Analysis — Value counts
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analysing data much easier. Pandas Series.value_counts()
function return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
Dataset
We will using Kaggle dataset — Telecom Customer Churn Prediction to understand value_counts() .
In the code below , I will be importing libraries and reading the dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline# Loading the CSV with pandas
data = pd.read_csv(‘WA_Fn-UseC_-Telco-Customer-Churn.csv’)
data.head(5)
General usage
The value_counts() function can be used to get count of unique values for given column in a dataset. The code below counts unique values for Gender column
data.gender.value_counts()
To sort in ascending or descending order , use sort as argument. In the code below Tenure column is sorted descending order
data.tenure.value_counts(sort= True)
Group by with value_counts()
The value_counts() function can be used with other useful Panda functions to enhance the data analysis . In this case , we will be using Group by function on Contract column along with value_count() on Gender column.
data.gender.groupby(data['Contract']).value_counts()
Normalize
At times the absolute values displayed does not make things clear. A better solution would be to show the relative frequencies of the unique values in each group. By setting normalize=True
, the object returned will contain the relative frequencies of the unique values. The normalize
parameter is set to False
by default.
data.gender.groupby(data[‘Contract’]).value_counts(normalize=True)
Binning
For columns with lot of unique values , default function is not helpful . In this dataset tenure column is one such column . value_counts() has a bins argument. It allows us to give groups we want to split the data into (number of bins) as an integer. In the example below I have added bins=5
to split the tenure counts into 5 groups.
data.tenure.value_counts(bins =5)
When percentage is a better criterion then the count , then use normalize = True argument
data.tenure.value_counts(bins =5 , normalize=True)
nlargest() & nsmallest()
There are third type of columns with large unique values where ever binning argument will not help in data analysis. In this dataset MonthlyCharges column is one such column . If we use value_counts() against this we get an output that is not particularly insightful. We can use another Pandas function called nlargest() in combination with value_counts().
data.MonthlyCharges.value_counts(sort= True).nlargest(10)
Similarly we can use nsmallest() function to display the bottom 10 MonthlyCharges .
data.MonthlyCharges.value_counts(sort= True).nsmallest(10)
Plots
We can also use various plotting libraries in python in conjunction with value_counts() function to display the data in more insightful manner
Monthly_Charges_Count = data.MonthlyCharges.value_counts(sort=True).nlargest(10)
plt.figure(figsize=(10,5))
sns.barplot(Monthly_Charges_Count.index, Monthly_Charges_Count.values, alpha=0.8)
plt.show()
NaN values with value_counts()
By default, the count of null values is excluded from the result. But, the same can be displayed easily by setting the dropna parameter to False.
Since I have started using python for Data analysis , value_counts() function has been extensively used by me to understand the data from various angles.
If you have any questions , let me know happy to help. Follow me on Medium or LinkedIn if you want to receive updates on my blog posts!