Implementing Exploratory Data Analysis for understanding real-life issues better

9 min readJun 7, 2020

So you watch the news and there is yet another headline where a woman becomes the victim of a crime so unfortunate, it just fills your heart with rage. A lot of us might wonder what would be the thought processes of the criminals committing such heinous crimes. Aren’t they taught how to respect women? The way we were taught in school and at home? What makes them forget everything and go on to this path?

Does education even play a role in deciding if a person can commit such a crime? There are a lot of other factors that can come into play while trying to figure out the answers to the above questions. In India, for example, it also depends on the location, whether it’s rural or urban, although a lot of the crimes in the rural areas do not get reported.

The question still looms — What makes a person commit these terrible crimes and is there a relation of it with their education.

What do you think of the above problem? Is there a way we can find answers to it with some real data?

Read further to find out.

Before that, let’s understand what the term Exploratory Data Analysis means.

Exploratory data analysis is an approach to understand the data at hand and analyze observations inferred, identify characteristics or possible patterns in data, usually visually.

Since, the problem I have picked up will possibly require me to find two kinds of data, one recording the crimes against women in India and the other covering up the education level of people in different areas. I started my search for datasets on Kaggle, which is one of the best sources to find datasets and start off learning to deal with data. There are a couple of other resources too, like, Google Cloud or UCI Machine Learning Repository.

I found the 2011 Census of India data which had the literacy rates of people state-wise. Step one — Check. Although for this too, I did quite a bit of research to identify the best method which can capture the education level of a region but this seemed to be the most uniform. The next data I took up was Crimes in India.

Let’s import Pandas, NumPy, and Matplotlib to see how the data looks like.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt%matplotlib notebookb=pd.read_csv('crimes.csv')
b.head()

The picture above only displays a few columns, do check out the data from this link to see all the columns.

Also, b.head() displays the first five values of the excel sheet we have loaded. b.tail() will display the last five values.

So, we see there are different groups and subgroups of crimes against women and also different categories for people caught. We need to choose the ones we want for our problem. Let’s see how.

The unique() function displays all the categories of the column into which the data is divided. ‘Total crimes against women’ seems like the category to keep since the problem statement just wants the count of crimes and not their divisions. Regarding the criminals, ‘Total persons under trail’ is the column to deal with since these are the criminals who have committed the crimes. We should keep in mind it is only a rough estimation of culprits since there are a lot of people who wrongfully undergo trial and many many more who are never even caught.

One more thing to note is that the crime data is from 2001 to 2010 whereas the literacy rate is for 2001 and 2011 census. For the sake of uniformity, we shall only deal with the year 2010 since that is common to both datasets.

Let’s clean up the data a bit.

b=b.where(b['Group_Name']=='Total Crime Against Women').dropna(how='all')
b=b.where(b['Year']==2010).dropna(how='all')
b=b[['Area_Name','Year','Group_Name','Total_Persons_under_Trial']]
b.head()

.where() changes all other values to NaN except the ones for which the condition inside parenthesis is satisfied. It is an excellent method to clean the data and when followed by .dropna() function which drops all NaN values, it returns the values we desire. Using double square brackets we can specify the columns we want in the cleaned dataset and discard the rest.

Let’s look at the second dataset.

l=pd.read_excel('popdata.xlsx')
l.head()

read_excel() is used for excel files which have extensions other .csv and read_csv() is used for .csv file extensions as suggested by the name.

We have a couple of important features here in this data like the population, area, and sex-ratio, along with the literacy rate. We might be able to use these later for our analysis.

After an initial glance, I realized there are certain spelling differences of state names in the two datasets. I have resolved them below. This step is important as we are going to merge the datasets and they need to have a common dimension on which to merge.

Since the dataset is small, it was easy to do here. This problem is common when we take data from different sources.

l=l.replace('Andaman and Nicobar Islands','Andaman & Nicobar Islands')
l=l.replace('Dadra and Nagar Haveli','Dadra & Nagar Haveli')
l=l.replace('Daman and Diu','Daman & Diu')
l=l.replace('Orissa','Odisha')
l=l.replace('Jammu and Kashmir','Jammu & Kashmir')
l=l.drop([14])

I have also merged the data for easy plotting later on.

l=l.merge(b,left_on='State',right_on='Area_Name')
l['Total crimes per capita'] = l['Total_Persons_under_Trial'] / l['Population']
l.head()

One new column I have added here is the ‘Total crimes per capita”. Since the states with the bigger area and as a result, a bigger population might have more number of cases, I have divided the total crimes by population of that state to get the Total crimes per capita for each state.

Now, we have the data, let’s move ahead.

For understanding the problem statement through the data we have processed, it is important to draw preliminary visualizations as well to know the nature of features we are working with. I have used Matplotlib for visualizations for this problem, although there are many Python libraries like Seaborn and Plotly, which produce beautiful visualizations without a fuss. I particularly like Matplotlib though because it gives a lot of control for tweaking parameters which influence the figure visually and otherwise as well. We shall see how.

First, let’s plot a scatter plot between the literacy rate and total crimes per capita to see if there is a trend.

fig=plt.figure()
plt.scatter(l['Literacy'],l['Total crimes per capita'])
plt.gca().set_xlabel('Literacy rate in %')
plt.gca().set_ylabel('Per capita crimes against women')
plt.gca().set_title('Relationship between literacy rate and crimes against women')
plt.ylim(-0.0025,0.0025)
plt.tight_layout()

plt.gca() is just a function to get the current axis. Since setting up labels for X and Y-axis and title for the figure are to be done on the axis, we have to get the current axis and then change/include them.

It is a basic plot to analyze the relationship between the two variables. We see here that there is no special information we get about them and we would need to try some other form of a plot possibly highlighting which state each data point belongs to, for analysis.

A Scatter plot can have a disadvantage sometimes since it requires both values to be numerical to be eligible for plotting.

For this problem statement, maybe we could use a line plot or bar chart for both literacy rates and crimes in the same figure for easy comparisons.

fig=plt.figure(figsize=(10,7))
x=list(l['State'])
color = 'tab:red'
plt.plot(x,l['Literacy'],color=color,marker='^',)
plt.gca().tick_params(axis='y', labelcolor=color)
plt.gca().set_ylabel('Literacy rate in %',color=color)
plt.xticks(rotation=70)
plt.gca().set_xticklabels(x, ha='right')
ax2 = plt.gca().twinx()
color = 'tab:blue'
plt.plot(x,l['Total crimes per capita'],color=color,marker='o',ls='--')
plt.gca().set_ylabel('Total crimes per capita',color=color)
plt.gca().tick_params(axis='y', labelcolor=color)
plt.gca().set_title('Statewise comparison between literacy rate and their respective crimes of India')
plt.tight_layout()
plt.show()

I ended up using a line plot because visually it helps in spotting haphazard values more easily as compared to a bar chart (though this is only a personal opinion, there are a lot of ways to make bar charts visually attractive.) I also used plt.gca().twinx() to get multiple axes in the same figure. This was needed because literacy rate being in percentage and total crimes per capita being in decimals needed different scales.

The figure is below.

This is how the final figure looked like. I had to rotate x-axis labels to prevent overwriting. I have also color-coded the Y tick labels such that the legend is not needed. While doing this I had in mind what is called ‘chartjunk’.

Chartjunk can be defined as everything present in the figure which adds no value in reading the graphic.

Edward Tufte has comprehensively mentioned the categories and kinds of elements that can be called chartjunk. Before reading that paper, my idea of a great visual honestly consisted of anything looking fancy regardless of the fact if the data is readable or not. This paper changed my perception and I feel everyone who wants to learn data visualization should give it a look.

Coming back to the figure, we see that states like Kerala with the highest literacy rate does have lesser crimes per capita. Even though the figure has scales according to the maximum value of the list they are plotting, but still, we can compare the states with each other since they have been converted into a uniform scale for comparing.

Whereas, there are states like Maharashtra where the crime per capita is very high with a relatively lower literacy rate. The same with Andhra Pradesh and Arunachal Pradesh too.

So, we do get an answer to the question we asked in the beginning. Education does have a role in the number of crimes against women.

Even though the analysis above considered literacy rate only which does not guarantee if any values are imbibed in the person which could prevent them from committing such crimes. But still, it is a good place to start to get a rough estimate or a trend in the data.

I am in no way claiming that lack of education is the driving force for such people to commit these crimes, but as we observed above, it could be one of the reasons behind it too.

So in conclusion, I would just like to say that the above visual shows a very basic analysis, and my main aim was to highlight how to get started when you have a problem statement in mind. There are a lot more things that could be done when working with a large dataset, like getting the mean, variance, etc and understanding the distribution of various features to identify possible outliers if any. All of these procedures depend on the question we want to solve.

I am sure everyone who is wondering where to start learning data science must have at least one question like this which they would enjoy exploring answers to. You could search for relevant data and then perform analysis for identifying possible correlations or proving your hypothesis. It is a great way to start learning and way better than running code from existing projects online where you don’t understand half the lines of code.

The reason why I have put pictures of how the data looks like after every block of code is that I wanted to highlight the fact that during exploratory data analysis we make decisions on the go rather than just writing a long block of code.

This is my first post on Medium. Please do mention in the comments if you have any feedback or there is anything that can be added here.

Implementing Exploratory Data Analysis for understanding real-life issues better

Written by Mahak Agarwal