Which state in India has the highest number of missing women from 2016–2018?
In this blog, I will write about the states in India that have the most missing women. We will do data analysis and visualization to see which states are most unsafe for women.
Even though the number of missing women is higher for Maharashtra and Madhya Pradesh, it’s Delhi that has the highest rate of women going missing, based on dividing by the total number of women per state. In Delhi, roughly thirteen thousand women went missing in 2018.
Data visualization and analysis could lead to false interpretations if not done right. This blog tells us that in the beginning, we need to give more thought to the data to conclude that the higher the number, the higher the number of missing women. However, proper data analysis provides true results and an accurate scenario.
Wrong analysis is common for population data. This article will shed light on bad practices in analyzing population data.
Getting the Data
I got the data from NCRB.
The analysis in the above image is actually wrong after I did my analysis. Delhi is the state with the highest rate of women missing. Let’s see how?
The data is in pdf and columns and rows. For data visualization, we need data in at least CSV format. How to do it? Let’s see it in the next section.
Processing the Data
I created the CSV data from PDF data with the help of AI — ChatGPT.
I copied the pdf data table content and pasted it into the chatGPT console.
I was amazed to see how easily it created a csv file. In the past, before AI, I have created CSV files from PDF by writing them myself in CSV files one by one.
Accessing the data for data visualization
Now, we need to access the data to analyze and visualize it. GitHub Gist is a great way to host small data, and it can be accessed anytime, anywhere.
The data link for Gist is public.
https://gist.github.com/nitanagdeote/5f1984062f433644d0588d082064c66a
To use the data, we need a raw file. Click on raw and copy the link to the URL.
What type of data visualization should we do?
For now, I plan on creating a line chart with the x-axis as a time and the y-axis as a number of cases.
Description of the dataset
The dataset consists of the Names of the states, the years, and the total number of cases of missing women. From a data analysis point of view, we need to see what types of variables we have in the data. We got States, which is a category or string; Time, which is the date format; and the total number of missing women, which is a number. So, we have a couple of options for analyzing this dataset. We can pick up any two variables and create a data visualization, or we can create one with all three variables, or we can do all of the combinations of variables and visualize all of them.
Analyzing data
I will be analyzing the data using the R programming language.
Description of the chart
The above chart is created using R programming language and a library called ggplot2
# Visualizing Number of cases of women missing
#in different states of India from 2016-2018
library(ggplot2)
data <-
read.csv(
'https://gist.githubusercontent.com/nitanagdeote/5f1984062f433644d0588d082064c66a/raw/abc4ddc11df9001698df8e2d223bd6b74a9cd133/.csv'
)
##################################
# Filtering of the data
# Change the column name
colnames <- colnames(data)
colnames(data)[2] <- 'States'
data
#######################################
ggplot(data, aes(
x = Year,
y = Cases,
group = States,
color = States
)) +
geom_line()+
ggtitle("Missing women from year 2016-2018 for top 10 States in India")
######################################################################
The chart above is very basic. The colors are not distinct, and the visualization is not interactive. I will have to use the better color theme in R.
Let’s change the color and types of the lines
# Visualizing Number of cases of women missing
#in different states of India from 2016-2018
library(ggplot2)
data <-
read.csv(
'https://gist.githubusercontent.com/nitanagdeote/5f1984062f433644d0588d082064c66a/raw/abc4ddc11df9001698df8e2d223bd6b74a9cd133/.csv'
)
##################################
# Filtering of the data
# Change the column name
colnames <- colnames(data)
colnames(data)[2] <- 'States'
data
#######################################
# Visualizing Number of cases of women missing
#in different states of India from 2016-2018
library(ggplot2)
data <-
read.csv(
'https://gist.githubusercontent.com/nitanagdeote/5f1984062f433644d0588d082064c66a/raw/abc4ddc11df9001698df8e2d223bd6b74a9cd133/.csv'
)
##################################
# Filtering of the data
# Change the column name
colnames <- colnames(data)
colnames(data)[2] <- 'States'
data
cols <- c("#998891", "#EEA236", "#00f2ff", "#3cff00", "#9632B8","#ff0000", "#fff700", "#5CB85C","#5CDDDC","#ff00d4", "#00f2ff" )
#######################################
ggplot(data, aes(
x = Year,
y = Cases,
group = States,
color = States,
linetype=States
)) +
geom_line()+
scale_color_manual(values = cols)
######################################################################
Now, we can see clearly that Maharashtra has the highest number of cases. But wait, the data needs to be normalized.
What is data Normalization?
In the above data, the number of cases of missing women will be larger if the state population is large. So, a larger population means a larger number of missing women. That actually does not picture the real problem.
Let's analyze it with an example.
Case 1: If the total population is 1,000 and 100 are missing, it means 10% of women are missing.
Case 2: If in a 10,000 population, 500 women are missing means, 5% of women are missing.
But if we look at only the number 500, it is much greater than 100. In case 1, the number of missing women is much higher than in case 2. The rate is calculated for different states and then compared. Then, it is a fair analysis.
Getting data for the Projected Total Population of different states
Now, we have to get data for each state population from 2016 to 2018 and normalize the case data.
I have manually created the CSV file at https://gist.githubusercontent.com/nitanagdeote/69fea3eb11e4758fa9c71ec22b085155/raw/8b32d1f04b82f14b7b32b6802fcdfb8353826674/women-missing-rate.csv
We calculate the rate by dividing the number of cases by the population for that year for that state.
#######################################
# Visualizing rate of number of cases of women missing
# in different states of India from 2016-2018
library(ggplot2)
data <-
read.csv(
'https://gist.githubusercontent.com/nitanagdeote/69fea3eb11e4758fa9c71ec22b085155/raw/8b32d1f04b82f14b7b32b6802fcdfb8353826674/women-missing-rate.csv'
)
##################################
# Filtering of the data
# Change the column name
colnames <- colnames(data)
colnames(data)[2] <- 'States'
data
cols <- c("#998891", "#EEA236", "#00f2ff", "#3cff00", "#9632B8","#ff0000", "#fff700", "#5CB85C","#5CDDDC","#ff00d4", "#00f2ff" )
data$rate<-data$Cases/data$Population
data
#######################################
ggplot(data, aes(
x = Year,
y = rate,
group = States,
color = States,
linetype=States
)) +
geom_line()+
scale_color_manual(values = cols)
######################################################################
In the R file, we created a new column, ‘rate’; the final line chart was created using the x-axis value as time, the y-axis value as rate, and the third variable state as color.’
Conclusion 1: Incorrect
So, with the new data, analyzing rate vs. year, we see from our chart that the rate of missing women is highest in the state of Madhya Pradesh, and the increase in the slope of the line on the top shows that it is increasing continuously.
This analysis is incorrect as the rate is calculated against the state's total population.
Analysis 2: Correct Analysis
In analysis 2, we will calculate the rate against the total population of women in the state for the particular year. We got the women population data from 2016–2018 from https://main.mohfw.gov.in/sites/default/files/Population%20Projection%20Report%202011-2036%20-%20upload_compressed_0.pdf.
I created the CSV file and hosted it in GitHub Gist
Let's calculate the percentage of the women missing per state by dividing cases by the total women population.
# Visualizing Number of cases of women missing per total women population
#in different states of India from 2016-2018
library(ggplot2)
data <-
read.csv(
'https://gist.githubusercontent.com/nitanagdeote/92ab25e2efe3bbde24fd87f87014bdec/raw/d1a3d45b2109fb2bfeb4b6f6c95d19d335d27408/women-missing-per-total-women-population.csv'
)
##################################
# Filtering of the data
# Change the column name
colnames <- colnames(data)
colnames(data)[2] <- 'States'
data
#######################################
# Visualizing Number of cases of women missing
#in different states of India from 2016-2018
library(ggplot2)
data <-
read.csv(
'https://gist.githubusercontent.com/nitanagdeote/5f1984062f433644d0588d082064c66a/raw/abc4ddc11df9001698df8e2d223bd6b74a9cd133/.csv'
)
##################################
# Filtering of the data
# Change the column name
colnames <- colnames(data)
colnames(data)[2] <- 'States'
data$Rate=data$Cases/data$Total.Women.Population
data$per10000women=data$Rate*10000
data
cols <- c("#998891", "#EEA236", "#00f2ff", "#3cff00", "#9632B8","#ff0000", "#fff700", "#5CB85C","#5CDDDC","#ff00d4", "#00f2ff" )
#######################################
ggplot(data, aes(
x = Year,
y = Rate,
group = States,
color = States,
linetype=States
)) +
geom_line()+
scale_color_manual(values = cols)
######################################################################
We are using R to create a new column for percentage/rate
Then, we will calculate the number of women missing over 10000 women population
The data for the year is visualized on the x-axis, the number of women missing per 10,000 women, and the color of the line represents the states.
Conclusion: Final Correct
Delhi has the highest number of missing women, with more than 12 missing women per 10,000 women. In 2018, the total female population was approximately 8697000. The total number of cases of missing women was 13272. That gives us the rate=( 13272/8697000)=0.0015.
In Delhi, roughly thirteen thousand women went missing in 2018. Madhya Pradesh is the second-largest state, with 29,761 women missing. Even though the number is greater for Madhya Pradesh, the rate is lower than that of Delhi. Data analysis helps us answer questions like this. It gives the true answer to the question at hand.
What do you think?
Does the missing women mean an unsafe place for women in general?
For fun, let's create a bar chart for state and rate variable
ggplot(data, aes(
x = States,
y = rate,
fill = States
)) +
geom_bar(stat = "identity",position = "stack")+
scale_color_manual(values = cols)+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
##########################################
What we learned
- In the case of population data, the rate of variable is the real picture of the data.
- How to host data with GitHub gist
- How to process and calculate data
- How to use ChatGPT for data processing
Resources :
- https://ncrb.gov.in/crime-in-india-year-wise.html?year=2022&keyword=
- https://ncrb.gov.in/crime-in-india-table-content?year=2022
- Population estimate data link in pdf for India 2011–2036 https://main.mohfw.gov.in/sites/default/files/Population%20Projection%20Report%202011-2036%20-%20upload_compressed_0.pdf
- Data file in GitHub gist for Women missing, state, year for 2016–2018
5. Projected Total Population for India
Notes
https://ncrb.gov.in/uploads/2022/May/18/1652868490_missingpage-merged.pdf
It looks like there is a mistake on the table. The data is for 11 states. However, Telangana data is missing for 2018, and it has been replaced by the new state of Odisha. However, in the data visualization, the Telangana data for 2018 is approximately 10,000.
Note: This is a work in progress.