How Can Data Science Fight Social Injustice?

Most of this was originally written for Ethics, Privacy and Social Justice in Data Science, Professor Ksenia Polson P.h.D., Regis University. The project I referenced is done by Shivam Banals.

Published in

ILLUMINATION

16 min readMay 2, 2020

“A hypothetical image of AI fighting social injustice” by Monash Tech talks on Youtube

Today we currently have a huge gap between law enforcement and the communities they are entrusted to protect and serve. This divide between the two is spiraling out of control and needs to be fixed immediately. We need law enforcement that can be trusted to uphold the oath they took and we need to have civilians who actually trust them. How can we do this? One solution is Data Science.

For those of you who haven’t read my last two articles, I am a Data Scientist and a big advocate for the field’s capabilities with helping us in our everyday lives. Maybe there is a lot of people who are skeptical about the technology behind it and don’t believe or know how it will actually help us, but I am here to reinforce the idea of how it can help us and I will deter the disbelief that some of you may have.

“Core traits of what Data Science is,” from DataQuest

The power of Data Science

Our world is constantly making advancements in technology and these advancements come with great responsibility and power. It is important that we use this for the greater good rather than exploit it for our selfish and greedy ways. Data Science is a hot topic that embodies a lot of technologies and is used today. It can be used to forecast the weather, climate change, stock market prices, it can even be used to classify and accurately predict images and films. However, I bet not everyone knows that it can even be used to fight social injustice.

Using Data Science to solve social justice issues can be done like it is used for any other issues. First, the issue must be defined before the data can be collected, cleaned, explored, prepped for learning, modeled for learning, visualized, and then analyzed for insight. However, a group hosting a competition on Kaggle provided a dataset for Data Scientists to come up with a solution for the problem.

“An image of Kaggle’s logo” from Wikipedia

What’s Kaggle?

Kaggle is a great online Data Science community owned by Google. It is mainly used by Data Scientists who get to compile their projects for competitions and show off their skills for the rest of the world to see.

But how would a competition or a personal project help solve social justice issues?

“An ordinary Data Scientist working on a project” from datascientists.org by Troy Sadkowsky

Kaggle allows for outside organizations to bring forth their datasets to see what a community of four million Data Scientists can solve for them. These outside organizations are generally looking for a solution to a problem of theirs and they don’t know how to solve it or want someone better to solve it for them. The competitions do provide an incentive for those competing which is usually a cash prize for the best solution overall. The incentive is what hooks and calls forward the best minds or teams of minds to bring forth the best answer. The Kaggle competition I’ll be covering incentivized $5,000 for the first place winner.

“Two competitors racing to claim the top” from MarkiT

The Competition/Project

The “Data Science for Good: Center for Policing Equity” was a project/competition set up to find a solution to measure justice and solve the problem of racism within policing. The project is made up of research scientists, race and equity experts, data virtuosos, and community trainers working together to build more fair and just systems (Center for Policing Equity, 2018). They utilize tools in Data Science plus are partnered up with law enforcement and communities in order to reconnect the broken trust issues that stemmed from communication problems and generational distrust. The data CPE provided contains documented field interviews, education attainment, occupation, income, housing status, and level of poverty. The data already is mined, cleaned, and ready for the steps that will follow after data exploration. All that remained was for willing contestants to come forward and put their skills and expertise to the test.

The Winner with the Solution

One Data Scientist by the name of Shivam Bansal approached the problem in his Kaggle kernel (IPython notebook), by describing the dataset first,

“An example of a visualization Shivam Bansal used,” from Shivam’s Kaggle Kaggle Project

then used data visualization to see the data in a different way rather than looking at a datasheet.

“ A visualization of GIS data on the different races of people within the dataset” from Shivam’s Kaggle Kaggle Project

He also used GIS (Geographic Information System) data to visualize on a global map where the data samples were collected.

Addressing the Challenges

In another IPython notebook, Shivam mentioned that the dataset itself is completely unstructured, but not to be confused with the category of unstructured data for those of you who know what that is. What Shivam meant for how the data was unstructured was one it contains too many sources ranging from the multiple data sources collected by the police rather it being a single data source, to external sources of additional data on the use of force, vehicle stops, and other police incidents. Two it lacked the standardization or normalization to be even considered clean data. Three there were a lot of data issues like missing files, missing values for department-level data, incorrect values, and less comprehensive records. There’s a lot of data wrangling Shivam would need to do in order to provide a promising solution.

The Solution Pitch

Shivam annotated how he would solve the issue using Data Science by first outlining the challenging requirements for the project solution which were

Hassle-free calculations
A need for scalability
The least amount of human intervention
And be able to give fast insights

Shivam’s solution was to make “An end to end automation data processing and analysis pipeline which takes care of both the points: Data Processing and Data Analysis. The complete pipeline has three major components or parts denoted as Component A, Component B, and Component C:” (Bansal, 2018).

Shivam’s Component A: Data Integration and Standardization

“The architectural design of Component A” from Shivam’s Kaggle Kaggle Project

The first component’s goals were to integrate multiple datasets, create a structured repository, process police department shapefiles, and the processing of ACS (American Community Survey) data. The code of how he accomplished this solution can be found in his part two notebook, but here is a snippet of what the trigger function to initialize the data integration and processing.

def _run_standardization_pipeline():
    _create_repository_structure()
    _standardize_shapefiles()
    _standardize_acs()_run_standardization_pipeline()Status : Directory Structured Created
Status : Shapefile Standardization Complete
Status : Standardization of Metrics complete

His Second Component, Component B: Department Level Processing

“The architecture for Component B” from Shivam’s Kaggle Kaggle Project

In the same notebook as before, Shivam outlines the tasks needed to complete Component B for the solution. The first task is to find overlapping percentages of the census acts and police shape. Next, improve the ACS data in the overlapped data. Then Standardize all the data for police incidents. After that, standardize the integration of all the data for the external police incidents. Finally, piece all datasets together in order to provide one concatenated dataset. Here is the snippet of code to trigger the pipeline but once again to see the whole code Shivam used can be found in the second notebook.

def _execute_district_pipeline(_dept, _police_config1, _police_config2=None):
    print ("Selected Department: ", _dept)
    
    ## department shape file
    print (". Loading Shape File Data")
    dept_shape_gdf = _read_shape_gdf(_dept)
    base_plot = _plot_shapefile_base(dept_shape_gdf, _dept, overlapped_cts = {})    

    ## finding overlapped CTs percentages
    print (".. Finding Overlapping CTs")
    _identifier = depts_config[_dept]["_rowid"]
    state_cts = _read_ctfile(_dept)
    overlapped_cts, olaps_percentages = find_overlapping_cts(dept_shape_gdf, state_cts, _identifier)
    overlapped_plot = _plot_shapefile_base(dept_shape_gdf, _dept, overlapped_cts)
    
    ## Adding the Metrics Data
    print ("... Loading ACS Metrics Data")
    metrics_df = _cleanup_metrics_data(_dept)

    ## Add Metrics to the dept df
    print (".... Enrichment of ACS Metrics with Overlapped Data")
    dept_enriched_gdf = dept_shape_gdf.copy(deep=True)
    for metric_name in metrics_config.keys():
        dept_enriched_gdf = _process_metric(metrics_df, dept_enriched_gdf, _identifier, 
                                            olaps_percentages, metric_name=metric_name)
    
    ## Find Enriched DF
    enriched_df = _flatten_gdf(dept_enriched_gdf, _identifier)
    enriched_df = enriched_df.rename(columns={_identifier : "LOCATION_DISTRICT"})
    
    ## Processing Police DF
    if _police_config1 != None:
        print ("..... Standardizing the Police Events")
        police_file1 = _standardize_filename(_dept)
        _police_config1["police_file"] = police_file1
        police_df, events_df = _process_events(_police_config1)
    else:
        police_df, events_df = pd.DataFrame(), pd.DataFrame()
    
    ## Adding any other external Police Data 
    if _police_config2 != None:
        print ("..... Standardizing the External Data")
        external_df = _load_external_dataset(_police_config2)
        police_df = police_df.merge(external_df, on="LOCATION_DISTRICT")
    
    ## Save Final Data
    print ("...... Saving the Final Data in New Repository")
    _save_final_data(enriched_df, police_df, events_df)
    
    response = {
                "dept_shape_gdf" : dept_shape_gdf,
                "base_plot" : base_plot,
                "olaps_percentages" : _prepare_olaps_df(olaps_percentages),
                "overlapped_plot" : overlapped_plot,
                "dept_enriched_gdf" : dept_enriched_gdf,
                "enriched_df" : enriched_df,
                "police_df" : police_df,
                "events_df" : events_df
                }
    return response

After running the code and pipelining multiple datasets from multiple police departments to generate examples, comes Component C which is the framework of the analysis’ generated by the automated pipelines.

The Last Component, Component C: Analysis of Framework

This component provides the overall analysis of a dataset that was integrated, standardized, and processed through a pipeline to give multiple templates for analysis. Shivam created these templates to be customizable and interactive so that they can be used to create department level reports. The analysis contains five covered topics, “Overview of Police Activity”, “Does Racial Bias Exist?”, “What is the Extent of Racial Bias?”, and “Different templates: What can Explain Racial Bias?, Statistical Analysis, Officer level Analysis”. To give an idea of what the analysis for the framework would look like I picked his first example of an analysis report on a Minneapolis Police Department dataset.

Example of Component C at Work

How component c works is it generates a report based on the dataset that was created from the first two components. The report itself is broken up into three content sections “Key highlights” of the report, a “Deep Exploration of Policy Activity”, and an explanation on “What can account for Racial Disparity?”. I’ll display a few examples from each of the contents below.

Everything from here down until the next headline is Shivam’s work.

Key Highlights

Different district’s have witnessed different number of use-of-force incidents by police. Maximum use of force incidents are centered in district 1 where approximately 4,000 incidents occured in last 5 years. While in District 2 and District 3, only about 1000–1200 incidents occured.
This data suggests that about use of force on Blacks was 3 times that of Whites. The percentage of Blacks being targeted equals 60% on an average which is quite higher than the other races, espicially Whites.
About 71% of the police incidents are targeted on Blacks (population : 20%) while only, 27% incidents are targeted on Whites (population : 72%). Similarly, Asian population proportion is very low (about 1.5%) in all districts, but the use-of-force on them is higher (6.5% incidents).
In three districts (1, 3, and 4), the subjects, the proportion of blacks being injured is very high ( about 50% in Dist 1, and 3, and about 80% in dist 4). However their population proportion in these districts is only (26%, 17%, and 40% respectively)
Though the aggregted black’s population is only 18% in all the districts, but the different use-of-force by police is on-an-average about 60% on blacks.
While in districts 2 and 5, surprizingly a higher percentage of white population is stopped for vehicle checks. This also aligns with the subject injuries insights that we saw were higher among whites in district 2 and 5.

All insights are obtained from the analysis framework and the detailed explanaions are provided in section 2.

2. Deep Exploration of Police Activity : Use of Force

All the analysis is first done at high level and then by controlling the socio-econometric or demographic factors of the area which the department serves. The main idea of this analysis is to measure following points :

Are there racial disparities?
To what extent the racial disparities exist in the department?
What are the key factors that can explain racial disparities?

As the first step, we select the department and load the relevant datasets (that were produced from the data processing and analysis pipeline). There are two main datasets : Enriched district level data and processed Police Incidents Data.

2.1 Overview : Use-of-Force (2012–2015)

Let’s look at an high level overview of the police activity (use of force) in different districts. The following plot shows the aggregated police activity in different districts served by the department. The leged of the graph is explained in following representations:

Department Districts : blue polygons
Aggregated Total incidents : green circles
2015 Use-of-Force incidents : red points

Graph Interpreatation

The size of green circles represents the total count of police activity in a district. Higher the size, higher is the police activity incidents in that district and vice versa. Hover on the circles bubbles to view the actual numbers.

Inferences

The Minneapolice police department serves the total of five districts in the city. The first look at the police activity shows that different district’s have witnessed different number of use-of-force incidents .
In the years 2012–2017, maximum use of force incidents occured in district 1 where approximately 4,000 incidents occured. District 4 and 3 witnessed about 3500 and 2500 police incidents in last 5 years. While only about 1100–1300 incidents occured in district 2 and 5.
The possible cause of this distribution may be one or more of the following factrs : high crime rates, police characteristics, and community and police relationships. In the later parts of this report, analysis is performed that attempts to uncover these insights and the possible causes.

Among these incidents:

Maximum use of force was used by police in the month of January followed by June with about 1500 police incidents each
Most used type of force was “Body Weight to Pin” with about 3700 occurances
Most common reason stated by police to use force was “Tensed Subject” with about 3900 incidents.

Inferences

In the first plot, the bubble with largest size means that maximum number of vehicle stops were made in that area.
Interesting to note that, in that particular area (district 4), the median income of whites is almost three times than that of blacks ie. median income ratio of whites and blacks is 75,000 USD : 25,000 USD. This may indicate that police makes more vehicle stops in the area where black population earns less.
Similarly, in the second plot we can observe that the maximum vehicle stops (largest bubble) were observed in district 5 in which Employment to Population Ratio of Blacks is about 60%, while it is about 80% for whites.

2.8. Vechile Stops — Blacks Vs Whites as a proportion of Population

Let’s compre the percentage of blacks and whites as a proportion of their population at the overall department level.

Graph Interpretation

In these radial charts, One color (BLUE) represents the amount population proportion in 5 districts, other (ORANGE) represents the number of vehicle stops by districts. Basically, these two information — population and vehicle stops are overlaped over one another in order to measure the extent of biasness. The 5 lines that represents the radius of these radial charts account for 100%, so the blue / orange points on these lines are less than 100%

Inferences

We can observe that in all five districts served by this department, the population proportion of blacks is much lower (orange area) while their vehicle stops proportion is higher (blue area).
In the second plot, A reverse scenario can be observed in case of whites, the population proportion (represented as blue) is much higher and the vehicle stops proportion (represented as orange) is much lower.
Again, this confirms that there exist some level of racial disparity.

3. What can account for Racial Disparities ?

In this section, we explore what can account for the racial disparities observed in different districts. For this purpose, We will explore two main metrics -

High Crime Rates
Low Poverty Ratios

3.1 High Crime Rates

Key Questions that ponder are :

Can police behaviour be explained by high crime rates ?
Does police are more active in areas where crime is actually high ?
Or more preciesly, police are active in areas where black population are higher ?

Let’s explore the crime data of minneapolice districts in order to get these answers. We will plot the crime incidents that occured in Minneapolis city and their 5 districts along with the police activities.

In the following graph:

The “blue” polygons represents the police districts.
The “red” circles represents the aggregated crime statistics.
The “gray” circles represents the “Use-of-Force” by police
The “green” circles represents the “vehicle stops” by police.

Inferences

Ideal Scenario : More Crime Means More Vehicle Stops and More Use of Force
ie. Radius : RedCircle (crime) < GreenCircle (VehicleStops), GrayCircle (Use of Force)

However, From this graph, we can note following points:

In District 1 and District 4, the use of force incidents are much higher than the aggregated crime in the districts. Though the vehicle stops are lesser.
District 2, 3, and 5 tells a different story, the relative propotion of crime incidents is very high however the use-of-force and vehicle stops are much lower in these districts. Also, we established earlier that in district 3, their exists racial disparity.

End of the example

The analysis framework is broken up into three key areas, but under content section 2 where nine different highlights of the analysis report are worth looking at, and content section 3 embodies two more as well. If you would like to see the rest of the analysis just click here.

Shivam Banal’s Kaggle project won first place in the competition given it was the most in-depth and comprehensive approach to solving the issue, but that doesn’t take away from every contender’s efforts in the project because there were second, third, fourth, and fifth place winners whose notebooks and code contributed to the overall project’s solution.

Kat from the CPE complimented those who competed by saying,

“We’ve been particularly excited about the results of this challenge because of how much faster and easier it will be to give the chiefs recommendations that are more actionable than ever before. With incident and demographic data mapped to precinct boundaries instead of just census tracts, we can actually let the science drive real-world protocol reconsiderations since police understandably think about their deployment decisions in terms of police districts, not census areas” (Center for Policing Equity, 2018).

“A funny meme on justice being served” from Youtube by VOA Learning English

My Conclusion

Shivam did an excellent job of displaying the capabilities of what Data Science can do for the social good. His data integration processing analysis pipeline provides The CPE with a great tool that allows for others to explore deeper into the root issues to develop actionable insight.

So has justice been served since then?

Justice may not have been served upon the completion of this competition. There hasn’t been anything posted from the competition that says whether or not the solutions together have brought a tremendous amount of success in mending the gap between law enforcement and communities, but it doesn’t mean that social activism stops there.

Social injustice can be fought with other data analysis’ as well. Last week I wrote an article on “What Do People Think About Trump’s Immigration Suspension?” using sentiment analysis and topical modeling. Collecting data for planned parenthood and showing how birthrates have increased and forecast with machine learning how they’ll continue to increase since Trump’s decision to cut funding is another idea. Data Science may not be the pen to the paper for change but it definitely is a great start in solving the social injustice of this issue with Data Science!

References

Center for Policing Equity. (2018). Data Science for Good: Center for Policing Equity. Retrieved April 29, 2020, from Kaggle.com website: https://www.kaggle.com/center-for-policing-equity/data-science-for-good

Kaggle Competitions. (2020). Retrieved April 30, 2020, from Kaggle.com website: https://www.kaggle.com/c/about/host/

Monash Information Technology. (2020). AI fighting social injustice | Monash Tech Talks [YouTube Video]. Retrieved from https://www.youtube.com/watch?v=R1Te9AIXlCw

Python scripts and modules — AMath 483/583, Spring 2013 1.0 documentation. (2013). Retrieved April 29, 2020, from Washington.edu website: https://faculty.washington.edu/rjl/classes/am583s2014/notes/python_scripts_modules.html

Sadkowsky, T. (2017). DataScientists. Retrieved April 30, 2020, from DataScientists website: http://www.datascientists.org/datascientistsblog/citizen-data-scientists

service. (2019). New Markets: The Value of a Competitive Analysis | MARKiT. Retrieved April 30, 2020, from Masmarkit.com website: https://masmarkit.com/2019/08/01/new-markets-the-value-of-a-competitive-analysis/

Shivam Bansal | Kaggle. (2020). Retrieved April 29, 2020, from Kaggle.com website: https://www.kaggle.com/shivamb

shivamb. (2018a, December 2). 2: Automation Pipeline — Integration & Processing. Retrieved April 30, 2020, from Kaggle.com website: https://www.kaggle.com/shivamb/2-automation-pipeline-integration-processing

shivamb. (2018b, December 2). 3: Example Runs of Automation Pipeline. Retrieved April 30, 2020, from Kaggle.com website: https://www.kaggle.com/shivamb/3-example-runs-of-automation-pipeline

shivamb. (2018c, December 3). 4.1 Analysis : Measuring Equity — Minneapolis PD. Retrieved April 30, 2020, from Kaggle.com website: https://www.kaggle.com/shivamb/4-1-analysis-measuring-equity-minneapolis-pd

shivamb. (2018d, December 4). 0. CPE — Getting Familier with Problem and Dataset. Retrieved April 29, 2020, from Kaggle.com website: https://www.kaggle.com/shivamb/0-cpe-getting-familier-with-problem-and-dataset

shivamb. (2018e, December 4). 1: Solution Workflow — Science of Policing Equity. Retrieved April 29, 2020, from Kaggle.com website: https://www.kaggle.com/shivamb/1-solution-workflow-science-of-policing-equity

The Jupyter Notebook — IPython. (2020). Retrieved April 29, 2020, from Ipython.org website: https://ipython.org/notebook.html

Understanding Tensor Processing Units — GeeksforGeeks. (2018, June 6). Retrieved April 29, 2020, from GeeksforGeeks website: https://www.geeksforgeeks.org/understanding-tensor-processing-units/

US Census Bureau. (2020, April 21). American Community Survey (ACS). Retrieved April 29, 2020, from The United States Census Bureau website: https://www.census.gov/programs-surveys/acs

VOA Learning English. (2020). English @ the Movies: “Justice is About to be Served” [YouTube Video]. Retrieved from https://www.youtube.com/watch?v=ciWAIY_piEw

What is GIS? | Geographic Information System Mapping Technology. (2020). Retrieved April 29, 2020, from Esri.com website: https://www.esri.com/en-us/what-is-gis/overview

Wikipedia Contributors. (2020, April 17). Kaggle. Retrieved April 30, 2020, from Wikipedia website: https://en.wikipedia.org/wiki/Kaggle

‌

How Can Data Science Fight Social Injustice?

Most of this was originally written for Ethics, Privacy and Social Justice in Data Science, Professor Ksenia Polson P.h.D., Regis University. The project I referenced is done by Shivam Banals.

The power of Data Science

What’s Kaggle?

The Competition/Project

The Winner with the Solution

Addressing the Challenges

The Solution Pitch

Shivam’s Component A: Data Integration and Standardization

His Second Component, Component B: Department Level Processing

The Last Component, Component C: Analysis of Framework

Example of Component C at Work

My Conclusion

References

Written by James Nelson