Gun Violence: Using Perl to Analyze Publicly Available Data

Published in

Cultured Perl

11 min readMar 23, 2017

https://commons.wikimedia.org/wiki/File:Ushomicidesbyweapon.svg
Self-published work by Aude; licensed under the Creative Commons Attribution-Share Alike 2.5 Generic license.

On March 20, the Guardian and the Gun Violence Archive (GVA) released "a new set of nationwide data for 2015 that maps gun murders at the micro level – down to the local census tract. You can use this data to do analysis of how gun murder clusters within neighborhoods in your city or state."

The Guardian and GVA are to be commended for making this data available to the public. In this article we'll explore some tools which open-source software communities have developed that can be used to analyze this data. But we'll also see that when we use publicly available data we must, at the outset, address discrepancies and imperfections in that data.

In the Guardian article linked to above you can find links to five files named as follows:

guardian-gva-shr-data-dictionary.csv
gva_release_2015_grouped_by_city_and_ranked.csv
gva_release_2015_grouped_by_tract.csv
gva_release_2015_raw_incidents.csv
UCR-1985-2015.csv

The fact that the files appear in CSV (Comma-Separated Values) format means that they can be opened in spreadsheet programs such as Microsoft Excel or LibreOffice Calc. But that also means that the files can be processed by dynamic programming languages such as Perl 5.

In what follows we'll assume that you have access to software tools commonly found in a Unix programming environment. You have that environment in the Linux or FreeBSD operating systems or — under the hood — in Mac OS X (Darwin). (Similar tools probably exist in contemporary Windows environments, but I don't have personal experience with them.) In particular, we'll assume that you have a modest familiarity with Perl 5 and that you know how to install extensions to Perl from the Comprehensive Perl Archive Network (CPAN). We'll also assume you have some familiarity with GitHub.

Why would we want to translate CSV data into a Perl data structure?

You could use a spreadsheet program to compute new columns in the Guardian/GVA data. For example, the file gva_release_2015_grouped_by_tract.csv lists, on a per-census-tract basis, the number of incidents of gun violence and the number of fatalities from such incidents. To determine the rate of such incidents or fatalities, we might want to compute a column which holds the rate of incidents per 100,000 population. We could certainly do that in a spreadsheet program.

However, if we wanted to combine data from two separate CSV files, or if we wanted to combine some of this Guardian/GVA data with data from another source, we would want data munging facilities beyond what a spreadsheet could offer. Perl is an ideal choice for such data processing.

What kind of files have we downloaded?

You can download the files simply by right-clicking on the links in the Guardian page cited above. (In this article, we'll be discussing those files in the state in which the author downloaded them on March 21, 2017. It's quite possible that the content of those files may change when small corrections or updates are made. Hence, we will store those files as they appeared on March 21 in our GitHub repository.)

During downloading the files will go to whatever directory on your system you have designated for that purpose in your browser. Alternatively, you could open your system's terminal application and issue commands at the command-prompt ('$' in the examples below) to create a directory tree specifically to receive these files.

$ mkdir -p gva/inputs$ cd gva/inputs$ wget https://interactive.guim.co.uk/2017/feb/09/gva-data/gva_release_2015_raw_incidents.csv \https://interactive.guim.co.uk/2017/feb/09/gva-data/gva_release_2015_grouped_by_city_and_ranked.csv \https://interactive.guim.co.uk/2017/feb/09/gva-data/gva_release_2015_grouped_by_tract.csv \https://interactive.guim.co.uk/2017/feb/09/gva-data/UCR-1985-2015.csv \https://uploads.guim.co.uk/2017/03/20/guardian-gva-shr-data-dictionary.csv \https://interactive.guim.co.uk/2017/feb/09/gva-data/UCR-1985-2015.csv

Here's what the output of wget will look like in your terminal:

To see a list of all the files, you can use the ls -l command. Its output will look like this:

To get an idea as to how each of these files has been constructed, we can use the file utility inside a small invocation of the shell on the command-line.

$ for f in `ls *.csv`; do file $f; done

Translating that into English, we get: For each CSV file found in this directory, call the file utility on the file. The file utility's output goes to "standard output" (STDOUT), i.e., to your terminal. There it will look like this:

At this point we know we have a problem. The files have different line terminators. One ends in the 'CRLF' (Carriage Return/Line Feed) terminators commonly found in DOS and Windows environments. Three end in the 'CR' terminators found in pre-OS X Macintosh operating systems. One file listed has no mention of specific line terminators — which suggests it has the 'LF' line terminators typical of Unix programming environments.

These problems are not major. They mainly affect how your text editor of choice will display the files once opened. You may, at this point, wish to convert all the files to a standard format.

$ ls *.csv | \
    xargs perl -pi'.orig' -e 's{(\015?\012$|\015)}{\012}g'

Translated into English, that says: Get each CSV file in this directory and process it with a Perl "one-liner" program which (a) creates a backup of the original file with the .orig extension; and (b) rewrites the file with all line terminators converted to the Unix 'LF' standard. The output of the ls command becomes the output of (or, is “piped into”) the xargs command. The xargs command, in turn, says: Perform the immediately following command — the perl one-liner — on each of the files that have been passed to xargs.

If we were to list the files in our inputs/ directory, we would now get this:

Note that so far we have only looked at the overall format of each file. We haven't yet looked inside to see what the data looks like.

How do we translate a CSV file into a Perl data structure?

The true "killer app" for the Perl programming language is the archive mentioned above of publicly available libraries and extension known as CPAN. The CSV libraries generally have "Text-CSV" in their names. Perhaps the most commonly used is simply called "Text-CSV"; we’ll use the Text::CSV module from this distribution.

01-load.pl

We'll start by writing a Perl 5 program which, given the basename of one of the Guardian/GVA-supplied data files, reads that file's data into an array of arrays: one record for each row in the CSV file. (You can download each of the programs that follow from our GitHub repository.)

To run this program and the ones that follow, please install the 'Text-CSV' and 'Data-Dump' extensions from CPAN. Once you have read the code for each program, you can test your own understanding of the code by taking a quiz on that program located on GitHub.

The first file we will try to process will be UCR-1985-2015.csv. Since 01-load.pl prints all its output to STDOUT, we will invoke the program with the filename, then redirect STDOUT to a file we'll call out1.

$ perl 01-load.pl UCR-1985-2015.csv > out1

out1 contains this:

02-load.pl

Wouldn't our data structure be more useful if each record (other than the header) had a name by which we could refer to it? And wouldn't our data structure be still more useful if, within each record, we labelled each field with the column name?

Yes, both features would be useful. For this we want to process the CSV data into a different Perl data structure: a multilevel hash, or hash of hashes. We do that in our next version of the program.

In 02-load.pl, we assume that we can use the value of the first column in each row — the 'Agency' column from the header row — as the name (or key or unique identifier) for each row.

We invoke this program and redirect its STDOUT to a file called out2.

$ perl 02-load.pl UCR-1985-2015.csv > out2

out2 contains this:

But wait a minute! 01-load.pl told us that we had:

291 rows in UCR-1985-2015.csv (excluding header row)

But 02-load.pl told us something different:

280 rows extracted from UCR-1985-2015.csv

How can we explain this discrepancy?

03-load.pl

When we built a hash out of the CSV data keyed on 'Agency' we got 11 fewer elements than expected. Perhaps in the CSV file there is more than one row with the same value for 'Agency'. In other words, perhaps the keys are not truly unique! Let's see if that is the case.

In 03-load.pl, we re-read the CSV data, but now for the purpose of constructing a small table which enables us to look up those values for 'Agency' which appear multiple times in the CSV data.

From this point forward, we're going to hard-code the basename of the file we're processing directly into the Perl program. We do that because our program is now focused on identifying and correcting the "dirty data" in one particular CSV file, rather than simply loading the content of any given CSV file into a Perl array or hash. So we can invoke the program as:

$ perl 03-load.pl > out3

out3 contains this:

This report confirms our hypothesis — mostly. For example, it's perfectly plausible to have one record for the Kansas City, Missouri, police department and another record for the Kansas City, Kansas, police department. The best known Peoria is in Illinois — but there's another one in Arizona!

The one obvious ringer in this data is the "Honolulu Police Dept". Surely there cannot be two cities in the United States each named Honolulu!

04-load.pl

In the next version of our program, 04-load.pl, we'll allow for the possibility that a given string can be the 'Agency' in more than one city. We'll printout the agency name and the corresponding city and state — but only where there are two rows with the same agency name.

This is invoked as:

$ perl 04-load.pl > out4

out4 contains this:

It does indeed appear that we have two records for the Honolulu, Hawaii, police department. We next should ask: Are these two records exact duplicates of one another, or do they have some columns with different data?

At this point the easiest solution is to use the common Unix grep utility to look at the Honolulu lines in the input file.

$ grep -n Honolulu inputs/UCR-1985-2015.csv

There is clearly a difference in the data in the two records starting in the 31st field ('2011_raw_murder_num') in each record. Ideally, we should report this discrepancy to the data provider. In the meantime, for the sake of expediency, we will simply have the second entry overwrite the first.

05-load.pl

05-load.pl represents the final version of the program we'll use to parse the UCR-1985-2015.csv file. In this program we'll create a separate Perl data structure to hold the one record — the first Honolulu record — that we're going to toss out of our dataset. Then we'll create a very simple report which lists all the agencies by descending order of the 2014 murder rate (per 100,000 population) in their cities.

This is invoked as:

$ perl 05-load.pl > out5

out5 contains this:

Farther down in out5 we have the table listing agencies in the descending order of their 2014 murder rates:

We know that certain cities have more gun violence than others, so we wouldn't be surprised at the highest ranking cities on this list. But is it really plausible that, say, Wichita, Kansas, would go an entire year without a homicide committed with a gun?

No, it's not plausible. If we look at our report more closely, we see that there are many large cities where the 2014 Murder Rate was listed as zero. Here are some of them:

One can only infer that either some police departments failed to report their 2014 gun violence data to the FBI — the source of the data in UCR-1985-2015.csv — or that between the FBI and teh Guardian/GVA there was some tabulation error.

What have we learned?

We've learned how to use common open-source utility programs to identify characteristics of publicly available data files. Those programs include: file; grep; ls; perl; wget; xargs.
We've learned how to write Perl 5 programs which, with the help of Perl extensions available from CPAN, enable us to parse CSV data into data structures that we can then use to explore new relationships among the data.
We've learned that the way publicly available data is structured within files is not necessarily optimal. We may have to construct our own unique identifiers for each record. We may have to detect partial or full duplication of records.
We've learned that incomplete records in publicly available data can distort our understanding of the social realities the data claim to describe.

So kudos to the Guardian and the GVA for making this data available — but a caution as to the limits of the data's usefulness as is.

Next Steps

You are encouraged to write Perl programs similar to the ones presented in this article to parse the CSV data in the other files supplied by the Guardian and GVA.
You are encouraged to take the quizzes, available on GitHub, corresponding to each of the Perl programs presented herein. If you would like a response from the author of this article, please store your quiz results in plain-text files and attach them to an email sent to jkeenan at cpan dot org.
You are encouraged to use publicly available data to understand social and political problems, but you must understand the potential for limitations and discrepancies in that data.