Text Mining in R: Death Row Prior Occupations by David Lettier
Found in a list of 100+ interesting data sets, the Texas Department of Criminal Justice provides different collections…lettier.github.io
Found in a list of 100+ interesting data sets, the Texas Department of Criminal Justice provides different collections of information on death row inmates both present and past. Prior occupations for each current offender are listed here.
We’ll need some packages to scrape, cluster, and chart our findings.
- rvest: “Wrappers around the ‘xml2’ and ‘httr’ packages to make it easy to download, then manipulate, HTML and XML.”
- wordcloud: “Pretty word clouds.”
- tm: “A framework for text mining applications within R.”
- apcluster: “The ‘apcluster’ package implements Frey’s and Dueck’s Affinity Propagation clustering in R.”
The data we wish to look at is not in the form we need it. We’ll need to capture the HTML and parse it looking for the prior occupation information. Using rvest, we will grep for all <p> tags that contain the occupation information. Not every page is consistent so we will need some logic to hunt around for the prior occupation field. Once we’ve parsed every page, all of the raw information will be saved to a CSV file for later pre-processing. Each line corresponds to each death row inmate prior occupations collected.
With the CSV file opened, we need to clean each entry for unwanted white-space and punctuation. For those with multiple prior occupations, we’ll split on the commas. Once cleaned, we’ll add each prior occupation per inmate to the inmates_occupations list.
Looking over the dataset, you’ll notice some redundancies such as mechanic versus auto mechanic. With care, we will condense the prior occupations to the sub-strings of the redundancies. We’ll also need to hard-code some corrections for irregularities such as fork lift versus forklift. Once the duplicates have been merged, we’ll generate a unique list of the prior occupations for our axis labels.
We have collected, cleaned, and removed erroneous variations in our dataset. Now we’ll move on to mining and plotting/charting our data.
Total Prior Occupation Relative Frequency Distribution
Let us begin by charting the relative frequency of prior occupations listed across all sampled inmates.
The unique prior occupations are in alphabetical order along the x-axis. The y-axis is the relative frequency distribution. laborer accounts for nearly 41% of the 291 listed prior occupations among the 182 death row inmates sampled. Attributing to its large percentage was the merging of duplicates, for example, general laborer and assembly line laborer into just laborer. One could argue that they are distinct labels. construction and none are the second and third largest respectively.
We can explore the dataset further by clustering the sampled inmates by the prior occupations they had. We’ll treat each inmate as a document with their prior occupations making up the document. By clustering the inmates via their prior occupations, we can partition the dataset into different prior occupation profiles.
Inmate Prior Occupations Matrix
To begin clustering, we will need to vectorize each inmates’ prior occupations. We’ll assemble these vectors, into the inmate_matrix_count where [i][j] is >= 1 if the i-th inmate had the j-th occupation and zero otherwise. No inmate had the same prior occupation listed more than once, so no count will be greater than one.
Prior Occupation Weighting (TF-IDF)
Since we are dealing with text and that nearly half of all sampled inmates were a laborer at some point, we’ll perform a term weighting technique known as TF-IDF. Given that an inmate was both a laborer and welder at some point, TF-IDF will weight laborer less since welder appears less frequently in the text corpus. Knowing that an inmate was a laborer does not say as much as knowing they were also a welder.
Multidimensional Scaling (MDS)
To visualize the inmate vectors in two dimensions we’ll employ multidimensional scaling.
An MDS algorithm aims to place each object in N-dimensional space such that the between-object distances are preserved as well as possible. — Multidimensional Scaling, Wikipedia, the free encyclopedia
We can see a large mass around the origin. Another distinct mass is seen at (-0.8951273 -0.702121).
Affinity Propagation (AP) Clustering
Due to the concave shapes and variable density of the MDS scatter plot, we’ll employ Affinity Propagation Clustering. This has the added benefit of not having to specify the amount of clusters ahead of time (K).
With the inmate_matrix_tfidf_mds clustered, we can now plot the clusters.
In total, the algorithm generated 17 clusters. Cluster membership sizes range from one to 92 inmates.
Inmate Clusters, Prior Occupations Relative Frequency Distributions
Now let us generate a relative frequency distribution bar-chart for each TF-IDF MDS AP cluster.
Cluster 1, seen at (-0.8951273 -0.702121), is solely comprised of none which agrees with the dataset. Intuitively, if an inmate had no prior occupations, you wouldn’t expect to see other prior occupation terms clustered with none.
The following bar-charts are the clusters making up the large mass centered around the origin.
Cluster 5 accounts for the inmates that were mainly laborers with only a very few having one other prior occupation. This cluster also contains a large portion of the unique prior occupations found in the raw data set such as computer software.
For cluster 10, there is a large mixture of both laborer and warehouse.
Cluster 17 is more difficult to interpret and could have been likely clustered with cluster 5, however, it is the only cluster that accounts for the inmates that had unknown prior occupations. unknown did not collocate with any other prior occupation so it is surprising to find it clustered with other prior occupations. This is likely due to the MDS.
Automatic Cluster Labels
- Cluster 1: none
- Cluster 2: mechanic
- Cluster 3: mechanic
- Cluster 4: construction
- Cluster 5: laborer
- Cluster 6: cook
- Cluster 7: welder
- Cluster 8: welder
- Cluster 9: ac
- Cluster 10: warehouse
- Cluster 11: construction
- Cluster 12: clerk
- Cluster 13: shipping
- Cluster 14: clerk
- Cluster 15: welder
- Cluster 16: mechanic
- Cluster 17: unknown
Condensing these even further:
- none: cluster 1
- mechanic: cluster 2, 3, 16
- construction: cluster 4, 11
- laborer: cluster 5
- cook: cluster 6
- welder: cluster 7, 8, 15
- ac: cluster 9
- warehouse: cluster 10
- clerk: cluster 12, 14
- shipping: cluster 13
- unknown: cluster 17
This effectively condenses down the original 60 prior occupations down to the 11 most prominent in the sampled data. These 11 prior occupations could all be described as blue-collar occupations with the exception of clerk.
TF-IDF AP Clustering Only
Comparing the TF-IDF MDS AP clusters to TF-IDF AP clusters, using only the inmate_matrix_tfidf, the Affinity Propagation clustering algorithm generated 48 clusters using the original 60 dimensions versus the 17 clusters generated using the two dimensions found by MDS.
These 48 clusters are more granular with a clearer partitioning of inmates with unique prior occupations. 30 out of the 48 clusters contain only one inmate due to their unique combination of prior occupations. Cluster membership ranges from one to 71. Like with the MDS clusters, none is its own cluster. However, unlike the MDS clusters, all of the unknown prior occupations are clustered together alone.
Latent Semantic Indexing (LSI), Singular Value Decomposition (SVD)
We can take the TF-IDF normalized inmate-prior-occupation matrix and perform SVD on it.
LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster. — Introduction to Information Retrieval
This will give us three matrices U, D, and V. U contains our inmate-concept vectors in the new orthogonal basis.
We’ll need to choose a K <= R (R being the rank) where R=60. At K=38, 90% of the variability of the original matrix is explained.
Now let us cluster and plot the prior occupation relative frequency distribution for each resulting cluster.
After clustering, 40 clusters were found. Contrast this with the 48 found by TF-IDF alone which used 60 features instead the 38 used after applying SVD. Cluster membership ranges from one to 71.
Below are the cluster labels found by PMI:
- Cluster 1: computer operator
- Cluster 2: none
- Cluster 3: kitchen worker
- Cluster 4: computer programmer
- Cluster 5: construction
- Cluster 6: warehouse
- Cluster 7: landscaper
- Cluster 8: cabinet maker
- Cluster 9: sales
- Cluster 10: pipe fitter
- Cluster 11: laborer
- Cluster 12: barber
- Cluster 13: machine operator
- Cluster 14: truck driver
- Cluster 15: oil field
- Cluster 16: food service
- Cluster 17: roofer
- Cluster 18: fabricator
- Cluster 19: cook
- Cluster 20: computer technician
- Cluster 21: janitor
- Cluster 22: wrecker driver
- Cluster 23: painter
- Cluster 24: hydro-water blaster
- Cluster 25: ranch hand
- Cluster 26: heavy equipment operator
- Cluster 27: iron worker
- Cluster 28: shipping
- Cluster 29: computer software
- Cluster 30: clerk
- Cluster 31: plumber’s helper
- Cluster 32: welder
- Cluster 33: mechanic
- Cluster 34: carpenter
- Cluster 35: forklift
- Cluster 36: heating
- Cluster 37: unkown
- Cluster 38: food service
- Cluster 39: jewelry designer
- Cluster 40: ac
We collected, parsed, and mined the prior occupations of current death row inmates. Interesting patterns discovered were the large portion of blue-collar occupations (most notably laborer) and more rare prior occupations such as computer operator and jewelry designer. An interesting hypothesis test would be the correlation between being on death row and having been a laborer. The SVD computation could be used for information retrieval allowing one to search for similar current inmates by some prior occupation query. Further analysis could include plotting the amount of each prior occupation seen per year.
Full SVD AP Clustering
Full Source Code
Originally published at lettier.github.io on February 17, 2016.
(C) David Lettier 2016.