Country Crime Analysis for Policy and Practice Adjustment

Ty Barker
INST414: Data Science Techniques
8 min readDec 18, 2023

Over the years, various countries that have high crime rates are usually associated with violent crimes and homicide. These countries have had problems with correcting it because each country essentially must figure out what works best for them because there is no guideline on how to rid every country of crime. Trying a lot of different policies to resolve their specific problem is the main method of figuring out what works best, this trial-and-error method does not seem to work as no significant changes to rates have been seen across the board.

To possibly assist legislative bodies, and police/ law enforcement in these countries that have a hard time reducing their crime significantly and creating safer environments for their citizens I am committing myself to finding a clue toward reaching this goal with my project. With my dataset, I’m going to attempt to analyze crime data, specifically international homicide data to see which countries have had the most significant drop in crime and/or just have the least number of criminal homicides. With the analysis done with that data, lawmakers and law enforcement in the worst-off countries can try their best to apply policies and laws that the best-off countries have which I/they would be able to find once do some peripheral research to see what policies the better off countries rely on the most to do reduce, or keep their crime low (though the actual research on this is outside of the scope of this project).

Data

The data that I explored comes from the UNODC which is a global leader in the fight against illicit drugs and transnational organized crime. It was established in 1997 and is headquartered in Vienna, Austria. Its mission is to assist Member States in their struggle against illicit drugs, crime, and terrorism.

They provide data in abundance that coincides with their efforts, and specifically for my project I use the dataset that they provide on international homicide rates. I picked this dataset because it was their most robust dataset spanning back to 1990 and also represented the essence of this project to a T.

To start with this data set contained 105 thousand rows and thirteen columns and was 17 thousand kb’s. The raw data contained several columns where the values within where classifiers for the values at the end of the rows, for example the column “age” had values that were ranges of ages like “18–24” or “68+”. For the purposes of this project, I wanted the rows that represented the total value of the homicides for each column for each country and year so I filtered this and similar values to reflect the “total” for a given year, country pair/row.

Beyond this, I filtered the columns to ones that had pertinence to the analysis that I was attempting. These columns were: “Iso3_code”, “Country”, “Year”, “Value”, “Source”. I did this mainly so that I could focus on the year and value for each data point. At this point after filtering, I ran into a problem where the values in the column “Value” which would be the main focus for this project were of mixed type. I used a lambda function that applied itself across each row to make sure that each value was of the type “float” for ease of future manipulation.

The last manipulation of the data that I did was to create a data frame that would allow for a more in-depth analysis of the data. There were many steps in doing this but the most important of them was to get the min, max, and mean for each country, as well as grouping each country over five-year intervals and determining what each country’s best and worst five-year average value for homicides was. Upon getting this data I turned it into a data frame where each country was a single datapoint, ultimately giving me 205 points of data.

Key Ideas and Rational

The main key ideas that my project is built around are unsupervised learning, and preprocessing. For me these came in the form of the KMeans algorithm and the PCA dimensionality reduction method. What I was aiming to do was build a model that would learn which countries are the most alike. In the hopes that doing so would reveal the most model countries for bad countries to research and that would be able to identify which countries are in serious need of some help. Or even which are most similar to a target country if being used to see where your country stands.

Given that there are many countries to be analyzed and I wanted to group them based on the features that were created for each country I opted to use Kmeans, instead of something like Knn because I am not trying to classify these data points. Because of the way that I created a new data frame after cleaning the raw data, I now had to cluster with data that had more than two features and this is what led me to use principal component analysis (PCA) which helped me by breaking down the multiple features into two features for plotting and analysis.

Model

For this project, the analysis that I did came mostly through the use of clustering with KMeans. Before anything, I used principal component analysis as stated above and reduced the number of features down to two for the use of graphing with KMeans. Once I had this done, I had to figure out what the most optimal value for k would be so to do this with KMeans I used the Silhouette score method in KMeansbut interpreted it as if I was using the elbow method.

Using this graph as one is intended to I am practicing the use of elbow method which is the use of repeated Kmeans calculations and seeing which one is better in terms of the sum of the distances from assigned centroids squared. This graph then would show a bend or elbow in the score which would show which value of K or number of clusters would be best to use for the KMeans algorithm, in terms of producing the best result.

As shown in the graph above the best or “optimal” number of clusters/value of K for these data points is shown as being the value of K = 8. The end of the graph does show the score start to dip down and it does get a bit lower after the value of 8 but what the elbow method shows us is that after the stark decrease and plateau at the value of 8 and a small decrease or increase after is not worth a change in the value of K = 8.

Using this Value of K = 8 I then clustered the data to find the insight that I was looking to gain when I started this project. This KMeans clustering is shown in the graph below which depicts distinct scaled groups with little room for outliers.

Stakeholder Support

With the use of this model, I was able to identify one cluster that no, matter how many times the k-means algorithm is run, it always seems to look like an outlier. Even within a relatively small number of clusters, this data point right here doesn’t find its way to a group containing other clusters. This is because this is the country El-Salvador, and if you look at these stats below you can see how outrageously high their average rate of recorded homicides is over the range of 1990–2023.

This number almost doubles the rate of the next highest group average and is several times the average of other groups. This to me is an indication of a the country in the worst situation according to the available data. It’s countries like this that might be able to help themselves by looking at the policies and law of countries in the cluster with the very low average homicide rate over all and with each point being very small. Like those in the following 3rd cluster:

Looking at this you can see for example a country like Poland in this cluster that I have dubbed the Model Country (for a low homicide rate) cluster. Here, no matter how many samples I took out of this cluster the total average did not get over 5, and none of the countries in them ever had a high spike in their homicide rate. These are the countries that are effectively the insight that I was attempting to glean from analyzing this data set.

Should a country want to escape an increasingly worse situation that may end up leading them to join El-Salvador in a cluster. This could be the countries that are in the 2nd and 7th cluster picture below, they should take an example or even do in-depth research on policies from samples of countries in the clusters that have a vastly better value for their homicide rate than them.

In the case of a country like El-Salvador, doing in-depth research on the policies of the same 2nd and 7th cluster could also be very beneficial because they may have the most similar situation to it and thus have the best policies for them to make use of for themselves.

Conclusions and Limitations

For me, this project turned out successful but there are some things that I wish could have turned out better. To start I struggled with the process of finding data that would help me with this project and that resulted in me using what I think was a slightly limited data set. I believe with my prepping of the data it turned out better than I initially thought it would.

The biggest limitation on this project was my failure to be able to also get the drug data from the UNODC website, this is mainly because I was unable to figure out the way to function their API as I discussed with my professor Cody Buntain. I think that the inclusion of the drug crime data set would have led to a better insight, but on their website, they only allow the export of data regarding this back to 2017 and it did not work out with the way I wanted to analyze the data because that means it left out a lot of time and a lot of countries.

I also wish that I had the time to do the practical research side of this project to show the policies that could be taken and used from the model countries in the clusters with a lower average homicide rate. To conclude, I believe that though this is an elementary show of practical data science skills, this project can properly be used to extract meaningful insight for lawmakers of a country that needs it.

GitHub Link

--

--