Crime Data Pattern Analysis and Visualization using K-means Clustering

Published in

Analytics Vidhya

5 min readJan 18, 2021

Crime pattern analysis uncovers the underlying interactive process between crime events by discovering where, when, and why particular crimes are likely to occur. The outcomes improve our understanding of the dynamics of unlawful activities and can enhance predictive policing.

For more on K-means Clustering: Everything you need to know about K-Means Clustering

Wget the data required at this link:

!wget https://raw.githubusercontent.com/namanvashistha/doctor_strange/master/crime.csv

Import libraries:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

Read and Display data:

data = pd.read_csv("crime.csv")
data

K-means from Scratch:

np.random.seed(42)

def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2)**2))

class KMeans():

    def __init__(self, K=5, max_iters=100, plot_steps=False):
        self.K = K
        self.max_iters = max_iters
        self.plot_steps = plot_steps

        # list of sample indices for each cluster
        self.clusters = [[] for _ in range(self.K)]
        # the centers (mean feature vector) for each cluster
        self.centroids = []

    def predict(self, X):
        self.X = X
        self.n_samples, self.n_features = X.shape
        
        # initialize 
        random_sample_idxs = np.random.choice(self.n_samples, self.K, replace=False)
        self.centroids = [self.X[idx] for idx in random_sample_idxs]

        # Optimize clusters
        for _ in range(self.max_iters):
            # Assign samples to closest centroids (create clusters)
            self.clusters = self._create_clusters(self.centroids)
            if self.plot_steps:
                self.plot()

            # Calculate new centroids from the clusters
            centroids_old = self.centroids
            self.centroids = self._get_centroids(self.clusters)
            
            # check if clusters have changed
            if self._is_converged(centroids_old, self.centroids):
                break

            if self.plot_steps:
                self.plot()

        # Classify samples as the index of their clusters
        return self._get_cluster_labels(self.clusters)


    def _get_cluster_labels(self, clusters):
        # each sample will get the label of the cluster it was assigned to
        labels = np.empty(self.n_samples)

        for cluster_idx, cluster in enumerate(clusters):
            for sample_index in cluster:
                labels[sample_index] = cluster_idx
        return labels

    def _create_clusters(self, centroids):
        # Assign the samples to the closest centroids to create clusters
        clusters = [[] for _ in range(self.K)]
        for idx, sample in enumerate(self.X):
            centroid_idx = self._closest_centroid(sample, centroids)
            clusters[centroid_idx].append(idx)
        return clusters

    def _closest_centroid(self, sample, centroids):
        # distance of the current sample to each centroid
        distances = [euclidean_distance(sample, point) for point in centroids]
        closest_index = np.argmin(distances)
        return closest_index

    def _get_centroids(self, clusters):
        # assign mean value of clusters to centroids
        centroids = np.zeros((self.K, self.n_features))
        for cluster_idx, cluster in enumerate(clusters):
            cluster_mean = np.mean(self.X[cluster], axis=0)
            centroids[cluster_idx] = cluster_mean
        return centroids

    def _is_converged(self, centroids_old, centroids):
        # distances between each old and new centroids, fol all centroids
        distances = [euclidean_distance(centroids_old[i], centroids[i]) for i in range(self.K)]
        return sum(distances) == 0

    def plot(self):
        fig, ax = plt.subplots(figsize=(12, 8))

        for i, index in enumerate(self.clusters):
            point = self.X[index].T
            ax.scatter(*point)

        for point in self.centroids:
            ax.scatter(*point, marker="x", color='black', linewidth=2)

        plt.show()
    def cent(self):
        return self.centroids

Checking our unique values of Magnitude column:

data.Magnitude.unique()Out:
array(['4', '6', '16', '12', '8', '10', '2', 'ARSON', '14'], dtype=object)

We have an unknown value called — ‘Arson’

Arson is a crime of willfully and maliciously setting fire to or charring property. Source: Wikipedia

We shall consider only Latitude and Longitude for plotting:

plt.scatter(data.Latitude,data.Longitude)
plt.xlabel('Latitude')
plt.ylabel('Longitude')

Checking for Nan values:

data.isna().sum()Out:
Date         0 
Latitude     0 
Longitude    0 
Magnitude    0 
dtype: int64

Giving our X value:

X = data[['Latitude', 'Longitude']]
X = np.array(X)

Importing the inbuilt sklearn.cluster module for computing WSS:

from sklearn.cluster import KMeans
import plotly.graph_objects as go

Computing WSS for the K-values in the range (1,50):

wss = []
K = []
k_rng = range(1,50)
for k in k_rng:
    km = KMeans(n_clusters=k)
    km.fit(X)
    wss.append(km.inertia_)
    K.append(k)
plt.xlabel('K Values')
plt.ylabel('WSS')
axes= plt.axes()
axes.set_xticks(K)
plt.grid()
plt.plot(k_rng,wss)

According to the Elbow Technique, the optimal value of K is 5, so we shall implement KMeans for the same value of K:

k = KMeans(K=5, max_iters=150, plot_steps=True)
y_pred = k.predict(X)
k.plot()

Now we shall replace our unknown value — ‘Arson’ with the average of our other Magnitude values:

arr = data.Magnitude.unique()
arr = np.delete(arr, 7)
arr = arr.astype(int)
avg = np.average(arr)
avgOut:
9.0data.Magnitude = data.Magnitude.replace(to_replace ="ARSON",value ="9")
data.Magnitude.unique()Out:
array(['4', '6', '16', '12', '8', '10', '2', '9', '14'], dtype=object)data['Magnitude'] = data['Magnitude'].astype(int)
data.Magnitude.unique()Out:
array([ 4,  6, 16, 12,  8, 10,  2,  9, 14])X = data[['Latitude', 'Longitude', 'Magnitude']]
X = np.array(X)
X

wss = []
K = []
k_rng = range(1,50)
for k in k_rng:
    km = KMeans(n_clusters=k)
    km.fit(X)
    wss.append(km.inertia_)
    K.append(k)
plt.xlabel('K Values')
plt.ylabel('WSS')
axes= plt.axes()
axes.set_xticks(K)
plt.grid()
plt.plot(k_rng,wss)

Looks like K=5 is again our optimal K-value

k = KMeans(K=5, max_iters=150, plot_steps=True)
y_pred = k.predict(X)
k.plot()

I know that’s a lot to take in at once! But you made it until the end! Kudos on that! Do not forget to check out my upcoming articles!

Additional Resources and References

Everything you need to know about K-Means Clustering

You’re at the right place if you’re wondering what K-means Clustering is all about! Let’s quickly get started without…

medium.com

Extracting Dominant Colours in an Image using K-means Clustering from Scratch

Extract the dominant colours from any image of your choice in less than 5 minutes from scratch!

medium.com

Image Segmentation using K-means Clustering from Scratch

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into…

medium.com

For complete code implementation:

tanvipenumudy/Winter-Internship-Internity

Repository to keep track of work assigned on a daily basis - tanvipenumudy/Winter-Internship-Internity

github.com

Hope you enjoyed and made the most out of this article! Stay tuned for my upcoming blogs! Make sure to CLAP and FOLLOW if you find my content helpful/informative!

Crime Data Pattern Analysis and Visualization using K-means Clustering

Additional Resources and References

Everything you need to know about K-Means Clustering

You’re at the right place if you’re wondering what K-means Clustering is all about! Let’s quickly get started without…

Extracting Dominant Colours in an Image using K-means Clustering from Scratch

Extract the dominant colours from any image of your choice in less than 5 minutes from scratch!

Image Segmentation using K-means Clustering from Scratch

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into…

tanvipenumudy/Winter-Internship-Internity

Repository to keep track of work assigned on a daily basis - tanvipenumudy/Winter-Internship-Internity

Written by Tanvi Penumudy