AI Challenge Partner: SafeCity India

Can Artificial Intelligence Help to Prevent Sexual Harassment?

Using convolutional neural networks and LSTM to predict places at a high risk of sexual harassment incidents.

--

Dente de Leão — Jardim Botânico in Curitiba

It is difficult to not get moved the first time you visit Safecity’s website and watch the opening video with the testimonials of women who got raped in India.

You will find the statistics about crimes against women on the official police department website.

The question that arises for a woman when walking in the streets, sitting in a quiet place to drink a cup of coffee or get on the bus to go home is,

Is this place safe?

To answer this question Omdena and SafeCity India organized a two month AI challenge where I was one among 30 collaborators to build an AI solution.

Part of the challenge proposed for us was “Clustering geography into heatmap of safe and unsafe”. Basically speaking, given past information about crimes against women that happened on a certain date in a certain place, how can we predict which places have high chances of an incident to happen and which places are safe.

Among all the reports available from a diversity of cities for the study, Delhi and Mumbai as seen in figure 1 were chosen as the selected cities since the majority of incidents come from there.

Figure 1 — Delhi and Mumbai: the selected cities for the challenge.

Problem definition

According to the Oxford dictionary, a heatmap is “a representation of data in the form of a map or diagram in which data values are represented as colors”. One of the most effective ways to express a heatmap that can be used on mathematical models is thought the use of matrices where each cell represents a square portion of space in a given measuring distance system and the colors represent the intensity of the studied event that happened on each cell mapped.

In our case, this intensity represents the number of crimes women suffered in that area, on a specific time.

In figure 2 we can see an example of heatmap predicted by a machine learning model plotted on a grid divided in 30 by 30 cells where each cell has a color representing the crime intensity for in Delhi on August 13. The more reddish the cell, the riskier the place. The black dots represent the real crimes that occurred on the particular date reported by the Safecity application.

Now, how can we use past aggregation maps to have future insights on a specific region?

Time for machine learning. Heatmap prediction is one of the many fields studied by people that want to predict human’s behavior over time.

Figure 2 — Heatmap of crime intensity against women in Delhi

Heatmap prediction is a three-dimensional problem that can also be called spatial-temporal problem as seen in figure 3. It involves a spatial component since a heatmap, in our case, is a two-dimensional matrix and involves a temporal component that varies on time and depends on the granularity that we decided to see it. This granularity that aggregates the events on each cell can be expressed in hours, days, months and so on.

Figure 3 — Spatial and Temporal Dimensions of a Heatmap

Selecting the right model

Being a passionate person for artificial neural networks I decided to search for articles on the use of deep learning for crime forecasting that generates a heatmap.

The first paper found for the study brought the use of Spatial-Temporal Residual Networks (ST-ResNet) for crime forecasting. Bao et al. used this technique, as a regression problem to predict the number of crimes hourly on a grid divided map over a region in Los Angeles city where their great contribution was to apply a spatial and temporal regularization technique on the dataset.

This ANN model uses the concept of aggregating heatmaps according to the concept of trends, period and closeness where each of these terms defines the temporal distance between the heatmaps and have its own internal architecture inside the model were it learns features related to those distances as we can see in figure 4. Residual convoluted units are used to learn the spatial dimension for each of those maps.

Figure 4 — Structure of ST-ResNet

A valuable lesson learned — Fail quickly and move forward

Grabbing a dataset available by Kaggle containing data about crimes in Los Angeles, I tried for two weeks to replicate their study with all the technique recommended by them without results that resembles the ones they showed in their article with that level of perfection. I entered in contact with the authors asking for clarification but just got some vague answers.

Finally, I gave up on this approach when I talked with an author of another article that cites them and he said that even himself failed to reproduce their results.

A valuable lesson to share here is that if you are not progressing with an idea, put a deadline to end it and move on for the next one. I searched for related articles that could cast some light on my problem and found alternative approaches.

Finding the best-fit model

Another deep learning model found during the study uses a more complex combination of Convolutional and LSTM model to predict the number and category of crimes called SFTT and proposed by Panagiotis et al.

The SFTT model, presented in figure 5, receives as input a sequence of heat maps aggregated in time by classes and outputs a future heat map with predicted hotspots that according to the article definition are places where at least a single crime happened and category probability.

Figure 5 — Structure of SFTT

Even though the model showed good results for the article’s authors I couldn’t manage to get good results. In my implementations or the model terribly overfits, or the number of parameters to train was astonishing reaching like 32 million and my machine was not capable of processing that in a reasonable time.

Unfortunately, I was still with the “perfect score” mentality from the St-ResNet article when working on this model and most of the time still getting frustrated with what I was achieving. Only later when I read Udo Schlegel’s master thesis, Towards Crime Forecasting Using Deep Learning, where he makes use of encoder-decoder and GAN’s to predict heatmaps showing results that looked more similar to the ones I found in the past, I changed my mind on crime prediction.

Crime predictability, as I would discover near the challenge’s end, is about predicting trends and regions and not lonely criminal minds. You can’t play Minority Report like in Tom Cruise movie on this field.

Even failing at using this model, I would reconsider retaking the study and implementation using it in the future since it can help immensely on the cases we would like to make a clear distinction between predicting ordinary cases and rape against women. In my opinion, the kind of intervention or warning needed in both situations should be treated differently since the impact on the last one can be a life-changer.

Finally, after a lot of study and research, I found an article that fitted the most on the solution I was looking for. Noah et al. proposed a Convolutional LSTM model (figure 6) to predict burglary in Sweden. They turned a regression problem into a classification one. Instead of predicting the number of crimes that could potentially occur in the future, they tried to classify each cell with a risky probability.

Figure 6 — Structure of Conv + LSTM.

The model’s input is a sequence of binary maps in time where each cell contains one if at least an incident happened there (burglary in the article’s case) and zero otherwise and outputs a risk probability map for the next period. Since we had little data to work with, it was easier to turn our samples into binary maps than aggregated sum maps like the ones used for regression problems. Being loose in aggregation let the spatial dimension to be explored even further enabling to amplify the resolution of the heatmaps since accumulation was not a necessity anymore but just the simple existence of a single incident inside the debilitated cell square. In the next part, I will describe the use of this chosen model for the solution.

The solution

The model that fits best to our case was the Conv + LSTM.

The dataset provided was stored in a spreadsheet containing around 11.000 inputs where each row had a reported incident that happened to a woman on a certain place on the globe with information like incident date, description, latitude, longitude, place, category and so on.

From this point the following pipeline was adopted in order to build a solution:

  • Data Selection
  • Data Pre Processing
  • Modeling
  • Evaluation
  • Prediction

Data Selection

Most of the open datasets used for crime studies contain lots of detailed samples reaching numbers that vary from 100.000 to 10.000.000 reports since they are generally provided by the local police department which contains a good computerized system capable of collecting, organizing, and storing this valuable information over a delimited region.

Since it’s not imaginable to build global scale heatmaps for crime prediction with a useful resolution we selected the cities which concentrate most of the data, Delhi and Mumbai with approximately 3.000 samples for each of those places.

After the selection, a spatial and temporal boundary was made that could encompass the majority of useful data for training the algorithm.

For the spatial part an algorithm that selects the best latitude and longitude boundaries based on a threshold number of events and grid division was used. The goal was to cut the number of places with irrelevant number of occurrences. As in figure 7, the first dataset that encompassed Delhi had a lot of zeros near the latitude and longitude boundaries and after the application of the algorithm we reduced the space where most of the relevant data were located with minimum loss.

Figure 7 — Boundaries selection for spatial dimension on Delhi dataset. Fig. (a) is the original dataset with all reports while fig. (b) is the dataset which encompasses the most valuable data.

On the temporal dimension we had data from 2002 until 2019. Grouping them by month it was possible to verify that not all years had a relevant or useful amount of information so the strategy was to select just reports between the concentrated range as it is possible to visualize in figure 8.

Figure 8 — Boundaries selection for temporal dimension on Delhi dataset. Fig. (a) is the original dataset with all reports ranging from 2002 to 2019 while fig. (b) shows data just selected from 2013 to 2017.

Once useful spatial and temporal data was selected in the next step we created the heatmaps.

Data pre-processing

In the data pre-processing step, we defined the spatial and temporal granularity of our aggregated data in order to produce the heatmaps.

For this challenge, a 32 by 32 grid size was used on a daily basis. These values seems aggressive for a small dataset but as stated early, binary maps allowed more space between cells since they need just one incident on that place and daily granularity was used, even with lots of missing values between days, because with the data augmentation technique used below we still got some good results.

Heatmaps were made using aggregation technique where first we converted the latitude and longitude from our data samples into a matrix coordinate and after summed all those samples that fell in the same coordinate on the same temporal granularity as demonstrated on the figure 9.

Figure 9 — Conversion between latitude and longitude into a heatmap.

After creating the heatmaps we still had many missing daily maps that could potentially represent a problem when stacking them into a sequence for the input data for ConvLSTM model. One solution could be to assume zero-valued heatmaps for these cases but since we already had too many sparse matrices doing this strategy would just contribute to model overfitting towards zero.

The strategy used to fill this gap was to upsample this missing heatmaps using a linear interpolation between the first and next heatmap on the time sequence using the total missing period as the division factor. With this data augmentation method it was possible to rise the dataset from an amount of 586 binary maps to 1546 daily maps.

After synthetically creating the missing heatmaps a threshold value was arbitrarily chosen to convert them into binary maps (figure 10) since we stated that the model works with risky and non-risky hotspot prediction.

Figure 10 — Example of conversion from a heatmap into a binary map using a 0.5 threshold.

Modeling

A deep learning model composed of Convolutional Network and LSTM, as in figure 6, was used as the main algorithm for heatmap prediction. The model’s input is a sequence of temporal binary maps and the output of the next map of the sequence containing the risk probability of each cell.

Convolutional networks are great for learning the relation between nearby cells in the map while the LSTM networks learn the relation between map sequences in time. Combining them together and you have a powerful algorithm that learns the spatial and temporal properties of our dataset.

The training and target sets were made by grouping a sequence of 16 daily binary maps for input and using the next one as target creating 576 unique sequences. It is important and necessary to state that maps artificially created were skipped when selected as a target since they don’t translate into a real scenario.

For the train and test splitting the decision boundary was based on the relevant temporal data between the period of 2013 to 2017 where are sample before the second half of 2016 where selected to train the model and the second semester to validate it.

A total of 100 epochs was used to train the model with a learning rate of 0.0001 in batches of size 8. An amount of 20 % of the training data was used to validate the mode during the process. Adam optimizer was used to correct the learning direction.

Heatmaps are very large sparse matrices (matrix with lots of zeroes), so to equilibrated the loss that naturally towards the easy way meaning learning to predict only zeros we use a weighted cross-entropy function for backpropagation to put more emphasis on missing ones.

It is interesting to mention that on this model we are not encoding the heatmaps in a higher-dimensional space, but squashing them into lower dimension learning the most significant features using the Max Pooling 3D layers after the LSTM that learns the sequence features.

Evaluation

To evaluate the model we used a separate test set to validate its behavior on unseen data and avoid bias.

Despite the fact that we are dealing with a classification problem we cannot assume an arbitrary threshold like 0.5 and assume that all values above this line are considered risks and all below neutral cells. Our goal is to predict regions that have higher chances of an incident to occur and not exactly the spot that a crime will happen. Besides that, we trained on large sparse matrices so its natural that all values will lean towards zero.

A more practical way is to define a percentile among all cell values and decide that above this threshold we will classify it as risky. So for example, if we define that the risky threshold is above the 96th percentile, we first count the number of cells for each value predicted, group them and take the value that represents the 96th threshold. On numpy there is a function called np.percentile that is useful for this case. On Noah et al. they recommend taking the percentile for the entire test prediction instead of for each sample to average the value between different seasons.

Figure 11 — Binary predicted maps using different percentiles to define risky areas.

After deciding the percentile value and converting the analog results into a binary map we measure the score against the true map. To not penalize the prediction so hard against missing results that can lead up to adjust the model in a way that we end up with a totally overfitted algorithm we give some score for the predicted cell in neighboring areas around hotspots since we didn’t miss so much.

Instead of the four classification labels, we have six:

The following table represents the labels for each classification:

Table 1 — Labels for each prediction

With this table in mind we count each of those labels for each predicted cell as seen in figure 12.

Figure 12 — Label prediction according to defined percentile threshold.

For the example above we have the following results depending on the selected percentile value:

Table 2 — Predictions count for different percentiles thresholds.

The lower the percentile threshold, the higher the number of cells we classify as risky and the higher is our model accuracy, but is it good?

We have to find the perfect balance between risk classification and model accuracy to get the best end-user experience.

A conservative view with a lower threshold would lead to a high number of correct classification but at the same time creating the wrong perception for the users that the entire map is a dangerous zone. On the other side if we rise too much the percentile we would barely classify any place as a hotspot and incurring in serious error: FALSE NEGATIVE.

There is not a single problem that can be worse for a person’s experience than trust on an application that is telling them that they are stepping into a safe zone and end being victim of a crime. In fact depending on the proportion it takes in the media it can be the end for the entire system we are building to help people.

So, between all those labels, what are the most important ones to look at ? Noah at. al state in the article that Correct, False Negative Neighbor and False Negative should be taken into consideration when doing the model evaluation. The goal is to find the balance between False Negative Vs. Correct + False Negative Neighbor rate and Accuracy.

Figure 13 — Model evaluation against different thresholds percentiles. It is possible to note that the higher the percentile threshold value the higher the accuracy (black line) but at the same time the more false positives we have.

On figure 13 we can check the model evaluation for percentile that ranges from 85th to 99th threshold. The higher the value the more false negatives we have despite increasing model’s accuracy. That happens because we evaluate more neutral cells (True Negatives) as correct. Those cells are also important to correctly identify since we want to guide the user on a safe space while avoid indicating too many places as risky while they are not.

The prediction

There are some guidelines suggested on Noah et al. to make the prediction in a way people have an easy understanding of the generated output. The first way is just to display the cells classified as risky with one color while muting the others while the second way is to define a group of threshold values to classify each cell.

So as a general guideline for model prediction is as following:

  • Make a prediction for a single sample or a whole batch.
  • Select a percentile to evaluate all cells from the output prediction that will be used as a threshold to classify all values for each cell that falls above as risky and below as neutral or another defining label.
  • Convert the analog values into discrete ones using the threshold list from above.

On figure 14 there are two examples using the 96th and 99th threshold as risky values colored in red. The other values are made using lower threshold selections like the 91st, 75th, and 25th percentile.

The two black dots on the red represent the crime occurred on that particular day and both were captured by the model. It is easy to check that the 99th percentile prediction gives a more accurate risky area but it has at the same time more chance to miss important cells since it aggressively tries to classify just areas were we are almost certain that something can happen. Tenting should be done using different percentiles in order to find the one that best adapts to user’s need.

Figure 14 — Heatmaps predictions using different thresholds. Figure (a) uses a 96th percentile while figure (b) uses 99th.

An important problem to state is related to the border effect caused by the padding on the Convolutional Network. Zero padding is not the best way to deal with the borders, especially on small images like heatmaps, and they should be ignored when putting on production since they hardly translate into reality.

One way to improve the resolution of the output is to use a bilinear interpolation technique and increase the heatmap size providing a more fine tuned resolution when presenting to the final user like the one shown in figure 15.

Figure 15 — High-resolution heatmap over a region of Delhi. The reddish the area the riskers it is considered by the algorithm. The black dot represents the crime that occurred in that region on the specific date.

Final considerations

There are many solutions out there using deep learning models on crime forecasting using heatmaps. This article tried to contribute showing an approach that fitted best on the resources that we had for this challenge while telling what and why the other solutions didn’t work.

There is room for improvements such as,

  • Testing different spatial dimensions (grid size)
  • Testing different temporal dimensions (time granularity) like aggregating heatmaps in weeks, months and so on.
  • Testing different upsampling techniques for filling the gaps when no data is presented.
  • Trying to implement another padding technique for removing border effect on the convolutional network.
  • Trying to convert this problem from classification to regression and use the other proposed deep learning approaches.

After spending two months of hard work on this challenge the great question that bothers me the most as a Machine Learning Engineer: Will the product of all my effort helps a woman in India to feel safe?

That is the only question that matters in the end for someone that feels responsible for his actions.

Doing machine learning for good is not trying to reach 99.9% of accuracy on your model to look good for your friends on social media, but to make the world a less suffering place for people to live and I can’t imagine the motto “do good” detached from the word “responsibility”.

I end this article thanking Omdena in letting me participate in the Safecity challenge and for all the other great companies that I met while collaborating.

Want to become an Omdena Collaborator and join one of our tough AI for Good challenges, apply here.

If you want to receive updates on our AI Challenges, get expert interviews, and practical tips to boost your AI skills, subscribe to our monthly newsletter.

We are also on LinkedIn, Instagram, Facebook, and Twitter.

Bibliography

Holm, Noah;Plynning, E. (2018). Spatio-temporal prediction of residential burglaries using convolutional LSTM neural networks.

Schlegel, U. (2018). Universität Konstanz Department of Computer Science Master Thesis.

Stalidis, P., Semertzidis, T., & Daras, P. (2018). Examining Deep Learning Architectures for Crime Classification and Prediction, 1–12. Retrieved from http://arxiv.org/abs/1812.00602

Wang, B., Zhang, D., Zhang, D., Brantingham, P. J., & Bertozzi, A. L. (2017). Deep Learning for Real Time Crime Forecasting, 33–36. Retrieved from http://arxiv.org/abs/1707.03340

Zhang, J., Zheng, Y., & Qi, D. (2017). Deep spatio-temporal residual networks for citywide crowd flows prediction. 31st AAAI Conference on Artificial Intelligence, AAAI 2017, 1655–1661.

--

--