Using Data Analysis and Machine Learning to Identify Violence Zones in Somalia

SVM classifiers with an average accuracy of 99% to monitor critical areas of conflict including the type of incidents and impact.

Published in

Omdena

6 min readOct 24, 2019

The conflicts in Somalia have reached alarming levels, year after year many people are victimized by disputes of territory and dominance of spaces. The problem has reached intolerable levels in the international community.

This report aims to inform intervention actions through insights that are placed as strategic tools for facing the presented problems. The work is part of Omdena’s AI challenge in partnership with the UNHCR — The UN Refugee Agency.

The data is derived from a wide variety of local, regional and national sources and the information is collected by trained data experts around the world.

The data set

The first step in discovering knowledge is to perform data exploration. Many data scientists despise the power of a good data visualization. For this step, I used pandas profiling. The library optimizes the process of variable analysis, providing information that requires development time.

A simple line of code provides a real picture of the variables and this is critical when building the data model. The dataset has 31 variables, 10 numerical, 14 categorical, 6 rejected, the model already identifies and finally a text field variable. Thus we can select which variables the data model will compose of.

Data Insights

Building good visuals follows an unbeatable formula: less is more. In this sense, graphics and plots need to be as synthetic as possible, delivering a value chain for those who need to make a decision based on the data presented.

Graph 1 — Total Cases according to type over the years

This graph shows the evolution of cases over the years.

Insights

It is possible to observe that “battles” always had 50% of incident cases, but from 2017 this type lost space in proportion. Directly proportional to the reduction in battles has been an increase in the “Explosion / Remote Violence” category, indicating that new technologies such as drones or remote explosions are gaining ground in violent conflict practices. Finally, in this chart, we can see a consistent increase in violence against civilians over the years.

In the following graph we can see the lethality level by type of “sub-event”, it is evident that in absolute numbers “armed clash” is what has lethality, followed by “attack” and in third place “air/drone”. These data indicate a trend in case types that most impacts on the number of fatalities.

Evaluating relative data has great power in decision-making data evaluation, in Graph 3 we can see how a variable aggregation measure: total cases by ”actor1” and average fatalities provides a panel of performance for decision making. Divided by colors and signs we can easily see the serious cases that need immediate intervention.

Data visualization has a fundamental and decisive power in structuring and generating insights for decision making, exploiting this information brings the user a way to build solutions to confront the presented cases.

The SVM tool

SupportVector Machine (SVM) is a supervised algorithm Machine Learning that can be used for classification or regression. Despite the slightly confusing name, you will see that it makes perfect sense. SVM can be translated as Support Vector Machines. Supporting Vectors are the basis of how the algorithm works because their architecture depends on these vectors. — Felipe Santana

Mining data by deriving insights from data visualization, we have the necessary elements for building machine learning models. The purpose of this data model is to build a hot zone solution, which seeks to predict the most dangerous locations and the highest fatalities. This type of prediction is extremely business-friendly because this model optimizes the leasing of utility personnel to handle the cases presented.

To build this model it was necessary to create a new feature called “mild_of_fatalities”, this new variable is categorical value labeled “high” and “low”. To establish this new category we used the mean, standard deviation, and variance of the data column of the “fatalities” column. Thus, results from the “fatalities” column with values above six cases were considered as “high”, results below six cases were considered as low. This new variable characterizes the regions between very dangerous and non-dangerous.

From this point, we can establish the target variable of the data models. The next step was to define the predictor variables, for this was applied the technique of “GridSearch”, thanks to this model it was possible to establish the degree of importance of the predictor variables. The following variables were defined: time_precision, event_type, latitude, longitude, location, fatalities.

Having defined the target variable and the predictor variables, the next step is to perform the data transformation process. In this stage, the “Standard Scale” and “label encoder” were applied to perform data normalization.

The fact is that SVM only works with numeric data, thus, this algorithm does not accept categorical values. Therefore we need to convert the column “Location” to numeric.

One Hot Encoder converts categorical values into binary vectors. The result of this technique is a presence matrix where the columns are the categorical data and the rows the presence.

Let’s now test SVM with this setting, to do this let’s use the pipelines:

With just LabelEncoder the model was simpler, with fewer features, this was a considerable change in performance. from SVM.

The Results

We can see that the rbf kernel (pipeline 2) performed better with an average accuracy of 99%. The other kernels underperformed.

This result can change dramatically if we change the data. Kernels do data transformations, which means that for each situation a kernel can be more interesting.

Another point to note is that Kernels makes the model more complex. This can also be a problem.

This gives us a more detailed view of classifier performance, including by class, so you can see how it is doing in classifying specific classes.

With an average accuracy and the 99% measurement, I believe we have a good SVM rating result.

Conclusion

This model attempts to predict the most dangerous locations. The applications for the program are diverse because, based on the number of deaths, the system seeks to provide the user with the intervention zones of the security forces, as well as the creation of programs and public policies aimed at reducing violence.