Cities should use machine learning to detect buildings at risk of fire (with Python code)

Published in

The Startup

15 min readAug 21, 2019

(The full code for this project is on GitHub which you may access here or here).

Combining different datasets and applying machine learning algorithms will allow fire departments and city officials to more efficiently target their resources, prevent disasters and save lives.

Or at the very least it could help us make better predictions where fires will occur.

This summer I worked with the city of Sioux Falls on this very challenge, as part of a summer fellowship made possible by the Harvard Bloomberg City Leadership Initiative, which aims to nudge local governments towards data-driven decision making. Along with the departments of innovation, GIS, and fire, we set out to build a rough first version of a predictive model that would demostrate the value of bringing together property, parcel, tax and demographic information.

Predictive analytics and fire risks

The fire department from Sioux Falls first proposed a challenge.

Every year they not only respond to calls, but also attempt to prevent. They give out smoke alarms, educate the community on risky behaviors to avoid,, inspecting commercial buildings, among other preemptive actions.

The problem is that, as always, resources are limited and so far they’ve only been able to take a scattershot approach: try to do as much prevention as the budget and human hours will allow, but without targeting those resources in any way. Could there be any way to use knowledge that the city already has to make smarter decisions?

Our efforts were illuminated by the phenomenal work from Jon Jay, post-doctoral researcher at Harvard, which you can read about here and here. This gave us some initial confidence that data that the city collects may hold some predictive power.

After some initial assessment of our data sources we figured we could reliably use the following:

Fire incidents from 2008 to 2018
General parcel information
Units in the rent registry
List of all foreclosed properties
Utility (water service) disconnections
Code violations
Crime reported to specific addresses
Value assessments from county authorities (Sioux Falls is part of the counties of Minnehaha and Lincoln)

Luckily, all of the above sources include an “Address” field which allowed us to easily join them. Prior to this project, the GIS team at Sioux Falls spent considerable effort in geo-coding different datasets — making sure that the way the addresses were written was consistent (i.e., “123 N Cleveland Street” wouldn’t appear as “0123 North Cleveland St” in another one).

We spent some time brain-storming other relevant data points that may be collected by the fire department at different stages of their response to each incident.

Next, a detailed step-by-step explanation for building the machine learning model used (and the corresponding Python commands). Feel free to skip to the results section if you are not interested in the technical aspects.

Building a machine learning model for fire risk assessment

Package and data importing

First, we import all the handy packages that we will use for this exercise.

# Data analysis and visualization
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt# Interactive maps 
import folium
from folium.plugins import HeatMap# Machine learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score, auc
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline

Then, we load all of our data into the Python environment with the read_csv() command in the pandas library:

parcels_df = pd.read_excel('data/parcels2.xlsx', parse_dates=True)

Normally, loading data from a single file is quick, but since we will work with multiple dataframes, some of which are tens of thousands of rows long, its good to create copies. This way, if we delete a useful column by mistake when cleaning the data we can just re-run the code for a new copy instead of loading everything from csv again.

parcels = parcels_df.copy()

Data description

Once we are done loading everything and before we begin cleaning the data, we should spend some time inspecting our datasets to know what information we can actually use and which columns have missing values. Some useful commands are:

parcels.columns
foreclosures.describe()
building_fires.info()

Since we want to understand and predict a phenomenon that is geographical in nature, a neat approach is creating an interactive map for our main variable of interest. As long as we have columns with the latitude and longitude of the addresses, we can create an interactive heat map of past fire incidents with the folium library and a few lines of code:

# Visualizing the fire incidents in a map
fire_map = folium.Map(location=[43.54, -96.72], zoom_start=14, tiles='Stamen Terrain')
heat_df = building_fires[['lat', 'lon']]
heat_df = heat_df.dropna(axis=0, subset=['lat','lon'])
heat_data = [[row['lat'],row['lon']] for index, row in heat_df.iterrows()]
HeatMap(heat_data).add_to(fire_map)
fire_map

This creates an intuitive visualization:

Heat map of fire incidents in Sioux Falls. (Not interactive in Medium, unfortunately).

Data wrangling

We must now wrangle (clean) our datasets. This means preparing the data for analysis by joining the appropriate fields, dropping useless features, dealing with missing numbers, converting categorical variables, among other things. The result will look like a single matrix where each row represents an address and each column represents a numerical feature of that address.

For example, the “activity” column in our parcel dataset has codes for lots that are assigned to single-family housing, different types of commercial use, open spaces, among other things. This is useful information for prediction but we need to tweak it with a little custom made function:

def cleanActivity(x):
    if x == 11 or x == 12 or x == 13:
        x = 'SINGLE OR TWO RESIDENTIAL'
    elif x == 21 or x == 22 or x == 23 or x == 24 or x == 25:
        x = 'MULTIFAMILY'
    elif x in list(range(31,39)):
        x = 'OFFICE AND PUBLIC SERVICE'
    elif x in list(range(40,50)):
        x = 'INSTITUTIONAL'
    elif x in list(range(51, 54)):
        x = 'COMMERCIAL'
    elif x == 61 or x in list(range(63,70)):
        x = 'INDUSTRIAL'
    elif x == 62:
        x = 'AIRPORT'
    elif x in range(70,98) or x in range(0, 9):
        x = 'OPEN SPACES/NA'
    return xparcels['ACTIVITY'] = parcels.ACTIVITY.apply(cleanActivity)

This creates a more managable list of activities that we can then easily turn into dummy variables by using the pd.get_dummies(parcels.ACTIVITY, prefix='ACT_') command.

The main dataframe that we will use for the analysis, df, will contain the information that we extract from the other datasets. We begin with parcels and the appropriate dummies from some of its columns.

df = pd.concat([parcels, pd.get_dummies(parcels.ACTIVITY, prefix='ACT_'), pd.get_dummies(parcels.PARPR), pd.get_dummies(parcels.PARCELTYPE), pd.get_dummies(parcels.PARTYPE), pd.get_dummies(parcels.COUNTY)], axis = 1)

We can also drop entirely those rows that are not relevant to us.

df = df[df.ACTIVITY != 'AIRPORT']
df = df[df.ACTIVITY != 'OPEN SPACES/NA']

For our prediction variable of building fires, we have a list of unique incidents that we need to join with our list of parcels. We want every address to have a “INCIDENT” column with value 1 if it has had a fire in the last decade, and 0 if it hasn’t. We can do this by grouping together the incidents by their address and then creating a new column:

building_fires = building_fires.groupby('ADDRESS').max()
building_fires['INCIDENT'] = 1

We can now merge datasets based on the parcel address with pd.merge() and then replace missing values for zeros with the fillna() method of pandas dataframes, for parcels that weren’t in the building fires list.

df = pd.merge(df, building_fires, how = 'left', on='ADDRESS')
df.INCIDENT.fillna(0, inplace = True)

For code violations, we may want to look at both the presence of any violation and the total number per address. We use a similar procedure:

code_cases_grouped = code_cases.groupby(['ADDRESS']).sum()
code_cases_df = pd.DataFrame(code_cases_grouped)
code_cases_df['TOTAL_VIOLATIONS'] = code_cases_df.sum(axis=1)
code_cases_df['ANY_VIOLATIONS'] = 1
df = pd.merge(df, code_cases_df, how = 'left', on = 'ADDRESS')
df.ANY_VIOLATIONS.fillna(0, inplace = True)
df.TOTAL_VIOLATIONS.fillna(0, inplace = True)

Above are just some examples of steps necessary for data wrangling. This is usually the most tedious and time-consuming step in the process.

Once we are done, we can split our dataframe into our prediction (y) and predictors (X) variables:

# Split predictor and prediction variables
X = df.drop(columns = ['INCIDENT', 'date'])
y = df.INCIDENT

Parameter tuning and generating predictions

Our machine learning algorithm will be based on a random forest classifier, a powerful and popular tool that fits a number of decision trees to the data and then follows a majority rule, where every tree in the forest “votes” for every classification decision that must be made.

The technical aspects that explain the high accuracy of random forest algorithms are beyond the scope of this column and are better explained elsewhere. For now, we must know that there are several parameters that we should specify— concretely, we are concerned with choosing the number of trees in the forest, the depth of each tree (how many times each tree branches) and how many variables will be considered. Choosing the right parameters is the key to a robust model.

The sklearn library is an intuitive way to run machine learning projects in Python. It also comes equipped with several tools for parameter tuning. In this case we use GridSearchCV() and Pipeline() to try out every single combination of number of trees, depth and number of features and return the best possible values. To compare each combination, sklearn uses cross-validation, a powerful statistical concept that relies on dividing the data in two and training the algorithm with one portion and testing its accuracy with the other (usually several times).

The code is fairly intuitive, and here we just select a reasonable amount of values to test. Since we are fitting the model an exponentially growing number of times, this may take too long if we choose too many combinations.

# Create pipeline with feature selector and random forest classifier
pipe = Pipeline([
    ('feature_selection', SelectKBest(f_classif)),
    ('clf', RandomForestClassifier(random_state=2))])# Create a parameter grid to test
params = {
   'feature_selection__k':[3, 5, 10, 20, 50],
    'clf__n_estimators':[2, 5, 10, 100],
    'clf__max_depth' : [3, 5, 10, 20, 50]}# Initialize the grid search object
grid_search = GridSearchCV(pipe, param_grid=params)# Fit it to the data and print the best value combination
print(grid_search.fit(X, y).best_params_)

To which we get the following output:

{'clf__max_depth': 10, 'clf__n_estimators': 100, 'feature_selection__k': 20}

We then use SelectKBest() and RandomForestClassifier() to fit our model using our dataframe and the optimized parameters obtained above.

# Select the best features based on the optimized parameters
sk = SelectKBest(f_classif, k=grid_search.best_params_['feature_selection__k'])
which_selected = sk.fit(X, y).get_support()
X = X[X.columns[which_selected]]# Fit the classifier with the optimized parameters
random_forest = RandomForestClassifier(n_estimators=grid_search.best_params_['clf__n_estimators'], max_depth = grid_search.best_params_['clf__max_depth'])
random_forest.fit(X, y)

In other words, we have now fitted a random forest classifier built with 100 decision trees of depth 10, using the 20 most useful variables.

At this point it is useful to visualize our best predictors and their correlation, which can be done with the seaborn library.

# Correlation heat map with our predictive variables
comb = pd.concat([X, y], axis = 1)
corr = comb.corr()plt.figure(figsize = (10,10))dropSelf = np.zeros_like(corr)
dropSelf[np.triu_indices_from(dropSelf)] = Trueax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True, annot=True, annot_kws={"size": 7},
    mask=dropSelf
)ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
)

Correlation matrix for our 20 most relevant features and our target variable.

We can see from this graph, for example, that fire incidents are more strongly correlated with crime incidents than with any other variable or that being in the city’s rent registry has a positive correlation with property maintenance code violations. But because we are not building a linear model, we should not assume that a low correlation with fire incidents implies a useless feature.

Scoring and interpreting our results

Let us turn now to assessing the value, or predictive power, of our model.

One common tool to validate models is cross-validation, which we’ve described earlier. In sklearn we can get the cross-validation score using the cross_val_score command:

Score = cross_val_score(random_forest, X, y, cv=3)
round(np.mean(Score)*100, ndigits=4)

Which gives us a score of 98.16 out of 100.

Although this seems high, it does not really help us that much. The reason for that is that we are trying to classify a phenomenon that is infrequent. If we used a trivial model that simply predicted no fires for every address in the city, then it would get comparable accuracy.

Instead, let’s try to evaluate the model by assigning risk scores (probabilities of fire) and comparing that to an approach to targeting resources based on randomness.

To do this we will train the model using fire incidents from every year except 2018, and then giving a risk score to every address. Many 2018 fires at high risk would indicate that our model is actually pointing us in the right direction. We can then take a number of hypothetical interventions that we could conduct in a year and see how many fires were catched by focusing only on the high risk scores as opposed to intervening at random.

To hide the 2018 fire incidents we go back to the data-wrangling section and use the datetime functionality of pandas.

# First create a new column for parcels with incidents in 2018
df['INCIDENT_2018'] = [1 if date > pd.to_datetime('2018') else 0 for date in df['date']] 
# Now turn all 2018 fire incidents off
df.loc[(df.date > '2018') & (df.INCIDENT == 1), 'INCIDENT'] = 0

Next, let’s create a new column with risk scores based on the predict_proba method of our classifier. This will give us a percentage rather than a 1 or 0 prediction.

df['PREDICTION'] =  random_forest.predict_proba(X)[:,1]

Let’s say that the city is going to intervene 2000 properties in a year. We can use the random method in the numpy library to assign random numbers and then assign the 2000 largest to an intervention. We will then use the nlargest method of pandas dataframes to order our addresses based on the risk scores from our model.

np.random.seed(0)
df['RANDOM'] = np.random.rand(df.shape[0])
df_random = df.nlargest(2000, columns = 'RANDOM')rf_results_search = df.nlargest(2000, columns = 'PREDICTION')
print('Total 2018 predicted fires (2000 searches): ' + str(rf_results_search.INCIDENT_2018.sum()))
print('Total random predicted fires (2000 searches): ' + str(df_random.INCIDENT_2018.sum()))

By looking at how many properties were found by both our model and random chance, the preceding code gives us this output:

Total 2018 predicted fires (2000 interventions): 42
Total random predicted fires (2000 interventions): 6

What would happen if we picked a different number of interventions? We can plot all possible results with this code:

model_comparison = []for searches in range(0, 55000, 100):
    prediction = df.nlargest(searches, columns = 'PREDICTION').INCIDENT_2018.sum()
    random = df.nlargest(searches, columns = 'RANDOM').INCIDENT_2018.sum()
    model_comparison.append([prediction, random])
    
model_comparison_df = pd.DataFrame(model_comparison, columns=['PREDICTION', 'RANDOM'])model_comparison_df.plot()
plt.xlabel('Properties inspected or intervened')
plt.ylabel('Fire occurance in properties inspected')
plt.xticks([])

This creates the following visualization:

Measuring the model’s robustness against randomness.

Intuitively, if we intervened all addresses in the city we would find every single fire and if we did 0 interventions we would find none. The preceding graph showcases the advantage of our model over randomness at different number of interventions.

This looks very similar to our model’s AUC-ROC curve, which compares the true positive and false positive classifications under several thresholds.

y_test = df.INCIDENT_2018
preds = df.PREDICTIONfpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Which yields:

The previous two curves are helpful for model comparison. If we create a new model and it yields a curve that is taller (closer to the top of the graph), then we can assume that the new model will be more useful at any number of interventions.

When it comes to explaining our model, random forests do not offer as much interpretability as other classification methods (such as linear regression). This is a trade-off for the accuracy given by using multiple decision trees. We can, however, visualize which features are more important towards making a classification decision.

importances = random_forest.feature_importances_
indices = np.argsort(importances)
ax = plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [X.columns[i] for i in indices])
plt.xlabel('Relative Importance')

Which prints out this graph:

Relative importance of the features of our model.

Exporting our results

Finally, we can export our results to a .csv file, which can then serve as an input to build a more concrete tool for the fire department (such as a map of the city with risk scores for every address).

df['PREDICTION'].to_csv('Results - risk predictions.csv', header = True)

Results

As described above, we have used the fire incidents from 2008 to 2017 to train our model, and those from 2018 to validate it.

A way to think about our results is this hypothethical scenario:

If the city had built this model at then end of 2017, and then conducted interventions to help mitigate risk (such as giving away fire detectors or doing educational campaigns) in the 2000 properties with the highest chance of fire, then they would have “caught” over 40 of 2018 fires, considerably more than the approximately 6 that would be found at random.

Comparing our model results versus randomness.

Big quotation marks should be placed upon the word catched. Just knowing which properties are high risk and doing an intervention there does not mean that the underlying risk factors and behaviors have been reduced.

Ethical, privacy and equity considerations

I believe that it is important for all public sector projects that leverage advanced analytics to make explicit how they deal with ethical, privacy and equity questions. Complex algorithms with strong predictive power may simultaneously change operational or resource allocation decisions, while being very hard to scrutinize.

When it comes to privacy, the model built above uses only data that is already collected at the municipal or county level. Most of this technically counts as public information. The one exception comes from the crime registered to specific addresses that cannot be made public unless it is generalized.

As it stands, only city officials that have requested access can use the mapping tools built with this model.

A less straightforward question is that of the inequity risk. Are there groups that may be unfairly punished because of this model? Could the data be biased? The use of machine learning for policing has been a topic of much debate and these are not easy questions to answer. But because the objective of our algorithm above is to be preventive rather than punitive (the city should provide support to building owners to reduce their risk assessment) the adverse effects of potential bias are somewhat diminished.

That does not mean that we are completely in the clear. What if, because the data collection is imperfect, we target resources to those who need it the least?

Socioeconomic and cultural differences in citizens who complain in 311 systems has been recently studied. If we were to target resources based on models built on data with these biases we would be investing more in resources who complain more, which also tend to be of higher income.

Some of the relevant data points used in our random forest classifier are very hard to incorrectly measure (the square footage of a property or whether or not a house has been foreclosed) but others could be biased (code violations, if they are mostly complaint-driven).

For this particular case, we are trying to build a model that drives resources towards buildings at the highest risk of fire, that improves upon randomness. There is little evidence that the results of this exercise could systemically point away from high-risk parcels to low-risk parcels, but biases in data collection should always be kept in mind when constructing and applying a model such as this.

Operational implications, and room for improvement

What started out as an exercise meant to prove the value of combining data from multiple city departments quickly became a (rough) first version of a concrete tool for helping the fire prevention team best target their resources.

The challenging part comes now, though.

First, with a better understanding of the buildings at fire the question then becomes how to effectively prevent incidents. A comprehensive strategy that identifies best practices to approach units by segmenting them based on their activity (commercial, single family, multi family, etc.), linking behaviours with risks and then testing the effectiveness of interventions in improving those behaviours. This would necessarily have to include going back to fire-fighters, who are the ones with the most domain knowledge.

Second, the machine learning model itself could be vastly expanded. We built this model quickly because we wanted to showcase that even an imperfect analysis had worth. But there are other data points that could be included that may have some predictive power. For example, we didn’t consider things like sub-types of commercial activity or time since last inspection.

Any improvements to the model can be measured using cross validation techniques or by following a strategy similar to the one used above: training the model with data from one period and then testing it with the incidents from the excluded time frame.

As cities race to get “smarter”, they would be wise to start with concrete challenges that they already face, figure out what they know and measure, and iteratively using that information to improve citizens lives.