CSI: SF — The Data Scientist

Connor Mitchell
Analytics Vidhya
Published in
11 min readDec 21, 2019

*Follow along with the code here: GitHub

Welcome to the first episode of CSI: San Francisco. You are an aspiring detective on the show and you’ve been given the opportunity to interview for a data science position with the SFPD. The moment you step into the interviewer’s office, you notice a distraught look on her face. She hands you a laptop and slams you with a seemingly impossible question:

“A hacker broke into our database and removed the descriptions of half of all our crime records before our cybersecurity team shored up our defenses. If I told you the time and date the crime occurred, the address and GPS coordinates of its location, and which police department district it occurred in, could you tell me what type of crime was committed? Our team is in the process of recovering as many details as we can, but we’re unsure how many we’ll get.”

With your dream job on the line, would you be able to do it?

Using nothing more than a basic data science toolkit and a computer, the answer is yes. This blog post will teach you how to become as big a star as Sherlock Holmes in 3 steps. As a nice bonus, you’ll also be able to submit your results to this kaggle challenge. Full code can be found in this Github repo.

Step 1: Explore

After completing the requisite NDAs, you’re able to log in to the laptop. It gives you access to a database containing historical crime details, including the type of crime committed, when it was committed, and where it was committed along with a description provided by the responsible police department. To get started, download all that data. You’ll need as much as you can get.

Now open an iPython notebook (ensure the data files live in the same folder!) and take a look at the data you’re working with. This is the first step to solving any data science problem, as exploring your data tells you what (if any) data cleaning (“pre-processing”) is needed before you can train your model and highlights any weaknesses in your dataset that may hurt your performance.

You should read the datasets using pandas, as the pandas DataFrame data structure is flexible, intuitive, and is closely intertwined with common NumPy functions. Additionally, pandas functions like head() and describe() can provide a solid overview of the dataset you’re working with. It’s a good habit to develop.

It looks like you’ve got 8 columns in the training dataset, including the temporal (Dates) and spatial (Address, X, Y) columns that contain information your interviewer told you she would provide. Make sure you check the test set too in order to make sure no reformatting is necessary (such as converting a tuple to separate columns).

Uh oh. While the temporal and spatial columns are consistent, the hacker did their work well. The test dataset is missing Category, Descript, and Resolution. You can’t use them to train your model because you won’t be able to use them to make predictions on the hacked dataset (“test”).

Speaking of which, you should examine the distribution of your target variable (“Category”) in your training dataset. The more clean records of a crime category you have, the more samples your model has to learn from in order to recognize future samples like them.

In order to get a better sense of where these crimes took place to make use of the X and Y feature columns, you should also take a look at a map crime density distribution map. This is important for understanding the spatial distribution of the training dataset and the possibility of location being a poor predictor. To create one, you should import your reliable data visualization library Folium and use google maps to identify the geographical center of San Francisco to center the map (SF_Coordinates).

Now to visualize the city’s districts, you need their geographical boundaries which you can import directly into your environment by downloading the json data.

Yes, yes. You’ve made a pretty map, but don’t get distracted. Your job’s on the line, and time is of the essence. So what does this map tell you? The large number of colors indicate a wide crime density distribution, which means the sample sizes of training data for crimes in some districts will be low. The map doesn’t tell you what type of crimes those are likely to be, but that’s what we’re modeling for. The diversity in crime density suggests district-level differences, which supports the use of location as a feature in your predictive model.

But before you can move to the feature engineering stage where you transform location into something useful, you need to explore null values and outliers: both types of data points that can skew your results or break your code. Using Pandas’ describe() function is a great way to discover outliers, as it provides summary statistics for non-categorical variables in a dataframe.

A 90 degree latitude? You know that’s an outlier based on the SF coordinates you identified earlier. You have to deal with them before proceeding to modeling, otherwise you risk skewing the weighting of the Y column feature. Hopefully it’s not many. You can check by looking at Y coordinate values greater than a reasonable upper bound on the San Francisco area.

Phew, that’s lucky. Whether you choose to impute their location values or not, it shouldn’t affect your model performance too significantly. But what about null values?

Wow! No null values in the training dataset? Almost unheard of in your experience as a data scientist. One of your interviewer’s colleagues must have cleaned the dataset beforehand… hopefully leaving the important pieces intact. Data science is a team sport–remember that.

Step 2: Engineer

Now that you’ve built up your intuition around the data with some exploration and dealt with null values and outliers you encountered along the way, it’s time to engineer some more explanatory variables for your model to chew on. But before you dive into the weeds, take a step back and consider the type of model you’ll need to build.

The problem as presented to you by your interviewer was one in which the “Category” of a crime needed to be predicted. The number of unique categories being greater than two means it’s a multi-class classification problem. This means you will be asking your model to estimate a function that can separate your data into distinct graphical regions in an N dimensional space that can be labelled as a specific category with good accuracy, where N is the number of features in your dataset. The more dimensions or features you have, the more degrees of freedom your classifier has to identify an accurate separator between crime categories, but that comes at a tradeoff of computational complexity and runs the risk of overfitting to the dataset.

Remembering that nugget of wisdom, you’re ready to begin.

Some lazy novice data scientists might simply convert the “Dates” timestamp into a numerical date column, but for the savvy feature engineer, it contains other useful information. With some time efficient vectorized calculations (see here for a quick recap on the merits of vectorization), you can add columns to the training and test sets that identify the time of day the crime was committed, whether the day was a holiday, and the season. The day of the week on which the crime occurred was already given to you, so that just saves you a bit of work but it has a string datatype. Is that a problem?

Not for pandas! You can efficiently convert it to 7 dummy variable columns (one for each day of the week). Each column is a binary indicator of whether the crime occurred on that day (value = 1) or not (value = 0). You should also convert the police department feature to dummy variables as well, but don’t forget about the curse of dimensionality! You want to make sure your model converges.

Spatial features offer another avenue for feature engineering, via mapping publicly available census data to a zip code. Population density, socioeconomic information, and age distribution within a zip code would be ideal features to include in your model. But each crime’s zip code wasn’t provided, only its location coordinates. Fortunately, the uszipcode library offers a simple way of obtaining the closest zip code to a coordinate pair and allows access to census data! Two birds with one scone. By using the documentation as a guide, you can write a function to perform zipcode lookup within a 5 mile radius and another to select the features you want to include in your model.

The feature selection function is where you’ll need to apply your data structures knowledge, as homebrew external libraries like uszipcode often leave the data munging to the user. In this case, it’s a matter of substantial trial and error and tons of print functions for you to filter throught the nested dictionaries to pull out clean data.

After temporal and spatial feature engineering, you should be feeling confident going into the modeling stage of this task. While features like population density and household income are traditionally predictive, they are limited by zipcode level granularity, meaning there will only be 25 unique values for each.

Step 3: Model

With a feature engineered training and test set that look identical aside from the missing “Category” label column from the test set, you’re now ready to build your multi-class classifiers. Another important data science practice to implement is the use of model comparison, especially when you’re uncertain of which will perform the best at the given task. Every data set has different characteristics (linear vs. nonlinear, high/low variance, etc.) which different modeling methods specialize in handling. A good first question is whether you think your data set is linearly separable. If not, then linear parameter models (LPMs) like support vector machines will not produce the best results unless you transform the data with a kernel.

Due to time constraints on the task, you decide to test three common models of reasonable diversity: multinomial logistic regression, random forest classifier, and KNearestNeighbors and trust the default hyperparameters

The multinomial logistic regression fits a linear combination function to the input data in order to produce a probability estimate for a sample belonging to each class. The coefficients of each feature in this function are determined by regressing the outcome variable on each feature individually using a logit function.

A random forest classifier uses a randomly generated set of decision trees (each using a subsample of the full dataset, sampling with replacement (“bootstrapping”) by default) to make an average class prediction for each sample. Compared to logistic regression, each decision tree may contain less than the total number of features of the input dataset, which also makes it more robust to overfitting. Since decision trees are nonlinear, it provides a nice complement to logistic regression in the case that the relationships between predictors/features and the category are nonlinear.

K-Nearest-Neighbors or KNN is a simple algorithm that plots all of the training data samples in an X-dimensional space (where X = the number of features in the dataset) along with their known classes, and when asked to classify an unknown datapoint, it finds the nearest K points to the unknown point and aggregates their classes to produce a prediction. If K = 1, then only the unknown point would take the closest point’s class for example.

The metric(s) you select for evaluating your models are also important. Your first thought may be to maximize accuracy (the count of correctly classified points divided by the total number of points), but this doesn’t work for a multi-class classification problem since accuracy could vary significantly across categories. Instead, you should aim to minimize the log-loss score, which accounts for your models’ uncertainty in their predictions. (Coincidentally, this is also the metric used by the kaggle competition).

Now that you have your models, and metrics, it’s time to train the models. Make sure you split the training set into training and validation sets so that you withhold some known samples from the model to test it with. This may seem obvious since the hacked test set provided to you is missing its category labels anyways, but even if it had them, its better to split your training set and accept the reduction in training samples to avoid overfitting your model to the test set and reducing its generalizability.

While your random forest classifier and your KNN model each ran much more efficiently than the multinomial logistic regression, they had worse loss performance. If you needed to run your model frequently (rather than the one-off data generation task provided by your interviewer) then perhaps you’d be willing to accept the worse performance in favor of the faster runtime, but that is not the case here. So why did logistic regression perform better even though it was a linear model?

Some googling allows you to discover that logistic regression performs better than random forest when the number of noise variables is less than the number of explanatory variables (Kirasich et al., 2018). If true, this suggests that the majority of features you engineered were predictive; great job!

Step 4: Predict

Once you’ve selected your best performing model (logistic regression), you’re ready to generate the missing classes. You retrain it using the same hyperparameters on the entire training dataset (including the validation set this time) and estimate category predictions on the test set.

After exporting this DataFrame containing your predictions, you’re ready to submit your results to your interviewer (or simply upload the csv file to Kaggle in order to calculate your final log-loss error). In the real world, you’d have no way to tell if your categorizations were accurate, but the Kaggle Gods can tell you. Assuming the log loss metric on upload approximates your validation set log loss metric, you can rest easy knowing it was not overfitted. If, on the other hand, the test set log loss is much greater than your validation set log loss, then you need to re-examine your hyperparameters or model selection to reduce overfitting.

In any case, you were able to produce justifiable results with reasonable ease under time pressure for an organization you hadn’t even been hired to join yet. Regardless of whether your interviewer replaces the missing historical crime details with your model’s predictions after she compares them with whatever data her team can recover, she’s impressed with your data science skills. Perhaps you aren’t yet at Sherlock Holmes’ level, but you got the job… and there’s always the next episode.

References

Kirasich, K., Smith, T., & Sadler, B. (2018). Random Forest vs Logistic Regression: Binary Classification for Heterogeneous Datasets. SMU Data Science Review, 1(3), 9.

Hu, S. (2019). uszipcode. Retrieved November 6, 2019, from https://pypi.org/project/uszipcode/

--

--