Everyone Poops

Everyone Poops is the timeless tale of humans and animals eating food and pooping. What Taro Gomi didn’t prepare us for is the fact that some people poop on the street. Here in San Francisco, human waste is a growing issue; both for the people who run into it and for the people that have no other options than to relieve themselves on public streets.

We can see the growth of human waste on the street as reported by San Francisco’s 311. 311 is a service provided by the city that will help direct any calls made to improve infrastructure. Callers can report anything from a broken street lamp to overflowing garbage cans. Reports of human waste are the 3rd most requested service on the platform and is the fastest growing issue.

Reported incidents of human waste

This is a multi-faceted problem, with many potential solutions that are best solved by social scientists. However, I think there is a place for data science in this conversation. I wanted to contribute by building a model that predicted where and when human waste will show up. This model could be used to better inform resource allocation for programs like San Francisco’s Pitstop. Pitstop is a program that brings portable bathrooms to areas that have high homeless populations. These pit stops provide a dignified option for people who don’t have regular access to bathrooms while avoiding a public health hazard.

Whatever solution the city ends up pursuing, their decisions should be informed by the data collected by 311. Decision makers can utilize this information to create more targeted solutions.


The Data

Data is provided from 311 with basic pieces of information, like: category, latitude, longitude, and a timestamp.

First 5 rows of the data set

When we start analyzing the data, we can see particular neighborhoods are affected more than others. Areas like Downtown and the Mission District have much higher incidence rates of human waste than areas like the Richmond District. We also see patterns emerge over time of day and day of week. This leads me to believe that there is an opportunity to improve the current rollout of Pitstop stations that is more or less static.

Incidence rates by different time aggregations

How to Model the Problem

The challenge with this modeling problem was defining the scope of the predictions. Since cleanup requests are recorded as latitude/longitude points and a timestamp, the problem could be infinitely complex or trivial. Herein lies the tradeoff of model utility and model accuracy that I balanced while defining the scope of this project.

On one end of the spectrum, I could have tried to make predictions for every block in San Francisco for every minute. This design would be very useful for city planners, but nearly impossible to model. To know, down to the minute, that a particular block in San Francisco will have human waste would be very powerful. However, given how granular this design is, the incidence rate for human waste (All square feet and minutes in our analysis, how many had human waste present?) is~ 0.0001%.

This means that creating a model on imbalanced classes would be nearly impossible. In situations like this, the model will tend to predict 0 for every observation (0 being no poop for our example). The model would be 99.9999% accurate by guessing 0 every time, but 100% useless because we don’t need data science to do that.

On the other end of the spectrum, we could break up San Francisco into the 12 major neighborhoods and only look at the data hour by hour. This design would produce very accurate predictions, but they wouldn’t be very useful. The model might predict that there will be human waste a 9 AM in the mission, but that model wouldn’t tell us something we didn’t already know.

I ended up handling the temporal aspect of this problem by asking: ‘ did human waste occur before or after 12 PM for any given day?’ I used the k-means clustering algorithm to group blocks together iteratively. I tested many different levels of aggregation and was satisfied with the accuracy/utility trade off, when I grouped 4 blocks together.

Examples of block clusters for the Mission District. I ended up using 1809 clusters for the entire city, ~ 4 blocks per cluster.

After finalizing the design of this problem, the next step was to bring in outside data to provide context on why human waste was showing up in some areas, but not others.


Outside Data

There were several outside data sources that brought a lot of predictive power to this model. The first was survey results from the American Community Survey. This survey is a bit like the U.S census, except they reach fewer people and ask more detailed questions. I used facts like unemployment levels, income levels, and average rent from this survey.

Sample question from the ACS

The second outside data source was the land use data set provided by DataSF Open Data. This data set provided information on what type of buildings were on each block. It answered questions like: how much square footage is residential vs commercial?

The third data source was a survey on pedestrian volume for each block; this data was also from DataSF. It labeled the median foot traffic for the year and also provided descriptive information on each block, like the presence of a traffic light or a stop sign.

Residential Block vs a Commercial Block

After combining all these data sets, I had a rich view of the blocks of San Francisco and they provided enough signal to model the problem effectively.


The Model

I tried several classification algorithms but ended up using a random forest to provide predictions. Sklearn makes working with imbalanced classes easy by providing the class_weights parameters.

After grid searching different hyper parameters like the max_depth, n_estimators, and class_weights, my model had an F1 score of .43. While this might not be a mouth watering upper-90’s score, I think it is still helpful in addressing the problem at hand. The model identifies hot spots and will surface solutions that change with the time of day, season, etc.

Here is a static image of the interactive visualization I created to display the output of my model. The darker the block (or cluster of blocks rather) the more likely there is to be human waste at this point in time.

How This Can Help

I believe this model adds to our current understanding by identifying the geographic and temporal underpinnings of the problem. That means when neighborhoods change, as they inevitably do, the model will be able to continue to provide accurate predictions. I hope that this can help advance efforts to keep San Francisco’s streets clean and provide citizens with the services they deserve.

Check out the source code on my Github and run it to see the interactive elements of this chart!