The Battle of Neighborhoods
Picking up a location in San Francisco for opening a new restaurant
An investor is looking to open a new restaurant in San Francisco, but he is not sure about the best location for his new venue. So you got a call from him asking your input to help him choose the location. San Francisco is a very busy city, best known for tourist attractions and business innovation. Strolling around the city blocks, it pretty easy to notice that the city already has a lot of restaurants in town. How should we proceed and decide the location?
In The Art of War, Sun Tzu has said: “Know the enemy and know yourself; in a hundred battles you will never be in peril.” Following this line of thinking, the basic strategy here is then to know what are the most critical factors that contribute to the restaurant’s profitability. According to a report written by Tom Larkin published in the FSR magazine, these components stand out as the most important ones: visibility, parking, space size, crime rates, surrounding businesses and competitor analysis, accessibility, affordability, and safety. Using public datasets, we could actually address some of these considerations pretty straightforwardly. The city of San Francisco has maintained a large data repository hosted on the website DataSF. From there, we can access the city’s crime rate, housing price, and public parking information. Sounds good! So let’s proceed and see what insight can be extracted from the data. We will be working with two datasets: Police Department Incident Reports: 2018 to Present and San Francisco Historical Secured Property Tax Rolls 2007–2015. The code described in this post can be found here or here.
First, the crime rate. The Incident Report dataset is in .csv format, which can be easily imported into pandas:
There are lots of columns. But for our purpose, we only need ‘Incident Category,’ ‘Police District,’ ‘Latitude,’ and ‘Longitude.’ One notices that multiple incidents might have the same Incident ID. This is because a report (corresponding a row in the dataframe) can have multiple incidents associated. However, we’ll skip this detail at the moment. After dropping unwanted columns and cleaning up NaNs, the number of crimes can be plotted by categories:
Apparently, the number one category is larceny theft. A bit more than 30k incidents in the span of approximately 9 months (up to the time when the dataset was downloaded). That is about 4,000 incidents per month across the entire city. The second most category is assault, followed by burglary (we ignore ‘Other Miscellaneous’ and ‘Miscellaneous Mischief’). But the frequency of these two categories is far less than theft. Alright, now we know the type of crime committed most often in the city. What about its distribution? We could take advantage of the ‘Police District’ column. Using the value_counts function, the total number of incidents grouped by Police District is plotted as follows:
It looks like the Central, Mission, Northern, Southern, and Tenderloin Districts combined contribute the majority number of incidents to the report. This is perhaps not surprising at all because these areas are the most busy districts in San Francisco. For example, the Central District covers Fisherman’s Wharf and the Embarcardero which has many waterfront attractions.
While the above bar charts are useful when comparing different categories, it would be nice if we could put these data on a map using their GPS coordinates. This could be achieved through using the GeoPandas and the folium package. We first form a Shapely geometry object by combining ‘Latitude’ and ‘Longitude’ columns and merge it with the original dataframe to form a GeoDataFrame:
The next step is to put each incident report (row) into a geological unit and do statistics for mapping. There are three levels of unit we could work with: (1) police district, (2) census tracts, and (3) neighborhoods. The first one is a bit coarse-grained; census tracts, on the other hand, are the finest units; the neighborhood unit is somewhere between a district and a census tract. The method provided by GeoPandas is good with any of them. Here we choose to work with the neighborhoods defined by San Francisco Association of Realtors. The shape files can be downloaded from DataSF and be imported using geopandas.read_file() easily:
Each row is a neighborhood defined by a shapely polygon object. We can plot the shape file easily using the object’s plot method:
How do we bin the crime reports based on neighborhoods? Similar to SQL’s join function that allows us to selectively combine two databases, GeoPandas’ sjoin provides an approach to spatial join of two GeoDataFrames. Because we want to aggregate the number for each neighborhood, we set the parameter op=’intersects.’ when calling sjoin. The results are grouped by neighborhoods:
We can then merge the result with the nbrhoods GeoDataFrame using the column ‘nbrhood’ as the key of operation:
We could build a map using the result, which includes the geospatial information as well as statistics. There are several python packages that allows us to plot geospatial data. Here we choose to use folium. In particular, we are going to generate a choropleth map:
There! A San Francisco neighborhood-level crime map. In addition to the map, we also annotate the map by creating popups (green dots) that display detailed neighborhood crime data.
Following almost exactly the same procedure, we could visualize the housing data. The dataset (San Francisco Historical Secured Property Tax Rolls) contains assessed fixtures, improvements, land, and personal prop values. The housing price is estimated as the sum of all those four assessed numbers. There is only one subtlety: instead of calculating the total price per neighborhood, we want the average price. So after we sjoin the housing and the nbrhoods dataframes, we group the results by neighborhoods and call the mean() function. The results are merged back to nbrhoods again:
Note that the average price is in unit of millions. The housing choropleth map looks like
Now that we have the crime and housing price maps, next we are going to study these neighborhoods using Foursquare APIs. For each neighborhood, the Foursquare search engine returns a list of top ten most common venues. Based on the venue data, the neighborhoods can be clustered according to some similarity measures. Results from the Foursquare Explore API call are the following:
Here we have used a pipeline to channel the json file from search to Pandas DataFrame. Note the last column is ‘Venue Category.’ The API requires GPS coordinates of the neighborhoods. The GeoDataFrame nbrhoods only defines polygons, but we could deploy the representative_point() function from GeoPandas to extract a location that is guaranteed inside each neighborhood. This is also how the location of popups is generated. The coordinates are recorded in the columns ‘nbrhood Latitude’ and ‘nbrhoods Longitude’ in the above dataframe. To cluster the neighborhoods, we simply apply the k-means algorithm to the one-hot encoded venue dataset. Assuming there is 5 different clusters, the map looks like
Apparently a lot of the neighborhoods are in the red cluster. When we actually look at the red cluster, it becomes clear that the most common venues in the neighborhoods are restaurants: bar, cafe/coffee shop, Chinese/Japanese/Korean/Italian/American restaurants. So yes, the clustered results are consistent with the impression: San Francisco indeed has lots of restaurants already! So where should we put our new restaurant? Well, by checking out the crime and housing price maps, it appears that the North Beach neighborhood might be a good candidate. This area is close to the city’s water front attractions, so we expect the region to have a lot of foot and car traffic, i.e. good visibility. And the maps also indicate that North Beach has a relatively low crime rate in 2018 and a reasonable housing cost. By calling Foursquare API again and narrow our search to the food sector, we could obtain a detailed map of restaurants in North Beach. Most of the venues are located in the blocks along the Columbus Avenue. There are also a few sitting along the Broadway.
So this is like the first-order solution to the question “Where should we open a new restaurant in San Francisco?” Using public datasets, we are able to, at least partially, address a few factors we have mentioned at the beginning: crime rate and housing cost. We also carry out a very simple competitor analysis based on the distribution of restaurants in the chosen neighborhood. There certainly is lot of room for improvement. For example, we have yet to answer the question about parking space. DataSF has public parking datasets, which could help us understand the landscape of the parking space distribution. We may also use ParkWhiz to help locate public as well as private parking spaces. The competitor analysis can also be refined by segregating the restaurants in North Beach into different categories. This information is extremely useful because we certainly don’t want too many competitors in the same sector. While the approach discussed here is primitive, it nevertheless showcases the usefulness of data analysis!