The City, Through its People

I recently had the opportunity to explore the city of San Francisco through data collected by its denizens. Well, some of the denizens — those who bother to rate/review businesses on Yelp.

The Yelp dataset turns out to encode a surprising amount of insight into the city, from information about business quality (as perceived by Yelpers, of course) and density to neighborhood boundaries and distribution of services across the city.

The following is a collection of data sketches — initial prototypes that informed our investigation into local data journalism. Stamen undertook this investigation on behalf a client, to help identify stories of local interest using the data generated by the neighbors of each hood.

Step One: Dots on a map

Rating (double-encoded)

Simply size and color each business in a sample (10k entries) dataset and put each on the map. Not a whole lot of new information here, though we can see overall density. The second iteration encodes higher ratings as brighter spots, generating a faux-heatmap effect.

Step Two: Ratings + number of reviews

Rating as color, number of reviews as size

Adding another variable starts to tell a new story, not just about business quality but also how frequently businesses are patronized, and possible correlations between the two. Filtering and zooming in highlights some interesting findings:

Left: parks; Right: nightlife

On the left, parks in San Francisco. People like parks! Very few poor ratings for parks. On the right, nightlife clusters and corridors: SoMA and Mission brightly lit, and high density in Tenderloin and the Polk St. corridor. Moving outside of city center, Upper Market, Upper Haight, Divisadero, and other hotspots become apparent.

Step Three: Add a dimension

Number of reviews as height

This was a fun one. The concept is simple: encode number of reviews as height on a 3D map. The obvious metaphor of skyscrapers holds if we think of number of reviews as a proxy for density — these towers are not full of people living or working, but people passing through.

Restaurants reliably get many reviews

One particularly interesting insight came from this prototype: San Franciscans are devoted to local businesses and despise chains.

Left: chains as tall red gashes; Right: there are exceptions…

Almost without exception, any tall red bar indicates a chain of some sort. The good people of SF love to hate on Best Buy, McDonalds and their ilk. (Oh yeah, and property management companies.) One notable exception appeared on investigation, however: a cluster of heavily-reviewed and highly-despised Chinatown restaurants. Steer clear of Grant & Sacramento.

Step Four: More basic analysis

Left: top categories by neighborhood; Right: examining relationship between rating and number of reviews

Some folks might have opted to start with this kind of analysis. I decided to focus on geospatial exploration first, then circle back to bar charts and scatterplots. The Yelp dataset is deep enough to reveal interesting stories almost any way you slice it.

The bar chart above shows the top three categories per neighborhood; the scatterplot was an attempt to find correlation between rating and number of reviews. More heavily-reviewed businesses tend to fall in the 3–4 star range — below that, the negative energy pulls down the will to review, and above that, there are always enough haters to pull you back down from 5 stars. Or at least, that was the narrative we told ourselves.

We did more simple analysis, but it doesn’t screenshot as well as the maps. So…back to the maps!

Step Five: Experiments

K-means geographical clustering. Say what?

We then ran some quick experiments to see what else we could learn about our city. Above is a demonstration of K-means clustering applied to San Francisco businesses; the algorithm simply generates clusters of points closer to each other than to any other points outside the cluster. (The output is basically a Voronoi diagram.) We’d hoped to identify commercial corridors smaller than Yelp’s neighborhood designations, but would need to put more work into this to get past the inherent limitations of the K-means algorithm. Specifically, commercial corridors tend to be linear, while K-means tends to generate more round, convex shapes.

Neighborhood overlaps

Each business on Yelp can be assigned to as many neighborhoods as are relevant. Most businesses live in a single hood, but things get interesting on neighborhood boundaries. This map shows a number of things about how we define our environment: 

  • where neighborhoods overlap
  • where there is confusion / disagreement about neighborhood boundaries
  • which neighborhoods are known by more than one name
EmbarcaFiDiSoMaSoBeach — hot new real estate!

Zooming in a bit, we can make out outlines of Hayes Valley, Nob Hill, Tenderloin, and Union Square; we see the Financial District creeping across Market St. into SoMa; and it’s apparent that once you hit the water, all bets are off and you can just make up whatever names you want.

This project was a lot of fun, largely because the data tell stories we’re familiar with — stories about how we perceive and use our city. Yelp data are, obviously, biased by the perspectives of Yelp users; however, as a proxy for the population of San Francisco as a whole, we have the opportunity to both reaffirm existing assumptions about our home, and break down others.

On the tech side, these prototypes are built on Mapbox GL JS and D3, with all the data pulled from Yelp’s public API. We won’t be releasing these to the public in their current form because of the large amount of data each prototype pulls down.