The Calendar of Billions at Your Fingertips

Kristof Boghe
Analytics Vidhya
Published in
37 min readSep 28, 2020

The unexplored opportunities of Google popular times data within academia and business

In short:

[1] I explain why Google popular times data constitutes an hitherto unexplored data goldmine for researchers.

[2] I build my own Google maps scraper and collect popular times data on more than 13.000 locations across Europe.

[3] I show how you can geocode and map all these locations at no cost, and how you can leverage these digital traces to gain insight into intercultural differences in time budgets (i.e. how people in a particular culture tend to spend their time)

[4] I focus on sources of bias and other potential pitfalls when using Google popular times data that future researchers should grapple with when setting up their research.

During the last decade or so, researchers in academia lauded the possibilities granted by the ever-increasing production of digital traces to revolutionize the way we study social reality. By studying patterns in search results, pictures, tweets and the like, we could study the offline world through an online lens, or so the argument goes. The always charismatic professor Richard Rogers is one of the most well-known advocates for this new methodological approach and is more or less the godfather of what we nowadays call ‘digital methods’.

Some of Rogers’ ideas were instantly adopted and are by now part of the data analyst’s toolbox. Think for example about so-called ‘Google trends’ research, used by journalists, bloggers and academics to discern the rise and fall of ‘hot topics’ in the public sphere. Or what about predicting the surge of the flu season using the very same data? Notwithstanding some of its successes, the methodological paradigm of Rogers is — exactly because of its infectious enthusiasm and optimism — also hopelessly naïve, risking to obscure some of the pitfalls accompanying the analysis of digital traces. All things considered, the online world is riddled with biases, from skewed and self-selected populations to structural platform limitations and algorithmic calculation that skew exposure. I’m reminded of this whenever I see a newscaster raging about what the folks on Twitter think on recent events; a platform that represents a highly educated, left-leaning, relatively young self-selected subsample of the population being politically polarized by their own algorithmic filter bubbles. Translating knowledge from the online to the offline world should trigger all kinds of alarm bells, and should in any case be accompanied with alternative readings and interpretations by the researcher.

Still, I’m writing this piece to convince you that there is an overlooked goldmine of digital trace data readily available, namely Google popular times data. You probably noticed the popular times graph whenever you’re searching for a particular place on Google maps. The graph displays how busy a place usually is (0%-100%) during any given hour of the week, relative to its busiest hour ( which represents the upper limit of the scale and thus ‘100% busy’). The graph looks something like this:

Figure 1. The new digital methods goldmine in all its glory

I’ll argue throughout this blog post (a) why this type of data is useful for research in academia and business and (b) how you can collect, explore and analyze popular times data. I’ll write my very own Google maps scraper to collect data on restaurants and supermarkets in 30 European cities and perform some explorative analyses in Python. I’ll also show you how you can automate forward geocoding at no cost, and how to require Geojson maps in bulk to visualize geographic data with ease. The end goal here is to inspire future researchers to think about new pathways that could leverage this kind of data in new and exciting ways to gain insight into our social world. I’ll minimize the amount of Python code presented here. This is not a step-by-step guide, but a relatively short and snappy demonstration.

Popular times data as uncharted territory

If you’re an Android user, chances are high that you’re already contributing to Google’s popular times database without even realizing it. Using your device’s geolocation data, Google matches your coordinates with a specific Google maps place (e.g. restaurant, shop, etc.) and estimates the average amount of visitors in said place for a given hour in a particular day across a few months. If there’s a sufficient amount of data available, Google will produce the popular times graph for all to see. This metric gives you a rather robust estimate of how busy a location usually is, meaning that popular times data is not susceptible to exceptional and unexpected peaks and throughs in popularity. This makes it ideal for discerning general trends, which is exactly the kind of thing most academics and some long-term research projects in the private sector are interested in. However, those interested in temporary sudden shifts in traffic can still leverage the ‘live’ estimate of the popular times graph, although this is not the focus of this article.

One central and peculiar characteristic of Google popular times data is that it’s already standardized: the scale (0%-100%) represents how busy a location is when compared with the busiest time of that particular place. Depending on the research question at hand, this could constitute both an advantage or disadvantage. When you’re interested in raw figures, such as which fast food joint in the country has the highest amount of visitors, you’re out of luck. However, most of the time you want to filter out sources of biases stemming from the variability in population density. Otherwise, all McDonalds franchises in, let’s say, New York and L.A will continuously be ‘busy’, while a McDonalds in the countryside will never reach any ‘peak popularity’ at all. This kind of bias is damaging because it doesn’t correspond with the actual perception of ‘busyness’ in reality. If 50 people order a Big Mac menu on Times Square at the same time, it will look like any ordinary day. The interiors of the restaurant, the size of the staff and their work flow are all adapted to serve a large amount of customers. Drop these 50 people in my local McDonalds, however, and it will look like a straight up invasion. This makes this type of data ideal for anyone interested in the ‘relative’ flow of people in and out of places.

Let’s review just some of the possible applications within both academia and the private sector:

  • Academic applications:
    1.Cross-cultural ‘time budget’ analysis: how people tend to spend their time across cultures (e.g. the differences in dining culture in Northern versus Southern Europe)
    2. Analyzing the impact of a particular event on traffic patterns: how did the global COVID pandemic in 2020 change the time budget patterns? This assumes a longitudinal lens, which means that there should be at least two waves of data collection present for analysis.
    3. ‘Mapping’ a region for specific categories of places and correlating it with external data sources (e.g. income distribution, race,…): when do essential (social) services reach a peak in popularity throughout the week? Does this possibly have an impact on certain disadvantaged groups or workers (e.g. Workers in the service industry who are pushed to work during the weekend, nights, etc.)?
    4. Research within Urban Studies in general: When are certain areas (over)crowded? What is the impact of this kind of traffic (of people and/or vehicles) on the infrastructure and people of these neighborhoods?
  • Business applications:
    1. Improving customer service: A public transport company could use popular times data to estimate the flow of passengers in their train stations, ultimately resulting in a more reliable estimate of the number of passengers experiencing a delay. They could use these estimates to pinpoint hot spots in need of new infrastructure.
    2. Keeping an eye on the competition and the customer: a restaurant could ‘map’ their particular city and estimate when people tend to eat at home (i.e. when all restaurants tend to experience low popular times values). They could increase their advertising budget for promoting their food delivery service on social media around these ‘stay at home’ hotspots.

Of course, the question remains what kind of biases are present here. Are we talking about the movement of only a handful of Google maps users, who explicitly volunteered to take part in some sort of Google panopticon? Well, not exactly. In theory, Android users have to link their Google account with their device and have to ‘explicitly’ agreed to opt into the ‘location history’ service, but one should be aware that Android phones are designed in such a way that users are easily funneled into this opt-in. Not only does Google maps come pre-installed with every Android device, but end users are also heavily encouraged to couple their Google-account with their Android device. When these two components are matched, collecting location data is done fully automatically and on the go, without any required intervention on the user’s side. For this reason, these data suffer less from a couple of nefarious and well-known sources of bias that are prevalent among most digital traces, such as social media activity (e.g. [1,2]). Unlike social media posts, for example, all it takes for these digital traces to be created is being present; the data is recorded and collected without having to use Google maps as such in the present moment. Of course, there are several caveats here:

  • The device’s GPS function should be enabled.
    However, I do have reason to believe that this is the case for a large majority of smartphone users. In 2019, I explored an extensive dataset of around 15.000 smartphone users in the context of an academic study. We followed our sample for about 18 months and collected all sorts of data, from the app they opened to — if available — their precise location whenever they used their smartphone. This data revealed that around 80% of our sample had turned on their smartphone’s GPS function all of the time, while roughly 90% enabled this feature most of the time (users with 10% to 40% missing geolocation data).
  • The user has opted into Google’s location history registration.
    This sounds like a pretty big impediment. Who in their right mind will consciously allow Google to track their detailed location history, right? Of course, Google is extremely skillful in nudging their users into accepting the tracking of one’s device, selling it as a nifty feature that affords the user all kinds of additional benefits. Moreover, people are getting used to being bombarded with ‘privacy notices’, especially since the European GDPR forced companies to ask their users to ask permission for every type of personal data being collected. This could increase what researchers call ‘privacy fatigue’: the feeling that managing one’s own privacy online is a tiresome, almost unmanageable endeavor. Instead of taking action, people react to this cognitive overload by becoming complacent and cynic about their own agency to deal with their own online privacy. Google counts on that cynicism, combined with a healthy dose of pure laziness: when you’re opening Google maps for the first time, it gives you a long lists of permissions (such as the location history service), with an attractive looking ‘agree all’ button at the bottom of said list. Instead of going through all the trouble of reading and contemplating which service you want to enable, it’s just easier to accept all services at once.
    To demonstrate the power of privacy fatigue, I conducted a small experiment among ten of my friends (not exactly a true random sample, I know) and asked what their location setting is when they access the corresponding Google settings page. All of them, without a single exception, reported that location history is enabled. Some of them were surprised, but most of them shrugged it off as ‘just another privacy invasion’ of the tech giants, indicative of the cynic attitude characteristic of privacy fatigue. The results of my little survey corresponds well with (somewhat outdated) research from the PEW research center from 2012 (cited here), where it is reported that only 19% of US smartphone users turned off their location tracking.

So like many online privacy behaviors, the majority of Android users adhere to what experts call the privacy paradox: while everyone claims their privacy is of utmost importance, most people don’t bother to actually do something about it. Even with the recent changes made in their privacy policy, which entails that Google will only keep your location history for 18 months, it still doesn’t hold the company back from reliably updating their popular times data. With at least 2.5 billion active Android smartphones around the world, it’s not too far-fetched of an idea to assume that Google tracks more than 2 billion devices around the globe at any moment. After all, a bunch of iPhone devices are tracked as well, even though they do have to jump through a couple of more hoops before the geo-tracking is in place.

This might sound like a disaster for the privacy of these users and, sure, in theory it does constitute a possibly worrisome bulk collection of geodata. However, in its current form, the popular times data constitute one of the more benign forms of bulk data collection: it’s impossible to link the data to any individual or even groups of individuals and it’s indicative of broad, general and aggregate trends, updated every couple of weeks, without giving access to fine-grained moment-to-moment traffic in specific locations. Moreover, if a location has an insufficient amount of data available, the popular times data won’t be released at all. So even if you know that — for example — a specific place can only be accessed by a couple of people (e.g. imagine that you created a google place for your shared appartement), it’s not as if you can track any particular visit to said place through Google maps. So, all in all and from an ethical point of view, there is little concern that this type of data collection invades the privacy of individuals. Instead, this type of data is somewhat comparable to other aggregate trend data such as recorded traffic flows on highway networks, collected and sold by both private companies and governments worldwide.

Still, this doesn’t mean that there are absolutely no biases present here. For example:

  • The location needs to have a Google page in the first place. Although a lot of places do have a Google page, it is possible that less popular locations, or locations of less digitally-savvy business owners (possibly overrepresented among the immigrant population) are less likely to create a Google place location.
  • If an insufficient amount of data is available for a particular location, the data won’t be displayed (these data points can be categorized as MNAR: Missings Not At Random). This effectively means that the aforementioned bias towards popular places is aggravated.
  • People with smartphones do not constitute a representative sample of the population. Although smartphone penetration rates are high in most developed countries, with around 3 out of 4 residents owning a smartphone, senior citizens keep lagging behind. Moreover, not all regions are alike. Pakistan (15%), The Philippines (34%) and even India (37%) have relatively low adoption rates; so researchers interested in surveying and/or comparing regions should be aware that different sample biases are present within specific regions under study.
  • Finally, the structural limitations of the Google platform itself make that the creation of the popular times data lacks transparency. It is unknown how the popular times value is calculated, how regularly it’s updated, when someone is considered as a ‘visitor’ of a particular place, and so on. For example, it’s likely that Google implemented a built-in ‘lag’ in their updates to prevent sudden changes in visiting patterns over time. This means we’re working with data already treated by an unknown proprietary algorithm, which exposes us to all sorts of (unknown) sources of bias.

That being said, these are still minor issues compared to the massive biases present in most other type of digital traces. Tracing happens on the go, without any human intervention, and billions of users are actively nudged into accepting this constant and ubiquitous data collection by Google. Although some researchers already recognized the potential ([1],[2],[3]), to this day only a handful of papers are published using this promising and vast database of visiting patterns.

This might have something to do with the fact that Google doesn’t grant access to the popular times data through their API. This poses a significant hurdle for data collection, requiring the researcher to have web scraping skills. So this is where we start off with our little Google maps project: developing a scraper.

[Part 1] Scraping Google maps

I’m not the first one interested in scraping popular times data. There is already a popular Python package available that will do just that. However, these packages require either (1) an API key and/or (2) a priori knowledge on the specific places you’d like to scrape (such as Google place ID’s). Moreover, these tools only scrape the popular times graph, while I wanted to scrape everything from the amount of reviews, place categories (e.g. informal, cozy) and the popular times figures. Crucially and unlike these available tools, I wanted my scraper to contain a ‘free roaming’ option where the user can simply define a particular search in a pre-determined area, scraping all places and all their available info that match the given category. If Google wanders too far off (e.g. scraping particular restaurants in neighboring though different city), the scrape session stops and goes on to the next region * search term combo. This means that the scraper is able to map regions all by itself, without requiring any prerequired knowledge on the places of interest as such. For example, giving the scraper a list of 10 cities (e.g. New York, Tokyo) and place types (e.g. Restaurant) should return me all place info — including their popular times — without any further input.

Gif 1. Free roaming option of my Google maps scraper

This is both interesting for academics — such as sociologists — and businesses alike. For example, a fast food chain looking for new promising regions to invest in could easily create a ‘fast food map’ of a specific state or country, where underserved regions will quickly reveal themselves. For sociologists, then, the popularity of fast food joints in particular neighborhoods can be coupled with socio-economic variables such as income and level of education to lay bare disparities in health and available food options. Combining this kind of info with the popular times data only supercharges these kind of analyses even further.

I used a combination of Selenium and Beautifulsoup in Python to construct the scraper. An earlier version of the scraper automatically scrolled through all days of a specific location (see Gif 2.). This method proved to be too slow and was replaced by directly accessing the popular times data in the page’s source code.

Gif 2. Early version of the google maps scraper

The final version of the scraper includes around thousand lines of code and is, honestly, a huge mess. In hindsight, I was ill-prepared for the task and too ambitious. True, it incorporated all sorts of fancy options and was able to deal with a ton of exceptions. For example, it double-checked whether Google zoomed in on a specific sub-region or not, whether it changed the search area of general coordinates when scrolling through the results, and whether a ‘nearby’ search would lead to a more fine-grained search area. This all sounds fine and dandy until you realize that Google is continuously changing their source code ever so slightly, breaking the script here and there if you wait a couple of weeks. So if you run the script now, you’re bound to run into some errors. However, I was able to scrape around 13.000 places, targeting restaurants and supermarkets, across 30 cities in 6 West-European countries: Belgium, the Netherlands, Germany, Italy, Spain and France. All the specific region * place category combos used for this research can be accessed here on my Github.

If you’re interested in how the final scraper works, check out my short demo video here. In essence, I connect to my MySQL database set up on a different Linux machine before scraping, start up a chrome browser using Selenium, perform a Google search and scrape the data location by location. Next, I upload the data automatically to the MYSQL database after scraping all places in a particular region, and switch to a different server (using a NordVPN switcher I wrote) after every session to avoid bot detection. In the end, the database contains two tables. One table contains ibfo on particular places (one line = one location), such as the place category, address, the review scores, etc. The other table is used to store the popular times data and is a bit more bulky, containing about one million lines where each and every line represents a single hour in a particular location (so one line = one hour). The google maps URL is used as an identifier in both tables, which means we can link these data sources in a matter of seconds.

Figure 2. The inner workings of my Google maps scraper

I export both tables to csv et voilà, we got our data! This may sound as if this was all done during a single lazy Sunday afternoon, but this is only because I’m repressing the trauma induced by this entire data collection ordeal. Developing the scraper took me about two weeks, dealing constantly with new exceptions and unexpected behavior of Google maps, and collecting the data took me another week or so (scraping 24/7 without any interruption). I aggregated place categories into more meaningful and general levels (i.e. ‘Greek restaurant’, ‘Indian restaurant’ = restaurant for our intents and purposes) and created an additional ‘category aggregated’ variable.

The actual scraping of the data took place in the beginning of April and is — as far as I know — not impacted by the rather severe lockdown measures enacted across Europe. It seems as if Google did delay their update of their popular times data, evidenced by the fact that local restaurants in Belgium still retained their popular times graphs despite being closed. This may be a conscious one-off decision from Google, but it’s entirely possible that the popular times update has a rather extensive built-in delay in any case.

[Part 2] Retrieving detailed geolocation data.

With our data loaded into our Python environment, we’re ready for some exploratory visual analysis, i.e. mapping the locations onto some sort of map (GeoJson, shapefile, etc.). As far as I know, it’s impossible to extract the precise geolocation of a particular place from Google maps. Sure, you’ll find some blog posts detailing how you can scrape specific place coordinates from the page’s code, but these methods are (a) unreliable or (b) broken because of recent changes in the source code. Moreover, don’t mistake the coordinates incorporated in the URL of a google place page with the actual place coordinates. These only represent the latitude and longitude of the center of the map, which does not represent nor is unique for a specific location at all!

This means we’ll have to take the Google maps addresses and perform some forward geocoding by either:

  • Scrape some kind of other geocoding platform, such as latlong.net
  • Make use of the open-source openstreetmap data by hosting my own OSRM server. I don’t want to make use of the freely available OSM API, which will surely block me if I’m shoving 13.000 + requests down its throat in a short time period.
  • Make use of a commercial geocoding API, such as the Google maps API.

However, since I rather spend my money on books and video games, and not on some API-service, I don’t want to open up my wallet. It took me a day or so to figure out the most foolproof method to forward geocode 13.000 locations without spending a single dollar.

As it turns out, platforms such as latlong.net are keenly aware of their attractiveness for bots and have therefore implemented several security measures to prevent automatic data collection. They blocked my IP after scraping only a few locations, and switching to a new server through a VPN would slow the process down to a considerable (and unacceptable) degree. The same goes for similar platforms. The next thing I tried to do is hosting my own openstreetmap server. It did work, but I quickly realized that a substantial proportion of my requests were ambiguous, where OSM recognized a faulty location or no address at all. Maybe the OSM API isn’t flexible enough to deal with the inconsistent format of some of the Google maps addresses? Whatever it may be, this meant I had to look for a commercial API service with….absolutely no budget whatsoever. Ouch.

Google maps was out of the question: their pricing policy would result in me spending about 70 dollars for the geocoding alone. After some researching and testing several commercial API’s with a small batch of addresses, the Positionstack API came out as the clear winner. It deals with ambiguous addresses without too much hassle and- here’s the best part — it allows you to perform 25.000 free requests (forward/reverse geocoding) every month! Interestingly, your budget is renewed at the start of each calendar month. This means that you’re able to perform 50.000 free requests in two days (e.g. August 31 & September 1) if you plan accordingly.

I took the address and country of each location as parameters for my Position stack requests and used the ID of the location as the identifier to link the geocoded dataset with my google maps dataset. I also recorded whether a particular request has failed (which it sometimes did…) and implemented a limited retry-loop (which you can read all about in my blog post here) if this was the case. Crucially, I paused my script after every request for 3 seconds to avoid triggering any overload protection. If the script throws an error, we pause the requests for an entire minute to

The code for performing all 13.000+ Positionstack requests looks like this:

Make sure you’re replacing the ‘YOUR KEY HERE’ in the code with your own API key.

Although Positionstack proved to be a free alternative to Google’s geocoding service, the performance of said API was somewhat inconsistent. If you’re lucky, Positionstack accepts thousands upon thousands of requests without any trouble, but it can equally throw you one error after the other without any reasonable explanation. I recorded the amount of failed requests during two sessions, spread across two days. Results are displayed below (see Figure 4).

Figure 4. Number of errors over time while performing requests on Positionstack

On day 1, Positionstack returned the coordinates of about 8000 locations without any noticeable hiccups. The first two hours it hardly threw any error at all, and even after 7 hours the script only had to retry 15 requests. The API did experience some trouble after that, but it still only threw 30 errors after performing 8000+ requests in the span of 9 hours. Great! On day 2, however, the script continuously ran into some errors, with 80 retries after only 2 hours. Importantly, this has nothing to do with the number of requests I’ve already spent on my ‘monthly budget’, since my budget was refreshed between day 1 (August 31) and day 2 (September 1). Still, this is only a minor drawback: you simply let the script pause for a minute and retry the request; the script never got really ‘stuck’ on a particular address.

Crucially, a ‘successful’ request does not necessarily imply that the location was successfully geocoded. Positionstack could also return:

  • Imprecise coordinates, e.g. the coordinates of a street or neighborhood instead of a specific address. In that case, all restaurants in a single street get the very same street-coordinate, which isn’t really useful for mapping purposes.
  • No coordinates at all. This is the case if the address wasn’t found for some reason (e.g. unexpected formatting of the address on Google maps).

Let’s check out how many missings are returned by country and city (see Figure 5).

Figure 5. Failed requests from Positionstack — by country

The blue portion of every bar represent addresses that were geocoded on a less fine-grained level (usually street level). The red portion represent the percentage of ‘true’ missings, i.e. Positionstack simply doesn’t recognize the address and returns an empty JSON.

It looks like there are stark differences in performance between regions. Addresses in Spain and Italy tend to return more missings, with addresses in Valencia returning no specific address location in about one out of every two cases! The other countries in our sample tend to experience less issues. 96% of all addresses in Rotterdam were successfully geocoded on address level. Ghent (95%), Brussels (90%), Munich (94%), Nice (93%) and the like have similar success rates. The fault for the high error rate among some of the Spanish and Italian cities does not necessarily lie with Positionstack. Maybe we’re dealing with different formatting conventions between Google and Positionstack for addresses in Spain and/or Italy. If that’s the case, some basic data wrangling could fix this issue. However, this does not explain the relatively low error rate of, let’s say, Milan and Zaragoza.

Let’s plot all locations to get a bird’s eye view of the data we obtained here. I used the Geopandas package to plot the coordinates. Luckily, Geopandas comes preloaded with a couple of maps, such as one of Europe. After only a couple lines of code, the hot spots we scraped reveal themselves instantly (see Figure 6).

Figure 6. Around 13.000 locations mapped around Europe

You might notice the high concentration of locations in Belgium and the Netherlands, but this is simply the result of relatively small area these countries cover. Remember that every region is represented by its five biggest cities, so naturally these cities (and their restaurants/supermarkets) are rather close to one another. Scraped cities in the other countries stand out more, such as Paris and Toulouse in France.

It’s important to note that this kind of geocoded data affords us to ask all kinds of new interesting research questions and remain valuable beyond mere visualization purposes. For example, one could measure the distance from a particular location to the city center and correlate this with some other interesting variables. For example: are restaurants close to the city center more or less expensive (the ‘€€’ categorization in Google maps) than restaurants in the suburbs? Do certain types of restaurants ‘cluster’ together in the same area (e.g. ‘Chinatown’ or streets with a high concentration of kebab restaurants)?

[Part 3] Downloading geojson maps in bulk

While mapping our locations on a map of Europe provides a neat overview, it doesn’t deliver us much insight into micro-level patterns on — for example -city-level. Maybe your research isn’t focused on visualizing detailed geolocation data and if that’s the case, you don’t need to worry about obtaining detailed maps at all. Nonetheless, visualizing your data on a more granular level could reveal all sorts of interesting patterns that are otherwise indiscernible.

Luckily, there’s a great Python package available called osmnx to retrieve detailed maps using the OpenStreetMap database. Below you can find the code to obtain the street networks of the 30 cities in our database. I wrote every map to my hard drive (in the ‘maps’ folder) using the GeoJSON format. I only need the actual coordinates of the network (i.e. the position of the lines), so we can discard the other attributes in the graph object.

I included an example of how you can map these GeoJson files using the GeoPandas package I mentioned earlier (check out the second part of the code snippet above). Loading every saved GeoJson file into my Python environment and plotting them using subplots gives us a neat visual overview of the 30 cities under study (see Figure 7).

Figure 7. Maps of the 30 cities scraped from Google

The GeoJson files just contain the coordinates of the network, which will be drawn every time you plot the network according to the settings provided by your graphics library (such as matplotlib). This means that no matter how much you zoom in, the streets will be plotted in crisp detail. Below you can find the end result of the second part of the code snippet displayed above for the Netherlands (see Figure 8).

Figure 8. Shapemaps 5 biggest cities — The Netherlands

Let’s combine the info from Google, Position stack and OpenStreetMap and map all scraped restaurants and supermarkets in Paris. You can find the code to obtain Figure 9 below. In this case, supermarkets are indicated by green dots, restaurants by orange dots, but you can change the color scheme to your liking of course.

Obtaining Figure 9
Figure 9. Restaurants and supermarkets in Paris (602 locations in total)

Simply plotting the data could also reveal some potential sources of bias stemming from structural limitations of the Google maps platform. This is clearly the case here. As you can see, it looks like we’ve collected more data on the city center of Paris. Now, I have visited Paris multiple times and it’s obvious that not all restaurants are represented on this map. There’s more to Paris than the touristic center, but it’s all that really seems to matter according to Google. This might have something to do with the way I’ve built the scraper. I simply told Google to ‘find restaurants in Paris’, without any further instruction. Given this, it makes sense that Google will (a) focus on the city center and (b) that Google stops making suggestions after X number of pages, even if there are more restaurants available in the region. Indeed, no matter the region scraped, Google suggestions tend to stop after about 320 places. These two algorithmic tendencies combined means that the data obtained is biased towards including locations in the city center, which might have implications for your research. This is especially worrisome if you’re interested in different neighborhoods within the same region or city. If that’s the case, you should write a scraper that explicitly targets several sub-regions, such as different districts in the same city.

Notwithstanding this shortcoming, we still have an impressive amount of locations at our disposal, with info on around 600 restaurants and supermarkets in a single city! The richness of the data is especially evident when you zoom in (see Figure 10), making the individual restaurants visually discernable on street-level.

Figure 10. Zooming into just a small neighborhood of Paris

[Part 4] Wrangling, exploring & cleaning data

OK, now we have our Google data together with detailed geocoded locations and maps on city-level. This already allows you to answer all kinds of research questions, like plotting and clustering place categories, correlating the availability of certain place categories with other external data sources such as socio-economic variables, and so on.

Now it’s time to actually explore the popular times data. Remember: for every location and if available, we have a value between 0 and 100 for every hour of the week. A value of 100 indicates that this particular hour tends to be the busiest hour of the week for this particular location. However, and just to recap, locations without sufficient location data simply don’t have popular times figures. The question remains how many missings we are actually talking about. I created the following graph to estimate how severe this kind of bias is in our database (see Figure 11).

Figure 11. Missing popular times data — by place type and region

It’s evident that we’re talking about a substantial proportion of missings. Around 40% of all places we’ve scraped from Google maps have no popular times data at their disposal. These locations are simply not popular enough to generate a reliable popularity estimate. This is the case for about 4 out of 10 restaurants and 3 out of 10 supermarkets. Interestingly, there are substantial more missings in Spanish and Italian cities such as Barcelona and Zaragoza (50%) when compared with regions in Western countries such as Eindhoven and Munich (30%). Maybe this is a result of the dining culture prevalent among more Southern cultures, where there are more (local) bars that serve a relatively small amount of loyal customers. Indeed, when one walks around Barcelona, you are overwhelmed by the amount of small (and charming) tapas bars, while restaurants in cities such as Utrecht or Antwerpen need to ‘go big or go bust’. This means that — whatever analyses we’ll perform in this section — we’re only talking about the top 60% popular restaurants in our sample. Let’s keep that in mind whenever we’re interpreting the results below.

Just for exploratory purposes, I’d like to perform three analyses.

  • Clustering popular times data of restaurants: can we identify regions with similar visitation patterns? I only take restaurants here because I suspect dining culture is more culturally defined than going to the supermarket. Ideally, we’d see that each country forms its own typical ‘popular times’ cluster, where restaurants adhere to different popularity patterns across countries (and cultures).
  • On which day of the week do restaurants or supermarkets reach their peak popularity? Does this differ across countries? For example: does the famous Italian dining culture mean that each day of the week a substantial proportion of restaurants experience some kind of peak? And do the disciplined Germans all go out to eat on a Saturday?
  • On which hour do restaurants reach their peak popularity? Are the Spanish going out late in the evening (let’s say around 9 PM), when the restaurants in the Netherlands are all but empty?

Clustering popular times data

Before looking at different popularity patterns across countries: does it even make sense to cluster all these restaurants into their corresponding regional category? In other words: are the popular times trends within a single country so similar that they would cluster together based on their popularity trends? To find that out, I performed a hierarchical cluster analysis (Using the very straightforward Euclidean distance and ward’s method) and sliced the dataset into 2 to 6 clusters. I used 6 as the upper limit here since we are dealing with 6 different countries, assuming that each of these regions possibly form a cluster driven by differences in dining cultures.

I wrote the following code to do just that:

I ran this script twice: once using all available popular times data (left panel, Figure 12), and once using a more restricted time frame, using only the popular times figures from noon ‘till midnight (right panel, Figure 12). I did this to avoid an excess of zeroes, which is something that might be detrimental to finding a meaningful cluster solution. Indeed, it makes sense that restaurants are usually empty between midnight and noon, with the occasional exception, and that the interesting differences in popularity patterns are predominantly manifested somewhere between noon and midnight. The cluster membership by country is produced by the cluster_membership function. I color-coded these percentages in Excel, which gives us the figure below.

Figure 12 — Clustering popular times data. Left: using the entire time range | Right: using restricted time range

Ideally, the solution with 6 clusters would show a clear division by country. This is not the case, however. No matter the time range selected, the region seems to matter little to explain cluster membership. When we use the restricted time range, the first and also biggest cluster (N: 1000) is dominated by French and Italian restaurants though, so a substantial subset of restaurants in these regions seem to exhibit similar popularity patterns. The same goes for Dutch and Belgian restaurants (cluster 3, N: 942). The remainder of the clusters contain a mix of different regions though, and there’s not a clear pattern to be discerned as well. A couple of cluster solutions hint at a South-Nord divide: locations in France, Spain and Italy do tend to concentrate in the same clusters, with Dutch, German and Belgian restaurants usually absent. That’s about it. So what does this mean? This can be explained by

  • The structure is more complex than it seems.
    In this case, region may play a pivotal role in explaining differences in popularity trends, but we are attempting to cluster locations (i.e. restaurants) that adhere to vastly different visitation peaks and troughs. In other words: not all restaurants within the same region are alike, and their different representation on regional level thwart our attempt to create meaningful clusters. It’s true that the ‘restaurants’ Google comes up with cover a broad range, including everything from fast food to haute cuisine. Maybe we need to perform some dimension reduction (e.g. PCA) on these restaurants before clustering?
  • The locations being scraped are not representative of restaurants that exhibit ‘typical’ (i.e. culturally determined) popularity patterns of the region.
    As I’ve already mentioned, our data harbors several sources of bias. The scraped restaurants tend to be popular (because they have popular times figures in the first place), are usually located in the city center (see [part 3]) and — more broadly — in one of the five biggest cities of a given country. It’s not unreasonable to argue that restaurants in the city center of Paris and Naples are different, on many accounts, from restaurants resided in a small rural French or Italian village. For one thing, these are all tourist hot spots, meaning that a substantial amount of tracked devices in these cities actually belong to tourists who may exhibit different dining habits. Scraping less tourist-friendly cities or more rural regions may lay bare the more nuanced differences between national dining cultures.
  • The sample is biased.
    This harks back to the comments made in the introductory section of this article. If the tracked devices are more likely to represent a particular part of the population (e.g. younger, more cosmopolitan, higher educated, less traditional), the data may be unable to capture the dining habits of, let’s say, the ‘average Italian’.
  • I haven’t used a proper clustering technique.
    Cluster analysis is a somewhat subjective, trial-and-error kind of ordeal. Different distance measures and linkage methods underperform depending on the true cluster shape and size, which is something the analyst doesn’t know beforehand. The high proportion of zero-values might warrant a sparse clustering solution, although the more restricted time range (noon-midnight) does fix this issue for the most part.
  • Region (and thus national dining culture) is not a proper variable to explain differences in popular time trends.
    Finally, it’s entirely possible that region just isn’t a proper explanatory variable; which means that dining cultures between the Netherlands and Italy actually do not differ. Or maybe they do differ on some grounds, but these differences are subtle and do not immediately pop up as a separate ‘cluster’.

Popularity by day of the week: trends and peaks

Maybe we need to take group restaurants by region and look at more specific differences. For example, when do restaurants and supermarkets tend to experience a peak in their popularity? This means we’ll need to transform the longitudinal popular times data into some sort of summary statistic on day-level. There are a plethora of possibilities to go about this; let’s take a look a look at a couple of options.

Figure 13. Popularity of restaurants by day of the week

One possibility is to calculate a ‘mean’ popularity by day for each and every restaurant. Given this, one could plot the distribution of the mean popularity — by day and region — using kernel density plots (KDE) (basically a smoothed histogram on an interval scale). Figure 13 displays the end result of such an exercise. Notably, I only retained non-zero values here to avoid heavily skewed distributions with high concentrations around zero. In other words: the KDE plots represent the distribution of mean popularity values for hours where there are some amount of visitors on location. This mostly rules out the closing hours of the restaurant in question (and restaurants simply closed on that particular day), which is a form of bias we’d like to exclude from our analysis anyway.

It looks like there’s little insight to gain from such an analysis. Sure, the distribution on Saturday (the brown KDE plots) tend to be concentrated around higher popularity figures in all countries, but other weekdays show a somewhat similar normal distribution, with a couple of minor exceptions. There are two ways forward from here: either we summarize the data even more or we break the sample down into smaller coherent clusters of locations (e.g. after performing some dimensionality reduction technique).

Figure 14 represents a similar analysis where we follow the first route; we aggregate the data even more by (a) only looking at the most popular hour of the week for a particular restaurant (i.e. on which day of the week does a location reach 100% popularity?) and (b) by adding all popularity percentages for a specific day and taking the day with the highest total figure. Since restaurants obviously can’t reach peak popularity when the place is closed, we’d like to express the percentage of locations reaching peak popularity among restaurants that are actually opened. This means that adding all percentages over the week gives you a higher figure than 100%, as the comparison group shrinks and expands throughout the week. To ease interpretation, I also want to plot the percentage of locations closed on a specific weekday (the red bars).

To avoid a big chunk of (repeated) code, I wrote two functions to summarize the data and filled a dictionary with a total of eight datasets. One dataset analyzes peak popularity of restaurants, by hour and country. Another one measures peak popularity of supermarkets, by day and city, etc. The code snippet looks something like this:

Now let’s finally take a look at Figure 14 (you can find the code to reproduce this graph here). Unlike our previous analysis, this one delivers a couple of interesting insights:

  • It doesn’t matter whether we measure peak popularity by the most popular hour or overall day popularity. What this effectively means is that the busiest hour of the week is not an isolated incident: when a peak is bound to happen, the hours surrounding this peak will tend to be relatively popular as well.
  • Saturday is — by far- the most popular day of the week (with around 3 out of 4 restaurants reaching their peak). Monday ‘till Wednesday are slow days for restaurants across Europe. The trendline increases on Thursday, if only slightly, in most countries. Only around 1 out of 5 restaurants experience a peak on Friday. Again, notice how remarkably stable this trend is across countries.
  • It doesn’t really pay off to open your restaurant on a Sunday or Monday. In most countries, Sunday and — in a lesser regard — Monday are popular closing days. One might expect that a substantial proportion of restaurants open for business on these days would experience a boost in their popularity simply because plenty of competitors are closed. This is clearly not the case however. This suggests that the balance between supply and demand is just right and that these two traditional closing days are cemented into cultural norms on when people should and can eat out.
Figure 14. Share of restaurants experiencing peak popularity

I repeated this exercise for supermarkets, which revealed some different though equally interesting patterns.

Figure 15. Popularity of supermarkets by day of the week

Again, visualizing the distribution of mean popularity across days gives us some neat-looking but ambiguous KDE plots. There seems to be something off with Belgian supermarkets during the weekend. Some supermarkets are surprisingly calm on a Saturday, while a different subset of locations experience extremely high peaks (> 80% mean popularity) on that very same day. This even results in almost bimodal distribution. Still, the overall trend throughout the week seems to be remarkably similar across nations.

Again, the analysis by peak-popularity is more revealing (see figure 16). As one might expect, Saturday takes the lead as the most popular day to do grocery shopping. But the pattern throughout the weekend is more nuanced here. In essence, countries seem to adhere to three different patterns on Saturday and Sunday: a conservative, liberal or conservative-to-liberal trend.

Let’s start with the conservative-to-liberal cluster, of which Belgian supermarkets constitute a prime example. In Belgium, most supermarkets are closed on Sunday (around half), but those who are open experience an enormous boost in visitors. If we measure peak popularity on an hour-level, three out of four supermarkets that are open for business on a Sunday experience their peak on this Holy day. This means that these locations even outperform their own visitor figures from Saturday! A similar trend is visible in Spain. So, although many Belgian and Spanish supermarkets uphold the tradition of Sunday as a Holy day of rest, both populations seem to have move past this: they expect that (at least some) supermarkets are open on Sunday, which proves to be a lucrative decision for these businesses in both countries. One could expect that it’s only a matter of time before market forces push the legislative branch to pass more liberal work regulations that would stimulate these wider opening hours even further. This liberalization has already happened in the Netherlands, the only true liberal country in our sample, where most supermarkets are open every day of the week (only about one out of ten businesses are closed on Sunday). However, the high proportion of locations experiencing peak popularity are absent here. This does not necessarily mean that these locations have low visitor figures on this particular day; it just might be the case that Sundays tend to attract a ‘moderate’ or ‘normal’ amount of customers and therefore do not stand out when we merely look at peaks. This is supported by the KDE plot in Figure 15. This makes sense from a supply and demand standpoint as well: since the ‘7/7 supermarket’ has been normalized, grocery shoppers on Sunday are probably distributed over more supermarkets. On the more conservative side, we can see that basically all German supermarkets, with a few exceptions, are closed on Sunday. As it turns out, Germany is known for their extremely restrictive shopping hours. France and Italy fall somewhere in-between, with relatively relaxed opening hours (leaning to the liberal side of the Netherlands), but with around 25% of supermarkets closed on Sunday, they do seem to follow some of the Belgian or Spanish trends as well.

I should note that this liberal-to-conservative reading constitutes only one of the many ways you could read the stats in figure 16. Since all popularity measures are relative (on location-level), we can’t know for sure whether there are a lot of Belgians doing grocery shopping on a Sunday. Perhaps — even though this seems somewhat unlikely — there is a substantial subset of independently owned supermarkets (e.g. Turkish supermarkets in urban areas) that attract a rather small amount of customers on a Sunday in absolute numbers. However, these supermarkets might experience a peak in popularity from their standpoint. As I’ve already argued in the introduction, it’s hard to pinpoint exactly what’s going on when working with algorithmically calculated data.

Figure 16. Share of supermarkets experiencing peak popularity — by day of the week

Popularity by hour of the day: trends and peaks

When one thinks about differences in dining culture, the first thing that might pop into your mind is that Italians go out for dinner when the Dutch already go to bed. Indeed, more southern cultures are known for delaying their dinner time when darkness fills the cozy alleyways of, let’s say, Naples and Barcelona. Can we find some evidence of these cultural norms in the popular time data?

Figure 17 displays (a) the mean popularity by hour (from 5 PM-2 AM) and (b) the percentage of restaurants experiencing peak popularity in a specific hour (from 5 PM — 2 AM) by country (check out the code to reproduce this graph here).

Figure 17. Peak popularity of restaurants by hour of the day

It looks like there’s a clear north-south divide here. More specifically:

  • Dutch restaurants already experience a peak between 7 and 8 PM. Around that time, there’s not much going on in restaurants in the southern European countries (France, Spain and Italy).
  • Although Belgian restaurants are already packed between 7 and 8, but 35% of Belgian restaurants experience their peak one hour later (between 8 and 9 PM).
  • French and Italian restaurants only reach their peak between 9 and 10, but the true night owls are the Spanish. More than 30% of Spanish restaurants experience a peak between 10 and 11 PM, when restaurants in Belgium and the Netherlands are all but empty.

Conclusion & suggestions for future research

The goal of this article was to inspire future researchers and give some relevant pointers when it comes to retrieving, wrangling and analyzing Google popular times data. Although the exploratory analyses reported here might seem to state the obvious or point to rather banal conclusions, these figures might point to highly valuable patterns within the right research context.

One important caveat for leveraging popular times data are the numerous sources of bias reported throughout the article. We’re not just talking about mere sampling bias here (who is being tracked and is this a representative sample of the population), but there’s a clear algorithmic bias at play here as well. For example, as we’ve seen when plotting the collected locations on a map, Google maps tends to focus its search on a small area when looking for locations, with a preference for the city center. These locations might not be representative for the ‘prototypical’ location type for a number of reasons (e.g. tourists might skew the popularity trends). Future researchers should be aware of this and write their scraper in such a way to diversify the geographic spread of the scraped locations within the same local region. For example, one could target different neighborhoods within the same locality/city explicitly.

Moreover, the researcher has to deal with missing data which are probably not missing at random. Our analyses estimate that around 40% of Google maps locations (for restaurants and supermarkets) have missing popular times data due to an insufficient amount visitors. It’s reasonable to assume that these locations share some common characteristics. For example, so-called ‘ethnic grocery stores’ (e.g. Turkish supermarkets) are probably more likely to end up in the missing-category, as they tend to serve a rather small community — often concentrated within the same neighborhood. Depending on the research question at hand, this might have severe implications for the reliability and validity of the researcher’s conclusions.

Finally, mapping a particular region by location type still involves some manual labor. First, Google will inevitably return some places that are not relevant to your search term and their categorization scheme is often too detailed and murky. This meant I had to convert all unique Google categories to a more general aggregate category. Second, and depending on your research question, the returned locations might encompass multiple clusters of places that each adhere to different visitation trends. This was evident when looking at our cluster analysis. Some form of dimensionality reduction, such as PCA, might therefore be warranted to expose relevant subclusters of locations. But researchers can get more creative and opt for more straightforward solutions, such as including locations based on particular keywords. For example, Turkish supermarkets could be extracted by matching the names of reviewers with a list of popular Turkish names (e.g. Yusuf, Eymen, Ömer,…).

Still, and notwithstanding these potential pitfalls, Google popular times data remains a vastly unexplored and possibly valuable source of information for both academics and businesses. We were able to scrape more than 13.000 Google places, geocode their specific location, plot these locations on any specific city map and look for some patterns that might be indicative of cultural norms on dining and grocery shopping. What’s even more, we did this by spending exactly zero dollars. One interesting avenue for academic researchers might be to look for longitudinal change in popular times figures. Given the current COVID-pandemic, one might look for significant changes in our time budgets that might even linger on after the pandemic is over. This presupposes three waves of data collection: one before, one during and one after the pandemic. In this light, continuously monitoring and scraping popular times data (every few months or so) might prove to be a valuable undertaking within communication and sociology research departments.

--

--