Can You Estimate the Number of People whoVisit a Place by the Number of People Who Review It Online?

Dtg
Introduction to Cultural Analytics
14 min readMay 21, 2021

No. But the Answer is More Interesting than that.

The Colosseum in Rome has been reviewed on Google by more than a quarter million people. Screenshot from Google Maps.

Online reviews are ubiquitous. Almost everything has been reviewed by someone, at some point, somewhere on the internet. There are easily accessible reviews online for everything — from books and restaurants to college professors and bespoke haberdashers . It would not be unreasonable to say that this massive cache of online reviews is, taken at its whole, one of the largest datasets ever assembled by humans — and one that grows daily.

It is very, very tempting to attempt to tap into this data. Surely, such a massive log of human thoughts and interactions can and will yield equally massive insights. Even if online reviews, as a format, have flaws (which anyone who has attempted to use them can enumerate) their potential would seem to vastly outweigh their risks: this is free* data, on an unprecedented scale, with an almost universal scope and range. The data exists; the only job is to identify its uses.

I have long been interested in quantitatively studying tourism — specifically historical and cultural tourism. This is, really, only tangentially related to my question about online review data, which could be applied to any number of fields, but to understand how I stumbled upon this research topic, it is good to understand the context I was working in.

When doing research on the impact of tourism, it is often crucial to have access to data on visitation numbers (the total number of tourists who visit a place), but these data are often very difficult or impossible to find. It is even more often woefully inaccurate. People can be counted fairly easily at choke points — the ticket counters of museums, the passport control lines at airports — but they are much harder to enumerate when visiting public spaces, like churches or parks or bridges or cities or towns. The most precise data source governments have traditionally been able to collect is nightly hotel stays by city or municipality (France does this, for instance), but clearly this has its limits: where are people going in the hours that they are not asleep?

The most high-tech solution to this problem is mobile phone tracking data, which can give you a pretty good idea about how many people visit certain, pre-determined points over a set period. These data, sold by several private companies, go mainly to those interested in evaluating commercial real estate, but at a price that puts them outside the budget of many researchers, and even many municipalities that might be interested in tracking their tourists. Continual concerns about privacy make the use of these datasets controversial at best, but there are also enough concerns about the accuracy of mobile phone tracking data — heavily reliant on the precision GPS logging of a statistically representative sample of the population — that it is hardly a slam-dunk solution.

If only there were a simpler way to get visitation data. If only people would tell you when they had been places, generating an online ledger of their presence. Better yet, if only this data were public, easily collectable, and universal. You probably see where I am heading with this: it sure seems like we should be able to use the online reviews people leave about the places they visit to estimate tourist visitation numbers. If not perfectly, maybe at least with fairly good accuracy.

The number of reviews for a location is prominently displayed on most online review sites. Here, as an example, we can see that the Uffizi Gallery in Florence has 44,715 reviews. This museum was among the cultural sites in the dataset I used for my analysis. Screenshot from Google Maps.

The key data point for us is the number of reviews, which is almost always published alongside the aggregated average review value to help the viewer interpret the data. Clearly, we don’t expect the number of reviews to be identical to the number of visitors: most people, it is safe to assume, leave no reviews at all. However, it is not unreasonable to think that there may be a strong relation between the number of reviews and the number of visitors, especially over a wide timeframe. If the relationship is strong enough, it would be possible to calculate an accurate estimate of total tourist visitation from this humble source: humanity’s absolute glut of online reviews.

The data needed to test this relationship is relatively simple in theory: all that is needed are 1) the actual visitation numbers and 2) the number of online reviews for a list of places. However, the difficulties of finding this data quickly become apparent. Ideally, the actual visitation data would be uniform — all collected by the same agency; all formatted and verified in the same way; all accessible in the same spot. to draw any far reaching conclusions, it is also key that the sample size be large enough to do statistical tests. The places in the dataset also have to be public and permanent (a fairly easy bar to pass), since otherwise there would be no way to get online reviews for them.

After a bit of looking around, I found an ideal candidate in a data set published by the Italian Ministry of Culture (MiC). This organization, an official office of the Italian Government, is in charge of a significant number of Italy’s most famous museums and archeological sites, as well as numerous palaces, churches, monasteries, castles, and other attractions across the country. In total, about 500 sites are under their jurisdiction, and they publish visitation numbers for all of them yearly. These data are based off of tallied ticket sales as well as observed entrances, and range widely from properties seeing over 6 million visitors in the average year (e.g. the Colosseum and Pantheon in Rome) to those which receive none. The size and range of the data set were good, and it was to be hoped that the MiC had high enough standards that the data would be accurate and consistent as well.

All that remained was to pair these data with online review data. I chose to get my online review data from Google Maps. There were several reasons for this: 1) their platform is used and known by pretty much everyone, 2) more niche tourism sites, like Trip Advisor, do not allow academic researchers to access their APIs (the backdoor into their data), 3) Google Maps is very complete and should have reviews and entries for even the smallest attractions run by the MiC, and 4) the data is posted publicly and there are little ethical concerns in using it, especially when aggregated into the number of reviews and separated from individual account names.

The Archeological Site of Pompeii is part of the dataset I found published by the Italian Government. It is also a good example of the problems I faced collecting and cleaning the data: not only are there multiple Google results that come up for Pompeii, but there are also many Google results for places within the site. In addition, Pompeii is part of several group tickets arrangements offered by the Italian Government, further complicating my data collection. Screenshot from Google Maps.

With these two data sources identified, all that remained was marrying them into a single data set — a process more difficult than first appeared. For one thing, the names used by the Italian government did not correspond universally with the proper listing in Google Maps for each attraction. This was exacerbated by Google Map’s frequent use of multiple tags for the same place, only one of which is actively used (with website links, reviews, etc). Since my process for linking each data Italian attraction to its number of Google Reviews was automated, it is possible some misidentifications have slipped through, although I have hand-corrected many of the most egregious ones. Another issue that absorbed an inordinate amount of time (and may contribute to errors) is the fact that the data published by the MiC includes counts of combined tickets, which gave admission to multiple attractions. I added the totals for these combined tickets to each property the ticket was valid for, even though it is probable some group ticket holders did not go to every site on the group ticket. I figured that this inaccuracy was better than the alternative — excluding the group tickets — since there are several sites which were primarily visited on group tickets, and these places would have much more inaccurate counts if the group tickets had been excluded.

What did I expect when looking at this data? If I am to be honest, my expectation was for a very close relationship between visitation and reviews.

One final detail to mention before my analysis is that my museum visitation data is an average of the values reported in 2010, 2015, and 2019. I needed an average of multiple years because the number of Google Reviews reported by the website is a running total: the total since the origins of the Google Reviews program in 2007. Since the website does not report individual years of data, I thought it was important for my actual visitation numbers to be an average of visitation numbers over the years when Google Maps reviews have been most active, so as to account for any changes over time. Ideally, I would have tallied the total visitation from every year, but time constraints during data collection mean that, in this analysis, I decided to work with an average. Clearly, my arbitrary selection of years may induce further error in the data: for instance, if a museum or site happened to be closed for renovations during one of the years I chose to sample.

What did I expect when looking at this data? If I am to be honest, my expectation was for a very close relationship between the visitation and reviews. I guessed — whether reasonably or not — that there would be a fairly set percentage of people who would leave reviews. Certainly there are myriad reasons why you may or may not leave a review, from time to interest to other random inducements. But I thought it not unreasonable that these random influences would average out across the population. On a large scale, it seemed entirely possible that the number of reviews for a place would be related through a pretty consistent percentage to that site’s visitor figures.

If this were the case, we would expect to be able to find the percentage pretty easily. A scatter plot of the two values — average yearly visitation numbers and total Google Reviews — would show a pretty clear trend, a linear grouping of the data-points. However, if you look at the actual scatter plot produced by my data, shown below, you will see already that this is not clearly the case.

Although there is a definite upwards trend in my data, the relationship between total visitation and the number of online reviews does not leap off the page. Indeed, much of the relationships between the data is obscured by our giant scale. The scale for total visitation goes between 0 and 7 million, meaning that most of the data points have grouped together in a tight blob at the small values close to zero. For these values, this large scale scatter plot tells us almost nothing.

A linear regression of the these two values provides clearer results. The relationship between the two values is statistically significant and positive, which means that our hypothesized relationship is certainly present. We can expect, on average, about 2 in a 1,000 visitors to leave a review for a place they have visited. Statistically, this model, using just visitation and review data, has an R² of .83, and thus allows us, with just one variable, to explain 83% of the variation in the other. This may seem like a promising result, however, we have reasons to be dissatisfied by these results.

Although it is true that, for most uses, a model that explains 83% of the variation in a variable would be incredibly impressive, in the context of our data, this number is not high enough. Most of the variables in our dataset have relatively small values, and the level of accuracy achieved by the model is simply not high enough to ensure accurate predictions at small scales. Essentially, the inclusion of several variables with values which are orders of magnitude larger than the others has skewed the model’s calculations of overall error. As an example of this consider that predicting within, say, 1,000 reviews of the true value means something very different for a variable with 100,000 total reviews than for one with only 50. To get a sense what this means at a practical level, let’s zoom in on that dense cluster of attractions which is obscured by the large scale of our main scatter plot:

A close view of the main scatter plot, showing only the sites with visitation less than 1 million.
An even closer view, showing only the data points with visitation less than 100,000. In this case, the lack of a clear linear relationship is even more evident.

As you can see in these samples, although we have been able to fit a “best fit” line to our data, the data themselves do not suggest a linear relationship. The spread appears, for lack of a better word, largely random. Although it is clear that a having a higher number of visitors allows for more reviews (there is an upward trend, after all), it also seems pretty clear that not every place is experiencing this increase in reviews. What the statistical model seems to be picking up on is more a large scale, structural relationship (to have a large number of reviews, you need a large number of visitors) and not something which is actually predictive.

Indeed, if we rerun the linear regression excluding just the top four attractions in terms of visitation numbers, we find that visitation explains significantly less of the variation in our data, a much more prosaic 67.7%. This strongly suggests that our initial analysis was biased by outliers. This level of prediction is nowhere near large enough to make Google place reviews a reliable substitute for actual visitation numbers.

Our prediction that the relationship would be very closely linear does not hold up to scrutiny. But why? What’s going on here? The short answer is that the number of reviews written for a particular place must be dependent on much more than simply the number of people to have visited. This is reasonable enough in hindsight. The other potential factors are fairly obvious: from the amount someone has paid for entry to their surprise in the quality of the attraction. Without accounting for these (which is often impossible), the analysis is doomed.

The long answer involves more digging. Using a sorting tool, I went looking for the places with the highest and lowest ratio of reviews to visitors. These places should be places where our hypothesis — that there is a set percentage of people who will leave reviews — fails most spectacularly. Looking at these places and observing reasons why they might be outliers should help us understand why the entire pattern did not hold.

The Antiquarium of the Torre Cimalonga — This small archeological museum in a historic tower has the lowest rate of reviews to visitors for any place in my study, with 2 lifetime reviews and 5000 annual visitors. The linear model would predict 112 reviews for this level of visitation. Screenshot from Google Maps.

Looking first at the list of places where the this ratio was the lowest (in other words, with the fewest reviews per visitor), I happened upon the Antiquarium di Torre Cimalonga. This is a small archeological museum in a remote village in Calabria (the foot of the boot), a region of Southern Italy that often ranks near the bottom in nationwide measures for poverty and corruption. This museum has an extremely low number of reviews (2 lifetime reviews) given the visitation numbers reported by the government, almost 110 reviews lower than what would be predicted by our regression. In addition, one of these reviews is just a review complaining about how the museum is closed.

From what I can tell, there are several plausible explanations for this low review/visitor ratio: 1) The museum seems very small — perhaps it is simply not impressive enough that many people think to review it, 2) The museum is in a very remote and infrequently touristed part of Italy — it is probable that most of the visits are from locals who either do not think to review something they view as a small local museum or are less familiar with online reviews, 3) It is also possible that the admissions numbers have been falsely inflated, especially since they may determine the amount of funding the museum receives. There are similar explanations that can be made for other museums on the list of low visitor/review ratios, although these explanations are not identical and there seems to be no discernible pattern in them — further evidence that a statistical analysis using online reviews seems to be untenable.

The Aragonese Castle in the town of Le Castella is extremely picturesque from the outside, and therefore likely to be reviewed, however it is rarely open for interior visits, and so many of the reviewers were simply never counted by the government since they did not officially “visit”. Screenshot from Google Maps.

On the other side, for places which seem to have received an inordinate number of reviews per visitor, we have more of a pattern. Prominent among them are places which seem to have mistakenly low reported visitation numbers (mostly because they seem to have been closed for repairs one or most of the years I averaged to get my visitation figures). Others, like the Aragonese Castle in Le Castella, are places which it is very possible to review without ever “visiting”. Attractions like this are very picturesque from the outside, but are only open for official tours rarely. Therefore, although only a small number of visitors are counted, a much higher number interact with the site from the outside only. In these cases, faulty data seem the most reasonable and consistent explanation: either it was the way I selected my data or the inherent difficulty in counting “visitors” who do not ever enter a building.

This marks a good place to transition back to my original question: Is there a better way to measure visitation than with the patchy, faulty, and selective data that it is traditionally possible to find? To this I give a grudging answer: not that I have found.

This analysis has certainly shown that, when it comes to online reviews, they are not a simple proxy for total visitation numbers, and they are influenced by many other factors. Especially when looking at places where people seem strangely review-shy, I could find no general pattern which would explain the reticence: basically, some places just seem to produce a smaller review impulse than others.

Is there a better way to measure visitation than with the patchy, faulty, and selective data that it is traditionally possible to find? To this I give a grudging answer: not that I have found.

However, I would say that not all is lost. Even if the number of online reviews is not perfectly correlated with total visits, that does not mean that it is a useless measure — just that it is a different one. It could be, in many ways, just as useful a variable as the total number of visitors, measuring the amount of interaction between visitors and an attraction: their level of engagement with it, even after the fact. Places that are reviewed more may simply be better tourist attractions, and it is entirely conceivable that using online review number data — simply for its own sake — may have as much use as the visitation data I was looking to predict.

Who knows for certain. All I know is that it is a possibility. This investigation may be over, but my adventures with online review data are just beginning.

*Although almost all reviews can be found publicly on websites, not all online reviewing companies maintain free APIs (inward-facing links which allow your computer to ask for data from websites on a larger scale). The Google Maps API, which I used in writing this piece, has fairly steep rates for its data collection, at least once your free trial has expired.

--

--