Places and their names — observations from 11 million place names
Quite often in my data visualization weekend projects I first find an interesting set of data and after that I might (or might not) get an idea what to actually do with it. That was also the case with this story.
A few weeks ago I ended up downloading the whole GeoNames dump of over 11 million place names and their coordinates. At the time I didn’t have any idea what to do with it. I like to think this kind of things as data challenges, which are interesting opportunities to teach myself some new tech/dataviz skills and to find interesting or funny insights from the data.
Tools used in this project were QGIS and PostGIS. PostGIS allowed me to store and query the data easily and QGIS was the tool I used for visualizing it. So for all of these visualizations you see in this blog post I didn’t need to write any code. Only a few simple SQL statements. I’m not going to go through all the QGIS visualization methods here, but all of them “out-of-the-box” stuff too. Ask me if you want to know something specific.
I wanted to find patterns in the names, so I explored if they started or ended in a certain way or just contained a certain word. With SQL this means that I was using the % wildcard to find prefixes or suffixes. So for instance the following query would return return every word containing the word bad anywhere in the name:
SELECT * FROM geonames WHERE name ILIKE ‘%bad%’
After downloading the data, I unzipped it and I loaded the .txt data to my local PostGIS database using QGIS DB Manager and threw the data on to the QGIS canvas to see how it looked like. After I did some filtering with the data and crowd-sourced Twitter for some ideas (big thanks to everyone who contributed!), I decided that this data could very well contain some blog post-worthy material about places, their names, cultures and geography in general!
What is GeoNames and what is it good for?
Before we dive in to the data in the form of maps, a few words about the data source itself. On their site the data is described as follows:
The GeoNames geographical database is available for download free of charge under a creative commons attribution license. It contains over 25 million geographical names and consists of over 11 million unique features whereof 4.8 million populated places and 13 million alternate names. All features are categorized into one out of nine feature classes and further subcategorized into one out of 645 feature codes.
So it is a CC 3.0 licensed gazetter data set. This means that it is a huge list of names, but with additional attributes and coordinates. The detail in the GeoNames data ranges from continent names down to individual rocks or hotels. Another very ambitious gazetter project is Who’s on First by the late Mapzen. This 26 million record data set is also very much worth looking into.
One thing that is important to notice with GeoNames, is that although the data is global, the names are not equally distributed across the globe. At all. This is partly because users can add and edit names themselves. GeoNames.org has a long list of data sources that are the original source of the GeoNames data.
As can be seen from the graph below, United States is the clear number one in GeoNames coverage with more than 2 million names. After U.S. the most covered countries are China, India, Norway, Mexico, Russia, Canada and Thailand. Especially Norway there is slightly strange.
This you might have already spotted from the first image in this blog post. From the map it can be clearly seen that some countries have much brighter colors (e.g. Norway, Morocco) than others and thus contain much more names. I would assume that the optimal global data would resemble a lot like a population map or a map of human footprint when visualized that way.
Uneven distribution means that in cartographic use the data is a bit difficult. Also including a lot of veeeery small villages and a lot of Best Western hotels is probably slightly annoying for some users. But despite it’s faults, GeoNames is a very good generic source for global place names.
Rājekumāravenkataperumālrāzumbahadūrvāripeta and other place I’d like to visit
Before making any maps, I wanted to do some basic SQL analysis. One of the first things I checked was the 5 longest place names from the data. I excluded from the SQL statement all names with spaces or dashes in them. Here’s the top 5:
- Taumatawhakatangihangakoauauotamateapokaiwhenuakitanatahu
- Rājekumāravenkataperumālrāzumbahadūrvāripeta
- Hangukhwangyeongjeongchaekpyeonggayeonguwon
- Jainnonghyeopjeontongjangnyugagonggongjang
- Hangukdambaeinsamgonsasuwonjejochangsawon
Isn’t language just beautiful? So at least according to GeoNames the longest name in the world isn’t the famous Welsh Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch.
I also found out that there are 21 names in the data with a single character and 3 072 with two characters. It seems that on average the names are 13.09 characters.
Enough text. On with the maps!
Time for some name mapping!
After the data was loaded and a few basic searches were done, it was time to visualize it. So in this section you can see a collection of maps from different themes.
Below is a new version of one of the maps I already published on Twitter and how the whole blog post got started. There are a lot of places in South America named after a saint. The map compares two different prefixes (Santa & San). Seems like San (with a space after the word!) seems to be slightly more common, but Santa seems to be a common prefix in Brazil. It also seems like someone in Paraguay has been really active with adding their data to GeoNames…
There are 18 082 populated places in the data, where the name is three characters or less. Below is a map of those places from Europe. The labels are dynamically sized, so if the place has a high number in their population attribute, the labels are slightly bigger (e.g. see that Gay in Russia). Few interesting points can be made from there, like the difference between short names in Northern and Southern Spain 🤔
I visited Denmark to give a conference talk recently and I wanted to do something related to Denmark for my presentation. The gif animation below shows names which end either with the suffix by, havn, hus or borg. I was enlightened on Twitter that the by’s in Eastern England are due to viking heritage.
The following map was one of the most interesting IMHO. Cheers Stephen R. Smith for the great idea! The idea was to visualize how words related to nature would be distributed. This is not only populated places, but all names in the data. One reason that this worked out so well was the high coverage of names in the U.S.
Another great idea from Twitter was to explore words north, south, east and west in the names. I did this for Great Britain and Ireland, but the results aren’t nearly as clear as for the nature words above. This is of course due to the fact that ‘south of something’ can be on a block level, city level or country level. So as these get mixed up on this scale, any patterns are hard to spot on a national level. Like you can see in the animation below. Only thing that makes some sense here are the “West-” names in western Belgium. Nothing to see here!
What could be a common patter between names in German language and the Middle East? A BAD suffix! Although whereas in German language this means a spa, about the other ‘bads’ Wikipedia tells us the following:
-abad is a suffix that forms part of many west, central and south Asian city names originally derived from the Persian language term ābād (آباد), meaning “cultivated place” (village, city, region), and commonly attached to the name of the city’s founder or patron.
Like I already wrote earlier, there are a lot of weird and long names in the data. I wanted to find out which country has the longest place names on average. To find that out, I used the following query:
SELECT avg(length(name)), country, count(name)
FROM geonames
WHERE name NOT LIKE '% %' AND featcode ='PPL'
GROUP BY country
ORDER BY avg asc
I left out countries with very few (<1000) names in total and eventually the winner was Sri Lanka! To celebrate this great finding, I did this map with a collection of some of the longer names there. This was also an excuse to try out this kind of antique cartographic style in QGIS.
Ville (apart from being a Finnish name) means a city or a town in French. As you might know, there are several Ville’s around the world and not only in French speaking countries, but also in the U.S. (e.g. Knoxville, Nashville). Here’s how they look in France on a heatmap, where brighter color means more names with a suffix ville. So in much of France there are a lot of areas with no ville’s at all. There’s probably again a historic reason for this that somebody can tell me?
According to good old Wikipedia, the most common place name is Wahington. Here are the ones that were in GeoNames. I bet someone has visited all of them and if not, there’s a good holiday plan for someone.
As a bonus, here’s a thing I did already a while back. It’s a collection of place names in Finland containing the word ‘paska’ (=shit). There are 569 of them in total. Data from National Land Survey, not GeoNames.
Hope you had nearly as much fun reading this as I had when exploring and visualizing data! 🙌