Finding Vandals and Language Hotspots on OpenStreetMap
I’ve long wanted to see a true map of the world’s languages. We know where languages are supposed to be spoken, but where are the real borders, where are the little enclaves? This week I finally got the server space and ran the scripts to make it happen!
For this project, I am collecting data from OpenStreetMap, a user-edited map with free, open data. I take the primary ‘name’ tag of any points, and using Jan Lelis’s unicode-blocks Ruby gem, determine where their letters fall in the Unicode block system. This blog post is in English speakers’ familiar “Basic Latin”, “Latin-1 Supplement”, and “General Punctuation”, which are common enough that I’ve filtered them out from our map.
Local Language Hotspots
Tifinagh (ⵜⵉⴼⵉⵏⴰⵖ) is used alongside Arabic across North Africa. In the past five years, OpenStreetMap users started labeling all Moroccan cities in Latin, Arabic, and Tifinagh script. You can see a handful of other locations in Algeria and Libya.
The letters in “Latin Extended-B” and “IPA Extensions” are common on a small section of the Guyana-Brazil border. This probably overlaps with a local language known as Wayampi. There is another cluster in the Tizi Ouzou region of Algeria. That doesn’t mean that these languages are related at all — just when new sounds and/or symbols were added to the Unicode standard, both were included in the same update.
Similarly, “Latin Extended-D” appears only on Easter Island.
I was delighted to see that N’Ko and Canadian Aboriginal Syllabics have their own clusters, even if they are small.
By seeing a single dot using a character from the “Latin Extended Additional” block, we find that Australia’s Uluṟu includes the letter ṟ.
I found India’s two Antarctic bases because they are labeled in Devanagari script.
A Greek hostel in Vila Velha, Brazil? A Chinese bank in the Bahamas? There were several unusual outliers which I couldn’t fully identify or verify on Google Street View.
Cyrillic (Bulgarian) vs. Latin Extended (UK) in the South Shetland Islands.
A “Canadian Aboriginal Syllabics” point in Colombia caught my attention. This user stylized the shop name as ᗰI ᑕᗩᔕᗩ, but Mapnik had some issues rendering it.
A seemingly harmless extra bus stop named in Lao script in a neighborhood outside of Adelaide, Australia, was scraped into the content generator OpeningHoursAU.com
This point in New Zealand was labeled “Czech Republic” with an attempt to add an emoji flag.
I found a handful of points in Tenerife which had Glagolitic names (unused old Croatian script) and I’m not sure why. The reason for this type of vandalism is unclear, but they should be removed.
- OpenStreetMap data is © OpenStreetMap and contributors, and was downloaded from https://download.geofabrik.de/
- By using the Unicode script, I miss the distinction between several languages, such as Russian / Ukrainian.
- By using node names, I missed names used on lines and polygons, particularly roads, rivers, and buildings where I’ve seen local languages used in the past.
- I haven’t checked if Guyana and Easter Island have the ‘correct’ Latin extended letters for their names, just noting a common pattern.
- I checked only the primary name=__ tag, and not the alternate names (name:en=__, name:es=__), I’m leaving a lot of multilingual names. This was OK with me because cities often have dozens of alternate WikiData names, and the main name tag is the primary, most-seen one.
Finding gaps in local language coverage
There are many areas which aren’t highlighted by the map because they were Anglicized by map editors for various reasons. As an example, places in the Marshall Islands aren’t labeled in their Marshallese names (e.g. Mājro, Arņo). Unicode may still add new codepoints for these letters (ņ here is repurposed from the Latvian alphabet).
It would be interesting to reach out to editors who have added ~5 places in N’Ko alphabet in Guinea, or the brothers making the Adlam alphabet, to expand the number of local scripts used on OSM.