Finding Vandals and Language Hotspots on OpenStreetMap

Nick Doiron
Mar 11, 2019 · 4 min read

I’ve long wanted to see a true map of the world’s languages. We know where languages are supposed to be spoken, but where are the real borders, where are the little enclaves? This week I finally got the server space and ran the scripts to make it happen!

For this project, I am collecting data from OpenStreetMap, a user-edited map with free, open data. I take the primary ‘name’ tag of any points, and using Jan Lelis’s unicode-blocks Ruby gem, determine where their letters fall in the Unicode block system. This blog post is in English speakers’ familiar “Basic Latin”, “Latin-1 Supplement”, and “General Punctuation”, which are common enough that I’ve filtered them out from our map.


You can view the map at and data and source code are on GitHub. I’ll make GeoJSON files later.

Local Language Hotspots

Tifinagh (ⵜⵉⴼⵉⵏⴰⵖ) is used alongside Arabic across North Africa. In the past five years, OpenStreetMap users started labeling all Moroccan cities in Latin, Arabic, and Tifinagh script. You can see a handful of other locations in Algeria and Libya.

Image for post
Image for post

The letters in “Latin Extended-B” and “IPA Extensions” are common on a small section of the Guyana-Brazil border. This probably overlaps with a local language known as Wayampi. There is another cluster in the Tizi Ouzou region of Algeria. That doesn’t mean that these languages are related at all — just when new sounds and/or symbols were added to the Unicode standard, both were included in the same update.

Image for post
Image for post

Similarly, “Latin Extended-D” appears only on Easter Island.

I was delighted to see that N’Ko and Canadian Aboriginal Syllabics have their own clusters, even if they are small.


By seeing a single dot using a character from the “Latin Extended Additional” block, we find that Australia’s Uluṟu includes the letter ṟ.

I found India’s two Antarctic bases because they are labeled in Devanagari script.

Image for post
Image for post

A Greek hostel in Vila Velha, Brazil? A Chinese bank in the Bahamas? There were several unusual outliers which I couldn’t fully identify or verify on Google Street View.


Cyrillic (Bulgarian) vs. Latin Extended (UK) in the South Shetland Islands.

Image for post
Image for post


A “Canadian Aboriginal Syllabics” point in Colombia caught my attention. This user stylized the shop name as ᗰI ᑕᗩᔕᗩ, but Mapnik had some issues rendering it.

Image for post
Image for post

A seemingly harmless extra bus stop named in Lao script in a neighborhood outside of Adelaide, Australia, was scraped into the content generator

Image for post
Image for post

This point in New Zealand was labeled “Czech Republic” with an attempt to add an emoji flag.

Image for post
Image for post

I found a handful of points in Tenerife which had Glagolitic names (unused old Croatian script) and I’m not sure why. The reason for this type of vandalism is unclear, but they should be removed.


  • OpenStreetMap data is © OpenStreetMap and contributors, and was downloaded from
  • By using the Unicode script, I miss the distinction between several languages, such as Russian / Ukrainian.
  • By using node names, I missed names used on lines and polygons, particularly roads, rivers, and buildings where I’ve seen local languages used in the past.
  • I haven’t checked if Guyana and Easter Island have the ‘correct’ Latin extended letters for their names, just noting a common pattern.
  • I checked only the primary name=__ tag, and not the alternate names (name:en=__, name:es=__), I’m leaving a lot of multilingual names. This was OK with me because cities often have dozens of alternate WikiData names, and the main name tag is the primary, most-seen one.

Finding gaps in local language coverage

There are many areas which aren’t highlighted by the map because they were Anglicized by map editors for various reasons. As an example, places in the Marshall Islands aren’t labeled in their Marshallese names (e.g. Mājro, Arņo). Unicode may still add new codepoints for these letters (ņ here is repurposed from the Latvian alphabet).

It would be interesting to reach out to editors who have added ~5 places in N’Ko alphabet in Guinea, or the brothers making the Adlam alphabet, to expand the number of local scripts used on OSM.

Image for post
Image for post
Place names including Arabic in Africa, via OpenStreetMap

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store