X-SAMPA: An Alphabet That Changed the Arts of Voice Technologies

Delving into Speech Assessment Methods Phonetic Alphabet

3 min readAug 17, 2022

For a long time, computer users were restricted in their ability to express themselves through language as machines understood only ASCII-encoded 7-bit characters (which only support Latin scripts, numbers, and some punctuation and symbols). Unable to carry out linguistic analysis or work with phonetic notations such as IPA and other language-specific graphemes, linguists designed their own solutions.

At first, a standard character encoding was redefined by mapping custom-made fonts to specific code points. But, very soon, it became clear that such a system does not render properly in cases of device-to-device transmission, which also makes collaboration almost impossible.

The most successful solution for this incompatibility issue was SAMPA or Speech Assessment Methods Phonetic Alphabet.

SAMPA

SAMPA was meant to be a bridge between ASCII characters and IPA symbols, Designed by a group of speech scientists from nine countries, it constituted the ASCII-IPA symbols needed for phonetic transcription of the principal European languages.

SAMPA was devised as a hack to work around the inability of text encodings to represent IPA symbols.

Thus, for example, in SAMPA’s notations [@] was used to represent English schwa sound(IPA [ə]), [2] for the vowel sound found in French deux (IPA [ø]) and so on.

However, SAMPA had some drawbacks that became more apparent with the development of speech-enabled technologies. Among them were:

a partial encoding of the IPA chart;
inefficient encoding system of different languages i.e. instead of using a universal alphabet like IPA, it stored each language's entries in separate data tables;

SAMPA was essentially a collection of tables to be compared, instead of a large universal table representing all languages.

X-SAMPA

X-SAMPA is an alphabet system that changed the principles of voice technologies by providing the functionality to include every symbol (incl. all diacritics) in the IPA chart. In principle, this makes it possible to produce a machine-readable phonetic transcription for every known human language.

Thus, being an extended version of SAMPA, this new notation system is viewed as more universally applicable due to its effective way of encoding all characters in IPA within one table. Additionally, X-SAMPA is easily parsable by machines as it combines the simplicity of ASCII encoding. Below, you can see an example of X-SAMPA — IPA translation:

An interactive example of X-Sampa — IPA translation.

Industry Applications

Both SAMPA and X-SAMPA have been widely used for speech technology and as encoding systems in computational linguistics. Today, you can easily encounter the need for IPA — X-SAMPA translation while developing virtual assistants and chatbots, as the last one can be parsable by computers more efficiently.

In cases when Unicode (ISO 10646) is not available or not appropriate, X-SAMPA forms the best robust international collaborative basis for a standard machine-readable encoding of phonetic notation. Thus, it is highly applicable in constructing multilingual dictionaries, automatic language identification, multilingual speech recognition, and synthesis. Moreover, it is still operated in popular software packages that require ASCII input, e.g. RuG/L04 and SplitsTree4.

I aim to work on this topic further and going to publish an article about the principles for X-SAMPA translation via Python. In the meantime, you can have a look at my previous article, where I discuss how to efficiently handle an English — IPA translation for your speech projects :

How to Teach Your Virtual Assistant English Spelling

Building a converter to the International Phonetic Alphabet with Python

medium.com

Who Stays Behind Your New Voice Assistant, or What I Learned as a Language Annotator for Big Tech

Have you ever caught yourself wondering how your bright new Alexa understand you right off the bat?