Airbnb’s vision is to allow people to “Belong Anywhere” by helping travelers feel at home anywhere in the world. This vision is inspiring, yet it presents challenges on numerous fronts, one of which is overcoming the communication barrier. The diversity and richness of our cultures makes us unique as humans, but can sometimes get in the way of relating to one other, given the varying forms in which we express ourselves through language. Thus, bridging the language gap between people is fundamental in helping to create a world where we can all feel belonging no matter where we are.
In this blog post, we will discuss how we built Airbnb’s Internationalization* (I18n) Platform in support of that vision, by serving content and their translations across product lines to our global community in an efficient, robust, and scalable manner.
* Internationalization is the process of adapting software to accommodate for different languages, cultures, and regions (i.e. locales), while minimizing additional engineering changes required for localization.
Content in the Airbnb UI ends up being aggregated from hundreds of microservices, and displayed in the language specified by user preference or locality. Content is composed of phrases (content units), uniquely identified and stored in a central data repository and dispatched after creation (or modification) for translation to all supported production languages. Once the translations are ready, they are propagated to clients apps to be displayed to users.
There were several main system requirements we wanted to uphold in our design:
- Performant: translate calls are served with very low latency, without the need for further application-layer optimizations (ex: batching, deferred execution, parallelization).
- Scalable: the system should efficiently scale with the increase in client apps, supported languages, content units, and traffic growth.
- Available: the system is resilient to failures, and limits possible downtime for dependent clients.
- Cross-Language: apps across multiple platforms and programming languages are supported.
- Integration ease: onboarding clients is easy, seamless, and results in minimal churn to development.
A Content Management System (CMS) allows engineers and content strategists to create, access, and modify content. It also supports other management features such as submitting content for translation, adding relevant metadata (ex: description, screenshots) to improve translation quality, and tracking translation progress.
Content units (phrases) and associated metadata are stored and managed by a Content Service. Each phrase has a unique string key to identify it, along with a timestamp indicating when it was last updated. Phrases also belong to collections, which offer a logical grouping to provide context on the product or app domain in which the content is served.
Newly added or modified phrases that are marked as ready to translate are sent at a regular interval to an external Translation Vendor for translation in the set of target locales. Once done, the vendor notifies an External Callback Service of the new translations batch, which is then packed and sent as translation events to our Event Bus. Following that, an event consumer listens to new translation events, parses the translation units, and writes them to a Translations Service where they are persisted and sent for delivery to client apps.
The Translation Service stores all the translation versions for each phrase. Translations are keyed by [phrase key, locale, timestamp], and are immutable, in order to offer a historical audit trail. Only the latest translation for each phrase is served to clients.
A Snapshotter component periodically loads the latest translations for each locale, creates a JSON blob snapshot, and stores it with the associated timestamp to an object store. The snapshots offer a deterministic view of phrase translations at a specific point-in-time, and help with populating the local client-side translation cache (as we will see later).
On each client instance (can range from a microservice to a web server), translation data is downloaded and stored in a Local Store, which acts as a persisted in-node on-disk key-value cache of all translations accessed by the app. This allows for resolving client translate requests locally and avoiding network calls to the server. There are several benefits to this approach, mainly in improving availability, reliability, and request latency. It also provides loose coupling between the Translation Service and clients, promoting resilience in case of service downtime.
An I18n agent is deployed on each client app instance as a separate process, and is responsible for keeping the Local Store in-sync with the server-side store. Responsibilities include: fetching the latest translations, performing pre/post processing, and managing on-disk storage. This helps encapsulate data access patterns and synchronization operations to the local cache, allowing easy integration with apps implemented in different languages (Java, JS, Ruby).
The main operations the agent performs are:
- Initialize: bootstrap new client app instances with the latest translations snapshot.
- Sync: continuously pull in new incoming translations and update the Local Store.
Figure: Sequence diagram of translate calls to the Translator library.
The Translator library in the I18n client is used by the app to translate content, given a phrase key and locale. The pertaining translations are fetched from the Local Store, with no fallback to server in case the translation does not exist. Missing content or translations are detected and remediated in an asynchronous manner via introspection (explained later). Serving translation requests locally allows us to achieve low latency (sub-millisecond), while ensuring deterministic load & scaling for our server fleet.
The client library has other features as well to support client apps, such as:
- Fallbacks: If a translation is not found in the requested locale, we fallback to a parent locale translation according to a predetermined fallback chain. If no translations are found in any fallback locale, we return the phrase in its original source language.
- Pluralization: Languages (ex: Russian) may have different plural rules, which translations should accommodate for when numerical qualifiers are present in a phrase.
- Interpolation: Phrases can have embedded variables that are resolved at runtime. The
client helps replace them with their associated value before returning the translation.
As mentioned earlier, the I18n Agent keeps the Local Store up-to-date by pulling in new phrases and translations when available.
A basic approach is to periodically poll the Translation Service and retrieve the latest translations since the last time the store was updated. This is done by retaining a “last updated” timestamp and requesting new translations from the server created between that timestamp and the current time. This approach requires very little logic in the client libraries and puts all the heavy lifting work in the Translation Service, which can be made even more efficient with caches on the server side.
A few shortcomings and optimizations of this approach are:
- New translations are not available directly, and can arrive with a delay equal to the polling cadence, which may be a matter of minutes or hours. This is a minor concern if some degree of staleness can be tolerated. To put things into perspective, human translator SLA to deliver translated content is typically 1–3 days.
- Client app instances will batch fetch the latest translations on every sync run, which can add substantial load to our server database. As we scaled in the number of client apps and corresponding service fleet size, we adopted a NoSQL database as a derived storage system for our Translation Store in order to avoid performance bottlenecks and scale more effectively.
Note: An optimization would be to re-initialize the Local Store with the latest snapshot on every sync, rather than incrementally retrieving new translations within the time delta. This alleviates load from our online database, at the expense of higher data transfer and processing time. The cost can be mitigated by rebasing from snapshots less frequently.
Most applications serve only a small subset of content in their workflows. To avoid downloading all translations, only those pertaining to the set of phrase keys associated with an app are fetched. This drastically improves resource utilization (CPU, memory), request latency, and initialization operations.
The app phrase key set can be derived in several manners, for example:
- Configuration: Developers can specify information in service configuration about the phrases used by the app, such as phrase tenant/collections or key prefix matching.
- Static Analysis: Code can be examined during build time to extract all phrase keys used to make translation requests from the i18n client.
- Access Patterns: Phrase keys issued on translate calls can be aggregated dynamically during runtime and persisted on the server-side. The phrase key set is fetched on each app instance deployment and used to filter downloaded translation. The next section describes how this can be done in an efficient manner.
The i18n client collects information on the phrases used by the app on each request, such as application name, usage frequency, translate hit/miss results, and last access time. This info is stored on the client-side in a Metadata Cache, implemented as a concurrent bounded in-memory buffer (based on the producer-consumer pattern). The metadata entries are periodically flushed, dispatched to the server, and persisted in an App Metadata Store.
Collecting metadata on content access has several benefits:
- Identifies app content ownership and phrase liveliness.
- Optimizes translation fetching via app-based phrase filtering (discussed earlier).
- Prioritizes content to translate and avoid over-translating. This is specifically beneficial when adding new languages or importing new content (ex: product launch)
- Improves robustness of translation delivery pipeline, specifically by detecting missing translations from the Local Store and remediating accordingly.
Apps access a subset of phrases in their workflow during a deployment lifetime, which is usually small enough to fit in memory. Based on this, we implemented a 2nd layer in-memory cache, with a fixed capacity and cache-aside strategy to reduce disk access on translate requests. The cache also stores fallback resolution to further reduce disk load frequency.
The cache can be pre-populated on app instance bootstrap or via a warm-up script, to improve the hit/miss ratio. A further optimization is to pin translation records in the cache (i.e. no time expiration policy). Cache entries are then refreshed in-place by periodically reloading from disk after write. An alternative to that would be to use a capture-and-replace mechanism for the entire cache to reduce locking contention (if space permits).
The I18n Platform has been a critical piece in supporting our organic growth globally, and addressing the concerns of internationalization within our microservices architecture. Our system design is still evolving as we continue to improve on cost, scalability, and the performance of our pipelines, even as we’re presented with new product requirements.
As an outcome, the platform today serves more than 1 million pieces of content in 62 languages, and 100+ billion translate requests daily with microseconds latency.
The Internationalization Platform Team
Special thanks to Jad Abi-Samra, who co-authored this article.