Microservices Travel Journal

Published in

Omio Engineering

10 min readAug 22, 2019

As a travel platform, Omio’s success is built upon providing a comprehensive coverage of routes and providers to our customers. Today, the routes we serve cover more than 100,000 destinations, across more than 800 service providers. However, integrating hundreds of service providers is challenging, as their technology vary largely and some providers have limited technical prowess. In this blog post, we will present how we scaled up provider integration at Omio by transitioning from monolithic backend systems to microservice architecture.

Serving so many routes with various means of transportation, myriad of offers, promotions, rules, dynamic timetables, shifting grounds and seemingly infinite number of journey segments combinations is no easy engineering feat. At first, when the scale of the traffic, data, codebase, and a number of external integrations was much smaller, the magic that happened between pushing the search button on the landing page and displaying a list of most interesting connections was handled by a single monolithic application. But it became increasingly difficult to develop and operate as everything continued to rapidly grow in several dimensions. If you are an IT professional who follows current software architecture and technology trends, you probably know what happened next. We followed the steps of giants like Netflix or Amazon and jumped onboard the microservices train. Before we dive into technical details, let’s have a closer look at business challenges we had to face. After all, however beautiful the technology is, it has to serve a particular purpose. We will then explore problems we have encountered with our previous architecture, how we envisioned dividing the system into manageable parts, the gradual process of implementing that, new challenges that arose alongside and solutions to them.

The Polymorphous Inventory

As a travel platform, a large and accurate inventory, or: the coverage, is crucial for our success.

Omio evolved from being a metasearch engine for ground and air connections into a full-fledged travel platform that allows our customers to not only find the best routes but also buy tickets directly on Omio website. Coverage is a commonly heard notion within our company. Carriers, or providers, like Deutsche Bahn, run a variety of their own searching and booking systems. One core mission at Omio is to integrate with external providers to expand our route coverage/inventory. Such integration happens in very different ways. In many cases they expose a REST or SOAP API that we connect to and perform both search and bookings. In some cases, we can only search, but to perform the actual booking we need to redirect to provider website. Some providers do not have APIs, and there are several possibilities how we can proceed. We can scrape HTML from their web pages, process it and serve in the result of our searches — not an ideal solution, but better than nothing. We can get, process and serve a timetable in a widely used format one of them being GTFS (General Transit Feed Specification) created by Google. Another format would be HAFAS, created by HaCon, a logistic and transportation software company, originally for Deutsche Bahn, but then adopted by over 60 carriers across Europe. Sometimes all we got is a colorful custom Excel file that may look wonderfully on the wall of a small station somewhere in the Alps but is not exactly what computer system would like to parse. We have an option to put a connector on provider premise so that we can talk to it with our standard protocol. And finally, for very small providers without IT system, we take care of everything for them including actual ticket generation.

The Monolith

5 years ago, Omio started by serving just a few service providers e.g. Deutsche Bahn and Renfe. Back then, integrating with only two providers was straightforward and manageable. As a result, simple architecture was good enough, and everything about provider integration was implemented as a monolithic service.

Integrating providers is only one piece of the puzzle of preparing a search result. Each provider is a snowflake on its own and may have a lot of various technical and business nuances — limitations of a number of calls we can perform or a number of results we can get, various ticket types, elaborate systems of loyalty cards, passenger data restrictions, cancellation and refund rules, various disclaimers etc. On top of that, we add our own rules to prioritize and adjust results on our website due to a complicated network of business deals, carriers sharing the same routes and desire to deliver the best results possible. Some of those nuances can be unified and extracted to common parts, but many are too tied to provider-specific formats, objects, and quirks.

In the beginning, all of that code used sit in a single monolithic application. Teams working on the same codebase often run into conflicts. Changes in one place caused unexpected side effects in other places. Releases had to be coordinated between teams, which was difficult and time-consuming. A lot of changes released at the same time also led to increased risk of failure. If some failure happened, it was difficult to pinpoint the exact commit that caused it. If we are able to do that, it might not always be easy to revert it as there might have been a lot of further changes stockpiled on it. To counter that, we started to use feature toggles — putting new code execution paths and functionalities in blocks that could be quickly enabled or disabled based on database entry. This was, however, often cumbersome and error prone, as we often had to branch the logic in several places. Aside from development complexity, the application became slower and slower to start, and required more and more resources. As a result, scaling number of replicas in production was difficult and inefficient in terms of amount of CPU and memory required.

Brave new Deployment

To mitigate all these horrors of the monolith, we started leaning towards microservice architecture.

In order to be able to leverage it, we need three main pillars: rapid provisioning, rapid deployment, and good monitoring. Tons of magnificent work was done by the DevOps team to move our deployment to Kubernetes and create the DevOps contract. We can bootstrap new service with all necessary infrastructure from repository, development pipelines, through monitoring, dashboards, alerting, and up to production deployment within an hour. And deploying further changes to existing service could be done within minutes.

In order to appropriately divide the system into separate services, we need to identify business contexts, or domains. Historically, the two oldest domains were Search and Booking, where Search is responsible for searching and presenting best offers for travel between two places and the Booking implements the booking process if it happens on our platform. Search was further divided into two parts. One part is Search Core, which starts the entire process and decides which providers might have offers for a given route. The second part implements provider integrations. We decided to extract each provider-specific code to a separate application, and put them all behind a façade that would accept Search Core requests for particular provider inventory and redirect them to the appropriate application, taking care of some common metrics, circuit-breaking, retrial, and such.

Sharding Responsibility

Migrating functionality from monolith to separate services and making sure that everything goes smoothly takes several steps and some planning. The goal was to get rid of painful releases, conflicts and dependencies.

As the first services boundary drawing, provider code was refactored and simply put in a separate Java package. Then it was put in a separate Maven package. This way, each provider could have a separate versioning and was given just a little bit of independence. Still, the code was executed in the same runtime as other providers which could lead to problems. Next step was creating separate skeleton applications that used Maven modules for a subset of providers as dependencies and give them to country group teams to take care of. Along with that, the organizational paradigm of a single coverage-platform team and several country group teams become clearly visible in our runtime architecture and followed the inverted Conway’s law — architecture mirrored communication patterns in the organization. The country group teams then proceed with moving particular providers to a smaller skeleton application. Usually, several major providers were moved that way and the rest of the smaller ones were left to sit together in a common runtime. We could now scale a number of replicas very precisely, and isolate failures in one provider so that they won’t affect others. The final step was to move the actual code from a common repository to a separate one, gaining build-time independence.

The production rollout process itself usually involved routing a small percentage of traffic to new separate applications and majority of it to an old setup. This percentage was gradually increased over one or two days while we monitor closely both the low-level metrics (e.g., memory, CPU, network traffic and connection pools) and high-level metrics (e.g., search error rate and search to booking ratio). If everything was stable after a few weeks we could proceed to remove the code from common repositories and celebrate another successful migration. Each commit to master branch in the provider-specific code was now a production deployment. There are basically no conflicts, no waiting for release, no need of coordination. We do more than 600 such deployments per week. If there is any problem, we got an alert from Sentry, and after a quick assessment we can decide to hit a button on Jenkins pipeline and Kubernetes takes care of replacing faulty pods with the previous version, while we can investigate calmly the nature of the problem and fix it at another time. There is no fire extinguishing, no witch-hunting and no managers running frantically around because technical issues of a single provider brought down the entire platform. There are currently over 300 different services running in production and growing.

The Mirror Context

Having a separate service for each provider has many benefits, but solution for one set of problems comes with a brand-new set of challenges.

Besides integrations with particular providers, other business domains of the system were identified and extracted. Some of them were: routing engine that calculates particular stations and routes choice for a given query; performing A/B tests and various SEO gimmicks; generating tickets, vouchers, and invoices; Customer support back-office. This list goes on. But you might have already noticed, that those are somewhat different things, while a domain of integration with one particular provider, while having a lot of its own specific details, is kind of similar to the domain of integrating with another provider. Suddenly, while solving many problems, we created a big new one: keeping consistency and avoiding code duplications across the vast number of new repositories that started living their own life and diverging in various directions. If you have a single codebase, despite all its shortcomings, it’s easier to introduce common changes that benefit all the providers. And soon it turns out that a feature introduced in one provider could be also reused elsewhere. With great freedom apparently comes great chaos of vastly duplicated pieces of code and configuration that mirrored what already existed.

So, how do we handle it? First of all, we have a common model that holds the contract for how the standard search results, booking, travel mode etc. look like. The model includes structure of search results with travel legs, segments, stations, trains, stops, offers etc. Second, we have a bootstrap application that is a runtime skeleton around the code for particular provider search integration and presents all required API endpoints that the contract with Search Core requires. Third, when creating a new integration, we use a code-generation tool based on Rhythm templates that creates all the glue code for us. Fourth — we have an SDK library which contains lots of reusable functionality like various postprocessors, result filters, utilities and common building blocks. If the common functionality grows out of control, it’s possible to rip it out of the library and move it to a separate service with clear API. And finally, well, sometimes we cut corners here and there and copy pieces of code around if it makes sense. Too many dependencies can be eviler than code duplication. It’s not dirty, it’s calculated.

The Future

Even after we finish all our migrations, there is still a lot of space for improvements. While working on making integration process even easier, we are aware that rapid expansion and growth of the coverage requires shifting to generic integration solutions and developing APIs and tools that would enable others to integrate with us, not the other way around. Aside from increasing coverage, we keep adding new features to our platform, so that we accompany travelers from the very beginning, before they even know that they want to go on a journey, through the planning, searching, getting tickets, the actual travel and until even after it ends. The more complete and the smarter the platform becomes, the more sophisticated technology behind it is required.

Microservice architecture certainly solves a lot of problems, but also creates new ones as we might have observed. It requires a lot of initial commitment to infrastructure that might be difficult to pull off in smaller projects. But as the number of developers, size of the system and the amount of traffic grows, the solution really shines.