The name of the great explorer Marco Polo is a very well-known around DAZN because the system we built to QA the world bears his name. DAZN has launched globally today and without Marco Polo we could never have done it.
Fortunately for us all, things have moved on since 1271 when Marco Polo first departed Venice to embark on his gap year travelling round the world. While the travel bug has very much dampened down in 2020, we’re all very much hoping that things will return to normal as overall, people do enjoy travelling.
With its worldwide roll-out today, DAZN is unique in its ambition to build a truly global OTT sports offering. If Marco Polo had been also a sports fan, then he would have been glad to be a subscriber to DAZN as he would not have been limited to view only the sports we have in his home country (questa è Serie A e Serie B per i nostri lettori italiani). DAZN allows you to travel and watch the sports we have rights for in any of our countries. EU portability is the slight complexity as it mandates that while travelling in the EU, EU members get to watch the same catalogue that’s available in their home country. However, once out of Europe, Marco Polo would have enjoyed any of the sports we have rights for wherever he was.
SERVICE TRAFFIC IS NOT THE ONLY CAPACITY PROBLEM
DAZN was operating in 4 AWS regions and a handful of countries so we had some experience of the complexities of balancing load and resilience across AWS regions while simultaneously supporting a plethora of different rights, languages and payment systems across many different countries. The global release will not change the number of AWS regions we operate in but will hugely increase the numbers of countries.
Other than the usual bread and butter problems such as resilience and dealing with increased load this might not sound like anything too much out of the ordinary to have to think about. However, somewhat from left field, the architecture team was presented with an unusual problem of scale. We’ve all seen those situations where a system designed to solve a particular problem starts stuttering because of traffic volume. There comes a time when a totally new approach is needed in order to overcome the limitations.
Service traffic is not the only capacity related problem. Manual operations such as setup or configuration soon start to bottleneck as the amount of work increases. We needed to test the app globally and other than significantly upping the travel budget of our QA engineers we needed to think of a better solution.
VPN — THE COMMON SOLUTION
The technique we were originally using is the one adopted by a lot of companies — use a VPN to make your device appear as if it were in another country. All network traffic is routed through an ip allocated by the VPN to be within the range of the desired country and so it appears as if the device is in that country. While this sounds simple enough, it has several drawbacks. It’s not easily possible on tv devices, it’s a pain to set up and administer; all users require access to the VPN and typically there’s no programmatic control so it doesn’t lend itself well to automation. Plus we had to whitelist VPN allocated ip addresses to prevent the DAZN security systems from blocking them. Finally, to cover all of our teams and give access to every country on the planet would have been inordinately expensive.
INTRODUCING MARCO POLO, BETTER THAN VPN
While brainstorming this problem with our platform team, we had the idea that rather than trying to fool the app behind the scenes into thinking it was somewhere else, let’s build awareness into the system so that a running instance of the app could itself simulate being somewhere else. Marco Polo was born. It lets the app pretend it’s in another country. In fact, Marco Polo was built as a completely generic system to inject any kind of override config into a specific running instance of the DAZN app and all its backend services. It needed 3 main ingredients:
- A way to identify an instance of the running app
- A way to get and set overrides for that instance
- A way to propagate the override to all the BE services involved in running the app
Here’s an outline of the system:
MARCO POLO — MAIN DESIGN ELEMENTS
The key components of this design are:
- The Marco Polo service which manages a database of device overrides through get and set operations. By including the override type field in the key — in this case “country” — we easily can extend the system to cover other sorts of overrides. More on that later.
- A unique device id which is generated by the app if not present, usually on first boot and stored in local storage.
- BE Service modification to pass the Marco Polo header through to any dependent services that may be affected
- Finally, the actual logic changes for those services required to take account of the override. In our case this was the geofencing service which normally used the ip address to return a country.
Sound simple? It proved harder than expected. In a big system such as ours with a lot of different teams supporting it, there’s a lot of explaining to do and it’s quite hard coordinating everything across the organisation. Not unexpectedly during our journey several wrinkles surfaced resulting in design changes which all had to be communicated and rolled out to a large group. At the risk of losing my audience I could go into deeply into all of these but perhaps better to mention a few in passing and if there’s interest follow up with another post.
- Marco Polo can’t support 3rd party systems. For example, Apple Pay will only ever work for the country you’re in. In most cases the VPN solution does not get around this anyway. There may be workarounds — paypal for example allows programmatic access to specific countries.
- Does the header mess up caching? Yes — we solved this by adding a toggle switch so that the header would only be sent for those cases where Marco Polo was enabled.
- Security — In supporting automation for our QA engineers we had to ensure the set endpoint was properly secured as the gleaming pot of gold hiding behind gives an attacker access to anything anywhere on our platform.
- In the case of geofencing, some CDNs provide their own measures which we were required to support by our rights holders. Compute is often available at the edge allowing direct support of Marco Polo although we accommodated these requirements using certificates, an alternative validation mechanism supported by some CDNs.
- BE service reconfiguration. If you are using CORS then OPTIONS calls need to accept the headers even if the service is not directly using them.
- Some practical issues. It should be easy to capture the device id and enter into the admin console. QR codes came to our rescue saving tedious UUID transcription.
- Which environments should you run the system in? We’re currently using both stage and also production where it’s convenient for demos and VIP access.
Finally, I mentioned above that we took the decision to make this solution generic by including the override name in the key. There are overrides other than country that have proved useful. At startup we call an endpoint to retrieve a service dictionary with a list of endpoints used by the app. Marco Polo can easily be used to override this endpoint such that a particular running instance of the app could have a different service dictionary with some of the services mapped to a test version. With some care, this allows the possibility to test a new service during development in the production environment on specific devices thus alleviating the problem of realistic data.