One Manifest To Rule Them All
How the DAZN Manifest evolved into one of the most important pieces of the DAZN Development Ecosystem. This blog will describe the origins and how it became a central piece in everything we develop in DAZN today.
What is the DAZN Manifest?
As mentioned in a previous blog post about logging, at DAZN we tend to raise a Request For Change (RFC) document when a new process or standard is being created. One of the very first of these was RFC-002 DAZN Manifest
.
The original document as seen in the screenshot above was created as early as 2018. But it wasn’t until mid to late 2019 where the DAZN Manifest started to really take off.
The DAZN Manifest is a file included in every DAZN repository describing various aspects. It is primarily used to identify owners as well as declare services, systems (groups of services) and other metadata around those entities. It plays a major role in linking services across our large estate of tools and enables automation and supportive tooling to be created. The DAZN Manifest is considered the source of truth. Over time and iteratively the DAZN Manifest has grown into being a very useful tool.
Early Days Of The DAZN Manifest
In the beginning, the DAZN Manifest had a very basic minimum requirement as described by the RFC document at the time. There was little to no automation around the manifest and it was usually best-effort from our developers to include this in their repository. As DAZN was getting bigger and more services were being deployed, it became increasingly difficult to identify owners of services and repositories. A person’s GitHub handle didn’t always help with finding who that was within the company.
DAZN Linter
To tackle the problem with DAZN Manifest files not being included in repositories the DAZN Linter was created. This linter (amongst other things) checks each commit on all repositories for the existence of the DAZN Manifest. If a manifest was not found a warning is issued but it was non-blocking.
Now that teams were encouraged to include a manifest file, it was quite some time before anything was done with them. We had to come up with some way of capturing this manifest data and storing it somewhere.
Manifest API
Shortly after the introduction of the DAZN Linter, the Manifest API (MAPI) was created. A MAPI ingestion Lambda function is subscribed to web-hook notifications from GitHub. When a DAZN Manifest file is part of any changes it will be ingested into MAPI. MAPI uses MongoDB as the storage backend to allow us to make powerful queries with specific filters.
As seen in the above architecture diagram, consumers of MAPI are subscribing to an after-hook SNS topic. When a manifest is inserted
, updated
, deleted
or renamed
the consumers of MAPI can decide how they want to react, paving the way for how the DAZN Manifest is being used today.
The Manifest API handles uniqueness of entities so that there could be no ambiguity in identifying them. The DAZN Linter helped identify uniqueness issues during each linter execution using a MAPI validation endpoint.
Manifest Driven Approach
This period sparked a new era for the DAZN Manifest. Now there was a way to ingest the data and use it in any way a consumer of the Manifest API might want to. Having uniquely identifiable entities such as services, meant that the same unique identifier could be used to link these services across our vast estate of tooling.
Djed
Djed is the internal name for our Service Level Indicator (SLI), Service Level Objective (SLO) and Error Budget (EB) observability tool. Djed is the Egyptian Symbol for stability.
As one of the first manifest driven tools. Djed is used to define service level objectives for services defined in the manifest and through automation, alerting and dashboards are created. Each service defined in the manifest has an slo
section which can be defined.
Each SLO definition defines error-budgets
which configure the alerting conditions under which an alert is triggered. Error Budgets operate on a window
, which must also be defined. The threshold
determines the percentage of the error budget that is allowed to remain before triggering an alert.
The above example defines our error rate SLO defined at 99% successful HTTP responses, using two different alert thresholds and windows. A 24h
window is allowed to drop to 50%
remaining error budget and a 6h
window is allowed to drop to 0%
remaining.
Each service with defined slo
has a dashboard with the above details automatically generated in Djed. If a DAZN Manifest includes links
section then these would also be added to the dashboard. These links are usually run-books that would help an engineer in an incident response situation or any documentation that would be useful.
Backstage
One of the main problems Manifest API had in the early stages was that there was no frontend to visualise all of the entities we had ingested to the Manifest API. This is where Backstage comes in, an open source software catalog.
DAZN’s DX team documented the journey taken with bringing Backstage to DAZN.
As can be seen in the above screenshot Backstage offered everything we needed to visualise the data in Manifest API as well as create new plugins that enhanced the DAZN Manifest capabilities. This was all done with minimal changes to our DAZN Manifest schema making it an excellent addition to our toolset.
Manifest Driven PagerDuty
At DAZN our primary paging tool is PagerDuty. In the past when teams wanted to register a new service in PagerDuty they would do so using a terraform module. This meant that teams had to choose their service name and hopefully that service name matched what was in the manifest.
Since we want the manifest to be the source of truth of service definition and other entity types, we made a tool called Manifest Driven PagerDuty (MDPD). This allowed for the PagerDuty service to be defined in the DAZN Manifest.
As seen in the above DAZN Manifest snippet a support
section can be defined. This particular service shows that the support hours are defined and that this service is supported during weekdays.
MDPD creates PagerDuty services with a defined name format that is driven by the definition of the service inside the manifest. This allows for other automations to leverage PagerDuty to page teams if they need to be notified about their service, since the name is unique and can be looked up via MAPI.
Service Dependencies
The DAZN Manifest also allows the concept of defining a service dependency. By defining a dependency it allows to better understand the blast radius of incidents and allows for better risk assessment of large deployments.
The Manifest API only allows teams to create a dependency on a service that exists in the API.
As seen above if an incident was to occur on a service that your service is dependent on, it may be worth first checking if that service is indeed having a negative impact on your service. Time to resolution could be decreased with this information as the investigation could be started at the source of the problem immediately.
This view it becomes much clearer what services may be impacted. It also allows new team members to have an overarching view of the system that they may be soon working on, this contextual view can help with the understanding of such a system.
Alexandria
Alexandria is an internal tool we use for recording deployments. Each pipeline is required to use a specific workflow step to gather information about a deployment and send that information to a central database. This deployment is enriched with information from the DAZN Manifest, primarily the service name. Internally we have built a Github Action that our developers can use which takes care of this.
Since we are using the DAZN Manifest to determine the service name. We know that if the service is using Manifest Driven PagerDuty, we are able to determine the service name in PagerDuty. Hence, we can forward the information of the deployment to PagerDuty also.
Since Djed is also driven by the DAZN Manifest, the same thing can be done for Djed Dashboards. An annotation gets created at the moment the deployment is sent. We are able to match this deployment with the service because the service name is taken from the DAZN Manifest.
Conclusion
Although in the early days the DAZN Manifest had no real purpose. Over time it grew to become a really important tool that makes life easier for everyone. It is really important to create a source of truth for identifying entities uniquely and linking them using the same name across any and all tooling where possible. If done correctly integrations can be created which can add a lot of value and save time.