How we evolved our logging stack to sustain a growing Service Oriented Architecture organization
This article describes how BlaBlaCar changed its log stack from a single-tenant to a multi-tenant platform, ensuring strong data consistency across 20+ tech teams.
The logging stack
At BlaBlaCar, we manage our Observability in-house for about 20+ tech teams in a service oriented architecture with about 650 services deployed in Google Cloud.
The stack is composed of the following services in production:
- Filebeat to collect logs as a Kubernetes Daemonset on 70 nodes
- Elasticsearch to ingest logs. 30 pods 4 cpu, 20Go RAM, 1.5 T of storage distributed on 3 zones
- Kibana for the UI. 2 pods for availability
We managed 10,000 logs/sec in one Elasticsearch index with 10 days of retention upon a daily rollover.
Applications push json formatted logs.
In a Service Oriented Architecture (SOA), having a “you build you run it” approach, developers manage their micro services in their own team’s scope and tend to be autonomous for their observability as they are on-call for their scope only.
This results in:
- 4200 indexed fields in one Elasticsearch index when Kibana can only handle 1000 of them.
- Many mapping conflicts. In dynamic mapping (default behaviour of Elasticsearch), the field type is determined by the first data targeted to the field. A mapping conflict happens when the field type is already defined and you try to send data of another type than the field targeted. For example, one application pushes first userId: “123” (string) while after some others would push userId: 123 (integer). Filebeat sends logs by batch, so as a consequence, if a log in this batch contains a mapping conflict, all the batches of logs sent by Filebeat are rejected.
- Many data consistency issues for keys like userId user-id user_id
With only one index, the design is not multi-tenant. As a consequence:
- We couldn’t easily mitigate issues related to low disk space by deleting index in case of spike of unneeded logs;
- We couldn’t tweak the retention time according to the criticality or business need of the service;
- We couldn’t easily have accountability by tenant.
The previous architecture was limited, therefore we had to change it.
How we solved those problems
Making the logging stack a multi-tenant platform
The design of the revamp did not include a multi-instance design for cost control and for easier operability.
To have an easy way to make the logging stack a multi-tenant platform, we decided to split the index. We had to figure out how to split this index, this means the best ratio between a very fine retention granularity versus having not too many indices (to reduce the overhead, as recommended in the Elasticsearch blog). For us, the best ratio was to divide indexes by Kubernetes namespace. At BlaBlaCar, most of Kubernetes namespaces contain services owned by a unique team.
From a tech perspective:
It was easy to target the Kubernetes namespace name index like in Filebeat configuration:
However, we were managing Elasticsearch templates with the Index Lifecycle Management (ILM) configuration in Filebeat.
Filebeat pushes Elasticsearch templates at startup time. As the variable [kubernetes.namespace] is populated when logs are parsed by Filebeat (information not available at startup ), it was not possible to keep managing our Elasticsearch template configuration with Filebeat.
We solved this by generating an Elasticsearch template by namespace, using a Bash script instead of relying on the ILM, to generate the configuration when a new namespace is created.
Now that we have improved the multi-tenancy of the platform, we focused on data consistency improvements.
As logs are formatted as JSON, they can be considered as key/value pairs.
To improve data consistency we decided to work, at engineering level on:
- A dictionary of keys that enable consistent cross-team application searches. For example, to search for a user ID, do the query ‘userId:”123”’ instead of ‘user_Id:”123” OR userId:”123” OR user-id:”123”’ . This allowed us to easily know which key to use and avoid mistakes while writing the query which would otherwise return partial data.
- The same value type to avoid mapping conflicts in Elasticsearch.
To implement this dictionary of keys across teams with the corresponding value type for each key, we needed a common schema.
In Elasticsearch, the mapping is “the process of defining how a document and the fields it contains, are stored and indexed.” As the mapping describes keys and value types stored in the database, we considered using the mapping as the source of truth, the common schema defined in a Git repository.
In order to ease adoption, we communicated through the SRE Guild, a recurring meeting where each team has a representative and can bring SRE topics to discuss cross-team projects around SRE topics.
First, we worked together to define the schema containing a set of predefined keys to enable search based on widely used Kubernetes attributes (e.g., pod name, container name, node) and most common attributes in the BlaBlaCar ecosystem (e.g. userId).
Then we defined the workflow: when a developer adds a log into the application code, they must check if the key is defined in the common schema otherwise they should add it to the common schema. In technical terms, it is about adding the field definition in the Elasticsearch mapping template via a pull request.
This required a change in the habits of developers that now have to take extra steps for adding logs. In order to improve adoption, we decided to propose a tool from a Protobuf schema that generates the corresponding Elasticsearch mapping.
This enabled developers to use this Protobuf schema in their code to autocomplete possible fields. Applications still log in JSON, the Protobuf schema is just here for developers convenience.
The tool is a Protoc plugin that generates Elasticsearch template mappings from Protobuf schema. This tool enables us to generate an Elasticsearch template mapping from a Go template and a Protobuf schema
Going into more details on the Go template, we can see that the tool generates the template.settings.query.default_field. This item is used to specify fields to query when performing a request with no explicitly specified fields. For example, the query “myuserId” will search this string across all default fields.
The tools construct dot notation field names based on the field defined in the schema with a recursive algorithm.
The dynamic_template avoids indexing all other fields.
Finally, the tools can add the @timestamp field in the output.
As Golang variables do not allow special characters such as “/”, it is not currently possible to define keys with special characters in the schema. A way to fix this is to add a field without a special character and transform it with the special character using a Bash script.
For example, a field called “my/fieldname” would be added in the Protobuf file as “my_fieldname” and be transformed by the Bash script in the resulting template.
Here is a sum-up: the tool pushes the Elasticsearch template mapping. Applications use the schema for using keys and then push logs into Elasticsearch.
This project began in august 2020 and after only 2 bi monthly SRE guild meetings, the platform was shifted to this multi-tenant approach. It enabled us to tweak the retention and have better accountability. It increased as well as the data coherence by introducing this tool which reduced the number of indexed fields from 4500 to 150. That point increased the stack performance and fixed the issue of having too many fields on Kibana, solving at the same time many data consistency issues and mapping conflict types.
Even if the cultural change took some time to adopt the new workflow of adding keys to the Elasticsearch mapping, that enabled other discussions to improve the refactorization of logging in applications and achieve better observability.
In January 2021, we decided to migrate our observability stack to a SAAS solution : Datadog. We are now decommissioning our historical logging stack, but we kept in Datadog the tenant model by team ( Kubernetes namespace ) as well as the accountability pattern.
An article on the Datadog migration process will come later this year.