Jaeger and multitenancy

Published in

JaegerTracing

8 min readOct 27, 2017

When deploying your application to a cloud environment, you’d want to make sure your data is seen only by you, and not by your neighbors. At the same time, having only one installation of Jaeger taking care of all tenants in a system is a noble goal and might potentially save a lot of resources.

But what aspects are important when designing a Jaeger deployment in a multitenant scenario? How about the security implications? What can be done today in Jaeger and what are the options for the future?

What is multitenancy anyway?

Multitenancy is the ability of a single instance of an application to be used by multiple “tenants”. A tenant could be a user or an organization, like the GitHub model, where repositories are managed and stored either under the user or organization’s account.

That said, each project and company may have their own definition of multitenancy and their specific requirements. It’s common to have tenant information as a way to calculate the costs of the services used by a certain department (usually called “chargeback”). In such a scenario, each department is a tenant and security is not a big concern.

Then we have the typical “Software as a Service” scenario, where each account (user or organization) is a tenant. Ideally, the tenancy here would be tightly connected to the security: we want to authenticate the user and perform the authorization using the organization as part of the context. Or we might have a simpler, flat model, where there’s only the notion of “organization”, being authenticated via “API Keys”.

At the other end of the complexity scale, we have big corporations with strict regulation regarding data access, usually implemented with Access Control Lists (ACLs). This means that a user might belong to tenant “A” as “admin” and to tenant “B” as “developer”. So, this user might be able to query data related to tenant “A”, but access to production data related to tenant “B” is forbidden.

Tenancy options in Jaeger

Now that we revisited some of the possible varieties of multitenancy, let’s see how it can be conceptually applied to Jaeger based on different deployment requirements and tenancy models. Note that Jaeger was built with a single tenant scenario in mind and has no explicit support for multitenancy.

Tenant information at the span level

This is the simplest of the scenarios and is suitable where the tenant name is only used for informative purposes, like when an Operations department has to support several instances of the business application, each belonging to a different department. Here, the tenant information is added to each span as a tag. This approach works without changes to the Agent/Collector and could be made backwards compatible if the source of the information changes in the future.

This is ideal for scenarios where each instance of the target application stack is executed on its own hosts (bare metal, VM, agent as sidecar in pod), or where tenancy is a business concern on the target application stack. For instance, when your application is multitenant but tracing data is not shared with the tenants.

The easiest way to accomplish that today is by setting theJAEGER_TAGS environment variable to contain a value like TENANT=tenant-1. This way, no change is required to the tracer or to the target application. As the tenant data is unique per tracer instance, all spans reported by this tracer will have the same tenant data.

A possible adjustment could be made to the agent, so that it reads the tenant information from similar environment variable, augmenting the span data before dispatching to the collector. The tracer, in this case, could be kept free from to the tenant information, but this scenario implies that all spans reaching this agent will belong to the same tenant.

Using the OpenTracing java-cdi example as base, we could accomplish that by setting the following environment variable before the mvn wildfly:run command:

export JAEGER_TAGS="tenant=tenant-1"

It should be enough to give us a tag called tenant with the value tenant-1 under the Process tags:

Tenant process tag added via environment variable

Multitenancy at the storage layer

Considering that the storage is the heaviest part of our setup in terms of memory, startup time and CPU consumption, a practical solution to multitenancy would be to have all tenants share a single storage cluster, while having the Agents/Collectors segmented on a per tenant basis. On Cassandra, this would mean using a keyspace per tenant.

An advantage of this scenario is that it’s hard for one tenant to influence the performance of the data collection from another tenant. In other words: as each tenant have their own agents and collectors, a rogue or misbehaving tenant will not cause UDP packages to be dropped or the collector to slow down.

This can be done today, like this:

## One Cassandra for all tenants
$ docker run --rm --name cassandra -p 9042:9042 cassandra## The tenant 'tenant-1'
$ docker run \
  --rm \
  --name jaeger-cassandra-schema \
  --link cassandra:cassandra \
  -e MODE=test \
  -e KEYSPACE=jaeger_tenant_1 \
  jaegertracing/jaeger-cassandra-schema$ docker run \
  --rm \
  --link cassandra:cassandra \
  -e CASSANDRA_SERVERS=cassandra \
  -e CASSANDRA_KEYSPACE=jaeger_tenant_1 \
  jaegertracing/jaeger-collector## The tenant 'tenant-2'
$ docker run \
  --rm \
  --name jaeger-cassandra-schema \
  --link cassandra:cassandra \
  -e MODE=test \
  -e KEYSPACE=jaeger_tenant_2 \
  jaegertracing/jaeger-cassandra-schema$ docker run \
  --rm \
  --link cassandra:cassandra \
  -e CASSANDRA_SERVERS=cassandra \
  -e CASSANDRA_KEYSPACE=jaeger_tenant_2 \
  jaegertracing/jaeger-collector

At this point, it’s just a matter of pointing the tracer or the agent to the right collector, which could be done by linking containers like we did between Cassandra and the Collector, or by passing the parameter --collector.host-port to the agent.

Multitenancy at agent/collector level

A discussion has been started on the community some time ago about this scenario. It is more complicated than the others and involves not only changes to the agent and collector, but also to the data schema being used. Even though this scenario cannot be achieved today, it should be discussed as well.

Leaving the security aspect out of the discussion for a while, this would work by having the tracer understand the notion of tenant as a first-class citizen, like the service name currently is. In other words: each instance of the tracer would know to which tenant it relates to, possibly having a default tenant to make the default configuration easy. A possible feature could be to propagate the tenant information from the root span to child spans via baggage items, effectively making each trace belong to a single tenant.

A chargeback feature is harder to be implemented in this scenario, as it’s not easy to calculate the exact computing usage of the agent/collector on a per-tenant basis. Similarly, it might be hard to calculate the storage requirements, as everything is on the same keyspace. Arguably, though, scenarios where this feature is required could still use one instance per tenant, ignoring Jaeger’s native multi tenancy capabilities.

Once the tracer has the tenant information, it would add that to the payload sent to the agent/collector on every communication. The backing storage need to also account for this tenant information, probably making it a primary key field. For Cassandra, this means that all queries have to provide the tenant information, making it relatively safe to assume that one tenant won’t be able to see data from another tenant. For Elasticsearch and other possible future storage engines, there’s no such guarantee and extra-care need to be taken to ensure all queries do have the appropriate constraints.

And then, there’s the security part…

No matter which approach is taken, the security aspect has to be discussed as well. How important is it to ensure that the tenant information we received within the span or by the agent/collector is accurate? Should we just trust the tracer, which is under the control of what we consider our users?

In the end, we can’t just add multitenancy support and assume that the tracer will send only sane data. We have to integrate that with a broader security concept, and security is hard. Or rather, it’s easy to get it wrong.

When talking about cloud-native development and microservices, concerns like authentication and authorization are better handled by other services, like security proxies. Basic HTTP authentication can be done by simply adding a NGINX/Apache httpd proxy in front of the Jaeger components. More complex solutions like Keycloak can be added in a similar way, adding transparent support for single-sign on, Kerberos and SAML authentication, brute force attack detection, authorization server and so on. There are also frameworks proposing to do fancy stuff like Mutual Service Authentication, in a way that is completely transparent to “our” services and using state-of-the-art security practices.

A common approach to authentication and authorization among microservices is to pass tokens, like a JSON Web Token (JWT). In this scenario, an authentication server such as Keycloak would provide a JWT to a client (the target application), which in turn would give this token to the tracer. The tracer sends this token to the agent/collector, which then decodes it, verifying its signature to make sure it’s not been tampered with mid-flight. Once the verification is done, the collector can be sure that the data from the JWT can be trusted. A common field within a JWT payload is the sub(subject), which could be a good fit for our tenant name. For a simpler use case, a JWT can act as “username/password” or “API Key”, as JWTs can have long expiration dates.

Where to go from here

It’s clear by now that multitenancy is a matter of perspective and that there are several use cases to support, specially for a tool that is shared among several microservices on a distributed architecture. The use cases listed here are only a small set of the cases we can see Jaeger being used.

In our context, several scenarios of multitenancy could be satisfied by applying it at the agent/collector level, but the complexity of doing that might not justify the benefits, specially if we consider that each component has only a few tens of megabytes of overhead “per tenant”. And if most of Jaeger’s users out there can be satisfied with multitenancy at the data storage level, even better, as it means less complex “core” code.

It’s now your chance to influence the development of Jaeger! What’s important to you? Do you need multitenancy at all? If so, what would be the ideal feature set for you? Are your tenancy requirements listed here, or would you have special cases? Let us know via the mailing list or Gitter!