Getsafe
Published in

Getsafe

Solving a distributed authorisation problem with a local library

Background

At Getsafe, we use a service-oriented architecture. One key service (which is actually a client of other services) is our Admin Panel. Admin Panel provides a high-level view of the whole system by aggregating data from different services, making it possible to perform administrative tasks without needing to interface directly with each service. It’s used by internal users (such as customer support agents), whom we call operators, to distinguish them from customers.

Admin Panel has an extensive role and permission system that it uses to decide what an operator can access or do. For example, an operator could have the user_manager role, which grants them the user.edit permission, allowing them to edit a user’s details. Since our other services are accessed via Admin Panel, they leave the authorisation work up to it.

But our story begins with a new requirement: we need to dynamically change the behaviour of an action in a service in a way that Admin Panel doesn’t know about. This dynamic behaviour is only determined when the action is attempted, so we can’t tell Admin Panel ahead of time whether extra permissions should be enforced.

As an example, to cancel a contract, an operator needs the contracts.cancel permission. However, cancelling a contract might fail sometimes because of certain restrictions. In special situations, we may want to bypass any restrictions and definitely cancel it. Ideally, only certain privileged operators should be able to do this.

In this example, only when someone tries to cancel the contract does the Insurance Service check if the contract has any restrictions. So, Admin Panel has no way of knowing about the restrictions and checking for such permissions before making the cancellation request.

So this is our challenge: we need to find a way for Insurance Service to do this authorisation itself, and fit that into our existing authorisation setup.

Existing authorisation setup

To identify itself to other services, Admin Panel uses a one-time token generated by the service and communicated with Admin Panel out of band. This token is attached to all its requests to the service.

The main benefit of this approach is that it’s local to each service. Each service defines its own tokens, so there’s no central auth server to communicate to. This keeps our maintenance workload small and avoids having a central point of failure. In addition, a static token means Admin Panel could save time by not generating a new token on every request.

However, this means that this approach only authorises the service (Admin Panel). But now we need to authorise who in Admin Panel.

Needed: operator authorisation

Obviously, our existing setup was inadequate for this new scenario. For Insurance Service to decide if an Admin Panel operator could bypass a restriction, Admin Panel needed to include details about the operator.

One fix would be for Admin Panel, as a trusted client, to communicate the operator details, such as permissions, in its auth token. But the static nature of the token meant we couldn’t dynamically add additional authorisation metadata to the token.

This meant we had to make a few decisions:

  1. How to communicate: Do we replace the static service token with a new, dynamic token that contains extra information about the operator? Or do we send an extra token or field containing the operator authorisation metadata?
  2. What to communicate: Do we communicate the full operator details, or a token that can be used to retrieve those details?

How to communicate

We could switch to a dynamic token, but we ruled that out for now. We’d need to update every service to handle dynamic tokens from Admin Panel while still supporting static tokens for other services. We had other priorities, so we didn’t want to spend time reworking our auth setups for a specific use case.

So ideally, we’d go with the second option: sending the operator auth data in a separate field, probably as a different header.

What to communicate

The next part was what to communicate. The simple answer would be “everything”. Since we were treating Admin Panel as a trusted client, it could simply send us a list of all the current operator’s permissions as a separate header, perhaps as a JWT.

That turned out to be unworkable, however. We have a very fine-grained permissions system, so there are many different permissions. The most common role had a permissions list of 5kB, which was more than double the rest of the request payload, making out API calls slower. Additionally, some webservers have limits on the size of HTTP header values.

This gave rise to the next idea: an auth service.

Auth service

Since we couldn’t send all the operator’s permissions, why not send some identifier that the receiving service could use to look up that operator and retrieve its permissions?

Essentially:

  • Admin Panel sends the operator ID (or a token) to Insurance Service.
  • Insurance Service calls auth service with that token to get the operator’s permissions and make authorisation decisions.

It makes sense. Plus, a new service! What engineer doesn’t like working on a new project?

Unfortunately, there were two big problems for us:

  1. Complexity: A new service is a new project to set up, new pipelines to configure, new infrastructure to maintain and monitor, new code (and new bugs) to write and document, and so on. We would also need to migrate operator data from Admin Panel to the auth service without breaking anything.
  2. Latency: Making an external call to verify the operator permissions is an extra round trip or more on every request. If the auth service was down or slow, Admin Panel would be affected. And if we propagated operator authorisation to other services, it could become a big bottleneck.
  3. More complexity: To mitigate the latency issues, we could add things like caching and replication. But again, that’s adding more things to help manage all the things we added.

By the way, these aren’t necessarily dealbreakers by themselves. For some companies or teams at certain stages, they might be worth taking on. But not for us at the moment.

Shared library

So we backtracked a bit, and ended up with a shared library solution, as suggested by C Prebble. It’s similar to the previous idea of having Admin Panel send the operator permissions, but this time:

  • Admin Panel sends only the roles.
  • The role-permission mapping is stored in a shared library (a private gem). Insurance Service only needs to install that library, and it can retrieve the permissions corresponding to the roles it was sent. Neat.

This works where the others didn’t, because:

  • The roles list is extremely smaller than the permissions list, so we can easily pass it in a header.
  • Having the role-permission mapping local to each service (installed as a gem) meant they could instantly verify an operator’s permissions without any extra latency.
  • Having the mapping stored in a shared library meant it could be updated in one place, and all consuming services would have access to it (as long as they stayed updated).
  • There’s very little complexity. Nearly everything stays as is, and the implementation was a matter of days.
  • As an extra benefit, we could put the tools to generate and verify tokens right in the library, so consumers only need to call one or two methods.

One downside here is that we’d need to remember to update the library in every consuming service whenever permissions are added or changed. But since we have automated dependency upgrade notifications every week, that’s less of a concern.

A second problem is that storing the data statically in a shared library means that editing the role-permission mappings needs a developer. But our role-permission mappings don’t change often, so that’s another reasonable tradeoff for us at the point.

It’s a bit of an unconventional solution, but it’s quite practical for our use case.

Implementation

So we created a library, nxt_auth_registry. The library itself is pretty plain:

  • roles, permissions, and role-permission mappings stored in flat YAML files and managed via ActiveYaml,
  • a Fetcher class that checks for an existing token in a specified store, or generates a new one, if there’s none, and
  • a Verifier class that verifies an operator token. If the token is valid, it returns an operator object with helpful methods, like operator.roles.permit? for checking if an action is permitted.

Here’s how usage looks like in our code:

In Admin Panel:

  • When the user logs in, we create a new JWT containing their roles and store this in the session. We do this at login so we don’t have to generate a new token every time we call other services.
session[:s2s_operator_token] =
NxtAuthRegistry::OperatorToken::Fetcher.new(current_operator).call
  • When creating a client to talk to Insurance Service, send the token along.
client = Faraday.new(url: insurance_service_api_url) do |faraday|
faraday.request :json
auth_options = {
token: insurance_service_auth_token,
'operator-token': session[:s2s_operator_token]
}
faraday.request(:authorization, 'Token', auth_options.compact)
end
client.post(...)

This gives us an Authorization header that looks like this:

Authorization: Token token="<token>" operator-token="<operator-token>"

It might look a little strange at first, but it’s valid HTTP. Most headers support specifying extra fields with the key=value syntax.

In Insurance Service:

  • After authorising the service (Admin Panel), we fetch the operator details from the operator token, if there’s one.
def get_operator
return nil unless fields['operator-token'].present?
NxtAuthRegistry::OperatorToken::Verifier.new(fields['operator-token']).call
end
current_operator = get_operator(parsed_authotization_header)
  • Now we can check if the operator is authorised to do something with operator.roles.permit?.
def operator_can?(power)
return false unless current_operator
current_operator.roles.permit?(power)
end
def cancel_contract(contract)
if contract.is_restricted?
raise ContractRestrictedError unless operator_can?(:bypass_restrictions)
end
contract.cancel
end

And that’s it.

Future developments

We haven’t ruled out creating an auth service or something else in future, if our needs evolve (such as changing roles and permissions more frequently). But we’re quite happy with nxt_auth_registry at the moment. It’s a pragmatic approach that allows us to move fast. And if we do decide to switch to something else in future, having a shared library with common authorisation utilities will likely be useful.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Shalvah

Shalvah

1.3K Followers

Not on Medium anymore🤢. I write about interesting software engineering stuff @ blog.shalvah.me