Moving On from OAuth 2: A Proposal

This past summer, I gave a talk at Identiverse 2018 called What’s Wrong with OAuth 2. While I do think there’s a lot to talk about on this topic, I’m not going to go into more depth here about what’s broken. I do think it’s worthwhile, though, to take a moment here to address one question that I got at the end of my presentation: what’s next?

The short answer is that I still don’t know exactly what’s next, but in the last few years I’ve been noticing a few trends in the extensions and applications of OAuth2 out in the wild. These extensions and applications are all trying to use OAuth as a tool to solve very real problems, and often problems that it wasn’t really meant to solve. Coupled with some ideas of what we could do to address things in the OAuth that are causing problems today, I want to present some vague hand-waving of where we could go as an industry.

Defining Data Models

The first thing that I think the OAuth world would benefit from is a consistent set of data models. The original OAuth2 protocol didn’t really define what the attributes of clients, authorization servers, resources, or tokens really were. Several things were hinted at, like client IDs and token types, but the actual concrete data models came later. Today, a variety of add-on specifications have effectively defined what each of these are. Clients are defined in Dynamic Client Registration, authorization servers are defined in Authorization Server Metadata (discovery), and tokens are defined in Token Introspection and JSON Web Tokens. Even if you’re not using these extensions directly, the data models encoded in all of them are fundamental to how an OAuth ecosystem functions. Any future efforts within this space, and any protocols that might eventually replace OAuth2, would do well to have all of these well defined in one place. Don’t just say what the parties are, list out what you’re expecting them to have access to and know about each other.

Minimizing the Front Channel

Many of OAuth2’s biggest problems in the wild come from dealing with the front channel and sending security information across the HTTP redirects there. Not only does information leak to the browser, there’s also no guarantee that things haven’t been tampered with in transit. OpenID Connect invented Request Objects and Hybrid ID Tokens to combat this, and the FAPI working group recently published the JWT-Secured Authorization Response Mode (JARM) draft as a general mechanism to address this outside of OIDC.

While these methods are effective from a security standpoint, they wreak havoc with OAuth2’s simplicity for developers. One of the biggest draws of the protocol was that it was easy to implement from the client side: just add a few parameters to a URL and off you go. By wrapping everything in JOSE objects, you’re not only making it more complex for a developer, you’re also avoiding using the full power of HTTP in the protocol. To be honest, the latter part almost makes sense here because HTTP was never meant to carry information securely across redirects like this. The front channel is ultimately a hack on top of a system that was designed to solve a very different problem. It’s an effective and powerful hack, to be sure, but a hack nonetheless.

To solve this, I propose that we limit what needs to go over the front channel to the absolute bare minimum: only send reference handles and never send values. Actual values would be sent to back-channel endpoints, regardless of client type, and handles would be returned to the client for future use. Developers would still be able to send these handles by simply adding query parameters to a URL, and clients would be able to read them by plucking query parameters off the incoming request. In addition, we need to drastically simplify the various responses and modes that have grown on to OAuth and its kin. Instead of form-posts and URI fragments and JWTs and whatever other options are there today, we should go back to simple query parameters since what we’re sending around is much smaller and simpler.

Furthermore, the AS boils down to two basic endpoints for authorization transactions across the different grant types. The back channel calls are served by the equivalent of the token endpoint; and the front channel calls — which require user interaction — are handled by an endpoint dedicated to user interaction. For most things, that’s pretty much it. The server is uniquely identified by its back channel endpoint URL. The interaction URL is sent to the client on an as-needed basis in a manner that is contextual to the client’s request, which means it can vary depending on the client’s context.

Transactions and Interactions

You may be thinking that this is crazy talk, as the front channel today is used to send a wide variety of information including what’s being asked for and how it’s getting there and the response modes were added because different clients behaved differently and we didn’t want information to leak. But when you take a step back and look at how things function in today’s browsers, you can see that our intended protections are wishful thinking at best, and the complexity they add for client developers is detrimental to the ecosystem.

The thing is, a number of different protocols built on and around OAuth2 have noticed this same problem and have set up a way to do a pre-flight request before involving the end user. UMA 2.0 has the client call the token endpoint with a ticket before it sends the user to the claims endpoint. The OAuth Device Grant has the client speak to a special endpoint to get its codes before having the user type them in someplace else. CIBA has the client talk to an authentication endpoint before the user gets involved in a completely out-of-band way to validate the request. Both the Open Banking initiative and the FAPI working group have ways to send information to the AS before getting the user. Dynamic Client Registration even lives in a similar space because the client is talking to the authorization server in an automated way before the user is even contacted. Looking back a bit, OAuth 1 had its request tokens.

Let’s start to clean this up by having the client always start by talking to the AS over the back channel, regardless of the kind of transaction. The AS is going to want to know the answer to a few key questions: what do you want, who are you, and what can you do?

What do you want?

In OAuth 1, the answer was always “to get to the API”. OAuth 2 added scopes which let a client specify which parts of the API it wanted to access. The limitations of the front channel make it prohibitively difficult for the client to say in any detail what it wants to do. As a consequence, scopes are simple strings and their combinatorial value is dependent on the authorization server’s implementation whims. UMA added permissions and resource sets on top of scopes to allow a greater granularity and wider distribution of the system, and there are a number of extensions to OAuth such as the resource indicators and various audience parameters to allow similar things, but these are all still limited. By having the client go to the back channel first, we can make use of a JSON structure that would let a client describe its request. Perhaps something that describes what we want to do, where we want to do it, and what we want to do it to:

{
 actions: [update, read, delete],
locations: [http://example.com/api],
data: [images, location, metadata]
}

The client could even send a list of these structures to indicate it wants to access multiple resources at once, potentially across multiple systems. This is great for complex cases, but the best thing about scopes in OAuth2 is that, from a client’s perspective, requesting scopes is really simple: it’s a list of strings that get sent to the server. For most clients, this rich set of information described above could be easily loaded into a template and that template sent to the server. An AS could even bundle all of this into an easily referenced resource identifier that the client could use instead of the transaction values themselves. In that way, we could allow a system as simple as scopes are today but with a more concrete definition of what each scope applies to and how they fit together.

This same data model can be applied to the access tokens that are issued by the server. Instead of just returning the list of scopes, the AS can return a data structure that tells the client, in no ambiguous way, what the token is supposed to be good for.

Who are you?

The AS needs to know a few things about the client software that’s asking for access to the API. In OpenID 2, the RP identified itself fully by its URL. In OAuth, not every client has a unique URL and so client IDs are used to solve that. And since users are involved in making trust and security decisions, the trust and identity of a client in a particular transaction is vital in leading the user toward making a sound decision.

But with OAuth, the client ID is sent through the front channel and is effectively public, allowing any attacker to use it in a phishing attempt. To mitigate this, clients were given a client secret that they could present in the back channel to prove who they were. However, not every OAuth client was capable of keeping this kind of secret — in-browser and native apps in particular. So what can we do?

I propose that we always allow a client to declare its attributes at the start of a transaction. When making the transaction request, the client simply tells the authorization server who it is. This can include display names, home pages, and anything else that would identify a piece of software to a user. When you start off a transaction to ask for a token, tell the AS who you are. This could be coupled with some form of proof that you’re allowed to make those particular claims, but I’m not sure how that would fully pan out. The AS can then, as always, decide how it wants to convey that to the user. If it’s seen the client a bunch before, or if the client is able to present proof of a validation from a trusted third party, maybe the approval display should be different from a client that’s just shown up out of nowhere. This model gets rid of the assumption of registration, making it dynamic-first, like much of the rest of the internet.

What can you do?

From a code-path perspective, it’s much less important about who a client is than what a client is capable of doing. In fact, much of the client model in OAuth2 is a stand-in for client capabilities. We add fields like redirect_uris and client_type to the client model to let the authorization server decide how to handle that client. What if, instead, we just had the client declare how the transaction should go as part of the transaction at the start?

This approach could replace the entire OAuth2 concept of different grant types and response types, commoditizing the entire process of interaction within the protocol. Can you open a web browser? Say so! Can you listen for a front channel response at a URI? Tell the server what URI that is. Can you give the user a code? Maybe even tell the AS what kind of codes you can handle. A fancy client could even give the AS a few different kinds of interactions to choose from, with the AS declaring what will be used this time. And the client can tell the AS all of these things based on its current state and programmed features.

User interaction could then be handled on an as-needed basis, and be reactive to the client’s capabilities. When the AS sees that a client is making a request that requires interaction, it can tailor its response to what the client is able to do within this transaction. After all, the AS just needs to have the user approve the transaction interactively somehow; everything else, from redirect URIs to user codes, is just scaffolding to get the user in front of the AS in the right context.

Importantly, by declaring its capabilities at the start of a transaction and choosing a specific path, a client narrows the attack surface for the overall experience. Many OAuth breaches have come from a client or AS being too flexible about what it accepts, but the myriad of options that have been added to OAuth almost dictate that kind of flexibility for a piece of software to survive.

Haven’t We Met?

The first time a client ever talks to an AS, I think it makes a lot of sense to allow an a client to introduce itself like we’re talking about here. But when the same instance of client software is talking to the same AS, using the same capabilities, asking for the same things, it would be wasteful and redundant to re-declare all of this every time. OAuth2 tried to address this by bundling client capabilities in the client’s registration, but then the client still had to declare its scope every time. This lead to clients having hardcoded parameter strings to send the same options to the server every time — a waste of bandwidth and complexity whose power was never used.

While the transactional system I’m proposing here would, on the surface, make that problem worse, a simple optimization could make all of this reasonable to deal with. Once a client has sent its transaction request, the AS can return a handle that represents part or all of the incoming request. This idea is based in part on the Persisted Claims Token from UMA. The PCT represents user claims, but the same concept could easily be applied to the transactional input that we’ve been discussing all along.

In this way, the transaction mechanism works very similarly to the FAPI request object registration endpoint: the client sends in a set of parameters and gets back a handle that it can use to interact with the AS in the future. The next time the client makes a request, instead of sending all of its data about its own capabilities it sends the handle representing those capabilities. The same thing could be used for the client’s data about itself as well as what it was asking for. As long as the client is asking for the same kind of thing and its own parameters haven’t changed, it can keep using the same handle for different requests and even different users.

This handle mechanism extends into the rest of the authorization transaction. Much like the UMA permission ticket, and separate from the handle given back representing the details of the incoming request, there is a handle given back that represents the transaction request itself. This handle can be used when sending the user to the interaction endpoint, and in fact this would be the only parameter sent to the interaction endpoint, with the rest being looked up by the AS. The handle would also be used when calling the back channel endpoint when polling for a token (like with the device flow), or returning from the interaction endpoint, or really used whenever the client interacts with the AS and needs to move things forward another step. Additionally, since this handle is meant to be ephemeral and scoped within the transaction, every time the handle is used it can be rotated by the AS. This would prevent replay and other attacks.

Statefulness and Secrets

The transactional handle model as presented here makes a lot of assumptions about the ability of the AS to validate and dereference these handles during a transaction. With a stateful server, this would be pretty simple: you store the handle (or its hash) alongside the data that the handle represents. When the handle comes in, you look it up.

The problem with this statefulness assumption is that it doesn’t scale particularly well, and at the scale of large internet service companies it would require significant global synchronization of data across clusters, clouds, and data centers. The traditional way to address this issue is through making all the handle values be self-contained. This comes with its own problems of cryptographic protection, privacy, and key-management, but these kinds of scaling problems are hardly new. I don’t know what the right way to address this is, yet, and so I propose we start with a stateful system and move from there.

In a related space, the most simple way to use a handle within a protocol like this is as a bearer token. You generate a value, hand it over to the next party, and they replay it back under whatever other circumstances you need to recognize that party in. This is really simple to build and use, but the security is awful. These tokens (or secrets or passwords or api keys, whatever you want to call them) can be copied and stolen trivially. This is a particularly bad problem if you end up sending things over the front channel, which we plan to do.

I think we can solve this using a bit of cryptography. Again, if we’re assuming a stateful system, we can operate such that the real handle value is never sent over the wire, only proof of possession of that handle. The simplest way to do this would be a simple hash of the value, such that the AS sends the value, the client hashes the value and sends the hash. This is roughly what the PKCE specification does with the code challenge. That hash could still be intercepted and replayed by an attacker, so it doesn’t solve everything, but there are other cryptographic techniques like zero-knowledge proofs that might apply.

There are other means for protecting the various messages in this system, including several forms of binding to TLS and message-level signatures for HTTP messages. In an ideal world, we’d be able to apply any number of different proofing mechanisms to the various secrets: transaction handles, access tokens, client secrets (or their equivalent), state values, etc.

Begin Transaction?

The transactional model is flexible enough to fit a variety of circumstances and use cases. A quick back-of-the-napkin sketch shows that you can represent all of the existing OAuth flows as well as UMA, the device flow, the OpenID Connect extensions, FAPI/OB, and even CIBA using this model. The devil is, as always, in the details, and I’d like to see conversation on this. It could be that this completely falls apart for other reasons that I haven’t thought of — but I want to know what those reasons are.

Or it could be that people just don’t care enough right now. Even if this idea takes off overnight, which it won’t, OAuth2 isn’t going away any time soon. OAuth2 solves a lot of use cases very well and it will continue to do so. But even so, I firmly believe that just because something is good, one should not be dissuaded from trying to do even better. There’s an immense value in getting things “good enough” with basic tools, and maybe that’s where we are today. But I also think that we can make better tools now that we’ve had some experience with this.

I’d like to explore this topic more with the OAuth, OpenID, and wider community, with implementations, specs, and discussion. And by explore, what I really mean is build: let’s go make something and see what sticks.