This article is part of my, very loosely connected, series that explores different challenges of designing large, distributed system. While working on this system, I have had a lot of opportunities to learn about system design, security, and Artificial Intelligence.
You can read the previous articles here:
- Building a Backend System for Artificial Intelligence
- Fault Tolerance in Asynchronous, Choreographed, Distributed System
Our primary use case is following:
When users of our app decide to share content to their friends, one of two things can happen:
1. Friend already has our app.
In that case, content is shared to this friend in-app, and he or she gets notified via push notification.
2. Friend does not have the app.
Friend will receive a link in a text message to view shared content online on our server in a browser.
Main goal is to provide privacy of the content. Even though the shared link is technically public - and recipient may choose to forward or otherwise re-share the link from the text message - it is truly intended only for him to see. However, we do not want to be too restrictive. We want to give our users (or non-users, I should say…) the flexibility to open the link on multiple devices or share the link to someone else, should they decide to do so. We only want to prevent (or limit) scenarios in which the link is leaked and target resource becomes indefinitely accessible to anyone.
We decided to limit the lifetime of the link to 2 days. After 2 days, link becomes inaccessible. Furthermore, the target resource becomes unavailable too. It’s not just about securing the link, is’s about protecting the resources as well.
This work was motivated by a problem reported by users participating in our closed-beta test. Before the new system was implemented, we had been using bit.ly as our URL shortener of choice. Some users on one particular US carriers would not receive the aforementioned text message, unlike users on other networks. Short investigation revealed that this issue has been reported by others as well and started appearing about 6 month ago (And it does not appear to be getting resolved any time soon).
What seems to be happening is this particular carrier deployed a new filter that checks all text messages and automatically blocks spam messages from being delivered. Which, of course, is a good thing. But not so great when it blocks your legitimate text message. And it would appear that a bit.ly link in a text message from a phone number unknown to the recipient appears too “spammy”.
That set things in motion. We knew we needed another solution. Our current system was not only getting blocked, but also did not truly offer the resource access expiration capabilities that we required.
First, I considered not using url links in the text messages at all. If links are being blocked, let’s avoid putting links in the text messages. Instead, user would receive a 4-digit PIN number that she enters along with her phone number on our website to authenticate.
After reviewing the requirements, I quickly came to a realization that it looked awful lot like a plain authentication flow that could be solved by using OpenID Connect (OIDC) protocol to authenticate user and provide access to that authenticated user. I quickly realized that, while offering a lot of advantages to developers by providing fully authenticated identity, it offered poor user experience.
We want to make sure our non-users can quickly see the content shared to them. Switching from text messaging app to their browser, navigating to our website and manually entering a code is significantly more difficult than just clicking a link in the text message.
Sometimes the best engineering choices are not the best product choices. That is something I keep telling myself everyday (and I am struggling with it everyday, too!).
I scraped that idea and went back to the drawing board. But the idea of authentication stuck. From there it was a quick mental leap to decide to keep part of the design, but strip it to bare bones to offer a system that satisfies our requirements, is secure, but also extensible, should we decide to go back to that idea of full user authentication with user name (phone number) and password (PIN number).
For better understanding, let review the architecture of the system:
Let me describe the flow.
First, user decides to share the content. This triggers processing pipeline in one of our micro-services. Our backend service checks whether the recipient is user or non-user of our app. When it’s a non-user, we call our internal API to issue new token JWT token.
JSON Web Token (JWT) is an Internet standard for creating JSON-based access tokens that assert some number of claims.
For example, a server could generate a token that has the claim “logged in as admin” and provide that to a client.
The request to issue the new token looks like this:
“audiences”: [ … ],
Uid parameter identifies recipient of the token.
Audiences limit where the token can be consumed. This is one of the concepts we are borrowing from OIDC.
ValidUntil configures how long the access will last. Even though, right now, we always set it to 2 days from the time token was issued, this provides flexibility to change it in the future.
AdminAccess gives administrative access to the resource. This will eventually translate to role that shared in the token. More on that later.
(API communication between our services internally is secured through a shared secret to authenticate our internal API calls and the endpoints are not exposed to public internet).
Once token is created, it is saved internally to our database and API responds back with a “shortcode”. Shortcode is essentially a unique identifier of the token that we can use to look it up later but it’s designed to be short in order to be able to use it in a short URL that we share to user.
Each shortcode is 6 characters and is case insensitive. That gives us roughly 1.8 billion (1838265625) combinations. It is unlikely that we would run out of unique shortcodes in a foreseeable future. But, in the (likely, of course!) event that our app becomes tremendously successful and acquires more users than, say, TikTok, we can start reusing shortcodes from long expired tokens, or make shortcodes case sensitive to increase number of potential combinations.
This is how the token in our database looks:
"aud": [ "https://go.<redacted>.com", "https://albums.<redacted>.com" ],
"roles": [ "admin" ]
Now that we have the shortcode, we can make a link out of it and send it via text message to the recipient.
https://go.<our app domain>.com/fxi60i
As you can see in the url above, I call the service that will take care of taking the short link and “translating” it to a long url of the target resource “Go”.
When user clicks the link, we navigate to our “Go” service. This service looks up a token using an API call. If the token already expired, or if the shortcode is not correct, we terminate here and return an error page. And of course, we offer an upsell to our users — download our app and you won’t have to worry about expiration date. If, on the other hand, a valid token is found, we continue. First, we verify the token. We want to make sure that our system cannot be compromised by presenting spoofed tokens to users — we verify audience, issuer and signature.
I opted to use public-private key pair to sign the the tokens at the issuer with private key and verify the signature at Go service using public key. Because this is service-to-service communication (between our API and our Go service), we could have used a simpler shared secret for both signing and verification. That would spare us from having to manage certificates down the line. Initially, I thought I would verify the signature on the client as well. And that would prevent me from using shared secret. If I did use it, I would inadvertently expose the secret to anyone who would care to look for it. But there is very little benefit in verifying the token on the client anyway. Because an attacker could easily circumvent the verification. More on that later!
Now that we have our token verified, we can pass it to the client to open access to the service.
I chose to use a cookie to store the token. To provide additional security to the cookie, it’s marked as secure (only distributable over HTTPS). It is set to expire at the same time as the token. That offers a very nice “out-of-the box” mechanism to prevent expired cookies from lingering around.
Both our APIs and the “Go” service are subdomains of our <our app domain>.com, and so our cookies are open to all other subdomain. This may pose a potential security vulnerability if an attacker got hold of one of our subdomains and that way gained ability to read cookies issued by go subdomain. We deem that to be low risk at this point. Should we re-evaluate that later, or if we want to get rid of cookies, other options are passing the token from Go service to the other service during redirection via query parameter in the URL or using localStorage.
To continue, user is redirected from the Go service to the web app using a simple HTTP 302 Redirect.
The web app loads https://albums.<our app domain>.com/<resource uuid>. This loads an Angular app. The client application uses Angular concept of route guards to check (and prevent) access to the url route.
Each route maps a URL path to an Angular component.
You add guards to route configuration to restrict access to a route.
As I briefly mentioned above, the presence of a token on a client (or lack thereof) is not enough to secure our resources. It is trivial for an attacker to modify the Angular app in-flight to skip the validation of tokens and present fake tokens. And for the same reason validating token signatures in the client app offers no real advantage.
Two things are implemented here to combat this.
1.) resources ids are non-guessable, which is a very basic security technique.
The basic thinking is: because resources are identified by uuids, they are not “guessable” (in contrast to using an incremental integer id’s for example, which would be easy to guess). Only way for an attacker to find the resource would be to randomly generate massive amounts of uuids until a valid identifier is discovered. However, by employing a simple traffic throttling at infrastructure level, we can prevent that from happening.
This of course, in itself, does not provide any real security. The other piece of the puzzle is the token.
2.) The resources themselves are secured too.
At it’s core, our client app just displays images stored in Azure Storage Blobs.
Ideally, we would want to pass our JWT token to Azure Storage when reading the blob (“Binary Large OBject”) content and have the service validate the token same way our Go service does.
Azure Blob Service supports OpenID Connect (OIDC) protocol via Azure AD client impersonation. But traditional method of securing access to Azure Storage resources uses so-called “SAS tokens”. SAS (Secure Access Signature) token as a concept is similar to JWT. Instead of JSON object representing the identity, it consists of a set of query parameters that specify operation (read/write), expiration date of the token, and signature to prevent tampering with the parameters.
To generate a SAS token, one can use Azure API, cli or various Azure client tools. And this is how the signature looks:
The glaring questions in front of us is, how do we marry these two concepts to prevent attacker from bypassing the client JTW validation and accessing the resource. We have already learned in the previous paragraph that attacked would mostly be limited to his or her own resource. Even if that happened, attacker would still be limited to just “hacking” a resource previously shared to him. Solution that I chose was to use custom claims in the JWT token. Going back to our Issuer API, at the time when we issue new token, we also reach out to Azure to generate SAS token with the same expiration as our new token. Then, we include the SAS token in the generated JWT as a custom claim.
That way, when the token gets passed down to our client app, app can use the token to access the blob resource. Even if the attacker modifies the client to skip JWT validation, there is no way she can pass custom SAS token because that is validated server-side.
Last, but not least, let’s talk about the roles. It is well established that your tokens should not contain permissions (e.g. https://leastprivilege.com/2016/12/16/identity-vs-permissions). Roles, on the other hand, are usually fairly static and ok to be represented in the token.
In our system, we have two roles.
- User role — the use case that we described in this article.
- Admin role, that gives person an unlimited access to the resource. That means viewing full album. In future, it will also include delete capability, or some other operations.
To get a shortcode and a token with full admin privilege, our issuer API must be called out-of-band with “fullAccess: true”. Of course, the call to the API is authenticated to prevent malicious users from granting themselves access. And the API endpoint is only available inside our network, not exposed to the public.
A flaw in the system is that, at the moment, malicious user can modify local JWT token by simply adding the right role locally to gain full access.
This is an acceptable risk for us because user in possession of the original token is considered “rightful owner” of the resource. Not giving him full access in our case is not a security measure. It is merely a marketing tool to upsell installation of our app (which of course has a lot of functionality beyond mere viewing of the album…)
The right solution, of course, is to validate the token server side before deciding to serve a full access version of the album. This is, unfortunately not possible for us at the moment.
As a startup, we must look for ways to move fast and prove functionality quickly. To do that, we are using Azure blobs to represent our album resources. Instead of saving items in the album to a database, I chose to write (append) data to JSON blob resource. Doing so offers a very effective system with inherited properties of Azure Storage — geo replication, fast access, CDN capabilities, …
But that brings back the problem described in previous section — Azure Storage does not understand our JWT tokens.
In the end, I believe I designed a fairly flexible system that follows security best practices and industry patterns without introducing too much of complexity that would prevent us from being agile.
End to end, including the proof of concept with phone and PIN code input, it took under a week to implement, which I think is a decent amount of time. In the process, I learned a lot about the internals of OIDC protocol, as I was investigating how (or if) I can implement it here.
In future, if we wanted to take privacy one step further, we could limit the expiration date of the link to, say, 15 minutes. And use the OIDC concept of refresh tokens to keep getting new tokens as long as the resource owner hasn’t decided to revoke access.