The Curious Problem of Rate Limiting an Unauthenticated Endpoint
Rate Limit Endpoints with Blockchain! (kind of)
As I worked on my pet project Pogify API, I ran into a question of rate limiting an unauthenticated endpoint.
The ethos of Pogify is that basic functionality is available without any kind of signup or onboarding. This has the unfortunate consequence that we have to operate and secure our endpoints without any explicit authentication.
The solution that we, the Pogify team, came up with was to issue a JSON Web Token (JWT) to each client that started a listening session, then only allow methods related to that session to the person who holds the JWT. We could now implement rate limiting schemes on a per user basis since only one client was expected to have access to any specific session.
That solved one problem but it still left another one open: How do you rate limit issuing JWTs? Endpoints were rate limited by an issued JWT but the endpoint that issues the JWT can’t be rate limited. This might be fine in general, but a design choice of Pogify was to have short session identifiers for user experience. This meant that we were limited to a relatively smaller pool of about 60 million keys. Since the API caches issued session IDs to avoid issuing a JWT for an already active session, this meant that a malignant actor could in theory run a script that could exhaust the session ID pool and run methods that unnecessarily burden our servers.
Some “back-of-the-napkin” math shows that we’d run out in about 34 hours at 200 requests per second.
So how do you rate limit an unauthenticated endpoint?
What we wanted was to be able to limit requests on site (i.e. baked into the application) without relying on services like Cloudflare, CAPTCHA, or an API gateway.
Solution 1:
We don’t. find a way to scale out instead.
Just grow the pool of available ids. The current implementation uses a five character long case insensitive alpha-numeric-hyphen-underscore string for ids. Even adding just one more character would raise the ceiling to about two billion. It would be significantly harder to exhaust the pool with a longer id string. But still leaves the problem that the malignant actor could still use those thousands of keys to make requests that load the servers deeper in the stack.
Solution 2:
Rate limit by IP addresses. A simple solution but poses problems if user IPs are masked by NATs.
Solution 3:
Throttle the endpoint. Limit the amount of session tokens that can be issued per some interval of time.
This solution however creates a new problem elsewhere. The problem now is one of queuing and timeouts. A malignant actor could in theory shutdown the creation of new well intentioned sessions by queuing some arbitrarily many new session requests and fill the queue of waiting requests to a point where any new requests get timed-out.
The Solution
Use blockchain! Kinda-sorta.
The solution we settled on after much googling and searching was a solution inspired by a stackexchange answer and a blog post:
We would use proof of work like those used in blockchain to artificially limit the number of sessions any one client can create. Getting a new session token would require the client do a partial hash reversion to claim it. We can now raise and lower the difficulty of getting a new token by requiring more or less leading zeros on a solution hash. This solution allows the API to run as it had, without rate limiting by IP address, or throttling, but still limit the rate of session creation.
The Pogify Implementation
We developed a Go package and Gin-Gonic middleware that would handle the problem generation and verification steps of session creation.
The session start flow was modified to include this proof of work requirement:
- Client retrieves a problem from the
/issue
endpoint. The issue endpoint includes a problem payload which is a JSON object with session ID, issued timestamp, target difficulty (i.e. number of leading zeros) and checksum for the payload. - Client starts a counter at 0 and calculates a hash:
h(counter, id, timestamp)
- If the hash results fulfills the number of leading zeros required, the client sends a request to the
/claim
endpoint with the session ID, issued timestamp, checksum, counter, and found hash. If not, client increments the counter and goes back to step 2. - Server verifies against the checksum that the sessionID and timestamp haven’t been tampered.
- Server checks that the problem has not expired. This check is to make it difficult for a malignant agent from pre calculating solutions then flooding the endpoint with valid hashes.
- Server then checks that the client’s calculated hash fulfils the difficulty requirement and matches the hash calculated from the same components by the server.
- If the hash is valid, then the server returns new JWT to the session ID of the problem.
Depending on the difficulty of the hash, we can now adjust how often a single session is created. If issued problems expire in 5 seconds and take on average 0.5 seconds to solve a problem, we’ve essentially restricted any malignant actors’ ability to make a new session to 2 per second with one thread, and limited any burst to 10 at once with one thread. I say one thread because nothing stops a malignant actor from using many threads and many clients to claim many sessions. Extending the example: it would require on average about 347 days of CPU time to exhaust the whole key space.
Proof of work is not the Holy Grail
While Proof of work is a good solution, its not a panacea. There are some drawbacks as with any system:
- End users would need to wait a little longer to start a session.
- Proof of work requires CPU clock cycles that could’ve been used elsewhere.
- As mentioned above: Nothing here stops a malignant actor from using a network of clients to attack the server, but does make it harder. Taking the number from before (8333 hours), it would take a network of 347 threads working for 24 hours to render the API inoperable.
Proof of work is not a solution to a DDoS or related attack. Nothing is stopping anyone from reaching the server, or any particular endpoint. However, Proof of work is a really simple way to limit access to resources or expensive tasks on an unauthenticated endpoint. And since its much cheaper to verify hashes than to scale up an expensive endpoint, it can make resource management easier.
Checkout the implementation in the Pogify API Github repo. Or the packages that we used to implement proof of work: