The naughtiness score

A simple algorithm to prevent jerks from ruining rawgithub.com for everyone

Ryan Grove
4 min readMar 22, 2014

Update: RawGit (formerly rawgithub) no longer behaves quite the same way as this post describes, but the core concept of the naughtiness score is still used to throttle and blacklist excessive traffic.

Last week, the developer of a popular Chrome extension released an update that requested a JavaScript file from rawgithub.com every time a user of the extension visited any web page in Chrome.

This kind of sucked for rawgithub.

I spent most of my Sunday working to mitigate the ensuing flood of HTTP requests. The problem wasn’t the load — it turns out Node and Nginx were more than up to the task of handling the flood—it was the sheer bandwidth consumed by all those incoming requests.

At the time, rawgithub was hosted on Amazon EC2. In a typical month it cost me a paltry $30 in AWS fees, roughly half of that being bandwidth costs. This flood was well on the way to ballooning those costs to over $1,000 a month, which was an unhappy prospect for my bank account.

After trying several things, including temporarily null routing rawgithub.com’s DNS (suboptimal, since it meant discarding legitimate traffic) and moving the server to DigitalOcean (which has more generous bandwidth pricing than EC2), the most effective solution ended up being to respond to the abusive requests with a simple JavaScript file:

alert('Stop it.');while(1){}

For users of the abusive extension, this made Chrome completely unusable by freezing it on startup and on any attempt to load a page. Those users either uninstalled the extension or stopped using Chrome altogether, instantly reducing the flood to manageable levels.

Turns out being really annoying is the best way to prevent abuse of rawgithub.com

As an added bonus, those angry users also complained loudly to the extension’s author and began leaving one-star reviews of the extension. The author quickly came to his senses and released an update.

The flood abated and things went back to normal, but I wasn’t happy with the prospect of having to fight a fire anytime some doofus decides to use rawgithub irresponsibly. Clearly, the process of manually blacklisting abusers wasn’t going to scale.

Thus, the naughtiness score was born.

Starting today, rawgithub.com tracks detailed metrics on every file requested and every referrer that sends these requests. Requests are harvested in real time from Nginx’s upstream log file and metrics are tallied in an in-memory LRU cache in the rawgithub Node app. Infrequent requests and requests that haven’t been seen for 30 minutes fall out of the cache and their metrics are reset.

A naughtiness score is calculated for each request using a simple algorithm:

totalRequests * requestsPerSecond * totalKilobytes * multiplier

The multiplier, currently 0.0000001, is calibrated such that the vast majority of requests have a score under 0.5. These requests are considered legitimate traffic, and we like them.

This formula is designed to be very generous toward small files requested infrequently, and gets less forgiving as file size and request frequency increase. In other words, large files are more likely to get penalized than small files and frequently requested files are more likely to get penalized than infrequently requested files, but we try pretty hard to avoid penalizing a request unless it’s abusive on several levels.

When a request’s naughtiness score reaches 0.5, responses are throttled and a RawGitHub-Message response header is added containing a warning. The hope is that a developer investigating why their site is loading slowly will see this warning and back off a bit.

If a request’s naughtiness score reaches 1.0, it is blacklisted. Response bodies will contain a simple message indicating that the request has been blacklisted, and will be served with a Cache-Control header that instructs browsers to cache the response for one day.

If a request’s naughtiness score reaches 3.0, it is shitlisted, and will receive evil.css or evil.js in response (or a simple message if the request wasn’t for a CSS or JS file). The response will be served with a Cache-Control header that instructs browsers to cache it for six months.

Each of these escalating mitigation methods attempts to give rawgithub users ample warning and time to back off before any real damage is done. Aggressive cache headers are used to reduce bandwidth usage by encouraging browsers to stop sending requests. Annoyance is used as a last resort to get someone’s attention when all else has failed.

Since naughtiness is determined in real time, all it takes to de-escalate a request’s naughtiness is for that request to become less frequent. As fewer requests come in, the naughtiness score will drop. If no requests are received for 30 minutes or more, the naughtiness score will be reset to 0.

If you suspect your own requests are being throttled or blacklisted, take a look at the response headers or the live stats on rawgithub.com.

My hope is that with less manual intervention required to prevent abuse, rawgithub will remain reliable and available for legitimate use and economically viable for me to continue running. Jerks will be dealt with swiftly and automatically, without ruining things for everyone else.

Cross your fingers!

--

--