Image for post
Image for post

How I Built Emojitracker

Adventures in Unicode, Real-time Streaming, and Media Culture

Emojitracker was one of those projects that was supposed to be a quick weekend hack but turned into an all-consuming project that ate up my nights for months. Since its launch in early July, Emojitracker has processed over 1.8 billion tweets, and has been mentioned in approximately a gajillion online publications.

Prologue: Why Emoji?

Image for post
Image for post
These fingers wrote a lot of emoji code, they earned it.

Background Understanding: Emoji and Unicode

The history of Emoji has been written about in many places, so I’m going to keep it brief here and concentrate more on the technical aspects.

Image for post
Image for post
For some background music, watch Katy Perry demonstrate why she should have been a primary delegate on the Unicode Consortium Subcommittee on Emoji at http://youtu.be/e9SeJIgWRPk.

Now, you may be thinking: “Wait, standards are good, right? And why do you say ‘mostly’ standardized, that sounds suspicious…”

Of course, you’d be correct in your suspicions. Standardization is almost never that simple. For example, take flags. When time came to standardize Emoji codepoints, everyone wanted their country’s flag added to the original 10 in the Softbank/DoCoMo emoji. This had the potential to get messy fast, so instead what we ended up with were 26 diplomatically-safe “Regional indicator symbols” set aside in the Unicode standard. This avoided polluting the standard with potentially hundreds of codepoints that could become quickly outdated with the evolving geopolitical climate, while preserving Canada’s need to assert their flag’s importance to the Emoji standardization process:

https://twitter.com/withloveclaudia/statuses/351744535291887616

This of course makes the life of someone writing Emoji-handling code more difficult, as pretty much all the boilerplate you’ll find out there assumes a single Unicode code point per character glyph (since after all, this was the problem that Unicode was supposed to solve to begin with).

For example, say you want to parse and decode an emoji character from a UTF-8 string to identify its unified codepoint identifier. Conventional wisdom would be that this a simple operation, and you’ll find lots of sample code that looks like this:

# return unified codepoint for a character, in hexadecimal
def char_to_unified
(c)
c.unpack("U*").first.to_s(16)
end
Image for post
Image for post
Figure 1: Not the American Flag.
# return unified codepoint for a character, in hexadecimal.
#  — account for multibyte characters, represent with dash.
#  — pad values to uniform length.

def char_to_unified
(c)
c.codepoints.to_a.map {|i| i.to_s(16).rjust(4,'0')}.join('-')
end
Image for post
Image for post
Figure 2: The land of the free, and the home of the brave.
>> EmojiData.all.select(&:doublebyte?).map(&:short_name)
=> [“hash”, “zero”, “one”, “two”, “three”, “four”, “five”, “six”, “seven”, “eight”, “nine”, “cn”, “de”, “es”, “fr”, “gb”, “it”, “jp”, “kr”, “ru”, “us”]

Emojitracker Backend Architecture

Here’s the overall architecture for Emojitracker in a nutshell: A feeder server receives data from the Twitter Streaming API, which it then processes and collates. It sends that data into Redis, but also publishes a realtime stream of activity into Redis via pubsub streams. A number of web streamer servers then subscribe to those Redis pubsub streams, handle client connections, and multiplex subsets of that data out to clients via SSE streaming.

Image for post
Image for post

Feeding the Machine: Riding the Twitter Streaming API

If you’re doing anything even remotely high volume with Twitter, you need to be using the Streaming APIs instead of polling. The Streaming APIs allow you to create a set of criteria to monitor, and then Twitter handles the work of pushing updates to you whenever they occur over a single long-life socket connection.

The JSON blob that the Twitter API sends for each tweet is pretty massive, and at a high rate this will get bandwidth intensive. The feeder process for Emojitracker is typically receiving a full 1MB/second of JSON data from Twitter’s servers.

Since in our cases we’re going to be re-broadcasting this out at an extremely high rate to all the streaming servers, we want to trim this down to conserve bandwidth. Thus we create a new JSON blob from a hash containing just the bare minimum to construct a tweet: tweet ID, text, and author info (permalink URLs are predictable and can be recreated with this info). This reduces the size by 10-20x.

Data Storage: Redis sorted sets, FIFO, and Pubsub streams

Redis is an obvious data-storage layer for rapidly-changing and streaming data. It’s super fast, has a number of data structures that are ideally suited for this sort of application, and additionally its built-in support for pubsub streaming enables some really impressive ways of shuffling data around.

  1. Tweet streams. 842 different active streams for these (one for each emoji symbol). This sounds more complex than it is—in Redis, streams are lightweight and you don’t have to do any work to set them up, just publish to a unique name. For any matching Tweet, we just publish our “small-ified” JSON blob to the equivalent ID stream. For example, a tweet matching both the dolphin and pistol emoji symbols would get published to the stream.score_updates.1f42c and stream.score_updates.1f52b streams.
Image for post
Image for post
Illustration: crossing 842 pubsub streams with a single PSUBSCRIBE statement in Redis.
Image for post
Image for post
A totally ridiculous illustration of a FIFO queue I found on the web. I decided it required some emojification.
 matches = EmojiData.chars.select { |c| status.text.include? c }
matches.each do |matched_emoji_char|
# get the unified codepoint ID for the matched emoji char
cp = EmojiData.char_to_unified(matched_emoji_char)
REDIS.pipelined do
# increment the score in a sorted set
REDIS.ZINCRBY 'emojitrack_score', 1, cp
# stream the fact that the score was updated
REDIS.PUBLISH 'stream.score_updates', cp
# for each emoji char, store most recent 10 tweets in a list
REDIS.LPUSH "emojitrack_tweets_#{cp}", status_json
REDIS.LTRIM "emojitrack_tweets_#{cp}",0,9
# also stream all tweet updates to named streams by char
REDIS.PUBLISH "stream.tweet_updates.#{cp}", status_json
end
end

Pushing to Web Clients: Utilizing SSE Streams

Image for post
Image for post
  1. The tweet detail updates queue is more complex. We use a connection wrapper that maintains some state information for each client connected to the stream. All web clients receiving tweet detail updates from the streaming server are actually in the same connection pool, but when they connect they pass along as a parameter the ID of the emoji character they want updates on, which gets added to their wrapper object as tagged metadata. We later use this to determine which updates they will receive.

Performance Optimizations for High Frequency SSE Streams

SSE is great, but when you start to approach hundreds of events per second, raw bandwidth is going to become a concern. For Emojitracker, we needed to turn to a number of performance enhancements were necessary to reduce the bandwidth of the stream updates so that people without super-fat pipes could play along.

Image for post
Image for post
data: 2665  \n\n
data: 1F44C \n\n
data: 1F44F \n\n
data: 1F602 \n\n
data: 2665 \n\n
data: 1F60B \n\n
data: 1F602 \n\n
data:{"2665":2,"1F44C":1,"1F44F":1,"1F602":2,"1F60B":1}\n\n

Gotcha: Many “cloud” environments don’t properly support this (and a workaround)

The crux: after building all this in development environment, I realized it wasn’t quite working correctly in production when doing load testing. The stream queue was filling up, getting bigger and bigger, never reducing in size. After much spelunking, it turned out that the routing layer used by many cloud server providers prevents the web server from properly seeing a stream disconnection on their end. In an environment where we are manually handling a connection pool, this is obviously no good.

Image for post
Image for post

Not crossing the streams: The admin interface

When attempting to debug things, I quickly realized that tailing a traditional log format is a really terrible way to attempt to understand what’s going on with long-lived streams. I hacked up a quick web interface showing me the essential information for the connection pools on a given web server: how many open connections and to whom, what information they were streaming, and how long those connections had been open:

Image for post
Image for post
Part of the stream admin interface for one of the web dynos, on launch day.

Frontend Architecture

For the most part, there is nothing that surprising here. Consuming a SSE stream is a fairly simple endeavor in Javascript, with widespread browser support. However, there were a number of “gotchas” with secondary functionality that ended up being somewhat complex.

Rendering Emoji Glyphs

Spoiler alert: sadly, most web browsers don’t support emoji display natively (Google, get on this! Forget Google+, we want emoji in Chrome!). Thankfully, you can utilize Cal Henderson’s js-emoji project to sniff the browser and either serve native emoji unicode or substitute in images via JS for the other browsers.

Image for post
Image for post
Image for post
Image for post
Image via emojinal art gallery.
Image for post
Image for post
Native-rendering of an English tweet containing Emoji in Safari 7.0 on MacOSX 10.9.

To get around this problem, I stumbled along the technique of creating a Unicode-range restricted font-family in CSS, which will let us instruct the browser to only use the AppleColorEmoji font for those particular 842 emoji characters.

Listing out all 842 codepoints would work, but would result in a bulky and inefficient CSS file. Unfortunately, a simple unicode-range won’t work either, as Emoji symbols are strewn haphazardly across multiple locations in the Unicode spec. Thus, to generate the appropriate ranges in an efficient manner for emojistatic, we turn again to our EmojiData library, using it to find all sequential blocks of Emoji characters greater than 3 in size and compressing them to a range. Go here to examine the relevant code (it’s a bit large to paste into Medium), or just check out the results:

>> @emoji_unicode_range = Emojistatic.generate_css_map
=> "U+00A9,U+00AE,U+203C,U+2049,U+2122,U+2139,U+2194-2199,U+21A9-21AA,U+231A-231B,U+23E9-23EC,U+23F0,U+23F3,U+24C2,U+25AA-25AB,U+25B6,U+25C0,U+25FB-25FE,U+2600-2601,U+260E,U+2611,U+2614-2615,U+261D,U+263A,U+2648-2653,U+2660,U+2663,U+2665-2666,U+2668,U+267B,U+267F,U+2693,U+26A0-26A1,U+26AA-26AB,U+26BD-26BE,U+26C4-26C5,U+26CE,U+26D4,U+26EA,U+26F2-26F3,U+26F5,U+26FA,U+26FD,U+2702,U+2705,U+2708-270C,U+270F,U+2712,U+2714,U+2716,U+2728,U+2733-2734,U+2744,U+2747,U+274C,U+274E,U+2753-2755,U+2757,U+2764,U+2795-2797,U+27A1,U+27B0,U+27BF,U+2934-2935,U+2B05-2B07,U+2B1B-2B1C,U+2B50,U+2B55,U+3030,U+303D,U+3297,U+3299,U+1F004,U+1F0CF,U+1F170-1F171,U+1F17E-1F17F,U+1F18E,U+1F191-1F19A,U+1F201-1F202,U+1F21A,U+1F22F,U+1F232-1F23A,U+1F250-1F251,U+1F300-1F31F,U+1F330-1F335,U+1F337-1F37C,U+1F380-1F393,U+1F3A0-1F3C4,U+1F3C6-1F3CA,U+1F3E0-1F3F0,U+1F400-1F43E,U+1F440,U+1F442-1F4F7,U+1F4F9-1F4FC,U+1F500-1F507,U+1F509-1F53D,U+1F550-1F567,U+1F5FB-1F640,U+1F645-1F64F,U+1F680-1F68A,U+1F68C-1F6C5"
@font-face {
font-family: 'AppleColorEmojiRestricted';
src: local('AppleColorEmoji');
unicode-range: <%= @emoji_unicode_range %>;
}
.emojifont-restricted {
font-family: AppleColorEmojiRestricted, Helvetica;
}
Image for post
Image for post
Same example, but custom font range saves the day. (try demo in your own browser: http://codepen.io/mroth/pen/cpLyK)

Frontend Performance

Image for post
Image for post
Image from http://ftw.usatoday.com/2013/09/emoji-sports-art-is-the-best-kind-of-art
  • Different methods of force triggering the transition animation to display:replacing an entire element vs. forcing a reflow vs. using a zero length timeout.
  • Maintaining an in-memory cache of DOM elements as a hash, avoiding repeated selection.
Image for post
Image for post
Menu bar during a benchmark performance test, showing the current animation method and FPS.
Image for post
Image for post
Benchmark results as JSON blob.

Deploying and Scaling

The first “soft launch” for Emojitracker was on the Fourth of July, 2013. I had been working on emojitracker for months, getting it to work had consumed far more effort than I had ever anticipated, and I just wanted to be done with it. So I bailed on a party in Red Hook, cabbed it back up to North Brooklyn, and removed the authentication layer keeping it hidden from the public pretty much exactly as the fireworks displays began.

https://twitter.com/mroth/status/352975279897067520

One crazy day

Fast forward about a month. I had just finished up getting a fairly large forearm tattoo the previous night, and I was trying to avoid using my wrist much to aide in the healing (e.g. ideally, avoiding the computer).

https://twitter.com/emojitracker/status/360807967232229376

Even better, you pay per-minute for web dyno use, which is really helpful for someone on a small budget. I was able to have a massive workforce of 16 web servers during the absolute peaks of launch craziness, but drop it down when demand was lower, saving $$$.

By carefully monitoring and adjusting the amount of web dynos to meet demand, I was able to serve tens of millions of realtime streams in under 24hrs while spending less money than I do on coffee in an average week.

Riding the Wave: Monitoring and Scaling

I primarily used two tools to monitor and scale emojitracker during the initial wave of crazy.

Image for post
Image for post
log2viz
Image for post
Image for post
Graphite charts during launch day.
# configure logging to graphite in production
def graphite_log(metric, count)
if is_production?
sock = UDPSocket.new
sock.send @hostedgraphite_apikey + ".#{metric} #{count}\n", 0, "carbon.hostedgraphite.com", 2003
end
end
# same as above but include heroku dyno hostname
def graphite_dyno_log(metric,count)
dyno = ENV['DYNO'] || 'unknown-host'
metric_name = "#{dyno}.#{metric}"
graphite_log metric_name, count
end

Things I’d still like to do

There are a few obvious things I’d still love to add to Emojitracker.

Reception and conclusions

Image for post
Image for post
Fan Art, via @UnbornOrochi

ENJOYED THIS ARTICLE?: You might also enjoy the followup post enumerating all the changes involved in scaling over the next 1.5 years here: “How I Kept Building Emojitracker”

Epilogue: Emoji Art Show!

Image for post
Image for post
The Eyebeam Art and Technology Center gallery.
https://twitter.com/kittehmien/status/405528153151397888

Artist + hacker. Made @emojitracker & other internet detritus. Past lives: @flickr, @bitly, @polaroid, @khanacademy.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store