How To Parse Hashtags and Mentions with a Tokenizer in NodeJS

Sonny Trujillo Jr.
Jul 23, 2018 · 6 min read

In my final semester of grad school, I took a course on Compiler Construction. We each built a compiler from scratch for a programming language our Professor defined. This turned out to be the most programming intensive course I took throughout my entire academic work. It was also the most rewarding and I left with a new appreciation for compilers and programming languages.

One of the concepts we discussed in detail was tokenization. During the compilation of a program, your code is parsed and then tokenized. This converts the textual representation of your code into simple blocks with information such as the file it came from, the line number, the starting and ending byte, and anything else we might need. These tokens are then used to build up your abstract syntax tree.

There seemed to be a million different ways to use tokenization but it wasn’t clear how could use it in my work. Until I had to read a response from the Twitter API and noticed how they identified elements in a tweet. Every tweet returned an entities object containing information on hashtags, mentions, links, and retweets in a Tweet. This included the location of these elements in the text and the objects they were referring too. On our side, we were using a handful of regular expressions to find and highlight these elements every time we displayed a post. This was slow and loading a bunch of posts often caused the main thread to block and users sometimes had a choppy user experience. If we use tokenization to parse hashtags, mentions, and links, then we only need to parse and cache this data once and wouldn’t need to rely on regular expressions.

I’ve created an NPM package for this called TokenizerJS out of this work. You can find the Github repo at https://github.com/strujillojr/TokenizerJS and the package at https://www.npmjs.com/package/@sonnytrujillojr/tokenizerjs.

A Bit of Background

We used regular expressions in a few different ways. First, when a user uploaded a post, we used a regex to find any mentions. If those mentions were referencing valid users, we created a relationship between the new post and the mentioned user, and then alerted them.

On the front end (iOS, Android, and Web) we used a regex a to find and highlight all the hashtags, mentions, and links in a post. When a user taps on a hashtag, it navigates to the correct list of posts. When a user taps on the mention, it checks the list of mentioned users (returned in the API for that post) for that username and opens their profile.

This system worked and got the job done, but the overuse of regular expressions in our apps was was slowing down the main thread. As our application was growing, rendering hundreds of posts became really slow. Also, regular expressions are really powerful but often cryptic. Ones that work in Javascript won’t directly translate to Swift or Java.

What is a Tokenizer

How Does a Tokenizer Work

For example, imagine we built a tokenizer that only identifies Numbers. We will define a number as one or more characters of 0–9. Given the input Hello 123 and m3our tokenizer would return an array containing two tokens:

It obviously finds 123 but it also matches the 3 in m3 because even though m is not a digit, our definition states that a number is any digit between 0–9 and makes no restrictions on where that digit can appear. This definition wouldn’t work in many scenarios though, and we would need to extend it to only match numbers surrounded by white spaces and maybe allow for commas or decimals.

You have the freedom to define a token. The more tokens you choose to recognize the mightier your Tokenizer becomes.

Building a Tokenizer in Javascript

Token

The Token function will be the base that all other tokens will inherit from.

Hashtag Token

Mention Token

Parsing

We know that a hashtag begins with a # therefore in our loop we can check if the ith character is a # and attempt to parse a hashtag. Similarly, if we find an @ we can attempt to parse a mention token.

Next, we create a static method in Hashtag and Mention token that will attempt to parse that token type. It keeps track of the starting and ending position, checks each character to verify it is a legal part of the token and returns a new valid token if successfully parsed. If it fails to parse that token, it returns the position where the tokenizer should continue from. Here’s the code.

Finally, we update our Tokenizer to use these methods.

Why Tokenize Instead of …

You could also split your string by whitespaces and walk through each word to see if they’re the elements you’re interested in. This is basically the same as tokenization, except you’re using words instead of characters, which is fine as long as you’re sure your elements are separated by whitespace.

Overview

Next Steps

Sonny Trujillo Jr.

Written by

Hi, my name is Sonny and I am a software engineer in New York City. I am currently working at uSTADIUM as the Lead Engineer and CTO.