How To Parse Hashtags and Mentions with a Tokenizer in NodeJS
In my final semester of grad school, I took a course on Compiler Construction. We each built a compiler from scratch for a programming language our Professor defined. This turned out to be the most programming intensive course I took throughout my entire academic work. It was also the most rewarding and I left with a new appreciation for compilers and programming languages.
One of the concepts we discussed in detail was tokenization. During the compilation of a program, your code is parsed and then tokenized. This converts the textual representation of your code into simple blocks with information such as the file it came from, the line number, the starting and ending byte, and anything else we might need. These tokens are then used to build up your abstract syntax tree.
There seemed to be a million different ways to use tokenization but it wasn’t clear how could use it in my work. Until I had to read a response from the Twitter API and noticed how they identified elements in a tweet. Every tweet returned an entities object containing information on hashtags, mentions, links, and retweets in a Tweet. This included the location of these elements in the text and the objects they were referring too. On our side, we were using a handful of regular expressions to find and highlight these elements every time we displayed a post. This was slow and loading a bunch of posts often caused the main thread to block and users sometimes had a choppy user experience. If we use tokenization to parse hashtags, mentions, and links, then we only need to parse and cache this data once and wouldn’t need to rely on regular expressions.
I’ve created an NPM package for this called TokenizerJS out of this work. You can find the Github repo at https://github.com/strujillojr/TokenizerJS and the package at https://www.npmjs.com/package/@sonnytrujillojr/tokenizerjs.
A Bit of Background
As with most social networks, we needed a way for users to mention other users and add hashtags to posts and comments. The most obvious solution we came up with is to use a regular expression to find hashtags and mentions in text.
We used regular expressions in a few different ways. First, when a user uploaded a post, we used a regex to find any mentions. If those mentions were referencing valid users, we created a relationship between the new post and the mentioned user, and then alerted them.
On the front end (iOS, Android, and Web) we used a regex a to find and highlight all the hashtags, mentions, and links in a post. When a user taps on a hashtag, it navigates to the correct list of posts. When a user taps on the mention, it checks the list of mentioned users (returned in the API for that post) for that username and opens their profile.
What is a Tokenizer
A Tokenizer takes a string input and returns an array of Tokens. Tokens can be defined as Numbers, Identifiers, Usernames, Hashtags, or anything you want. Wikipedia defines a Tokenizer as “the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning)”.
How Does a Tokenizer Work
A Tokenizer starts at by peaking at the first character in the input and tries to predict the token to parse. For example, if the current character is a number it may try to parse a Number token, or if it’s a
# it may try to parse a Hashtag token. If there is any chance that character is part of a token it will attempt to parse it as such. If it is not part of a token, it checks to see if there is a chance it’s a different type of token, and if so it attempts to parse that one. If it fails to parse any token it moves on to the next character in the string and starts the process all over again.
For example, imagine we built a tokenizer that only identifies Numbers. We will define a number as one or more characters of 0–9. Given the input
Hello 123 and m3our tokenizer would return an array containing two tokens:
It obviously finds
123 but it also matches the
m3 because even though
m is not a digit, our definition states that a number is any digit between 0–9 and makes no restrictions on where that digit can appear. This definition wouldn’t work in many scenarios though, and we would need to extend it to only match numbers surrounded by white spaces and maybe allow for commas or decimals.
You have the freedom to define a token. The more tokens you choose to recognize the mightier your Tokenizer becomes.
In this post, we want to build a tokenizer that will identify hashtags and mentions. Let’s start with creating the token functions for each type and then I’ll discuss the parsing process in the next section.
A token contains three elements. A starting and ending position and a value. When instantiating a new Token we need to provide all of this information.
Token function will be the base that all other tokens will inherit from.
A hashtag begins with “#” and is followed by 1 or more alphanumeric characters. We first start with defining a Hashtag function that inherits from the Token function.
A mention begins with the
@ symbol and is followed by a minimum of three alphanumeric or underscore characters.
We set up our parsing function to accept a string as input and move through each character attempting to parse a token. It starts at the 0th character and works it’s way up.
We know that a hashtag begins with a
# therefore in our loop we can check if the ith character is a
# and attempt to parse a hashtag. Similarly, if we find an
@ we can attempt to parse a mention token.
Next, we create a static method in Hashtag and Mention token that will attempt to parse that token type. It keeps track of the starting and ending position, checks each character to verify it is a legal part of the token and returns a new valid token if successfully parsed. If it fails to parse that token, it returns the position where the tokenizer should continue from. Here’s the code.
Finally, we update our Tokenizer to use these methods.
Why Tokenize Instead of …
There are many alternatives to tokenization. As I’ve mentioned, the most popular is regular expressions, but they don’t return information about where the element was found. We needed to highlight and make links out of these elements, so it was important for us to know their location. RegEx’s only told us if they existed in the text, but we would still need to walk through and find them. Which is fine if that’s what you want to do, but Tokenization is just a different way of doing it.
You could also split your string by whitespaces and walk through each word to see if they’re the elements you’re interested in. This is basically the same as tokenization, except you’re using words instead of characters, which is fine as long as you’re sure your elements are separated by whitespace.
Included in this write up is a simple way of tokenizing hashtags and mentions. You could easily extend this to tokenize links, phone numbers, emails, and much more.