# Calculating 1 + 1 in JavaScript — Part 3

Published in
12 min readMar 11, 2021

I’m a compiler enthusiast who has been learning how the V8 JavaScript Engine works. Of course, the best way to learn something is to write about it, so that’s why I’m sharing my experiences here. I hope this might be interesting to others too.

This is the third part of a multi-part series describing how the V8 JavaScript Engine calculates `1 + 1`. If you haven’t read the previous posts in this series, you might like to start with Part 1 (storing the source code string in the JavaScript heap), and Part 2 (checking if the byte codes are already cached). However, since this blog post is fairly independent from the previous two, you can likely understand it in isolation.

In this part of our story of how V8 calculates `1 + 1` , we’ll learn how the input characters are scanned into tokens, which are then used as input to JavaScript’s parser. This concept will be familiar to anyone who has read an introductory compiler text book.

# The Scanning Process

With our `1 + 1` example, the output from the scanner we’re expecting to see is the following sequence of tokens:

`Token::SMI (value 1)Token::ADDToken::SMI (value 1)`

Where `Token::SMI` is a special variant of `Token::NUMBER` representing small integer values, and `Token::ADD` unsurprisingly represents addition. Note also, the white space characters between `1`, `+`, and `1` have been ignored, since they don’t provide any further value.

Here’s the overall flow of V8, as it scans and parses the input stream:

The first step is for the `v8::internal::Utf16CharacterStream` class to read the individual characters from the JavaScript heap (as seen in Part 1, the string is stored as a `SeqOneByteString` object). Next, the `v8::internal:Scanner` class converts sequences of characters into tokens (of type `v8::internal::Token`). Finally, the `v8::internal::Parser` class (which we’ll examine in a later blog post) uses these tokens to validate the input stream and build an in-memory Abstract Syntax Tree (AST).

What’s important to understand is that all these activities happen in a streaming fashion. It all starts when the `Parser` requests the next token, which causes the `Scanner` to request the next character (or characters) from the `Utf16CharacterStream`, which in turn reads the input string from the JavaScript heap. As with any stream-based solution, the temporary storage space requirement is minimal, as tokens are not scanned until they’re actually required downstream.

Let’s look at this process in more detail.

# The v8::internal::Utf16CharacterStream Class

The `Utf16CharacterStream` class is responsible for reading Unicode characters from the input stream, then providing them one-by-one to the `Scanner` class. This seems like a trivial exercise, but as we’ll see there are interesting edge-cases to consider. First, the scanner might need to look ahead into the input stream before deciding on what the current token should be. Second, the scanner doesn’t know (or care) where the characters come from, or how they’re stored in memory.

## Utf16CharacterStream Methods

Let’s look at some of the `Utf16CharacterStream` methods, to understand how this class is used.

• `stream->Advance()` — This method returns the next Unicode character in the input, fully removing that character from the input stream. This is the standard behaviour you expect when reading characters in a sequence.
• `stream->Peek()` — This method returns the next Unicode character in the stream, but without actually consuming that character. Therefore, it’ll still be available when `Peek()` or `Advanced()` is next called. This is useful for looking ahead into the input, without committing to consume the character until you know for sure it’s part of the current token.
• `stream->AdvanceUntil(func)` — Continually read (aka “use up”) characters until the `func` function returns true. This is useful for consuming characters until a certain point is reached, such as the end of line.
• `stream->Back()` — Essentially the opposite of `Advance()`. Returns the character back to the input stream so it’s available for future `Peek()` or `Advance()` calls. This is useful when the scanner tried to read ahead, yet decided the next character wasn’t actually part of the current token.

There are several more methods available, but these are the most important. As we’ll see when we examine some of the methods in the `Scanner` class, the ability to read ahead, yet also push back characters is vital for correctly scanning input tokens.

## Utf16CharacterStream is Abstract

A second interesting discussion is how the characters are retrieved from memory. It turns out that `Utf16CharacterStream` is an abstract class, with a range of different implementations available, each focusing on a specific storage layout of the source string. In our case, the `1 + 1` string was stored on JavaScript’s heap using one byte to store each character. Other options include reading from two-byte strings, as well as from strings that are stored externally from the JavaScript heap.

Selection of the appropriate `Utf16CharacterStream` sub-class is performed within the `ParseProgram` method (see `src/parsing/parsing.cc)` . `ParseProgram` does many different things, but as far as scanning is concerned the most relevant line of code is:

`std::unique_ptr<Utf16CharacterStream> stream(    ScannerStream::For(isolate, source));`

Using our `1 + 1` example, the`ScannerStream::For` method examines the source string and determines it has type `SeqOneByteString` (one-byte string, stored in the JavaScript heap). The code then returns an instance of the `BufferedCharacterStream` class, which is the particular sub-class of `Utf16CharacterStream` capable of reading `SeqOneByteString` objects.

The most interesting part of the `BufferedCharacterStream` sub-class is the `ReadBlock()` method. This method is called upon by higher-level methods such as `Peek()` or `Advance()` to fetch the next block of characters from the input stream, however it may be stored.

Let’s now move forward in the scanner pipeline to learn how tokens are represented in V8.

# The v8::internal::Token Class

Similar to other compilers, the `v8::internal::Token` class provides an enumeration of all token values recognized by the scanner. The definition of these tokens (see `src/parsing/token.h`) is managed by a clever C++ macro (`TOKEN_LIST`) containing a list of all tokens, combined with a second macro (`T`) which extracts the name portion:

Here’s the definition of the token enumeration, using C++ macros (warning: it’s not very easy to read):

`#define T(name, string, precedence) name,     enum Value : uint8_t { TOKEN_LIST(T, T) NUM_TOKENS };#undef T`

And here is the definition of `TOKEN_LIST`, abbreviated from code in `src/parsing/token.h`.

`#define TOKEN_LIST(T, K) \    T(TEMPLATE_SPAN, nullptr, 0) \    T(TEMPLATE_TAIL, nullptr, 0) \    T(PERIOD, ".", 0) \    T(LBRACK, "[", 0) \    T(QUESTION_PERIOD, "?.", 0) \    T(LPAREN, "(", 0) \    T(RPAREN, ")", 0) \    T(RBRACK, "]", 0) \    ...`

When macros are expanded, this becomes a more readable enumeration.

`enum Value : uint8_t {   TEMPLATE_SPAN,   TEMPLATE_TAIL,   PERIOD,   LBRACK,   QUESTION_PERIOD,   LPAREN,   RPAREN,   ...   ADD,   ...   SMI,   ...   WHITESPACE,   UNINITIALIZED,   REGEXP_LITERAL,   NUM_TOKENS}`

As we’ll see later, token values are referenced in the `Scanner` and `Parser` classes using the syntax `Token::SMI` or `Token::ADD`.

# The v8::internal::Scanner Class

Let’s now dive into the internals of the `Scanner` class, responsible for reading characters from the input, and generating tokens for the output. We’ll see examples of scanning operators (such as `+`), as well as scanning numbers (such as `1`).

## Scanner Methods

First, here are some interesting methods from the `Scanner` class. They are somewhat similar to the methods on the `Utf16CharacterStream` class, but operate on whole tokens, rather than individual characters.

• `scanner->Next()` —Return the next token from the input stream, and advanced the input pointer. Naturally, this calls the `stream->Advance()` method to fetch multiple characters from the `Utf16CharacterStream`, but will only return a single token value. In our `1 + 1` example, each token is only one character long, but normally this isn’t the case.
• `scanner->peek()` — Peek ahead to see what the next token will be (beyond what `Next()` returns) without advancing the input. This is used by the parser to check the upcoming tokens to decide whether or not the current parser rule is matched by the input.
• `scanner->PeekAhead()`— Peek even further ahead, necessary for parsing some of JavaScript’s more complicated syntax.
• `scanner->location()` — Return the location of the current token. This provides the start and end character positions within the source string.
• `scanner->smi_value()` — Returns the Smi (small integer) value of the current token, if any. In our example, this will return the integer 1 for both of our `Token::SMI` tokens.

And as you might expect, there are many other `Scanner` methods, mostly focused on error handling, but also for fetching the token’s associated value. As a sneak peek, here’s a small section of the `Parser` code using these methods:

`...if (peek() == Token::PERIOD && PeekAhead() == Token::PRIVATE_NAME) {    Consume(Token::PERIOD);    Consume(Token::PRIVATE_NAME);    ...}...`

We’ll learn more about the `Parser` class in the next blog post, but this code snippet gives you a sense of how `Parser` calls upon `Scanner` to return the upcoming token values.

## The TokenDesc Structure

Although we’ve already discussed the token enumerated values, allowing us to write `Token::SMI` (for the number `1`) or `Token::ADD` (for the `+` symbol) that’s only part of the what’s required for the scanner to represent a token. In addition, the scanner cares about the token’s location, the literal characters, any possible error cases, and of course, the actual numeric value of the token.

To store all this extra information, the scanner uses a `TokenDesc` structure:

`struct TokenDesc {  Location location = {0, 0};  LiteralBuffer literal_chars;  LiteralBuffer raw_literal_chars;  Token::Value token = Token::UNINITIALIZED;  MessageTemplate invalid_template_escape_message =       MessageTemplate::kNone;  Location invalid_template_escape_location;  uint32_t smi_value_ = 0;  bool after_line_terminator = false;}`

The fields are:

• `location` — The numeric start and end positions within the source string. For example, our first `Token::SMI` is at position 0, and our `Token::ADD` is at position 2.
• `literal_chars` — These are the actual characters that make up the token, whether it be a number, a string, an identifier, or something else. This is important, since knowing that `total_cost` is a `Token::IDENTIFIER` is only part of the story. In addition, we also need the identifier’s name (`total_cost`) to distinguish it from other `Token::IDENTIFIER` values.
• `raw_literal_chars` — The is similar to `literal_chars` but is used for template literals. In this case, we don’t want escape sequences (e.g. `\064`) to be replaced by the corresponding character (e.g. `4`), but instead require that the raw literal characters be passed to the template function.
• `token` — The token’s enumerated value, as before.
• `invalid_template_escape_message` / `invalid_template_escape_location` — If there’s an error discovered while scanning a token, these fields store the error code and location.
• `smi_value_` — In the case of `Token::SMI`, this field stores the number’s actual integer value. This is returned by `scanner->smi_value()`.
• `after_line_terminator` — Indicates whether the token appears as the first token of a new line. This is useful for automatically inserting semicolons.

Now we understand all the building blocks, let’s continue along the scanner pipeline, to see how characters are actually converted into token values.

## Example: Scanning for Operator Tokens

At this point, we’re ready to trace through the operation of scanning `1 + 1`. Since scanning the operator is easier than scanning the numbers, let’s start by seeing what `scanner->Next()` does when the upcoming input character is the `+` sign.

The bulk of the scanning mechanism is in the private `ScanSingleToken()` method. The scanning starts with a very simple lookup into the `one_char_tokens[128]` array, which contains a direct mapping from the first 128 Unicode characters (aka ASCII characters) to a corresponding “guess” of what the token might be. Here’s the `GetOneCharToken()` method, used to populate the `one_char_tokens[128]` array:

`constexpr Token::Value GetOneCharToken(char c) {   return      c == '(' ? Token::LPAREN :      c == ')' ? Token::RPAREN :      c == '{' ? Token::LBRACE :      c == '}' ? Token::RBRACE :      c == '[' ? Token::LBRACK :      c == ']' ? Token::RBRACK :      c == '?' ? Token::CONDITIONAL :      c == ':' ? Token::COLON :      ...      c == '+' ? Token::ADD :      ...}`

In our example, the character `+` is mapped to `Token::ADD`. However, this is only a guess. What if the actual input was `++` or `+=`, which are both legal tokens in JavaScript? To handle this, the code looks ahead to see if the following character is also a `+` (in which case `Token::INC` is returned), or perhaps an `=` (returning `Token::ASSIGN_ADD)`. If neither case is true, the original `Token::ADD` is returned.

Here’s the code in question:

`case Token::ADD:  // + ++ +=  Advance();  if (c0_ == '+') return Select(Token::INC);  if (c0_ == '=') return Select(Token::ASSIGN_ADD);  return Token::ADD;`

Note that the `Advance()` method reads the next character into local variable `c0_`, and `Select` is short-hand for consuming that character, and returning the specified token.

Let’s now see the more complicated case of scanning numbers.

## Example: Scanning for Number Tokens

When scanning the string `1`, the process starts the same way as before. We look up the character in the `one_char_tokens[128]` array, which provides `Token::NUMBER` as the initial guess. The code then immediately calls the `ScanNumber` method to look more deeply at the characters appearing in the input stream.

`case Token::NUMBER:    return ScanNumber(false);`

Here’s a breakdown of how `ScanNumber` works:

• Line 752 — A decision is made about whether the number starts with a decimal point (the `.` character). If this is true, the number contains only a fractional portion (such as `.123`), so further scanning is delegated to the `ScanDecimalDigits()` method. In our `1 + 1` example, we don’t take this code path.
• Line 762—Next, we check whether the first character in the number is a `0`. If so, we check the following character for `x`, `o`, or `b`, delegating to `ScanHexDigits()`, `ScanOctalDigits()`, or `ScanBinaryDigits()` respectively. However, if the next character was actually an octal digit (from `0` to `7`), then `ScanImplicitOctalDigits()` is called to handle cases such as `077` (as opposed to the more explicit `0o77`). Finally, a number like `088` (out of range for an octal) is treated as a regular decimal `88`.
• Line 797 — Since our input string (`1 + 1`) does not start with a `0` digit, we consider the number to be a decimal (base 10), possibly containing underscores (e.g. `1_000`)
• Line 803 — Given that most number literals are small, we take our chances and scan the number as a Smi (small integer), delegating the work to `ScanDecimalAsSmi()`, returning the value as a C++ `uint64_t`.
• Line 807 — A Smi must be small enough to fit into 31-bits. If this is possible, we set the `smi_value_` field of the token to the integer’s value, then return `Token::SMI`. This is the path taken in our `1 + 1` example.
• Line 819 — If the number wasn’t a Smi, continue to parse the decimal number, calling `ScanDecimalDigits()` to do so. Additionally, we check for a trailing decimal point (the `.` character), followed by another decimal number (the fractional part). Note that this code simply validates the number is well-formed, rather than extracting the actual value itself (as we did in the Smi case).
• Line 833 — This is where we handle the BigInt scenario, where the number has a trailing `n`. For example, `12345678n`.
• Line 848 — Finally, we handle the exponent case, where the number has a trailing `e` following by the exponent. For example: `123e5`. This is not relevant in our `1 + 1` case, since a Smi can’t have an exponent.

And then ends the scanning process for operators and numbers in JavaScript, due to the complexity of multi-character operators, different bases (hex, octal, binary), underscores used as digit separators, fractional portions, BigInts, and exponents, the whole process is quite complicated.

# Next Time…

In the next blog post, we’ll continue our story of calculating `1 + 1`. Now that we have a sequence of tokens (`Token::SMI`, `Token::ADD`, and `TOKEN:SMI`) we can see how the `Parser` class validates the input against the JavaScript language definition. Finally, we’ll see how an AST (Abstract Syntax Tree) is created as an in-memory representation of our program.

--

--