Calculating 1 + 1 in JavaScript — Part 3

Peter Smith
Compilers
Published in
12 min readMar 11, 2021

I’m a compiler enthusiast who has been learning how the V8 JavaScript Engine works. Of course, the best way to learn something is to write about it, so that’s why I’m sharing my experiences here. I hope this might be interesting to others too.

This is the third part of a multi-part series describing how the V8 JavaScript Engine calculates 1 + 1. If you haven’t read the previous posts in this series, you might like to start with Part 1 (storing the source code string in the JavaScript heap), and Part 2 (checking if the byte codes are already cached). However, since this blog post is fairly independent from the previous two, you can likely understand it in isolation.

In this part of our story of how V8 calculates 1 + 1 , we’ll learn how the input characters are scanned into tokens, which are then used as input to JavaScript’s parser. This concept will be familiar to anyone who has read an introductory compiler text book.

The Scanning Process

With our 1 + 1 example, the output from the scanner we’re expecting to see is the following sequence of tokens:

Token::SMI (value 1)
Token::ADD
Token::SMI (value 1)

Where Token::SMI is a special variant of Token::NUMBER representing small integer values, and Token::ADD unsurprisingly represents addition. Note also, the white space characters between 1, +, and 1 have been ignored, since they don’t provide any further value.

Here’s the overall flow of V8, as it scans and parses the input stream:

The first step is for the v8::internal::Utf16CharacterStream class to read the individual characters from the JavaScript heap (as seen in Part 1, the string is stored as a SeqOneByteString object). Next, the v8::internal:Scanner class converts sequences of characters into tokens (of type v8::internal::Token). Finally, the v8::internal::Parser class (which we’ll examine in a later blog post) uses these tokens to validate the input stream and build an in-memory Abstract Syntax Tree (AST).

What’s important to understand is that all these activities happen in a streaming fashion. It all starts when the Parser requests the next token, which causes the Scanner to request the next character (or characters) from the Utf16CharacterStream, which in turn reads the input string from the JavaScript heap. As with any stream-based solution, the temporary storage space requirement is minimal, as tokens are not scanned until they’re actually required downstream.

Let’s look at this process in more detail.

The v8::internal::Utf16CharacterStream Class

The Utf16CharacterStream class is responsible for reading Unicode characters from the input stream, then providing them one-by-one to the Scanner class. This seems like a trivial exercise, but as we’ll see there are interesting edge-cases to consider. First, the scanner might need to look ahead into the input stream before deciding on what the current token should be. Second, the scanner doesn’t know (or care) where the characters come from, or how they’re stored in memory.

Utf16CharacterStream Methods

Let’s look at some of the Utf16CharacterStream methods, to understand how this class is used.

  • stream->Advance() — This method returns the next Unicode character in the input, fully removing that character from the input stream. This is the standard behaviour you expect when reading characters in a sequence.
  • stream->Peek() — This method returns the next Unicode character in the stream, but without actually consuming that character. Therefore, it’ll still be available when Peek() or Advanced() is next called. This is useful for looking ahead into the input, without committing to consume the character until you know for sure it’s part of the current token.
  • stream->AdvanceUntil(func) — Continually read (aka “use up”) characters until the func function returns true. This is useful for consuming characters until a certain point is reached, such as the end of line.
  • stream->Back() — Essentially the opposite of Advance(). Returns the character back to the input stream so it’s available for future Peek() or Advance() calls. This is useful when the scanner tried to read ahead, yet decided the next character wasn’t actually part of the current token.

There are several more methods available, but these are the most important. As we’ll see when we examine some of the methods in the Scanner class, the ability to read ahead, yet also push back characters is vital for correctly scanning input tokens.

Utf16CharacterStream is Abstract

A second interesting discussion is how the characters are retrieved from memory. It turns out that Utf16CharacterStream is an abstract class, with a range of different implementations available, each focusing on a specific storage layout of the source string. In our case, the 1 + 1 string was stored on JavaScript’s heap using one byte to store each character. Other options include reading from two-byte strings, as well as from strings that are stored externally from the JavaScript heap.

Selection of the appropriate Utf16CharacterStream sub-class is performed within the ParseProgram method (see src/parsing/parsing.cc) . ParseProgram does many different things, but as far as scanning is concerned the most relevant line of code is:

std::unique_ptr<Utf16CharacterStream> stream(
ScannerStream::For(isolate, source));

Using our 1 + 1 example, theScannerStream::For method examines the source string and determines it has type SeqOneByteString (one-byte string, stored in the JavaScript heap). The code then returns an instance of the BufferedCharacterStream class, which is the particular sub-class of Utf16CharacterStream capable of reading SeqOneByteString objects.

The most interesting part of the BufferedCharacterStream sub-class is the ReadBlock() method. This method is called upon by higher-level methods such as Peek() or Advance() to fetch the next block of characters from the input stream, however it may be stored.

Let’s now move forward in the scanner pipeline to learn how tokens are represented in V8.

The v8::internal::Token Class

Similar to other compilers, the v8::internal::Token class provides an enumeration of all token values recognized by the scanner. The definition of these tokens (see src/parsing/token.h) is managed by a clever C++ macro (TOKEN_LIST) containing a list of all tokens, combined with a second macro (T) which extracts the name portion:

Here’s the definition of the token enumeration, using C++ macros (warning: it’s not very easy to read):

#define T(name, string, precedence) name,  
enum Value : uint8_t { TOKEN_LIST(T, T) NUM_TOKENS };
#undef T

And here is the definition of TOKEN_LIST, abbreviated from code in src/parsing/token.h.

#define TOKEN_LIST(T, K) \
T(TEMPLATE_SPAN, nullptr, 0) \
T(TEMPLATE_TAIL, nullptr, 0) \
T(PERIOD, ".", 0) \
T(LBRACK, "[", 0) \
T(QUESTION_PERIOD, "?.", 0) \
T(LPAREN, "(", 0) \
T(RPAREN, ")", 0) \
T(RBRACK, "]", 0) \
...

When macros are expanded, this becomes a more readable enumeration.

enum Value : uint8_t {
TEMPLATE_SPAN,
TEMPLATE_TAIL,
PERIOD,
LBRACK,
QUESTION_PERIOD,
LPAREN,
RPAREN,
...
ADD,
...
SMI,
...
WHITESPACE,
UNINITIALIZED,
REGEXP_LITERAL,
NUM_TOKENS
}

As we’ll see later, token values are referenced in the Scanner and Parser classes using the syntax Token::SMI or Token::ADD.

The v8::internal::Scanner Class

Let’s now dive into the internals of the Scanner class, responsible for reading characters from the input, and generating tokens for the output. We’ll see examples of scanning operators (such as +), as well as scanning numbers (such as 1).

Scanner Methods

First, here are some interesting methods from the Scanner class. They are somewhat similar to the methods on the Utf16CharacterStream class, but operate on whole tokens, rather than individual characters.

  • scanner->Next() —Return the next token from the input stream, and advanced the input pointer. Naturally, this calls the stream->Advance() method to fetch multiple characters from the Utf16CharacterStream, but will only return a single token value. In our 1 + 1 example, each token is only one character long, but normally this isn’t the case.
  • scanner->peek() — Peek ahead to see what the next token will be (beyond what Next() returns) without advancing the input. This is used by the parser to check the upcoming tokens to decide whether or not the current parser rule is matched by the input.
  • scanner->PeekAhead()— Peek even further ahead, necessary for parsing some of JavaScript’s more complicated syntax.
  • scanner->location() — Return the location of the current token. This provides the start and end character positions within the source string.
  • scanner->smi_value() — Returns the Smi (small integer) value of the current token, if any. In our example, this will return the integer 1 for both of our Token::SMI tokens.

And as you might expect, there are many other Scanner methods, mostly focused on error handling, but also for fetching the token’s associated value. As a sneak peek, here’s a small section of the Parser code using these methods:

...if (peek() == Token::PERIOD && PeekAhead() == Token::PRIVATE_NAME) {
Consume(Token::PERIOD);
Consume(Token::PRIVATE_NAME);
...
}
...

We’ll learn more about the Parser class in the next blog post, but this code snippet gives you a sense of how Parser calls upon Scanner to return the upcoming token values.

The TokenDesc Structure

Although we’ve already discussed the token enumerated values, allowing us to write Token::SMI (for the number 1) or Token::ADD (for the + symbol) that’s only part of the what’s required for the scanner to represent a token. In addition, the scanner cares about the token’s location, the literal characters, any possible error cases, and of course, the actual numeric value of the token.

To store all this extra information, the scanner uses a TokenDesc structure:

struct TokenDesc {
Location location = {0, 0};
LiteralBuffer literal_chars;
LiteralBuffer raw_literal_chars;
Token::Value token = Token::UNINITIALIZED;
MessageTemplate invalid_template_escape_message =
MessageTemplate::kNone;
Location invalid_template_escape_location;
uint32_t smi_value_ = 0;
bool after_line_terminator = false;
}

The fields are:

  • location — The numeric start and end positions within the source string. For example, our first Token::SMI is at position 0, and our Token::ADD is at position 2.
  • literal_chars — These are the actual characters that make up the token, whether it be a number, a string, an identifier, or something else. This is important, since knowing that total_cost is a Token::IDENTIFIER is only part of the story. In addition, we also need the identifier’s name (total_cost) to distinguish it from other Token::IDENTIFIER values.
  • raw_literal_chars — The is similar to literal_chars but is used for template literals. In this case, we don’t want escape sequences (e.g. \064) to be replaced by the corresponding character (e.g. 4), but instead require that the raw literal characters be passed to the template function.
  • token — The token’s enumerated value, as before.
  • invalid_template_escape_message / invalid_template_escape_location — If there’s an error discovered while scanning a token, these fields store the error code and location.
  • smi_value_ — In the case of Token::SMI, this field stores the number’s actual integer value. This is returned by scanner->smi_value().
  • after_line_terminator — Indicates whether the token appears as the first token of a new line. This is useful for automatically inserting semicolons.

Now we understand all the building blocks, let’s continue along the scanner pipeline, to see how characters are actually converted into token values.

Example: Scanning for Operator Tokens

At this point, we’re ready to trace through the operation of scanning 1 + 1. Since scanning the operator is easier than scanning the numbers, let’s start by seeing what scanner->Next() does when the upcoming input character is the + sign.

The bulk of the scanning mechanism is in the private ScanSingleToken() method. The scanning starts with a very simple lookup into the one_char_tokens[128] array, which contains a direct mapping from the first 128 Unicode characters (aka ASCII characters) to a corresponding “guess” of what the token might be. Here’s the GetOneCharToken() method, used to populate the one_char_tokens[128] array:

constexpr Token::Value GetOneCharToken(char c) {
return
c == '(' ? Token::LPAREN :
c == ')' ? Token::RPAREN :
c == '{' ? Token::LBRACE :
c == '}' ? Token::RBRACE :
c == '[' ? Token::LBRACK :
c == ']' ? Token::RBRACK :
c == '?' ? Token::CONDITIONAL :
c == ':' ? Token::COLON :
...
c == '+' ? Token::ADD :
...
}

In our example, the character + is mapped to Token::ADD. However, this is only a guess. What if the actual input was ++ or +=, which are both legal tokens in JavaScript? To handle this, the code looks ahead to see if the following character is also a + (in which case Token::INC is returned), or perhaps an = (returning Token::ASSIGN_ADD). If neither case is true, the original Token::ADD is returned.

Here’s the code in question:

case Token::ADD:
// + ++ +=
Advance();
if (c0_ == '+') return Select(Token::INC);
if (c0_ == '=') return Select(Token::ASSIGN_ADD);
return Token::ADD;

Note that the Advance() method reads the next character into local variable c0_, and Select is short-hand for consuming that character, and returning the specified token.

Let’s now see the more complicated case of scanning numbers.

Example: Scanning for Number Tokens

When scanning the string 1, the process starts the same way as before. We look up the character in the one_char_tokens[128] array, which provides Token::NUMBER as the initial guess. The code then immediately calls the ScanNumber method to look more deeply at the characters appearing in the input stream.

case Token::NUMBER:
return ScanNumber(false);

Here’s a breakdown of how ScanNumber works:

  • Line 752 — A decision is made about whether the number starts with a decimal point (the . character). If this is true, the number contains only a fractional portion (such as .123), so further scanning is delegated to the ScanDecimalDigits() method. In our 1 + 1 example, we don’t take this code path.
  • Line 762—Next, we check whether the first character in the number is a 0. If so, we check the following character for x, o, or b, delegating to ScanHexDigits(), ScanOctalDigits(), or ScanBinaryDigits() respectively. However, if the next character was actually an octal digit (from 0 to 7), then ScanImplicitOctalDigits() is called to handle cases such as 077 (as opposed to the more explicit 0o77). Finally, a number like 088 (out of range for an octal) is treated as a regular decimal 88.
  • Line 797 — Since our input string (1 + 1) does not start with a 0 digit, we consider the number to be a decimal (base 10), possibly containing underscores (e.g. 1_000)
  • Line 803 — Given that most number literals are small, we take our chances and scan the number as a Smi (small integer), delegating the work to ScanDecimalAsSmi(), returning the value as a C++ uint64_t.
  • Line 807 — A Smi must be small enough to fit into 31-bits. If this is possible, we set the smi_value_ field of the token to the integer’s value, then return Token::SMI. This is the path taken in our 1 + 1 example.
  • Line 819 — If the number wasn’t a Smi, continue to parse the decimal number, calling ScanDecimalDigits() to do so. Additionally, we check for a trailing decimal point (the . character), followed by another decimal number (the fractional part). Note that this code simply validates the number is well-formed, rather than extracting the actual value itself (as we did in the Smi case).
  • Line 833 — This is where we handle the BigInt scenario, where the number has a trailing n. For example, 12345678n.
  • Line 848 — Finally, we handle the exponent case, where the number has a trailing e following by the exponent. For example: 123e5. This is not relevant in our 1 + 1 case, since a Smi can’t have an exponent.

And then ends the scanning process for operators and numbers in JavaScript, due to the complexity of multi-character operators, different bases (hex, octal, binary), underscores used as digit separators, fractional portions, BigInts, and exponents, the whole process is quite complicated.

Next Time…

In the next blog post, we’ll continue our story of calculating 1 + 1. Now that we have a sequence of tokens (Token::SMI, Token::ADD, and TOKEN:SMI) we can see how the Parser class validates the input against the JavaScript language definition. Finally, we’ll see how an AST (Abstract Syntax Tree) is created as an in-memory representation of our program.

--

--