Yes, obviously the answer is that
Stepping back for a moment… one of my favourite interview questions is:
“When I type a URL into a web browser, and then press enter, what are all the things that happen to load the web page? Go into as much detail as you like”.
It’s a great question, because it shows the depth and breadth of a person’s understanding, allowing them to illustrate which parts of the process are the most interesting for them.
This is the first in a series of blog posts that will look at everything V8 does when
1 + 1 is entered. To start with, we’ll focus on how V8 stores the
1 + 1 string in its heap memory. It sounds simple, but it’s an entire blog post on its own!
The Client Application
1 + 1, the most likely approach you’d take is to fire up NodeJS, or the Chrome developer console, and simply enter
1 + 1. That will work, but in order to show the V8 internals, I’ve decided to modify
hello-world.cc, one of the standard sample applications in the V8 source code.
I took the original code, which printed “Hello World”, and replaced it with the expression
1 + 1:
Local<String> source = String::NewFromUtf8Literal(isolate, "1 + 1");// Compile the source code.
Local<Script> script =
Script::Compile(context, source).ToLocalChecked();// Run the script to get the result.
Local<Value> result = script->Run(context).ToLocalChecked();// Convert the result to Number and print it.
Local<Number> number = Local<Number>::Cast(result);
Take a quick look at this code to get an idea of what it does. The lines of C++ might be fairly cryptic, but the comments will help. For this blog post we’ll focus exclusively on the first statement, allocating a new
1 + 1 string in the V8 heap:
Local<String> source = String::NewFromUtf8Literal(isolate, "1 + 1");
To understand this code, let’s start with the high-level sequence of V8 modules involved. In this diagram, the flow of execution is from left-to-right, with the return value being passed back from right-to-left to be inserted into the
- Application — This represents the client of V8. In our example, it’s the
- Heap Factory — Internal to the V8 Engine (not exposed via the API) is a “factory” for creating various data objects on the heap. Quite surprisingly, the set of factory methods available is very different to what the external API provides, so a lot of translation is done within the API layer.
- New Space — V8’s heap is very complicated, but newly-allocated objects are usually stored in the New Space, often known as the Young Generation. We won’t cover the detail here, but the new space is managed using Cheney’s Algorithm, a well-known algorithm for performing garbage collection.
Let’s now go into more detail on this flow, focusing on:
- How the API layer decides what type of string to create, and where it should be stored within the heap.
- What the string’s internal memory layout will be. This depends on the range of characters appearing in the string.
- How the storage is allocated from the heap. In our example, 20 bytes are required.
- Finally, how the pointer to the string is returned to the application, with the goal of garbage collecting in the future.
Determining How and Where to Store the String
As mentioned above, there’s a fair amount of translation that must happen between the client application and the heap factory, where the object is actually created. The bulk of this work is performed in
Let’s start with the client application’s call:
String::NewFromUtf8Literal(isolate, "1 + 1");
The first argument is for an “Isolate” which is V8’s main internal data structure representing the state of the run-time system, isolated from other possible V8 instances. To understand this, think about having multiple browser windows open, where each window has a completely separate instance of V8 running, each with its own isolated heap. We won’t talk about the
isolate argument much, other than noting that a very large number of API calls expect this parameter.
String::NewFromUtf8Literal method (see
src/api/api.cc) starts with basic string length checking, but also decides how to store the string in memory. Given that we only provided two arguments to the call, the third
type argument defaults to
NewStringType::kNormal, indicating the string should be allocated as a regular object on the heap. The alternative would have been to pass
NewStringType::kInternalized indicating that de-duplication of the string is desired. This feature is very useful to avoid storing multiple copies of the same constant string.
The next step is to call the
NewString method (see
src/api/api.cc) which invokes
factory->NewStringFromUtf8(string). Note that
string here has been mapped into an internal
Vector data structure, instead of a regular C++ string, because the heap factory has quite a different set of methods than the external API. This difference will become more apparent later when the return value is passed back to the client application.
src/heap/factory.cc), a decision is made on the optimal storage format for the string. Naturally, UTF-8 is a convenient format for storing a wide range of Unicode characters, but when only basic ASCII characters are used (such as
1 + 1) V8 stores the string in “one byte” format. To make this decision, the string’s characters are passed into
Utf8Decoder decoder(utf8_data) (declared in
Now that we’ve decided to allocate a one-byte string, using the normal (not internalized) approach, the next step is to invoke
src/heap/factory-base.cc), where the heap memory is allocated, and the string’s content is written into that memory.
The String’s In-Memory Structure
Inside V8, our
1 + 1 string is represented as an instance of the
v8::internal::SeqOneByteString class (see
src/objects/string.h). If you’re like most object-oriented developers, you’d expect that
SeqOneByteString would have a number of public methods, as well as several private members, such as an array of characters or an integer storing the string’s length. However, that’s not the case! Instead, all internal object classes are actually just pointers to the heap where that data is stored.
As you can see from the code comment in
src/objects/objects.h, there are roughly 150 internal classes that have the common parent class of
v8::internal::Object. Each of these classes consists solely of a single 8-byte value (on a 64-bit machine) referring to the object’s heap location.
Keeping this in mind, here’s what our string looks like in memory:
There are many interesting parts here:
The SeqOneByteString Object
As mentioned, this is not a fully-fledged string class, but is instead a pointer to the on-heap location of the string’s actual content. On a 64-bit machine, this “pointer” will be an 8-byte
unsigned long, with the type alias of
Address. Note that the data on the heap (on the right side of the diagram) is not actually a real C++ object, so there’s no point in treating this
Address as if it was a pointer to something strongly-typed (such as
But, you might be wondering why the additional level of indirection exists in the first place. Why not simply access the heap block directly? This approach makes sense when you consider that garbage collection can result in objects being moved around the heap. It’s important that data can move, without the client application getting confused.
To clarify, in Generational Garbage Collection, objects are first allocated in the Young Generation (New Space), and if they survive long enough, they’ll be moved to the Old Generation (Old Space). To make this work, the garbage collector copies the heap block to the new heap space, then updates the
Address value to point to the new memory location. Given that the
SeqOneByteString object itself is still at exactly the same memory address as before, the client software won’t notice the change.
Compressed Pointer To Map (Bytes 0–3 of Heap Block)
We won’t go into much detail about the map for our
1 + 1 string, other than mentioning that maps are also a type of heap object, stored in the Read Only Space. Maps (also known as Shapes or Hidden Classes) can become very complex, although our constant string uses a pre-defined map by calling
Interestingly, although this map field is a pointer to another heap object, it cleverly uses Pointer Compression to store a 64-bit pointer value in a 32-bit field.
Object Hash Value (Bytes 4–7 of Heap Block)
Every object has an internal hash value, but in this example it defaults to
kEmptyHashField (value of 3) to indicate the hash is not yet computed.
String Length (Bytes 8–11 of Heap Block)
This is the number of bytes in the string (5).
The Characters and the Padding (Bytes 12–19 of Heap Block)
As you’d expect, the five single-byte characters are stored next. Additionally, to ensure that future heap objects are aligned based on the CPU’s architecture requirements, an additional three bytes of padding are added (aligning the object to a 4-byte boundary).
Allocating Memory From the Heap
We briefly mentioned that the factory class allocates a block of memory from the heap (20 bytes in our case), then fills that block with the object’s data. One remaining question is how that 20 bytes is allocated.
In Cheney’s algorithm for garbage collection, the Young Generation (New Space) is divided into two semi-spaces. To allocate a block of memory in the heap, the allocator determines if there are enough free bytes between the
limit of the current semi-space, and the current
top of that semi-space. If there’s enough room, the algorithm returns the address of the next block, then increments the
top pointer by the requested number of bytes.
This basic case is shown here, showing the before and after states of the current semi-space:
If the current semi-space were to run out of free memory (
limit get too close), then the collection portion of Cheney’s algorithm starts. Once collection is complete, all the live objects will have been copied to the beginning of the second semi-space, and all dead objects (remaining in the first semi-space) will be discarded. No matter what, a semi-space is guaranteed to have all its used space at the bottom, and all its free space at the top, so it’ll always look like the above diagram.
In our case though, there’s plenty of free memory in the current semi-space, so we carve off 20 bytes, then increase the
top pointer. There’s no need for garbage collection, and the second semi-space isn’t involved. In the V8 code, there are numerous special cases to consider, but the final allocation of 20 bytes is handled by the
NewSpace::AllocateFastUnaligned method in
Returning a Handle
Now that we have a pointer to a heap block, fully populated with the string’s content (including length, hash, and map) the pointer must be returned to the client application. If you recall, the client invoked this line of code:
Local<String> source = String::NewFromUtf8Literal(isolate, "1 + 1");
But, what exactly is the type of
source, and what does
Local<String> actually mean? There are two key observations here:
Translating Internal to External Classes
First, it’s interesting to recall that V8 stored our string object using the
v8::internal::SeqOneByteString class, which is simply a pointer to the data on the heap. However, the client application expects the data to be of type
v8::String, which is part of the V8 API.
What may surprise you is that
v8::internal::SeqOneByteString (a subclass of
v8::internal::String) is in a completely different class hierarchy than
v8::String. In fact, all of the internal classes are defined in the
src/objects directory using the
v8::internal namespace, whereas the external classes are defined in
include/v8.h using the
NewFromUtf8Literal method we discussed earlier (see
src/api/api.cc), the very last step before returning the object pointer to the client application is to cast the result from a
v8::internal::String to a
This conversion magic is done using macros defined in
Managing the “Roots” for Garbage Collection
Second, let’s discuss what
Local<String> means (which incidentally is an abbreviation for
v8::Local<v8::String>). The Local concept is how we deal with garbage collection of the string object when it’s no longer needed.
In the case of the
hello-world.cc, or Chrome, or NodeJS). For example:
Local<String> source = ...
In this case,
source is a reference to a heap object, although now with an additional level of indirection. This diagram will explain:
On the left side is the C++ stack, which grows from top to bottom as the program executes, and the right side is the heap memory we saw earlier. As the client application executes, it pushes a
HandleScope object onto the local C++ stack (see
src/samples/hello-world.cc). Next, the return value from calling
String::NewFromUtf8Literal() is stored on the C++ stack as a
It looks like we’ve add yet another level of indirection, but there are benefits of doing this:
- Finding roots is easier — The
HandleScopeobject is a place to store “Handles” (aka pointers) to heap objects. As you recall, this is exactly what our
SeqOneByteStringobject was, an 8-byte pointer to the underlying heap data. When garbage collection is initiated, V8 quickly scans the
HandleScopeobject to find all the root pointers. It can then update those pointers if the underlying heap data is moved.
- Local pointers are easy to manage — In contrast to
HandleScopewhich is quite large, the
Local<String>object is an 8-byte value on the C++ stack, which can be used in the same context as any other 8-byte value, such as pointers or integers. In particular, it can be stored in CPU registers, be passed to functions, or provided as return values. What’s notable is that the garbage collector is not required to locate or update these values when garbage collection occurs.
- Eliminating scopes is easy — Finally, when the C++ function in the client application finishes, the
Localobjects on the C++ stack are removed, but only after their C++ object destructors have been called. These destructors remove all the handles from the garbage collector’s list of roots. They’re no longer in scope, so the underlying heap objects may have become garbage.
To conclude the story, the
source variable, referring to our
1 + 1 string is now ready to be passed to the next line in our client application:
Local<Script> script =
There was clearly a lot of work to allocate the
1 + 1 string on the heap. Hopefully it illustrated some parts of V8’s internal architecture, as well as how data is represented in different parts of the system. In future blog posts, I’ll look more into how our simple expression is parsed and executed, which will expose a lot more about how V8 operates.
In Part 2 of this blog post series, I’ll dig into how the Compilation Cache works, to avoid compiling code more than necessary.