The Beauty and the Bits

What is happening when you open a website in your browser?

19 min readApr 2, 2019

Life. One of the most complicated concepts ever known to humanity. When we try to describe the life, we usually start with living organisms, such as us, humans (I apologize to the robots reading this article for the technoracial discrimination). We are composed of cells and the intercommunication of those cells produce results that we call “decisions”, “breathing”, “thinking”, “sleeping”, etc. The life itself is based upon a very basic but fundamental concept called “communication”; cells, organs, humans, animals, nature, communities, cities, countries (a hundred year later planets, galaxies, world), all communicate and keep the balance between order and chaos. It won’t be much of an exaggeration to take bits into account when describing the communication. The signal is the key concept in communication. You communicate with the light by turning the switch on or off, thus giving a signal to the light bulb. The light communicates back by turning on or off. It might just ignore the request by signaling its death, or an absence of electricity, or a cutoff in wires connecting the switch with the light, and so on.

Think about it next time you turn the switch on in your room. Think about the communication of the toothbrush bristle with your teeth while you brush them (assuming it will happen during the day). Think about your teeth and their communication with the food while you eat. A hard piece of french fries might destroy a little part of your tooth enamel, the result of the communication, which will eventually lead to a new communication between you and the dentist. Think about your brain cells that communicate together to process these words trying to deduce a general meaning that you will “understand” while reading this sentence and [probably] nodding as a result of a complex communication of the cells.

Communication plays a key role in computers as much as in life. The most simple electronic device is based on the communication of the electrons. We are going to dive into the beauty of the bits behind the simple request the browser makes to render the webpage on your computer. Programmers are being asked this question in technical interviews, so if you need any good reference on the topic, this might be the one you’ve been looking for.

The browser and the OS

The web browser is a program used to access websites worldwide. Before you start typing the address of the website you wish to visit, you have to launch the browser application (obviously). When you hit the icon of the browser using your mouse/trackpad to launch it, the operating system recognizes the mouse click to the specific area of the screen. The OS also has the coordinates of all the rectangles of icons placed on your desktop. A double click (or a single click, as set up in your OS settings) tells the OS to search for the program in the file system. Programs usually are located in a specific directory (/usr/bin for example), when the OS finds the program file, which is just a file that differs from other files such as acdc-hellsbells.mp3 by its format, extension, and contents. A file that can be run is called an executable file, that’s why in some systems its extension is .exe (in Windows). The browser is a file that can be executed by the OS, so it contains instructions that command the OS to do what the browser wants it to do.

The OS loads the contents of the browser program instructions into the main memory, called RAM (Random Access Memory), or DRAM (Dynamic RAM). It’s called the main memory because it serves as the place where instructions and data are located and can be accessed by the CPU, which in turn, is the only guy who can execute anything. The OS itself (that loads the contents of the browser into the memory) is run by the CPU.

The CPU (Central Processing Unit) is a small device that is able to do simple instructions at a time, which include reading data from the memory, writing data to memory, performing arithmetic and logic operations, and so on. The CPU has its little memory blocks called registers, that store a fixed amount of data (4 bytes, for example), called a word. The word is the name of the data unit that the CPU can process at a time. “Processing at a time” is called a CPU cycle. It’s similar to chewing gum. You move your mouth such that your upper and lower teeth come close together and eventually press the gum inside them to reveal the sweet juice that your tongue feels with its receptors and signals your brain that it’s tasty and you feel a little bit better. Each time your upper and lower teeth meet together and kick the gum between them, your mouth completes a cycle of chewing. You are done with the gum when you totally squeeze all the sweet taste that it has by chewing it a hundred or a thousand times. The CPU performs a simple action (usually involving data) and stores the results in a register in one CPU cycle. A program consists of tons of such actions that the CPU could possibly complete in thousands of cycles.

The operating system loads the contents of the program using a “tool” called the loader. The loader copies the contents of the executable file into the main memory. When it’s ready, the CPU instruction pointer (a register that stores the current address of the instruction to be executed) points to the first instruction of the program. The program, i. e. the browser is now starting. It renders itself to the screen using the API provided by the operating system. The window frame, the buttons, the menu, the colors, all are rendered on the screen using the OS. The operating system serves as an environment that programs may run without intercepting with each other. The OS also monitors the actions by the program to prevent destructive intentions of the program (if any). To achieve that, the operating system incorporates the concept of a process, i.e. the running program. The program that resides in the file system is “asleep” before the loader copies its content to memory. After that, the program awakens and starts running as a process.

Under the hood of an executable

Side note: For a good reference to learn this fundamental stuff from the C++ perspective and understand how the program works under the hood, I personally recommend this book.

The executable file consists of the code segment and the data segment. When you write a program and compile it, the generated file has at least those two segments. The loader copies the contents of the executable file considering each segment and copying them to the proper memory segment. Instructions of the browser are located in the code segment (called a text segment as well). Functions responsible for requesting and rendering website content are examples of instructions. These instructions might use data which is located in the data segment.

memory segments and the executable sections

The OS controls the execution of the program by incorporating the virtual memory, a mapping for the physical memory that allows each process to feel itself the only running program at a time. The OS also divides the virtual memory into blocks that control the execution of the program to some extent. For example, to call the functions properly, the OS uses the stack, which automatically allocates/deallocates memory space for the function arguments and local variables.

We write programs in high-level programming languages, such as C++. But the program, in the end, should be translated into machine code to be run by the CPU. The translation process (known as a compilation) is simplified by using middle states of translation, i.e. the program translates to an assembly language and then to machine code. Let’s take a look at the low-level representation of the logic mentioned above.

Let’s suppose we have the function request() which makes a Google search request or makes an HTTP request based on the result of the parsing of the text typed in the address bar. We introduce the following simplified blah-code (a blah-code is a mix of pseudocode and whatever the author wants). Let’s say we set the url property to 1 if the address is a valid URL and set it 0 if it’s not (a search term).

void request(address) {
  if (address.url == 1) { 
    makeHTTPRequest(address); 
  }
  else { // url == 0
    makeGoogleSearch(address);
  }
}

(Obviously, the program also defines makeHTTPRequest and makeGoogleSearch functions somewhere in the code.)

The CPU executes instructions sequentially one by one, and instructions are simple commands doing exactly one thing. We can use complex expressions in a single line in a high-level programming language such as [obviously] C++, while the assembly instructions are simple commands that can do only one simple operation at one CPU cycle (remember the chewing?): move, add, subtract, XOR, and so on. The CPU fetches the instruction from the code segment of the memory, decodes it to find out what it should exactly do (move data, add numbers, subtract them, etc.), and executes the command. In order to run at its fastest, the CPU stores the operands and the result of the execution in registers (think of registers as temporary variables of the CPU). Registers are physical memory units that are located within the CPU so the access is much faster compared to the RAM. To access the registers from an assembly language program, we use their specified names such as rax, rbx, rdx, etc. The CPU commands operate on registers rather than the RAM cells, that’s why the CPU has to copy the contents of the variable from the memory to registers, execute operations and store the results in a register and then copy the value of the register back to the memory cell.

For example the high-level expression:

a = b + 2 * c - 1;

that takes just a single line of code will have the following assembly representation (a blah-code, again). Comments follow after semicolons.

mov rax, b; copy the contents of "b" located in the memory to the register rax
mov rbx, c; the same for the "c" to be able to calculate 2 * c
mul rbx, 2; multiply the value of the rbx register with immediate value 2 (2 * c)
add rax, rbx; add rax (b) with rbx (2*c) and store back in the rax
sub rax, 1; subtract 1 from rax
mov a, rax; copy the contents of rax to the "a" located in the memory

A conditional statement suggests that a portion of the code should be “skipped”, for example, calling request(“world better place how”); means the if block will be omitted. To express this in the assembly language, the idea of jumps is used. We compare two values and based on the result we jump to a specified portion of the code. We label the portion to make it possible to “find” the set of instructions. For example, to skip adding 99 to the register rbx, we can “jump” to the portion labeled “MEH” using the unconditional jump instruction jpm.

mov rax, 2
mov rbx, 0
jmp MEH
add rbx, 99; will be skipped
MEH:
add rax, 1
...

The jmp instruction performs an unconditional jump, i. e. starts the execution of the first instruction at a specified label without any condition check. The good news is that the CPU provides conditional jumps as well.

The body of the request() function will translate into the following assembly code (simplified), where the je is interpreted as “jump if equal to” and jne as “jump if is not equal to” (based on the results of the comparison using the cmp instruction):

mov rax, address.url; copy the "address.url" into the rax register
cmp rax, 1; 
je IS_URL; jump if url is 1
jne IS_SEARCH_TERM; jump if url is 0
IS_URL:
  call makeHTTPRequest
IS_SEARCH_TERM:
  call makeGoogleSearch

The browser executable file consists of thousands of lines of similar codes but in the form of zeroes and ones, where each combination specifies a command or data (10001 means add, 01101 means move, etc.).

Typing the address and pressing the “Enter”!

The OS handles user events such as mouse click or keyboard key press and passes them to the running application. When you click on the address bar of the browser, type the address of the website and hit the “enter” button, all of that events are passed to the browser by the OS. The OS in its turn gets them from the CPU via interrupts. An interrupt is the “hey” of the digital world. Imagine you take a walk and think around the pitch of your startup the aim of which is to convince the investors that all you want to do is to make the world a better place (instead of making a sh*tload of money and the opportunity to sign on your fans’ boobs (which are hot chics, obviously)); and someone interrupts you with a “hey!”, you take a pause from your obviously fantastic thoughts and turn to that someone with a “why C++ is better than C#” face and that someone asks you how to get to the nearest station blah blah. You answer them and get back to your obviously fantastic thoughts. The “hey!” interrupted you from your “default” routine the same way the CPU is interrupted when the user clicks or types something. These events are then passed to the browser which properly reacts to them by rendering letters on the address bar while you are typing, for example.

An interrupt is a signal to the processor emitted by hardware or software indicating an event that needs immediate attention (Wikipedia).

After typing the address, for example, “facebook.com”, and hitting the “enter” by signaling the browser to start loading the contents of the website, the browser starts the parsing of the address. Most of the browsers allow you to search the term rather than enter the address completely, for example, in Chrome (I use Chrome), typing “facebook” without a “.com” performs a Google search and results in a list of websites that match the query the most, and obviously the first result would be the link to facebook.com. To achieve this and also to properly request the website, the browser parses the contents of the address bar to find out what exactly is it. Is it an “asdf”, or is it an IP address (138.201.20.123) or is it a fully typed address of the website (“http://facebook.com”)? In any situation, the browser “must” perform a network request to the target server to retrieve the contents of the website to be rendered.

Name lookup

Let’s suppose that you’ve typed “http://facebook.com” and now you are waiting for the browser’s response. The browser has to somehow find the one and only server that contains the contents of the website and ask that server to send the contents over. It’s the same as if you look for the guy named Valod that lives on the planet Earth and has a tattoo of a mermaid on his left shoulder.

This might be our Valod. (image source: TNW)

To do so, the browser has to lookup the actual IP address of the server that is mapped to the name “facebook.com”. The browser first performs a DNS lookup. DNS stands for Domain Name Server, a server that holds the names mapped to the IP addresses. The lookup starts from your local internet provider to so-called root internet servers to get the name servers first for the “.com” top-level domain and then for the “facebook.com”.

The browser caches the response to access the same website faster by skipping the DNS lookup phase. It usually takes microseconds to retrieve the IP address of the website, the browser then makes an HTTP request to the server. This is where the hard stuff begins.

Network requests

Both to make the DNS lookup and to make an HTTP request to the server, the browser has to do it over the internet. To be able to do so, it uses a concept called sockets. Sockets are abstractions provided by the operating system and allowing to access a remote computer on the other side of the planet.

Sockets are a way of accessing the other world, in order to send to or receive data from the other world, a stable connection should be established between the worlds using sockets. Sockets are files that are treated differently by the OS. The program tells its intention to access the internet by telling the OS to create a socket for it, a special file the program could write data that would be transferred via the network (could be a local network as well). Whenever the program writes something to the socket, the OS transfers it to the specified endpoint (another socket in another computer). So the browser creates a socket (asks the OS to create one) after the user types the address of the website and fills the socket with data regarding the request. It doesn’t matter if the request is a DNS lookup or is an HTTP request to the Facebook’s server, it happens using a socket. The type of the socket the browser creates is a client socket because it should be used to “ask” for something. The one that serves the results (the Facebook server) listens for incoming network requests and creates its own “server” socket to handle requests and serve data.

How would the OS cope with several programs that need network access? For example, suppose you are opening the facebook.com in your browser and chat with your friend at the same time using the Skype desktop application. Both applications require a network connection and both ask the OS to create sockets for them. Eventually, the OS creates two sockets for two different applications. To know the difference between these sockets, it uses port numbers. Each socket has its unique port number which cannot be used by any other application. For example, the socket created for Skype uses the port 5678, and the socket created for the browser uses the port 8765. This way the OS can distinguish the data sent by the browser from the data sent by Skype and also can correctly pass the response to the proper application. The port number is specified by the program developer and should be chosen wisely.

Why you can’t access the internet from your pocket calculator

It doesn’t have a network adapter, a device that makes it possible for bits to flow from a computer to another computer by wires (or radio signals). This is the lowest level of network communication, called the physical layer. Data is sent via the wires as a sequence of bits which are then collected into something meaningful by the receiver (the beauty of the bits). The form of communication is known as a protocol. The protocol defines the form of the data, the exact location of data and headers describing the data in the packet. The packet is the unit of information passed through the network.

To transfer the packet, the transferrer should mark it somehow for the receiver the same way the mail is marked with stamps, the delivery address, and the sender address. The protocols exist for that purpose. Each layer of the network model adds level-specific metadata to the packet. The packets formed by the browser are HTTP (HyperText Transfer Protocol) documents sent over TCP (Transmission Control Protocol).

An HTTP document consists of two parts, the header, and the body. The header of a document contains the metadata related to the request and the document body. For example, it may contain the details of the browser that made the request, the size of the data located in the body of the document, and so on.

When you request the contents of facebook.com, the request-document doesn’t contain a body, it only contains a header describing what does the browser need from the server. The server responds with a document that contains the full contents of the requested web page in the body of the HTTP-document.

So the following happens during a network communication:

the browser creates a socket (by specifying the port number, the transmission protocol);
the browser creates an HTTP request document;
the browser writes the HTTP-document into the socket and commands the OS to make the request;
the OS transforms the data into TCP packets and passes to the network adapter;
the network adapter sends the bits of packets to the network;
the packet(s) are being received from the server’s adapter, which then passes it to the higher levels of the network model;
the OS of the server fetches the data from the received packet and passes to the web server;
the web server parses the contents as an HTTP-document and creates the response HTTP-document;
the web server sends the response the same way the client sent it;
the client receives the response and renders the contents as a web page.

Serving the request

A single computer that runs a special software called a web server might be considered as a server. The web server creates a server socket and listens for incoming connections on port 80 (the default HTTP port). You can create your own web server by designing an application that creates a socket, listens for the port 80, and accepts incoming connections and serves data following the rules of the HTTP protocol.

Facebook (and lookalikes) consists of thousands of servers in a complex architecture to serve millions of users as fast as it might be achieved.

The request comes to one of the front servers, which decide which processing server should serve the request that has the minimum load and is at the nearest location [geographically] to the user.

Processing the request most of the time requires accessing databases. Let’s suppose you’ve opened the facebook.com in your browser and didn’t log out the last time you visit it. So the browser stored the information regarding your last authorization in its “memory” and now requests the facebook.com by passing that information (usually a request key in a form “kjhlkjhjhl1234324WEASDASF34534FDFGDfhhf877”) along with the HTTP-request (in the HTTP header). The Facebook servers authorize the request and fetch the user data that is mapped to the request key specified in the HTTP request.

Authorization involves accessing in-memory databases in order to efficiently retrieve user information.

Finally, having the user information (authorizing the request at the same time), now the news feed (the wall, Mr. Snow) can be fetched from the database. To make this retrieval faster, a database index is used which usually represents a B-tree. The final content of the wall and other user-related data (didn’t fit in this simplified version of the story) is then sent as an HTTP-response document.

Rendering the server response

When the browser receives the HTTP-response from the server, it starts by looking at the header of the document. The header specifies the type of the content, whether it’s an HTML text, or a binary data to download, or an image to render on the screen, and so on. Usually, the response contains the HTML text of the web page. The browser discovers this by examining the HTTP-document for the Content-Type header.

Content-Type: text/html

Now the browser has to parse the HTML and show the buttons, input boxes, tables, links, images, etc. on the page. The browser constructs so-called DOM to conveniently manage and render the web page. DOM (Document Object Model) defines the model for the document that can be pictured as a tree consisting of HTML elements.

The image above illustrates the DOM constructed from the following HTML code:

<html>
  <head>
    <title>My title</title>
  </head>
  <body>
    <h1>A heading</h1>
    <a href="">Link text</a>
  </body>
</html>

The HTML is the declaration of how the content should be rendered by the browser. Besides HTML, the browser deals with CSS and JavaScript as well. The CSS specifies the styling of the elements.

JavaScript

JavaScript adds some dynamicity to elements. The browser represents a mini version of the OS that runs a JavaScript program inside of it. The OS doesn’t actually know about the JavaScript and even about the browser’s capability to run a “program”. It’s like renting an apartment from someone and renting a room out to someone else without the knowledge of the landlord.

For example, Facebook’s main page won’t make a login request unless you type the login and the password in the input fields. It must have a JS-code that checks if the fields are not empty before enabling the “Submit” button. Take a look at the following JS blah-code (a blah-code can be ill-formed, never mind)

function checkFields() {
  let login = document.getElementById("login");
  let pass = document.getElementById("password");
  if (login.value != "" && pass.value != "") {
    document.getElementById("submit_button").value = 1;
  } else {
    document.getElementById("submit_button").value = 0;
  }
}

The browser takes care of the code. It creates an environment for the JS to be executed and provides access to the elements rendered from the HTML text fetched from the server (the JS code is fetched from the server as well). To achieve the finest result, the browser contains a fully functional JavaScript virtual machine, such as Google’s V8. It runs the JS by interpreting it (not compiling), which means that instructions are being executed one by one by the virtual machine (V8). To execute each command, it should be translated into machine code, because the final executor of the command must be the CPU. So the translation is done by the virtual machine, and the browser runs the translated code as its own. Similar to our apartment renting example, it’s like asking the landlord for a spare key and giving it to our guest who rents the room from us.

In the end, I hope I managed to express most of the beauty of the bits. If you find something missing (don’t), or any issues/errors (just, don’t), let me know in the comments.