How did I get here? — A behind the scenes look at the internet

Do you ever wonder how the internet works?

Sure, we essentially type in a web address and are then transported to that website, but getting you from there to here is actually quite an involved process, and it all happens within a few seconds.

Let’s take a look at how you got here!

When you type ‘https://www.holbertonschool.com’ in a browser and hit enter

Whenever you enter a website name, like “google.com” or “reddit.com”, you’re actually looking for the IP address that is associated with that website. Computers and other devices communicated using IP addresses to identify each other on the internet.

But all those numbers are difficult to remember, so they are given domain names that are easy for humans to remember. This process of linking IP addresses and domain names is handled by the Domain Name System (DNS). So whenever we send a request for a website, that address gets sent to the DNS, which then has to find the matching IP address and return the results to the user. If a website has been visited before, the matching IP address is usually stored on the users web browser or on the users operating system. If that’s the case, then the matching IP address is returned.

However, if it’s the first time visiting a website, a few more steps that a need to be taken before the website can be accessed. When both the browser and operating system are unable to find a match for the domain name, they send the request over to the resolver, (typically your internet service provider), which checks its cache of domain names and IP addresses. If nothing is found, it passes the requested domain name over to what is known as the root server.

A good way to imagine this scenario:

You’re looking for a phone number. You’ve got the name of the person you’re trying to contact, but don’t know how to find their number, so you consult a phone book, which has all the names organized alphabetically and by last name. 
You can think of the phone book as the root server. It knows where to point the resolver when it’s looking for a specific domain name.

We’re trying to find a phone number for someone with the last name “James”. We’re trying to find a website that ends with “.com”.

So, we consult the index of the phone book, to find where the entries starting with “J” begin. We then flip to that page and begin looking for the matching name and phone number.
We’re consult the Top-level Domain (TLD), which knows which servers to check to find the entries ending in “.com”. The root server sends the resolver to the appropriate server so that it can search and retrieve the requested domain.\

Once the resolver arrives at the appropriate server, the authoritative name server finds the matching IP address and returns it to the resolver, which hands it back to the browser. Then the matching address returns to the operating system, and then the website can be accessed.

Actually, not just yet...

Now that the browser has found the proper IP address for the website, it has to send a request to the web server that’s hosting the website we want to access. To do so, the browser and web server have to follow a specific set of protocols to establish a proper connection. This set of protocols is known as the Transmission Control Protocol / Internet Protocol (TCP/IP), and is essentially a 3 step handshake to establish a connection. 
It works like this:

  1. The client (your computer) sends a packet to the server over the internet, checking if it’s open to receive connections.
  2. The host receives the packet and sends back its own packet to acknowledge that it received the package.
  3. The client receives this new packet, and sends one more packet back to the host to acknowledging that it received the package.

Now the connection has been established and data can be transferred between the host and the client.

Are we there yet?

Well, no.

The next thing that happens is that we have to pass through the firewall before we can access any of the data. Firewalls are there to monitor incoming and outgoing traffic, filtering out unwanted connections and letting others pass through. When we want to connect to a website, we will typically try to connect via HTTP (port 80) or HTTPS (port 443).

HTTP stands for Hyper Text Transfer Protocol, and is the format in which data is sent between the client computer and the host. When this type of data is being sent back and forth, it’s sent in plain text. For example, if you were to type in your password and send it over to the server, that data would literally contain your password. It’s not very safe, as someone could intercept that information and have your password and username almost immediately.
HTTPS is essentially HTTP but a secure version of it. When you send your password with HTTPS, it becomes encrypted on its way to the server, and is only decrypted when it reaches the server.

But to successfully encrypt and decrypt information, another set of protocols must take place. The Secure Sockets Layer (SSL) protocol is essentially a way of encrypting communication between the sever and the computer by using “keys”. The private key is retained by the host, and the public key is distributed to computers that want to access that server. If the public and private key match, then a secure connection has been established.

Alright, how about now?

Sorry, not just yet!

Now that we’ve got a connection established, we will probably have to pass through whats known as the Load Balancer. The job of the load balancer is to direct the traffic coming into a web site to the appropriate web server. 
Think of it like this:

If a website receives a high amount of traffic each day, there’s a good chance that the website will be hosted across multiple servers. Hosting a website on multiple servers is a common practice for large websites like google and youtube because it ensures that if something catastrophic were to happen to one of their servers, the other servers would still be able to handle the incoming requests and provide service. If all the information and data were to be stored on one single server, and that server crashed, then the website would be offline, leading to potential loss of revenue, etc.

So, that’s where the load balancer comes in. It essentially checks which servers are active, what volume of traffic those servers are experiencing, and then makes the decision to send the incoming request to that server that is most suited to handle that request.

Alright, We have successfully reached the web server, and now the website’s “static” content will be sent to the web browser, which will then be displayed to the user. What a journey!

Finally! Wait, what do you mean static content…?

Yeah, so…

We all know that we use websites to do specific things. We want to login to our account, we want to purchase things, talk to our friends, etc. Well, static content can’t exactly do any of that. Static content is like the welcome screen of a video game. To start the game, you need to hit start. Same thing with websites. If we want to access our accounts, we need to enter our information and submit it, which will then retrieve our information and populate our screen with content that is related to our account.

This type of content is known as dynamic content and to retrieve and generate dynamic content a few more steps need to be taken.

For example, if we wanted to sign up for a website, we would fill out the information form and hit the submit button. By hitting submit, you are most likely sending a HTTP request with the POST method. What this does is it tells the web server that you are sending information that you want to store on the website. But you’re not exactly storing this information on the website, you’re actually storing it in a database that is connected to the website.

To access the database, the information gets passed to the application server, who essentially determines that it needs to access the database and store the information there. Now that we’ve signed up for the website, we will be presented with new content that is relevant to the information we signed up with. To generate this content, the application server reaches into the database, pulls out any relevant information, converts it into HTML, and sends it back to the web server, which then sends it back to the browser.

That’s pretty much how it all works!

I hope you learned a bit about how this all works. Here’s a sketch I made to illustrate this concept. Thanks for reading!

Sources: