How the Internet Works
Behind the screens look at what happens when typing a URL into a web browser.
Holberton School instills that full-stack software engineers understand every aspect of how the stack works including knowledge of how the stack works on top of the internet. So let us dive now into the web world!!!
While researching this topic, I listened to an instructor from LearnCode.academy admit to it being at least five years, if not more, before figuring out, along the way, how web infrastructure and the internet works.
Now, that being said, I don’t have the years of wisdom and on-the-job training that some of these individuals do. I can tell you that I have spent hours of researching, self-teaching/implementing, and learning from those individuals who have spent years learning and doing this stuff. From that, I am now able to deliver to y’all, a very high-level rendition, of what’s happening under the hood when we type https://www.holbertonschool.com into our web browsers!
Links will be provided throughout the post to assist in offering everyone an even deeper knowledge and understanding of topics I may not be able to go into in this post ;)
We gotta start somewhere though, so let’s start by talking about what we do with the information we do have, which is a URL(Uniform Resource Locator)…
This URL leads us into our discussion about Internet addresses.
To put it simply, if you want to roll around with everyone else on the internet, you need to have an IP address. See below the proper form of an IP address:
May I remind you that anyone who wants to roll around the internet must have an IP address. This means the user’s computer, as well as the destination computer(web server that’s hosting the website), must have IP addresses in order to make a connection and exchange data.
If a user is connecting to the internet via an ISP (internet service provider), they will be assigned a temporary IP address that will last the duration of the user’s session. If the user is connecting through a LAN (local area network), the IP address could be a permanent one or a temporary one from a DHCP (Dynamic Host Configuration Protocol).
So here’s the real question. How can we use a URL typed into our web browser and still open a website? This my friends, is where DNS (Domain Name Service) steps in to take over.
The DNS is a distributed database that keeps track of computer’s names and their corresponding IP addresses on the Internet.
Old-school analogies like to compare DNS to a giant phone book. Nowadays, it’s probably more comparable to a smartphone’s contact list. This is match names to phone numbers and emails, the way DNS matches IP addresses to domain names, except it does it for the whole world.
Any time a domain name is entered into a browser, it checks a few caches locally and/or in routers before DNS must make a recursive request, starting at the root server before reaching the authoritative DNS server, which contains the appropriate IP address pointed to by the domain name that was searched.
Below is the DNS server hierarchy which shows how DNS requests get mapped to their final destination server that is storing the correct IP address where the domain name should point to:
FUN FACT: In the 1970s and early ’80s, before DNS was pioneered, one individual by the name Elizabeth Feinler, at Stanford, is who maintained a master list of every Internet-connected computer in a text file called HOSTS.TXT. However, because Elizabeth only handled requests before 6:00 pm (Pacific) and didn’t work on Christmas, a new system was conceived, which is the DNS system we refer to today.
It’s still possible to type the specific IP address into a web browser and have it reach the same website.
Below I provided an example of this since I happened to know the IP address my brerickner.tech domain name is registered to point to. As you can see, they both serve up the same static HTML page that displays ”Holberton School”.
In terms of networking, a protocol corresponds to a set of rules which govern how systems communicate with each other. Every computer needs a protocol stack in order to communicate on the internet. Usually, they are built into a computer’s operating system.
Communication starts off as alphabetic text, which must then be translated into an electronic signal before it can be transmitted across the internet. Once arriving at the destination source, this electronic signal must then be translated back to alphabetic text. This communication has been regulated on internet the internet and known as TCP/IP (Transmission Control Protocol / Internet Protocol).
Although there are other communication protocols composed under the TCP/IP protocol suite, TCP and IP are the two major ones being used, hence the name. TCP/IP is non-proprietary and easily modified; it’s compatible with all operating systems hardware and networks and highly scalable.
Imagining the TCP/IP in action would probably be similar to watching an entire document start as a whole, get broken down into several what appears to be incomplete packets of data, poof across the network in electrical signals, then recompiles itself in order to completely revert back to its original form. The entire functionality of TCP/IP is broken down into four different layers, each imposing their own set of protocols.
Below you can see how the different layers break down the functionality of TCP/IP and impose their own set of protocols.
Connections must be mutual and require a TCP handshake. That involves a client making a request through a specified port, to serve up a website using the HTTP protocol. A public server will open the connection for a client and once the connection is established, the content will be served up for the client to download to their browser. The keep-alive header value indicates for how long the connection will remain open before timing out or the connection is terminated by the server. There is one important factor to consider when requesting to connect to other devices over the internet, and that is, of course, security.
Earlier, when I was discussing the TCP handshake, I mentioned clients requesting access through specified ports. I’m going to dive a little bit further into this and how relevant it is in the discussion of secure and not secure. When the request for an HTTP connection is made, if no port is specified, the default port used is port 80. To put it simply, this is just a plain ole HTTP request.
Have fun being hacked.
If you want to communicate securely, it is done on port 443, which allows the transmission of data over a secure network. This security involves encrypting data in a top-secret language that can only be translated with access to private key administered by the connecting party. TLS HANDSHAKE GO HERE TO LEARN MORE.
Below are different browsers and how they look with secure and not secure URL requests:
Having a secure connection cuts down the risk of a man-in-the-middle attack, ensuring users are safe sharing their personal information without fears of it being compromised along the way.
Google actually goes as far as to say:
“Users should expect that the web is safe by default, and they’ll be warned when there’s an issue.”
Deploying a website in HTTPS is remarkably easy. So easy in fact, it can be done without installing anything or writing a lick of code. Just do a quick google search or even follow the instructions provided in the resource I linked at the beginning of this paragraph!
On the server-side of things, there are different ways a programmer can go about securing the websites that are being served up by the web server. Not all of them are the right way of doing it though. The biggest benefit comes when a developer is able to be strategic about where and on which servers to implement the SSL certificate. This decision could help optimize an entire network’s security and efficiency.
Firewalls have been a staple in network security for quite some time and they don’t appear to be going anywhere anytime soon. However, they should be thought of as nothing more than the bare minimum/standard for network security. A firewall put simply, is a preconfigured set of rules that have been put into place to implicate which requests will be allowed through to the network servers. Nowadays monitoring incoming and outgoing traffic against previous network traffic patterns has served as means to ruling whether requests will be allowed or denied.
A web server can be a physical machine or a virtual machine but what makes a web server different from a regular server is the software installed. At a minimum, this software controls how web users are able to access any files being hosted. Another important feature is being able to understand URLs and HTTP(the protocol browsers uses to view webpages).
Imagine if you had to have your computer up and running in order for clients to successfully request content from your website. Eww. Hosting a website on a dedicated web server is optimal for several reasons. Starting with the fact that these servers are dedicated to providing web content to users. This means every setting is optimized to perform this functionality. Which could mean faster response times for users, better website performance and all-around more reliability.
When there’s a website generating a lot of traffic, like Facebook or Twitter, and there’s an excessive amount of requests being made at any given time, web servers become even more quintessential. In fact, this is a prime example of when a website would benefit the most from being hosted on a web server. In fact, why stop at just one?
If performance is being compromised due to the influx of requests being made to a website, that is the best indicator that more than one web server would be optimal for handling requests. Especially when these requests are more cumbersome and involve on the fly ____ to dynamically update content before sending a customized response back to the client who made the request. The best way to handle multiple requests being made to a website is through a load balancing mechanism.
Another thing that high traffic sites like Facebook and Twitter have in common is their implementations of application servers that allow dynamic content to be served in responses to client or application requests. This means that when a request comes through, it will be updated on the fly using HTML templates that get filled using data stored about the client making the request. When websites store user data to actively retrieve later, it must be stored in an application database, as opposed to a static database that web servers use.
Even though web servers are only capable of serving up static web pages the same does not go for application servers. They are not limited to serving up only static content nor are they limited to serving up only dynamic content. Application servers are set up to serve both static and business logic interchangeably whenever and wherever they are needed.
Application servers are also equipped to handle multiple requests in parallel known as multi-threading, which for instance would allow a user’s name and password to be saved in order to stay logged in to a website or application. This is essential for the communication exchange occurring between other web and mobile applications.
Load Balancers are exactly what they claim to be, a resource meant for distributing the workload(requests) amongst the web servers available to handle HTTP requests. In my experience, the public address associated with the domain name registered through DNS points to the load balancer. This cuts off almost any interaction client requests have with back-end servers.
This provides security and promotes efficiency amongst the back-end servers due to the amount of CPU power saved by the load balancer catching and distributing requests so that anyone's server isn’t working harder than the others. This allows for consistency in the performance quality to each and every user because each request will be redirected to go to the server that is most optimized to handle the request.
Load balancers are a great place to implement network security since it is on the very edge of the network and can be used as the first line of defense when letting requests through. By configuring the SSL certificate on the load balancer, it allows requests to be encrypted and decrypted at the edge of the network, before reaching the back-end servers. This speeds up the efficiency of the web servers by allowing CPU power to continue working on retrieving and serving up responses to requests.