Unveil the magic behind web surfing
“What happens when you type [a website] in your browser and press Enter”: a recurring question in job interviews that sums up a fair part of a Software Engineer knowledge.
Everybody browse the Internet but few bother to wonder what actually happens behind the scene.
Before we begin our journey into web infrastructure, I’d like to present the client — server model, since the communication on Internet is based on this model.
How does the model work?
- Client any connected device (laptop, smartphone …) using a service from the service providers (Servers).
- Server a connected device which provides information or access to some services. A Server can either be hardware or software. Usually, a Web Server hosts the requested website.
- Protocol a general term for a system of rules, or method, for transmitting data between two devices. The Transmission Control Protocol/Internet Protocol (TCP/IP) protocol is the defacto standard protocol over Internet.
The Client establishes a connection to the Server over a local area network (LAN) or wide-area network (WAN), such as the Internet. Once the Server has fulfilled the Client’s request, the connection is terminated.
Let’s say you google
holbertonschool.com. You enter the Uniform Resource Locater (URL) https://www.holbertonschool.com into your web browser.
In the blink of an eye you land to the requested site. Isn’t it magical? As you’ll see, it takes more time to go through some of the underlying concepts than to get to the requested website!
Let’s unveil the magic step by step.
1. Say hello to DNS.
Computers and other devices on the internet communicate with each other an unique set of numbers called an IP address. Every device with an active Internet connection has its own IP address, including your smartphone and computer.
If you happen to know the IP address of the requested website, you can type it directly into the browser. But we, as humans, use words. As an analogy, do you remember every phone number you have in your contact list? Therefore, there is a need for a translation process from URL to IP address.
If you have already visited the requested site then its IP address is stored in your browser or your Operating System (OS)’s cache.
“In computing, a cache is a hardware or software component that stores data so future requests for that data can be served faster” — Wikipedia
If the browser cache comes up with an empty response, then the OS cache will be checked:
/etc/hosts on Linux and MacOS machines and
C:\Windows\System32\Drivers\etc\hosts on Windows.
What if neither the browser or the OS holds the IP address of a domain name in cache memory? That means that you’ve never been to the site or your cache has been cleared. Enter the Domain Name System.
“The Domain Name System (DNS) is a hierarchical and decentralized naming system for computers, services, or other resources connected to the Internet or a private network … Most prominently, it translates more readily memorized domain names to the numerical IP addresses needed for locating and identifying computer services and devices with the underlying network protocols.” — Wikipedia
DNS is like an address book for websites. Here’s a workflow of a DNS lookup:
a) The OS will ask a recursive DNS server, called a resolver, a server dedicated to resolving domain names, to find the IP address. The resolver is usually operated by your Internet Service Provider (ISP). The resolver checks its cache, and if found the resolution process ends. If not, it will query the next DNS server …
b) … known as the Root DNS Server. Root DNS servers know information about Top Level Domain (TLD) servers.
.com is an example of a TLD. Others are
.net, and many more. Root servers sit at the top of the DNS hierarchy.
Root DNS servers are scattered around the globe (decentralized) and operated by independent organisations. In our example, root server will guide us to …
c) … the
.comTLD server. When a domain is purchased, the domain registrar reserves the name and communicates to the TLD registry the Authoritative Name Servers (ANS). With the help of the domain registrar, the
.com TLD finds the ANS for the domain
d) The ANS is the one that actually holds a copy of the address book that matches IP addresses with domain names. It will inspect
holbertonschool.com’s DNS record for the www subdomain.
That’s the end of the address quest: the ANS provides the A record (actual IP address) assigned of the requested domain name.
e) If instead of an A record, the ANS returns a Canonical Name record (CNAME record), the DNS resolver will restart the query using the canonical name instead of the original name.
For instance, when querying “holbertonschool.com” the CNAME record of “hbtnweb-prod.us-east-1.elasticbeanstalk.com.” is returned. The DNS resolver will the send a query with that CNAME record and will get back three A records (IP addresses of 22.214.171.124, 126.96.36.199 and 188.8.131.52).
“A Canonical Name or CNAME record is a type of DNS record that maps an alias name to a true or canonical domain name … This can prove convenient when running multiple services (like an FTP server and a web server, each running on different ports) from a single IP address. One can, for example, point ftp.example.com and www.example.com to the DNS entry for example.com, which in turn has an A record which points to the IP address. Then, if the IP address ever changes, one only has to record the change in one place within the network: in the DNS A record for example.com.” — Wikipedia
After getting the answer, the recursive DNS resolver sends that information back to the computer (and browser) that requested it.
And from there how does the Client communicate with the Server?
Remember the term protocol? There is a need for a standardized protocol that can be implemented by both Client and Server to establish a connection between them.
The Open System Interconnection (OSI) is such a standardized model.
The Open Systems Interconnection model (OSI model) is a conceptual model that characterises and standardises the communication functions of a telecommunication or computing system without regard to its underlying internal structure and technology. Its goal is the interoperability of diverse communication systems with standard communication protocols. — Wikipedia
When it comes to web infrastructure the TCP/IP is the standard, although not mandatory, for communication between a Client and a Server.
2. the TCP / IP protocol
IP specifies the technical format of packets and the addressing scheme for computers to communicate over a network. Most networks combine IP with TCP, a higher-level protocol, which establishes a virtual connection between a destination and a source, so that they can send messages back and forth for a period of time.
TCP, the transport-layer protocol, is defined by its reliability — packet (ie. request/response data) delivery in TCP is guaranteed, even if it takes more time. An alternative transport-layer protocol, User Datagram Package (UDP) is faster, but less reliable — packet delivery is not double-checked.
UDP is typical of streaming services where instant content takes priority, TCP is used most everywhere else.
When you request a web page in your browser, your computer sends TCP packets to the web server’s IP address, asking it to send the web page back. The web server responds by sending a stream of TCP packets, which your web browser stitches together to form the web page.
First, TCP orders packets by numbering them. Second, it error-checks by having the recipient send a response back to the sender saying that it has received the message. If the sender doesn’t get a correct response, it can resend the packets to ensure the recipient receives them correctly.
Summary, round 1
So far you enter an URL,
holbertonschool.com, and there is a whole DNS lookup process to get the corresponding IP address, and with that address, the Client is able to establish a connection with the Server thanks to the TCP/IP protocol.
So far so good, but how about security issues? Can both hosts exchanged sensitive information (such as passwords, credit card numbers etc …) over the connection?
Well, do you see a padlock icon in the address bar? That would indicate that a HTTPS connection using a SSL (Secure Socket Layer) certificate is in effect. You can click on the padlock to see the details of the certificate, including the issuing authority and the corporate name of the website owner.
With HTTPS, the TCP/IP connection is encrypted, so all exchange data during the connection is protected.
How does it work?
3. HyperText Transfer Protocol Secure/ Secure Socket Layer (HTTPS / SSL)
Let’s first define HyperText Transfer Protocol (HTTP).
The HTTP is an TCP/IP application-level protocol, used to deliver data (HTML files, image files, query results, etc…) on the World Wide Web (WWW).
HTTP messages are of two types: request and response. HTTP specifies how clients’ request (such as GET, POST, DELETE, PUT) will be constructed and sent to the server, and how the servers respond to these requests, and how messages are interpreted.
HTTP transfers the data in clear text with no encryption or protection so it does not prevent from a man in the middle type intrusion.
This is where HTTPS comes in (with the “S” for Secure). It adds an SSL / Transport Layer Security (TLS) protocol to HTTP. This protocol encrypts data using an asymmetrical encryption key, making the information exchanged unreadable for a third party and securing the connection. It also proves the identity of the holder of the corresponding SSL / TLS certificate.
The activation of the HTTPS protocol causes a padlock to appear next to the URL in the address bar, which Internet users are now used to. It appears when a website is protected by an SSL certificate or a TLS certificate.
To display a website in HTTPS, a company must first obtain an SSL certificate. SSL is the technology that secures data exchanges between the browser and the server. Obtaining an SSL certificate leads to the activation of the SSL protocol, authorizing the site to open a connection in HTTPS.
The TLS certificate is the successor of the SSL certificate. TLS is a more secure version of SSL that works on the same principle. By convention, SSL certificate is still used rather than TLS certificate, even if the protocol is indeed TLS.
The TLS handshake
TLS handshakes occur whenever a Client queries a website over HTTPS, after a TCP connection.
The steps within a TLS handshake will vary depending upon the kind of key exchange algorithm used and the cipher suites supported by both sides. The RSA key exchange algorithm is used most often. It goes as follows:
a) The ‘client hello’ message: The client initiates the handshake by sending a “hello” message to the server. The message includes the TLS version the client supports, the cipher suites supported, and a string of random bytes known as the “client random.”
b) The ‘server hello’ message: In reply to the client ‘hello message’, the server sends a message containing the server’s SSL certificate, the server’s chosen cipher suite, and the “server random,” another random string of bytes that’s generated by the server.
c) Authentication: The client verifies the server’s SSL certificate with the certificate authority that issued it. This confirms that the server is who it says it is, and that the client is interacting with the actual owner of the domain.
d) The premaster secret: The client sends one more random string of bytes, the “premaster secret.” The premaster secret is encrypted with the public key and can only be decrypted with the private key by the server. (The client gets the public key from the server’s SSL certificate.)
e) Private key used: The server decrypts the premaster secret.
f) Session keys created: Both client and server generate session keys from the client random, the server random, and the premaster secret. They should arrive at the same results.
g) Client is ready: The client sends a “finished” message that is encrypted with a session key.
h) Server is ready: The server sends a “finished” message encrypted with a session key.
i) Secure symmetric encryption achieved: the handshake is completed, and communication continues using the session keys.
This entire security process is responsible for that padlock displayed in the browser whenever you connect to a website through HTTPS.
Great! Is that all regarding the security aspect? Are we safe to communicate?
Well, not quite yet.
TLS achieves three security purposes: privacy, integrity, and identification, yet it assumes that the data is coming from a trusted source. Indeed, TLS would encrypt whatever request is sent to both ends. It doesn’t protect the hosts from the execution of malicious scripts/payloads.
We need firewall!
A firewall is a division between a private network and an outer network, often the internet, that manages traffic passing between the two networks. It’s implemented through either hardware or software. Firewalls allow, limit, and block network traffic based on preconfigured rules in the hardware or software, analyzing data packets that request entry to the network. — Webopedia
Firewalls act like a checkpoint where data passed to or from an outer network is reviewed before being accepted or rejected. There are several types of security functions used by firewall programs. For example:
a) Packet Filtering: looks at each packet entering or leaving the network and accepts or rejects it based on user-defined rules. Packet filtering is fairly effective and transparent to users, but it is difficult to configure. In addition, it is susceptible to IP spoofing.
b) Circuit-Level: activates security sweeps when a new TCP/IP or UDP connection links up to the system. Once the connection has been made, packets can flow between the hosts without further checking.
c) Proxy Server: serve as the gateway from one network to another for a specific application. Proxy servers can provide additional functionality by preventing direct connections from outside the network.
Server administrators can configure a firewall to only accept or deny incoming traffic from certain IP addresses, or certain ports.
That’s about it, for now, security wise.
5. Load Balancer
Let’s switch side and think about the server load.
As a rule of thumb, to avoid Single Point of Failure (SPOF), websites have more that one server. It needs a load balancer to distribute the volume of incoming requests across multiple servers.
Load balancer will distribute the work-load for multiple servers to share the amount of load, which in turn increases the reliability, efficiency and availability of the website.
“A load balancer, or server load balancer (SLB), is a hardware or software-based device that efficiently distributes network or application traffic across a number of servers. With a load balancer, if a server’s performance suffers from excessive traffic or if it stops responding to requests, the load-balancing capabilities will automatically switch the requests to a different server.” — Webopedia
A software load balancer can be configured either on the same server as that hosting web content or on a server all its own.
There are various types of load balancing algorithms, each with their own advantages and disadvantages:
a) Round-robin: distributes evenly requests to servers according to a queue.
b) Least connections: sends new connection to the server that has the least number of current connections.
c) Random: randomly distributes requests to a server.
d) more load balancing algorithms here.
Ideally, a website will be configured with multiple load balancers, in order to avoid a SPOF.
Redirecting from HTTP to HTTPS
When you type google.com or youtube.com into your browser’s address bar, do you prefix it with https? If you’re like most people, the answer is no. The result is that you’re sent to the non-secure, HTTP version of the site.
Some load balancers, like HAProxy, can be set to reroute users from HTTP (port 80) to HTTPS (port 443) automatically.
Remember the TLS handshake with the private and public keys? With RSA algorithm, those keys are 2048-bit long, and the session key would support up to 256-bit encryption. Which provide a robust protection that is reassuring for both users and service providers.
The downside is it’s relatively slow for encryption and decryption, and utilizes the server’s resources up to a great extent. Thus, the server becomes slower.
The concept of SSL offloading is when all encryption and decryption are terminated at the load balancer and the HTTP messages are passed in the clear to the servers. There are two types of SSL offloading: SSL termination and SSL bridging.
SSL termination is a process that helps to speed up the decryption process. It connects the client to the load balancer via the secure, encrypted HTTPS connection. However, both the incoming and outgoing data transmission between the load balancer and the server will remain un-encrypted.
The advantages for the website owner are:
- certificates are maintained in fewer places,
- the servers are not exposed to the Internet for certificate renewal purposes,
- servers are unburdened from the task of processing encrypted messages, freeing up CPU time.
SSL termination reduces the workload and saves on computational overhead.
SSL termination is a effective method for websites that don’t deal with users’ sensitive information: blogs, informative sites (such as Wikipedia), media sharing websites (such as YouTube).
The cons are:
- It deceives the clients that their data is safe and secure throughout the communication, although encryption is lost mid-way and they do not know about this.
- As the load balancer handles all the data, it isn’t easy to trust that all the information is still secure.
Summary, round 2
- The browser receives the URL
- There is a DNS lookup process to get the IP address of the URL.
- The browser completes a TLS handshake with the load balancer over TCP/IP.
- The browser sends the load balancer a HTTPS GET request, over an encrypted connection method.
- The HTTPS GET request is passed through a firewall on the load balancer.
- The load balancer terminates the encryption decryption process.
- The load balancer distributes the HTTP GET request to the next available host server.
- The HTTP GET request is passed through a firewall on the host server.
6. Web Servers
A web server is a software that uses HTTP and other protocols to respond to client requests and delivers web pages. The main job of a web server is to fulfill requests from clients for static content from a website (HTML pages, images, plain text files, and so on).
So, the Web Server has received the Client GET request. It processes the request and sends back a HTTP response. The response usually includes a status code. Some common status codes are:
- 200 everything went well.
- 301 Moved Permanently: there is a redirection to another website.
- 404 Not Found: the server didn’t find anything at the location being requested.
Static content refers to any content that doesn’t need to be processed before being sent to the client, where no interaction is happening.
In our case, we’ve reached our destination, here’s the requested page.
The dynamic content can be updated and changed. This is handled by …
7. … Application Servers
Application Servers let websites become more active and dynamic. Users are able to interact with it by logging in, posting to a forum etc...
In a typical deployment, a website that provides both static and dynamically contents runs web servers for the static content and application servers to generate content dynamically.
The web server passes the information to the application server and then the application server queries the database for the information needed, transforms it and sends it back to the web server.
An application server runs behind a web server in front of an SQL database.
Is a collection of information that is organized so that it can be easily accessed, managed and updated. Computer databases typically contain aggregations of data records or files, containing information about sales transactions or interactions with specific customers.
In a relational database, digital information about a specific customer is organized into rows, columns and tables which are indexed to make it easier to find relevant information through SQL or NoSQL queries.
Here’s a diagram that illustrates the flow of processes happening from the client to the server.
What a journey that happens in the blink of an eye, millions of time everyday!! Thank you to those who bear with me til now.
What is a web server?
The term web server can refer to hardware or software, or both of them working together.
How does Software and Hardware Load Balancer Work? (Loadbalancer Algorithms Explained with…
When you have an enterprise application or website that gets lot of hits, your server might be under heavy load. In…
What is a Firewall? | Firewall Definition & Meaning
A firewall is a division between a private and an outer network, often the , that manages traffic passing between the…
What is a Database? - Definition from WhatIs.com
By A database is a collection of information that is organized so that it can be easily accessed, managed and updated…
What is SSL Offloading? Features & Benefits of SSL Offloading
If you are a regular internet surfer, you must have noticed that some sites get slower when many users visit the same…