What happens when you type a URL in your browser and press enter?

Andrew Birnberg
23 min readNov 27, 2017

--

The internet

With the free and open internet on the chopping block despite widespread public opposition, it seems like an especially important time to dig deeply into how the internet works and what actually happens when you open a URL in your browser of choice. It is no small thing to type www.google.com and have the familiar logo and search bar appear in your browser window.

General Overview

Before going into each step of the journey I want to give a general overview of what will happen, then once the larger picture is clear, I will walk through each leg of the journey in greater detail. Also, as this article has been written for The Holberton School, I will use their website as an example destination, however it will apply equally to all websites on the internet.

The internet runs on the Internet Protocol Suite, the set of standard networking and data formatting protocols described and maintained by the Internet Engineering Task Force. Although a number of connection protocols exist, TCP/IP, or Transmission Control Protocol and Internet Protocol, is the primary one used to transfer information about websites between web servers and browsers. The actual data is actually handled by the HTTP protocol, but TCP and HTTP serve different purposes and represent distinct layers of the OSI communication model (application layers for HTTP and transport layer for TCP).

  1. In general, when you type a URL into the browser location bar (and press enter) the browser needs to find server you are trying communicate with based on whatever you typed.
  2. To communicate with the web server hosting the website you’ve requested it needs to discover its IP address, which will allow a connection to be established across the internet. The website name or domain name is a shorthand for us humans so we don’t have to remember IP addresses (e.g. 123.456.789.101). To get the IP address, specific servers called Domain Name Servers (DNS) with well known IP addresses exist that maintain tables of domain names and the actual IP addresses of the servers hosting their associated content. That way, your browser can first check with a DNS server for the actual IP address of e.g. www.holbertonschool.com and then make a second request directly to the correct IP address.
  3. Getting a valid IP address is only the beginning. A browser is multicomponent piece of software with several important and distinct jobs, one of which is correctly formatting requests based on the HTTP protocol and interpreting the server responses. The rendering of HTML and GUI operations are distinct jobs for the browser that load the content once it has been received. With the correct IP address in hand, the browser will make a GET request to the web server using HTTP and wait for a response, ultimately setting up a bidirectional connection between the computer running the browser and the one running the webserver. If secure communication is desired, then that will be negotiated between the server and the client before any significant data is sent.
  4. On the server side, unless it is serving purely static HTML content, it will probably need to query some kind of application server or run some scripts based on the requests of the user or some external information (e.g. time of day). This will be used to generate custom HTML content before being returned to the client.
  5. After all the data needed to render the web page has been transferred, the connection may be closed and the browser rendering engine parses the HTML and stylesheet content and runs any Javascript code. Continuing connection is possible and ongoing communication between the server and the browser may continue, especially in the case of a web application, which updates its view through continued communication with an application server.

Hopefully this gives a simple picture of what’s going on, however this basic overview has skipped over many details and I’ll go into as many as I can in the following section.

DNS request

When you initially type www.holbertonschool.com into your browser and press enter your browser doesn’t know whether this is a valid domain name nor how to contact the server hosting its content if it is. To do so it must find the IP address of www.holbertonschool.com before it can request a connection from its web server.

DNS (Domain Name System) lookup is the method by which browsers determine the IP addresses of websites users visit. DNS encompasses the totality of how domain names, such as xyz.com, are translated into the IP addresses that directly identify the server hosting the content.

DNS is a hierarchical network of servers that maintain information about domains that have been registered with an internet authority, such as ICANN (Internet Corporation for Assigned Names and Numbers). At the top of the hierarchy are 13 root domain name servers (referred to by a single period — ‘.’) that maintain lists of all the current TLDs (top-level domains) and their name servers, such as .com, .info, .org, and many others. Each TLD has servers that keep track of domain names registered with that extension. Since www.holbertonschool.com is registered with a .com extension, a .com name server will have a reference to its location.

Domain names are typically purchased from a domain name registrar, such as www.gandi.net, who communicates to the TLD that the domain has been registered and is associated with a particular IP address. Most often, the IP address associated with the domain on the TLD name server is actually another lower-in-the-hierarchy name server, such as the ones operated by the registrar themselves, or the owner of the website.

Using the whois linux command with holbertonschool.com, which queries TLD registries for information about a domain name, you can learn more about a domain:

Domain Name: HOLBERTONSCHOOL.COM
Registry Domain ID: 1950068353_DOMAIN_COM-VRSN
Registrar WHOIS Server: whois.gandi.net
Registrar URL: http://www.gandi.net
Updated Date: 2017-06-29T06:59:20Z
Creation Date: 2015-07-30T09:53:51Z
Registry Expiry Date: 2018-07-30T09:53:51Z
Registrar: Gandi SAS
Registrar IANA ID: 81
Registrar Abuse Contact Email: abuse@support.gandi.net
Registrar Abuse Contact Phone: +33.170377661
Domain Status: clientTransferProhibited https://icann.org/epp#clientTransferProhibited
Name Server: NS-1455.AWSDNS-53.ORG
Name Server: NS-1619.AWSDNS-10.CO.UK
Name Server: NS-176.AWSDNS-22.COM
Name Server: NS-792.AWSDNS-35.NET
DNSSEC: unsigned
URL of the ICANN Whois Inaccuracy Complaint Form: https://www.icann.org/wicf/
>>> Last update of whois database: 2017-11-27T00:26:58Z <<<

Here, you can see that the name servers for holbertonschool.com are most likely Amazon Web Services virtual servers (NS-1455.AWSDNS-53.ORG) and that registrar was Gandi. These name servers, like any DNS server, will maintain a list of domains it is responsible for in a standardized format. The .com TLD server will have an NS record that lists the name server in use for the domain in question. The name servers themselves will have A records (among others) that list the actual IP address of the web servers hosting content from www.holbertonschool.com. It may have a number of A records in order to direct traffic to subdomains, such as www. or blog., to different physical (or virtual) servers.

Each DNS record contains several pieces of information: 1) the record type (e.g. A or NS); 2) the domain associated with the record; 3) the subdomain attached to the record; 4) the TTL, or time-to-live, which specifies how long to consider the record valid before returning to the name server to check for a new IP address and/or TTL value.

You may be wondering at this point when we’re going to get back to the question of how a browser translates requests for a domain to an actual HTML page. The DNS record TTL value is of particular importance here as it tells the browser and operating system that after making a DNS query they can cache the returned IP address before having to initiate a new DNS query.

Putting it all together

When you hit enter after typing www.holbertonschool.com, the browser will check its cache of previously requested domains and look for a match. If it finds a match it will use that IP address directly and construct an HTTP request for that site, which we will see more of shortly.

Otherwise, a recursive process begins wherein the browser will check with the OS, which keeps its own separate cache, to see if another application had requested this domain recently. Assuming it’s not there, the OS then checks with the gateway, often a modem or a router, to see if the IP has been stored there. The modem will usually have a default DNS server hosted by the ISP, which will check in its own cache. If not found, then it will issue a DNS request possibly to other DNS servers caching a larger region, and finally query a root name server if it was not cached anywhere (or if the TTL was exceeded everywhere). The root name server will then delegate to the .com TLD name server, which will delegate to one of the AWS name servers shown in the whois output. If a record is found (an A type record in this case, since we are looking for an IP address to serve a web site), the name server will respond with the necessary information, such as the IP address and TTL value, and each node along the way back to the browser will cache this response so another query within the TTL won’t have to go all the way back to the name server to get the IP.

If the name gets resolved by the DNS query then the browser will now have the IP address and use it to initiate a connection with the web server at that IP address.

HTTP request and TCP/IP

Once the browser has the IP address of the host it is trying to reach it will send a connection request to that address so it can begin to download content, such as HTML and other resources. The method it uses for communicating with the server is TCP (Transmission Control Protocol), which is a protocol for exchanging data between two computers that makes an effort to be reliable and error-free. It is only one of many protocols for communicating over networks, but is the main one for HTTP requests. Other protocols, such as UDP, do not guarantee a reliable connection and do not attempt to resend lost or corrupted data, but they do have uses in streaming content and other areas requiring low network latency where errors can be tolerated.

Connection request over TCP/IP

Before an HTTP request goes out to www.holbertonschool.com (or 54.192.146.3) the browser must establish a connection. This is done through what is known as the three-way handshake. First the browser sends a synchronize request (SYN) to port 80 of the IP address it is trying to connect to. Port 80 is a well known port that is associated with the HTTP protocol. If the server is listening on port 80 and accepting requests it will respond with a synchronize and acknowledge response (SYN-ACK). In order for the connection to be established the browser must then send an acknowledgement (ACK) and bind some port of its own to port 80 on the host.

Typical connection pattern between browser and web server.

This all takes place within the OS and the result is finally exposed to the browser, with the connection either having been successfully made and socket file handle generated for reading and writing, or some failure code if the connection failed.

The OSI model

Now is a good time to bring up the OSI (Open Systems Interconnection) model of network communication. Every communication between two computers ultimately has to be translated between the representation in an application or process to the series of electrical impulses that are carried by wire or fiber optic over great distances. While each application may achieve this somewhat differently there are certain commonalities and requirements that can be summarized in this diagram:

OSI model

The layers represent different levels of abstraction. In the application layer, where web browsers exist, the data begins in a human readable form. As you go down the layers the data is progressively converted into the form that will be transported over the physical medium. One of the important things to keep in mind is that this is a model, not a protocol. Web browsers often perform several of the upper layers all together, such as formulating the HTTP request, encrypting data, and handling persistent connections over the otherwise stateless HTTP protocol. The transport layer is typically handled by the OS and includes the TCP and UDP protocols addressing data to its destination as well as checking that the received data is uncorrupted or in the correct order.

All of the layers make a change to the data by encapsulating it in their own headers as illustrated below:

Headers are added at each level.

At the network layer, the IP protocol fragments TCP/UDP messages into smaller chunks and converts them into packets, which are encapsulated further into ethernet frames and finally transmitted as 1’s and 0’s. When a frame arrives at an interface, such as a router or the destination server, the data will be decapsulated as far as necessary for the hardware or software to perform whatever operation is necessary, such as routing. It is then sent back down to the lowest layer to be transmitted to the next destination. At their destination, frames are completely decapsulated, recombined and turned back into application data.

IP addresses and port numbers

Currently, most IP addresses are IPv4 addresses, which are 32 bit numbers separated in 4 octets. As this leaves room for only 2³² possible IP addresses on the internet various systems have been devised to make judicious use of them. More recently, 128 bit IPv6 addresses have come into use, which permits a much larger number of addresses for the 1,200,000,000+ websites on the internet. Of course, websites are outnumbered by users, and they need IP addresses as well, so we would be well beyond the limits of IPv4 if everyone had their own IP address all the time.

Besides an IP address, computers need another piece of information to communicate with each other, namely a port number. Ports allow the network interface to route different kinds of traffic to the same IP address to different applications. For example, I can have a web server running on port 80 and also communicate with an application server on port 8080 and a database server on port 8888. This way traffic coming in on port 80 should always be seen by my web server and traffic to port 22 should be seen by my SSH daemon.

What if I only want traffic to my IP address to be able to reach port 80? That’s what a firewall accomplishes. A firewall is a piece of software that blocks incoming and outgoing traffic based on rules that you specify. You could say I only want to allow incoming traffic on port 80 and outgoing traffic on ports above 1000. Any traffic outside of that range will be blocked. This an important safety precaution to take on any device connected to the internet and can greatly reduce vulnerability to hacking.

HTTP requests

Web servers exist to serve content to clients over the internet. The way that they do so is through HTTP (Hypertext transfer protocol). HTTP is a means for both requesting data and sharing data over the internet. When a browser requests HTML from a web server it must make the request using one of several HTTP verbs and whatever content is necessary to communicate its identity and intent. HTTP verbs are the methods that web servers can use to retrieve or modify data. They include: GET, POST, HEAD, PUT, PATCH, DELETE, OPTIONS, CONNECT and TRACE. Among these, GET is the most used as it is called whenever a client wants to download the content of a website.

HTTP requests include a number header key-value pairs, which help the web server perform its job and maintain session data between requests when a user will be making multiple requests over the same connection. The first line of the header states the protocol version that will be used, the HTTP method to request of the server, and the resource being requested. Currently, HTTP version 1.1 is used most often, although version 2.0 is now supported for the majority of browsers and web servers. Subsequent lines of the header include information such as the domain name that the client is trying to reach, and any other required information, such as the type of browser and origin IP address. Header keys are separated from their values by a colon ‘:’ and the header ends with two newlines, afterwhich any longform data is appended. Here is an example of a GET request:

GET / HTTP/1.1
User-Agent: curl/7.35.0
Host: www.holbertonschool.com
Accept: */*

Here I have used the linux utility curl to generate a GET request for www.holbertonschool.com. The first line shows the method, followed by a /, which shows that I am interested in the homepage, not a resource in a subfolder. It then shows the version of HTTP I am using. The User-Agent header is probably of little value here, but the server might serve different versions of its website if I’m browsing on IE9 vs Firefox. Host is the domain I am requesting. Even though I had to go through the whole DNS lookup described earlier to get the IP address, if other websites are hosted at this location the server needs to know which one I want. The Accept header indicates that I will accept data in any encoding whatsoever.

It is also informative to look at the response:

HTTP/1.1 301 Moved Permanently
Cache-Control: no-cache
Content-Type: text/html; charset=utf-8
Date: Mon, 27 Nov 2017 06:30:50 GMT
Location: https://www.holbertonschool.com/
Server: nginx/1.10.2
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Request-Id: 2b1bb157-fa7a-4643-bcbe-e526abee3c59
X-Runtime: 0.003631
X-XSS-Protection: 1; mode=block
Content-Length: 98
Connection: keep-alive
<html><body>You are being <a href="https://www.holbertonschool.com/">redirected</a>.</body></html>

The response is similar, but has some important differences on the first line. It starts with the HTTP version being used, which must match what I used in my request if we’re going to communicate effectively. The next piece of information is the status code, which curl conveniently describes in the text right after it. The possible status codes are all three digit numbers:

  • 1xx Informational responses.
  • 2xx Success.
  • 3xx Redirection.
  • 4xx Client errors.
  • 5xx Server errors.

Everyone has most likely seen the 404 not found error when trying to access a web page. This is to indicate that the client has requested a resource that is not available or not on the server. Otherwise, when things go as expected and no errors or redirections were performed, a status code of 200 is returned. This indicates that everything is okay. The 301 redirection seen above is telling us that instead of going to http://www.holbertonschool.com, we should try to request https://www.holbertonschool.com/, instead as specified in the Location header. This will direct us to a secure version of the site that allows private, encrypted communication between the client and the server. By returning a 301, or permanent redirection, browser is being told that the resource will always be found at this new location and is no longer available at the old location. Since browsers and the curl utility expect a regular port 80 HTTP request unless told otherwise, web servers need to be able to handle traffic to the unsecured version of the website and turn incoming traffic around to head to the secure version.

At the bottom of the response, after a blank line, the body of the response begins. In this case, it is a simple HTML snippet that tells us we are being redirected.

SSL/TLS and HTTPS

There are several reasons to use HTTPS over HTTP. First is that by encrypting your traffic you are less likely to have your private data stolen. Second is that you can verify the identity of who you are communicating with using their certificate that has been issued by a trusted source.

While normal, unsecure HTTP requests are handled on port 80 and are visible to anyone on the network, secure HTTP, a.k.a. HTTPS, is usually handled on port 443 and encrypted using a shared key that is negotiated after the initial connection is established. The set of protocols used for establishing secure communication are called SSL or TLS. These methods are used to encrypt HTTP-based communication so that once the data leaves the application layers (5–7) of the OSI model, it is hidden from anyone besides the destination server. Even though the message content is encrypted, the source and destination addresses must still be made available to the transport layer, such as TCP, so it can be properly encapsulated and addressed to its destination. Since each layer is independent, this poses no problem to TCP/IP as it merely wraps its payload in a new header and passes it along.

In order to initiate secure communication using SSL/TLS there are four steps that are taken, which are known as the four-way handshake.

  1. The client (browser) sends a “Hello” message to the server requesting secure communication over TLS.
  2. The server sends back a certificate from a trusted Certificate Authority that has been bundled with a public key, which can be used to encrypt data before it is sent back to the server.
  3. The client receives the key and verifies the authenticity of the certificate by checking a list of trusted Certificate Authorities and using the correct public key from that authority to verify the signature on the certificate. It then uses the key that was sent by the web server to encrypt a symmetric key, which will be used for ensuing communication and sends it back to the host along with a message saying that encryption has begun.
  4. The host then uses its private key to decrypt the symmetric key that was encrypted using its public key. It then sends back to the client an encrypted message that encryption has begun from its side as well.

There are several additional steps not described here, such as the browser declaring what methods of encryption it supports and the server choosing the most secure among them. Also, there are further checks included in the encrypted communications that prevent tampering with the messages via message authentication codes, or MAC.

Once secure communication is established, it is maintained throughout the session. At the end of the session, the symmetric key is destroyed and a new one is generated for future sessions.

The following output from curl shows the process in action:

* Connected to www.holbertonschool.com (52.70.237.84) port 443 (#1)
* successfully set certificate verify locations:
* CAfile: none
CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSL connection using ECDHE-RSA-AES128-GCM-SHA256
* Server certificate:
* subject: CN=*.holbertonschool.com
* start date: 2017-03-01 00:00:00 GMT
* expire date: 2018-04-01 12:00:00 GMT
* subjectAltName: www.holbertonschool.com matched
* issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
* SSL certificate verify ok.

The line SSL connection using ECDHE-RSA-AES128-GCM-SHA256 indicates the type of encryption being used for each phase of encryption. ECDHE-RSA is a secure asymmetric key exchange method, AES128-GCM is the method of creating the symmetric key for the rest of the communication, and SHA256 is the method used for MAC signatures of individual methods to prevent tampering along the path. The last chunk of the handshake that begins with Server certificate: shows that the certificate was deemed valid and that the browser was able to verify that it was indeed communicating with www.holbertonschool.com. As an aside, this does not demonstrate that www.holbertonschool.com is not some sketchy website that will securely encrypt its transactions while stealing your identity, only that they are who they say they are. Fortunately, you can take it on my own authority that they won’t try to do that :)

Putting it all together

After the IP address is found via DNS lookup, the browser will send a GET request using HTTP to port 80 at the IP address that was returned. The web server will then respond with a 301 status code to indicate that the browser should perform the request over port 443, using SSL/TLS in order to encrypt the transmission of data. Once the four-way handshake with the server is complete, another GET request is issued (encrypted), and the full contents of the resource is returned.

Server Architecture

When you look up the IP address for www.holbertonschool.com using a DNS query and try to connect to that server you may someday find that you get no response. In this simple 1:1 correspondence between a domain name and an IP address, if the server goes down for any reason or if it requires maintenance and has to be shut down, users won’t be able to reach your website. Thus, the webserver represents a single point of failure (SPOF). One way of dealing with this problem is by having more servers so that if one goes down or is overwhelmed with traffic another server can take over.

Load Balancers

A load balancer is a type of server that distributes client requests across an array of web servers to lessen the burden on any individual server. Like a web server it listens on port 80 (and 443, if it is TLS enabled), but instead of returning any content, it rewrites the request to direct it to a web server hosting the requested content. As an example of a domain host on multiple servers, you can see all of the IP addresses connected to holbertonschool.com using the linux dig utility to performs DNS lookup. It returns the DNS records associated with the domain, their record types, and TTL values. Running dig holbertonschool.com returns a list of 8 DNS A-type records pointing to different IP addresses:

; <<>> DiG 9.9.5-3ubuntu0.15-Ubuntu <<>> holbertonschool.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48871
;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;holbertonschool.com. IN A
;; ANSWER SECTION:
holbertonschool.com. 60 IN A 54.192.146.199
holbertonschool.com. 60 IN A 54.192.146.21
holbertonschool.com. 60 IN A 54.192.146.3
holbertonschool.com. 60 IN A 54.192.146.136
holbertonschool.com. 60 IN A 54.192.146.13
holbertonschool.com. 60 IN A 54.192.146.213
holbertonschool.com. 60 IN A 54.192.146.192
holbertonschool.com. 60 IN A 54.192.146.134
;; Query time: 28 msec
;; SERVER: 10.0.2.3#53(10.0.2.3)
;; WHEN: Mon Nov 27 12:10:59 UTC 2017
;; MSG SIZE rcvd: 176

A number of different algorithms exist for load balancers to decide where to send traffic.

  1. Round robin: new traffic is forwarded to each available server in sequence, wrapping back around when it reaches the end of the list. This works well when all that is needed is to spread the server load out over a larger number of servers.
  2. Weighted round robin: like round robin, but certain servers get a greater share of the traffic in proportion to their weights.
  3. Source IP hash: the origin IP of the traffic gets hashed modulo the number of available web servers so that all servers have roughly the same likelihood of being chosen, but that any individual will usually be sent to the same server based on their IP.
  4. Destination URL hash: like IP hash, but the requested resource is hashed so users requesting a particular resource will get it from the same server. This is particularly useful for caching schemes so that caches on multiple servers don’t have to cache the same resource in multiple places.
  5. Least connection: the load balancer keeps track of the number of connections to each server and select the one with the lowest number.
  6. Least traffic: like least connection, but based on total traffic, not just connections. Since certain content generates more traffic, like images as compared to text, the total number of connections might not be reflective of the server load.

One of the critical functions of load balancers is to monitor the health of the servers it relays requests to. By periodically checking on certain key metrics of each server it can easily remove one from its queue if it starts acting up or becomes unresponsive.

Since the DNS server only is directing traffic to a load balancer, the load balancer it self has to keep track of all the IP addresses of the web servers it manages. Depending on the setup, it’s possible that the servers only have private IP addresses and can only be accessed through the load balancer. This is particularly useful from a security perspective as it limits the opportunities for hackers to get into your servers.

While a load balancer like I’ve described will remove the SPOF from your web server, the load balancer itself becomes the SPOF. In order to avoid that, you can actually just have multiple load balancers where one monitors the health of the other until it fails, at which point it takes over its duties.

Web Servers

We’ve been talking about web servers this whole time, but I haven’t yet adequately described what one is. First and foremost, it is an application capable of responding to HTTP requests on, at least, port 80. To do its job it must be continuously listening for traffic and responding to requests as they come in.

Because they are the interface between the outside world and the data being served (as well as all other files on the computer) many precautions must be taken with their use, such as setting up a firewall to block unexpected traffic, adjusting permissions of the web server so it cannot access unauthorized material, and limiting the number of connections it can handle at once to avoid DDoS attacks and faults due to overloading. To facilitate all these things, web servers are highly configurable.

Two popular open source web servers are Apache and Nginx. Both have their strengths and weaknesses, but together serve up the majority of web pages on the internet. One weakness of Nginx is that it does not have builtin capabilities to generate dynamic content through scripts, such as PHP or Python. Apache, on the other hand, has limited ability to do this through various plugins, but crashes under a relatively light load compared to Nginx.

When dynamic content needs to be created on the server, however, often a separate application server will do a better job.

Application Servers

There are many application frameworks that can be linked to a web server to generate content in response to user input or other factors. Ultimately, all of these frameworks communicate through either sockets or IP address/port combinations with the web server and handle requests by taking parameters from the request and using them to compute some new content. To protect against attacks, an application server might be designed only to answer requests from a specific web server. So rather than listening on all addresses for input it might only listen for requests from a small range of IP addresses.

These days, applications tend to be written in scripting languages, such as Javascript/Node.js, Python, Ruby, and many others. These three languages in particular are good choices, however, as they are not compiled so changes to the codebase can be immediately available the next time they are run. Even though compiled languages run more quickly, the latency over the internet usually overshadows the time lost in computation.

Although application servers take requests from web servers, they usually need additional data to do their jobs, such as from a database of users or sales. The application may exist simply to issue commands to a database to update its records or record a transaction.

Database Servers

There are many ways to store data. A text file is one common way, spreadsheets are another. Databases provide a flexible means to both store data and query it.

A number of different database technologies exist, but most of them rely at some level on relational algebra. This means that they store data in a form that makes it clear how data is connected both within and between separate tables.

The most common relational-database management system (rdbms) is based around SQL (structured query language). This simple language allows users to construct very specific queries that will return only the rows of the their database that some pattern. From the perspective of a web application, it is often the application server that programmatically creates the query based on some request from the web server and issues it to the database server.

Like the application and web servers, database servers connect with other servers over a network and listen for requests on some preordained port, possibly limiting the IP addresses that can contact them. Because of the sensitive nature of their contents, databases must take many precautions to be sure that unauthorized access is prevented. In addition to firewalls, to use a database you must have been granted permission to access data. Since the ability to access data is very different from the ability to modify it, users can be given any combination of read and write privileges to any of the databases or tables stored on the database server as required by their roles.

Applications that interact with the database must run with security in mind as they be subject to SQL injection attacks without proper controls in place. In these attacks, malicious users will enter specially formatted SQL queries that will trick poorly designed applications into executing arbitrary code in the database.

Putting it all together

Once you’ve made it to The Holberton School homepage (and to here, if you’re still with me) you might want to interact with some of its dynamic features. For example, if you wanted to apply to the school, there are various forms you would want to fill out that would be sent to the server using the POST method. Depending on the values that were filled in the application server might deliver different content or ask for a correction. In particular, The Holberton School has a very rich online application that goes over basic command line usage and checks your answers to the questions. These are all handled by the application and database servers.

Final Thoughts

The internet was not built in a day. At every network node and routing decision there are numerous factors to consider. I have learned a great deal from writing this and I hope you did too.

--

--