Computers Are Hard: networking with Rita Kozlov
The internet is a living thing.
It’s easy to visualize how the internet works. We type an address into the browser and then our computer connects to a server and that server shows us a website. Simple enough. But then you start thinking about what’s really going on and it turns out just the number of acronyms involved is enough to make you dizzy. ISP looks up a DNS and finds an IP. Your computer connects to the IP over HTTP. On its way, it hits a CDN, likely hosted by AWS or GCP.
The internet is an enormous ecosystem of technologies and organizations. They make sure electrons zipping through thousands of kilometers of cables laid across the world end up as pixels on your screen, whether you’re fragging enemies in Fortnite or sending holiday pics to friends. Both these scenarios generate traffic handled by multiple companies and communication protocols: some fairly new and others decades old. And as you’re reading these words, engineers at Medium are probably making changes to the website’s code. All the time, the internet is stirring and bursting and bubbling. It never stops.
To untangle this quagmire, I talked to Rita Kozlov. She’s a Product Manager at Cloudflare, a popular provider of web infrastructure. I asked Rita what really happens when you try to open YouTube and it led us all the way down to those pipes laid across the ocean floor.
Wojtek Borowicz: I launch the browser, type in an address — say, YouTube.com — and hit enter. What happens under the hood before I’m able to watch videos?
Rita Kozlov: First is a DNS lookup. When you go to a website like youtube.com, you need to open communication with a server, which is basically another computer somewhere else in the world. Your computer finds that server by an IP address. But people are bad at remembering numbers and pretty decent at remembering names, so we came up with DNS: Domain Naming System. It’s basically a large phonebook that makes the translation from the name to the IP address.
Once the browser has the IP address, it can open the connection to the server. Before we can start talking, we need to establish a channel of communication. That’s the handshake. It will generally start with something called a client hello, where the computer will initiate the connection. Otherwise it’s called a SYN, synchronization request. The server will say ‘I acknowledge you want to connect’ and the computer will receive a SYN/ACK. We’ve established a channel of communication.
Now that we’re talking to each other, the server will send a response over the HTTP protocol. If it was a phone call, HTTP would be the language being spoken. Then the server will send an HTTP response that the browser will take and render. And that’s how the YouTube web page shows up in your browser.
Going back to that first step, you said DNS translates names into IP addresses. Is DNS something that you, the website developer, control? Or is that a global directory?
There are two parties here, similar to the client-server situation. There is an authoritative DNS provider and there is a DNS resolver. If I want to publish my website, I’m going to need an authoritative DNS provider. That could be Cloudflare, for example, which is where I work. The resolver is usually provided through your ISP, or if you’re working at a company, there is often an internal DNS resolver.
When you type example.com into the browser, DNS does a recursive lookup. So we’re going to start from the very end, which is .com. Generally, the idea of the internet is that it’s not controlled by any single entity. This requires a few organizations, like IANA or ICANN, to run root servers that contain information about the TLD registries. A TLD stands for top-level domain. Examples of that are .com, .org, or .net.
The recursive resolver will first go to the root and ask ‘hey, do you know where .com is?’ and the root server will respond: ‘yes, this server will know all about .com websites’. Then you go to the .com registry and check where example.com is and it will return an IP. Or you ask where is www.example.com, which is called the subdomain, and the .com registry will send you to the authoritative DNS provider for example.com and then you will connect to that name server and ask where www.example.com is. At that point you will get back an IP.
All those servers are physical computers. Does their location matter? If I’m connecting to a service based in the UK, is it going to be slower if I’m in Australia compared to Germany?
us-east-1 is the codename of the first data center Amazon launched as part of Amazon Web Services. It went online in 2006 and became the cornerstone of AWS, which is now the largest cloud services provider in the world. Whether you’re watching Netflix or signing up for health insurance, it’s all happening on Amazon’s infrastructure.
So how fast you can load websites is limited by the laws of physics?
You’re literally limited by the speed of light.
What can engineers do about that?
Until we figure a way to beat the speed of light, the best thing you can do is make sure the assets are as close to the end user as possible. One technology that enables you to do that is called CDN: content delivery network. Without a CDN, if I’m in Australia and I visit your website, I’m gonna travel all the way to the US for every single asset. And if another person then visits that website, they will have to make those journeys all over again. A CDN basically places servers at many different locations around the world, so when you send your request, it can leverage those locations for caching, which is the ability to make a local copy of an asset. For static content, which doesn’t change between users, once I’ve gone to a shopping site and downloaded all those images, the next user in Australia can be served those images from a data center much closer to them.
What about scale? Is a website going to run slower if million people are using it at the same time versus a few thousand?
Yes. Servers, like anything else, have limited capacity. You can get too much traffic, whether it’s intentionally (your company is growing) or in a form of an attack called Distributed-Denial-of-Service or DDoS, where an attacker would send so much traffic to your server that it can’t handle the legitimate requests.
A CDN can definitely help you scale by caching things so that not everything is hitting your server. Another way to scale is by adding more servers and load balancing between them, which is determining how much traffic goes to each one.
So when you want to scale your infrastructure, you just add more servers?
It depends on how you chose to architect and build your application in the first place. If you went and bought a server, you’re probably going to need to go and buy another server and buy a box that balances the load between them.
Luckily, there’s been the emergence of the cloud and the idea is that you, as the developer, won’t need to worry about buying the boxes anymore. A provider like AWS will be able to purchase those on your behalf and you can specify how much load you’re expecting. Typically, you would buy something called a VPS or virtual private server. AWS would spin up a server for you, and no matter how much traffic you get, even if it’s very little, it’s on standby for you.
The next iteration of that is called serverless, where you should only worry about the code you’re writing and your cloud provider will take care of scaling for you. And you only pay for as much as you’re using, rather than having this thing standing by the whole time even if it’s not getting traffic. Or inversely, when you get a ton of traffic, you shouldn’t have to do anything because the cloud provider will figure out a way to scale.
Does the type of service you build determine the infrastructure? How would this differ for a game store versus a video calling app?
Definitely the type of thing you’re building would influence the type of infrastructure you end up using. For a game store, the primary functions you need are the ability to serve large files and the user database. And you need to be able to authenticate these users and authorize them based on whether or not they bought the game. These are all very well served by the HTTP protocol and can be easily built with a simple storage solution. You can have Amazon S3 for storing the games and then you can have a serverless function that does the authentication and maintains the user database.
For a video conferencing app, you would definitely want to consider a different stack and you would want to think about how to connect users from a point that’s the nearest to both of them, so they are able to communicate with as little delay as possible. HTTP might not be the best choice. At a lower level, HTTP is based on something called TCP and the idea is that you have a consistent connection you’re communicating over. This is built for services that require all of the data to be always accounted for and transferred. If I buy a game and I have a blip in the connection, I still need to be able to get that game later. When I’m calling you and I had a blip in the connection, it would be very disruptive if you suddenly heard words I said five minutes ago. So there is a different protocol, called UDP, which is better optimized for performance but not so much for consistency.
User Datagram Protocol is one of the standard communication protocols used for data transmission over the internet. It’s prone to dropping data packets but it’s also very fast, making it a common choice for streaming or online gaming.
Fast and stable internet connection is not the default for millions of people around the world, especially in developing countries. When building a service, do you take into account that some of your users will be on spotty connections or will be using cellular networks instead of Wi-Fi?
Around 50 or 60% of internet connections today are coming from mobile devices and as places like India, China and sub-Saharan Africa become more connected, you want to think about your users there. Generally, when countries connect to the internet, the beginning of that journey will be on pretty cheap mobile devices that don’t have the latest hardware and that might be running relatively old software that can be easily overwhelmed.
As a developer, especially when you’re building something intended for an international audience, you will want first of all to consider how much logic you’re cramming into the client side. On one hand, you want quite a bit of logic to live on the server side in case the device is not able to handle all of the computation that you’re trying to ship to it. On the other hand, you try to place as much of the logic as possible with the end user, so their devices can get connected to the server quickly.
Is that difficult to test for when your team is based in a modern office in San Francisco, with everyone using brand-new MacBooks connected to a superfast network?
There are lots of tools that can help you test for things like that. One of my favorites is Google Lighthouse, which is available with any Chrome browser right now. If you’re a developer, especially a front-end developer, you’re probably already familiar with Chrome dev tools but there is a new tab now called Audits. Under those Audits, you can select what kind of device you want to run tests for. You can choose whether it’s mobile or desktop, you can test for performance, you can choose artificial network throttling to help you understand the experience of someone on a slower internet connection. But it can also help you improve your business with testing for things like SEO and accessibility.
All these elements we’ve talked about — DNS, CDN, hosting providers — create a complicated map of middlemen between the user and the service. How much of this is under engineer’s control? If your CDN fails, is there anything you can do?
There are few approaches you can take pre-emptively. One that’s becoming more prominent is the idea of multi-cloud. You know, AWS has failures and GCP has failures too, so if you have a layer in front of them that balances the load or you set up health checks, then you can make sure your service stays online. You can also have primary and secondary DNS providers. We encourage people to have a very robust infrastructure. At Cloudflare, we have many data centers around the world, so even if one of them goes down you will get routed to the next one and you’re up and available at all times. But ultimately it’s just a bunch of wires hacked together. People are writing code that has bugs, as anyone’s code does. Sometimes there is not much you can do about outages.
We opened with a hypothetical so let’s close with one, too. Why can the same app, used from the same network and the same device, feel slower on one day and faster on another?
The internet is a living thing. There can be something as basic that happened as a snowstorm and some of the wires in your connection got knocked over. Or someone at the ISP fat-fingered something and now the connection isn’t going through. Another possibility: if there is an emergency happening and everyone is getting online at the same time, the internet routes get congested, which causes slowness. Just like water pipes, the internet pipes can get clogged up too.
Computers Are Hard
- Bugs and incidents with Charity Majors
- Networking with Rita Kozlov
- Hardware with Greg Kroah-Hartman
- Security and cryptography with Anastasiia Voitova
- App performance with Jeff Fritz
- Accessibility with Sina Bahram
- Representing alphabets with Bianca Berning
- Building software with David Heinemeier Hansson