Why we built our own DNS server from the ground up
DNS is a venerable protocol, one which has serviced a core function for the internet since the early 1980s. There is plenty of great, battle tested open source authoritative DNS server software out there — from the original and still nearly ubiquitous BIND, to newer but widely used implementations like PowerDNS, Knot, and more. So why on earth, when we started NS1, did we build our own authoritative DNS server software from scratch — unlike nearly every other DNS service provider in the market?
A global architecture
Most authoritative DNS software that’s out there today was built to do one thing above all else: deliver statically configured DNS zones and records via the network, with deep coverage of the protocol specifications, reliably and fast, from individual servers. And the server software produced by the open source community is exceedingly good at meeting this use case.
When we were starting NS1, there was just one problem: that wasn’t our use case. And we’d seen the ceilings we would hit with this model before, when we used heavily modified open source DNS software to solve traffic management problems in a CDN we’d built. It got the job done, but not without a lot of hackery and many architectural backflips, because we were using the software as part of a highly distributed, highly dynamic global system driven by an intensive data pipeline and in need of frequent reconfiguration as zones and records changed dynamically, new service endpoints spun up and down, and infrastructure conditions in the application shifted. We had learned from our experience that for this use case, we’d need to build something new and different.
So from the beginning, we built our DNS server to be a small (but obviously important) part of a global architecture — ingesting a constant stream of configuration changes and data, generating a steady flow of telemetry and analytics, and automated and managed not with zone files and SIGHUPs but with high velocity globally replicated databases and message queueing.
A single-minded purpose
When we built our new DNS platform, we started with the centermost kernel that enables our game changing traffic management capabilities: the Filter Chain. To experienced engineers, that might sound surprising: why not start by experimenting with and selecting an optimized server framework, or building rock solid DNS protocol logic? Simple: because that’s not actually what NS1 is about. To us, DNS is the substrate we use to deliver intelligence. And unless we got that intelligence just right, none of the rest mattered.
So we focused from the very start on how we’d combine flexibility, power, and simplicity to invent a new way to manage traffic with DNS, and iterated and iterated and iterated until we’d built the Filter Chain: a super simple idea, entirely new in the industry, for combining high performance traffic routing algorithms in bespoke application-specific setups, driven in real-time by telemetry about the infrastructure of the application and the state of the internet.
The Filter Chain is still the core kernel of functionality of our DNS server software, and it still executes in real-time for every single DNS query that hits NS1’s servers, because we built all the other scaffolding of our software and architecture with that single-minded purpose.
Breaking the rules
We also knew from the start that we didn’t want to confine ourselves to the traditional RFCs that govern agreed upon rules for DNS systems. The DNS protocol itself obviously constrains the way NS1’s systems interact with the rest of the internet. But within those limitations, what more could we accomplish?
Rethinking the traditional DNS data model was where we started. By reimagining DNS records as collections of potential answers, to be manipulated by an algorithmic traffic management pipeline, we avoided many of the configuration management backflips needed to make use of other platforms and enabled an elegant approach to managing complex DNS setups with many-faceted traffic management rulesets. Our approach has enabled us to deliver not just the most powerful DNS platform on the planet, but also the simplest and most approachable.
And at the same time, by building our own DNS server, we’ve enabled ourselves to explore ideas for application performance optimization, ease of use, and visibility that push the envelopes of the protocol. For example, linked records and zones — essentially symlinks within NS1’s platform — are a super simple idea, not unlike CNAME and DNAME records, but reducing recursive DNS lookups, solving CNAME-at-the-apex issues (where we’ve also long supported ALIAS records as well), enabling deep performance improvements for our customers, and making complex name-aliasing issues easy to solve.
Milliseconds matter in more than delivery
Traditional Managed DNS players have differentiated mainly on network characteristics: DNS lookup uptime and response time. Our view as we were starting NS1 was that 100% uptime and super-fast global lookup performance had become table stakes, not differentiators. It is not that hard to build a reliable, low latency global anycasted DNS network these days. But while we’ve focused our efforts on delivering the most advanced traffic management capabilities and most usable, flexible tools for managing DNS in the industry, performance still matters, and we have stretched the expectations for Managed DNS performance in new dimensions.
For example, we knew change propagation was a huge issue for companies making dynamic updates to their DNS configurations. Traditional DNS servers weren’t built to ensure configuration changes pushed to a “central” API can propagate quickly across a global network of DNS delivery POPs, including potentially thousands of individual DNS servers. But that’s exactly what NS1’s systems do — propagate changes globally at close to the speed of light — and it’s only possible because of the configuration and telemetry pipelines we built directly into our DNS server software from the beginning. These pipelines plug into a global messaging topology to ensure the instant DNS records are added or updated through the NS1 API, those updates are pushed to our edge locations, into every DNS server instance, through every layer of internal caching, so on the next DNS query we respond with the newest configuration.
The same mentality drives our approach to telemetry. Observability is critical in any distributed system, and we have a system that’s about as distributed as you can get, spanning dozens of datacenters, hundreds of servers, thousands of instances, all in the critical path for our customers, meaning we need to notice potentially concerning issues instantly. Our DNS server generates internal telemetry across dozens of dimensions and pumps it at granularities as low as 1 second to our operations dashboards and analytics and alerting tools, so we can react quickly as conditions shift.
We treat DNS analytics no differently, and we present our customers with real-time query metrics for every single DNS record in our platform, using a lightweight but powerful query analytics aggregation engine built right into our DNS server itself. Four years later, NS1 is still the only industry player with such granular analytics delivered in real-time. That high velocity visibility matters to our customers.
Control = velocity
One of the most important motivations behind building our own DNS server was the desire to move fast. That might sound counterintuitive: isn’t building a DNS server that handles all the wacky exigencies of a decades-old protocol kind of a laborious process? Yep, it is — and there’s certainly a lot of up-front complexity and cost in going down this path. But it’s hard to overestimate the value of having complete control over the architecture of your most critical service delivery asset.
That control has enabled us to deliver at high velocity new features like linked records and zones, ALIAS records, new kinds of Filter Chain algorithms, support for new DNS record types, and more.
And just as importantly, it’s enabled us to react quickly to changing workloads and scaling challenges. For example, we’ve been able to rapidly iterate on our techniques for mitigating certain kinds of DDoS attacks, quickly start gathering new types of telemetry, introduce new caching and data architecture strategies, and re-work internal load balancing and load shedding approaches.
Building our own DNS server was a more intensive up front effort, but the control it’s afforded us has helped us be the most innovative company in our space ever since.
Where to from here?
Building and iterating on NS1’s own in-house DNS server has served us well. So well, in fact, that we’re doing it again. After years of scaling, data gathering, operational experience, and optimization, we’ve learned an incredible amount about the demands of operating global authoritative DNS systems at scale. This year, we’ve embarked on a from-scratch rewrite of our edge DNS server software based on those lessons, with an eye toward truly massive scale, future threats, and bleeding edge functionality for our customers.
So stay tuned — the fun is just getting started.