An idea : npm should use DNS for optimization

Now that Node.js v4 is released, I was thinking about the number of people that will try out this new version on their project and the traffic, resources usage and iops npm, Inc. staff will face during the first days. Of course, they have a great architecture (http://blog.npmjs.org/post/75707294465/new-npm-registry-architecture) and everyone can agree that they do a wonderful job with an almost 100% uptime.

However, like every time I think about about an architecture, I was wondering how can this system can be optimized and I ended with that idea : What if npm, Inc. was using DNS for all the read operations ?

This would be only partial

First, it’s important to understand that this switch would be only for the READ operations. DNS is, by nature, a client-read-from-server protocol and NPM is a full CRUD system. The idea would be to switch all READ operations like “What is the latest version of this package” or “What are the dependencies of this package”. Basically this means the view operation of the NPM Api.

Even if this is only a fourth of the CRUD stack and it’s only one in all >40 operations, it’s the main one in terms of use cases and traffic. Every time someone does a npm update/install or a npm outdated for example, a view operation is done on each of your dependencies (and each of the dependencies of your dependencies and so on…). The Create Update and Delete operations are more occasional.

So this change would be crucial and important.

How it work now (briefly) and the optimization key

Let’s say I have a project which depends on lodash

When I do a npm install, npm will parse my package.json file and try to install all dependency, by their name.

That means, for each dependency, it’ll make an HTTP request like http://registry.npmjs.com/lodash :

Two things are important here :

  1. The hostname we made the request on is registry.npmjs.com ; so yes, it’s a data registry, so is a DNS server :) ;
  2. The server header : “CouchDB/1.5.0” ; so ye’s we’re making a (cached by fastly) HTTP query to NPM database. Basically, registry.npmjs.com is a CouchDB database.

As you can see, the body of the answer is a JSON document which contains ALL the packages information (name, description, versions, repository URL, authors, etc.) ; so npm will check if the package is installed or if it needs to be updated according to the dependencies declaration.

For lodash, it’s (at the time of writing this post) 109kB ; and, let’s say we’re doing a npm outdated on express, it’s a total of 27 HTTP requests and 700kB of data… Damn, we just wanted to check our package versions !

Well, let’s be honest, this is NOT what happens each time npm checks a version of a package, to get these numbers, I cheated a bit (sorry). npm keeps a cache of the answers (in ~/.npm/registry.npmjs.org/<package>) with the full JSON document plus a property named “_etag” to respect the Non-Match/304 Not modified part of the HTTP protocol. So yes, it’s already optimized for that part but still, we have to query their server on each package and parse a full big JSON document every time.

The concept

(For anyone that is not comfortable with DNS, should be good to read a bit about it before).

Imagine that, besides this wonderful architecture, npm, Inc. provides a DNS server for simple queries like the latest version of a package or the tarball url of a specific version ? What we need here is a protocol that will provide an information from a registry when requested with a specific key. DNS is perfect. Specifically the TXT Record.

Querying an information from registry will be a DNS question to <package_name>.<property_path>.domain

So let’s say this DNS server has the delegation for dns.npmjs.com

You want the latest version of the lodash package ?
$> dig +short -t TXT lodash.dist-tags.latest.dns.npmjs.com
3.10.1

You want the tarball url of the version 0.3.0 of lodash ?
$> dig +short -t TXT lodash.versions.0.3.0.dist.dns.npmjs.com
{shasum: “e7db413e8b50d48834e86f82c52fd64f0a9e5d6e”,tarball: “http://registry.npmjs.org/lodash/-/lodash-0.3.0.tgz"}

You want the description of lodash ?
$> dig +short -t TXT lodash.description.dns.npmjs.com
The modern build of lodash modular utilities.

Well, I think you got the concept.

The advantages

Less traffic

An average HTTP query is 250B + the package name length, a DNS question message is 16B + the domain length, that’s more than 90% less.

The average response (considering a 304 not modified answer) is also on an average of 250B, a DNS resource message is 10B + the domain length + the data, we’re still in a huge traffic reduction.

Yeah UDP !

Yep, UDP traffic is faster, easier to manage and reduces latency

Less parsing

Avoiding the whole JSON document parsing every time you only need to check a package version is a huge win, you can update, deploy and install faster !

Like a free and bigger CDN

Every DNS record has a TTL, this means that every user’s internet provider would offer npm, Inc. a free and optimized CDN while they can still decide and control how long the answer will be cached. Google’s public DNS would become an opportunity of a damned fast mirror !

Easier implementation

Third party project that monitor your package versions (like david) would be able to perform a version check without the overhead of the whole npm dependency and parsing.

Even more, it opens to other languages (Go, PHP, pyton, etc.) the availability to query the npm, Inc. registry easily for interoperability.

Less firewall hassle

Allowing your production server to query registry.npmjs.com is often a nightmare in the enterprise world, you have to negociate hardly with your sysadmin or use a proxy. Well, DNS is quite never blocked because it’s done via a local server. No more configuration !

Implementation

Another good news is that it doesn’t seem very hard to implement, it would need two steps :

A custom DNS server

This part is the easier, with the help of node-dns, it’s just about parsing the queried domain, to parse the JSON from CouchDB (and probably cache it to redis or so) and to reply with the asked property.

This server would be delegated *.dns.npmjs.com and that’s it.

Update npm/npm

The goal here is to detect these READ operations to migrate it to a DNS query and to keep the classic CouchDB query where needed.

I will make a POC in the next days … If it sounds good to you. Does it ?

CTO, Senior Full-Stack Web Developer, DevOps, hacker. Father. #NodeJS, #Typescript, #Golang

CTO, Senior Full-Stack Web Developer, DevOps, hacker. Father. #NodeJS, #Typescript, #Golang