My node_modules are in git again


Update (2018)

It’s been a few years since I wrote this tool and since then it has been superseded by Yarn and offline mirror feature.
Yarn provides a better way to manage node_modules so I urge everyone to use it
I’ll leave the post below unchanged but obviously it is a bit outdated.

Since some time ago I can’t imagine doing commercial websites without Continuous Integration and node.js (io.js).

This post could be a short note about https://drone.io, https://codeship.com or https://travis-ci.org: how to set up your deployment branch, webpack and gulp and be happy.

But then real life happened and for various reasons I did not get reliable and consistent node.js builds on every deployment.

Most of those red badges were problems installing packages from npm repository.

why cb() was never called when phantom.js is compiled? Why does it happen once in 5 times on a CI server?

There are a few most common problems with having a fresh npm install being executed for every build/deployment:

  1. It takes time to download and compile all packages
  2. Build may fail because of problems on npm server or client (watch a great talk that I enjoyed by an npm lead and what they did to sort out npm scaling problems)
  3. Node.js packages by design depend on each other loosely. One library may specify “1.*” as a required version of some other library. Today it can be 1.0.1 and in a month it can be 1.9.3. Things break often.
  4. Packages may be removed from npm or updated with force (not anymore) in the npm repository because that is what maintainers do.

There are a few common practices to make your builds reproducible and (more) reliable that the community uses:

Committing node_modules folder to your source code repository

This was officially recommended by Node/npm before 2015.

Though being reliable this practice was not favoured by many people:

  1. node_modules is large, 1500 files in hundreds of folders is common for a middle sized web application. Which is a big distraction when doing code reviews, branch merges and other teamwork related code management.

2. Binary packages get compiled for specific architecture. If you are doing development on a Windows PC and your CI is on AWS linux box, you’ll have to filter your dev PC artefacts from being committed to production folder.

3. Binary dependencies are usually large. My current project’s node_modules folder is 130MB. If it gets into the source control system then every update may add another 130MB to repository size, over time it will slow down the workflow for the team.

Using npm shrinkwrap

Shrinkwrapping is the recommended option to “lock down” dependency tree of your application.
I have been using it throughout 2014 and there are too many inconveniences that accompany this technique:

1. Dependency on npm servers availability at every CI build. NPM availability is quite good in 2015 but you don’t want to have another moving part when doing an urgent production deployment.

2. Managing of npm-shrinkwrap.json is not straightforward. As of npm@2.4 you have to remove the folder with dependency that you want to udpate, run npm update --save and then npm shrinkwrap. The strict order of those commands caused me some pain and I had to manually code review changes in a 6000 line generated file “just in case”.

3. Even though npm does not allow force updating packages without changing the version, packages can still be removed from the repository and what are the chances a package/version you use gets removed from npm? Happenned to me.

4. There are “optionalDependencies” that will be installed by a Widnows PC but will not be installed on linux, for example. This would cause different shrinkwrap files being generated by different systems.

5. When npm@3.0 is released we will have all node_modules flat, will it break existing shrinkwraps?

Optional dependency fs-events does not get installed on linux but gets installed on mac. Make sure you generate CI-compatible npm-shrinkwrap.json

Retaining node_modules between CI builds

Quite possible that your CI provider (codeship does it) or Node.js web service infrastructure caches node_modules folder between builds because of some issues outlined above.

Though it is a big helper there are a few problems:

  1. Unfortunately npm does not have a “native” way to verify that node_modules folder satisfies version requirements in package.json or npm-shrinkwrap.json files. There are some thirdparty tools available though, but not perfect, at the time of writing.
  2. Retaining node_modules between CI runs may be useless for actively developed codebases because this cache will be invalidated often.

What I ended up with

I came to a conclusion to use the old recommended way of committing node_modules to a git repo. But with some extra effort to get rid of the the outlined disadvantages.

The idea is not mine, I evolved it from one of the Mozilla repositories.

The implementation commits node_modules folder to a standalone git repository, it will benefit if node_modules folder is retained between builds but it is not necessary.

The source code is here: https://github.com/bestander/npm-git-lock

To use it:

sudo npm install -g npm-git-lock  
cd [your work directory]
npm-git-lock --repo [git@bitbucket.org:your/dedicated/node_modules/git/repository.git]

If you run this command at the start of your build it will either do npm install or check out node_modules from a remote repo if this package.json was used with this command once.

How it works

The algorithm is simple, when doing a build:

  1. Check if node_modules folder is present in the working directory
  2. If node_modules exists check if there is a separate git repository in it
  3. Calculate sha1 hash from package.json in base64 format
  4. If remote repo from [2] has a commit tagged with sha1 from [3] then check it out clean, no npm install is required
  5. Otherwise remove everything from node_modules, do a clean npm install, commit, tag with sha1 from [3] and push to remote repo
  6. Next time you build with the same package.json, it is guaranteed that you get node_modules from [5]

Amazing features

With this tool you get:

  1. Minimum dependency on npm servers availability for builds of projects with no package.json changes
  2. No noise in your main project Pull Requests, all packages are committed to a separate git repository that does not need to be reviewed or maintained
  3. If the separate git repository for packages gets too large and slows down your builds after a few years, you can just create a new one, saving the old one for patches if you need
  4. Using it does not interfere with the recommended npm workflow, you can use it only on your CI system with no side effects for your dev environment or you can mix it with shrinkwrapping
  5. You can have different node_modules repositories for different OS. Your CI is likely to be linux while your dev machines may be mac or windows. You can set up 3 repositories for them and use them independently.
  6. And it is blazing fast
A fresh npm install for package.json that script sees for the first time, takes 1 min 8 secs
A build with cached node_modules with a package.json that was built before, takes 6 seconds
A build with cached node_modules pointing to same sha1 of current package.json jsut takes 3 seconds

I feel quite comfortable with this way of “sealing” my packages in CI, please provide feedback and point to any errors that I have made.