The perfect npm storm — a tale of stale node_modules and incidents

Or how, given enough time, incidents will always create an accident.

Working hard to become the home to the world's best independent brands.

Deploying is the best part of being a developer. Releasing freshly-baked, peer-reviewed, test-passing code into the world and making your product better is the crack of the startup world. At Tictail, we love launching every hour, every day.

Except when it takes two days and your code only fails in production, huh?

Why you no work, production code? Biggie wonders…

It all starts with a harmless pull request…

With more than 100,000 stores from hundreds of countries, the Tictail Marketplace is already global. Of course, we want to make sure it also feels truly local to everyone, so we're working hard to bring it to our eleven supported languages. And so, this story starts:

Sounds great, right?! This will be such a major win!

After a long review and lots of improvements, we're ready to launch! The first tidbit of UI is translated into my native Brazilian Portuguese, and it looks gorgeous. 😻

Everything ready, let's go!

And it breaks in production. Rollback!

Time to investigate what's happening. We use the excellent Opbeat, but unfortunately the first information from the stack trace is somewhat unhelpful…

invariant (dist_server/node_modules/react-intl/node_modules/react/lib/invariant.js)
Occurred at:
Error: Minified exception occurred; use the non-minified dev environment for the full error message and additional helpful warnings.

Darn you, React! Darn your excellent production environment optimisations! Anyhow, we managed to run it in the production machine in development mode, and some progress was made:

invariant (server.js)
Occurred at:
Invariant Violation: FormattedMessage.render(): A valid ReactComponent must be returned. You may have returned undefined, an array or some other invalid object.

But, wait, what? It works on my machine!!!

I swear :(

What now, my love?

The next couple of hours were spent, as you might expect, cluelessly pushing buttons trying to reproduce the problem in our development environment:

  • Clean ALL the caches.
  • Reinstall ALL the dependencies.
  • Pray to ALL the Gods.

To no avail, of course. Then, after too much coffee and tears, while poking around in the production environment, we noticed something…something Really Strange ®.

> npm ls react
marketplace@0.1.0 /Users/guilherme/Projects/Tictail/marketplace
└─┬ react-intl@2.0.0-beta-2
└── react@0.13.3 extraneous

Wait, what? We were using react@0.14.7. What is this react@0.13.3 doing there, inside react-intl?!

And suddenly it all makes sense

What actually happened were two separate problems:

Unused dependencies declared on package.json

When this project was started, internationalisation was already part of the "spec", of course. That's why react-intl was already included as a dependency. But back in those days, the most recent available version was…


Version 1.2.2 which depends on react@0.13.13. 😭

“name”: “react-intl”, 
“version”: “1.2.2”,
“dependencies”: {
// (...)
“react”: “>=0.11.2 <0.14.0”

But how come this stale nested dependency was there after deploying our new package which updates the react-intl version?

We updated react-intl! What's up?

Our deploy process didn't delete old node_modules

In super simple terms, our deploy process basically consists of using CircleCI to test our code, build production bundles and upload them to S3.

Then, as a separate step, we download this bundle and uncompress it on a specific folder, where we run our awesome code from.

Did you miss something there? Yep… We didn't clean the folder. Therefore, we've only overwritten the node_modules directory each deploy — never getting rid of stale nested folders. 😱


On accidents and incidents

Granted — it was a pretty gigantic coincidence that we happened to end up with two simultaneous React versions, which basically destroys the React context. However, there's a valuable lesson here:

Enough safety incidents over time will always create an accident.

  • Not cleaning up our dependencies was an incident.
  • Not cleaning up the deploy folder was an incident.
  • Not having the staging environment match the production environment completely was an incident.

And sure enough they would all conspire to create an accident sooner or later. Now we have learned and have fixed what we can.

Which begs the question: what incidents are you letting live long and prosper in your codebase? And in your deployment procedure? Maybe it’s worth taking few minutes and reviewing this with your team. Do it now, or do it later, in production.

It's your choice. 😁