Why we stopped using ‘npm start’ for running our blockchain core’s child processes
You shouldn’t start applications through npm when you have child processes natively supported by Node.js. In this article, we will provide a list of best practices for Node.js applications with a code snippet that outlines the core problem and shows you how to reproduce the issue in 3 steps. In short, we stopped using
npm start to run our blockchain’s core and instead opted for using the native
Introduction to npm and its most well-known command ‘npm start’.
npm start that acts as a wrapper for
Our challenge: npm runs app.js files as a child process of npm.
However, what many don’t know is that while using npm start to trigger node app.js, npm is actually running your app.js file as a child process of npm which manages this. In 99% of the cases, you shouldn’t care about this, but things can get tricky when working with child processes in your own project. Can you feel the inception happening here? #child-process-inception
To give you a better understanding of how this is relevant to our “npm vs node” problem, let’s talk about how we are running Lisk Core. For those who don’t know what Lisk Core is, essentially, it is a program that implements the Lisk Protocol which includes consensus, block creation, transaction handling, peer communication, etc. Every machine must set it up to run a node that allows for participation in the network.
Intro to PM2, a production process manager for Node.js apps.
In our case, we use PM2 to restart the application upon failure. PM2 is a production process manager for Node.js applications with a built-in load balancer. It allows you to keep applications alive forever, to reload them without downtime and to facilitate common system admin tasks.
A few weeks ago, we decided to provide the ability to run the
http_api module as a child process to improve the overall efficiency of the Lisk Core application while using the same allocated resources.
Rationale behind the decision to run http_api module as a child process.
The idea behind this decision was mainly funded by the fact that functionally isolated components can form the basis of a multi-process application, in order to utilize the potential of multiple hardware cores of the physical processor if available. Also, to design each component in a resilient way to tackle brittleness of the multi-processing. This means that a failure of one component will have minimal impact on other components and that components can recover individually. More information about child processes can be found in our proposal to introduce a new flexible, resilient and modular architecture for Lisk Core.
We were not able to gracefully exit Lisk Core with npm.
While implementing child processes for the http_api module, Lightcurve Backend Developer Lucas Silvestre discovered that Lisk Core was not exiting gracefully while running the http_api module as a child process using PM2. This resulted in a tricky situation where the
http_api kept on running in the background whenever the main process (Lisk Core) crashed.
Whenever this happens, PM2 will attempt to recover the Lisk Core process. However, this would spawn a new
http_api process which was not possible as the port was already in use because of the cleanup process not being called. The resulted in PM2 not being able to restore the application which is a big issue when running a blockchain node that is part of the network. In this case, the user has to manually restart the blockchain node which we absolutely want to avoid.
Running Lisk Core with node command
This issue made us aware of the difference between npm and node and made us reconsider the way we were running Lisk Core. Previously, we just accepted the
npm start industry standard as the go-to way of running an application.
Later, we found the best practices provided by the docker-node GitHub repository dedicated to Dockerizing Node.js applications. Here, a clear warning message can be found about the usage of npm inside of a Dockerfile or any other higher-level application management tool like PM2.
“When creating an image, you can bypass the package.json’s start command and bake it directly into the image itself. First off this reduces the number of processes running inside of your container. Secondly, it causes exit signals such as SIGTERM and SIGINT to be received by the Node.js process instead of npm swallowing them.”
Whenever we tried to exit Lisk Core or the application crashed, a SIGINT signal is sent to the application. In Node.js, you can listen for this signal and execute a cleanup function in order to gracefully exit the application. In our case, we are removing various listeners and pass the SIGINT signal to the child process in order to exit this one gracefully as well.
As stated by docker-node, npm swallows this signal and does not trigger our listeners for the SIGINT signal causing the application to not being able to clean up gracefully. That’s also the reason why the
http_api module kept running inside of PM2.
Nick Parsons, an expert when it comes to running Node applications with PM2 also mentions the fact that it is important to gracefully shut down your application in order to maximize robustness and enable fast startup (no downtime) when using PM2.
Termination signals: what are SIGKILL, SIGTERM and SIGINT?
We have to dive quite deep to find out what these signals are about. These signals are part of a collection of signals to tell a process to terminate, actually many more exist, and can be found in the documentation provided by gnu.org under section 24.2.2 Termination Signals.
- SIGKILL: “The SIGKILL signal is used to cause immediate program termination. It cannot be handled or ignored, and is therefore always fatal. It is also not possible to block this signal.”
- SIGTERM: “The SIGTERM signal is a generic signal used to cause program termination. Unlike SIGKILL, this signal can be blocked, handled, and ignored. It is the normal way to politely ask a program to terminate.” Interesting to know that the shell command kill generates SIGTERM by default.
- SIGINT: “The SIGINT (‘program interrupt’) signal is sent when the user types the INTR character (normally
C-c).” Developers will probably be more familiar with the
CTRL/CMD+Ccommand to interrupt a running process in the shell.
Moving Docker and PM2 to Node.
This made us decide to get rid of
npm start and replacing it by the node command. The
start command was being used in both the Dockerfile as the PM2 run file.
The following image shows a snippet of the typical
ENTRYPOINT for Docker. Previously, this would contain
ENTRYPOINT ["npm", “start"] . This file can be found now in our new Lisk Core repository which is extracted from the Lisk-SDK Monorepo.
Also, the same applies to the
pm2-lisk.json file which contains the PM2 configuration for starting Lisk Core. The
script property now contains the relative path to the
Learn how to reproduce the issue in 3 steps.
We can find a cool snippet created by GitHub user EvanTahler addressing the above-mentioned issue. Let’s reproduce this!
Step 1. Create package.json and app.js
To emulate this issue, you need to create two files (
app.js ) in the same directory. Make sure you have Node.js version
10.x or higher installed on your machine to run the snippet with the node command. As we don’t need any code dependencies, we don’t have to install anything else.
Snippet clarification — The snippet will print a dot every 0.5 seconds and listens for the SIGINT and SIGTERM signals. Once one of the two termination signals is received, we will delay the shutdown by 5 seconds (5 * 1000ms) and print out “bye!”.
Before running this snippet, I want to show you how a killed process is indicated in your terminal when hitting
CTRL/CMD+C. You can notice it by the
Step 2. Run the snippet with node
Now we know how the SIGINT is represented in our terminal, let’s start the snippet with
node app.js. Let it run for 5 seconds, and hit
CTRL/CMD+C. You will see that the kill signal is properly handled by Node and waits for 5 more seconds before shutting down.
Step 3. Run the snippet with npm start
However, when we run the snippet with
npm start, you will notice two kill signals being received. As we now know, the start command will run
node app.js as a child process. So, when receiving
^C, it will try to exit the npm process and pass this termination signal to the child which causes the problem that the main process exits but the child is still active for 5 more seconds.
As explained before, this will give all sorts of problems when you try to listen for termination signals while running applications with
npm start, especially when operating child processes.
Interested in learning how to set up and run your own Lisk node? More information can be found in the Lisk Core documentation on the website. You can choose between the binary setup which is the default (and most simple) installation technique. Other options include running Lisk Core with Docker to support other platforms or for more advanced users, it is possible to build from Lisk Core.
Because of this “child process inception”, the
http_api module could not gracefully exit and kept on running. The only way to stop this process is by using a shell command that kills all Node processes:
sudo killall node (or target the specific process ID to be killed). Luckily, this could be easily resolved by using node to start the application.
Best Practices for Handling Node.js Applications
Felix Geisendörfer, an early contributor of Node.js, makes it very clear how to handle crashed applications:
What does the above teach us? Avoid spinning up your application through
npm start but use node instead. Also, if something goes wrong, exit the process gracefully and accept it. Felix recommends using higher level tools like PM2 to deal with recovering and restarting the application.
We learned from this that you not always should take standards for granted. It is sometimes better to keep things simple and run it with a simple node command.
To conclude what we did at Lisk, we decided to solve the issue by changing the
npm start command to
node src/index in both the PM2 run configuration and Dockerfile. Now, upon receiving a SIGINT signal, the node process receives this directly and can communicate the SIGINT signal to its child processes so every process can be exited gracefully.
Therefore, PM2 can easily restart the application without any downtime. Running our application via this setup allows us to deploy a more stable application which is utterly important for creating a stable blockchain network.
Lisk empowers individuals to create a more decentralized, efficient and transparent global economy. We welcome you to join us in our mission: