Superfast JVM Startup with CRaC — Understanding The Tech behind the AWS SnapStart

Samarendra Kandala
Fission Labs
Published in
8 min readFeb 24, 2023

The term cold-start is often associated with serverless functions and in general with Java lambda functions. AWS recently announced the lambda snapstart feature for Java functions which will end the cold-start problem once and for all. This intrigued me to do some digging and that’s when I heard the term CRaC, the Coordinated Restore at Checkpoint.

Issues that Modern JVM Applications Face

  • The startup of JVM is not very fast compared to other languages which might be an issue for short-lived workloads (AWS Lambda). It takes a lot of time as seen in the below image.
  • JVM initially runs everything in the JVM’s bytecode-interpreted mode. Only after some profiling, JIT will kick in and compile the bytecode to native. In the world of microservices, the scaling of services needs to have elasticity. Every time you launch a new instance all the profiling and JIT native compilation will start from scratch which is a waste of resources.

Now think about it, if we can save the state of the virtual machine which includes profiling information, class loading information, and the JIT native compilation then use that image to launch the new instance every time. It would mitigate the above-mentioned problems.

  • That would mitigate the long start-up time that JVM suffers.
  • As the profiling information and JIT native compilation are reused, we will get the same optimal performance every time we launch a new instance with that snapshot.

Coordinated Restore at Checkpoint(CRaC)

This is where the CRaC specification comes into the picture, it was first proposed by Azul engineer Anton Kozlov. Checkpointing the Java application saves the state of the virtual machine including profiling information and the JIT native compilation.

It uses the CRIU (Checkpoint/Restore In Userspace) project on Linux to checkpoint the running Java application. CRaC is only supported on Linux as of now. AWS uses a firecracker VM which is also Linux based that supports CRIU to support the snapstart feature for Lambda.

You can learn more about CRaC in the video here.

Checkpoint Creation

Creating a checkpoint requires your application to free its resources, such as database connections, socket connections, open files, etc. Otherwise, the snapshot may result in possible exceptions when restored.

For example, Let’s say you have stored a DB connection in the snapshot. When you restore the application from the snapshot if the application reuses that DB connection then it will face issues as the connection might be timed out or the DB might close the connection already.

The application must close all the file descriptors and sockets to be successfully check-pointed. The application is responsible for checking/opening the files and sockets after the restoration.

A special interface “Resource” is added to make the application aware that it is being check-pointed and restored. All classes that need this trigger event must implement the Resource interface.

public class NamedResource implements Resource {

public NamedResource() {
Core.getGlobalContext().register(NamedResource.this);
}

@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
System.out.println("Before Checkpointing");

}

@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
System.out.println("After Checkpointing");

}

}

First of all, you need to implement the Resource interface from the “jdk.crac” package.

This interface provides two methods,

  • beforeCheckpoint() — This method gets called before checkpointing
  • afterRestore() — This method gets called after restoring.

To make it work, you also need to register your class to a global context by calling “Core.getGlobalContext().register()” in the constructor as shown above.

This is not part of OpenJDK yet but is present in one of the forks https://github.com/CRaC/openjdk-builds/releases. You can download this Java version and test this feature.

Demo Application:

Let's create a class NamedResource which extends the Resource interface of the ”jdk.crac” package with a method printNumber which prints numbers from 1 to 100000 and wait for a second in every iteration.

public class NamedResource implements Resource {

public NamedResource() {
Core.getGlobalContext().register(NamedResource.this);
}

@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
System.out.println("Before Checkpointing");

}

@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
System.out.println("After Checkpointing");

}

public void printNumber() throws InterruptedException {
for (int i = 0; i < 100000; i++) {
System.out.println(i);
Thread.sleep(1000);
}
}

}

public class Executor {
public static void main(String[] args) throws InterruptedException {
NamedResource r = new NamedResource();
r.printNumber();
}
}

When the checkpoint is created, the state of the application gets saved which includes the printNumber method call stack, so when restored it won't start from 0.

Demo Using Docker

A docker container is used for the demonstration of this as this works only on Linux-based OS. I have built the Jar file of the above-mentioned classes and added that to the docker.

Create the docker image

FROM ubuntu:20.04
ENV JAVA_HOME /opt/jdk
ENV PATH $JAVA_HOME/bin:$PATH
RUN apt-get update -y
ADD "https://github.com/CRaC/openjdk-builds/releases/download/17-crac%2B3/openjdk-17-crac+3_linux-x64.tar.gz" $JAVA_HOME/openjdk.tar.gz
RUN tar - extract - file $JAVA_HOME/openjdk.tar.gz - directory "$JAVA_HOME" - strip-components 1; rm $JAVA_HOME/openjdk.tar.gz;
RUN mkdir -p /opt/crac-files
COPY ./build/libs/crac4.jar /opt/app/crac4-17.0.0.jar
CMD ["/bin/bash"]

I have built the jar file of the code and copied that to the image to /opt/app location. If you have a different jar location you can update it in the Dockerfile.

sudo docker build -t crac4 .

Start your application in a docker container

  • Execute the below command to start the container with the image
sudo docker run -it --privileged --name crac4_demo crac4
  • Once you are in, navigate to the folder which has the jar file and execute it. If you observe the below command, a flag -XX:CRaCCheckpointTo is added which gives the location to save the check-pointed state.
  • In terminal 1, the below output should be seen if you are using the same logic as I have used above.
cd /opt/app/
java -XX:CRaCCheckpointTo=/opt/crac-files -jar crac4–17.0.0.jar
Terminal 1 output

Checkpoint Creation

  • Then login to the same container in a separate terminal, let’s call it terminal 2
sudo docker exec -it crac4_demo /bin/bash
  • Execute the below command in terminal 2 which checkpoints the java application. You can do it in two ways, either with the PID of the java application or with the name. To get the PID run the below command
ps -ef | grep java

jcmd 115 JDK.checkpoint

OR

jcmd crac4–17.0.0.jar JDK.checkpoint
Terminal 2 output
  • Which should show something like the image below in terminal 1, where the application is printing numbers.
Terminal 1 output after the checkpoint

The application running in docker in terminal 1 will exit showing a message as in the above image. The print statement which is in beforeCheckpoint method gets executed. Which means the application gets the trigger to clean up the resources. e.g. closing the DB connection in a connection pool. /opt/crac-files folder will contain the details of the snapshot taken.

@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
System.out.println("Before Checkpointing");
}

Checkpoint Restoring

For restoring run the below command in either terminal 1 or 2. Once the application is restored from the snapshot, it starts printing from 68 instead of 0.

ps -ef | grep java

java -XX:CRaCRestoreFrom=/opt/crac-files
Terminal 1 output

The print statement which is in the afterRestore method gets executed, which means the application gets the restoration trigger. This provides an opportunity for the application to validate/recreate any required resources to avoid exceptions. e.g. validating/creating a connection in the DB connection pool.

@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
System.out.println("After Checkpointing");
}

Commit the current state of the docker container

If you want to reuse this application snapshot at a later point in time, you can commit the docker container to an image and reuse it later.

  • Exit from the docker in terminal 2 using the exit command. To get the CONTAINER_ID run the below command in terminal 2. Then commit the container to an image and tag it.
sudo docker ps
sudo docker commit CONTAINER_ID crac4:checkpoint
Terminal 2 output

As we have committed the container, we can exit from the docker container in terminal 1 as well.

Run the docker container from the checkpoint

You use this committed docker image later to start the application from the check-pointed state. Whenever you want to start the application again, just run the committed docker image and start the application using the restore command instead of starting from the jar.

  • Run the committed image, Once you are in that container run the below commands to restore it. This will work the same as mentioned in the Checkpoint Restoring section above.
sudo docker run -it - privileged - name crac4_demo1 crac4:checkpoint

// In container run below commands

ps -ef | grep java

java -XX:CRaCRestoreFrom=/opt/crac-files

Check out this GitHub repository for detailed code and Dockerfile. If you face any issues in running this on your local machine, do raise a GitHub issue we will check and resolve it.

AWS Lambda

AWS Lambda uses a similar technique to achieve the SnapStart feature. It uses a firecracker VM, which is Linux based and supports CRIU.

Whenever a Java function’s zip file is uploaded and a version is created, AWS starts the lambda functions internally up to the point of the actual code, then does the checkpoint. It stores a snapshot and when a new request comes it will just restore the application from the snapshot. As the snapshot files have all the information to start, they will execute without class loading time thus reducing the startup time significantly.

You can check more on this in an AWS blog post here.

--

--