Key Takeaways
- Fast startup time is essential to the success of your cloud-native strategy.
- While there are different solutions in the JVM space to improve startup time, InstantOn is the only one that provides fast startup without compromising your applications.
- InstantOn is based on checkpoint/restore technology, which has some challenges with regard to the checkpoint and restore environment. InstantOn addresses these challenges offering a seamless experience for developers deploying new and existing applications.
- InstantOn integrates with container-based technologies.
A shift to cloud-native computing has been on the minds of many developers in recent years so that their business applications can benefit from reduced IT infrastructure costs, increased scalability, and more. Scale-to-zero is the standard provisioning policy when it comes to deploying applications on the cloud in order to save costs when demand is low. As demand increases, more instances of the application, and runtime, are provisioned; this scaling out must happen very quickly so that end users don’t experience a lag in response times. The startup time of your runtime can play a big part in scale-out performance.
Open Liberty is a cloud-native Java runtime and, like other Java runtimes, is built on JVM technology. The performance, debugging capabilities, and class libraries that the JVM (more broadly, the whole JDK) offers make it a compelling technology to base your applications on. Although JVMs are known for excellent throughput, their startup time lags behind statically compiled languages like Go and C++. Given the requirements of scale-to-zero, significantly improving startup time has been a key area of innovation for all JVM implementations for many years. Metadata caching techniques such as AppCDS (HotSpot) and Shared Classes Cache (Eclipse OpenJ9) have shown impressive improvements in startup time but don’t achieve the order-of-magnitude start time reduction required in scale-out scenarios such as serverless computing.
Compiling to a native image to reduce startup time
Graal Native Image gained attention when it was announced that it can achieve sub-100ms startup times with its compile-to-native approach. This was a significant shift in the JVM landscape because, for the first time, Java applications were competing with C++ in startup time. While Graal Native Image significantly reduced startup time, this came with several trade-offs.
Firstly, static compilation requires a global view of the application at build-time. This imposes some limitations on the use of dynamic capabilities that developers building applications on the JVM have traditionally relied on. For example, operations such as reflection, dynamic class loading, and invokedynamic need special treatment because they interfere with the requirements of the static analysis needed to produce a native image. This, in turn, means you might need to modify your applications significantly for the native image to work and worse, your dependencies might also need updating.
Secondly, debugging becomes challenging because you are no longer debugging a JVM application but rather a native executable. You’ll need to trade your familiar Java debugger for a native debugger like gdb to investigate issues. One way to work around this is by using a JVM in development and a native image in production. However, this means that your production environment will not match your development environment, and you could end up having to fix bugs on two different runtimes!
Finally, one of the great things that JVMs offer is excellent throughput with just-in-time compilers optimizing the application at runtime based on live data to achieve optimal performance. This too, must be sacrificed in a native image because developers only get one shot at compilation – at build time. Several frameworks, such as Spring Native, have built up capabilities to help Java developers work within the native image constraints, but there is no getting away from the fact that the developer does have to give up something to obtain the startup time benefits of native image.
Skipping startup with checkpoint/restore
The Liberty runtime has taken a different approach to improve startup. Liberty aims to offer fast startup without compromise with a feature called Liberty InstantOn. This feature offers all the capabilities with which Java developers are familiar while improving runtime startup times by up to 10 times in comparison to JVM runs without InstantOn.
Liberty InstantOn is fundamentally based on checkpoint/restore technology. You start an application, then pause it and persist the state — checkpoint — at some well-defined points offered by Liberty. This checkpoint then becomes your application image, and when you deploy your application, you just resume the image from the saved state — restore — so that the application skips the startup and initialization process that it would normally go through (as those steps have already run).
Liberty uses OpenJ9 CRIU support, a technology based on Linux CRIU which enables any application to be checkpointed and resumed. Because you are still running with a JVM in the Liberty InstantOn approach, there is no loss to throughput performance. Java debugging works as expected, and all the libraries that depend on dynamic JVM capabilities will also work.
Resolving the limitations of checkpoint/restore
While the concept of checkpoint/restore sounds very simple, in reality, there are some constraints (arising out of how CRIU works) that need to be addressed by the runtime and JVM working together for an application to experience these benefits. When a checkpoint is taken (when building the image), CRIU takes the environment and “freezes” it within the checkpointed state: environment variables, knowledge of computing resources (CPU, memory), and time itself are all baked into the image. Any of those things can be different in the restored environment, causing inconsistencies in the application that can be difficult to track. Additionally, there can be some data captured in the checkpoint image that would not be ideal if, for example, images are to be shipped across public networks via container registries to deployment environments. This data can include external connections to endpoints that do not exist in the restore environment and security tokens that you don’t want to embed in the checkpoint image.
For these reasons, OpenJ9 CRIU support has built-in compensations to ensure that a checkpointed application behaves correctly and safely when restored. Time-sensitive APIs are modified to compensate for the downtime between the checkpoint and restore. Random APIs like SecureRandom are re-seeded upon restore to ensure that each time a checkpoint is restored, it is restored as a unique instance.
The JVM can address the things it knows about, but there might be application code that needs similar treatment. The Liberty runtime helps to shield developers from the complexities of checkpoint/restore by working with the JVM to address any remaining issues that the JVM cannot deal with on its own. To facilitate this, OpenJ9 offers a hook mechanism that developers can use to register the methods that will run before and after a checkpoint. This mechanism is used by Liberty extensively, for example, to re-parse the configuration at deployment time to ensure that the correct configuration is used for the environment.
So, while OpenJ9 offers the tools to leverage checkpoint/restore technology effectively, the straightforward way to enhance the startup time of an existing application is to run it on Liberty with Liberty InstantOn. Liberty InstantOn abstracts the checkpoint/restore process, simplifying the developer's choices to only a few, such as determining whether a checkpoint should be before or after the application starts.
Ultimately, the end goal is to improve the cloud-native experience of Java applications, which means that whatever technology you use must work effectively in a cloud environment. Liberty InstantOn integrates seamlessly with container technologies like Docker and Podman. Liberty InstantOn also works with container engines like Knative and OpenShift Container Platform. We have done work to ensure that Liberty InstantOn runs in unprivileged modes because this is essential for the security of production environments. This work is being contributed back to the CRIU project.
Trying out Liberty InstantOn with your own app
Liberty InstantOn is publicly available as a beta, and developers can try it with their existing applications to see the improvements (up to 10 times faster) in startup time. You just need to create an application container image of your application using the Liberty InstantOn tools. Open Liberty publishes production-ready container images that make it easy to containerize your applications to run in a container engine such as Docker, Podman, or in Kubernetes environments like Red Hat OpenShift.
The Open Liberty container images contain all the necessary dependencies for running an application with the Open Liberty runtime. The following instructions for developers describe how to create a base application container image with their application on top of the provided Open Liberty beta-instanton
image (icr.io/appcafe/open-liberty:beta-instanton
) and then how to create and add a layer on top that contains the checkpoint process state. The beta-instanton
image contains all the prerequisites needed to checkpoint an Open Liberty process and store that checkpoint process in a container image layer. This includes an early access build of OpenJ9 CRIU support and Linux CRIU.
How to containerize your app to start faster with Liberty InstantOn
The following instructions use Podman to build and run the application container and use the application from the Open Liberty getting started guide. Developers can substitute their own application if you have one to hand.
The completed getting started application contains a Dockerfile that looks like this:
FROM icr.io/appcafe/open-liberty:full-java11-openj9-ubi
ARG VERSION=1.0
ARG REVISION=SNAPSHOT
COPY --chown=1001:0 src/main/liberty/config/ /config/
COPY --chown=1001:0 target/*.war /config/apps/
RUN configure.sh
First, the developer needs to update the FROM instruction to utilize the beta-instanton image:
FROM icr.io/appcafe/open-liberty:beta-instanton
After that, the application container image can be built with the updated Dockerfile using the following command:
podman build –t getting-started .
The command creates the application container image, but no checkpoint process has been created yet. The checkpoint process for the application is created by running the application container image with some additional options using the following command:
podman run \
--name getting-started-checkpoint-container \
--privileged \
--env WLP_CHECKPOINT=applications \
getting-started
The WLP_CHECKPOINT
variable specifies that the Open Liberty runtime will checkpoint the application process at the point after the configured applications have been started but before any ports are opened to take incoming requests for the applications. When the application process checkpoint has been completed, the running container will stop. This results in a stopped container that contains the checkpoint process state.
The final step is to layer this checkpoint process state on top of the original application process image. This is done by committing the stopped application container called getting-started-checkpoint-container
to a new container image with the following command:
podman commit \
getting-started-checkpoint-container \
getting-started-instanton
The final result is the getting-started-instanton
container image ready to run.
Running the container with privileged Linux capabilities
When running the getting-started-instanton
container, developers must grant it a set of Linux capabilities so that the CRIU binary in the container image can perform the restore process:
cap_checkpoint_restore
cap_net_admin
cap_sys_ptrace
When you created the checkpoint process, a privileged container was used, which granted the CRIU binary in the container image the Linux capabilities required.
Run the following Podman command to run the container with the three required capabilities:
podman run \
--rm \
--cap-add=CHECKPOINT_RESTORE \
--cap-add=NET_ADMIN \
--cap-add=SYS_PTRACE \
-p 9080:9080 \
getting-started-instanton
The getting-started-instanton
container runs with the necessary privileges to perform the restore process, and the application runs up to 10 times faster than the original getting-started application.
Future Improvements
The Open Liberty beta releases publish regular updates to Liberty InstantOn. Some improvements are being planned in future releases to make the process of building and running an application image with Liberty InstantOn easier. For example, additional work has been done to remove the need for the NET_ADMIN
Linux capability. There is also a plan to remove the requirement for SYS_PTRACE
when restoring the application process. This would reduce the required capability list to only the CHECKPOINT_RESTORE
capability when running the application.
Other plans include performing the application process checkpoint during the application container build step without requiring a container run and container commit command to store the application process state into an application container image layer.
Let us know what you think
While cloud native requires many changes to how organizations approach their businesses, with Liberty InstantOn, developers won’t have to worry about altering their application development approach.
Developers are encouraged to try Liberty InstantOn in beta using Open Liberty 22.0.0.11-beta or a later version. Feedback is welcome and can be shared through the project's mailing list. In case of encountering an issue, developers can post a question on StackOverflow. If they discover a bug, they are welcome to raise an issue.
Background notes
Open Liberty and Eclipse OpenJ9 are open-source projects. IBM builds its commercial WebSphere Liberty Java runtime and IBM Semeru Runtimes Java distributions from these projects. Liberty InstantOn uses the checkpoint/restore technology made available by the Linux Checkpoint/Restore In Userspace (CRIU) project and works with CRIU to contribute code back to the project.