InfoQ Homepage News Solving Fat JAR Woes at HubSpot

Solving Fat JAR Woes at HubSpot

This item in japanese

Spring Boot 1.4 and Dropwizard 1.0 were both released at the end of July, both based on fat JARS. As adoption of such frameworks and microservices increases, fat JARs are becoming a more common deployment mechanism.

Fat JARs is a technique for packaging all dependencies of a Java application into a single bundle for execution, and is used by many Java microservice frameworks, including Spring Boot and Dropwizard. There's even a Fat JAR Eclipse Plug-In.

For organizations with a few microservices, the bandwidth used by fat JARs is hardly noticeable. However, when you get into thousands of microservices, bandwidth usage can become an issue.

Earlier this summer, HubSpot cited issues with Fat JARs as a deployment technique experiencing problems with the maven-shade-plugin, and efficiency problems when packaging 100,000 tiny files as a JAR. They also mentioned a large duplication of dependency JARs stemming from their 1,000 plus applications constantly building and deploying.

They experimented with the maven-dependency-plugin to reduce the bloat, but their efforts didn't reduce the size of the generated build artifacts.

To cure their fat JAR pain, HubSpot created the SlimFast plugin for Maven, which creates a build artifact containing only the specific projects classes. It piggybacks on the deploy phase and uploads all of the application's dependencies to Amazon Simple Storage Service (S3) individually. Using this plugin HubSpot reportedly realized 60% faster build times and a 99% increase in storage capacity.

The graph below shows how much faster builds have been for them after using SlimFast.

InfoQ: Are the fat JAR problems you experienced mostly due to continuous integration and deployment?

Jonathan Haber: Yes, I think the issues we ran into are largely caused by our style of development. We have lots of small teams pushing code, building, and deploying hundreds of times per day. Because of our small units of build, it would often take longer to create and upload the fat JAR than to actually compile and test the code. On the other hand, if you have a monolith that takes 20+ minutes to build, then the overhead of a fat JAR probably isn't very noticeable. But I think more companies are moving to this faster, lighter style of development and may run into the same challenges.

InfoQ: Do you think alternate packaging techniques like SlimFast provides should be native to frameworks like Spring Boot and Dropwizard?

Haber: Because this approach requires integration with the build and deploy system, my feeling is that it's too opinionated to include in something like Spring Boot or Dropwizard. However, one way to handle this would be to put the SlimFast plugin in a Maven profile activated by an environment variable. That way the build system could indicate that it supports this feature, otherwise it falls back to using a fat JAR.

InfoQ: If cloud providers (e.g. Heroku, CloudFoundry, etc.) adopted a similar technique to reduce duplication of JARs among applications, could they save a lot of money on bandwidth?

Haber: I'm not sure what savings are achievable, but I think it would be possible to use a similar strategy. However, we have the advantage of all of our apps using the same versions for 3rd party libraries and having a huge amount of overlap in terms of libraries used. For cloud providers, their users will depend on a much wider array of libraries across all different versions, so if you wanted to cache dependencies on the application servers it would take up a huge amount of space. But if you didn't, you'd lose out on a lot of the speed/bandwidth savings. That's not to say there aren't savings to be had, I just think the implementation would need to be more sophisticated than our approach. Another issue you run into is that usually these cloud providers are just running Maven with the POM supplied by the user so they don't have much control over the build lifecycle to add these types of optimizations.

InfoQ: Are there any additional improvements you'd like to see in fat JAR applications?

Haber: I'm not sure if this is on the list for Java 9, but if Java could handle nested JARs it would make it a lot easier to build and run a fat JAR. Tools like Spring Boot and One-JAR do a good job of working around this limitation, but they add complexity and can never be completely transparent.

Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

• Close, but no cigar

by Will Hartung /

• Re: Close, but no cigar

by Bruce Wayne /

• Re: Close, but no cigar

by Will Hartung /

• Re: Close, but no cigar

by Jonathan Haber /

• Questions!

by Bruce Wayne /

• Re: Questions!

by Jonathan Haber /

• Close, but no cigar

by Will Hartung /

Your message is awaiting moderation. Thank you for participating in the discussion.

They're doing this the wrong way.

Out of the box, maven bundles the pom in the META-INF of a Java jar.

Maven already has code to "run" an application based on its pom, standing up the class path before executing the prescribed main routine.

That code need to "simply" be ported as a stand alone class loader, invocable by the main method. It will essentially do what maven does - it will reify the dependencies, use the classic local maven repository, build up an internal class path that based on those dependencies, and then fire off whatever method the developer desires.

MavenClassloader.run(MyAppClassWithMainMethod.class, args);

Worst case, when the jar starts up, it downloads the internet in to a local repository.

But, that happens once, and rarely. As the jars are deployed over and over, the repository normalizes with the code and eventually there's little to do but start the code up.

You can stage pre-loaded repositories in containers that are brought up to date automatically by the jars as necessary.

Now, the jar you upload is wafer thin, and it leverages all of the tooling everyone is already using.

• Re: Close, but no cigar

by Bruce Wayne /

Your message is awaiting moderation. Thank you for participating in the discussion.

Its perhaps reasonable to desire not running an assemble(sort-of) job in production where build once deploy many(same artefact) is the philosophy..

• Questions!

by Bruce Wayne /

Your message is awaiting moderation. Thank you for participating in the discussion.

What I fail to understand is the 1000s of micro-services bit, why are there soo many micro-services?
What is the bandwidth gain thats being talked about? If there is an artefact server which is mirroring some internet maven repos, and is within the same environment. That could potentially reduce bandwidth dramatically isn't it? Unless I've mis-understood the bandwidth concerns.
Looks like an invented problem to me!

• Re: Close, but no cigar

by Will Hartung /

Your message is awaiting moderation. Thank you for participating in the discussion.

Well the you're at an impasse, then, aren't you? The artifacts need to get to the machines somehow. So they either have to already be out there, or you have to drag them with you each time. It is straightforward to add a flag to the command that simply tells the process to satisfy its requirements before "running it for real", but for whatever reason, folks itch at that too.

This is no different than pre-staging all of the jars, setting the class path up, and firing off the code. But we did that back in 1999, and had to manage all of that. It was awful and brittle and nasty back then.

This is a natural extension to what Maven can already do, using tools and infrastructure everyone already has.

Of course, you don't have to do anything at all, you can simply run mvn directly using the exec plugin, but folks cry and whine about that as well.

• Re: Close, but no cigar

Your message is awaiting moderation. Thank you for participating in the discussion.

I also have some more pragmatic concerns, such as Maven not using locking or atomic operations when interacting with the local repository, so if we had concurrent deploys accessing the local repository we would run into issues. Also, we use snapshot versions extensively for our internal dependencies and our Nexus instance is configured to only keep the latest snapshot to keep disk space under control. But to make sure our dependencies aren't shifting at deploy time, SlimFast always uses resolved snapshot versions. So if we were fetching artifacts via Maven at runtime, these resolved snapshots would quickly point at purged artifacts and the app would fail to start up (ie, my app depends on artifact A snapshot version 1, artifact A builds again and publishes snapshot version 2 to Nexus which causes snapshot version 1 to get purged, now my app can't start because it can't fetch artifact A snapshot version 1 from Nexus). Additionally, for dependencies that come from 3rd party repositories there's also concerns that those repositories could disappear at any time. This is acceptable when it causes a build time failure, but if we fetched via Maven at runtime it could cause a serious outage.

• Re: Questions!

Your message is awaiting moderation. Thank you for participating in the discussion.

Hey Bruce, thanks for the questions! We have lots of microservices mainly because of the breadth of our product and our team structure. The HubSpot product is extremely broad: our customers can manage their leads, segment them into lists, build automated marketing workflows, score them with predictive lead scoring, build their website, run their blog, send emails, publish to various social media networks, view analytics across all of these channels, and much much more. Each of these pieces of the product is owned by a different product team, and each product team owns the potentially dozens of microservices that power that piece of the product. For example on the email side, there's an API to send an email, there's an API to record email opens and clicks, there are jobs to handle bounces and spam reports, there's an API to fetch statistics about an email send for display to the customer (open percentage, click percentage, etc.), and many more. These aren't integrated into a single service because they have very different performance characteristics and reliability needs. For example, the click tracking service needs to respond within a few ms and emphasizes availability and eventual consistency because if it's down the email links won't work. This is very different from the API that sends emails, which may take a few seconds to complete all of the SMTP operations and favors consistency over availability to make sure we never double-send the same email.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.