BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Interviews Javier Soltero Discusses Management and Monitoring of Complex Java Applications

Javier Soltero Discusses Management and Monitoring of Complex Java Applications

Bookmarks
   

1. What are some of the challenges faced by an operations team managing a complex Java application?

There are a lot. The first one is operations teams generally don't know a lot about the internals of any of the apps that they tend to support or even if they do, there are especially at scale so many apps that the challenge of becoming an expert in managing potentially dozens if not hundreds of apps becomes quite daunting. Part of it starts from the fact that generally operations teams have a very OS centric view of the world and they say, "Well, OK. there is an operating system and a bunch of processes and I need to make sure these things are still running and I need to make sure the web servers are responding at a certain speed. And if I get those two things right, then I'm in business." But, of course, Java applications, especially complicated ones, have multiple layers of services and other things and places where things can go wrong, frankly. It comes down to if you don't know enough about the application architecture as an operator and you don't know enough about the overall behavioral characteristics, or you don't have tools that will help you answer either of those two questions you're going to have a very difficult time at being able to plan for scale, address problems before they become fires, etc.

   

2. What are some of the tools that Hyperic offers and what features do they provide?

Hyperic HQ is essentially built as a management platform. And as a product designed to provide all the features, all the functionality that an operations person supporting a complex web application – either a Java one or a non-Java one – would need. In that sense it's actually kind of unique. People know us a lot because one of the things we do is monitoring, but in reality we do quite a few others. The first one is what we call inventory auto discovery. There is kind of a general rule that I always talk about that's basically monitoring and management tools cannot manage things they don't know about. You need a strong sense of inventory and then an automatic way in which to populate that, so that you have the best possible reflection at any given point in time of the infrastructure that you're being tasked to manage.

That involves going not only from the physical layer of saying "This many servers" or "These many CPUs or "These many operating systems" all the way up to stack into "this is Apache and it has 200 vhosts [virtual hosts]," or "this is a Tomcat server and it has 30 web apps on it" or whatever, and providing an inventory model that not only is automatically populated but is also expressed in a manner that is simple enough for your average operations person to be able to manage at scale.

The second feature category is monitoring, which to us means the sort of art and science of collecting any kind of performance or utilization or capacity -- any number, any metric that is exposed by any of the things that we have within our inventory that programmatically can be retrieved from that product, whether it's a JMX-based operation where you retrieve a statistic or it's an SQL query to pull out a counter or whatever.

You store the metric and you store the time stamp with the associated metric and you essentially – in the case of Hyperic – not only use those for real time analysis, but also for historical perspectives. Third is what we call log and configuration tracking. In addition to being able to measure performance and collect all these numeric values for metrics, you need to complement that with alphanumeric information that gets submitted in some cases in a similar fashion either through notifications like in JMX or log files that you write to, etc.

By collecting information from there you get to see application-level or resource-specific faults that are not just numbers. They might be messages that say "Hey, kernel panic". If you got one of those you are pretty much toast. Or you get any number of other potentially developer added specific messages that might help you understand why a performance counter or why some other characteristic of your app is behaving in the way that it is.

Fourth is what we call control and this is where it starts getting into much more of a management tool. For us, the idea of simply identifying problems wasn't enough. We recognized that if you are supporting a complex application, especially as an operator, you might get into a situation where you identify a problem like a memory leak or something of that sort. You go and you tell the developers about it and you say "This application, after running for 10 days or 20 days or whatever amount, basically runs out of memory and essentially has to be restarted".

The developer says "OK, I'm going to take a look at it and I'll get back to you". In the mean time, as an operator, you still are responsible to the business around this application that might have a time to die of 7 or 8 days. That's one example of why you put control into the product. Control is really about enabling a user of HQ to act on the things that are being managed – start, stop, vacuum.

Any number of different product or layer specific actions that a product intrinsically exposes, you can perform securely and in an auditable fashion directly through the Hyperic UI and you get the added benefit of not only being able to do that on demand. But you can also use alerting and other things to trigger control operations that might go on when things potentially go wrong or when certain conditions are met.

To go back to the example of the web app with the memory leak, with Hyperic you can actually very easily define an alert that says "If the amount of free heap memory goes below a certain threshold, I want you to restart the application and only send me an e-mail if that restart didn't work." Otherwise I would prefer to continue to either go about my day or sleep, or whatever it might be that you're doing. In addition, you can combine that capability with grouping, and for reasons that might not be related to trying to band-aid an issue, you might say "I need to perform a sequential graceful restart on 200 Apache servers".

We have a grouping mechanism in the product that allows you to bucket 200 Apache servers into a group and you could say "Go do a graceful restart on these servers in this order and only tell me which ones didn't work, if any." So, that's control.

Finally, this feature that is sort of a wrapper for what our UI and our analytics do, which is analysis. It's basically combining performance data, inventory data, logs, control operations, alerts that have happened, etc., into a UI and into a presentation layer that allows you to answer questions, do planning, understand the behavior of certain pieces of your application, etc.

That's basically the core feature set of Hyperic. It's offered as both an open source product as well as an Enterprise subscription product. Then, there is also a newer product called IQ, which is a separate thing that is used as a business intelligence platform or a high end reporting engine that plugs into HQ and allows you to define custom reports and things that business people tend to really like.

   

3. Some types of problems that can be encountered in the running of a Java application can be detected with off the shelf tools, such as things like memory utilization and heap utilization, but other classes of problems require some kind of introspection into the application's internal state. What kind of things can Hyperic do out of the box without any developer interaction, and what kind of things do developers have to work with operations to create the correct hooks for?

The philosophy around monitoring a data collection in Hyperic is by design meant to push the responsibility of exposing relevant information to the product that we're actually managing. What that means is we don't actually ask you to drop all these libraries into, say your Tomcat server or any other product that we manage in order for it to expose more information about itself. What we do instead is hook into whatever usually rich set of capabilities are built into that and give the operations person the freedom to choose to collect whatever number of metrics and other things are exposed by that product at an interval of their choosing.

So you could say "I want to collect…" I always had this example of WebLogic has this metric called heuristic hazard exception and it's a counter. I still to this day am not exactly sure what it means, but it's usually zero and when it's not, there is a problem. If you are an operations person and you've been trained in WebLogic or you support a large scale WebLogic infrastructure, then you might care about heuristic hazard exceptions.

Having the ability to watch that metric along with other things. And then not only be able to see metric at the JVM level, but have products – and this is one of the things that's really good about managing Java - is JMX is actually extremely useful in this regard. To be able to collect metrics and events and other things for higher level components that might be running inside of a given container, for example. There are metrics for Tomcat and events and logs and configuration tracking that are completely separate from the metrics, events, logs and configuration tracking and control actions that you can perform on a web app or on a connection pool.

Our inventory model supports that and that, in a sense, is taking management capabilities that are already there but that most tools tend to collapse rather poorly, and actually exposing those to an operator and giving them as much information as the product is able to convey. The other part of your question is what can developers do to complement that and actually provide even more context in terms of how you manage an app.

The short answer is you build in JMX hooks -- if it's a Java app -- where you are exposing performance counters, attributes, other things, management operations and events that might be domain specific, so they are specific to your apps. Now I get not only information about Tomcat, its web apps and Spring through our instrumentation of the Spring Framework, but also potentially a custom MBean or a set of custom MBeans that speak to the overall health and state of the application from that application's domain perspective.

I can say "Number of orders placed" (I don't know, I'm trying to think of some good examples) or any type of business metric is the term that we've used internally at Hyperic a few times to express what there actually might be going on there. And the same way with notifications. You have custom MBeans and custom components that can send messages about the state of the application that are at a higher level than "This servlet container is angry" - that kind of thing. Usually that won't happen, developers won't get signed up to start building manageability and instrumentation into the app unless a) ops asks for it and b) there is tooling and an infrastructure that actually makes that worthwhile. We've done this actually so many times.

We go talk to customers, I spend a lot of time talking to customers and they say that you get the operations team on one side of the table and the developer team on the other and everybody generally thinks it's the promise on the other side. One way to sort that out – we'll never get to this perfect nirvana of developers and ops really just being 100% comfortable with each other necessarily – is to say "Look, developers you hate it when ops people pull you out of whatever you are doing in your fun feature writing and ask you to debug a production problem on a production application. How about you enable the existing tools that they are using to provide more information about the actual application so they don't have to become domain experts and come back and ask you every time that something else happens." Usually, the reason why that doesn't exist today is because there aren't tools that can do this correctly or nobody's actually got both groups in the same room and said "Here is how we could make everybody happy." Ops people hate having to go to developers and developers hate having ops people come to them.

   

4. What effect has the SpringSource acquisition had on Hyperic development?

When we joined up with SpringSource earlier this year, they were an OEM partner of Hyperic that had license to our technology to build both their AMS product and their Tomcat server product, that have been very successful. So there was a lot of familiarity on the SpringSource side about the Hyperic technology stack and actually they built some extremely cool add-on features on top of the Hyperic platform that were specific to these two products that were great.

In the case of Hyperic, the Hyperic product was built before Spring existed, so you could say that we have an architecture and some elements of our product that don't follow some of the things that have really made Spring the powerhouse that it is today in terms of Java development frameworks. We've gotten access into enormous amounts of expertise in new programming models and technologies that we were very interested in getting involved in and that were going to help us chart, and we are actively charting, a path for the evolution of our platform that's going to enable us to do some very cool things.

The other interesting fact is that we selected Groovy as our scripting language of choice for the HQ platform before G.2One was acquired by SpringSource and now we came into SpringSource with a strong amount of Groovy advocacy. We weren't using Grails inside our product for some technical reasons, but we were using Groovy extensively, so that now we were basically at the source. It's really a marriage of expertise from developer frameworks and development tools and languages, and on the other end we're bringing a lot of operational knowledge and, frankly, an application that scales and has some pretty demanding requirements.

We are the last line of defense for our customers. When Hyperic breaks and your monitoring isn't working correctly, certainly that's not a good thing. When we talk about evolving the product, for example, and say "What can you expect to see over the next year or so?" You're going to see an updating of the Hyperic platform, but at the same time, the reason for that update isn't merely for the sake of us trying to cozy up our brothers and sisters at Spring, but because, as most people, we are convinced of the fact that the Spring framework is a better way to architect complex applications, make them more testable, etc.

You're going to see a Spring-enabled version of our product as one of the things that we're doing. We're developing this out in the open as well, so that's another great thing about being open source. Then, you are also going to see a lot of cross-pollination of different technologies. Groovy is a great example of something we did before that, but now the guys who build Groovy are colleagues of ours in the same company and they're very interested in seeing how both us and our customers employ Groovy to do a lot of customization things inside of Hyperic.

   

5. With the acquisition by VMware, how will Hyperic be integrated into the VMware product line?

Hyperic already established a really good position in the marketplace by managing not only VMware but other hypervisors and other virtualization technologies. We had, coming into this, pretty strong experience and expertise around managing high-scale virtualized environments. One of the unique things that we brought to the table was visibility not only into the VMware infrastructure but also into the applications that were running on top of the ESX and vSphere environments.

That capability is something that is a huge strategic priority for us and part of new products that we're building that are going to leverage now, similarly to the way that we're able to leverage a lot of the Spring technologies because we're part of the same team. Now our ability to tap into the extensive resources that VMware has and the expertise they have around their own hypervisor and sort of virtualization technologies really will allow us to create (which was our vision all along) a new class of management product.

One that is designed for highly dynamic environments where some of traditional processes that people have put in place in order to manage operations more reliably no longer apply, because you are moving work loads left and right because you have potentially given, through vCloud and other efforts, direct access for people to spin up virtual machines programmatically. These are things that are highly disruptive.

They are very powerful and productive in one sense, but in the other sense it's kind of a game of Whack-A-Mole so to speak. Like you are trying to chase after a problem and it's showing up in one place and then moving around and getting shifted to the other. It's truly a marriage of deep visibility in the applications with the understanding of what a virtual infrastructure actually looks like and how does that scale out and how do things move around. There is a whole series of Hyperic products that are going to be very closely tailored towards that audience and it's something that we're really excited about.

   

6. What do you see as the result of the combining of the variety of SpringSource offerings with the VMware offerings?

I've been asked that question a lot lately and it's actually good, because at first glance you see a company that is traditionally an infrastructure software company -- and a very powerful one at that -- in VMware, acquiring a company that has obviously a lot of developer-oriented offerings and really works on the absolute other end of the spectrum, so to speak. At first glance most people tell me that they don't get it. Obviously, it's a great outcome for all the SpringSource employees, but what is it really that we're getting here.

The truth is we're getting a new class of Enterprise software company and I think this is the most important part. VMware can combine now their infrastructure leadership and the ability that they've had to fundamentally change the way people look at physical data centers and servers, etc. with a development model and a programming technology and all of the great assets of Spring (and this is SpringSource and I'll talk about Hyperic in a second because we're sort of in the middle of that).

They say here is this huge audience; here are people that frankly are huge fans of Spring and related technologies. Those folks are now going to be able to build different types of applications. Applications that are designed to work in environments where physical server capacity is not really the operating concern, where you were able to build (at some of this was demoed at the conference) an application whose overall blueprint is used by the infrastructure layer to decide how to scale and how to manage the application through its various stages.

I think that that is not only going to push Spring into a new terrain very quickly, but it's also going to ultimately change the way people are building applications -- Java applications at first, but other types of applications as we go further into this; and that's very exciting. Where Hyperic fits into this is we’re sort of – as I mentioned – in the middle in the sense that Hyperic actually, if anything, had more affinity to infrastructure and operations. So we know a lot more about VMware than we arguably knew about SpringSource.

We sort of act as the bridge between those two worlds, because we know the application development and the application management world and at the same time we're extremely familiar with what it takes to build and support rock-solid infrastructure and manage it. I really believe that the combination of those three assets and certainly the capabilities that VMware has in the marketplace to continue its leadership role are substantial. And it's going to create, within this Cloud Computing marketplace that's very much an emerging opportunity, a new class of software company. -- one that is embracing, obviously open source through their acquisition of SpringSource and Hyperic, but also a level of innovation that is fundamentally different. We just look at the world different; we helped change the world or I should say VMware changed the world. You could argue certainly that SpringSource has changed the Java world as well. It's going to be a very interesting year I think, while we complete the integration and the products and the things that you're going to see are hopefully going to be very impressive.

The other thing to keep in mind though is that contrary to what other people might like to think, VMware is very keen to see what Spring has made successful continue to be successful and grow. That's why Rod [Johnson, SpringSource founder and general manager of the SpringSource division of VMware] and myself and everybody from Spring is a part of the VMware family now and we are being given a substantial amount of room I would say to continue doing what we do really well. That's one of the hallmarks of what makes a merger or an acquisition successful. So just check back in a year and see how it's going.

May 13, 2010

BT