Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Interviews Paul Cowan Discusses the Resin Application Server and Cloud

Paul Cowan Discusses the Resin Application Server and Cloud


1. My name is Charles Humble and I’m here with Paul Cowan, Senior Software Engineer at Caucho Technologies. Can you tell us a bit about yourself and your work at Caucho?

I got started about 11 years ago. I was weaned on Java and started with NetDynamics which (some people maybe remember) was one of the first JEE servers to come out. I’m primarily a backend software developer, have been doing threading and concurrency caching for the last few years, and recently at my work with Caucho, I’m mostly working on the health system, our health monitoring system, and get involved in some of the CDI implementations or web servers - pretty much anything that we need to work on at Caucho in terms of the Resin application server.


2. Where do you see Resin in terms of the wider Cloud landscape?

We see Resin as an elastic JEE application server layer. We are not a Platform-as-a-Service and we’re not a Software-as-a-Service, we’re a Platform-as-a-Service or Software-as-a-Service infrastructure provider or vendor. But we don’t sell the service; we don’t provide it to you. We just sell you the software and you build your Cloud on it.


3. Can you tell us a little bit about how your Cloud support is architected?

We have what we call a Triad hub-and-spoke architecture; the Triad being three servers that are your primary servers that are in constant communication with each other. That forms the hub, and then n number of elastic spoke servers that you can add and remove, as needed, to handle the load.


4. Why did you go for a hub-and-spoke rather than say a peer-to-peer cluster?

What we found was that in most cases it ends up being easier to have these three primary servers, with IP addresses that are known, without this auto-discovery, because really that creates a lot of problems in EC2 environments. The hub architecture of the three primaries, that are responsible for caching the data and maintaining the knowledge of the architecture of the Cloud, eliminates a lot of issues that you get with a true peer-to-peer network.


5. How do you keep the network traffic to a reasonable level as the number of spikes grows?

Our messaging is based on what we call HMTP which is our Hessian Network Protocol over the web socket standard. What we are doing is we’re looking at, for example, in the distributed sessions we will usually only update your session when the data changes and, even then, we’re hashing the session and the Triad master will have a hash of what the current data is. The spoke servers will say, "Here is my hash. Do I have the latest data?" If the hashes match, then you don’t need to send very much data.


6. How do I configure the typology of the cluster?

Resin is really simple and it’s primarily just a single JAR file with a configuration file. It’s the same configuration file for the Triad as it is for the web tier, any of the cluster tiers, the spokes. The three Triad servers are spun up and they are usually static on different machines, and the elastic servers are spun up just with a single command line. It’s really easy to bring them up and take them down as you need to.


7. How does Resin support Cloud deployment and distributed versioning?

It’s based on our HMTP messaging system, which I mentioned before. That forms the basis for our distributed cache and then our distributed versioning is based on a Git. So we have an internal version of the Git library and when you push a WAR, an application, to one of the Triad servers, we use Git internally to push it out to all the other servers. Sessions in our Cloud are tied, usually, to the Git version of the application, so you can have a number of sessions running off of one version of the application, push a new one, and then Resin will keep that version of the application up until all the sessions are over, and then new sessions that are coming in will get the new version of the application.


8. So that’s the graceful version upgrade feature, isn’t it?

Yes, that’s how you upgrade. Alternatively, you can bring a server down, put the application on there, bring it back up. That’s an option the Cloud is auto-sensing and will move the load around, or the other alternative is to have a set of servers that you upgrade and then basically flick a switch on your load balancer and move your load to the other servers.


9. You’ve got replicated session state across the machines? What’s the overhead like for that; for actually replicating the session?

Yes. It’s pretty low, because you only need to replicate it to the server that is handling that session. We’re using sticky sessions, with a lease. So the same session will go to the same server and it will push its session update to the Triad. Now, that’s configurable whether you want it to push after every update or if you want to push it after a time period. The hashing of the sessions and comparison with the master and the spoke servers is what cuts down on the network traffic.


10. What’s the Watchdog system and why is that important for Cloud deployment?

Resin always runs with actually two processes, and the Watchdog is the secondary process that controls Resin. When you start up a Resin server, you are actually starting up the Watchdog and the Watchdog starts up Resin for you. It’s a peer process, but it’s actually the parent. The Watchdog will monitor the health of the Resin system and automatically restart, if there is an issue. One of the advantages here, particularly in the Cloud environment, is that as you are spinning up elastic servers and bringing them back down, sometimes you lose track of your servers as you have lots coming up and going down. The Watchdog system is maintaining the health of the servers and bringing them back up, and noticing if there is an issue, and then reporting on that. It’s really part of our health system, so it triggers health reports because it’s an external process and can detect when Resin is having issues.


11. Can you describe some of the other reports I can get out of the health system. You have post-mortem reports for instance.

Primarily we have two kinds of reports: a Watchdog report is triggered when the server restarts. So, in say the 10 minutes - actually it’s configurable, but in the 10 minutes prior or leading up to a restart, we will produce a report from the data that we were tracking at that time. The second report that we have is similar and it’s a Watchdog report also. It’s produced as a PDF, it contains a thread dump, a heap dump, a stack dump, a JMX dump. So we go through all of JMX and dump out the attributes and the values. If you want, we can do profiling - we have our own internal native profiling library that can profile for a period of time. Then, we are monitoring lots of attributes in your system, like CPU usage, memory usage, file descriptors and we graph that for you. We produce what we call a snapshot report and it’s a PDF report containing a snapshot of your system at the time.


12. And you do JIT profiling as well, I think I’m right in saying?

That’s right, we do. The health system is very configurable and we can configure it to do profiling for any period of time on a certain end response to a certain issue that it notices.


13. How does what you offer compare to some of the other JavaEE vendors? WebLogic has some similar features, for instance.

It’s similar, and really in the market we’re positioned between Tomcat and WebLogic. So we’re giving you more of an enterprise quality application server, elastic and with a nice health system and a nice administration system, but it’s much more lightweight. Our download is only 23 MB, we have about a 6 seconds startup time, so we’re well-positioned in the market to take advantage of customers that don’t need the heavyweight full JEE stack, but want something a little more enterprise quality than Tomcat.


15. You also have an anomaly detection feature. Could you tell us a bit about that and where that idea came from?

A lot of our health system, and the anomaly detection is a good example, rose organically out of our need to support customers. The anomaly detection feature came about as the result of a customer who was having an issue with thread spikes, and we knew that there was something happening at that time; we could see it in the graphs that we were producing, but we couldn’t tell what was happening; we needed to have a snapshot at the time. The anomaly detection feature will monitor virtually any statistic you want in your JMX, number of threads being an easy one to track. If it sees a spike, an unusual spike, in the amount of threads, for example, it will trigger one of our snapshot reports, and that’s really invaluable for support, because we’re actually detecting something’s happening at that time, something unusual, and getting a snapshot of the system that you can use for debugging.

Dec 21, 2011