Interview with Kevin Nilson on Cloud Monitoring and Mobile Testing
Managing a cloud environment is very different from managing servers inside the wall. On JavaOne Shanghai 2013, Kevin Nilson, the VP of Engineering at just.me, gave a talk on what needs to be done after deploying to the cloud, and he covered a lot on what an ops needs to take care of on AWS environment.
InfoQ did an interview with Kevin about how monitoring in the cloud is different from monitoring on-site servers, how we should deal with this difference, and what we need to know about monitoring mobile apps and mobile testing.
InfoQ: Hi Kevin, good to have you here! You talked about cloud monitoring this morning. What's the main difference in monitoring when you move from your own service to AWS?
Kevin: Being in the cloud has a lot of challenges, and one of those challenges is knowing how many servers are up at any given time. You're auto scaling up and down. You don't know how many servers you're actually running. You don't know the performance of those servers. That can be challenging.
And one of the other more unique challenges that the cloud offers is running on commodity hardware. In the cloud, things are shared, and not all servers are the same. So you may spin up another instance and that instance is much slower than the one that you had used previously.
So you need this ability to break your API's performance down by servers. Sometimes you need to look at that as a whole, and sometimes you need to break it down, so sometimes aggregated and sometimes not. You look at it in general how is this API call is performing, but then also is it performing really poorly but only on one server. And in that case, you can delete that instance and then get another one because you really don't want to be paying for something that has much worse performance than what you should be getting. It's not that common but it definitely does happen.
And lots of people – I talk to the head of engineering at Pinterest and he talked about the same problems that they have. I know the folks at Netflix have done quite a bit in this area as well and it is something that we try to keep an eye on as well.
Another interesting thing is, when you're using the cloud, one of the things that you'll often do is terminate instances. And when you terminate an instance, it's gone completely, you've given up everything, any kind of logging and things like that are kind of gone. So that's another challenge.
Other challenges that come about from being in the cloud is monitoring servers. So at just.me we use Graphite which is similar to Ganglia. It's really easy to set things up with a fixed number of server instances and you can have Graphite point to your servers, and have it pulling to gather metrics. That's probably the most common use case of how those tools often used. But when you're in the cloud, if you used a pulling setup, that would often mean every time an additional instance is added, you may have to restart Graphite, change configuration and restart. And obviously that's not going to work.
So one of the things that I think you need to do is reverse the direction. Instead of the monitoring server knowing about all your instances, which is what you would do traditionally, instead, all your instances need to know about your monitoring server. So the instances themselves, the war, the web archive in Java should be pushing data into the reporting service rather than the reporting service pulling from your server.
I think that's one of the things that really is a shift that being in the cloud, when you make that shift of direction and you figure that out, it simplifies a lot of what you're trying to do.
InfoQ: In your talk, you also mentioned about the AWS CloudWatch being 5 minutes delayed.
Kevin: Let's talk about that, sure. So AWS provides a lot of basic mechanism for monitoring, CloudWatch, and CloudWatch is really nice. It provides some basic alerts and notifications and lets you look at a lot of really basic attributes, like CPU and network io. The CPU is the big one I care about.
The problem with CloudWatch is, with that basic monitoring that Amazon provides, it doesn't let you get in-depth details about your application. So CloudWatch only lets you look things like the server's health, and network in and out.
And that's a good starting point, but it really doesn't help you – ask yourself this simple question. If the CPU is 0%, what does that mean? Does that mean that everything is performing well? Or does that mean your app server crashed? It could mean the best case and it could mean the worst case. You really have no idea.
So CloudWatch has some limitations. So overcoming some of these limitations with the AWS CloudWatch, you really want to look at your particular services and how are they doing, right? So one of the tools that I recommend are Yammer Metrics. Yammer Metrics is really nice because it lets you have timers, gauges and histograms. But basically, it allows you to either annotate your code and look at any APIs, how are they performing, how many times have they been called, what's the rate at which they're being called, how many times per minute or how many times per second, and you can really easily get that information. And that's kind of the first step in providing data points about your services, but the problem with that is you really need to look at how the service is trending. Knowing that the service is taking 400 milliseconds doesn't really mean much. Was it only taking 20 milliseconds the day before or was it taking 800 milliseconds the day before? And now you're significantly worse. Or is it running faster than it's ever been. You really don't know.
And so one of the things that I've done at just.me is use Graphite so that I can get charts and graphs of all their performance so that I can look at trend analysis of how the things perform today compared to yesterday, compared to last month. And then if I do a large version upgrade within our software, I can look at was that affective in performance or was that effective in driving more traffic as well, so looking on the business side, also on the operational side as well. Hosted Graphite has a really easy to setup service the lets you get up and running with Graphite really quickly. At just.me we’ve been very happy with their service.
Then from there, the next big challenge is looking at the performance of individual services and how do you have alerts and notifications where you can get an e-mail or get an SMS when you have an outage. I've been using a tool called Nagios for this and it's really nice. Nagios provides a simple configuration file where where you you can define when you want to be warned that, "Hey, there's a problem coming soon." and when you actually have a problem.
I think the key is, you want to get to a point where you know a problem is coming before it's happened. Often analysis is much easier if you can find that problem before it happens. And definitely, it's better for the customer if there never is an outage.
The last thing I did at just.me around this trending analysis and looking at things as a whole is a lot of the tools provided by Graphite allowed you to look at things one service at a time very easily. Maybe you can look at five services at a time but trying to look at all your services at once but what I really wanted to is to be able to see how is everything behaving and looking at the trends.
So the folks at Square built something called Cubism. It's open source. It's really great. It lets me get a tremendous amount of data on the screen at once and see how things are performing. And one of the big advantages that gives me is, a lot of times when you have a problem with one service, it will cause lots of problems everywhere in the app. But there'll be a period of time where maybe you'll get a warning where one service starts performing badly for 10 minutes and then everything performs poorly. So that helps you isolate your problem because looking into first service that went bad is more than likely going to lead you to the cause of the problem faster. I think the key here is finding the cause as quick as possible so that you can get things recovering quickly.
InfoQ: There are so many metrics to watch. How do you determine which are the ones that you should look at?
Kevin: I do think there are some core metrics that you want to look at. You want to look at your CPU. If you have thousands of things to view, you want to build a dashboard, something that's visible at-a-glance is often we would call it, at-a-glance reporting. CPU is definitely there, instance health is also there.
One of the hardest things to get monitoring for, but I really need, is active request. I think knowing which APIs have active request and what the count of the number of active request is actually one of the most challenging things to get, but I think it's actually one of the most valuable. And so if you have a lot of traffic, when a service starts to have problems, the number of active request will increase because it’s taking longer to finish. And being able to look at that as a whole, I found it very, very interesting and I think in Java especially, the fact that you have a fixed thread pool and generally one thread per request so one API call can take down your entire service if it is hanging, which is very scary. So having this ability to protect one service from taking down the site is really important.
There are a lot of services that I don't care about performance. And the differences are when are you getting data versus when are you posting or putting data. So when you're sending data to the server, that's almost guaranteed to be asynchronous, to be in the background, and if the customer doesn't know what's happening, he doesn't know how long it took. When you're getting data, or when you're fetching things from the server, that's where the customer is waiting, almost always. So the metrics of the gets as far as performance is concerned is much more important for the users.
But then the puts, what are people publishing into you system, from the business this is often more important. So the stuff that we do in the social space, we want to know when people post messages, how many messages did they post or likes and comments and different reactions, following people. We want to know how many of these events have happened. On the business side, we don't really care how many times users refresh the screen.
So there's sort of this tradeoff because you really need the performance when you're getting data, but then your business is most concerned about the put and post.
The other thing that greatly helped me was using Cubism. You've got this decision where you're trying to determine what metrics do I want to see at-a-glance. And with Cubism, really the way that they do three colors on top of each other and the way you can see it with minimal real estate, it almost gives me this ability to see everything at-a-glance. And I can see all metrics as many as possible on one screen and have it visible from a distance.
That's more from an operational monitoring side, not from a business side. The business, I think you're just looking at what are your core goals that you're trying to achieve.
I think my dashboards have probably changed about 30% since the day I launched, because there are things that I thought we would have problems with and we have never had problems so those metrics got removed. Things that we've had big problems with that I never could have predicted or maybe I should have predicted but I didn’t, have gotten moved into the at-a-glance dashboards that we show in the office.
InfoQ: So what's left on that dashboard?
Kevin: So I look at our CPUs. The CPU data can be done on top of each other - I can take all the MySQL CPUs, EC2 servers CPU, Lucene CPU, the Neo4j CPU - I can put them all on top of each other. They all show a number between zero and 100 and we know that once you get above 75% to 80%, you start to get a little panicky. So I can look up at-a-glance and say, "Okay, nothing's over 50%. I don't really care what the individual numbers are." This has really allowed me to compress more info onto one screen.
I look at my load balancer, the number of instances that are healthy, making sure that we don't hit zero. I also look at active request and make sure that no API spikes.
I look at DynamoDB. There are two things I want to say about DynamoDB. I don’t like the way you pay for DynamoDB read units. In my opinion, it’s anti-cloud. It doesn't auto scale. Basically, you say, "I'm willing to pay for 100 read units." And once you get to 100 read units, they throttle you and stop returning results. They throw an exception and you stop getting results. You pay for a hundred even if you only using 10. And if you have a site where during the day it's active, at night it isn't, then you end up either having to constantly reconfigure. I'm really nervous about that because if any one of those hits the threshold, you're in a really big trouble because it will stop returning data. So that's really bad.
Also, they have CloudWatch for Dynamo, but it’s not easy to use. At just.me, we have many many tables, and you would have to open up a few dozen browsers at once to see everything in CloudWatch. So it's nearly impossible. That's why I build my tools with Graphite on top of it so that I can aggregate all that data - it would be on a hundred browsers - and I can put that in one chart just like I did for CPU. I can look at my read and write units. I can draw a line across and say this is my threshold. Anything goes above and below is problematic.
What else is on our dashboard? For a while, I put a lot of things around our security layer. So with each request that needs to go through JAAS security and check, "Are you authenticated?", and we had some problems in our authentication layer in the early days. So that used to be top, front and center of my dashboard, because if it takes 10 seconds to make it through the security layer, then it really doesn't matter how the rest of the site is performing. You have to get that fixed first. But they've all been resolved now, those metrics all went away and so it disappeared from the at-a-glance.
Then I also put the number of open support tickets on the at-a-glance report, because if a large volume of support tickets gets opened, you probably have a problem that maybe your monitoring hasn’t picked up on yet. So I put that right in the at-a-glance dashboard where I can see the open support tickets.
And then for morale boosting within the company, I also put some simple marketing metrics on there. So we can see how many new devices do we have that have register and, we can see how many messages were sent in our system, the number of replies. That's more for the team. I'm hoping it draws their eyes toward the dashboard, that they're curious about these, and then they might notice some of the other things. So we keep it in the office, put it on the big screens, and then all the developers can see it. We take breaks. People stand under the monitors and talk about them, they're curious and interested.
InfoQ: What metrics do you look into specifically for the performance of your iOS and Android apps?
Kevin: When you're trying to work with Android and iOS and the HTML5 mobile version as well - we're a responsive app, we basically use HTML5 so that when users first get introduced to the app, they may not have it installed. Someone may share a message to them and they get a nice native-feeling experience, and they say, "Wow, I want to install the real app so I can do more."
And so you have these three clients that you're very concerned about, and you want to be able to know the differences between them, how are they performing from a business perspective, knowing which one is driving more traffic. And one of the big challenges there is knowing what was the source of the installation. So when someone goes to the app store and does an installation, where do they come from? What's the referrer? Maybe we did a big advertising push, spent lots of money on advertising, was it effective? What's our return on investment there? Maybe we had big article with some blogger and it drove a lot of traffic, and maybe we need to work more closely with that person in the future to drive more business and really trying to figure out that targeting.
So one of the things that we do at just.me is we don't drive people towards the app store or play store directly. We drive them to an HTML5 website. So it's just.me/gettheapp. So we tell people to go to gettheapp. And when we send them to gettheapp, then it tells what the referrer is. Google Analytics will do that for us. And then we'll often add a query string at the end, just.me/gettheapp?src=TechCrunchArticle, or just.me/gettheapp?src=emailcampaign. And then we can look at the metrics to determine which advertising was successful. HTTP headers give the referrer URL which is very useful as well. So we try to balance as much of the traffic to the app store through gettheapp as possible. That's definitely very helpful.
The other tools that I use on the mobile monitoring, we started with using Flurry quite a bit. It's built from the ground up for mobile. And so we use it in iPhone and Android. It tells what devices do people have. If you look at Android just.me right now - we support over 1,500 devices, I can't tell my QA guy, "Go test on 1,500 devices," or “go randomly pick one,” I want to look what are my customers really using, and buy those targeted devices.
So although we support 1,500 devices, in the company, we own about 10 to 15 unique devices. And then I definitely put a lot of pressure on the developers to purchase for their personal phone to correlate with what people are using out in the market. It is important for us to know if it is popular so can so can drive our QA towards what do people actually use.
We had a problem a few weeks ago with Sony devices. So we had just launched and we found that no one using a Sony device could publish a message. We had never bought one. And sure enough, we went through a small beta with a few hundred people. None of them had Sony devices - kind of surprising. None of them reported problems anyway. And then once we launched, the first day that we launched, I was on the hotline at Sony, trying to get them to overnight me an Xperia phone. And sure enough we get the device on our hands and within 20 minutes we had a fix in and put it in the play store.
The fragmentation problem on Android is really annoying, but one things that really great about Android is that I can have a fix to the play store within an hour, which is amazing. The Apple App Store requires each release to be reviewed before it goes out, which often can take about five .
Another really nice thing that Android offers is risk mitigation. Google Play store announced at I/O this year the alpha and beta program. At just.me we have three tiers: alpha, beta, and production. In the alpha, it's people I know very well, often people I've drunk a beer with. And those people also are able to plug their phone into a debugger. And so if I have a new build I want to put out, if I think it's risky, I'll give it to the alpha store for a few days. Let those people try it and see what happens. If something goes terribly wrong, nothing to worry about.
The next step is beta. So these are people who went to the website and said, "Hey, I'm really interested in your product. I want to see it the day it comes out. And so we've got several hundred of those, and we control that through a Google+ group. So I next give the build to them. If something goes wrong, it's probably not good but it's not catastrophic because builds that are in the beta, can't be rated in the app store and you don’t want one-star reviews because you release something too early. So this is where Android really has done a lot of great work that is, really developer focused.
So for the Sony case I mentioned earlier, one thing we did was, we immediately went to the play store and disabled all Sony devices. So if someone comes to the play store with a Sony device, it says, "Your device is not supported. Come back soon." We took Sony devices offline for about four days while we waited for “overnight shipping”, and when we got a fix ready, we re-enabled Sonly. The play store really gives you some great advantage.
Another thing that's interesting when working with Apple was tips around getting featured. A lot of what Apple wanted in making their decision to feature things is they wanted a large block of features. So they really don't want you pushing code every single week. The best way not to get featured is to push your features to your customer every week. Leaning towards releasing three times to four times a year was their preference.
So what that means is, when you're completing things, you're really anxious to get them out there, you think this is going to save your company, it's going to drive business, it's going to close the viral loop and your viral coefficient is going to go up…but we're going to need to hold it because it needs to go out with a few other things so that we have a better chance of getting featured when the editors. And that's painful. But understanding that is definitely key to your success in the mobile area.
And so we use Google Analytics as Flurry to gather the metrics, but I think you have to really understand Google Analytics to get data out of it. An average employee is not going to go and find data out of Google Analytics. So someone in marketing who's really, really driven to find that data can get it out, but I think an average developer is not going to just poke around and go, "Oh, I found an insight." It's not going to happen. And you probably don't want them spending that much time because it’s not that user-friendly, the way that they have labels, actions, and values can seem very contrived.
So another tool that we started using is Mixpanel. It's a startup. It's a paid tool. It's nowhere in my presentation for JavaOne. What Mixpanel does is a lot more friendly. The APIs are a little bit more similar to Flurry, but they're really good at cohort analysis and funnels. You can look at who have done this, have other people also done this? You can look at who are becoming power users, people who have been using your product for a week. Basically, instead of just having counters, it's more looking at tiers of customer counters, in a long timeframe.
A simple example: if someone registers and logs into the app once a week or once a day of five times a day, whatever it is, so you have a number indicating who is still using. In the first week, that's going to be 100% or 90%. The second week, maybe it's 80%. And then the next week, it's 70%. And then you hope that like at 50% or whatever your target is, that it stays consistent, where 50% of the people are still there and you want to know does that number go to zero? Are you holding and maintaining your users. Normally, when looking at counters, if you have new people coming in, it's really hard to distinguish between the new people coming in and the old customers, and Mixpanel is very nice for that. Also, it allows you to look at things that build on top of each other such as replying to messages and things like that.
InfoQ: Good. So last question. When you have choices of tools - building your own tool, taking an open source project and make your own tool, or taking a 3rd party service (you used graphite, nagios, jmeter, yammer metrics, New Relic, and there are a lot more) that seems useful and reliable - what is your major considerations on which ones to opt for?
Kevin: I always try to see if there's something already there. I start with open source when I'm looking for tooling because I think a lot of the advantage for open source is, if there is a pain point, it's going to be somebody else's pain point, and they're going to go and fix it. And if not, I'll fix it. It happened many times where either I'll fix it or I'll have somebody who works for me fix it and contribute the code back. That's really like a stepping-stone - you may seed it and then somebody else is fixing it for you.
Picasso is a library I really like is from Square. At just.me we use Picasso for image loading. I've put in a few feature requests and often they will get. You can also vote on features and things like that. This was very useful for us at just.me. I really didn’t want to write all the code necessary for caching and image loading.
So I always look to open source. But open source doesn't always work. Sometimes there's not enough people interested in that problem. You may have to start your own open source. For example, when I worked at E*TRADE, I wanted to be able to test how my app is performing on all browsers, and I wanted to be able to do that in continuous integration with Jenkins. Nobody really had done that. So I went ahead and made my own plug-ins for Jenkins to make that possible. And then a lot of people started contributing patches and fixes and different feature request and that's definitely really nice.
A lot of times, open source solution is not for your case. What I did with Graphite was, I wanted a dashboard, I looked at it, they have invested a tremendous amount of effort in this dashboard and I cannot easily mold it to be usable for me. I didn’t see how it would be possible to enhance their dashboards to meet the needs I. So I built my own dashboard on top of their graphs. The graphing flexibility in Graphite is amazing, but I found the dashboards limiting. So I rolled my own tool for making a more flexibile dashboard.
The third place I would go is looking for something commercial. Commercial products can save you time and money, if the meet your need and you don’t think you will need additional customization. Building your own from scratch is the right choice when it's core to your business. You want to be able to distinguish yourself from others, you want to be able to have full control. You should leverage some open source when you can afford this investment. If it is something that's really core to your business, then you shouldn't necessarily always start at open source.
When you look at something like a JVM profile, it's a very advanced thing. There's nothing in open source that’s really analyzes and profiles all at once. People don't have the time and energy to make that possible in open source, and I think New Relic has done a good job. And so we do a little bit with New Relic.
About the Interviewee