InfoQ Homepage Articles The Challenge of Monitoring Containers at Scale

The Challenge of Monitoring Containers at Scale

Mar 24, 2016 18 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

The open source release of Docker in March 2013 triggered a major shift in the way in which the software development industry is aspiring to package and deploy modern applications. The creation of many competing, complimentary and supporting container technologies has followed in the wake of Docker, and this has lead to much hype, and some disillusion, around this space. This article series aims to cut through some of this confusion, and explains how containers are actually being used within the enterprise.

This articles series begins with a look into the core technology behind containers and how this is currently being used by developers, and then examines core challenges with deploying containers in the enterprise, such as integrating containerisation into continuous integration and continuous delivery pipelines, and enhancing monitoring to support a changing workload and potential transience. The series concludes with a look to the future of containerisation, and discusses the role unikernels are currently playing within leading-edge organisations.

This InfoQ article is part of the series "Containers in the Real World - Stepping Off the Hype Curve". You can subscribe to receive notifications via RSS.

The adoption of containers, and the associated desire to build microservices, is causing a paradigm shift within the monitoring space. Application functionality is becoming more granular and more independently scalable and resilient, which is a challenge for traditional monitoring solutions. If a single component within a microservice architecture fails, there may be no business impact, and so the severity of alert should match this fact. The traditional monitoring tool approach of testing whether something is 'up' or 'down' falls short, and accordingly some organisations are building their own monitoring systems.

The transient nature of containers also presents new challenges with monitoring, especially when combined with the emerging popularity of scheduling and orchestration systems, such as Kubernetes, Mesos and AWS ECS. The expert panel argue below that modern monitoring solutions must be capable of integrating with these platforms, and also providing data at both the container and aggregated service view. This combination of views is the only way that issues can be identified and resolved effectively.

With the increase in the number of services running within an application system, and also additional underlying infrastructure components, a large amount of data is now being generated. So much so that monitoring may now be a 'big data' problem. Accordingly, the next generation of monitoring tools must provide some level of artificial intelligence that is capable of providing understandable insight to the data generated, and ultimately be capable of providing recommended actions (even if this includes taking no action).

InfoQ recently sat down with a series of container monitoring experts and explored these challenges and more. Topics discussed included the design of modern monitoring systems, approaches to providing actionable insight for human operators and sysadmins, and the future of monitoring tooling.

InfoQ: Hi, thanks for taking the time to talk to InfoQ today. Could you briefly introduce yourself please?

Chris Crane: Hi, thanks for having me! My name is Chris Crane and I’m the VP Product at Sysdig, the container-native visibility company.

Kevin McGuire: Hi, my name is Kevin McGuire. I'm Principal Product Manager, Operations Analytics for New Relic which means I have responsibility for our strategy for operations and infrastructure monitoring. Recently I led our product and engineering efforts around cloud monitoring which includes developing and launching our monitoring solutions for Docker and our Amazon EC2 Beta.

Ilan Rabinovitch: Happy to join you! My name is Ilan Rabinovitch, I am the Director of Technical Community and Evangelism at Datadog. Prior to joining Datadog, I spent a number of years leading infrastructure and reliability engineering teams at large web organizations such as Ooyala and Edmunds.com.

Alois Reitbauer: Hello, my name is Alois Reitbauer I am chief technical strategist of Dynatrace and Ruxit. I have been working in the monitoring and performance space for the greater part of my career. Right now my main area of work how we can better monitor large scale environments – think microservices, containers, IoT – and how to make it easier for people to interact with monitoring systems.

InfoQ: Can you share your thoughts on what the biggest challenge currently is within the container monitoring space?

Crane: Containers - and the microservices they enable - represent a paradigm shift in the development and deployment of software applications. As with previous new paradigms, we’re seeing an ecosystem develop of supporting technologies which have been designed from the ground up for this new platform. Everything from security to networking to storage to, and of course monitoring and visibility.

Now this “fresh start” approach is particularly important for monitoring, and this gets to the central challenge in container visibility: legacy monitoring tools were designed for the world of VMs, where an agent can be deployed in the same space as the applications being monitored. This is no longer the case with containers.

The core principles of containers demand that they remain lightweight, reproducible and unburdened by agents. So any monitoring solution that really wants to be container-native is going to have to be grounded in an approach that respects the sanctity, portability, and replicability of each container. This turns out to be really hard, if - and this is the key part - you still want to achieve 100% application visibility inside containers.

McGuire: What we're seeing is that containers are being used for smaller and shorter lived workloads as the benefits around microservices are being explored. At the same time, containers are also being used for larger, long running services, a sort of optimization from VM usage. The former requires a different way of monitoring containers, and the combination of the two means that a single approach to monitor isn't sufficient to meet the needs of these vastly different uses. For example, monitoring should provide visibility of application and resource metrics per container, and also the aggregated resource usage at the image level that is more applicable to short lived containers.

Within each container is an application, but presently the monitoring APIs are focused at the resource level. This is fine from the host's point of view, but ultimately you need to be able to interpret application performance both within and across containers.

Finally, as containers become smaller and the number increases, understanding their interdependencies, maintaining operational health at scale, and doing root cause analysis becomes challenging.

Rabinovitch: The biggest challenge many experience with container monitoring is the frequency of change, making it difficult to define and alert on "normal". While this has been true for many years with cloud environments such as AWS, OpenStack and other virtualized solutions, the short lifetime of containers exacerbates it.

To give you a sense of scale and frequency here, we recently ran a study on Docker usage and adoption where we found that most hosts tend to run about 4 containers concurrently, and that each of those containers lived less than 1/4 the lifetime of the host they ran on. This means that our monitoring tools can no longer focus on hosts as they once did as the primary unit of measure.

We now ask “where is that Redis cluster or NGINX server running? Did Kubernetes just move it to another host resource availability or was this a failure event?”

If your monitoring is centered around hosts, your world looks like Ptolemaic astronomy - complicated. It's pretty hard to account for the movement of planets when you try to measure them in relation to the Earth. Flip it around with the Sun as the center of the solar system, the math gets much simpler. That may mean relying on data from your scheduler's service discovery, as well as the tags or labels from your containers to define queries and questions that will ring true regardless of what host or port those services happen to be running on at any given second.

Reitbauer: From our experience working with large accounts it is an actual paradigm shift in the way we build software, which leads to a number of challenges. One is obviously scale, companies adopting containers in most case also move to microservice architectures. Usually once people break down their monolithic applications into services, the number of entities involved increases greatly. From our experience in the Java space, these environments run 10 to 20 times more JVMs than before. This means that a reasonably sized systems suddenly requires the tools to manage ‘web-scale’ systems.

The other change we can see is the adoption of orchestration layers building on Mesos, Marathon or Kubernetes. The effect is that application landscapes become much more dynamic, and scale up and down pretty quickly. Understanding this dynamism is key, and many traditional monitoring tools are not equipped well to handle this. Besides understanding the dynamics of these environments a lot of other questions come up like how we do log management in these environments and what the overall role and approach to infrastructure monitoring looks like.

Which challenge is the most critical depends on where a company is in their adoption of containers. Dynamic orchestration is still an advanced area and we are seeing people moving gradually towards it. Dealing with web-scale environments, however, is something that is on everybody’s agenda once they move to containers.

InfoQ: Adrian Cockcroft has been quoted as saying that monitoring systems need to be more available and scalable than the systems being monitored. What is your opinion on how practical this is for current designers/implementers of container-based microservice applications?

Crane: Adrian Cockcroft is right. The worst case scenario is you’ve got your new microservice oriented application and you’re trying to monitor it with a monolithic legacy tool that can’t keep up. Here at Sysdig we call the solution to this challenge “monitoring as a microservice”. The idea is that your monitoring solution should be just as easily deployed and scaled as all your microservices. Your monitoring solution has got to be as automated, self-serving, and self-healing as possible, with concepts like automatic service discovery baked in. If your monitoring itself is built on a microservice-native, container-native architecture, then you’ve got a good shot at keeping up. Without this approach though, Adrian’s ideal is probably, unfortunately, pretty impractical.

McGuire: High availability of monitoring systems is definitely important. As DevOps becomes more reliant on monitoring for alerting on problems, your monitoring system becomes an integral safety net. This is why we believe a SaaS based monitoring solution is the right one, since having to run and maintain both a monitoring system and the system to be monitored increases the likelihood of a problem impacting both and leaving you blind.

Microservices increase the scale and complexity of the monitoring problem. However, they also bring with them practices that increase availability, such as the ability to deploy the same containers across a variety of hosts, or the ability to deploy to the same host a variety of containers with complimentary usage patterns.

Rabinovitch: Adrian hits the nail on the head here. You need to know that the systems responsible for detecting your downtime are going to be online and available when you need them most.

However while an accurate requirement, this is challenging, and accordingly this is why we believe a SaaS solution is the best approach. On average we find a given Linux host generates about 100 key OS level metrics, with an additional 50 from your application. With higher density of containers running on each host, these numbers quickly increase in volume.

This leaves the question, how do you store the billions of metrics a day from your applications and infrastructure? In many cases the answer is likely to be as complex, if no more so, than the infrastructure behind your core applications. Modern monitoring platforms are effectively a ‘big data’ problem, which means in turn managing the complex distributed data stores and large compute infrastructures that power them.

Reitbauer: Well, this is nothing new. The key goal of a monitoring system is to inform you when your applications are down, which requires them to be basically up all the time. Accordingly, we have built an entirely new architecture to ensure maximum uptime. All components are built to work with real-time failover and we built an automated management layer that detects faulty components and replaces them automatically. For real-time failover and the ability to cope with peak loads in real time we also have reserved one third of excess capacity in our clusters.

Obviously this is totally different from traditional single-server based monitoring tools.

InfoQ: How can monitoring systems provide insight to operators, rather than simple metrics/numbers?

Crane: It’s all about the insights, right? I think the key here is providing the proper context around any metrics and numbers surfaced. For example, one of the big trends in the container ecosystem right now is the adoption of scheduling, orchestration, and management tools - tools like Kubernetes, Mesos, and Docker’s own Swarm. These tools provide an extra layer of abstraction above your underlying containers, and they really enable the move towards microservices run on containers.

Orchestration systems also tend to add a fair bit of complexity in terms of monitoring and visibility. Instead of a neatly organized cluster of VMs or containers, you end up with a highly scaled, distributed, and scattered collection of containers intermingled across a shared pool of resource. If your monitoring solution starts spitting out metrics from this environment - even if they are container-aware metrics - it’s not going to be very helpful. You need your monitoring solution to understand the actual semantic context imposed on the containers by the orchestration system. In other words, you want to see performance at the application or microservice level, even if the containers serving that microservice have been arbitrarily scheduled across a variety of underlying nodes. That’s insight.

McGuire: Essentially this is the difference between data and information. Raw metrics are not enough, they need to be provided in context to aid interpretation. This is why we focus on actionable data with a UI designed around the information we hear our customers need. That focus on design and usability helps reduce the noise so that operators can quickly orient themselves and make important timely decisions.

Rabinovitch: Monitoring systems generally rely on the operator to define ‘normal’. With the rate of change in today’s dynamic environments being driven by auto-scaling and scheduled infrastructures, defining normality becomes a challenge. So far the monitoring community has done a great job of focusing on automating metrics collection and alerting on those predefined thresholds. We now need to focus on algorithmically detecting faults or anomalies and alerting on them.

Reitbauer: Let me rephrase this questions to "How can monitoring tools provide insight to everybody in the company who needs this information”. DevOps and microservices have lead to end-to-end teams that cover the whole lifecycle from development to monitoring. So everybody must be able to understand monitoring data. This is why we invested a lot of time into building self explanatory infographics everybody can understand.

Another key requirement is anomaly detection. Due to the massive scale nobody can look at all these numbers manually. So monitoring systems have to learn normal behaviour and indicate when system behaviour is not normal any more.

The last aspect is contextual semantic information. An example is that a monitoring system needs to “understand” what a metric means and how it is related to other metrics. We need to learn all the dependencies in an application landscape, and this information can then be used in problem analysis.

InfoQ: What will the future of container-based systems look like (IaaS vs PaaS & bare metal vs VMs etc), and how will this impact monitoring?

Crane: The method of deploying containers on top of VMs, whether in public/private cloud or data center, seems to be the most popular so far. However, whatever methods the market adopts at scale, I think it’s safe to say that there will be plenty of use cases where any given method is ideal. I wouldn’t be surprised by a result similar to the hybrid clouds we’ve seen evolve in the enterprise over the past few years - with combinations of public cloud, private cloud, OpenStack, virtualized datacenters, bare metal, etc. The implication for monitoring though is that your visibility solution has got to be technology stack agnostic. No one wants to have to switch between different panes of glass for different environments, and no one wants to have to switch vendors in order to experiment with (much less migrate to) new environments.

McGuire: I think where containers and PaaS have commonality is their potential in liberating the developer from the mechanics of infrastructure management so you can focus on the thing that adds value - the application. That's what you're in the business of.

Managing infrastructure is a cost and the best you can do is cost optimization through increased efficiency and higher reliability. IaaS reduces some of that cost because you don't need to maintain the physical infrastructure, but you still need to manage the service to ensure you're using it optimally and are responding well to demand changes. This is where monitoring steps in, to help provide deeper visibility into service usage as it relates to application performance so that you can accurately turn those efficiency, reliability, and availability dials. In the cloud, lack of efficiency increases your cloud bill, so cost is the ultimate performance metric.

Ideally, monitoring systems should provide visibility expressed at the same conceptual level as the problem you're solving. It's not even clear what an "application" is when you move to microservices. This means you need elevated views where you get above the level of the individual container, and can look at their behavior as a whole.

Above IaaS and above PaaS, what we're seeing is the emergence of "Computation as a Service" with Docker containers dabbling in it via microservices and AWS Lambda at the leading edge conceptually. Ultimately then the challenge is to understand what it means to monitor computation.

Rabinovitch: The frequency of change will continue to increase, both in terms of the changes coming from our schedulers and auto-scaling tooling and from the velocity of deployments that these technologies enable. This will continue to drive monitoring systems to better integrate with platform and scheduling solutions to be able to more dynamically respond when those changes as they occur. Similarly I think we will start to see common patterns develop on how to have our monitoring systems send feedback directly into these IaaS, PaaS and scheduling solutions so that we can build more automated responses and tighter feedback loops. This will in turn continue to force more real time constraints on monitoring and alerting systems.

Reitbauer: Infrastructure has become ephemeral. With containers we care less and less what the underlying infrastructure looks like. In many cases the same applications might be partially running on IaaS and bare metal for example. The key impact on monitoring is that the focus is moving up to the application. Failing nodes get replaced automatically so infrastructure failures play less and less of a role. The key focus of monitoring tools therefore becomes the actual services and their quality. I care about response times and failure rates. Frankly speaking, if your monitoring primarily focussed on infrastructure and won’t get the information need to run container-based services effectively.

About the Interviewees

Chris Crane is VP Product at Sysdig where he spends his time building monitoring and visibility tools for the new container and microservices ecosystem. He is passionate about creating powerful technology that solves real world problems. Before Sysdig he worked on product, marketing, and BD at other startups including Aardvark and Compass, as well spending time at Bain&Co and Bain Capital Ventures. A long time ago, in a galaxy far far away, he was actually a web developer. Chris majored in electrical engineering at Yale University.

Kevin McGuire is Principal Product Manager, Operations Analytics for New Relic where he has responsibility for the company's strategy for operations and infrastructure monitoring. Previously as the Director of Engineering and the Product Manager for the New Relic Infrastructure team Kevin was responsible for the Server and Plugins products, which recently delivered Docker monitoring and lead the charge for re-imagining AWS monitoring.

Prior to New Relic, he was an architect and a product manager at Microsoft, he also held various technical and management roles in IBM, in particular as one of the first leads on the Eclipse Project.

Ilan Rabinovitch is Director of Technical Community at Datadog. Prior to joining Datadog, he spent a number of years leading infrastructure and reliability engineering teams at organizations such as Ooyala and Edmunds.com. In addition to his work at Datadog, he active in the open-source and DevOps communities, where he is a co-organizer of events such as SCALE, Texas Linux Fest, DevOpsDay LA and DevOpsDays Silicon Valley.

Alois Reitbauer is Chief Technical Strategist and Head of Innovation Lab at Dynatrace. He has spent most of his career building monitoring tools and fine tuning application performance. A regular conference speaker, blogger, author, and sushi maniac. Alois currently shares his professional time between Linz, Boston, and San Francisco.

This InfoQ article is part of the series "Containers in the Real World - Stepping Off the Hype Curve". You can subscribe to receive notifications via RSS.

InfoQ Software Architects' Newsletter

The Challenge of Monitoring Containers at Scale

Write for InfoQ

Related Sponsors

About the Interviewees

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ

Trending

Related Sponsors

Educational Content

The InfoQ Newsletter