Key Takeaways
- The popularity of Kubernetes continues to explode, but with this growth comes challenges and gaps, from the cultural shift required to technology trends and advancements.
- The software development lifecycle (SDLC) on Kubernetes and microservices-based applications are still evolving and this is where there might be significant evolution in the next few years.
- The adoption of DevOps practices on Kubernetes platforms is relatively more mature than the associated SDLC. However, with emerging patterns like GitOps, this is also an anticipated area of growth as well.
- As the next wave of microservices and more stateful applications are deployed on Kubernetes-based platforms, there is a need for more visibility for operations and also tools for self-defense and self-healing against malicious applications (intentional or inadvertent).
- For users running Kubernetes in production, homegrown tools based around the Kubernetes Operator pattern for example have emerged and will continue to evolve to address the tooling gaps.
The lead headline in the recent Cloud Native Computing (CNCF) survey states that the "use of containers in production has increased by 300% since 2016." With this hyper-growth there comes challenges that users are grappling with, not only individually but also as a community.
InfoQ recently caught up with several Kubernetes experts to discuss the top trends and most important challenges that users of the platform are facing.
The panelists:
- Katie Gamanji - Ecosystem Advocate at CNCF
- Brian Gracely - Sr. Director of Product Strategy for OpenShift at Red Hat
- William Jimenez - Technical Product Manager at Rancher/Suse
- Qi Ke - Engineering director at Azure, Microsoft
- Suhail Patel - Staff Engineer at Monzo Bank
- Sunil Shah - Engineering Manager at Airbnb
InfoQ: In two or three sentences, can you please talk about your first encounter with Kubernetes, your first reactions, and how you’re involved with it since?
Katie Gamanji: The first exploration of Kubernetes functionalities was an attempt to introduce better resource management and self-healing capabilities to existing applications. At the time, I had to install Kubernetes "the hard way," including the manual configuration of
systemd
units for the core components. Since then, I have configured, managed, and maintained tens of clusters hosted on multiple infrastructure providers and integrated with a variate set of cloud-native technologies.
Brian Gracely: Docker Swarm and Mesos/Marathon already existed, and then Google announced this new project with the strange name. A few months later, KubeCon was announced (pre-CNCF) and I thought, "do we seriously need an entire event for a container scheduler?" Only after I heard about a large Financial Services company using it in the v1.0 days did I think it might be real. I’ve led Product Strategy for Red Hat’s Kubernetes platform, OpenShift, since 2016. We work with thousands of companies deploying Kubernetes across private and public clouds.
William Jimenez: The first encounter with Kubernetes was when I was at a digital education company (Chegg) leading a DevOps modernization project. We were coming from a world of complex dependency management which was increasing in weight and fragility. Containers were immediately showing results and we were investing heavily in AWS ECS for our orchestration so I was quite preoccupied when Kubernetes came on the scene. The basics of security, CI/CD, and developer tooling to just make containers usable in a fast-moving business were large enough challenges at the time that concerns Kubernetes sought to address seemed too lofty and out of touch with our reality. So in hindsight, I didn’t really pay attention until much later when I started working at Rancher Labs and realized that Kubernetes could be made approachable in such a way that companies outside the engineering weight-class of Google could see real value from it.
Qi Ke: I was working at Google when Kubernetes and GKE were announced. My first encounter with GKE was to evaluate tracing for applications running on GKE. There was no service mesh at all at that time hence it’s impossible to carry trace context across calls without modifying application code. And I didn’t know I would be working on Kubernetes every day a few years later.
Suhail Patel: Kubernetes was introduced to Monzo quite early on in the v1.0 days. We might be the financial services company that Brian mentions! My initial exposure came from joining Monzo and seeing this wonderful orchestrator fit so well within a microservices architecture. We use Kubernetes to run over 1500 services encompassing all aspects of running a bank.
As an end-user, we’re constantly working with the cloud-native community on improving projects and providing public posts, and feedback on running systems like Kubernetes, Prometheus, Envoy, and more at scale in production.
Sunil Shah: This was way back in 2014, a week or two after I had started as an engineer at Mesosphere (the open-core startup aligned with the Apache Mesos project). My first reaction was a mixture of fear and excitement -- it’s always a little scary when an industry giant like Google announces a free and well-funded competitor to your company’s core product! On the other hand, the fact they were getting into the space was a good sign that we were onto something!
Since then, Kubernetes quite obviously won out over Mesos, and it’s been astonishing to see the community grow so rapidly. At Yelp, our team began exploring replacing Mesos with Kubernetes for new workloads. And now at Airbnb, I support a team that manages dozens of Kubernetes clusters to run almost all of our online workloads, serving traffic from around the world.
InfoQ: What are the three top Kubernetes trends that developers/architects should pay attention to?
Gamanji: In the past years, Kubernetes’ adoption is in constant growth, resulting in 83% use in production based on the 2020 CNCF survey. With a stable suite of core features, the community focus shifts towards the enhancement of the developer experience and lightweight execution of workloads. As such, technologies to keep exploring are GitOps, cloud-native portals and IDEs, and Kubernetes as an edge platform.
GitOps patterns have been on the horizon for a while. Within the cloud-native ecosystem, tools such as ArgoCD and Flux enable the declarative representation of the application state using git repositories. These tools redefine the automation levels of a CI/CD pipeline.
Developer experience should always be at the forefront of any application release process, including code development, deployment, observability, and troubleshooting stages. Within the cloud-native space, a plethora of tools have focused on DX improvement, including GitPod (development), Octant (visualization and troubleshooting), and operator boilerplates (release).
For many organizations, customer proximity and reduced latency are core metrics to evaluate the success of an application. Therefore, organizations might choose to operate clusters at a physical distance to improve the quality of service and experience for their services. These functionalities are covered by Edge-focused projects, such as k3s, KubeEdge, OpenYurt, and many more.
Gracely: Let’s start with the tools that let developers not have to care about containers or Kubernetes and just stay focused on writing code (Buildpacks, s2i, etc.). Then there are a great set of tools that make it simple to get started (VSCode plugins, Red Hat Code-Ready Containers, minikube, Katacoda). And Knative is making it simpler for developers to not have to care about Ops.
Five plus years into the lifecycle of Kubernetes, we have to remember that most developers don’t think about pods or CRDs, so tools need to meet them where they are today in their learning curve and business objectives.
Tools such as buildpacks and s2i allow them to just write code, and it will automatically turn the code and dependencies into a container.
IDE plugins and web-based IDEs (VSCode, Code-Ready Containers, etc.) allow them to not only get started quickly but give them context about how Kubernetes can augment their deployments.
And new functionality like Knative allows them to take those containers and apply them to new, event-driven applications.
Jimenez: First, developer advocacy. Most of the engineering world is actually developers. Although Kubernetes often starts as an IT modernization story, that’s just the tip of the iceberg. It’s when developers start to perceive the value of it that adoption really takes off. We need more tools and language to invite developers into the story.
Second, cloud-native storage. Kubernetes started off as a solution for stateless applications because state is hard in a distributed system. As Kubernetes’ usage matures, users will increasingly feel comfortable with this next order of challenge, and the ROI will be too appealing to pass up.
Lastly, we’re going to see Kubernetes everywhere. Kubernetes will become a commodity that we simply expect to exist in a computing environment (datacenter, edge, embedded, and IoT). So hopefully as a technology community, we free up a lot of resources that have been traditionally spent adapting software to bespoke environments. And we’ll also need tools to interface with this scale and scope of clusters that treat all Kubernetes environments equally.
Ke:
- 1. Multi cluster management, cross-cluster networking, and storage
Kubernetes customers usually manage more than one cluster for their production environment in the cloud. There are many reasons for that, some for a smaller blast radius, some for regionality and locality, some for security boundaries, and some to overcome scale limits. And when their number of clusters grows, cluster-lifecycle management, configuration management, app management, and policy enforcement across all clusters become labor-intensive and error-prone.- 2. Self-defense and self-healing from malicious (maybe unintentional) applications
Not all applications are written in a way to be friendly to Kubernetes infrastructure. As a managed Kubernetes service, we’ve seen many cases where a cluster is beaten to an unhealthy state by the application’s unintentional behaviors. Sometimes it is due to queries or watches on the API server from DaemonSet on a large cluster. Sometimes it is due to aggressive logging that triggers IO throttling on the OS disk and freezes docker. How to configure your Kubernetes cluster and lay out infrastructure and set up monitoring to be defensive against intentional or unintentional attacks from applications running on the cluster. And how to recover quickly.- 3. DevOps tools
Needless to say, Kubernetes developer experience needs more love. The manifest file provides flexibility but brings complexity and maintenance pain. VSCode plugins that not only help developers to jumpstart but also detect incorrect indentation or typo help boost productivity. In addition, server-side dry-run and kubectl diff would be great features to add to the plugins.
Patel: Running Kubernetes reliably requires a large time investment. The ubiquity of managed offerings at many price points and across a variety of providers has made Kubernetes accessible to engineers across the globe. It isn’t reserved for large cloud providers or companies with large platform teams. If you need to host it yourself, there are lots of pre-configured and packaged distributions that are ready to go in a few commands that come with years of experience baked in.
I’m particularly excited about controllers and operators. Controllers allow you to bring in your own custom control loops to Kubernetes and Operators allow you to bring in systems and tightly integrate them into the Kubernetes lifecycle. Kubernetes can become an SRE for all the other systems that you run.
Kubernetes is here to stay and is a mature base platform to depend on. It is an enabler for bringing in unified practices around deployments, tooling, monitoring, observability, and much more across an entire organization. Hopefully, this also means more companies and engineers getting involved directly in the CNCF ecosystem and contributing their experience back.
Shah: I’m personally excited by three areas of improvement: pod resource optimization, orchestration of stateful services, and operations automation.
First, resource requests (CPU, memory, etc.) are something that most users aren’t used to thinking about. If you’re running Kubernetes well in a multi-tenant environment, it is almost mandatory to set reasonable requests to protect running services under contention. Tooling to improve how users manage and think about resources is something that we’re excited about. Ideally, they don’t need to think about resources at all, and our systems just take care of it! Some startups are making some moves in this direction but there is room for improvement beyond just making resource consumption observable by users.
Second, the integration of stateful services (think Spark, Flink, etc.) is much better than it was a year or even two years ago but we’re still seeing authors of distributed systems catch up to using Kubernetes primitives like operators as a first-class way to orchestrate their system. Spark now has a mature Kubernetes scheduler, which I’m excited to start experimenting with because it’ll bring some helpful operational efficiency to that infrastructure.
Third, automation of runbooks (or playbooks) through systems like Stackpulse is still in its infancy but has a lot of potential. Every large organization struggles with regularly validating operational protocols and moving these into code allows us to automate testing as APIs change.
InfoQ: What are the three biggest challenges to deploying Kubernetes in production?
Gamanji: The biggest challenge faced by cloud platform/infrastructure teams is keeping Kubernetes up to date with the latest release. Multiple organizations chose to delegate the cluster maintenance to a cloud provider, using a managed service such as EKS, GKE, or AKS. However, in cases where this isn’t a viable solution, the teams allocate chunky proportions of their resources to perform cluster upgrades. As a response, the Kubernetes community has reduced the number of releases a year from 4 to 3, ensuring End User organizations have sufficient time to digest and integrate the latest features.
Gracely: Many companies treat Kubernetes as the new version of their existing processes, most of which aren’t designed for frequently changing software. Security must "shift left" and be part of software supply chains, as well as being integrated into the Kubernetes platform.
Jimenez: Getting to production is a lot easier than it was just a few years ago. Automation technologies such as kubespray, kubeadm, and RKE were brave forays into the error-prone and complicated territory of Kubernetes administration. I think we can safely say however that these were "easy" problems, and we’ve done a good job addressing them. The real challenge I wonder about is how do we deliver on the promise of Kubernetes which is more than just getting it to work.
To be transformative like it was at Google, Kubernetes needs to gain significant adoption within an organization or it risks becoming just another "platform" that exists against a backdrop of many others. The power of Kubernetes is the velocity at which software can be developed and then scaled to vast amounts of users. Developers, security researchers, and SREs need to be empowered to embrace all the tooling and features of Kubernetes for this vision to be realized. Too often I find Kubernetes is siloed or confined to the point where it is only offering limited value even though an organization has spent so much time and energy implementing it.
Ke: Visibility, visibility, visibility. As a provider of managed Kubernetes, we found that most of the customer escalations are due to a lack of visibility of what’s going on in the cluster. Is the connection reset by SLB? Is the disk throttled? Did the scale fail due to lack of quota? Is the kubelet hung due to memory or CPU overuse? Visibility of performance, QoS, and logs on the cloud infrastructure and Kubernetes cluster are the key to the root cause of an issue. Over time, we baked signals from these metrics and logs into problem detectors and developed auto remediators for common issues to achieve self-healing. That’s our own way to manage clusters at scale.
Patel: Kubernetes doesn’t let you off the hook for needing to develop your own components to abstract away infrastructure concerns. Many application engineers don’t/won’t care about how Kubernetes works so it’s important to do the work to integrate within a company to leverage all the benefits. Kubernetes levels up the unit of abstraction that your infrastructure team deals with and provides a rich set of orchestration capabilities that you don’t need to build yourself.
Running your own clusters requires time and energy. One common point of difficulty I see is building support and tooling that only supports one or two clusters (i.e., a dev cluster and a production cluster). Instead, treating clusters like cattle and having the ability to spin up new clusters seamlessly is a great time investment upfront for easier testing of new features and processes.
Shah: In my opinion, the biggest challenges are related to the popularity of Kubernetes. Firstly, Kubernetes moves quickly! It often feels like we are running just to keep up with each Kubernetes upgrade. At an organization the size of Airbnb, we have a number of patches and custom integrations that need meticulous testing with each new version. Automation helps here but it’s still a full-time job in itself.
Secondly, Kubernetes’ reputation amongst the engineering community as the defacto way to manage cloud resources sometimes results in teams underestimating the effort required to run a production-ready cluster. This sometimes backfires, where teams end up not investing time into automation or building integration with other infrastructure systems necessary at a large enterprise, like cost attribution or observability. Even with vendor-managed solutions, this doesn’t come for free when you have a complex tagging taxonomy.
Finally, running Kubernetes in production within a large engineering organization necessitates a high support burden for our team, the "Kubernetes experts." We are often paged in to help with basic actions (e.g., cycling a service) that most engineers know how to do in a traditional Linux environment, but aren’t confident or familiar with on Kubernetes. Better developer tooling can help but this needs to be balanced with risk around potentially dangerous commands!
Summary
The panelists talk about their first encounter(s) with Kubernetes and discuss the top trends and challenges that the platform is facing especially in production. There are a variety of operational challenges from keeping deployments up to date to needing more visibility end to end, especially on managed services. Even Software Development LifeCycle (SDLC) can stand to be improved on the platform and it needs to "shift left" to accommodate security in a cloud-native world. The panelists discuss efforts that are underway to bridge some of these gaps.
About the Panelists
Katie Gamanji - Currently the Ecosystem Advocate for CNCF, Katie Gamanji works closely with the End User Community. Gamanji’s main goals are to develop and execute programs to expand the visibility and growth of the End User Community while bridging the gap with other ecosystem units, such as TOCs and SIGs. In past roles, Gamanji contributed to the build-out of platforms that gravitate towards cloud-native principles and open-source tooling, with Kubernetes as the focal point. These projects started with the maintenance and automation of application delivery on OpenStack-based infrastructure, which transitioned into the creation of a centralized, globally distributed platform at Condé Nast. You can find Gamanji on Twitter, LinkedIn or Medium.
Brian Gracely - Sr. Director of Product Strategy for OpenShift at Red Hat, Gracely works closely with large enterprises around the world to bring existing and new applications into production on Kubernetes. He has 20+ years of experience in Strategy, Product Management, Systems Engineering, Marketing, and M&A. He is co-host of The Cloudcast (cloud computing) and PodCTL (Kubernetes) podcasts. You can find Gracely on Twitter, LinkedIn and YouTube.
William Jimenez is a Technical Product Manager at SUSE. He enjoys solving problems with computers, software, and just about any complex system he can get his hands on. In his free time, he likes to tinker with amateur radio, cycle on the open road, and spend time with his family (so they don’t think he forgot about them). You can find Jimenez on LinkedIn.
Qi Ke is the Engineering direction at Azure leading the Managed Kubernetes Service. Prior to that, she worked at Google in various areas in cloud, APM, dev tools, enterprise, social and search, building performant distributed systems, and engineering systems. Before that, she designed, architected, and led the effort to build the Q Build system (a.k.a. CloudBuild) when she worked in Bing. You can find Ke on LinkedIn.
Suhail Patel is a Staff Engineer at Monzo focused on working on the core Platform. His role involves building and maintaining Monzo’s infrastructure which spans over 1500 services and leverages key infrastructure components like Kubernetes, Cassandra, Etcd, Envoy Proxy, and more. You can find Patel on Twitter and LinkedIn.
Sunil Shah is an Engineering Manager at Airbnb. His team builds and maintains the Kubernetes-based platform that powers Airbnb.com. Prior to Airbnb, Sunil managed computing for Yelp, helped commercialize Apache Mesos at Mesosphere, studied robotics at UC Berkeley, and built ingestion pipelines at music recommendations service Last.fm. When he’s not sending emails, Sunil spends his time swimming, biking and running (slowly), and playing Overcooked 2 with his wife. You can find Shah on LinkedIn.