Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Q&A on Container Scaling with Fargate

Q&A on Container Scaling with Fargate

This item in japanese

Vlad Ionescu, an AWS Container Hero, in early April reported on his experiments with scaling Amazon Fargate for batch processing or background jobs.

Amazon Web Services released Fargate in 2017. It eliminates the need for customers to patch, scale or secure a cluster of EC2 instances to run Kubernetes applications in the cloud.

Fargate automates the provision of either AWS Elastic Container Service (ECS) or Elastic Kubernetes Service (EKS) resources. Customers can define and pay for the resources at pod-level.

Ionescu explains that container space is rapidly changing. For organizations, there will be an uncertainty about what AWS service to use for container scaling. So, identifying a custom metric for scaling is becoming a preferred option.

From his experiment, it is revealed that EKS is faster in scaling. He mentions that one can configure pipelines in many ways, without worrying about AMI upgrades or host OS updates as AWS manages it all. With availability zones, to keep the cross-AZ cost overshoot to the minimum, he suggests launching tasks in a single availability zone.

Eliminating the maintenance of Kubernetes and EKS, AWS also offers Fargate Spot and Saving plans for Compute, making it easier to choose Fargate for respective needs.

InfoQ: What typical challenges does an organization face when choosing the right AWS service for container scaling?

Vlad Ionescu: Most challenges I see are information-related. The container space is vast and ever-changing, with scale concerns adding even more complexity.

Uncertainty about what service to use is at the top of the list. Somebody read that ECS is obsolete now that AWS has EKS, and so that cannot be used — nothing could be further from the truth as ECS is an excellent service! Solutions tightly integrated with AWS are a great fit for ECS. On the opposite side, somebody else read that Kubernetes is not very stable or has a specific issue. With the landscape moving so fast, I had people quote me bugs that were not valid since the Kubernetes 1.7 days. It is not easy to keep up with it all.

The second most common request I see is the desire to lift and shift, or to move to containers without any work. One example is a particular service that is running soundly on EC2 but definitely needs moving into containers for better cost / scaling / velocity. Getting a legacy Java 8 application to run in a container is... often not worth it.

The last big pattern I see is exotic workloads that are not a great fit for containers. Disk-intensive workloads, for example — EC2 instances backed by NVMe SSDs are blazing fast, but not that easy to manage from EKS. Windows-only workloads are another example. I see a surprising amount of this, and the solutions are not yet there: Fargate does not support Windows at all, leaving only ECS on EC2, with EKS on EC2 also becoming an option last fall.

To address scaling in particular, I see much confusion in choosing the right metric to scale on. Scaling based on CPU or RAM is not the best fit, but the usage of custom metrics is on the rise.

InfoQ: Can you suggest the best configuration for AWS Fargate using ECS from an Availability Zone perspective?

Ionescu: From a reliability and scaling perspective, all Availability Zones should be used, as per the AWS recommendation. Using all the separate Availability Zones ensures the lowest impact if say an Availability Zone is having technical issues or, more likely, has a lack of capacity.

Of course, this depends on the specific use-case. Unlike ECS on EC2, Fargate does not support Task Placement — all tasks are spread across all available Availability Zones, as determined by the chosen subnets. To avoid sizable cross-AZ networking cost surprises, it may make sense to launch tasks in a single Availability Zone. A service that is highly tied to a single-AZ resource would be the perfect example against the general rule.

InfoQ: Compared with standalone EKS, what aspect is impacted from the configuration simplicity standpoint?

Ionescu: EKS and Kubernetes itself are two compelling tools, with a lot of knobs and dials. While powerful, the complexity impact may take a while to manifest itself. The "day 2" operations and issues might not be visible in a proof of concept.

Standalone EKS is using EC2 workers. Those workers require maintenance: from AMI upgrades to ensuring an up-to-date security configuration. Underlying Kubernetes version upgrades, which lack LTS support right now, might also become problematic for companies that do not have high velocity. In Fargate, be it on ECS or EKS, AWS handles most of those problems. AMI updates and worrying about the host OS are gone, which allows engineers to focus on other tasks.

EKS also implements its own abstractions. For scaling on say the number of messages in an SQS queue, two tools are needed: Metrics Server and the CloudWatch adapter. Both those tools have to be configured, upgraded, and maintained.

Fargate allows the use of scaling policies where scaling on SQS queues is offered as a native service.

Want to scale on custom metrics in EKS? That additionally requires Prometheus and Prometheus Adapter, both of which have to be configured extensively to work at scale.

There is no particular aspect that is obviously "hard". It is death by a thousand cuts. A lot of small tools that have to be configured, and more importantly, understood. Having to familiarize yourself with cluster-autoscaler, HPAs, Metrics Server, and a lot of other projects that do one thing well takes time and creates many places where something could go wrong or where a high-scale system could have a bottleneck.

That said, as it can be seen from the results in my blog post, EKS is faster in scaling. The scaling pipeline can also be configured endlessly, to a great fit, for excellent performance. It's a game of tradeoffs.

InfoQ: What impact do you see on the total cost of ownership when an organization chooses to use Fargate?

Ionescu: At launch, Fargate was a very expensive service. Any cost analysis done at that point was not at all in its favor. Fortunately, new launches by AWS made those calculations obsolete:

  • In early November 2019, Savings Plans for Compute was introduced, offering up to 66% cost savings in exchange for committed usage.
  • Furthermore, in early December 2019, AWS announced Fargate Spot, with up to 70% cost savings.

As of May 2020, the differences in pure dollar costs are not large enough to force a decision. The discussion moves from infrastructure and operational cost to the total cost of ownership.

For brevity, let's skip the high-cost consultants and employees that have enough Kubernetes experience enough to avoid delays and costly mistakes. They are a significant part, with cumbersome recruiting and retention, but that's a whole different subject. What should be mentioned is that with a lower operational burden, DevOps is also easier to implement. Some engineers  are rightfully concerned and scared at the sheer scale of knowledge and dark magic that is hidden behind an acronym: "k8s". Upskilling engineers into DevOps practices is easier when the area is smaller.

The most considerable impact I see is in regards to velocity. The team can focus on other business-impactful projects, rather than EKS and Kubernetes maintenance -- the undifferentiated heavy lifting is eliminated. The same reason people move from physical data centers to the cloud, or from EC2 to Serverless: offloading that effort to AWS is a very good proposition. Paying just for use and easily scaling to 0 are the cherry on top!

It is easy to underestimate the level of effort involved in an internal infrastructure team: a couple of engineers. It is not just that! Using Kubernetes includes staying up to date with all the new launches. It includes upskilling and training both the infrastructure and the application developers. It includes a more extended onboarding and deployment process.

With an easier base to build on, the teams can focus on better and brighter things: faster deployments and lowering the time to market, observability for better understanding the system and supporting the business, or even moving the roadmap up by a couple of quarters. That's a pretty easy sell!

Rate this Article