AWS Data on EKS Provides Opinionated Data Workload Blueprints

AWS has released Data on EKS (DoEKS), an open-source project providing templates, guidance, and best practices for deploying data workloads on Amazon Elastic Kubernetes Service (EKS). While the main focus is on running Apache Spark on Amazon EKS, blueprints also exist for other data workloads such as Ray, Apache Airflow, Argo Workflows, and Kubeflow.

Building on the Amazon EKS Blueprints project, DoEKS provides infrastructure as code (IaC) templates (in both Terraform and AWS CDK), sample jobs, references to AWS resources, and performance benchmark reports. Solutions within DoEKS are categorized into five areas: data analytics, AI/ML, distributed databases, streaming platforms, and scheduler workflow patterns.

Guidance and patterns are provided for configuring observability and logging, handling multi-tenancy, and selecting cluster autoscalers. Several open-source tools, Kubernetes operators, and frameworks are covered by DoEKS in addition to integrating with AWS managed services.

Data on EKS Components

Data on EKS Components (credit: AWS)

One of the provided patterns covers deploying EMR on EKS with Karpenter. This pattern will create an EKS Cluster Control plane and one managed node group. This node group has three instances spanning across multiple availability zones for handling system critical pods. This includes the Cluster Autoscaler, CoreDNS, observability, and logging. The pattern then enables EMR on EKS with several opinionated defaults.

EMR on EKS with Karpenter

EMR on EKS with Karpenter (credit: AWS)

This can be deployed and created using the provided Terraform template:

git clone https://github.com/awslabs/data-on-eks.git
cd data-on-eks/analytics/terraform/emr-eks-karpenter
terraform init
export AWS_REGION="us-west-2"
terraform plan
terraform apply

Argo Workflows is an open-source container-native engine for orchestrating parallel jobs on Kubernetes. The Argo Workflows on EKS pattern steps through how to use Argo Workflows on Amazon EKS. This includes using Argo Workflows to create Spark jobs through a Spark operator and via Amazon SQS messages.

The blueprints for streaming platforms and distributed databases are still being developed. For streaming platforms, details should be provided for Apache Kafka, Apache Flink, and Apache Pulsar. Blueprints for distributed databases should include Apache Cassandra, Amazon DynamoDB, and Apache Presto.

Currently, there are details on using CloudNativePG to manage PostgreSQL workloads through Kubernetes. Some of the recommendations provide details on selecting storage, setting up monitoring, and handling backups and restores. For storage, DoEKS recommends using Amazon Elastic Block Store (EBS) volumes as they "provide high performance and fault tolerance". Specifically, it recommends either using provisioned IOPS SSD (io2 or io1) or general-purpose SSD (gp3 or gp2). Examples are included for both cases as YAML files as shown with the io2 example below:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
  name: storageclass-io2
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
parameters:
  csi.storage.k8s.io/fstype: xfs
  encrypted: "true"
  type: io2
  iopsPerGB: "50"

This can be provisioned using: kubectl create -f examples/storageclass.yaml.

The DoEKS library is available under the Apache 2.0 license. It is not a supported AWS service and is instead maintained by AWS Solution Architects and the DoEKS Blueprints community.

About the Author

Matt Campbell

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Matt Campbell

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter