Running Apache Flink Applications on AWS KDA: Lessons Learnt at Deliveroo

Deliveroo introduced Apache Flink into its technology stack for enriching and merging events consumed from Apache Kafka or Kinesis Streams. The company opted to use AWS Kinesis Data Analytics (KDA) service to manage Apache Flink clusters on AWS and shared its experiences and observations from running Flink applications on AWS KDA.

Deliveroo uses Apache Kafka for inter-service messaging and analytical workloads. Still, in many cases, messages consumed from Kafka need to be augmented with data from other sources or aggregated based on common attributes (for instance, to calculate user interaction sessions). The team turned to Apache Flink to solve these use cases using a well-established solution.

Event aggregation for user session interactions (Source: Deliveroo Engineering Blog)

Apache Flink is a popular framework for stateful computations over unbounded and bounded data streams at any scale. It provides a distributed processing engine and integrates with cluster resource management systems such as Kubernetes, Hadoop YARN, and Apache Mesos.

AWS KDA is another cluster management service for Apache Flink that also supports Apache Beam and Apache Zeppelin. It provides APIs for Java, Scala, Python, and SQL, as well as SDKs for integrating with popular AWS services, such as S3, MSK, Kinesis, DynamoDB, OpenSearch, etc.

Duc Anh Khu, a senior software engineer at Deliveroo, talks about why the team went for AWS KDA:

We chose to use AWS KDA because it abstracts and simplifies the management and operation of Apache Flink cluster. To run on AWS KDA, applications are restricted to use streaming mode, RocksDB for state backend and resources of a cluster such as CPU and memory are abstracted as KPU. These work for us as our use cases meet these requirements. As Apache Flink adoption within the organisation is still low, choosing AWS KDA is a low risk decision for us as we don’t need to rely on other teams or manage Apache Flink clusters ourselves.

The team created build and deployment pipelines using CirceCI and Terraform and used multi-stage Docker image builds. KDA deployments, similarly to Lambda functions, expect application artifacts (jar files for Java and Scala or zip files for Python) to be uploaded to an S3 bucket. KDA provides observability for running applications with metrics originating from Apache Flink, MSK, or Kinesis Streams available in AWS CloudWatch and application-specific dashboard, helpful for troubleshooting.

While working on Flink applications, the team has come across some challenges, particularly around Python applications on AWS KDA. At the time, these applications required using an older version (1.13) of PyFlink library as an adaptation layer between the Python app and the internal Java APIs used by Flink. The dependency on JVM complicates application packaging and some functional areas where Python libraries are piggybacking on Java code for improved performance. Similarly, emitting custom metrics requires a particular approach to avoid Python/Java integration issues.

Lastly, because both JVM and Python runtimes need to be available to the application container, it requires more resources, which, combined with the default resource limit of 32 KPUs per application (1 KPU is the equivalent of 1vCPU and 4GiB or RAM; the default resource limit can be increased via support ticket), can lead to needing more application instances to support the required workload.

AWS KDA overview (Source: AWS KDA Documentation)

Developers have pointed out some areas where AWS KDA still needs work, including the ability to schedule and clean up snapshots (save points), automatic clean-up of old deployment artifacts in S3, changing low-level configs of Flink’s task managers (requires support tickets for now), and more user-friendly resource allocation (KPUs and parallelism settings can be cumbersome to work out).

Despite all these challenges and gaps, the team has observed some noticeable improvements since starting with KDA, including updated documentation and tutorials, a newer version (1.15) of Flink being available, and many bug fixes. They are happy with their choice but acknowledge that AWS KDA is not the best fit for everybody. For large applications, KDA may be tricky to customize resources, and for small ones, sharing an Apache Flink cluster (session mode) may be more cost-effective and flexible.

For more Apache Flink deployment options, see Instacart Creates a Self-Serve Apache Flink Platform on Kubernetes.

About the Author

Rafal Gancarz

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

About the Author

Rafal Gancarz

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter