AWS Introduces Step Functions Distributed Map for Large-Scale Parallel Data Processing

AWS recently announced a distributed map for Step Functions, a solution for large-scale parallel data processing. Optimized for S3, the new feature of the AWS orchestration service targets interactive and highly parallel serverless data processing workflows.

The new distributed map state allows writing Step Functions to coordinate large-scale workloads, iterating over millions of objects on S3, for example, logs, images, or CSV files. While AWS previously supported Step Function’s map state to execute the same processing steps for multiple entries in a dataset, it was limited to 40 parallel iterations. Sébastien Stormacq, principal developer advocate at AWS, explains:

Step Functions distributed map supports a maximum concurrency of up to 10,000 executions in parallel, which is well above the concurrency supported by many other AWS services. You can use the maximum concurrency feature of the distributed map to ensure that you do not exceed the concurrency of a downstream service. There are two factors to consider when working with other services. First, the maximum concurrency supported by the service for your account. Second, the burst and ramping rates.

AWS recommends using the map state in distributed mode when orchestrating large-scale parallel workloads, with datasets larger than 256 KB, execution event history greater than 25K entries, or a requirement of more than 40 parallel iterations.

Source: https://aws.amazon.com/blogs/aws/step-functions-distributed-map-a-serverless-solution-for-large-scale-parallel-data-processing/

Ben Kehoe, cloud expert and AWS Serverless Hero, tweets:

Step Functions Distributed Map is super helpful. Crawl over giant collections of S3 objects and apply Lambda processing to them! My only complaint is that this brand new syntax is put in the existing Map state, rather than a new state type.

Brian Zambrano, solutions architect at AWS, created a SAM application showing how to process 560K CSV files in 100 seconds. Some users highlight the overlapping between the new orchestration option and existing AWS services such as the serverless data integration service Glue, the cluster platform EMR, or S3 Batch Operations. Stormacq differentiates the use cases:

Data scientists and data engineers use AWS Glue and EMR to process large amounts of data, (...) application developers will use Step Functions to add serverless data processing into their applications (...) system administrators and IT operation teams are likely to use Amazon S3 Batch Operations for single-step IT automation operations such as copying, tagging, or changing permissions on billions of S3 objects.

The distributed map stops reading after 100 million items and supports JSON or CSV files of up to 10GB. Rafal Wilinski, founder of Dynobase, shares a CDK-based PoC of a migrations framework taking advantage of the new feature and comments:

Step Functions Distributed Maps are awesome. Combined with DynamoDB Parallel scans, they enable blazingly fast, whole-table data migrations and transformations.

Pricing is based on state transitions, starting at 0.025 USD per 1K transitions. According to AWS, for the same amount of iterations, customers will experience a cost reduction when using the combination of the distributed map and standard workflows compared to the existing inline map.

The new feature is generally available in a subset of AWS regions, including Ohio, Northern Virginia, Singapore, Frankfurt, and Ireland.

About the Author

Renato Losio

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Renato Losio

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter