Pfizer Uses Serverless Architecture on AWS to Scale Processing of Digital Biomarkers

Pfizer upgraded the serverless architecture for processing digital biomarker data at scale to make it more flexible and configurable. They created a framework that uses a file processing pipeline built with AWS Step Functions and other serverless services, as well as a custom Python package for data ingestion and processing.

Lukas Adamowicz, senior data scientist at Pfizer, talks about the importance of digital biomarkers:

Digital biomarkers are quantitative, objective measures of physiological and behavioral data. They are collected and measured using digital devices that better represent free-living activity in contrast to a highly structured in-clinic setting. This approach generates large amounts of data that requires processing.

Pfizer needed a scalable and cost-effective solution for processing digital biomarkers and created a serverless architecture on AWS in 2020. The original architecture combined AWS S3 for storage, AWS ECS for executing data processing logic, and Lambda functions for S3 event subscriptions and data aggregation. The solution supported a small selection of algorithms and catered only for a few types of biomarkers.

Recently, the company has reworked this original design to support multiple data sources and different algorithms with distinct sets of parameters. They also created a specialized SciKit-Digital-Health (SKDH) Python package that provides algorithms for computing digital biomarkers and code for supporting different data ingestion methods.

The improved solution leverages S3 for storing files and DynamoDB for storing metadata, AWS Step Functions for process orchestration, AWS Batch with Fargate for executing data processing logic, and AWS SQS for messaging.

Digital biomarker catalog and file processing workflow (Source: AWS Architecture Blog)

The overall architecture consists of two processing workflows. The first one is responsible for scanning the S3 bucket for new configuration files containing study configuration details in the YAML format. If a new configuration file is found, the workflow will verify whether the bucket configured in the configuration file exists and contains relevant data files. The workflow then triggers the file processing for any configured files in the study bucket.

The second workflow is focused on processing study data and metadata files. First, it optionally inserts metadata into the DynamoDB table, checks if the processing requirements are met, and waits until they are. Once requirements are met, the batch job is executed that runs on AWS Fargate. The batch job computes the digital biomarkers based on the configured version of the SKDH package, and generated results are uploaded back to the study S3 bucket.

The team used AWS CloudFormation, together with Boto3 library and AWS CLI, to manage the infrastructure resources required to support the architecture.

About the Author

Rafal Gancarz

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Rafal Gancarz

Rate this Article

This content is in the AWS topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter