The Broad Institute Migrates Genome Sequencing Pipeline to Google Cloud Platform

Last April Google Research followed up on a topic covered at the Google Cloud Platform (GCP) Next conference in San Francisco, CA. The Broad Institute of MIT and Harvard announced they’d fully migrated their pipeline to the GCP. Dr. Stacey Gabriel, Director of the Genomics Platform at the Broad Institute detailed the scale of their genomics pipeline, adding to previous coverage by Kris Cibulskis of the Broad Institute during the conference.

Broad manages one of the largest genome sequencing centers in the world and historically thought of itself as a hub for data generation, but now plans on expanding into offering gene sequencing and data as a service. To give the bioinformatics, data science and software engineering communities a sense of their data volume and growth rate Broad noted that their

DNA sequencers produce more than 20 Terabytes (TB) of genomic data per day, and they run 365 days a year…. the output increased more than two-fold last year, and nearly two-fold the previous year.

Broad uses Nearline as a cost-effective medium for storing DNA sequence segments used infrequently, saving a reported $1.5M, or 50% over their pre-GCP storage and access architecture. They also noted that their Whole Genome Sequencing Pipeline is completely ported to GCP and features like preemptible vm’s cut cost associated with idle CPU time. As part of plans to fully migrate to cloud services Broad noted they’re

migrating each of our own pipelines to the cloud to meet our own needs… and plan to make them available to the greater genomics community through a Software-as-a-Service model.

The scalability and data access infrastructure Broad built on the GCP is open sourced as the FireCloud platform. Cost savings and runtime optimizations are based on using the Genome Analysis Toolkit (GATK) and the relative cost of various steps in the pipeline. Broad noted that they parallelized computationally intensive steps like aligning DNA sequences against a reference genome to reduce overall wallclock runtime.

Noteworthy details about the Broad stack highlighted by Kris Cibulskis include:

A focus on using elastic compute for data sharing and processing
Optimizing storage cost based on genome data access patterns using Nearline
Emphasis on moving research computation to data where it’s hosted to avoid network latencies associated with petabyte scale data transfer

Broad also open-sourced Cromwell, a workflow management system geared towards scientific workflows under the BSD 3-Clause license.

Three other research organizations were represented in the original presentation as well and provided details about scaling genomics processing on the GCP. The Institute for Systems Biology (ISB) was represented by Ilya Shmulevich, Seven Bridges of Genomics by Igor Bogicevic and Stanford Medicine by Jason Merker.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter