BT

The Broad Institute Migrates Genome Sequencing Pipeline to Google Cloud Platform

| by Dylan Raithel Follow 9 Followers on May 13, 2016. Estimated reading time: 2 minutes |

Last April Google Research followed up on a topic covered at the Google Cloud Platform (GCP) Next conference in San Francisco, CA. The Broad Institute of MIT and Harvard announced they’d fully migrated their pipeline to the GCP. Dr. Stacey Gabriel, Director of the Genomics Platform at the Broad Institute detailed the scale of their genomics pipeline, adding to previous coverage by Kris Cibulskis of the Broad Institute during the conference.

Broad manages one of the largest genome sequencing centers in the world and historically thought of itself as a hub for data generation, but now plans on expanding into offering gene sequencing and data as a service. To give the bioinformatics, data science and software engineering communities a sense of their data volume and growth rate Broad noted that their

DNA sequencers produce more than 20 Terabytes (TB) of genomic data per day, and they run 365 days a year…. the output increased more than two-fold last year, and nearly two-fold the previous year.

Broad uses Nearline as a cost-effective medium for storing DNA sequence segments used infrequently, saving a reported $1.5M, or 50% over their pre-GCP storage and access architecture. They also noted that their Whole Genome Sequencing Pipeline is completely ported to GCP and features like preemptible vm’s cut cost associated with idle CPU time. As part of plans to fully migrate to cloud services Broad noted they’re

migrating each of our own pipelines to the cloud to meet our own needs… and plan to make them available to the greater genomics community through a Software-as-a-Service model.

The scalability and data access infrastructure Broad built on the GCP is open sourced as the FireCloud platform. Cost savings and runtime optimizations are based on using the Genome Analysis Toolkit (GATK) and the relative cost of various steps in the pipeline. Broad noted that they parallelized computationally intensive steps like aligning DNA sequences against a reference genome to reduce overall wallclock runtime.

Noteworthy details about the Broad stack highlighted by Kris Cibulskis include:

  • A focus on using elastic compute for data sharing and processing
  • Optimizing storage cost based on genome data access patterns using Nearline
  • Emphasis on moving research computation to data where it’s hosted to avoid network latencies associated with petabyte scale data transfer

Broad also open-sourced Cromwell, a workflow management system geared towards scientific workflows under the BSD 3-Clause license.

Three other research organizations were represented in the original presentation as well and provided details about scaling genomics processing on the GCP. The Institute for Systems Biology (ISB) was represented by Ilya Shmulevich, Seven Bridges of Genomics by Igor Bogicevic and Stanford Medicine by Jason Merker.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT