BT

Cloudera Announces Partnership with the Broad Institute

| by Dylan Raithel Follow 9 Followers on Jun 02, 2016. Estimated reading time: 2 minutes |

Cloudera reported last month they'd collaborated with The Broad Institute on running Broad's Genome Analysis Toolkit fourth-generation Hellbender (GATK4) pipeline covered previously by InfoQ.

Cloudera's life sciences industry leader Shawn Dolley mentioned GATK4 cost savings and reduced R&D time in Broad's almost synchronously-timed announcement regarding Broad's wider collaboration with various cloud IaaS providers, but didn't provide quantitative benchmarks. Dooley noted the collaborative work and its merits stating,

Cloudera's commitment to Spark drove us to be the first Hadoop vendor to ship, support, and offer Spark training in 2014. We are honored to apply our expertise to the downstream multi-omic analysis space, investing in Spark as a bioinformatics standard, and working with Broad to create the next generation of GATK... This lower cost of genome sequencing and advancement in big data technologies means that we can afford to sequence the genome of patients very broadly and produce datasets that have never been available before.

The cloud platform use-cases and architecture focus on avoiding duplicate infrastructures and facilitating best-practices so users can derive insights into disease and treatment rather than managing infrastructure. Broad's senior director of data sciences, data engineering and creator of the GATK software package Dr. Eric Banks noted,

There are currently more than 31,000 registered users of the Broad Institute's GATK. The vast majority set up an extensive local compute and storage infrastructure to process the huge amount of information required to conduct genomic analyses. These collaborations will provide new options that can remove traditional barriers of scale while offering the same high level of data quality.

On the performance gains between the previous version and the GATK4 pipeline, Banks stated that,

the Spark computing framework on Cloudera Enterprise gives us the ability to implement tools that were not possible in GATK3 due to their computational complexity...On Cloudera Enterprise, we can now run analysis of genomic data two orders of magnitude faster than in previous versions of GATK, enabling faster iterative analysis for propelling genomic innovation

Broad's goal in collaborating with IaaS providers is to make the next-generation GATK Spark pipeline available through a SaaS model to enable users access to GATK4 on various IaaS, with no specific vendor lock-in. GATK4 will be available as early as later this year and price will vary by provider. Free licenses will be provided for academic research and fee-based licensing for commercial users.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT