Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Cloudera Announces Partnership with the Broad Institute

Cloudera Announces Partnership with the Broad Institute

This item in japanese

Cloudera reported last month they'd collaborated with The Broad Institute on running Broad's Genome Analysis Toolkit fourth-generation Hellbender (GATK4) pipeline covered previously by InfoQ.

Cloudera's life sciences industry leader Shawn Dolley mentioned GATK4 cost savings and reduced R&D time in Broad's almost synchronously-timed announcement regarding Broad's wider collaboration with various cloud IaaS providers, but didn't provide quantitative benchmarks. Dooley noted the collaborative work and its merits stating,

Cloudera's commitment to Spark drove us to be the first Hadoop vendor to ship, support, and offer Spark training in 2014. We are honored to apply our expertise to the downstream multi-omic analysis space, investing in Spark as a bioinformatics standard, and working with Broad to create the next generation of GATK... This lower cost of genome sequencing and advancement in big data technologies means that we can afford to sequence the genome of patients very broadly and produce datasets that have never been available before.

The cloud platform use-cases and architecture focus on avoiding duplicate infrastructures and facilitating best-practices so users can derive insights into disease and treatment rather than managing infrastructure. Broad's senior director of data sciences, data engineering and creator of the GATK software package Dr. Eric Banks noted,

There are currently more than 31,000 registered users of the Broad Institute's GATK. The vast majority set up an extensive local compute and storage infrastructure to process the huge amount of information required to conduct genomic analyses. These collaborations will provide new options that can remove traditional barriers of scale while offering the same high level of data quality.

On the performance gains between the previous version and the GATK4 pipeline, Banks stated that,

the Spark computing framework on Cloudera Enterprise gives us the ability to implement tools that were not possible in GATK3 due to their computational complexity...On Cloudera Enterprise, we can now run analysis of genomic data two orders of magnitude faster than in previous versions of GATK, enabling faster iterative analysis for propelling genomic innovation

Broad's goal in collaborating with IaaS providers is to make the next-generation GATK Spark pipeline available through a SaaS model to enable users access to GATK4 on various IaaS, with no specific vendor lock-in. GATK4 will be available as early as later this year and price will vary by provider. Free licenses will be provided for academic research and fee-based licensing for commercial users.

Rate this Article