Big Data Revolution and Genomics Analysis
Curoverse and Tute Genomics secured $1.5 million each in seed funding in the past month aiming to bring gene sequencing to the masses. Curoverse is a private cloud platform for the biomedical industry. Curoverse is backing Arvados, the open source bioinformatics platform. Tute genomics is offering a cloud based genome analysis solution, helping researchers interpret sequenced data from human exome and even genome.
Gene sequencing costs have slashed in the past years, making it easier to market the service to a larger audience. In the meanwhile, storage and computing power have increased following Moore’s law, making it easier to analyze and store the full genome of a human being.
But still, a fully sequenced human genome is in the range of 100-1,000 GB of data. A million customers’ data can add up to an exabyte or around 1,000,000 TB of data. Researchers from UC Berkeley have proposed a feasible way to manage this database using a three tiered storage approach of 100 PB, one petabyte and one terabyte out of which only the last one would be RDBMS based. The holy grail for this effort is personalized medicine. Humans share 99.9% of DNA and the hypothesis is that analyzing the full genome sequence for many patients will discover what is hiding in this 0.1% that can be used to predict and cure many diseases including cancer.
On the computing power side, specialized hardware is being used to analyze genome data faster. The cost of sequencing human genome has dropped 100,000 time fold in the past 10 years and the time to analyze it has fallen from 13 years to less than three days.
In the research world, there are already sequencing centers analyzing and storing data, each from a small number of patients. The real challenge is combining these datasets across different archives and cross referencing them with patient records, treatments and outcomes.
Throughout the past years, private companies have stepped in and started offering genome analysis for the masses. Organizations like Illumina, Seven Bridges Genomics, Complete Genomics and others are offering researchers and private parties the opportunity to map the full genome sequence for a four figure quote. Illumina recently announced HiSeq X Ten, promising the long-awaited $1,000 genome sequencing.
Illumina has launched a cloud computing and storage platform called BaseSpace, allowing scientists to sequence, analyze and collaborate on data that are being stored in Amazon Web Services. Bioinformatics applications can also be developed using their API and SDKs.
Seven Bridges Genomics, on the other hand, is using a combination of cloud and NoSQL database technologies like EC2, S3 and MongoDB for human genome sequencing and analysis. Glacier is also used to bring down data storage costs. Seven Bridges PaaS provides a GUI to setup data pipelines which can be based off predefined models or modified to fit the task at hand.
For the aspiring bioinformatics developers, Crossbow is one of the tools that can be used for whole genome resequencing analysis. By combining several libraries it can analyze a human genome in under three hours for less than $100 in AWS. Intel offers a step-by-step guide and the source code can be found in GitHub.
InfoQ Sep 01, 2015