InfoQ Homepage Podcasts Lynn Langit on 25% Time and Cloud Adoption within Genomic Research Organizations

Lynn Langit on 25% Time and Cloud Adoption within Genomic Research Organizations

Jan 18, 2019

Podcast with

Lynn Langit

Wesley Reisz

Lynn Langit is a consulting cloud architect who holds recognitions from all three major cloud vendors for her contributions to their respective communities. On today’s podcast, Wes Weisz talks with Lynn Langit about a concept she calls 25% time, and a project it led her to become involved in within genomic research. 25% time is her own method of learning while collaborating with someone else for a greater good. A recent project leads her to become involved with the Commonwealth Scientific and Industrial Research Organisation (CSIRO) in Australia. Through cloud adoption and some lean startup practices, they were able to drop the run time for a machine learning algorithm against a genomic dataset from 500 hours to 10 minutes.

Key Takeaways

25% time is a way to learn, study, or collaborate with someone else for a greater good. It’s unbilled time in the service of others. Using the idea of 25% time along with some personal events that occurred in her life, Lynn became involved with genomic researchers in Australia.
Price of genomic sequencing has dropped. The price drop has enabled researchers to create huge repositories of genomic data; however, it was mostly on-prem. The idea of building data pipelines was pretty new in the genome community. Additionally, the genome itself is 3 billion data points. A variant of as little at 10-15 variants can be statistically significant.
The challenge was to leverage cloud resources. To gain a quick win and buy-in for Commonwealth Scientific and Industrial Research Organisation (or CSIRO an independent Australian federal government agency) for cloud adoption, a first step was to capture interest in the idea. So the team stored their reference data in the cloud and enabled access via a Jupyter Notebook.
They demonstrated a use case against the genomic data set leveraging a synthetic phenotype (or a fake disease) called hipsterdom. The solution became a basis for global discussion that got more people involved in the community.
By leveraging cloud resources, the CSIRO was able to get a run for their dataset that took 500 hours against an on-prem Spark cluster to 10 minutes.
Learning new programming language has unseen benefits. For example, Ballerina (a language written as an integration language between APIs) interested Lynn because of its live visual diagrams; however, it benefited her with some of the cloud pipelines because of its ability to produce YAML files.

Subscribe on:

Show Notes

How has your role in teaching kids computing changed?

01:50 After 11 years leading an organisation of volunteers creating software for middle-school teachers, my colleague Jessica Ellis has taken over and has TKP labs, using TKP Java as part of their curriculum.
02:15 They have also created other courseware around data science and IoT.
02:20 They built upon the foundation and they are continuing the work, with me acting in an advisory capacity.

What kind of things does she do?

02:30 She works with the kids using TKP Java to get them started in an object-oriented programming language.
02:40 She has some unplugged activities - she is also the director of the boys and girls club in North San Diego - there’s an industrial kitchen.
02:50 They did a hummus experiment, where the kids made different ingredients to create hummus, then performed data science experiments to find which one was best.
03:00 The data was collected, and inspired by the Information is Beautiful work of David McCandless, used materials and visualised the results of the data experiments on the walls of the club.
03:20 They are doing all sorts of really interesting computational concepts, from programming to IoT to data science.
03:35 She works with different organisations globally to bring this into the classroom.

What is 25% time?

04:10 I incorporate 25% time into my time in my own business.
04:15 I allow that to pursue my professional development, and combine that with working on socially good projects.
04:25 A few years ago, a friend of mine had cancer (she’s OK now) but she was unable to get personalised treatment.
04:30 At the same time, my daughter had the opportunity to go to Stanford the summer of her junior year and take cancer biology courses.
04:40 She was learning about personalised therapies that were becoming available because of the dropping cost of sequencing genomes, gene editing, etc.
04:50 I became interested in helping to work on the problem when my daughter came back from school - they weren’t using the cloud.
05:00 I had been working on building cloud pipelines in other verticals, like ad tech.
05:10 I was familiar with the patterns for taking huge amounts of data, and doing complex compute and getting a quick result.
05:20 I was surprised that the bioinformatics community wasn’t using the cloud.
05:25 The price of genomic sequencing coming down has been a relatively recent development; so they hadn’t had huge data repositories to work with for many years.
05:40 I got interested in working with them, and I landed with a group in Australia that I’ve been working with for a couple of years.

What do you mean by large data?

05:50 The physical size is one-dimension, but what’s interesting is that a genome is 3 billion letters, and even 15 letters being different can be statistically significant.
06:15 You have a matrix calculation to find the genomes of interest.
06:20 You start in the lab (wetware) - and they make mistakes.
06:30 You have to analyse the snippets of the genome over and over, and then you have to look for the statistically significant difference: these are matrix processes.
06:35 With the genome, you have to get the variants out, and then combine with the sample sets.
06:45 Where the work is going now is to get reference data sets around specific disease conditions, like pancreatic or breast cancer for example.
06:55 To further segment those genomic variants, so that when people present with disease conditions, they can reference against the more stratified data sets, and move towards personalised medicine.
07:05 In order to build datasets, you need fast matrix computation, and that’s the area I’m working on.

Are you helping build data pipelines, or is this a machine learning problem?

07:15 Both - a pipeline with machine learning.
07:20 In the case of CSIRO they had built a library to help identify variants of interest.
07:30 The disease that they’re working with is ALS - the icebucket challenge disease - they’re trying to work with a group of researchers globally to build that reference dataset.
07:35 In order to do that, they’re using various machine learning methods to find the statistically important variants.
07:45 They were using Hale, which is a logistic regression library that was put out by Bro institute in the United States, but that does single variant.
07:50 They wanted to work with an algorithm that would find multiple variants, because that would more closely correlate to disease conditions.
08:00 They experimented with k-means to find clumps in the data, but as they got more reference datasets they wanted to use a supervised algorithm.
08:10 They couldn’t find anything that would scale properly to do with the matrix sizes; they started with Spark to use Spark ML but they couldn’t get big enough compute clusters.
08:25 They took random forests with splits, both horizontally and vertically, and they made an open-source library called ‘variant spark’.
08:35 When I met them, they were running it on their internal shared Hadoop Spark cluster, so they had to wait - in addition, it took a really long time to do one run (500h).
08:55 They asked what I could do to help get a faster run in the cloud.
09:00 That’s what I’m working on - trying to help them to get this run in the cloud - it’s not the pipeline yet, it’s just this run because it’s so complex.

What did getting to the cloud look like?

09:20 The first thing that I did was get a minimum viable product, using lean startup principles.
09:40 We took their open-source repository, and built a reference Jupyter notebook as a sample.
09:50 I asked if they could make it fun or catchy, and they came up with an idea of making a synthetic phenotype - a fake disease, called the Hipster gene.
10:10 Because they are biologists, they took four traits; ability to drink coffee, some characteristic of the retina that would make you prefer checkered shirts, facial hair, and monobrow.
10:25 They are genetically related, if you check 23-and-me, you can check your own genome for Hispter gene.
10:30 We had this notebook, but how do you run it?
10:40 I started with a SaaS solution: Databrics on AWS, which is managed Spark.
10:50 We had a small dataset to show how the algorithm worked, and we used that as a basis for the sessions and talks globally.
11:00 We got more people involved in the community, with people getting excited and providing patches to the library.
11:10 That was getting people interested, not the scaling.

What did it look like?

11:25 We made a synthetic data set; we referenced it against 1000 genomes.
11:30 For machine learning, it was like a reverse-TDD: we labeled certain data as hipster or not hipster depending on gene variants.
11:35 We ran the algorithm in its random forests, and we should find the labels should prove true for those genes.
11:45 It demonstrated that the algorithm worked.
12:00 In finance you have sophisticated machine learning, like TensorFlow or deep neural networks.
12:05 What I found in bioinformatics, was that it was not using that.
12:10 The scientists were more focused on the science, and the technology was in service of the science, so they weren’t out studying tensor flow on the weekends - they were using logistic regression.
12:30 Using random forests in the field was a novel approach - I was surprised, because it’s not a new algorithm.

What did you do next?

12:55 We worked with the team in Australia and evaluated different cloud vendors.
13:00 They had less options at the time we started the project - the data has to be kept in-country because it’s health data - there was no GCP data center.
13:10 We had to eliminate them because they weren’t in Australia at the time (though they have since opened up a data center there).
13:15 It was between Azure and AWS - so we did an evaluation, and based on the cost of the services, we decided to go for a PaaS.
13:40 We got them up and running, because in this domain, they hadn’t used managed clusters.
14:00 They had a separate project which is downstream for the data - almost where to cut line for a Cripsr - a green line or a black line mapping to the genome.
14:20 They had a grad student who had gone to an AWS summit, and he built a tool using a serverless pattern and dynamodb and lambda.
14:35 They had success relatively quickly on Amazon, so were inclined to use that for the future.

What did they do with serverless?

14:45 It’s called CT scan 2 and it takes the results out of variant spark and it visualises them.
15:00 We then added additional data around where we’re going to cut the crispr.
15:10 The first question they asked me was whether variant spark could be serverless, and of course I had to say no.
15:15 It has long-running compute, so we can’t do it serverless like the way the algorithm was working.
15:30 We eliminated IaaS because they didn’t have a devops person.
15:35 We were in the PaaS and it was a process of helping them understand the true economies of the cloud, using batch and spot instances so they could burst.
15:45 Some devops stuff like templates to click-to-enable.
16:15 We used bursting but not serverless; the long-running compute didn’t allow for that.
16:25 We used batch or spot instances.
16:30 I was doing some other work with my daughter and she was having to connect to a mainframe to run a process.
16:50 We made a docker container with the tool to run it locally.
17:00 We applied the tool and a Jupyter notebook for visualisations and built the prototype.
17:15 Databricks came out with Spark 2.3, which allowed Kubernetes to be the controller.
17:30 I talked to Amazon, who helped fund this work, and took what we started with EMR and migrated it to a data lake on S3 using containers.
17:50 We shipped it, so we can do the 500h run into 10 minutes.

What are you doing next?

18:15 There’s a pipelining solution called Firecloud, built on GCP, and using that set of tools starting with GATK.
18:25 I had explored using that as a reference to potentially build a pipeline with variant spark.
18:35 I was at Google.next, and I got introduced to one of the designers of the Firecloud, and set up some meetings to collaborate.
18:50 We are in discussions now to set up some kind of collaboration, which is incredible.
18:55 They sit at the fulcrum of some of the most important research in the world and the production - they sit in the middle of academia and production in the United States.
19:05 When I went up to Boston in September, there were 15 genomic research companies right around the road, and it was a fascinating visit.

Your career started pretty late?

19:55 This career - but I’ve had a lot of careers.
20:00 I’ve done the 25% thing all of my tech career.
20:15 I have seen what can be done by a small group of curious technical people in conjunction with people who could make use of those resources.

What is 25% time?

20:30 On my calendar, I color-code it: green for money, light green for potential money, and orange for my 25% time.
20:50 It was Zambia, and was teaching kids programming, and now it’s bioinformatics.
20:55 It could be learning, working with a client, collaborating - it’s non-billed time in service of someone else.

Do you teach yourself a new language every year?

21:10 I’m non-traditional in so many ways - one of the other ways is my degree in linguistics.
21:20 I stopped taking math when I was in the first year of high school, and no computer science.
21:30 I have this idea that computer languages are interesting from a linguistic standpoint, which is interesting.
21:40 For example, I got interested in Ballerina, because I’m interested in visual dialects, and ballerina includes live visual diagrams and annotations so you can generate YAML files.
22:15 The year before, I learned Kotlin, which is an evolution of Java which I used to teach kids with.
22:50 I have wanted to do research - maybe in the next phase of my career.
23:00 I have wanted to find resources - and I haven’t found them.
24:00 I have been working with Jupyter notebooks, and the amount of visualisations and services and data.
24:25 As we increase in complexity we need visualisations - that’s why Ballerina was so interesting to me.

What final thought would you leave for our listeners?

25:30 One technique I use when learning new programming languages is to try and bring someone along with me.
25:40 I do a lot of remote pair programming with college students, and it’s been such great fun that I highly recommend it.
26:00 We usually use VSCode, and build something with pair programming; it’s a way of helping the next generation.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.