Bio Senior Engineer for DNA research web site Ancestry.com Jeremy Pollack was once part of a development team whose site's code endured the ultimate test - that of withstanding the traffic generated by Salesforce's Chatter's successful Super Bowl ad without adverse affects. With successful talks at QCon SF and HBase Con, Jeremy's views on software engineering are more in-demand than ever.
Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.
Good to be here.
Well, really well. Despite being up against a very popular talk from someone at Facebook, we got quite a few attendees and people seemed to like it. It was informative. We tried to pack as much actual information in there and not be boring as we could. So yes, I think it went pretty well.
Yes. So, it all started with an algorithm called Germline. That is Germline with a G. That distinction will become important later. It was an algorithm that a bunch of very smart scientists came up with at Columbia University.
The idea is that you have this pool of DNA and you examine that pool of DNA to find who within that pool is related to each other and how they are related to each other.
So this was great. It is good for an academic setting. Oh, I should mention this. They made an open source implementation of this algorithm in C++ and it is good for small sets of data.
I think in Academia you usually don’t get much more than maybe 10,000 or 20,000 people and for that, the old implementation was fine. But at Ancestry we have a much bigger pool. Right now I think it is at 200,000 people.
The old implementation just did not scale, it was single core, single class. It did not save any of its intermediate calculations. So if let’s say you wanted to add 1,000 people to a pool of say 200,000, it would not just figure out the results for the new 1,000 people, it would figure out the results for everybody, which was really bad.
Again, there is nothing wrong with the implementation. But for the scale that we are dealing with, it just did not work. So what we did was a more or less a clean room implementation of this algorithm and we did it in Java. We used a Hadoop to get parallelization; fault tolerance and we used HBase to hold the intermediate results.
It was a success, it was a 1,700 percent performance improvement so, yes the project went really well.
No. Yes, it is Linux. I think C++ you could compile it. I mean I compiled it on my Ubuntu box. But I am sure you compile it on Windows, too. I do not think the operating system was really a factor.
OK. It’s funny too because at the beginning of my presentation Bill took a poll and said: “Ok, how many people are familiar with Hadoop?” That was so that we knew how much time to spend on it. So, I even had a slide in there that was like the world’s fastest crash course in Hadoop.
But the idea is that it is a way to spread-out processing to a cluster and also achieve some fault tolerance and data redundancy. It uses a file system called HTFS, which is the Hadoop file system and so it is a distributive file system, it has built in replication, fault tolerance.
It also gives you MapReduce, which is a way to process tasks. It is a way to spread out the processing of tasks to multiple nodes on a cluster. Then we used HBase, which is a database that is built on top of Hadoop. So it leverages all the advantages that Hadoop gives you as far as built in scalability, fault tolerance, etc.
What makes Hadoop really special is the fact that you can run it on commodity hardware. So, we have a slide where we talked about before we went to Hadoop. We tried to get as much mileage out of the old implementation that we could and what that requires is scaling vertically.
Basically, you have a big box and you just keep adding memory in, adding props until you just cannot do that anymore. The problem with that is that it is not sustainable. After a while you are up there in a super computer realm where things get really expensive and there is just no reason for it.
With Hadoop you can do it on commodity hardware and you could just scale up as you need to.
I never really thought about that. I guess maybe a framework? I mean it specifies HTFS, which is a distributive file system, so you can think of that as a file system.
Then MapReduce - you can think of that as an execution framework for parallelization. So it is kind of two different things. Yes, I think that would probably be the best way I could describe that.
Again, because it lets you do this great parallel computing thing on commodity hardware. It is also pretty standard, there is a great community around it. It is the kind of thing, it is something – its time has come.
I mean we are now at a point where scaling horizontally makes the most sense and so there was going to be something, I am sure. I know that in the parallel, just plain old parallel computing space there have been other packages.
But I think Hadoop that by giving you both the distributive file system and MapReduce, Hadoop was sort of an ideal and it sort of came around at a time when something like that is really necessary. I would have to look into the history of it to tell you why Hadoop and not something else.
They are great people who are very smart. What is great for me is that they all came from Academia so they were, at some point, either TAs or professors and they are really good at explaining things.
My background is just purely CS, although I did take a couple of science classes in High school and College. They have been really good about teaching me their areas of research and I picked up population genetics just from talking with them.
It is interesting because they do not really follow the same model that we do in software. I mean we have a product oriented approach; we have test frameworks, we have deployment, we have Agile development, we have Scrum, we have the notion of a Sprint, we have this idea of regular releases of basically oriented around creating a product, creating a thing.
Their focus is more on research and making discoveries and writing papers and getting published. And so it works out really well because they - one of the scientists is actually really solid on CS. So he tends to write a lot of his own stuff.
But a number of them, they are just more purely research oriented. So it is just kind of the best of both worlds. When they can just give me their discoveries and I can go and implement that in a way that is scalable and robust, using these new technologies.
It is actually really funny because I was having a talk with somebody about this the other day. This is someone who is a few years younger than me; I think he has been out in the industry only a few years. And he was saying that Agile did not mean anything to him.
Because that is the only way he has ever known. He says he has heard the phrase Waterfall thrown around but he does not know what it is. If you look back in the history of software development, there was a time when the accepted way of doing things was this multistage process with these discreet hand-offs.
You would have the requirements team and the business analysis team. They were with the requirements teams. The requirements team would come up with this binder full of specifications that would get thrown over the wall to the developers.
They would work in isolation and that would get thrown over the wall to the UI people. I am not really sure where the UI people would come in. Then on down they would finish it up and then they would go to testing.
Of course, testing would happen last and of course they would find all the bugs and it would get cycled back to the developers. It is really just not a great way to do things.
Martin: Agile is not?
No, no. This is Waterfall. That is the old way.
It was sort of the way of doing things. It was the accepted way of doing things before Agile caught on. That was the way things were done. Then, of course, in all that process and all the hand offs, your project would get killed.
And if your project gets killed somewhere in there you have nothing, you have like a half-done project. The idea behind Agile is to continuously be producing something that works.
So you start small and then you iterate on it. That has a number of benefits. One: it gets your software into the hands of the people who are using it faster. So you get quick feedback, testing is baked into the process.
It is not this thing where after you have created all this stuff and finally it gets tested and that is when you find the bugs. I am a proponent of test-driven development. I could talk you ear off about that. I am huge on that.
But I think it is kind of an assumed part of Agile. The idea is that if you follow this, then - let’s say you project gets killed at some point. Well, since you have made sure that at every step of the way you are delivering a workable, useful thing with business value, you actually do have something.
It is not just some half-baked thing. It might just be still a prototype or it might not be fully featured but at that point you still have a working thing. And I think that is pretty much what everyone does now.
I am sure in some corporate strong holds, Waterfall still rules. But I think it has kind of become accepted in the last few years that Agile is the way to go.
That is a good question. I would say that would apply more to my team. The thing that I talked about, the matching, the DNA matching, that is what I worked on, that was my big project for I would say the good part of the last year or so.
That is actually part of a larger pipeline. So we do actually a couple of things. One thing we do is the matching, which I talked about. The other thing is we decode people’s family origins.
So we figure out, for example, if they are a third Japanese, or a third Scottish or a third Indian or whatever. That is another thing that we do. So we have this pipeline and it is a pretty standard workflow: DNA comes in, we do a bunch of stuff to it, at some point it diverges into matching versus the family origin calculation.
Then eventually we compile the results. We send it off to the front-end because we have a separation of responsibilities. So I would say that the way that Agile applies to this would be to the team as a whole.
For example, we started out using the old implementation of Germline, with a G. The one that I have made is Jermline with a J, because my name is Jeremy and I like bad puns.
So our original version of pipeline, which had the old version of Germline wasn’t ideal and, as Bill said in the presentation, we knew pretty early on that it would not be sustainable, that we would run into problems.
But meanwhile we had a business to run. We had samples that were waiting to be processed; we had customers that were waiting for results.
So, it was like “OK. This is not perfect, but we will go with it. And while we are going with this, I will be working on the actual, more sustainable solution”.
That is just one example, I would say. There are a number of other places with this team where things have been improved incrementally. But the idea was that we want to get this business unit up and running. So if it was not perfect at first, that is fine, but it would buy us some time so that we could actually perfect it.
A test driven development, yes. Anyone who knows me knows that I am obsessive about test-driven development. The idea is that you should have parity between your code and your tests. You should not have any line of code that is not tested.
Basically, what you do is you write this suite of automated tests that you can run every time, any time you want to. But certainly before you check in. Let’s say I write a class, I have some methods in it so I write a test class that tests those methods, right?
So I would say that is my program. That is one class. I check that in and that is fine. So then I add another class and I create a couple of methods in there.
And then I test those methods in a test class, check that in and that is fine. Let’s say now I have to go and change the first class – class A. Let’s say there is something about it that is internal that I need to change. Any sort of foreseeable change – change to the data scheme or something.
So the thing is this: since I have this test infrastructure, I can go ahead and make some changes to the first class and then run the tests again, the same tests that I had before.
Those tests should still work and if they do not, that means I broke something. Either that or maybe I changed something about that class’s interface. The bottom line is that, over time, you accrete this test suite and so that way you can spot regressions really easily.
For some who have never done this, it takes some getting used to. I know some older programmers who are still not completely sold on the concept. But I would never want to develop without it now. In fact, I do not think I would want to work someplace where they did not value testing.
It was really important for this project too because it was so data intensive. For example, you are matching segments of DNA. I think I came up with maybe 30 or 40 test cases for ways that these segments can match up.
And then I wrote a tool to generate basically fake DNA that fulfilled the specifications these 30 or 40 ways that segments can match up. That is what I used for my unit test.
And then once we were ready, once the actual coding was done and we were satisfied that it was more or less correct, then we moved on to another level of testing which was using actual real data.
So there is some public domain sets of DNA that people have generously put in the public domain for academics and people like me. So we used that then to test the software.
Another thing we did was an integration test. So I kind of take it a step further. You could say using real data was an integration test. But there is another kind of integration test. And that involved taking the generated test data that I made and running it against the real end point.
Again, this is kind of extra credit. I mean it is more important to have your unit test with your mock data. Because when you are writing a unit test you do not want to hit any real end points. You do not want to be dependent on that.
But I think it is also good to have tests that call to a real end point, let you know that you are set up and tear down. Let’s say you are hitting a database, sets up a fake table in the data base but actually hits the database.
I find that really helpful too. So you could say you used like three different methods of testing on this. Between the unit test, the integration test and then finally using real data.
That made us confident that our thing worked and the project has 89% test coverage. Which is great. And it has not failed once in production and it has been in there for a while now. So I would say that the successive TDD speaks for itself.
I remember when Agile first became a thing and I want to say it was like 2006 - there was the Agile manifesto – and I am trying to remember where testing sat in that. I am pretty sure that it was part of that, but I think it is now accepted.
Again, the old Waterfall approach was that you would test less. After the UI and code was in place, then it would get to the test units. I think that would be a horrible way to develop software.
To me, I think that testing should be assumed in Agile. And I think it generally is but, of course, you would probably get people arguing with you about that. You would argue anything.
Oh yes, it is Salesforce. Yes, that was interesting just because the Super Bowl commercial was for Chatter. Which is their social feature of salesforce.com.
So they are running a Super Bowl for Chatter and they had a sign up mechanism in place for people to sign up for Chatter. But it was not really meant to scale. Because it was just like maybe you got an offer on your e-mail saying: “Hey. Check this out!” It would handle one person at a time.
It was a synchronous process so you would put your e-mail address in, press the button, you would do a bunch of stuff and then it would say: “All right. Here you go! Here is your account” But for a Super Bowl level of traffic that obviously did not work.
So I wrote an asynchronous, queue-based system where you would sign up. It would go into a queue and we would say: “OK. Now we will get back to you via e-mail”. And then we would process it as we could. And then get back to them over e-mail and tell them that their account had been created.
Everything went really smoothly. I think the peak load was 900 or 1,000 accounts in – this is going back a couple of years. I think it might have been 900 or 1,000 in a minute. It might have been in a second so I do not want to short myself here.
I would actually have to look. But yes, we actually got some peak loads and handled it with grace and aplomb. It actually could have handled a lot more. And I would have to go back into my notes and look.
But we assist tested out to handle a ridiculous amount of traffic. Because it is Salesforce. You cannot have a big company like that and have your site go down because of load. So that was a cool project.
Again, this is going back to my days at Salesforce. This was interesting. I worked on the web team that handled content management and worked with the content management system. We had a way of rolling content.
The way it started out was just one stage; you just pushed stuff. You changed some content and then you just push it to the live version of the site, from staging to live.
That was not good because sometimes there would be errors; there was a little embedded codes that they would get wrong and stuff like that.
We had a suite of tests that were – I believe they were Selenium tests. I think they were F tests through Selenium. They loved Selenium at Salesforce and they used it well. So, the idea was that before people could push content, they would have to test it.
They would have to push it to the staging server, test it on the staging server and then they would push it to live and then test it on live. What I wrote was sort of an integration piece between the content management system and Jenkins. Which is where we touched off this Selenium test.
Actually, I think it was Hudson at that time. The idea was that you would have this little consult in the CMS. You would hit a button, you would push the content to staging, you would hit another button, run the test that would touch off some test in Hudson or now Jenkins, they would touch off the Selenium test and then it would come back to the consult with the success or failure message.
If it failed it would not let you move on and push stuff. I am surprised I remember all of this. It has been a couple of years. That was the idea. It was pretty cool because Hudson, now Jenkins, it gives you the ability like pretty much everything that you can do through the web site though the web interface, you can do through the Rest API. So, it was actually really easy to interoperate with and fun. It was a fun project.
This was a case of hybrid development. Because we used Selenium and Jenkins or obviously third party tools and so was the content management system.
But the piece that I wrote was custom code because we were the only ones doing it. It really depends on what you are doing, I guess. I mean now, for testing, I use TestNG. And we used Jenkins and Maven. Which makes you run tests before it lets you create a package. I guess I use pretty much entirely standard tools now.
Yes, we have a really great mobile app. We also have an – I think it is kind of like awards. It is really popular, they did a really good job with it. We also have a Facebook App. Which is nice because a lot of people are Facebook friends with their relatives. So it gives everyone a nice way to plug in there.
Just getting to talk to a lot of smart people. I think QCon is software development, which is very broad spectrum of topics. And I think that I saw a very good selection within the industry of the various latest developments in the industry.
Some of the smartest people in the industry and really covers a lot of ground. And I was impressed by that. I was impressed by how much ground was covered and how much interesting content, interesting people. Yes, very well done.
I go back to work. The idea is that scientists are always cooking up ways to improve the service, improve your results. Right now our focus is on improving results, helping people lift the signal out of the noise.
I am not saying that we are noisy but for example, fourth cousin matches like fourth cousin or closer, we are 90 to 90% confident in those results in the fourth cousin or closer matches.
As you get further out from that, we get less and less confident. A sixth or seventh cousin match we are not going to be as confident about. What is happening is we give people results that we are really confident about.
Then we also have a lot of results that we are less confident about. And some of them might actually be false positives or are just very weak connections.
I have been working with the scientists on ways to tighten that up, they have come up with some really good algorithmic improvements. So my next project then is to go and implement those in my code, in Jermline. to make it better and improve the experience for our customers.
Well, you can run your test and if all your tests pass, there is nothing to debug. But if your tests do not pass then you have got to do some debugging.
No, but if you do not do enough testing, your users then become your testers. Then you get angry users and they are like: “This does not work. There is a bug”.
So the idea is to cut a front load, do the testing yourself and not push that on to your users. Because that is how companies get kind of a bad name.
I spoke at HBase Con in June and that was a great experience. The HBase community are amazing, great people. I go to the beat-ups in town regularly. So if you go I will probably see you there.
Yes, I would love to do more speaking. And as we continue to improve the service, we will be writing papers. We have a paper that we are working on that is going to be published. And so the rest of the world will get to see the nuts and bolts of what we are doing. And we hope to keep doing that.
My favorite technologies. Well.
Martin: …like favorite programming languages, like you anticipate any new developments in Hadoop or test-driven development?
Yes. I think test-driven development will continue to catch on. When you have people like me who can talk about the success. And just how much stress it removes from development; how much un-surety it removes from development, when you know that your stuff works and that you do not have to be constantly putting out fires from customers.
I think the wisdom of TDD is something that will continue to catch on. Hadoop is obviously going to continue to catch on. I am not saying that they will never going to come up with something better. But I think that it has a lot of applications.
I think people will continue to write things that mask some of the complexity of Hadoop. I know a Hive is often touted as sort of the killer app for Hadoop. Because it lets people query their data using a Sequel-like language.
What is the first thing when you would give someone something that is no Sequel? They ask: “How do I write sequel on it?” So that is actually something on the HBase space, HBase Con there were like 6 or 7 different ways of doing Sequel on top of Hbase.
HBase being, of course, unknown Sequel database, but I think you will see more of that. I think these technologies are here to stay at least for a while. I think that the development you will see is ways of hiding their complexity and making it easier for say my science colleagues who may not want to write MapReduce Javas, make it so that they can use these tools in their work and make it easier for them.
I think we will be seeing more of that. I think HBase will continue to catch on. It is maturing at a very fast pace, it is very stable, it is very scalable, it is incredibly scalable. We have had very good experiences with it. I think as more and more companies have these experiences, you will see more of HBase out there.
Martin: We have been talking with Jeremy Pollack here at QCon San Francisco. Jeremy, thank you for being with us.
Thank you very much.