Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Interviews Chris Mattmann on Big Data Infrastructure for Scientific Data Processing

Chris Mattmann on Big Data Infrastructure for Scientific Data Processing


1. We are here at QCon London 2014, I’m sitting here with Chris Mattmann. So Chris who are you?

Hi thanks, so I kind of wear three hats, one of my hats is I’m Chief Architect at the Jet Propulsion Laboratory, I build teams, build data teams for remote sensing missions for Earth and Astronomy and Planetary Science and put those data teams together to process large amounts, Petabytes of information. I’m an Adjunct Associate professor at the University of Southern California, I teach classes and graduate courses in search engines and information retrieval and in software architecture, and the third hat I wear is related to this is I’m a member of the Board of Directors at the Apache Software Foundation, which is one of the world’s largest O penSource foundations, their flagship product the HTTP web server powers over 53% of the Internet and I mush all stuff together and that is who I am.


2. You work for NASA essentially, can we say that?

My primary appointment is at NASA.


3. The first question of course is where do you hide the aliens, I assume you found them already?

You know, that is an interesting question, I could tell you more I have to kill you, but suffice to say if you ever need to find where the cryotubes that are storing things are, I might be the person to ask.


4. Ok, will get back to you off the record. Let’s assume we haven’t found aliens yet, and so you're still doing your work, so what are some of, you are involved in many projects, what are some of your favorite projects that are dear to your heart?

Sure, so related to the aliens, the next generation astronomical telescope called the Square Kilometer Array it’s being built by International Consortium of nations, Europe, UK, primarily being built by South Africa and Australia, and so it’s actually been built on two continents, it’s the next generation ground based astronomical telescope and it’s going to generate 700 Terabytes of data per second, which is mind boggling, that type of stream is really important because in Astronomy there is been a big shift, the old adage was the sky is the archive, which means if we miss something, we’d see it again, and when we saw it again we can image it again, but what they found in astronomy through the search for extraterrestrials or just more in general, what they are finding in astronomy is that there are things that if we miss them we won’t see them again, and those are things like pulsars, radio transients and so on, so that 700 Terabytes of data that is generated per second in the square kilometer array, they going to have to keep a big portion of that around potentially to examine it later and not throw it away, and so that's a really cool project in astronomy, let’s pick one in earth science.

In earth science something that is dear to my heart living in California, southern California, is drought in water management. Related to that as it turns out the water flows down from the mountains and comes from the mountain's snowpack, especially in the western US and for us that is in the Sierra Nevadas, so there is a project called the Airborne Snow Observatory led by Doctor Tom Painter at JPL, and what that is looking at is that particular project is a joint LiDAR, Spectrometer, airborne platform, and the LiDAR measures snow depth, how much snow is left in the mountains and the spectrometer measures surface reflectance or albedo, which is how fast the snow is melting. And so the combination of how much is left and how fast it's melting has all sorts of impact something like for water management, how much water the city of San Francisco and the Hetch Hetchy reservoir keeps around, how much they let accumulate, how much they actually run off and things and provide to the locals city states and county federal government. And so we are providing a lot better measurements of that snowpack than people could sort of compute by going up to the mountains and sticking a stick in the ground. We are also helping in those types of terrains where that could cost life, there are some places where we can’t go up and stick a stick in the ground and measure snow because it’s just too dangerous or treacherous, so remote sensing is typically used to solve that. So those are two projects that I think are really important for different reasons but in the larger contexts of things like Big Data and Data Science.


5. Let’s start with your Square Kilometer Telescope, so what does the name mean, if it’s built on two continents, can it be in one place? What does it mean?

It originally meant that it was going to be a square kilometer of telescopes and it was both low frequency and mid frequency and some high frequency like the larger dishes that you see, you know like you see in the movie “Contact” those types of dishes, and then the low frequency ones had the little square plate types of things, and it was literally going to be laid out over a square kilometer and originally the thought was it was going to be either in South Africa only or in Australia and there was a big competition and nationalism between the nations to figure out who is going to host the next “It” science instrument, and what they decided after really in 2011-2012 after some meetings and some international consortium looked at this was that they would host, the political correct decision was to host it in both places, and so South Africa got pieces of the low frequency I believe and the mid frequency, and Australia got other pieces of the I think mid frequency and the potential high frequency dishes and things like that, and both nations were actually building precursor instruments, in Australia they have something called the ASKAP which is the Australian Square Kilometer Array Pathfinder, and in South Africa they have something called “Meerkat”, first seven dishes which are called “Cat Seven” and those are precursor instruments that they'll also use as part of the Square Kilometer Array. But it was literally going to be a square kilometer of these various size dishes and so forth in one of these regions had ended up picking both for.


6. These two arrays are in different sides of the planet, is that a problem, do they have to communicate or is each just a data source that later on gets synchronized with the other?

There could be some communication between them, they don’t have to synchronize at that particular level, what will likely happen is that Australia and their portion of the SKA is going to take some data and image particular portions of the sky, South Africa will do the same, or if they do want to bring them together there are techniques, called Interferometry where they take separate dishes or instruments and things like that all looking at particular common sources and the combination of these separate dishes when processed together using this Interferometry techniques actually looks like one big dish to cover that entire area, so that it’s an option and they are looking I’m sure at doing Interferometry for those type of targets, but each will likely have a mode where each can operate autonomously and so on and so forth.


8. First of all where do you even put that, I mean what kind of storage can you use for that?

The good news is that it’s been built over the next decade, so they’ve got a little bit of time to figure some of that out but that is why a lot of, so the reason that it’s not just a science project and it’s not just something no one cares about, the reason that IBM is donating Blue Waters Computers to the project and the reason that Amazon is going down to South Africa and building data centers and things like that. There is amazing Computer Science, Electrical Engineering and challenges in terms of what do we use today, I mean the best thing we could do a lot of times is SSD arrays, solid state devices and things like that, we don’t have ones that are very voluminous in terms of the amount of data that they can actually store. I’ve heard of some Petabyte type of SSD arrays on that particular scale but even to keep up with that type of data rate is impossible, so we are going to have to make advancements that aren’t there yet, likely not a 100% of that data will be kept around but not for example one percent, somewhere between 1 to 100, so that will be having machine learning and approaches to tell us what data to save can also help and hopefully once we'll set a target on these fast transients and in that respect the time domain is also really important. Basically how do we deal with that? I don’t pretend to have all the answers because that is what a number of people besides me are working on.

Werner: So that is data that is coming in continuously, just 24/7, it’s not like at CERN where they have an experiment and you gather a Petabyte and that's it.

That it’s right, the big difference with the SKA is that you don’t turn it off and if we turn it off there is a problem, you know, and things like that, but yes, this data is coming continuously likely what will happen I don’t know that they’ve thought through all the plans for this yet but I’ll imagine it’s part of any ground-based observing telescope, is that there'll be proposals for time to use time on the SKA. A number of the ground-based observatories that exist nowadays hold solicitations for researchers to reserve time or allocate time, you’ve got a project, you want to reserve n amount of hours of time on the telescope and things like that, you write a proposal, you justify the science case and things like that. My guess is that at some point the SKA will have a similar model to that although early on it will be focused on likely checkout and targets and just calibration making sure that it works, and they'll also prove in a number of these early pathfinders that they'll be doing to lead up to it.

Werner: It’ll be interesting to see what you guys figure out because that is more than the whole database of Netflix.

Actually it will generate the size, I’ve heard that someone computed it, it will generate the size of the current internet in 2 days.

Werner: The sky is big.

The sky is big, that is right.


9. Since we are talking about large data or big data as the buzzword goes, so you mentioned, I think in your talk you also mentioned you use a lot of Apache projects, so what do these projects bring to the table, what projects do you use, Hadoop and others?

Hadoop is part of it, the thing I really like about Apache besides the fact that I’m associated with the foundation, I’m not compensated by them in any way, so it’s really been an organic process to becoming a fan of Apache. They have projects that range from big data technologies to end user office productivity suites to libraries and web servers and things like that. So they are very diverse, it’s not like some other foundations where it’s just around a particular tool or around a particular domain. The thing I like about Apache projects and ones that we use a lot, is a) the license to leverage those projects is permissive, it’s the Apache license version 2, it doesn’t kick down any constrains to the consumers of it so if I wanted to commercialize my technology or eventually do something like that which I don’t do per se in my government role but I may do under my University hat or something else, the Apache 2 license won’t dictate any restrictions on me from doing that. At the same time if I want to continue to make the source code available in all my derivative products and works and things like that, Apache is very compatible with that and encourage that and they don’t care.

So the license is nice, the community models are really nice, they actually ensure, Apache has something like between 100 and 150 top level projects and you compare that with places like Sourceforge that have a hundred and fifty thousand, Google Code as similar numbers, there is over 2 million repo’s on GitHub and things like that, and the reason, the big difference at Apache is that they actively triage projects that they feel don't meet particular criteria, they are not releasing for example very frequently, they are not adding committers, they have a diverse organizational background so if anyone organization pulls resources out of the project, the project doesn’t die and things like that. Apache looks for these metrics and tries to ensure that they are being fulfilled in their projects. In terms of particular projects that we use at Apache, things like Hadoop, things like the Apache OODT project for data processing, Tika the digital babble fish, if you are familiar with the “Hitchhiker's Guide to the Galaxy”, you put the babble fish to your ear and you can understand any language.

Tika is that for file formats and so its goal it’s to detect and extract text and metadata and identify the language of any digital file format and so we use things like Tika for science data, for classification, we use OODT for data processing, we use Hadoop for data processing and also distributed file systems, I’m really excited about the Berkley AMP lab, they’ve been Open Sourcing some of their projects through Apache in particular the Spark and the Mesos technology, Spark is a very fast, something like 30 to 50 times fast than Hadoop Map/Reduce because it’s basically taking Hadoop and bringing it to memory doing sort of in-memory Hadoop operations and Mesos is a distributed resource management system that in use at Twitter to power their 2000+ node cluster and things like that, so these are technologies that are really exciting to me and things like Solr for search and Lucene for search, really big fan of those and for things like building web search engines Nutch is another really popular project, it’s someone that Hadoop originally spun out of it and I was involved with, so those would be the types of technologies that we use.


10. It’s seems like you, those are a lot of Java projects so is Java something that it’s commonly used for data analysis?

The cool thing within those particular technologies and stack as a number of them are primarily written in Java and Apache has a number of Java projects that are at the particular foundation, although we’ve had an increasing number of like Python projects and C projects lately, and so Apache did go through this renaissance of having been very early focused on C, on the web server, eventually to growing around this whole Java ecosystem, eventually growing to some Python and some other things and then it is sort of meeting this big data Java renaissance sort of coming back. In terms of actually processing the data with Java, we are not loading all the data into Java like a Petabyte and then processing and then a lot of these technologies including Hadoop and OODT and things like that, are orchestrating technologies. They orchestrate the underlying algorithms and so forth which do some of the heavy lifting and number crunching and then these technologies like OODT or Hadoop, they figure out what nodes those particular algorithm should run on, what subset of the data they should process and things like this, so we are not actually loading the entire data set into Java but we are using Java-based technologies to orchestrate the data processing and data management related to that, hopefully in an efficient way.

Werner: o it’s essentially infrastructure would say.

Yes, it’s definitely infrastructure.


11. You mentioned Python, so I’ve seen Python as a very popular tool in this domain, what is your idea, why is that, what makes Python special in that regard?

Python is becoming really popular in scientific programming, I think the reason behind it is the ecosystem support, Pycon, getting out there, Pyladies getting out to women and just encouraging STEM education and things, they’ve done an amazing job there, beyond the ecosystem and the conference and all that, the package support for their libraries is amazing. The Python package index, the fact that I can type easy_install name of software and it magically builds it on my device for me, puts it in the right place, all of that surrounding community and build support and tool support is really what I think has kind of gotten it out there and then the increasing explicit support in Python for scientific constructs like arrays provided by things like NumPy and newer projects like Blaze and things like that. So treating science as a first class citizen, package management and ecosystem, and the community I think.


12. So another popular topic for the InfoQ audience is of course big data and you mentioned that you moved away from MySQL to NoSQL solutions, so what was the motivation, was it scalability, was it the data model, what was the reason?

It’s a great question, the project that I was talking about during my talk on that was this regional climate model project and we had a little bit of an impedance mismatch; we were trying to take scientific data that was array-oriented and shove it in a relational database and so the engineers on our project did even worse and shoved it in a single table within in that relational database or some small number of tables or something to the point where in MySQL which was the database we are using, the table broke at about 6 billion records, we lost data, and so we started to figure out how tune it and we are doing some benchmarking, we loaded some more data into it and then it broke at like 8 billion records and we lost data and things like this. So we played that game a little bit and then we decided this is a particular project where we could be a little bit exploratory and so we decided to try some NoSQL technologies. In particular as well because the impedance mismatch was a little bit less to get the data in sort of a flat structure than it was to put it in a sort of entity relationship table model and then the additional thing around that too was that some of the NoSQL databases we were looking at like MongoDB, actually had scientific support for querying it or geospatial support, which is what some of the data was. So MongoDB provided a point radius query tool and a bounding box query tool, we also looked at Hadoop and Hive and some surrounding technology called Apache Spatial Information System to provide a quad tree index and point radius support and things like that. So whereas MySQL doesn’t support natively, as sort of a business intelligence or relational database, they don’t have a focus on providing some of the scientific capabilities and to actually pull them down or find them because I’m sure there are some plugins out there and the MySQL community is going to jump up and say: “Yes we do”, it’s just it wasn’t just as obvious, it didn’t seem to be a first class part of the Database itself, and so yes, we did try out some NoSQL databases, we had some good luck with them, we also tried out PostgreSQL little bit because it had some geospatial support as well and so right now instead of just having a single MySQL database, we have a number of these sort of NoSQL types of data solutions and we have a way to pick and chose which one to use based on the actual scenario and use case.

Werner: So you’ve just made a lot enemies in the MySQL and of course in the alien friend space, but thank you for the information Chris and we will talk about the aliens.

Awesome, ok, thanks for having me!

Apr 15, 2014