BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

The Larger Purpose of Big Data with Pavlo Baron
Recorded at:

Interview with Pavlo Baron by Harry Brumleve on Jan 30, 2013 | NOTICE: The next QCon is in San Francisco Nov 3-7, Join us!
18:43

Bio Pavlo Baron is lead architect with codecentric AG. His passion are distributed systems and large data sets – the infrastructure behind what they call Big Data. Pavlo is frequent conference speaker and has written three German books: “Erlang/OTP”, “Pragmatic IT Architecture” and “Fragile Agile”.

Software is changing the world; QCon aims to empower software development by facilitating the spread of knowledge and innovation in the enterprise software development community; to achieve this, QCon is organized as a practitioner-driven conference designed for people influencing innovation in their teams: team leads, architects, project managers, engineering directors.

   

1. This is Harry Brumleve, I’m at QCon San Francisco 2012, I’m sitting with Pavlo Baron, Pavlo how did you get your start as a software architect?

Well, you don’t start as a software architect; you start as a software developer, so as a newbie. And software architect from my point of view depends on the experience. When you are able to architect things, architect complex solutions, you need to go through a whole developer life. You will get yourself some bloody-noses and all that, and learn how people work, how companies work and then you realize: “Ok you can contribute not only through coding”, for me architects codes anyway, but you can only contribute on the soft skills level, you can contribute talking to customers, talking to even management and all that, because you have the experience and you have maybe the confidence to talk about technical things on a different level. And you can help teams improve themselves, this is a very important part, and it’s absolutely not about making decisions for everybody, but it’s more like helping people and the team make correct decisions.

This is actually what I think of being architect. So I went through this whole life, I’m for over 20 years in this industry and I started as a regular developer, so I’m still a regular developer to be honest.

   

2. You are here at QCon giving a presentation on Big Data. What draws you to Big Data?

First of all for me personally it’s a huge area where I can use all my experience, all my knowledge and even not existing knowledge but I still have to gain and it’s a huge playground for a technologist. But from the perspective of people who are not technologists, is like we cannot live without computers anymore, it’s not about that we have Skynet or something, it is just computers help us in the everyday life you know, and the Big Data area actually is something that supports this, that helps this development and I’m very interest3e in this area because I actually see a chance to help mankind, to maybe change the world a little bit, maybe make the world a better place, so that it’s actually something that draws me to Big Data.

   

3. That is a pretty big goal for Big Data?

Yes it is, but anyway I mean you need visions right? Visions and targets are different things, so targets are just small steps and maybe you just can contribute a little, but you know we have this huge information previously, huge data models, we didn’t have these tools. Now we have tools, we have this information, we can grow this information, we can collect much more information, we have learned how to deal with information and that is actually something that should, from my point of view, should help the mankind. So not just helping businesses with small things; these are targets, the vision must be, you know, people use computers, like computers help them, make decisions and things like that.

   

4. The name Big Data seems to imply … offer something that’s small data or medium size data can’t really provide, what is that? Why is Big Data important?

So that is a funny story. I’ve been to a conference on the management level and they have discussed how to tell Big Data from Medium Data and out of nothing I started laughing. From my opinion it’s not about numbers, I mean we are not at the “Who is the most chicken wings here” competition; it’s a little bit different. So it’s about dealing with any kind of data, this is important and it can lead to any amount of data, just depends of what you do and which information you want to get out of this data. Then you will go on and grow this, because you will get hungry, you will need even more and more and more, and this explains the numbers then. Big in my eyes, or from my point of view, it’s you cannot tell a terabyte is small and a petabyte is big, it’s not that. It’s unpredictably lot of data so that is the point. You cannot tell where is the border, so you just need to use technologies whenever the data grows, to get bigger, to have even more data on your storage, to use different storage strategies and so on. So I cannot tell you a number really, I wouldn’t.

   

5. That is a pretty good definition I think of Big Data but it’s still ambiguous, how do you tell when a solution is going to use Big Data or when it will produce Big Data?

Well in my talks I look at this whole area from different perspectives, so there are different perspectives like for example storage. I mean first of all you will have the problem that not everything will fit in your common data store, so you will go for different solutions then and what you need to consider things like Redundancy and Latency and Fault Tolerance and things like that, so even when you will start storing one chunk of data three times, your storage will grow, and another perspective on Big Data is for example when you have users all over the world and you would like to serve you data, how much you have it doesn’t matter really, you want to serve this data from one data center, you will have to fight distances, it’s still physics right?

The point here is that you want to reach your customer, your visitor, your user, as soon as possible, just imagine the situation where you are for example in the automobile Industry, producing cars and you really want your car video look real sexy, right? So because when this video starts jumping around and you can count every single frame, what will happen is that the customer will not buy it because it should be tasty. So what you would do, you would go to CDN like Akamai or something, you make a big fat contract just to plays the content close to your visitor. This is still something that you would have to do in this area because it’s still streaming and you need this media over there and you need to use infrastructure provided by somebody else and they always will be closer to your visitors than you would ever be.

Another perspective is something that could be described as close to real time, because it’s a different perspective, it’s maybe not the storage but you need for example to process a never ending stream of sensor data. Like, for example, we can automate a hospital. So there is a patient with maybe a thousand of different sensors and all the time and sense all the data and there are thousand patients so you can imagine the infrastructure issues that you need to consider or first of all appear in network layer somewhere, so it’s wires and first mile stuff like that, and you’ll even go on you want to do something with this stream. For example you don’t want the nurse to wake up the patients in the night every fifteen minutes just to ask him if he is ok, because he will not like it actually I don’t, so you want to analyze these data as it comes, it’s classic area of CEP, of Complex Event Processing, you have sliding windows maybe of two hours and you need to know within those two hours if something goes wrong so you need to alert somebody, so you will actively send a nurse in there, and this data amount is real huge that you have to process there. But maybe you wouldn’t even store this, so the whole infrastructure, what you would do in this case, for example you need to utilize your computer power as good as possible, it’s always a good thing to have processes working like at 100% maybe, but still you will have to do this here because he will not distributed all over at hundreds of machines, he will need to do this with several machines maybe just splitting streams.

Anyway analytics is another problem because analytics can be really expensive in terms of CPU power, because just try to analyze text for anomalies or for sentiments of things like that, it costs so maybe you don’t store a lot of data whatever terms you take it, petabytes, wherever, but you will have to implement correct algorithms that are fast enough, that are reliable enough, you need to train algorithms. This also takes time, this also takes some data that you need to prepare and so on and so forth, so the message here is you have different perspectives and when you combine all these different perspectives maybe there are not many companies who have all these pains, but when you combine all that, this will be a huge infrastructure with Redundancies and with huge storage and with lots of CPUs, with parallel batch processing, with streaming and stream processes. Like just try to find out if two people, someone from the public place, are starting a fight, a knife fight, it will not help the guy who’s actually done there dying, it will not help him if you record the event, you want to prevent it so you need a fast solution there. Alert must be real fast so this takes computer power, this takes real huge network bandwidth to stream all this video data then analytics afterwards.

   

6. And so that is specifically a great example of Big Data helping mankind, saving a knife victim. But what you just describe as is pretty complex so do you think you need a lot of training in Big Data, do you need a PhD from university in order to solve these huge problems that are all encompassing and affecting mankind?

That is a good question, I wouldn’t say you need a PhD, myself I don’t have a PhD but you need a solid theoretical base for that. I love this idea of data scientist, we are raising these new people in the industry, they are getting educating to have solid bases in mathematics and computer science and they combine those things. What is necessary is that these people have the correct mindset. I’m not the biggest world changer maybe, but still but you need to be aware of how you can use what you learn in order to make world a better place.

The answer is I would say “No” but you need to learn a lot because when you are prepared for that, when you have these theoretical bases, you will quickly realize that you don’t know anything, and even if you start knowing things you will actually realize that it doesn’t matter how much you know, it will always be, I mean what you don’t know it’s always much more than what you know. But what you learn through a good education is experimenting, being willing to experiment, try things out, really doing private research maybe in the night somewhere instead of hacking down the ultimate MVC framework number hundred fifty thousand or whatever.

   

7. Then if this is approachable by the average developer with some work, is this something that you can see most development teams and corporations embracing?

This is a real good question, when you are speaking of developer teams as I mentioned previously we need people on board who are aware what they can do with data, how they can gain information, how they can use the scientific analytical approach, it’s not like only developing software it’s being able to try different algorithms to know how to train them, to know how to tweak them, to know actually is tweak the result as well and being interested in this area as well. So I think that already has started with this professional data scientist so I expect that universities will do more about this. What we actually need is the concept of polyglot not only developers, polyglot data experts, people who don’t care what tool they use, they have just a vision, a target that they need to reach it and they use every single possible tool and this toolset needs to be real huge just to come there. So this is a sort of mindset, you need to change maybe the programs that you use in universities to educate those people, you need to provide the correct mindset.

A very interesting concept is having guests from the outside, from the industry, who will explain how to actually use what these people are learning in the universities and what you don’t know when you leave the university is how to tweak things so they work, it’s only coming from experience. So yes, to have this in the developer teams it needs to start with the education, it really needs, but I think we are on the way there.

   

8. You mention the idea of Big Data helping mankind and how universities and development teams can really improve the quality and acceptance of Big Data by adding educated people to the work force, where else do you see the evolution of Big Data going?

I think that it’s my opinion, I think that it’s all about developing systems that help make people qualified decisions, so called Decisions Support Systems or Recommender System, you name it, and it’s not about building the Skynet, Skynet that I mention previously. Computers are stupid, they are there just to do the job you tell them to do, but they are very good in doing this fast and repeated well, hopefully without mistakes, without errors which is not always true, but anyway they can repeat things, they can do these things very fast so we will use them. And the final click, the final ‘yes’ or ‘no’ will come from a qualified person who uses the computer to decide based on millions of different factors. It’s been pre-computed everything, pre-aggregated so you need to make a decision. And when you can make this decision you will teach your computer to do even more, to analyze more, to go further with that. I think the evolution is in the Recommender Systems and in the Decision Support Systems.

Harry: Thank you very much for your time!

Pavlo: Thanks for having me here!

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT