Bio Nathan Marz is currently working on a new startup. He was the lead engineer at BackType before being acquired by Twitter in 2011. At Twitter, he started the streaming compute team which provides and develops shared infrastructure to support many critical real-time applications throughout the company. Nathan is the creator of many open source projects, including projects such as Cascalog and Storm.
Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.
I’m a software engineer who lives in San Francisco, I used to work at Twitter, I started one of their core infrastructure teams and as part of my work I’ve been really involved in blogging and Open Source and I’m responsible for a few big Open Source projects, I created Storm, before that I did a project called Cascalog.
So Storm it’s an Open Source stream processing system, it makes it very easy to process massive streams of data in a scalable way, and it gives, provides mechanisms for doing things like guaranteeing that the data will be processed. Which is pretty important and someone tricky to do at that scale but it's handled automatically with Storm.
Werner: So Storm is written in Clojure I think.
Well it’s a, so I love Clojure as a programming language, I just think it’s the best programming language I ever designed, so I implemented Storm in Clojure but I wanted Storm to be able to be used by a very, very wide variety of people. Unfortunately the Clojure community is small when you compare it to let’s say Java, so the way I designed Storm is actually all the interfaces are in Java but the implementation is in Clojure. As a user of Storm you don’t even know that it’s written in Clojure, you just have your Java Interface as the thing you program to.
Sure, I mean I can just talk about why I created Storm in the first place, so I was, before I got to Twitter, I was part of a startup called BackType, later on we were actually required by Twitter and what we were doing is we were building a product to help businesses understand the effectiveness of their campaigns of social media, so we had this massive streams of data coming in and we had to perform these analytics on it, so for example one really simple thing we did, is we could tell we would roll up the number of tweets for a URL over a range of time and the way we did it, first we build this queues and workers system and we use Gearman as our queue and we would write these Python workers that would connect to a queue and consume the stream and update some database. It worked certainly that is the way that a lot of people start up with stream processing, we ran into a lot of issues. Once we got to a certain scale we had to deploy a lot of queues and workers, we had to manage these deploys by hand but it wasn’t really that fault tolerant, any fault tolerance was again just implemented manually. It was just a pain to build, we find that most of our code had to do with where to send messages to, where to read messages from, how to serialize messages, very little of the code was actually our business logic and so then I came up with Storm which was a general, Storm is essentially a general purpose queues and worker system without any of the complexity of queues and workers, so that is the origination of Storm.
It’s kind of at a different level of abstraction, so Akka it’s a, what is the best way to describe it?
Werner: [Akka is] basically infrastructure I guess?
Akka is almost like a library for building infrastructure for having nodes that pass messages to each other and react on the messages, so Storm it’s a bit higher level. So for example one of the key abstraction of Storm is called a bolt, and a bolt consumes any number of streams and produce any number of output streams. Now the bolt abstraction is actually inherently parallel, it’s kind of like mappers and reduce in MapReduce. You write this one piece of logic and then it gets partitioned across many machines to execute it.
The core abstraction of Storm is a stream which is just an infinite list of tuples and then tuples are just named lists of values so you have tuples which contain URLs, person identifiers, time stamps, and so on. And Storm is all about transforming streams of data into new streams of data, you do this by defining what we call a topology where there are basically two things that go into a topology: the first is called a spout and a spout is just a source of streams in a topology. So for example we have might have a spout which reads from a Kafka queue and emits that as a stream, then we have bolts, like I was saying before, process input streams and produce new output streams, so you wire together all your spouts and bolts into this network and that will be how things process. It’s pretty typical in Storm to have your bolts talk to a database for whenever you need to keep persistant state, that is actually one of those common applications of Storm, just doing the realtime ETL of consuming a stream and then updating the databases and doing that in a fault tolerant, scalable way.
So it’s kind of two aspects to it, one aspect is just making sure your workers just keep on running, so Storm does that, it manages a cluster for you so you have a master node which tracks running workers and if anything dies we restart it somewhere else. So that is the kind of thing that is handled automatically that was kind of difficult to do manually when we were doing queues and workers manually. The other aspect to it is making sure that your data gets fully processed, that's actually one of the big innovations in Storm, that was actually coming up with this algorithm which made Storm possible in the first place. So one thing I really, really hated, when we were doing queues and workers manually, was having to have these queues in between our sets of workers, and the queues just contained intermediate data, the problem was it was necessary because if there was a failure later on, you need to replay what you attempted. But I hate the idea of intermediate queues, because you are not sending messages to who is going to process it, you have to go to this third party that requires much more infrastructure, it’s complex, having to go through a third party makes us slow so I hated that, so I decided in Storm I don’t want any intermediate queues, so I had to figure out a way do this distributed processing but if anything would fail or messages would get dropped, know that and know how to replay your messages from your source, and so Storm implements real cool algorithm to do that where it tracks this tree of processing and can officially detect when it fails and retry if necessary. It’s kind of hard to go into it like this but it's actually documented pretty well in the Storm documentation and it's an algorithm that I’m personally very proud of.
Sure, I didn't actually know about this term before I did it but I learned later, it’s based on a technique called Zobrist hashing. There's a lot of hashing involved, it’s actually a probabilistic algorithm but the probability of it being wrong is so, so low, that you can basically ignore it, like basically the algorithm, if you are processing a million tuples per second, the algorithm will incorrectly mark a tuple as processed when it hasn’t been fully processed yet once every ten thousand years, so we felt that was pretty acceptable.
It’s hard to into without showing the diagrams. What it’s involved is hashing and XORing.
Werner: I think our audience can google that and have some fun.
If you just look at the Wiki page it’s pretty clear, it’s explained well, you do really need the diagrams.
Werner: Absolutely, and everybody loves probabilistic data structures nowadays. Have you read up on Bloom Filters.
I love Bloom filters and HyperLogLog is one of my favorite algorithms.
It’s a really big misconception especially because I’m one of the biggest advocates of using Storm and Hadoop together, we've been talking about his for years, it’s a big part of my book. So Hadoop it’s a batch processing system, Hadoop is really good at processing very, very large amounts of data all at once. Storm is a stream processing system, it’s very good at processing large amounts of data but as it comes in, all at once. Stream processing and batch processing are completely different and in my view the best architectures make use of both and each have their place and they don’t really overlap with each other.
Any time you need to look at data historically, that is when you use batch processing, whenever you need to look at all the data once, for then anything you do as it comes in, that you use Storm for.
My book is about how to build Big Data systems end to end and how to architect them. Basically I kind of think of Big Data as like the Wild West of software engineering right now, it’s pretty crazy there is lots of people trying new things and the average user is pretty bewildered by what's going on, it’s very, very confusing, and I entered in this Wild West and I didn't really know what was doing at first but when you deal with these really hard problems for long enough period of time, you learn certain things, and I started developing these models for how to approach these problems in a general way and actually solve the problems effectively, for example one of the core things which I learned very fast was this notion of human fault tolerance. The idea is that and everyone knows this, everyone knows this but no one talks about it, people make mistakes, programmers make mistakes, we deploy bugs to production all the time. You ask the average programmer: “Have you ever accidentally deleted data from the database?” and they will answer: “Yes”.
If they don’t answer “Yes”, they are lying to you or they haven't been a programmer that long. I think most people don’t design systems to be tolerant to human mistakes, especially in the Big Data and NoSQL world, people love patting themselves on the back for this super complex algorithms they developed to have machine fault tolerance like replication, leader election, active anti-entropy. I think nothing of that stuff matters if you are not tolerant to human mistakes. Human mistakes are guaranteed, so deploying a system that is not tolerant to human mistakes, you might as well not have fault tolerance. So that was a big thing that I learned, especially when people would make these big mistakes and we just need to correct these mistakes. So that led me down the path of rejecting a lot of the really, really core principles of data management, especially in the relational database world. So one of these principles is the idea of immutability instead of mutability, like a traditional database, the four core operations are create/read/update/delete. The thing is that if you can update data, then a mistake can also update data, so I think the far superior approach is the idea of immutability where you only ever add data, you never modify existing data and that makes your systems much more human fault tolerant, because when you make a mistake you might write some bad data, but at least you won't destroy existing stuff that was good. Anyway in my book this is one of those things that I’ve learned and then I explore general ways to actually approach systems so you get properties like human fault tolerance.
Rich Hickey is the creator of Clojure, we arrived at the importance of immutability independently, I was wold on immutability before I was sold on .Clojure, and when I saw Clojure that made me even more excited for it.
Werner: You were vindicated in a way.
Clojure is amazing, I mean immutability is not just useful just for the data persistence and human fault tolerance, it actually when you code programs using immutability as a core technique and not mutating existing data structures, you can really simplify your code. Clojure really embraces that, its standard library really embraces that, it's just that once you are able to understand the mental model of Clojure, it just makes programming such a joy.
14. I think immutability is often proposed as a solution, it’s a best practice but I think many people have the question: “But I do have to change some things, I have to update things” so if my data is immutable how do I change anything, so what are your approaches, what solutions do you have to that?
So the trick is that when you model things using immutability you model them in a different way than you would in a mutable world and a simples example I can think of for this is lets say you're just modeling someone's current location, so Sally lives in New York or Bob lives in Chicago. So in the mutable world that's what you store in a database, and when Sally moves to London you would update the cell to say London instead of New York. In the immutable world we do it completely differently, now instead of just storing the person and the location, you would store the person, location and then the time as of which they’ve lived in that location. And so now instead of updating that row you add a new row saying: “Sally lives in London as of this new time”. And to get someone's current location you just get the location with the latest timestamp. Actually this notion of time is actually just a general purpose way to make any data model Immutable as long as you only record facts as of when you know them to be true, anything later that happens doesn’t change the truthfulness of that. Now in terms of actually doing queries and doing them efficiently, that is essentially what my whole book is about, that is where the Lambda Architecture comes in, that is where the idea of building views on your data, views that are optimized for your queries, that is where that comes in.
So the Lambda Architecture approaches building data systems from first principles, and so a question I like to ask people is: “Does a relational database apply to all data problems? Can it be used for all data problems?”, and if you hear this question and it’s kind of a hard question to answer, like do relations and tables and primary keys and all of that, can you fit any data problems in that mold. It’s a hard question to answer because it’s not clear what a data problem is, it's not clearly defined and the answer is a kind of fuzzy. And so where I start with the Lambda Architecture is actually defining what a data problem is, what is the most general possible formulation for a data problem and it’s actually quite simple.
Any data problem can be expressed as a function that takes every piece of data that you have as input, query equals function of all data. Clearly if you can write a function that literarily takes all your data as input like anything you could ever want to do, you can do in that function. So let’s start from there and so the Lambda Architecture is a general purpose way to build those functions of all data and have it all be scalable and up to date and operate in very low latency.
Werner: Ok, let’s go into sort of the details here, so everybody likes low latency, so how does low latency get in there.
So one of the core ideas of the Lambda Architecture is this idea of views, so the idea is that you have your master data set and that is literally just an unindexed list of Immutable records and all you will do is add to that list. And it's not something that you query directly, I mean ideally you could literally just run, every time you want to do a query, if you could just write a function that took all your data as input, that would be so easy. Unfortunately you can't do that because that will take way too long, you can’t run a function on thre peta bytes of data in ten milliseconds. It would be so resource intensive it wouldn't be worth it. So a core idea of the Lambda Architecture is pre-computing the views on your master data set, views that are optimized for your queries. So an example I like to use is a web analytics example of computing a number of page views to a URL over a range of time. So the idea is that you pre-compute a view which is an index from a URL and a hour bucket to the number of page views for that hour and then to actually get a number of pages for a range of time you would get all the pages for all the hours and sum them together for the result. Now obviously the query will run in low latency because you are querying an indexed database and basically what the Lambda Architecture really is about is how to produce those views.
Werner: Let’s deep dive into views, into the idea of views.
So the idea of a function of all data, so the right place to start is to actually define your views as a function of all your data, that is the most general possible thing you can do, and then you have to think for a second, ok, how do I run a function of all my data to produce this output of a view, and that should just scream to you “batch processing”. We have, there has been amazing work in batch processing in the past decade and we have some great tools to do that, and I would say the premiere one is MapReduce. A lot of people talk about MapReduce in terms of like how it works, it has a map step and a shuffle step and a sort step and a reduce step, but that is how it works, that is not what it is, I would actually say MapReduce is a framework for computing arbitrary functions of arbitrary data, that is the actual power of MapReduce.
So in the Lambda Architecture the place you start is actually computing in batch views from your data using MapReduce, it’s actually pretty straight forward to do that. And the next place you go in the Lambda Architecture is you look at that and say: “Ok, that is great and I can use my batched views” but batch processing is a high latency operation, those views will be always out of date by say a few hours or how long it takes your batch code to run. But when you look at what you have, when you think about it we have to subdivide the problem because all the data you have up to a few hours ago is actually represented in the batch view. All you need to do to make things fully realtime is account for that last few hours of data, that last 0.1% of your data. And the second part of the Lambda Architecture is this thing called the speed layer and all that does is compensate for that last few hours of data. So the idea is that you have your batch views and in parallel you compute realtime views, so for page views over time the batch views will be all the page view indexes up to a few hours ago and the realtime view would contain the rest of it. The idea is that when we do a query, you query both the batch view and the realtime view and you are able to merge them to get your result. Now at first glance people say: that seems more complicated than just using a database, I just have to query I don’t have to do all merging, but you have to look at what you actually get from this. First of all this is a complete general purpose, applies to any function and then it has some really, really nice properties, one of the big ones is human fault tolerance.
When you have all your data existing in a batch computation system that means you can recompute those views whenever you want. The data is immutable, and so any mistake you have can always be corrected with recomputation, and that is an extremely important and I would say non-negotiable thing to have. Also you can do some really cool things with this batch/speed layer split, sometimes there are things that are actually really hard to compute in realtime and so the only way to do incrementally is to do like an approximation of some sort, and actually in my presentation I went through an example of this. What you can do in the Lambda Architecture is you can do that approximation in realtime but then in batch you can do an actually more accurate approach, so what you get and because the batch views are always overriding the realtime views you got this thing which I call eventual accuracy, where you can make that tradeoff in the performance in the realtime layer but it doesn’t cause permanent inaccuracy, it’s only temporarily inaccurate and only for recent data. So that is a really, really powerful technique, something I made use of many times.
16. That is very interesting and one question I have, superficially this sounds similar to CQRS, what do you think of that, are they completely different, are they overlapping, do they have different purposes?
So CQRS, from what I understand it is a concept to separate reads and writes essentially, so certainly that is embraced by the Lambda Architecture, the only write you really have is adding a new piece of immutable data and then the Lambda Architecture portion is how you transform that into views and then, at the end of it you do queries which are obviously just reads. I think it is definitely a good program principle for all programs and the Lambda Architecture didn’t set out to implement CQRS, it’s just s good idea. Q24: So it’s basically the approach to using, the Lambda Architecture is combining immutability with … 23.54 A: Yes, as core principles. Q25: Ok, so this Lambda Architecture, have you used implementations of it or these concept in previous work or is it something that you’ve seen in big applications. 24.08 A: The Lambda Architecture is something I developed by hammering my head on these problems for five years. So I’ve been doing this for a long time, I did it at BackType and I did it at Twitter, when I went to Twitter. As funny as that I actually whenever I talk about Lambda Architecture I always get people who come up to me and say “Wow we did something so similar” and then they really describe me this really complex problem they had to deal with. They didn’t necessarily formulate it in the general way I have, of functions of all data, you know, just the very general purpose nature of it, but I find people have independently stumbled on these techniques and I believe it's because once you have a problem get hard enough, this is the only thing you can really do, it’s just kind of a, it’s an interesting thing to think about, actually somewhat relatedly this is total speculation, I actually suspect that our brains use some form of Lambda Architecture, just like a lot of symptoms of it, just like the fact that we know that there is a clear difference between short term and long term memory, that screams Lambda Architecture speed layer and batch layer the fact that like we know what happens when you sleep and it has some effect on how information is indexed in your brain, and whenever you sleep on something it enhances recall, it sounds like some sort of batch processing is happening while you are sleeping. It just, I find it very interesting and unfortunately I don’t think it’s a question we'll get an answer for for a long time but I do wonder if nature has evolved some sort of Lambda Architecture.
Yes basically, or just do more intense calculation and correlation, the exact kind of things that you do in the batch layer of Lambda Architecture.
Werner: That's an interesting point of view, I wouldn’t present it at a neurological conference but it’s interesting.
If you think about it, computational limitations are a limitation of nature, so our programs are subject to it but our brains are subject to it too, so yes, it’s an interesting thing to think about.
18. Since we are talking about immutability, I think Storm is built with Clojure to some degree, what is so great about Clojure, I mean we've certainly touched on the immutability but what else, do you like the parentheses?
I love the parentheses, here is the thing: most people don’t like the parentheses and it really just comes down to that they are not used to it. It’s actually like, the parentheses stem from the fact that Clojure has a very, very regular syntax, it’s actually the simplest possible syntax you can have in a programming language, everything is a list, the first element of the lists is the operation. It’s actually, there are a lot of reasons why I love Clojure but we can start with the syntax. The fact that has such a regular syntax enables this thing Clojure calls macros and a lot of languages have macros, but it's not like the same kind of macros you get in Clojure or other Lisps. The fact that your code is written as data, it's just list, means you can process your code like it’s data and you can process your code using the exact same code used to processing any other data.
So something you can do in Clojure is write a macro which is a function that takes in code and spits out other code. And this is actually really, really powerful and enables you to build abstractions like you just can't in other programming languages. So for example of this is my other project Cascalog. Cascalog is a library for Clojure but it implements a declarative logic programming language that will run as MapReduce jobs on Hadoop. So it literally implemented a completely different programming paradigm within the language, but it’s just a library, that means you can use a different programming paradigm within the language and have an interoperate with the rest of you normal Clojure programming language and being able to interoperate these different programming paradigms is immensely powerful, is just something that you can’t do in other programming languages. But there is a more general reason that I love Clojure, I mean what I said before was a true of all Lisps, a more general reason I love Clojure is that it has a philosophy of how you should program, it's very opinionated and the stuff Rich Hickey has done is brilliant and I absolutely love his talks.
He has tons of talks, talking about some things that we were talking about, immutability and things like that and the importance of it, and those things are baked into Clojure, so I just love that about the programming language, also just has a fantastic community, there are people just doing some incredibly innovative things with Clojure. One of my favorite is this guy Sam Aaron with this library called Overtone, which is a, it’s a DSL for making music with Clojure and he literally will go on stage and just jam but at a programming level. That is a super cool, live music for programming, that is super cool and you find the Clojure community is filled with people like that just doing really, really cool stuff.
Core.async is another great example of the power of macros, so core.async, the programming language Go, had this really cool thing called Goroutines, and it’s just a way of doing concurrency and Go has all the special syntax for doing Goroutines and Clojure implemented Goroutines but as a library. Didn’t need to extend the language, it's just a separate library you can use, but because of the power of macros it’s able to transform the code that you write into this concurrent Goroutine style, into the way that Goroutines execute. That is a really, really cool library.
It’s called Big Data and it has a really long subtitle, it’s published by Manning.
Werner: And otherwise we will just google for Lambda Architecture to get more details about it.
Yes, if you just search Big Data then my name, it will come up.
Werner: So you've given us a lot to read and a lot to think about, thank you Nathan!