00:28:30 video length
Bio Ilya Grigorik is the founder of AideRSS – a real-time social media engagement monitoring and analytics platform. He has been wrangling with Ruby and cloud computing for over four years, trying to make sense of it all. In the process, he has contributed to many open source projects, blogged about his discoveries (blog: www.igvita.com, twitter: @igrigorik) has been an active speaker.
FutureRuby isn't a Ruby conference, but a conference for Rubyists. This is a call to order - a congress of the curious characters that drew us to this community in the first place. We have a singular opportunity to express a long-term vision, a future where Ruby drives creativity and prosperity without being dampened by partisan politics.
My name is Ilya Grigorik, I'm with a company called PostRank. FutureRuby is in Toronto - we're actually not far from Toronto, but on our way in the fond city of Waterloo, which is known for it's University of Waterloo. Lots of great computer science and software engineering crowds there, so we decided to stick around when we started the company there called PostRank. I'm the founder of the company and right now we're about 10 people, we're a Ruby Shop.
PostRank is an iteration, on an idea that was pioneered by the Google guys, when they invented Page Rank. When you think about Page Rank, to simplify it, you could think about treating links as votes. This was a radical idea back in '97 when they came up with this idea. You take text, you take web pages and you look at the links and every link going to another page is an implicit vote. Then you look at the entire graph and you figure out what are the most popular links, essentially. I think we can all agree that that worked out to be a pretty good search engine approach.
Of course, they've iterated in this idea quite a bit and it's much more than that, but one of the things that we've noticed when we started the company, which was in the early 2007, was that the web has changed since then. Links still exist and they continue to multiply at an amazing rate, but all of a sudden, given all the Web 2.0 activity, there are people sharing content on social networks. So, I find a link and I share it on Facebook, I'm going to post it on Twitter, I'm going to have a discussion on Friendfeed, I'm going to save it to my Del.icio.us Stream and there are a dozen of other ways that people interact with this content.
Each one of those actions is actually an implicit vote, when you think about the quality of that content. If I can come into InfoQ, read an article and leave a comment that's a meaningful interaction. I didn't leave a link, but that says also something about content. Looking at all this activity, we said "We could actually build a system that aggregates all of these activities from around the Internet and we could build a very interesting ranking engine", which is exactly what PostRank does. We have a system, which in real time gathers all the activity going on on most popular social networks, especially in the North America - so Digg, Del.icio.us, Twitter, FriendFeed - we track over a dozen of these networks in real time and we basically see activity streams like "Bob just bookmarked this story" or "John just left a comment on a Dig story that's pointing to this link".
We get all that data in real time and then we do analysis on it and say "Here is a story, here is an iPhone story, that is getting a lot of attention from the users". Any given day, there are hundreds of stories on iPhone, but there are only a few that are actually getting a lot of this user interaction. We started this service as a consumer service. You come to PostRank and you type in any web site or RSS feed, so you type in infoq.com, you get the main feed and then you can actually see beside each story there will be a PostRank score, which is a 1-10, based on how users have interacted with that story, how much interaction there has been with that story.
Then you can actually say "I'm interested in the best articles. Are there articles that generate a lot of conversation?". A great use case for this - and this is a personal use case for me - is photography. It's a hobby, I like to read about it, I like to know about it, but it's not something I want to learn about every single day. It's something that I do in the weekends kind of thing. I subscribe to maybe 10 or 20 great blogs on photography, but that's a lot of content - that's anywhere from 200 to 400 stories a week and I don't want to spend my entire week reading that. What I do is I come to the site and I say "Give me only the articles that generate a lot of engagement from this photography blog."
For example, if that blog writes about a new camera release from Casio or something like that, it's probably not going to generate a lot of activity, so it will drop below the filter, but if Casio does a huge recall or there is big kerfuffle about some sort of new camera coming out, I'm going to get that story. It essentially acts as an RSS proxy. You can put in the feed and get another filtered feed and it rinses out all of the articles that had low engagement. Or, alternatively, you can just use our APIs or you can send us a whole list of stories, like, say you are building your own newsreader, you've aggregated stories at any given topic, you send them into our API and we give you back all the scores. We'll say "Given the 100 articles that you gave us, here is the article with the most engagement.
Maybe that's the one you should read first." Imagine the use case of I've got 600 stories, I've got 5 minutes to read, what do I do? How do I go through this? That's exactly the problem that we solve. We provide an API, where you send the stories, we give you back the rank instance, they will hear the stories that people are paying attention to.
3. It sounds very interesting, very useful in our times to use the resource of all these social sites and the voting of those people that vote on those sites. How do you make this actually happen? What runs on your machines?
We are a Ruby Shop, we are actually fully virtualized. We run on Amazon EC2, we made that decision right from the get go, so we've been on EC2 since early 2007. We were one of the earliest adopters of the platform, definitely not without its trials, especially at the beginning, because it was a very young platform, but it's proven to be very helpful to us, because we can actually scale out very easily. We were kind of a prototypical case study for a Cloud Computing launch.
When we launched our website to the public, we had 3 servers and 2 of those were just for kicks kind of thing - if everything explodes, we have some spill over capacity. A day later, we had 85 servers running on Amazon EC2 - that's a big number - and to be honest, one of the reasons we had 85 was some of the code was just inefficient, but that's not the point, we were actually able to scale to meet the demand, because the product that we released really resonated with the users, like you were saying. It's the age of information overload and you think that can help you with that, all of a sudden we've got a lot of articles, so there is a lot of information overload, fighting information overload.
That was just a great use case for us and since then, we've condensed that number of servers, so right now, we're probably hovering around 30 or 40 servers on any given day and there are 2 big architectural components to our system. One is we actually aggregate all of the blog content that users are submitting to us. Since 2007 we've been essentially archiving the blogosphere, so that's a lot of text! At this point, we have over a terabytes of just raw text for all the stories that have been written since 2007 - that's one component of our system. The second component is this real time metrics gathering, also all of that is Ruby and we have essentially a real time stream that we use - RabbitMQ and AMQP protocols to push the data back and forth across the data pipe to provide the filtering capabilities.
We've iterated a number of solutions. We started with just MySQL and we pushed that for almost a year at which point it completely broke down and then we started investigating other alternatives and we actually ended up with Lucene, or more specifically, we ended up using Solr. Lucene is a Java library for indexing text and accessing text performing searches and Solr adds that additional layer or adds an HTTP server, which makes it easy to interact. So, if you want to store a document in there, you just post to a Solr endpoint and it stores a document.
The same thing for searches - it provides an HTTP interface to do searching. We've gone to Solr instead of MySQL and we also sharded our database, so we defined an arbitrary function based on our own internal keys for each feed and we've split it across many different instances. We have masters which we use to store content - because it's a lot of content and any given day we see 2-3 million new stories a day, so imagine storing 3 million new documents a day, every single day of the year. This number keeps increasing as we get more and more feeds into our system. Because that is sharded, we've been able to add new clusters along the way to help us scale with load.
We migrated away from MySQL for text storage, but we still use it to store all of our metrics. I mentioned the real time gathering of all the activities, on any given day we see roughly 15 million - 60 million new activities a day. We want to store each one of those in a database, because all of those are implicit votes. Again, for the past couple of years, that's the through-put that we've been handling. Every day we archive another 50 million records and this are very lightweight. You can think of them as there is a URL, there is a key, which may be like somebody just commented on the story, and then we have a counter to say "This story has 15 comments". But, as you can imagine, that's a lot of insert capacity that we need to match that.
We still use MySQL for that and, likewise we had to partition the data quite a bit across many different clusters to allow us to do the insert and the read capacity. It's the same problem that we had with Solr and one of the things we found about Solr, MySQL and all the solutions that we looked at is the amount of data that we're handling kind of pushes the limits on all of the solutions that we've tried. MySQL, as we found out, starts to really degrade nonlinearly around 100 GB, if what you've indexed is 100 GB or a billion records. We have way more than a billion records for metrics so how do you cope with that?
Same thing for Solr - for Solr, if you search the mailing list, some of the biggest deployments that we find is 100 million records. We have, an order of magnitude, more than that. We ended up finding a lot of really quirky locks and performance bottlenecks that we've been trying to address internally. We've addressed some of them internally, but it's an ongoing battle.
We've been actively exploring the space in terms of databases. How can we do this more efficiently? It's one thing to just keep adding hardware, but of course, that increases our cost as well, and we are at start up, so we are sensitive to price or sensitive to all this stuff. It's been really interesting for us to see a lot of the databases that have been popping up in the year or so, document oriented storage, all those things. Specifically, one of the solutions that we looked at was Tokyo Cabinet, which is on one hand a fairly recent project that was started in 2007 by a single developer who's working for Mixi, which is kind of like a Facebook in Japan.
I'm not sure how many users they have, but they're the number one dominant player in Japan, which is a big site. He started Tokyo Cabinet in 2007, but up to 2007 he was working on a database project called QDB, which was from 2000 to 2007 and then, in 2007 he said "I learnt a lot by building this thing. I'm just going to do this 'rewrite' thing where I'm going to take all the mistakes that I made, get rid of them and write Tokyo Cabinet." We could actually say that this has been in the works for 9 years or so and it's a very mature project. One of the things that surprised me when I came across it was I had a couple of people mention to me Tokyo Cabinet and it just flew by and never payed attention, but then, when I started looking at it, I was surprised about how mature the project is.
There is great documentation, the source code is very clean, it's written in C, great docs and great functionality and it's really fast. Just based on that, I started digging more into the author's blog posts and documentation trying to find if this is something that we could use internally because it relaxes some of the conditions that MySQL has, so there is no concept of - for example - triggers or any of the things that we really don't use anymore. I think that's one of the reasons for the backlash currently in the database community. There is a lot of the stuff that we don't use. We have much more simplified use cases, but it's a burden on the performance.
We started looking at Tokyo Cabinet for "can we take our databases, the metrics databases for example, and move them to Tokyo Cabinet? Is that going to get us an order of magnitude performance increase?" We found that, in many cases, that's true, that's the case. Right now, we are still investigating on these applications, we haven't migrated to Tokyo Cabinet, but it's definitely a very appealing platform for us to think about.
7. If you look at many of the new databases or the modern databases, like Couch DB or all these things, where does Tokyo Cabinet fit in? Is it document oriented? Is it key/value? Where does it fit in?
When we talk about Tokyo Cabinet, there is an umbrella - it's called Tokyo Product. There are 3 distinct projects, there are all by the same guy and Tokyo Cabinet is just one of them. Tokyo Cabinet is a library for managing a database, nothing more than that. Just like Lucene is a Java library for indexing text, Tokyo Cabinet is a C library for key value store, plus a few other engines. You can build a great embedded database. There is language bindings for Java, Ruby, Pearl, Lua, virtually any language out there. If you want to have a really fast embedded database in your application, that's where Tokyo Cabinet comes in.
Then, there is Tokyo Tyrant, which is a network interface on top of Tokyo Cabinet. Basically it takes Cabinet and wraps it into an HTTP interface and actually out of the box provides 3 different protocols. One is the binary protocol - as you would expect -, the second one is HTTP, which actually turns a database into a RESTful land point, so you can just PUT, GET and DELETE and it will do all the things that you would expect and the last one is Memcached.
So you can take your Memcache client in any language, point it at Tokyo Cabinet and it will behave just like a Memcache server. By default, Tokyo Cabinet defaults to a key/value store model, so it's a hash table and the project was started as an alternative to Berkley DB, so that's where its roots are, but it also provides a lot of other engines. If you think about MySQL , there is MyISAM, there is InnoDB, there is BDB engines. The same thing in Tokyo Cabinet - there are many different alternatives that you can use for different applications. There is no right choice, it all depends on the data, the type of data that you are storing.
The hash table is the number one, but there is a table engine which basically allows you to treat Tokyo Cabinet as a document store. If you are using the Ruby client, you can just pass in a hash full of values and keys in there and it will store that hash, and then you can query it back and get the same hash back and there is no schema. So, you can just pass in random hashes in there. The really exciting part about this is it's a schema-less document store, but it allows you to do queries on it as well.
You've created this document store, now let's say it was a list of people or attendees for a conference and you have a name and age and sex columns kind of thing, but maybe not everybody provided their age, so that's a good use case for a schema-less data store. But now, you can go in and say "Give me all the people that are over the age of 22" and Tokyo Cabinet will go through the records and find all the right things, not unlike MySQL. It supports indexes, so you can declare arbitrary indexes on integers, strings - all that kind of stuff. It's kind of hard to place it into "Is it just a key value store?".
When I think about Tokyo Cabinet, it's key value store, but more. It really depends on the use case, because the very simple engine that it has is called "fixed length table", which best thought of is just a giant array. You can't even access a data in there through keys like strings. It's all done through index offsets. So, you literally say "1, 2, 3, 4", but you get all the semantics of having a transactional database, so you can do transactions, it's persistent on disk, you can do replication, all of this things - it really depends.
Tokyo Tyrant is a network interface on top of Tokyo Cabinet and you can either use any of the language bindings that are available. So if you are a Ruby Shop - like we are - there are at least 3 gems. One is based on FFI, for example if you are doing JRuby you can actually use the FFI Gem and get access to Tokyo Tyrant, or, if you are looking for raw speed, the C versions, the native extensions are a little bit faster than the FFI ones - you can use those. Then, from there it really is just like a Ruby hash. Once you create a database, let's say an embedded database, via one of the Ruby clients, it walks and talks like a Ruby hash - "Is a key here" or "Let me set a key". Then you can do a table.query and you provide the parameters. It's a very clean interface for doing this kind of stuff.
9. If you get the values with the Ruby interface and you have a lot of values, do you have to iterate for all the values or is there a way that Tokyo Cabinet or one of the other tools can run an optimized query?
Tokyo Cabinet does have concept of iterators, so you can either say "I'm going to execute this query and give me all the results back", in which case you may get an array. Let's say you were doing a document store, you would get an array of hashes and then it's up to you to do whatever you want with it, or you can actually create an iterator and say "Next, I'm going to walk this street until I decide to stop." Alternatively, one of the things you could do is, let's say you wanted to run a giant aggregate function across your database.
Recently, the author added the capability to execute MapReduce jobs within the database and this is a great use case for this, where the native bindings actually provide some convenience functions for "Get me all the keys" and you basically write 2 functions. You say "Here is my mapper, here is a reducer" - in the mapper, the database repeatedly passes you the keys then you do something with them when you say "If this is a string, I'm going to split it and extract the words. I'm going to count a number of the words". Then, your reducer - the database is actually responsible for all the intermediate results that pass through your reducer and then you execute that. If you wanted to do big aggregate functions, that's probably the best tool.
For distributing data, when we planted the sharded logic at PostRank, it was we didn't use any tools as much as we've abstracted all that logic behind internal APIs. We think of everything as a service layer. From the outside, we have just a big database and, in front of a set of databases we have a Ruby API, so we use EventMachine for all of our APIs. We have these guys running in front of them, which expose a very simple HTTP protocol.
We send everything in JSON, we get everything back in JSON, but then that API is responsible for handling all the logic of "Where does this key belong? Is it Shard 5 or is it Shard 6?" When I query for something, if you give me 3 keys, I'll figure out that I need to query Shard 2, 4 and 6 to get that data and return it back to you in a consistent fashion, so you don't have to think about it. In our case, it was fairly simple to shard that. In one case, we have URLs and we want to assign metrics to them, so each URL is mapped to an MD5 string and then we just to a sharding function based on that string.
We've looked at alternatives; MySQL has a number of proxy layers that allow you to do transparent sharding. At the time when we started, none of them were production ready, but I think at this point they are actually getting much more stable. If you are looking for an alternative, I would encourage you to look at some of the MySQL proxy solutions that allow you to do that, but we just built that internally and we find that they gave us the most flexibility.
At PostRank, as I said, we index and archive the blogosphere and that means we need spiders to actually get all the content and we want that content to be fast, because this is RSS, this is breaking news type of thing. If you guys write a story, we want to get it out as fast as possible and push it to our readers if it crosses that filtering threshold. It's been a challenge for us. There is a number of open source solutions for writing spiders and it's a non trivial problem. If you want to just download 100 webpages, it's not hard; you use something like Ruby's Net::HTTP and you're done in a couple of minutes.
But, in our case, we have over a million feeds which you want to update every 15 minutes kind of thing and you got to maintain that rate and you want to make sure that you are a good citizen, as well, so you respect the robots file, you are doing all the stuff that a crawler should do. You look at open source projects like Nutch, which are Java spiders. We evaluated that but then realized that, in order to embed our build, our custom logic that wanted to index RSS feeds, we would have to do it in Ruby. Hence, the reason for some of the blog posts about how do we tune the performance of basically doing these network IO in Ruby.
One of the things we found, was we started with Net::HTTP and that was fine, until we got over 100,000 feeds. At that point, when we profiled our code, we really saw that was the bottleneck. It's blocking IO, we are doing threads, but thread performance in Ruby is also not the greatest, so we started evaluating other solutions and that's how we got into EventMachine. EventMachine allows you to do non-blocking IO in Ruby and since then, we've built all our API servers in EventMachine. The next step after that was to look at other available libraries for how do we maintain 1,000 parallel connections in one Ruby process.
We ended up using libcurl, which is a very popular library - if you've ever used the Curl command line client - that's what you were using. Libcurl is just the underlying library that powers Curl and we've contributed quite a bit to the project called Curb and we use that extensively. It's Ruby bindings for libcurl and we've built our spiders using that. One of the nice things about EC2 is you get a very big data pipe - it's 250 Mb to each computer. For our performance, for spidering, that's exactly what we needed. The reason we had to go to 80 servers when we launched was we just couldn't handle the capacity, we couldn't download fast enough, so we had to scale out to that number.
Since then, we've collapsed everything to 4 servers, which do active crawling and those servers just sustain a rate of anywhere between 10 to 30 MB/second of just downloading content, day in and day out. We are pushing terabytes of data in and out and we found that libcurl was actually the best solution for us.
We use 1.8 primarily at this point. We looked at going to the Ruby Enterprise version, there is a number of good performance patches in there and allows you to tweak your GC. We have a couple of servers which are very memory heavy, using that. Most of our code is still deployed in 1.8.7, but I've been personally pushing a lot for Ruby 1.9. I think there is a lot of really good performance improvements in 1.9, so slowly but surely we are migrating a lot of our libraries to be 1.9 compatible and I think at one point - hopefully some time this year - we'll actually flip over.
One of the dangers has been we are at startup, we have a finite amount of people - I guess this is in any company -, but there are so many things we could be doing. We'd like to move to 1.9 as soon as possible, but we want to avoid the big rewrite, so we've basically been taking the boy scout approach of "Any time I work in a library, I'm going to make sure that I spend extra 10 minutes to make it 1.9 compatible". We know that, at a certain point, we are going to reach that magic line where we can actually go there and everything will be fine. We've benchmarked some of our code on 1.9 and we saw performance increases, anywhere from 5 to 25%, alone and that's a lot when you think about 80 servers. That means we could cut 5 or 10 servers just out of our stack and that's a huge benefit in terms of cost.