Bio Ken is an engineer at Tumblr where he works with both the frontend and backend teams to build out the next generation of high scalability infrastructure. Prior to joining Tumblr, Ken worked as an Engineering Director at Etsy where he drove the implementation of a sharded data architecture, helping the site achieve new heights in throughput and availability. He blogs at http://61cygni.tumblr.com/
Software is changing the world; QCon aims to empower software development by facilitating the spread of knowledge and innovation in the enterprise software development community; to achieve this, QCon is organized as a practitioner-driven conference designed for people influencing innovation in their teams: team leads, architects, project managers, engineering directors.
At Tumblr I am the director of Product Engineering, Tumblr’s engineering staff is about thirty engineers and operations guys and it’s split about fifty/fifty, fifty people on the frontend and fifty people on the backend. The backend engineers build our data services to help solve the scalability challenges around manipulating the data flow that we have, the frontend engineers, which is the product group, consume those services and actually produce the user experience on top of it.
Yes, so Tumblr does about eighteen billion page views a month, globally we see about a hundred and twenty million unique visitors a month, and we generate about sixty million new posts a day.
Daniel: And that’s growing very rapidly.
And that’s increasing, yes.
At the beginning of 2011 the site was mostly a monolithic application with a single data store behind it, and that sort of reached its scalability limits, where we just couldn’t write fast enough, the posts were coming too fast to write to a single data store. So the majority of the last year was focused on scaling out, and sharding the post data model, so that we could actually write the posts as quickly as they were coming in. The challenge that came after that had to do with the way that we built the dashboard, so the dashboard at Tumblr is where users can come and follow logs that they like and see in real time a globally time based feed of all the posts that people are following or making.
And even with the sharded post model, the way that we were building out that dashboard required all of our users to have to go to a single database, in order to construct the first few pages of their dashboards feeds, so we ended up sharding that as well, and that ended up getting Tumblr through the end of 2011 and early 2012.
Daniel's full question: Is it something that you recommend to engineers with similar post type and dashboard type problems do from day one? I mean sharding, dividing the data, partitioning as much as possible and distributing it over many servers, or do you not regret the initial architecture of the system?
I don’t think it’s regretful I mean especially for early stage startups it’s more important to focus on the product than the technology that’s underpinning it, I don’t think it would have been possible to foresee where the bottlenecks would have been in the Tumblr infrastructure before they manifested, probably because the product is changing so rapidly, so I would say stick with the simplest architecture that you can until you see the problems that are arising and then find the next technical solution that matches the scale that you are seeing.
I wouldn’t say most of it but I would say a significant portion of the technology was fundamentally re-written especially since you consider we're playing with the data model, we still have plenty left to re-write though.
Daniel's full question: And looking back at the things you did re-write and the sharding, the dividing of the data, it’s some pretty major change indeed. Now that you look back at that experience, would you have done something very differently?
Tumblr has always had a small dedicated engineering staff, even in 2011 there was five or six engineers and the majority of the challenges that we had, the technical solutions were well established the patterns that are out there, Facebook and Twitter and Linkedin, these guys solved similar data stream problems and they are luckily very opened about how they are doing so, so the solutions, you now patterns have been established, but we didn’t have the staff to take on all of the fronts that were opening up, and all the bottle necks that were arising so it’s to say that from a technical perspective the path was sound, but we should have staffed up faster to address all the concerns that were arising simultaneously.
I don’t think we've passed Facebook in size yet, but we’ll see how things go. We are always looking to see what other people are doing I think that’s sort of the beauty of the open web, it’s that we can all learn from each other and see what patterns are working and what patterns aren’t working and then adapt that to the problems that we have. And that helps us keep our focus on the product rather than technology which is very important to us and we sort of see building up the software as a means to an end to actually produce something that we think has social value and that’s what we are really aspiring to.
8. When you look at the numbers though, what kind of reference numbers do you use let’s say a whiteboard session, or the today's system - how big is the system that you are building today for tomorrow that are you looking at?
Yes, so, we do about two thousand new posts a second, which is an important metric for us, that’s how much intake we have to have obviously if you can fan out your writes broad enough, scaling up your reads, is still a challenge but one that is more tangible and solvable rather than not being able to write data as fast as it’s coming in. The two thousand posts per second is an important number for us, and as we look forward to the next architecture that’s to store all this data which is essentially we are at least planning for it to be HBase, founded on HBase, that kind of input produces estimates of storage somewhere in a 150 terabytes per year at the current rate. So those are the kind of numbers that we try to keep in mind.
Daniel: Those are numbers that you already have today, right?
Well, we want to say that we are going to scale infinitely, I think that with architecture though and with hardware provisioning, we are trying to get twelve months out assuming that we will double in size. So you can say four thousand posts per second.
As I mentioned, the team is essentially divided into a frontend and a backend team. The backend team, the services they produce are standardizing on Scala, and the product team is still fundamentally a LAMP team and there are no plans to change that, we want the frontend to stay in PHP it’s something that a lot of people are very familiar with, it’s the devil we know more than anything else, and scaling out frontends to the size that we need to reach has been done before with PHP so that’s something that we are going to stick with. The backends though are doing fairly well on Scala, one of the reasons that we chose Scala was so that we could use some of the existing open source libraries out there to solve data access problem patterns like we see such as Finagle and that’s working out very well for us.
11. I run a ten people engineering team entirely based on Ruby, little bit on Ruby on Rails and I’m thinking I’ll be able to scale this to two thousand writes per second, no problem. Am I crazy? Or do you think I will be writing Java based code soon?
I don’t think Ruby is going to be your problem, I think it’s what you do with it. And we have even seen this with Scala, one of the great things about PHP is that it’s a very utilitarian language and people approach it as such and so they solve problems with it and they focus on the problem rather than necessarily the cleverness of the solution and I think as long as you can carry that mindset into a language that encourages clever solutions like Ruby or Scala, then you can do whatever you want with the language, the language is actually functional and good. It’s just you got to make sure you are keeping it simple.
Daniel's full question: Makes sense, so you talked about some of the subject domain of the interesting problems that are out there, that you can solve with these standard techniques that are shared across all these companies. There are some amazing speakers at this conference, that are talking about all kinds of technology, what is it that you particularly enjoy from Tumblr as a unique kind of problem compared to say Facebook or Twitter or other large scale systems?
So far, one of the most unique parts about working at Tumblr, is that it is still a small engineering team, for a rather large site. So your contribution is impactful even someone like me who is mostly in the managerial position I still get to be involved in hands on technology which is personally gratifying and for all of our engineers knowing that their work is so impactful for the business and for our users is probably one of the best things about being there right now. I think while other websites have scalability problems that are similar to the ones that we are working on, they are still rather unique in the business, and since we are trying to provide real time feed, of information that you’ve essentially chosen to see yourself, it’s still a lot of fun working on that part of it.
Yes, so obviously I follow some of the tech stuff that it’s on Tumblr, there’s not enough of it, we need more tech bloggers to get on Tumblr and also I personally enjoy the visual arts so I follow a lot of that. And news, Mother Jones and NPR and those guys have good blogs.