Bio Dan Farino is the Chief Systems Architect at MySpace.com where he is responsible for designing and implementing the infrastructure required to maintain the site's thousands of servers. Dan has designed and implemented many of MySpace.com's custom performance-monitoring, profiling and live-debugging tools.
Sure. I am the Chief Systems Architect at MySpace. My job has basically been to develop a lot of the backend custom performance monitoring and troubleshooting tools that we use. Some of our original problems when I got there were a lot of manual configuration, a lot of manual administration, a lot of administrators creating these one of text files to reset servers. You had these very simple tasks that took thirty minutes to an hour to accomplish. So a lot of what I came in and did is really focus on automation, from an administrative stand point, and focus on developing tools to make troubleshooting easier, diagnosing, performance problems easier etc.
Well a lot of the problems is just finding the one server out of thousands that's having the problem. So something I developed is a performance monitoring system that gives real time CPU requests queued, request per second, that type of information, live across the entire farm. So it's very easy to look at the screen and see the one red server that is popping up say "Hey I got request queued". At that point the original problem was "Ok. We know the server is having problems, what do we do now?" Before we would actually take a memory dump, send it off to Microsoft, wait for a week and then they would come back and say "Hey, your database server was down". Ok, fine we figured that out two hours later, at this point we have a lot of very simple right click tools, that enable people that are not developers, aren't programmers, are really low level people to go in and figure out: "Hey I can see a hundred threads blocked on the database call, I think it might be a database". So it's a real focus on providing higher visibility into bad servers and providing tools that allow your less technical administrators to diagnose problems effectively.
Yes, sure. The best place to start is probably on the front end, we are about three thousand Windows 2003 IIS 6 servers. The code runs on .Net 2.0 with some .Net 3.5, there's all kinds of versioning madness going on with .Net but it's basically .Net 2.0 and 3.5 running on the front end. We used to use Adjust SQL server as a backend store. But database queries are not performant enough to meet the scalability requirements of the site. So we ended up developing a custom cache here. It's basically object storage, all done through a custom sockets layer, all stored in unmanaged memory on a 64 bit .Net 2.0 box. One of our original problems was with the .Net garbage collector.
It is just really not designed even on a 64 bit platform to reverse billions of objects and collect and compact without causing at least a noticeable delay. So one of our scaling problems was "Hey, we are going to have to write an unmanaged memory store". So we did that, we have a .Net in front of it handling all the communications and that layer, the cache layer as we call it, has really offset a lot of the pressure from the databases. Now the databases have since been upgraded from SQL 2000 to 64-bit SQL 2005. So that is good performance increase, a lot more data can be resident in memory, just lot better than we used to be. We started out ColdFusion 5 Windows 2000, handful of databases. One thing we did early on with the databases that has enabled us to scale is we have partitioned vertically, horizontally, I am not sure which direction you would say, but one million users per database.
So that enables scaling, adding new hardware is easy, and failure isolation. If one database goes down, small percentage of the users are getting an error message or getting an under maintenance message while we bring that back up. Let's see, we also have a custom distributed file system, that we have built on Linux, we use that to upload and load balance media content. So, all of our videos, mp3s, all of those are served by this custom DFS layer which is redundant across data centers and serves the content straight off the disc via HTTP.
4. What were some of the challenges around scaling the website? You mentioned you started off the ColdFusion server and that now there is somewhere in the neighborhood five hundred billion IIS servers. So how do you get from A to B? Do you just add servers?
A lot of it was building the central cache here. That took a lot of load off the database servers. Second was there is an interesting problem when you have a lot of backend databases: if one of them goes down and it's denying requests very quickly from the web server, no problem. But we have that cases when one of the backend web servers gets a little slow. So if you imagine there is a web server servicing requests from all these different users, well as soon as one of the slower requests comes in, it kind of gets stuck on that bad database.
And then the others are fine, sooner or later you get another one come in, get stuck on that bad database. So within the matter of a few seconds, all of your threads on your web server have been sucked into this database. So part of what I architected when I first got there was this failure isolation system, that runs on each of the web servers that says "Ok, I am only going to allow x number of requests to run in each range. And that basically prevents one bad server from taking down the entire website. So originally the more we scaled out actually the worse off we became, because you are adding one more point of failure and that is going to take down the site that much quicker. So now once we built that failure isolation module into IIS, uptime increased a lot and with the addition of the cache layer, uptime increased even more.
It has actually been great. We use a product from Microsoft called PowerShell. It was originally called Monad as a code name and the only reason I mentioned that is because we actually had it deployed in production and we are using it Monad beta 3. So it came along exactly at the right time for us, because at about that time we were beginning to reach the limitation of what you could do with VB script, command files, just all of the recommended Windows administration technologies used at that point.
That's great if you are working at a local machine, it's great if you want to go out and see the contents of the registry on one of the server, but once you get up and do a thousand servers that you need to run a command on, you really start hitting bottlenecks: one because of the fact that you are going one server at a time, second, as soon as you hit the one bad server there, the whole thing is shut. In order to deal with that I wrote basically my own remoting layer, and the problem I was struggling with at first was "Well, how do I provide a really versatile interface into this remoting layer?" So we wanted to be able to manipulate lists of servers very easily, execute a command, and then take the results, manipulate those, maybe run another command, there had to be this whole pipeline.
So PowerShell came along, right at that time I was developing that, now PowerShell is written entirely in .Net, it runs all of your commands in the same process, the benefit of that is that command A can output more than just a string for command B to chop up, they can emit a full .Net object that can transform it can pass all the way through the pipeline, it's a very rich programming model even for an administrator that just wants to sit down and write a single command. So we came up with commands, when I say the term "VIP" at MySpace I basically mean virtual IP but the set of servers that are servicing that particular function. So if I say the "Profile VIP" I mean the however many hundreds of servers running profiles.myspace.com, the pool processing those. So we came up with a command called "Get VIP".
We can pipe that into this command RunAgent, run whatever remote command you want on the remote machine, runs them all in parallel, and that spits out an object into the pipeline, that you can use PowerShell syntax to say "Well give me that output where, whatever is true or where the file existed, or didn't exist, whatever you come up with", pipe that into something else, pipe that into something else so you get this very rapid yet still ad hoc administrative model. When people are writing scripts, they are not worrying about network failures, they are not worrying about writing code to run multiple threads because I have to get this done now, abstracts away the network, abstracts away the parallelism and it let us really take jobs that used to take thirty minutes an hour, literally down to about five seconds, with more reliability with more air control, with more accountability and logging.
Well that decision was made a little bit before my time, originally when I came in we were already about twenty million users running on ColdFusion. We had very, very, good .Net developers coming on board and I am not sure if we brought people in specifically to do .Net or we just found some of the best developers we could and they happened to be "Hey there is this great .Net thing out there, let's take a look". But I think the fact that we were running on Microsoft originally, made it very easy to continue and I am not a Java person, I don't really come from a different background, but so far the .Net platform and the support we've got from Microsoft in dealing with some of the growing pains of newer technology, it's been great so no complaints from me.
Well, there is an interesting versioning number madness right now with Microsoft. I was just looking at a chart earlier and 3.5 is basically 2.0 with just some extra extensions. So if you install 3.5, you get 2.0 as your running base. It's a little bizarre but I think the parts of 3.5 that were using like the stuff like Windows Communication Foundation which provides very easy web servicing, gets us out of the old remoting and web service model, provides transactional capabilities, we haven't been using it that long, but it looks like something that is going to be pretty exciting and take away a lot of the burden we have had over the years were we ended up writing our own network libraries just based on some of the glitches that we found in remoting and web services.
8. So the core website is based on 2.0 and 3.5, and the tools that you are involved with, they are based on PowerShell and one other thing that is in there. How is this whole system of debugging tools interact?
One of the ones that I am particularly most fond of is something called the "Profiler". And it is written in C++, it utilizes the Microsoft CLR profiling interfaces, so basically we are able to say "Ok, whereas our stack dump interface gives you kind of a slice point in time here is what everything is doing right now, the profiler interface says: Ok I am going to take the thread every ten seconds, I am going to profile it from start to finish, and I am going to show you how long each call took, I am going to tell you about every exception, I am going to tell you about every memory allocation, I am going to tell you about lock contention, a lot of this stuff that you really can't see by going into a debugger, stopping everything and just looking at the static state".
So that was something that gives us really not only during bad conditions what is going on the server but during ideal conditions, how do our requests actually flow to the system. We have a lot of modules written by a lot of different people, so for someone like me, even that knows the site, going in and looking at this traces is sometimes rather enlightening, like "I didn't know we were doing that", but we are. That's probably one of the more technically complicated tools we had to develop, it's one of the few that is not written in .Net, just because when you are trying to profile managed code in .Net you can't write the code in .Net. A lot of C++, one interesting thing it uses is the Detours library from Microsoft research, it's almost like the technology they use for some kernel rootkits where they just patch out code, divert it to something else and sneak back in and the code has no idea it has been patched. So we use that for getting some information that Microsoft's interfaces don't provide but happens to be very useful.
VB Script probably about two years ago. The .Net language of choice these days for us is C#.
I would say on the CLR itself it's probably close to a hundred percent C#. We have had some braver souls playing around with stuff like F#, from Microsoft research, which looks great, I know I embed IronPython into a lot of my utilities, because I think it gives you a lot of control as far as configuration, if instead of making a box with a hundred of checkboxes, I could just let you put in a little IronPython script to accomplish what you want. On the whole though C# on the front end and then back end, IronPython, PowerShell, peppered here and there.
LINQ is actually what I am impressed more and more with every day. So, I don't see any reasons for us not to upgrade all the servers and let developers use it, because it looks like a really neat technology and even for someone like me that I like writing SQL queries, there is something really cool about a LINQ query that works across XML, objects and memory, database, maybe some day a custom provider for us, who knows? Very neat stuff and I hope to see that there soon on our web servers.
13. So with a large website such as MySpace, when you have a problem such as a peak load overloading your servers, is there a way to create a fix which makes this miraculously go away or do you just have to keep adding incremental changes and taking small pieces of the issue one at a time?
Right, that's a good question because I think there is two parts about it. There is the finding the issue problem and there is the correcting the issue. A lot of what we had originally is "Ok, the site is queued right now, what do we do?" We don't even know why it's queued we just know that let's throw some more servers at it. Ok, that seams to help, but you don't really know that right when we added the servers maybe the database on the backend just came back up. Not only you didn't know what was wrong, but the fix you will turn to next time, it didn't even really fix it. So initially it was a matter of how do we start identifying these problems, how do we start giving the guy in the front line in the NOC the ability to identify these problems and then once that is done, it's usually a lot simpler for him to say "I need to go talk to this guy, I need to go talk to this guy".
Now of course there are cases where it's a genuine code problem and a lot of that is we try very hard to maintain compatibility as far as the middle and back tiers between code releases in the front end. So if we do get something out there that for whatever reasons scaled in QA, scaled in testing, but hits the front website and didn't scale, it's usually an easy matter of rolling back or disabling the feature. Very rarely have we come across problems where given the right tool set we can't find out what the problem is. Originally the tool set was a lot smaller but eventually when you start seeing the same problem every day it's "Ok, I am going to take a look at this with the debugger", next day "I am going to take a look at this with the debugger", day three "Ok, I am going to write my own tool because this is not working". So as our tool set has grown, our ability to find problems and diagnose them much more quickly has increased.
14. How does this process of troubleshooting and understanding the scaling and adapting to the scaling and figuring out the difference between the theory and production, how does that affect the system architecture?
Well let's see: one think with out current system architecture is we are pretty much set in our model right now with front end, cached tier, database. So, making a large scale change to that obviously is going to be painful. So if we start "Ok, we are doing the database query for this, it's not really scaling, let's try this relatively simple fix, move it into cache". All right, that didn't work, maybe we need to build another system. We are going to chip away, we have all these different systems we can escalate to if performance is not working, get it out of the database, ok let's get it on disc, get it in memory. When that doesn't work you develop a new system. It doesn't really happened too often in MySpace just because I think our original horizontal partitioning works pretty well, but of course as you try to go for more and more 9s of uptime, those little wins make a lot of difference.
Well I think it does, it's just a matter of having the right tools to do it, having the expertise in house, and the fact that it's a more mature platform right now I think really helps. Obviously Java has had quite a lead on .Net, but I think we are still the largest .Net website on the Internet. And I am not sure how you will really compare our uptime numbers or our performance numbers to a Java site of equivalent size but, it seams like any problems that we encounter are not with the .Net platform itself, it's bugs that we have put in, it's maybe hardware issues, but I haven't really hit anything in .Net, except possibly the garbage collector to where I say "This is just not going to scale".
But one great thing about .Net, of course, it's the extensibility, so if your memory management on a 16 GB box doesn't work using the garbage collector, well you can plug in something like Berkeley DB or you can plug in your own unmanaged store. You don't have to stay in the .Net box to get all the benefits of using it. I would say it does scale, I would say we are going to keep scaling it and ask me again in a couple of years and we'll see how we did.
I think the best answer to that is we try the new stuff when it comes out, like let's say the enterprise application blocks, remoting, web services, these were all tried. If you want to have a big application, this is what we are giving you today. What typically happens is we end up writing our own just because we don't need such a generic solution and we really need the performance. So I would say we've tried them, we have definitely learnt from them, and we probably had to put a good bit of them aside just for performance or scale reasons. I don't think that Microsoft can test those applications on a scale like we are going to test them for them. So it doesn't surprise me that something that is going to be a good enterprise application for a shop with less than a hundred servers, is not going to meet the scale needs of MySpace.
Well, I can tell you with the profiler I wrote, I had to use a lot of stuff that is definitely going to void the warrantee on that server, if they ever find out. Just patching CLR code to get performance matrix. We do a lot of stuff where we are actually looking at the source code for the C++ runtime, we use a lot of Reflector to get at the internals, maybe there are some cases where we have to use reflection in our production code to patch something that we don't like how Microsoft did. We try to stay away from that as much as possible, and I can't really think of a fantastic example of where we just went completely against the grain. But every time we hit those little pain points, we are usually pretty good about saying, instead of "Ok, start over, get a new technology", "Let's just tweak this a little bit and hope it still works".
Yes, that is an ongoing process. Right now, the cache tier is a little interesting in that it's not a write through cache, nor is it a read through. Basically it's a two step process from the web server. Check the cache, if it's not there the web server would go to the database, retrieve the object, serialize it, send the page to the user, and then asynchronously populate the cache. Rights are actually done through a very simple model of pop the cache, repeat that same process next time. So I am sure at some point we are going to want to move on to a more traditional three tiers model where your web servers don't talk to your database, but right now just based on the fact that when we started it was basically two tier, web server - database, the cache server was a very simple addition to add on the side, and as we grow more and more we'll start doing more through the cache and eventually take those connections away from web server database, to cache server database.
But for now it's really, I don't want to say "dumb object storage" because it's a lot of thought has gone into it and it works remarkably well, but it's just object storage, it doesn't really know what it's storing and where it came from, which may actually help performance, not sure, but it made it very easy to add as an after thought to the original design of several years ago when we needed a cache, we had a really fast four hundred server, if things get too slow, we'll just upgrade it. Not any more.
MySpace is comparable to Twitter in my mind - the site's really slow, doesn't really do much complicated actions, and has a tendency to break a lot.
So, listen to this and learn what not to do ;)
How do you test it?
listening to you (or similarly Dan Pritchet talking about eBay architecture) I have same questions in my mind:
how do you test new features before you roll them out? how do you test them scale? you don't have test lab with same level of scale as your production farm, do you? that makes me think you can't guarantee or knowingly predict exact level of performance until real users hit the new feature in real time. how do you deal with that and what did you have to build into your system to support new features deployment, as well as rolling features back quickly if apparently they did not work out as you had expected?
and another question is what did you have to do to support existence of "practical" development environments that behave as your production system but do not require each develop to work on dozens of servers, partitioned databases, and cache instances. How did this change your system's architecture?
Re: How do you test it?
you might be surprised how many companies don't have proper development/QA setups that replicate production environment well.
I witnessed number of scenarios when code passed developer and QA testing with flying colors but utterly failed in production environment. The failure was caused (as you already guessed it) by difference in development/QA vs production environment setups. Developers were writing code for one environment while being completely oblivious (no fault of their own) to the production environment pains and hazards.
Now, this might not be the case with MySpace. However, I certainly agree with you on the fact that Dan didn't say a word about testing their 'hydra'. But on the other hand, he wasn't asked that question.
NOTE to Ryan: It certainly would be great to hear from MySpace on how they test the code.