Moving Channel9.msdn.com to Windows Azure
A little background
Channel 9 is a developer focused community site run by Microsoft. It exists as an interesting little pocket team where things are quite different than you might expect. Channel 9 is built by a small team (4 devs), and was originally hosted at a 3rd party ISV. It has no connection to the large development and operation teams that run Microsoft’s many other sites such as http://msdn.microsoft.com and http://www.microsoft.com.
(Click on the image to enlarge it)
In fact, when it originally launched in 2004 it had to run outside of that world, because it was built using technologies (ASP.NET) that hadn’t even been released at the time.
The state of the world
Fast forward to 2010 and the site had developed as many web sites run by small teams do. Features were added quickly with little to no testing, deployments were done by whoever had the FTP credentials at the time, and maintenance consisted of rushing out site fixes late at night. Over the next few years, we worked to make things better, to stay an agile team while slowing things down enough to plan our development and try to stabilize the site for both the users and for our own team.
It didn’t work. Site stability was terrible, leading to downtime on a weekly basis. If the site went down at 2am, our ISP would call the development team. If a server had to be taken out of rotation, it could take days to build up a new machine to replace it. Even within our web farm, servers would have different hardware and software configurations, leading to bugs isolated to certain machines and terrible to troubleshoot. All these problems were not only hurting our users, but were killing our development team. Even code deployments, done manually using FTP, would often result in downtime if the deployment ran into issues when only partially done. As the site became more popular, the development team spent all of their time dealing with bugs and operations; there was no time at all for new features. Trying to explain this to my boss, I came up with this line I’ve used ever since: Imagine if architects had to be the janitor for every building they designed.
We were in a hole, and since we were so busy keeping the site running, we barely even had time to find a way out. It was around this point, in February of 2010, Windows Azure launched. To me it seemed like the solution to our problems. Even though we were a Microsoft site, Channel 9’s purpose is to get content out to the public, we are not a demo or a technology showcase. With that in mind, no one was pushing us to move onto a new development and hosting platform, but over on the development team we were convinced Windows Azure would be the way to get ourselves back to building features.
Making the case to move
Channel 9 had been hosted at the same ISP for nearly six years, and we had a great relationship with them; we needed to have compelling reasons to make a change at this point, especially to a relatively new platform. We pushed our case based around three main reasons.
First and foremost was a desire to get out of the operations and server management business pulling the development team away from building new features. Second, we expected having a clean environment with isolated systems that are re-imaged whenever deployed would avoid a lot of the hard-to-diagnose errors impacting Channel 9’s stability over the past few years. Finally, and this was definitely a big part of the desire to move to Azure, we explained how Windows Azure would make it easier for us to scale up to handle more and more traffic.
Channel 9 has been getting steadily more traffic every year, but gradual growth like that is something you can handle at nearly any hosting provider by adding more servers over time. A more pressing issue was the tendency to have spikes of traffic around content that was suddenly popular. The spikes would also occur when we hosted the content from a big event like MIX, PDC or the recent BUILD conference. The chart below shows the traffic to the /Events portion of Channel 9 leading up to, during and immediately after the MIX11 conference earlier this year.
(Click on the image to enlarge it)
For spikes of traffic like that you need to be able to scale up your infrastructure quickly, and then scale it back down again. At our existing host, adding a new server was a multiple day process, with a fair bit of manual deployment and configuration work involved. There are definitely better ways to handle this, even hosting at an ISP, but we didn’t have anything in place to help with server provisioning.
With all of our reasons spelled out and with the implied promise of more new features, less downtime and the ability to easily scale up, we were able to get the green light on building a new version of Channel 9 on Windows Azure.
Creating Channel 9 in the Cloud
Building a new version? Yes, you read that right, we leveraged some bits of code here and there, but we essentially architected and built a new version of Channel 9 as part of this move. It wasn’t exactly necessary - our existing code base was ASP.NET and SQL Server and could have run in Windows Azure with some changes. In fact, we did that as a proof of concept early on in the planning stage, but in the end, we believed our code base could be greatly simplified after years of ‘organic’ growth. Normally I would always advocate against a ground-up rewrite. Refactoring the existing production code is usually a better solution, but we seemed to be in a state where more drastic action was required. The original site code had more and more features tacked onto it, and not always in a well-planned fashion; at this point it was an extremely complicated code base running behind a relatively simple site.
We sketched out the basic goals Channel 9 needed to accomplish, with the main purpose of the site summed up as: ‘People can watch videos and comment on them’. There are many more features to the site of course, but by focusing on that singular goal we were able to drive to a new UX design, a new architecture and finally to building out a new set of code.
So what did we build?
Since we were starting with a new project, we made major changes in our technology stack. We built the site using ASP.NET MVC for the pages, Memcached as a distributed cache layer, and NHibernate (the .NET version of the popular Hibernate ORM) over SQL Azure as our data layer. All of our videos are pushed into blob storage and accessed through a content delivery network (CDN) to allow them to be cached at edge servers hosted around the world. While the site data is stored in SQL Azure, we use table storage and queues for various features such as view tracking, diagnostics, and auditing.
(Click on the image to enlarge it)
While we built this new system, the old site was still running (turning off Channel 9 for months was not really an option), and while we stopped building any new code for the production site, maintenance was still an ongoing issue. This conflict is one of the main reasons why rewriting a production code base often fails and our solution might not work for everyone. To free up the development team’s time to focus on the new code, I tried my best to act as the sole developer who worked on the live Channel 9 site. It wasn’t a good time for my work/life balance, but I was very motivated to get the new world order of running in the cloud. More than six months later, we made the DNS change to switch Channel 9 over from our old host to its new home in Windows Azure. Along with the completely new code base came a new UX, which was definitely the most noticeable change for all our users.
(Click on the image to enlarge it)
Our new world order has definitely had the desired effect in terms of site stability and developer productivity. We still manage the site, fix bugs, deploy new releases and actively monitor site performance, but it takes a fraction of our time. The ease with which we can create new servers has enabled us to do more than just scale up the production site, it makes it easy to create staging, test or beta sites, and deployment to production is a process anyone on the team can do without fear they’ll end up in a broken state.
Our deployment process to production is now:
- Deploy to the staging slot of our production instance,
- Test in staging,
- Do a VIP Swap, making staging into production and vice versa, and
- Delete the instance now in staging (this is an important step, or else you are paying for that staging instance!)
Deployment can take 20 minutes, but it is smooth and can be done by anyone on the team. Given the choice of 20 minutes to deploy versus hours spent troubleshooting deployment errors, I’ll take the 20 minutes every time.
Scaling Channel 9 is relatively easy, assuming you have the right monitoring in place. We watch the CPU usage on our servers and look for sustained levels over 50% or even a large number of spikes pushing the levels past that point. If we need to make a change we can have the ability to increase/decrease the number of web nodes and/or the Memcached nodes with just a configuration change. If needed, and we haven’t done this yet, we can change the size of virtual hardware for our nodes with a deployment (changing from a two core image to a four core for example). Automatic scaling is not provided out of the box in Azure, but there a few examples out showing how such a system could be created. We normally run with enough headroom (CPU/Memory load down around 20% on average) that we can handle momentary traffic increases without any trouble and still have enough time to react to any large sustained increase. SQL Azure is the main issue for us when scaling up, as we didn’t design our system to support database sharding or to take advantage of the new Federation features. It is not possible to just increase the number of SQL Azure servers working for your site or to increase the size of hardware you are allocated. Most of our site access is read-only (anonymous users just coming to watch the videos) so scaling up the web and caching tiers have worked well for us so far.
Lessons learned in moving to Windows Azure
Windows Azure is a new development platform, and even though we work at the same company we didn’t have any special expertise with it when we started building the new version of our site. Over time, both before and after we deployed the new site into production, we learned a lot about how things work and what worked best for our team. Here are the five key pieces of guidance we can offer about living in the Windows Azure world.
#1 Build for the cloud
Not to sound like an advertisement, but you should be ‘all in’ and commit to running your code in the cloud. This doesn’t mean you should code in such a way that you can never consider moving; it is best to build modular code using interfaces and dependency injection; both for testing purposes but also to isolate you from very specific platform details. Channel 9 can run local in IIS, or in the emulation environment for Windows Azure or in real production Windows Azure for example. What I am saying though, is don’t just port your code currently running in IIS. Instead you should revise or build your architecture with Windows Azure in mind, taking advantage of built-in functionality and not trying to force features into Windows Azure just because you had them with your previous host. As one example, we have built our distributed caching layer (Memcached) to use the Windows Azure runtime events when server topology changes, so our distribution of cache keys across the n worker roles running Memcached is dynamic and can handle the addition and removal of instances. We would need a new model to handle this in a regular server farm, but we knew we’d be on Azure so we built accordingly.
#2 Division of Labor
Cloud computing systems work well because they are designed to have many virtual machines running relatively small workloads each, and the ability to take any machine down, re-image it and put it back into service as needed. What does this mean for your code? Assume your machine will be freshly initialized when the user hits it and it might go away and come back completely clean again at any time. Everything you do, especially big tasks, should be designed to be run across many machines. If a given task must run on a single machine and must complete all in one go that could take hours, then you are missing out on a lot of advantages of running in Windows Azure. By designing a large task to break the work up into many small pieces, you can scale up and complete the work faster by just adding machines. Take existing processes and refactor them to be parallel and design any new processes with the idea of parallel processing in mind. Use Windows Azure storage (Queues, Blobs, and Tables) to distribute the work and any assets that need to be shared. You should also make sure work is repeatable, to handle the case where a task is only partially completed and a new worker role has to pick up that task and start again.
In the Channel 9 world, one great example of this concept is the downloading and processing of incoming web log files. We could have a worker role that checks for any incoming files, downloads them, decompresses them, parses them to find all the video requests, determines the number of views per video file and then updates Channel 9. This could all be one block of code, and it would work just fine, but it would be a very inefficient way to code in the Windows Azure world. What if the process, for a single file, managed to get all the way to the last step (updating Channel 9) and then failed? We would have to start again, downloading the file, etc.
Writing this with a Windows Azure mindset, we do things a bit differently. Our code checks for new log files, adds them to a list of files we’ve seen, then queues up a task to download the file. Now our code picks up that queue message, downloads the file, pushes it into blob storage and queues up a message to process it. We continue like this, doing each step as a separate task and using queue messages to distribute the work. Now if a large number of files needed to be processed, we could spin up many worker roles and each one would be kept busy processing chunks of files. Once one file was downloaded, the next could be downloaded while other workers handled the decompressing and processing of the first file. In the case of a failure we would only need to restart a single step.
#3 Run it all in the cloud
Earlier in this article I mentioned having the dev team also be an operations team was killing both our productivity and overall job satisfaction. There is more to be maintained than just the web servers themselves. In addition to the public site, we also have background processes (like the log processing example given above), video encoding servers, and reporting systems. All of these started out running on internal servers, which meant we had to maintain them. If a system failed, we had to get it back up and then restore any missing data. If a system account password changes, we are left with a small outage until someone fixes any running processes. All of these things are exactly the type of work I don’t want the development team to be doing.
The solution is to move everything we can into Windows Azure. It can take time to turn a process into something ‘Windows Azure safe’; making it repeatable, stateless and isolated from its environment, but it is time well spent. We aren’t 100% of the way there. We have moved the log downloading and processing task and various other small systems, but our video encoding process is still a complex system running all on internal servers. Every time it goes down and requires my manual intervention I’m reminded of the need to get it moved into Windows Azure, and of the value of getting my team out of the business of running servers.
#4 State kills scalability
These last two points are true for building any large scale system, but they are worth mentioning in the context of Windows Azure as well. State is essential in many scenarios, but state is the enemy of scalability. As I mentioned earlier, you should design your system assuming nodes could be taken down, re-imaged and brought back clean at any time. In practice they tend to run a very long time, but you can’t depend on that fact. Users will move between servers in your role, quite likely ending up at a different server for every request. If you build up state on your web server and depend on it being there on the next request by that user, then you will either have a system that can’t scale past a single instance, or a system that just doesn’t work most of the time. Look at alternate ways to handle your state, using Windows Azure App Fabric Caching or Windows Azure storage like blobs and tables, and work to minimize it as much as you can.
#5 Push everything you can to the edge
The web, even over high-speed connections, is slow. This is especially true if you and the server you are trying to hit are far away from each other. Given this, no matter how fast we make our pages it could still be quite slow to a user on the other side of the world from our data center. There are solutions to this problem for the site itself, but for many situations it is the media your site is using that is the real issue. The solution for this is simple -bring the content closer to the user. For that, a content distribution network (CDN) is your friend.
If you set up a CDN account, either through the Windows Azure built in features or through another provider like Akamai you are going to map a server of your content to a specific domain name. By virtue of some DNS rewrites, when the user requests your content the request goes to the CDN instead and they attempt to serve the user the content from the closest server (and any good CDN will have nodes in 100s of places around the world). If the CDN doesn’t have the content being requested they will fetch that content from your original source, making that first request a bit slower but getting the content out to the edge server to be ready for anyone who wants the same file.
We are back to being developers and we love it
Moving to Windows Azure really worked out for our team. We can focus on what we are actually good at doing, and can leave running servers to the experts. If your development team is bogged down with operations, deployments, and other maintenance consider a revamp of your hosting platform. Of course, Windows Azure isn’t the only option. Even using a regular ISP and moving from manually configured servers to ones based on images and virtual machines would give you a lot of benefits we’ve experienced.
However you do it, letting your developers focus on development is a great thing, and if you do end up using Windows Azure, consider the tips I’ve given when you are planning out your own system migration.
About the Author
Duncan Mackenzie is the developer lead for Channel 9, a video/community site run by Microsoft. Before Channel 9, he worked at MSDN building code samples and writing articles, and started out as a consultant working at Microsoft Canada. Outside of his job at Microsoft, he has written a variety of books on Visual Basic, Word VBA, Xbox and even Zune. When he isn't busy coding or writing, Duncan fills his time at home in Redmond, WA with his two kids and a reasonable amount of time spent on video games.
Memcached vs App Fabric
Re: Memcached vs App Fabric
Why did you choose to use Memcache as opposed to using Microsoft App Fabric for your distributed caching layer?
Couple of different reasons Chris... #1 was that App Fabric caching was not released or even in any form of preview when we built and released Channel 9. We had actually assumed we would be using Velocity, the non-Azure form of this technology, as we built Channel 9, but when it found out that it was not available for Azure, we had to go with Memcached.
Since then App Fabric Cache has come out, but while it is interesting to us, it would have to amazingly interesting to justify making a change to a working production system that has had absolutely zero problems around caching. Having said that, we are talking to that team about future features and releases and when that cost/benefit tips we will be right there over to their product! :)