On December 22nd, 1600 GMT, the Skype services started to become unavailable, in the beginning for a small part of the users, then for more and more, until the network was down for about 24 hours. A week later, Lars Rabbe, CIO at Skype, explained what happened in a post-mortem analysis of the outage.
At its core, Skype relies on a third generation P2P network that has lots of peer nodes and a number of supernodes, one for several hundreds of nodes. Since Skype does not have a centralized directory to support finding routes between two or more nodes that want to communicate, the virtual network uses supernodes as directories. When a client enters Skype, it registers itself with a supernode, giving its IP address so it can be found by other clients who might want to establish a communication. Supernodes are then interrogated when someone wants to start an IM or voice/video session with another client, the target IP is obtained and a direct communication link is opened between the two clients. Supernodes are vital elements of Skype’s network.
Skype also runs a number of support servers responsible for delivering offline messages. Due to an unexpectedly large number of undelivered messages, these servers sent the messages some time later. A bug in Skype for Windows version 5.0.0.152 made the application crash when receiving late messages. The latest Skype version, 5.0.0.156, and previous versions for Windows and all the other versions for non-Windows machines were not affected by the bug, but the problem was that around 50% of the users were using the faulty version, which was the initial version when Skype 5 was released. Approximately 40% of all Skype users that were online crashed, taking down around 30% of all supernodes.
Clients that continued to be up and running, and clients that restarted the application had their network searches directed to the supernodes still running, leading to an overload of those. Since Skype has in place a protection when a supernode is overloaded, so it would not consume too much of a client’s system’s resources, the supernodes started to shutdown automatically one after another, leading to a generalized failure of the network.
Skype cannot function without supernodes, so the company started initially hundreds then thousands of supernodes hoping to restore the service. They did not specify what systems they used for that, maybe some of their own or some on Amazon EC2. The network started to build itself around these supernodes, the service being restored after 24 hours. They said they stopped most supernodes they had to start, leaving a few around in case there was a sign of trouble, being known that the network is very used during Christmas.
One important lesson to be learned is this: many users do not update their software if they don’t have to. Skype had a newer version in place, without the triggering bug, but most users had the buggy one. Skype is going to review the auto-updating process, perhaps implementing something like Google Chrome has:
We will also be reviewing our processes for providing ‘automatic’ updates to our users so that we can help keep everyone on the latest Skype software. We believe these measures will reduce the possibility of this type of failure occurring again.
Another issue is that one should do all possible to make sure the software is thoroughly tested, Skype deciding to review their “testing processes to determine better ways of detecting and avoiding bugs which could affect the system.”
The last lesson, but not the least, is the capacity of the Skype servers supporting the network, such as those serving offline IM, Rabbe mentioning they “will keep under constant review the capacity of our core systems that support the Skype user base, and continue to invest in both capacity and resilience of these systems.”
The enterprise version, Skype Connect, was not affected by the outage, according to Peter Parkes, Skype’s blogger-in-chief.
Community comments
Skype
by skype user,
Re: Skype
by Sam Moffatt,
Re: Skype
by Abel Avram,
Re: Skype
by Bojan Antonovic,
Sloppy Programming...
by Randy Zagar,
Re: Sloppy Programming...
by CS Goh,
Re: Sloppy Programming...
by Louis Routhier,
The real lesson is this....
by Jim Ngo,
the newest client doesnt seem to be a miracle cure-all
by mitko didkov,
Re: the newest client doesnt seem to be a miracle cure-all
by Abel Avram,
Re: the newest client doesnt seem to be a miracle cure-all
by Louis Routhier,
Skype
by skype user,
Your message is awaiting moderation. Thank you for participating in the discussion.
I registered on this website just to write this comment.
The article is misleading in that it blames the user for not upgrading skype.
The reality is that it is the company, Skype.com, which is at fault.
The newer versions of skype have an auto-updating feature and the user
has NO CONTROL OVER IT! The update proceeds whether the user wants it or not!
So the system wide crash is really caused by from skype's own updating process, NOT from user error/inaction. Users were running the most current version of skype they were given.
Re: Skype
by Sam Moffatt,
Your message is awaiting moderation. Thank you for participating in the discussion.
I agree, I found myself the day after with my Skype automatically updated without me doing anything. I remember being offered to upgrade the day before and declining (yes, I said no and the damned thing updated anyway). Skype's policy of silently updating on Windows shot their network in the foot. I managed to get back to the 4.2 version that I was on and was reasonably happy with however the article is logically misleading.
The bug impacted on a particular version of a client with undelivered messages. If they're like me that was silently upgraded to the buggy version (before they fixed it) and then started it, each start will crash Skype. The user was either automatically upgraded to a buggy version or knowingly updated to the latest version and restarted (potentially not triggering the bug if they get back online before anyone sends them a message) then they aren't particularly at fault - more over after updating their client crashes and then when it stops crashing the network is offline anyway! The update system can't function properly until the client stops crashing (that the user just updated to) so the network has to die before those users realise they need ANOTHER update after the one they did yesterday to fix the crashing. In this particular case upgrading the version caused the problem, people on the old client were perfectly fine. If anything people keeping up to date encourage this bug in addition to Skype's magical update without your permission system.
This might have been avoided if Skype didn't automatically update silently and without/against permission. They might have had a better chance of getting people on the non-buggy version before the network went down entirely.
Sloppy Programming...
by Randy Zagar,
Your message is awaiting moderation. Thank you for participating in the discussion.
So, why exactly does a client-side application failure cause a super-node to fail? Sloppy programming is the only explanation. That'd be like having the Apache web server crash any time Firefox died.
The real lesson is this....
by Jim Ngo,
Your message is awaiting moderation. Thank you for participating in the discussion.
The inherent design of the system is at fault. If the system is designed to prevent overload of a supernode by removing the supernode then it is obvious that the problem will cascade as clients keep querying and overwhelming other supernodes. Duh. I believe this was covered in my second year of computer science at University.
Re: Skype
by Abel Avram,
Your message is awaiting moderation. Thank you for participating in the discussion.
Hi "skype user",
the article does not try to blame anyone. While I could find a culprit, it was not my intention that. I did not blame the users for not having the fixed version. I said "One important lesson to be learned is this: many users do not update their software if they don’t have to." And that is true. And it is not true just for Skype, it is true for all software in general. Microsoft has serious problems convincing some people to move to a newer/safer version of Windows. Some are still using Windows 95.
Regarding the update process. Lars Rabbe, CIO at Skype, said "Since a bug was identified in Skype for Windows (version 5.0.0.152), we had provided a fix to v5.0 of our Windows software prior to the incident." So, they provided the update. On the other hand, some users complain about the update not working, or they do not want it, and they are a bit confusing, making one wonder what actually is the problem:
CeRBeR:
InuyashaFan:
scott_81::
Well, I am personally using Skype 4.2 and it does not auto-update because I do not have set the option "Automatically download and install it" under the "Notify me" one. For me, the update process works fine. From these forum posts I quoted it is hard to grasp what is actually going on. Are some users really having a problem, or they don't use Skype's settings correctly? I do not intend to clarify that. With this article I just wanted to find out what are some of the lessons software companies can learn from Skype's outage.
Re: Skype
by Bojan Antonovic,
Your message is awaiting moderation. Thank you for participating in the discussion.
The "problem" is that for Microsoft products you have to pay, while Skype is free.
the newest client doesnt seem to be a miracle cure-all
by mitko didkov,
Your message is awaiting moderation. Thank you for participating in the discussion.
i registered here only because i disagree with the "fault" the article implies to the end users.
SKYPE ALREADY UPDATES ITSELF super stelathy!
therefor i am using only the following:
-"skype 3.6 portable.exe"
it stores data in %appdata% and works as normal, only no audio update is going on behind the scenes.
it wokrs like a charm, and the protocol has backward compability
-the web based imo.imo service
i will never install skype 4.x
or skype 5.x
k10x bai
Re: the newest client doesnt seem to be a miracle cure-all
by Abel Avram,
Your message is awaiting moderation. Thank you for participating in the discussion.
Hi Mitko,
the article I wrote does not try to find faults with the users or Skype. The article just presents some lessons to be learned. One of them is: "many users do not update their software if they don’t have to." And that is true. What you said comes as a proof to that. You are still using an older version of Skype, and you said you won't upgrade. This can prove to be very detrimental in some cases, as it happened with Skype.
Re: Sloppy Programming...
by CS Goh,
Your message is awaiting moderation. Thank you for participating in the discussion.
Your analogy describes it perfectly! I wonder how a Skype client could kill a server component. I don't believe automated update can solves the problem entirely, perhaps they should inspect supernode programming thoroughly to make it fail-safe.
Re: Sloppy Programming...
by Louis Routhier,
Your message is awaiting moderation. Thank you for participating in the discussion.
Supernodes are NOT "server components". They are simply clients who accepted to dedicate some resources as a directory. I don't think that as a client, you would appreciate your home computer to consume all resources simply to help the network.
But on the other hand, I agree that maybe there could be some throttling mechanism implemented limited usage instead of shutting it down.
Re: the newest client doesnt seem to be a miracle cure-all
by Louis Routhier,
Your message is awaiting moderation. Thank you for participating in the discussion.
This article may help to understand the supernode concept:
www.disruptivetelephony.com/2010/12/understandi...