BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Lessons Learned from Skype’s Outage

Lessons Learned from Skype’s Outage

This item in japanese

On December 22nd, 1600 GMT, the Skype services started to become unavailable, in the beginning for a small part of the users, then for more and more, until the network was down for about 24 hours. A week later, Lars Rabbe, CIO at Skype, explained what happened in a post-mortem analysis of the outage.

At its core, Skype relies on a third generation P2P network that has lots of peer nodes and a number of supernodes, one for several hundreds of nodes. Since Skype does not have a centralized directory to support finding routes between two or more nodes that want to communicate, the virtual network uses supernodes as directories. When a client enters Skype, it registers itself with a supernode, giving its IP address so it can be found by other clients who might want to establish a communication. Supernodes are then interrogated when someone wants to start an IM or voice/video session with another client, the target IP is obtained and a direct communication link is opened between the two clients. Supernodes are vital elements of Skype’s network.

Skype also runs a number of support servers responsible for delivering offline messages. Due to an unexpectedly large number of undelivered messages, these servers sent the messages some time later. A bug in Skype for Windows version 5.0.0.152 made the application crash when receiving late messages. The latest Skype version, 5.0.0.156, and previous versions for Windows and all the other versions for non-Windows machines were not affected by the bug, but the problem was that around 50% of the users were using the faulty version, which was the initial version when Skype 5 was released. Approximately 40% of all Skype users that were online crashed, taking down around 30% of all supernodes.

Clients that continued to be up and running, and clients that restarted the application had their network searches directed to the supernodes still running, leading to an overload of those. Since Skype has in place a protection when a supernode is overloaded, so it would not consume too much of a client’s system’s resources, the supernodes started to shutdown automatically one after another, leading to a generalized failure of the network.

Skype cannot function without supernodes, so the company started initially hundreds then thousands of supernodes hoping to restore the service. They did not specify what systems they used for that, maybe some of their own or some on Amazon EC2. The network started to build itself around these supernodes, the service being restored after 24 hours. They said they stopped most supernodes they had to start, leaving a few around in case there was a sign of trouble, being known that the network is very used during Christmas.

One important lesson to be learned is this: many users do not update their software if they don’t have to. Skype had a newer version in place, without the triggering bug, but most users had the buggy one. Skype is going to review the auto-updating process, perhaps implementing something like Google Chrome has:

We will also be reviewing our processes for providing ‘automatic’ updates to our users so that we can help keep everyone on the latest Skype software. We believe these measures will reduce the possibility of this type of failure occurring again.

Another issue is that one should do all possible to make sure the software is thoroughly tested, Skype deciding to review their “testing processes to determine better ways of detecting and avoiding bugs which could affect the system.”

The last lesson, but not the least, is the capacity of the Skype servers supporting the network, such as those serving offline IM, Rabbe mentioning they “will keep under constant review the capacity of our core systems that support the Skype user base, and continue to invest in both capacity and resilience of these systems.”

The enterprise version, Skype Connect, was not affected by the outage, according to Peter Parkes, Skype’s blogger-in-chief.

Rate this Article

Adoption
Style

BT