Is Amazon EC2 Oversubscribed and Suffering from Internal Network Latency?
There have been various reports from the community of Amazon EC2 users, that their instances are suffering poor performance, as the result of high internal network latency. This has led to speculations that Amazon's Cloud might be getting oversubscribed.
Alan Williamson from aw2.0 Ltd, has written a report about his experiences with Amazon EC2, where he claims that Amazon, as every Cloud provider his company has tried out, seems to scale well at the beggining but there is a tipping point:
Amazon in the early days was fantastic. Instances started up within a couple of minutes, they rarely had any problems and even their SMALL INSTANCE was strong enough to power even the moderately used MySQL database. For a good 20 months, all was well in the Amazon world, with really no need for concern or complaint.
However, in the last 8 or so months, the chinks in their armour have begun to show. The first signs of weakness came from the performance of the newly spun up Amazon SMALL instances. According to our monitoring, the newly spun up machines in the server farm, were under performing compared to the original ones. At first we thought these freaks-of-nature, just happened to beside a "noisy neighbor". A quick termination and a new spin up would usually, through the laws of randomness, have us in a quiet neighborhood where we could do what we needed.
However, in the last month of two, we've even noticed that these "High-CPU Medium Instance" have been suffering a similar fate of the Small instances, in that, new instances coming up don't seem to be performing anywhere near what they should be. After some investigation, we discovered a new problem that has crept into Amazon's world: Internal Network Latency.
A couple of weeks ago we noticed that our ping latency graphs on Cloudkick looked very odd.
...our monitoring node on EC2 is pinging four different servers on Slicehost. The average ping latency is all over the place.
The conclusion? Alan Williamson's post on EC2 oversubscription seems to make a lot of sense. The network behind EC2 appears to be experiencing very sporadic latency issues.
There have even been posts on AWS discussion forums from EC2 clients that have been experiencing networking issues:
We have an instance which started to become EXTREMELY unresponsive at 9:15 AM CST today. You could sometimes log into it, sometimes not. While the situation did not resolve itself, another instance was started (assuming there was a hardware problem on that instance) which has the same issue. I'm thinking there may be a network issue.
I've been able to log in once or twice, and once everything was normal for a bit and then it became unresponsive again. Any clue?
Instance IDs are i-c4921fad and i-a0e3d7c8. I am seeing the same network issues when attempting to connect to our machines from machines in other EC2 zones.
Alan reports that during an emergency he tried to cope by rapidly deploying new instances, but it didn't work for him:
In one particular "fire fighting mode", we spent an hour literally spinning up new instances and terminating them until we found ourselves on a node that actually responded to our network traffic.
In virtualized environments and specifically in the case of “Noisy Neighbors”, where you happen to be on a node where a neighboring instance is computationally heavy, this doesn't seem to be a good practise since there is a "tendency for EC2 to assign fresh instances to the same small set of machines" [PDF].