Cloud Foundry Experiences Storage Failure
Although it was nothing like the outage experienced by Amazon customers, VMware’s Cloud Foundry yesterday experienced a widespread failure of their storage infrastructure that left some users wondering why they couldn’t log into their control panels and issue vmc commands. The failure, caused by settings in the cloud controller, prevented users from starting and stopping existing application instances, creating new instances or otherwise interacting with either the Cloud Foundry vmc command-line tool or the Eclipse IDE. Current running instances were apparently unaffected.
Launched earlier this month, VMware’s Cloud Foundry combines a hosted Platform as a Service (cloudfoundry.com) running on VMware’s vSphere with an open-source development stack (cloudfoundry.org). Cloud Foundry supports the Spring framework for Java developers, Rails, Sinatra and Grails. The project is hosted and supported by VMware. The complete hosted cloudfoundry.com is currently in beta and is available to users free of charge until the beta period ends.
Unlike Amazon, VMWare began issuing hourly updates on the Cloud Foundry support site and posting messages to Twitter to keep their users apprised of the situation. It started yesterday morning when one Twitter user sent out a message: “Anyone else seeing a 404 error with #cloudfoundry vmc commands? Just started a few minutes ago." Shortly thereafter, Cloud Foundry began tweeting. “We are experiencing an issue right now. Applications should be running fine but connectivity is intermittent.” Within a couple of hours the Cloud Foundry support site posted the following message:
NOTICE: We are continuing to work on resolving a storage issue with the CloudFoundry.com service. The issue affects your ability to log in and manage your applications but should not impact currently running instances. We will continue to provide updates every hour as we work to resolve the issue.
Updates further identified the problem, and indicated that the system would be back up by 12:30 PDT. The response team missed the on-line time, explaining that:
The failure in the storage infrastructure has been identified and corrected, but it is a slow process to safely bring the system back up to full operational status and to ensure and validate that there has been no loss of data.
Later in the day InfoQ spoke with Jerry Chen, Senior Director Applications Platform for VMware, who confirmed the nature of the failure and that users were unable to log into the vmc command-line tool. According to Chen:
We provided updates on an hourly basis to keep our users informed of the situation until it was resolved today just before 4pm PDT. We are happy that our updates and transparency was well received by our users. We will be posting more information on the storage issue in the near future.
Ultimately the explanation involved the cloud controller (https://github.com/cloudfoundry/vcap/tree/master/cloud_controller). According to a blog post Ezra Zygmuntowicz, “The cloud controller is the main ‘brain’ of the system.” The cloud controller is an Async Rails3 app that exposes a REST interface, which is accessible through a command-line tool called vmc. There’s also an STS plugin that allows developers to access the interface from Eclipse. Sometime yesterday it was discovered that the cloud controller had been set to read-only mode, leading users to be unable to log in, start or stop existing applications, create new applications, or otherwise interact with the system through either the vmc command-line tool or the Eclipse IDE. Cloud Foundry also explained that:
Existing applications are not affected by this situation, BUT if these applications crash, the self-healing properties of Cloud Foundry are impacted. The health manager component https://github.com/cloudfoundry/vcap/tree/master/health_manager will not be able to take corrective action.
Clearly the beta project has some bugs to work through. Yesterday’s failure was followed today by the cloudfoundry.com and cloudfoundry.org sites going down for over an hour. That outage coincided with users complaining that all apps were being redirected to a single site. The company initially reported that:
We are currently experiencing an outage in our datacenter. We are working as quickly as we can to resolve this issue. We apologize for the inconvenience.
Shortly thereafter the support site reposted, indicating that Cloud Foundry was currently under maintenance and was expanding capacity “due to high demand.” Ultimately Cloud Foundry is still in beta, and it’s likely that those hosting the project have gained critical insights over the past two days. Nevertheless, there was little users could do but watch and wait, hoping that their data remained intact. For those looking for additional peace of mind, there are several excellent articles including Today’s EC2/EBS Outage: Lessons Learned. As recent events teach us, even a small down period can affect organizations in dramatic ways and at least for now every IT organization needs to plan for failure.
First Amazon then Sony and now VMWare - Clouds are not invincible
Nothing to see here