An Australian superannuation fund manager, UniSuper, using Google Cloud for an Infrastructure-as-a-Service (IaaS) contract, found it had no disaster recovery (DR) recourse when its entire infrastructure subscription was deleted.
UniSuper had previously migrated its VMware-based hardware infrastructure from two data centers to Google Cloud, using the Google Cloud VMware Engine. As part of its private cloud contract, UniSuper had its services and data duplicated across two Google Cloud regions. However, this regional separation was effectively virtual because both regions lost copies due to an internal Google error. There was no external disaster recovery facility in place.
This led to an outage that affected over 620,000 UniSuper fund members, leaving them unable to access their superannuation accounts for over a week.
Although UniSuper had backup systems in place, the deletion affected both geographies, causing a complete loss of data. Fortunately, UniSuper had additional backups with another provider, which helped minimize data loss and speed up restoration.
In a joint statement by Google and UniSuper:
This is an isolated, ‘one-of-a-kind occurrence’ that has never before occurred with any of Google Cloud’s clients globally. This should not have happened. Google Cloud has identified the events that led to this disruption and taken measures to ensure this does not happen again.
Yet, not everyone is convinced that it’s a one-of-a-kind issue; Daniel Compton, a software developer focusing on Clojure and large-scale systems, dedicated a blog post on the occurrence and concluded:
Given how little detail was communicated, it is difficult to make a conclusive statement about what happened, though I personally suspect UniSuper operator error was a large factor. Hopefully, APRA, Australia’s superannuation regulator, will investigate further and release a public report with more details.
And a respondent on a Hacker thread stated:
I really would like to hear the actual story here since it is basically impossible. It actually was "Google let customer data be completely lost in ~hours/days." This is compounded by the bizarre announcements - UniSuper putting up TK quotes on their website, which Google doesn't publish and also doesn't dispute.
However, Miles Ward, a CTO at SADA, explained in a thread of tweets and concluded:
Punchline: this failure mode cannot affect other @googlecloud users.
Everyone on that one service is no longer exposed to this risk, and all other types of Google Cloud services (the vast vast majority) were never exposed.
Furthermore, given the additional backups, a respondent on the same Hacker News commented:
Interestingly, the Australian financial services regulator (APRA) requires companies to have a multi-cloud plan for each of their applications. For example, a 'company critical' application needs to be capable of migrating to a secondary cloud service within four weeks.
I'm not sure how common this regulation is across industries in Australia or whether it's common in other countries as well.
And a respondent on a Reddit thread wrote:
Customer support just isn't in Google's DNA. While this could have happened on any provider, this happens far more often on Google.
This story is a classic reminder of the rule of 1: 1 is 0, and 2 is 1. Thank goodness they could recover from a different provider.
Lastly, Google released a blog post detailing the course of events on the 24th of May.