Facilitating the spread of knowledge and innovation in professional software development



Choose your language

InfoQ Homepage News Google Cloud Incident Root-Cause Analysis and Remediation

Google Cloud Incident Root-Cause Analysis and Remediation


Google disclosed its root-cause analysis of an incident affecting a few of its Cloud services that increased error rates between 33% and 87% for about 32 minutes, along with the steps they will take to improve the platform performance and availability.

The incident affected customers of a number of Google services relying on Google HTTP(S) Load Balancer, including Google Kubernetes Engine, Google App Engine, Google Cloud Functions, Stackdriver’s web UI, Dialogflow and the Cloud Support Portal/API. Customers started to randomly receive 502 error codes or connection resets for about 32 minutes, the time it took Google engineers to deploy a fix from the moment Google monitoring system alerted them to the increased failure rates.

Google HTTP(S) Load Balancing aims to balance HTTP and HTTPS traffic across multiple backend instances and multiple regions. One of the benefits it provides is enabling the use of a single global IP address for a Cloud app, which greatly simplifies DNS setup. To achieve maximum performance during connection setup, the service utilizes a first layer of Google Front Ends (GFE) as close as possible to clients anywhere in the world, which receive requests and relays them to a second tier of GFEs. The second tier constitutes a global network of servers that actually sends the requests to the corresponding backends, regardless of the regions they are located in.

The root cause of the incident turned out to be an undetected bug in a new feature added to improve security and performance of the second GFE tier. The bug was triggered by a configuration change in the production environment enabling that feature and causing GFEs to randomly restart, with the consequent loss of service capacity while the servers where restarting.

Luckily, the feature containing the bug had not yet been put in service, so it was relatively easy for Google engineers to deploy a fix by reverting the configuration change, which led the service to return to its normal behaviour and failure rates after a few minutes of cache warm-up.

On the prevention front, besides improving the GFE test stack and adding more safeguards to prevent disabled features from being mistakenly put in service, Google Cloud team efforts will aim to improve isolation between different shards of GFE pools to reduce the scope of failures, and create a consolidated dashboard of all configuration changes for GFE pools, to make it easier for engineers to identify problematic changes to the system.

Read the full detail in Google’s official statement.

We need your feedback

How might we improve InfoQ for you

Thank you for being an InfoQ reader.

Each year, we seek feedback from our readers to help us improve InfoQ. Would you mind spending 2 minutes to share your feedback in our short survey? Your feedback will directly help us continually evolve how we support you.

Take the Survey

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p


Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.