Google Cloud Suspends Railway's Production Account, Causing Eight-Hour Platform-Wide Outage

Google Cloud's automated systems suspended Railway's production account on May 19, triggering an eight-hour platform-wide outage that took down the dashboard, API, all deployments, and all databases for the platform's 3 million users.

The suspension was not triggered by anything Railway did. Google applied it as part of a broader automated action affecting multiple accounts, with no advance notice to individual customers. Chandrika Khanduri and Cody De Arkland from Railway's engineering team write:

We take full responsibility for the architectural decisions that allowed a single upstream provider action to cascade into a platform-wide outage.

The cascade mechanism is the architecturally interesting part. Railway runs a mesh network across Google Cloud, AWS, and its own bare-metal infrastructure (Railway Metal). When GCP suspended the account, workloads on AWS and Metal initially kept running because Railway's edge proxies maintain a cache of routing tables from the network control plane. But that control plane was hosted in Google Cloud. Once the cached routes expired, the edge could no longer resolve routes to active instances, and workloads across all regions, including Metal and AWS, began returning 404 errors. The workloads themselves were still running. They were just unreachable.

Recovery was not instant either. Restoring account access did not restore services. Persistent disks, compute instances, and networking all required separate recovery. Disks were ready by 23:54 UTC, but core networking didn't restore until 01:30 UTC the following day. Then a backlog of queued deployments had to be drained carefully to avoid overwhelming the build systems. In parallel, GitHub began rate-limiting Railway's OAuth and webhook integrations due to the burst of retried requests, temporarily blocking user logins and builds.

Railway's founder Jake Cooper told Cybernews he was "gobsmacked" by the suspension and announced that Railway is demoting GCP to backup-only status. The incident report confirms this: Railway is removing Google Cloud from the data plane's hot path, extending high-availability database shards across AWS and Metal, and redesigning the mesh so that if any interconnect fails, routing tables can still be populated from surviving paths.

On Hacker News, the thread generated 150+ comments across multiple submissions. One commenter pointed to the unresolved question:

Put all the timestamps you want in the post mortem about what you observed, but you haven't addressed the root cause. The "this doesn't make sense" part of the story likely has a real explanation that nobody wants to reveal yet.

Google has not issued a public statement explaining why the account was suspended. Railway's report notes only that it was "incorrectly" flagged "as part of an automated action" affecting many accounts.

Another commenter captured the broader trust implication:

Building on someone else's platform is always gonna be a risky move, and building a platform on top of someone else's platform is even riskier.

A Railway customer shared their response to the outage directly:

Unfortunately we had to make emergency migration off to Azure yesterday due to this. As much as we loved the simplicity they provided us, there's just been too many mishaps and shortcomings for us to continue running a B2B enterprise app on their infrastructure.

The incident was not isolated. Northflank reported that developers experienced worker crashes, partial outages, and build delays in the days before the full platform went down, with some noting this was their second or third major outage in a few months. Railway's own February 2026 postmortem acknowledged a pattern of "tightly coupled systems with a large blast radius causing single failures to cascade into broader outages." A specific pain point during the May outage was the inability to access database backups: with the dashboard and API both offline, users had no way to retrieve their own data during the incident window.

The architectural lesson extends beyond Railway. Any platform built on a single hyperscaler account, whether GCP, AWS, or Azure, carries the risk that an automated account-level action can take down everything simultaneously. The traditional multi-AZ and multi-region patterns protect against infrastructure failures within a provider but offer no protection against account-level suspension. Railway's planned remediation, making the mesh truly provider-independent with no single cloud on the hot path, is the pattern that addresses this class of failure.

Railway's status page tracks the ongoing resolution. The company says the incident report "reflects what we know at time of publication and may be updated pending Google Cloud's internal review."

About the Author

Steef-Jan Wiggers

Show moreShow less

InfoQ Software Architects' Newsletter

Follow us on

About the Author

Steef-Jan Wiggers

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter