Airbnb Rebuilt Alert Development After Discovering It Wasn’t a Culture Problem

Airbnb has revealed how it significantly improved its observability practices by rethinking how alerts are developed and validated, concluding that what appeared to be a "culture problem" was actually a tooling and workflow gap. By redesigning its Observability as Code (OaC) approach, the company reduced alert development cycles from weeks to minutes, cut alert noise dramatically, and enabled the migration of hundreds of thousands of alerts to a new platform.

At the core of the issue was a simple insight: engineers were not creating poor alerts due to a lack of discipline, but because they could not see how alerts would behave before deploying them. With around 300,000 alerts supporting thousands of services, Airbnb relied on OaC to bring structure and consistency. However, while code reviews validated syntax and logic, they failed to capture real-world behavior.

This meant engineers could not easily determine whether alerts would generate noise, miss incidents, or unnecessarily wake on-call teams. As a result, production became the testing ground, forcing teams into a tradeoff between improving alerts and risking instability, or tolerating poor signal quality. Over time, this led to alert fatigue, reduced trust, and slower iteration.

The root cause was a lack of fast feedback loops. Without the ability to validate alerts against real data before deployment, teams relied on slow, manual processes, often deploying changes, waiting days or weeks, and then iterating. This approach was impractical at scale, especially when managing thousands of services, and existing tools provided limited support due to reliance on downsampled data or manual validation.

Airbnb addressed this by rebuilding its observability platform to make alert behavior visible before deployment. The new approach introduced fast feedback loops, allowing engineers to preview alert behavior using real-world data prior to merging changes. Capabilities such as local diffs, pre-deployment validation, and large-scale backtesting enabled teams to test alerts in seconds rather than weeks.

By shifting validation earlier in the lifecycle, Airbnb moved alert testing out of production and into development workflows, aligning observability with modern software engineering practices. The results were significant: alert development cycles dropped to minutes, and alert noise was reduced by up to 90 percent, restoring trust in the system.

These improvements were critical to enabling Airbnb to complete a large-scale migration of approximately 300,000 alerts to Prometheus, which would have been extremely difficult under the previous approach.

The changes also support Airbnb's broader vision of "zero-touch" observability, in which teams automatically inherit high-quality alerts, dashboards, and service-level objectives when adopting shared platforms. This model allows platform teams to encode best practices into reusable templates, though it depends on having confidence that those templates behave correctly at scale.

Airbnb's experience highlights a broader lesson for engineering organizations: problems that appear cultural are often systemic. In this case, alert fatigue and inconsistent monitoring were driven not by poor practices but by gaps in the development workflow.

By improving tooling and feedback loops, Airbnb not only enhanced technical outcomes but also changed engineering behavior. Developers became more willing to iterate, platform teams could safely evolve standards, and overall observability quality improved.

Ultimately, the story reframes observability as a developer experience challenge. Just as CI/CD pipelines provide rapid feedback for code, observability systems must do the same for monitoring. Airbnb's approach shows that when engineers can validate changes early, they move faster, make better decisions, and build more reliable systems, proving that at scale, fixing the system matters more than fixing the people.

Other large-scale engineering organizations have tackled similar alerting challenges by focusing on shifting validation left and improving signal quality through automation and standardization. At Google, for example, the adoption of Site Reliability Engineering (SRE) practices led to a strong emphasis on Service Level Objectives (SLOs) and error budgets as the foundation for alerting. Rather than creating alerts for every possible failure condition, teams define alerts based on user-impacting signals tied to SLO breaches. This approach reduces noise by ensuring alerts are meaningful and actionable, while also providing a clearer framework for validating alert effectiveness before they impact on-call engineers.

Similarly, Netflix has approached the problem through automation and real-time observability tooling, investing heavily in platforms that allow engineers to simulate and test system behavior under failure conditions. By combining chaos engineering practices with observability, teams can validate whether alerts trigger appropriately during controlled failures, effectively testing alert behavior before real incidents occur.

Other organizations using platforms like Datadog or Prometheus have also introduced features such as alert previews, anomaly detection, and historical backtesting to improve confidence in alert configurations. Across these approaches, the common theme is clear: improving alert quality is less about enforcing stricter processes and more about giving engineers better visibility, faster feedback, and systems that prioritize meaningful signals over volume.

About the Author

Craig Risi

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Craig Risi

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter