Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Failure Content on InfoQ

  • A Distributed System is Knowable: an Impossible Thing for Developers

    Failure in distributed systems is normal. Distributed systems can provide only two of the three guarantees in consistency, availability, and partition tolerance. According to Kevlin Henney, this limits how much you can know about how a distributed system will behave. He gave a keynote about Six Impossible Things at QCon London 2022 and at QCon Plus May 10-20, 2022.

  • Dealing with Thundering Herd at Braintree

    Braintree engineer Anthony Ross explained in a recent article how introducing some random jitter into retry intervals for failed tasks solved a thundering herd issue which was impacting the efficiency of their payment dispute management API.

  • Microsoft Announces Azure Chaos Studio in Public Preview

    At the recent Ignite, Microsoft announced the public preview of Azure Chaos Studio, a fully-managed experimentation service to help customers track, measure, and mitigate faults with controlled chaos engineering to improve the resilience of their cloud applications.

  • How a Safe-to-Fail Approach Can Enable Psychological Safety in Teams

    Companies can establish a culture of psychological safety among their employees, a culture in which failing is not frowned upon but rather is accepted as something that can happen to anyone. Safe-to-fail should be part of the corporate culture. A shift in the way we envision success can lead to a better understanding of where failure lies and provide courage to overcome our fears.

  • AWS Announces Chaos Engineering as a Service Offering

    AWS has announced the upcoming release of their chaos engineering as a service offering. The Fault Injection Service (FIS) will provide fully-managed chaos experiments across a number of AWS services. The service includes pre-built templates that generate disruptions mimicking common real-world events. It can be integrated into CI pipelines via API.

  • New LiveRecorder for Java Enables Software Failure Replay

    LiveRecorder for Java is a newly released application for software failure replay. It enables developers to record application failures and then replay them in IntelliJ to find the cause of the failure. It helps to reduce the debugging time, especially with intermittent failures.

  • Cloudflare’s 27 Minutes Outage Explained

    Cloudflare recently suffered a partial outage, which lasted for 27 minutes. This outage caused 50% of traffic drop across the network.

  • Failure Modes and Building Resilient Systems: Adrian Cockcroft at QCon SF

    Adrian Cockcroft recently shared his thoughts on how to produce resilient systems that operate successfully in spite of the presence of failures. At the recent QCon San Francisco event, he also shared what he considers are good cloud resilience patterns for building with a continuous resilience mindset.

  • How Did Things Go Right? Learning More from Incidents at Netflix: Ryan Kitchens at QCon New York

    At QCon New York, Ryan Kitchens presented “How Did Things Go Right? Learning More from Incidents”. Key takeaways from the talk included: recovery is better than prevention; an incident occurs when there is a “perfect storm” of events -- there is no root cause; “stop reporting on the nines”, as user happiness is more important; and there is value in learning how things go right.

  • How to Grow Teams That Can Fail without Fear: QCon London Q&A

    Blameless failure starts with building a culture where failure is acknowledged, shared, investigated, remedied, and prevented, said Emma Button, a DevOps and cloud consultant, at QCon London 2019. Visualising the health and state of your system with CI/CD practices can increase trust and ownership and invite people to help out when things fail.

  • What Resiliency Means at Sportradar

    Pablo Jensen, CTO at Sportradar, talked about practices and procedures in place at Sportradar to ensure their systems meet expected resiliency levels, at this year's QCon London conference. Jensen mentioned how reliability is influenced not only by technical concerns but also organizational structure and governance, client support, and requires on-going effort to continuously improve.

  • Chaos Engineering at Twilio

    The Twilio team describes their foray into Chaos Engineering where they use Gremlin to inject failures into their homegrown queuing system shards to test for automated recovery.

  • How to Measure Continuous Delivery

    Stability and throughput are the things that you can measure when adopting continuous delivery practices. These metrics can help you reduce uncertainty, make better decisions about which practices to amplify or dampen, and steer your continuous delivery adoption process in the right direction.

  • Public Preview of Azure IaaS Disaster Recovery Announced

    In a recent announcement, Microsoft released details about its public preview for Infrastructure-as-a-Service (IaaS) disaster recovery using Azure Site Recovery (ASR). Using the ASR service, organizations can protect IaaS workloads in one Azure region and have it replicated to a different Azure region within a geographical cluster.

  • A Human Error Took Down AWS S3 US-EAST-1

    A mistake took down more S3 servers than it should, including two subsystems essential to S3 operation. This resulted in S3 failure, affecting the S3 service and other services depending on it. Normal functioning was restored in about four hours.