BT

InfoQ Homepage Resilience Content on InfoQ

  • Chaos Engineering at Twilio

    The Twilio team describes their foray into Chaos Engineering where they use Gremlin to inject failures into their homegrown queuing system shards to test for automated recovery.

  • Werner Vogels on “21st Century [Cloud] Architectures”: Availability, Reliability and Resilience

    At the AWS re:invent 2017 conference, Werner Vogels, CTO of Amazon, presented a keynote that discussed core concepts required for building “21st Century Architectures” on the cloud. Highlights of the talk included discussion of the emerging practices of evolutionary and “cloud native” architectures, the role of security becoming everyone’s responsibility, and the benefits of chaos engineering.

  • Serverless Challenges in Hybrid Environments

    Sam Newman, independent consultant and author of the book "Building Microservices", talked at the Velocity conference in London on the challenges faced when hybrid systems rely on both serverless architectures and traditional infrastructure. In particular, Newman discussed how serverless changes our notion of resiliency and how the two paradigms clash at times of high load in the system.

  • Expedia's Journey toward Site Resiliency: Embracing Chaos Testing in Dev and Production at QCon SF

    At QCon SF, Sahar Samiei and Willie Wheeler presented “Expedia’s Journey Toward Site Resiliency”, and discussed the building of a community of practice around resilience testing within Expedia. The results have generally been positive: Netflix’s Chaos Monkey has been running daily in production since May 15th; and resilience tests have been added to four Tier 1 service pipelines.

  • Adrian Cockcroft Discusses Chaos Architecture: "Four Layers, Two Teams, and an Attitude"

    At QCon San Francisco, Adrian Cockcroft presented “Chaos Architecture”, and discussed the evolution of cloud native architecture, and how chaos engineering can be applied to produce better and safer systems. Effective chaos architecture and engineering was presented as consisting of “four layers, two teams, and an attitude”.

  • Designing Services for Resilience: Nora Jones Discusses Netflix Chaos Engineering at QCon SF

    At QCon SF Nora Jones presented “Designing Services for Resilience Experiments: Lessons from Netflix”. Key takeaways from the talk included: the customer experience is a priority; designing for resiliency testability is a shared responsibility; configuration changes can cause outages; and engineers should have have explicit monitoring in place to detect antipatterns in configuration changes.

  • Choose Your Own Adventure: Chaos Engineering at QCon New York 2017

    Nora Jones, senior chaos engineer at Netflix, talked about chaos engineering at QCon New York 2017. She presents different stages of chaos engineering adoption and gives stories from her previous experiences at Jet and Netflix.

  • Netflix Engineer Lorin Hochstein on Chaos Monkey 2.0

    Netflix made waves when it initially announced Chaos Monkey, a tool that would terminate normally healthy VM instances in production. The goal was to embrace failure and thereby increase resiliency. Rags Srinivas caught up with Lorin Hochstein at Netflix regarding the recent upgrade to Chaos Monkey.

  • Chaos Monkey 2.0 Runs via Spinnaker

    Netflix has recently made available the source code of the Chaos Monkey 2.0. The latest iteration of the resilience tool is fully integrated with Spinnaker and event tracking systems, but the SSH support has been removed.

  • DevOps Days Kiel Day 2

    Round up of the talks at DevOps Days Kiel's second day.

  • Google Kick-Starts Git Ketch: A Fault-Tolerant Git Management System

    Although development has only started, Google has announced their first commits of Git Ketch, a multi-master Git management system that replicates information across multiple Git servers for resilience and scalability. The changes are based on JGit, a Java-based Git server, although other Git servers may be part of the multi-master cluster.

  • Microsoft Makes Available Their Platform for Building Microservices

    Microsoft has announced and made available the preview of Azure Service Fabric (ASF), a cloud platform including a runtime and lifecycle management tools for creating, deploying, running and managing microservices. ASF microservices can be deployed on Azure or on-premises on Windows Server private or hosted clouds. Support for Linux is to come in the future.

  • Anti-patterns for Handling Failure

    Oliver Hankeln shares the anti-patterns he found for handling failure in organizations: hiding mistakes, engaging in blame game, the arc of escalation and cowardice. He then suggests corrective actions for each of them.

  • How Netflix Handled the Reboot of 218 Cassandra Nodes

    Amazon performed a major maintenance update at the end of September in order to patch a security vulnerability in a Xen hypervisor affecting about 10% of their global fleet of cloud servers. This update involved the rebooting of those servers, with consequences for AWS users and the services they provide, including one of their largest clients, Netflix.

  • TypeSafe's Kevin Webber: Actor-based Concurrency for Reactive Systems

    In a recent article on Medium, TypeSafe's Kevin Webber argues that reactive programming "isn’t just another trend but rather the paradigm for modern software developers to learn" since it helps them to build systems that are responsive, resilient, and scalable. He also suggests that actor-based concurrency is the most convenient foundations for a reactive system.

BT

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.