InfoQ Homepage Resilience Content on InfoQ
-
From Darwin to DevOps: John Willis and Gene Kim Talk about Life after The Phoenix Project
IT Revolution recently published an audiobook with nearly eight hours of conversation between Gene Kim and John Willis; Beyond the Phoenix Project – the Origins and Evolution of DevOps.
-
What Resiliency Means at Sportradar
Pablo Jensen, CTO at Sportradar, talked about practices and procedures in place at Sportradar to ensure their systems meet expected resiliency levels, at this year's QCon London conference. Jensen mentioned how reliability is influenced not only by technical concerns but also organizational structure and governance, client support, and requires on-going effort to continuously improve.
-
Serverless Challenges in Hybrid Environments
Sam Newman, independent consultant and author of the book "Building Microservices", talked at the Velocity conference in London on the challenges faced when hybrid systems rely on both serverless architectures and traditional infrastructure. In particular, Newman discussed how serverless changes our notion of resiliency and how the two paradigms clash at times of high load in the system.
-
Chaos Monkey 2.0 Runs via Spinnaker
Netflix has recently made available the source code of the Chaos Monkey 2.0. The latest iteration of the resilience tool is fully integrated with Spinnaker and event tracking systems, but the SSH support has been removed.
-
-
Google Kick-Starts Git Ketch: A Fault-Tolerant Git Management System
Although development has only started, Google has announced their first commits of Git Ketch, a multi-master Git management system that replicates information across multiple Git servers for resilience and scalability. The changes are based on JGit, a Java-based Git server, although other Git servers may be part of the multi-master cluster.
-
Microsoft Makes Available Their Platform for Building Microservices
Microsoft has announced and made available the preview of Azure Service Fabric (ASF), a cloud platform including a runtime and lifecycle management tools for creating, deploying, running and managing microservices. ASF microservices can be deployed on Azure or on-premises on Windows Server private or hosted clouds. Support for Linux is to come in the future.
-
Anti-patterns for Handling Failure
Oliver Hankeln shares the anti-patterns he found for handling failure in organizations: hiding mistakes, engaging in blame game, the arc of escalation and cowardice. He then suggests corrective actions for each of them.
-
How Netflix Handled the Reboot of 218 Cassandra Nodes
Amazon performed a major maintenance update at the end of September in order to patch a security vulnerability in a Xen hypervisor affecting about 10% of their global fleet of cloud servers. This update involved the rebooting of those servers, with consequences for AWS users and the services they provide, including one of their largest clients, Netflix.
-
TypeSafe's Kevin Webber: Actor-based Concurrency for Reactive Systems
In a recent article on Medium, TypeSafe's Kevin Webber argues that reactive programming "isn’t just another trend but rather the paradigm for modern software developers to learn" since it helps them to build systems that are responsive, resilient, and scalable. He also suggests that actor-based concurrency is the most convenient foundations for a reactive system.
-
Refreshed AWS Trusted Advisor Offers Several Free Checks
Amazon Web Services (AWS) has recently integrated the AWS Trusted Advisor into the AWS Management Console and made four security and service limit checks available at no charge. Additional checks from the security, performance, fault tolerance and cost optimization categories remain part of their Business and Enterprise support tiers.
-
Testing Resiliency at PagerDuty Without a Simian Army
Doug Barth, from PagerDuty, talked at DevOps Days London about their approach to start resiliency testing their systems without dedicating a lot of automation effort upfront. The goal was to quickly start learning about failure points and openly discuss how to fix them with only one hour per week of effort.
-
Amazon Web Services Stability and the September 13th US East 1 Outage
Amazon Web Services (AWS) suffered another outage of its US East 1 region during the morning of Friday 13th September. A number of popular applications such as Heroku, Github and CMSWire were disrupted along with many other customers in Amazon’s largest, oldest and busiest location.