BT
DevOps Follow 972 Followers

Gremlin Releases Application Level Fault Injection (ALFI) Platform for Targeted Chaos Experiments

by Daniel Bryant Follow 764 Followers on  Oct 07, 2018 2

Gremlin Inc has released their second product offering in the “Failure-as-a-Service” domain– Application-Level Fault Injection (ALFI). Building upon their initial platform that facilitated engineers in creating and running chaos experiments at the infrastructure level, ALFI enables failure injection at the application level via a native language library.

Architecture & Design Follow 2413 Followers

How to Achieve a Resilient Architecture

by Jan Stenberg Follow 37 Followers on  Sep 13, 2018

To manage systems at scale you must push your system almost to the breaking point, but still be able to recover – and embrace failures, Adrian Hornsby writes in two blog posts sharing his experiences from working with large-scale systems for more than a decade, and the patterns he has found useful.

Culture & Methods Follow 799 Followers

Ben Gracewood on Learning from an Organisational Train Wreck

by Shane Hastie Follow 28 Followers on  Jul 16, 2018

At the recent JAFAC conference, Ben Gracewood told the story of how POS developer Vend transformed their development organisation following catastrophic disruption and losses. He explored what happened after they reduced headcount by over 30%, what they had in place that enabled them to survive, and what they did differently as a result of the changes.

DevOps Follow 972 Followers

Learning to Bend But Not Break at Netflix: Haley Tucker Discusses Chaos Engineering at QCon NY

by Daniel Bryant Follow 764 Followers on  Jul 05, 2018

At QCon New York, Haley Tucker presented “UNBREAKABLE: Learning to Bend But Not Break at Netflix” and discussed her experience with chaos engineering while working across a number of roles at Netflix. Key takeaways included: use functional sharding for fault isolation; continually tune RPC calls; run chaos experiments with small iterations; and apply the “principles of chaos”.

DevOps Follow 972 Followers

Chaos Engineering at LinkedIn: The “LinkedOut” Failure Injection Testing Framework

by Daniel Bryant Follow 764 Followers on  Jun 24, 2018

The LinkedIn Engineering team has recently discussed their “LinkedOut” failure injection testing framework. Hypotheses about service resilience can be formulated and failure triggers injected via the LinkedIn LiX A/B testing framework or via data in a cookie that is passed through the call stack using the Invocation Context (IC) framework. Failure scenarios include errors, delays and timeouts.

DevOps Follow 972 Followers

From Darwin to DevOps: John Willis and Gene Kim Talk about Life after The Phoenix Project

by Helen Beal Follow 4 Followers on  May 23, 2018

IT Revolution recently published an audiobook with nearly eight hours of conversation between Gene Kim and John Willis; Beyond the Phoenix Project – the Origins and Evolution of DevOps.

DevOps Follow 972 Followers

Increasing the Resilience of APIs with Chaos Engineering

by Daniel Bryant Follow 764 Followers on  May 20, 2018 1

The Gremlin team has described a simple chaos experiment as a method of validating that an organisation’s APIs are resilient. Using the principles of chaos engineering and techniques like running “game days” (a fire drill for IT systems and people) can provide value, as can the appropriate use of commercial and open source tooling emerging within this space.

DevOps Follow 972 Followers

What Resiliency Means at Sportradar

by Manuel Pais Follow 9 Followers on  Apr 06, 2018

Pablo Jensen, CTO at Sportradar, talked about practices and procedures in place at Sportradar to ensure their systems meet expected resiliency levels, at this year's QCon London conference. Jensen mentioned how reliability is influenced not only by technical concerns but also organizational structure and governance, client support, and requires on-going effort to continuously improve.

DevOps Follow 972 Followers

Why the World Needs More Resilient Systems: Tammy Butow Discusses Chaos Engineering at QCon London

by Daniel Bryant Follow 764 Followers on  Mar 18, 2018

At QCon London, Tammy Butow, explained why the world needs more resilient systems, and how this can be achieved with the practice of chaos engineering. Three primary prerequisites for chaos engineering were provided -- high severity “SEV” incident management, monitoring, and measuring the impact -- and a series of guidelines, tools and practices presented.

Cloud Follow 331 Followers

Bloomberg Releases Open Source “PowerfulSeal” Kubernetes-Specific Chaos Testing Tool

by Daniel Bryant Follow 764 Followers on  Jan 25, 2018

At the recent KubeCon North America conference, Bloomberg presented their new open source “PowerfulSeal” tool, which enables chaos testing within Kubernetes clusters via the termination of targeted pods and underlying node infrastructure.

DevOps Follow 972 Followers

Chaos Engineering at Twilio

by Hrishikesh Barua Follow 15 Followers on  Dec 25, 2017

The Twilio team describes their foray into Chaos Engineering where they use Gremlin to inject failures into their homegrown queuing system shards to test for automated recovery.

Cloud Follow 331 Followers

Werner Vogels on “21st Century [Cloud] Architectures”: Availability, Reliability and Resilience

by Daniel Bryant Follow 764 Followers on  Dec 03, 2017

At the AWS re:invent 2017 conference, Werner Vogels, CTO of Amazon, presented a keynote that discussed core concepts required for building “21st Century Architectures” on the cloud. Highlights of the talk included discussion of the emerging practices of evolutionary and “cloud native” architectures, the role of security becoming everyone’s responsibility, and the benefits of chaos engineering.

DevOps Follow 972 Followers

Serverless Challenges in Hybrid Environments

by Manuel Pais Follow 9 Followers on  Nov 30, 2017

Sam Newman, independent consultant and author of the book "Building Microservices", talked at the Velocity conference in London on the challenges faced when hybrid systems rely on both serverless architectures and traditional infrastructure. In particular, Newman discussed how serverless changes our notion of resiliency and how the two paradigms clash at times of high load in the system.

DevOps Follow 972 Followers

Expedia's Journey toward Site Resiliency: Embracing Chaos Testing in Dev and Production at QCon SF

by Daniel Bryant Follow 764 Followers on  Nov 19, 2017

At QCon SF, Sahar Samiei and Willie Wheeler presented “Expedia’s Journey Toward Site Resiliency”, and discussed the building of a community of practice around resilience testing within Expedia. The results have generally been positive: Netflix’s Chaos Monkey has been running daily in production since May 15th; and resilience tests have been added to four Tier 1 service pipelines.

Architecture & Design Follow 2413 Followers

Adrian Cockcroft Discusses Chaos Architecture: "Four Layers, Two Teams, and an Attitude"

by Daniel Bryant Follow 764 Followers on  Nov 17, 2017 1

At QCon San Francisco, Adrian Cockcroft presented “Chaos Architecture”, and discussed the evolution of cloud native architecture, and how chaos engineering can be applied to produce better and safer systems. Effective chaos architecture and engineering was presented as consisting of “four layers, two teams, and an attitude”.

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT