Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Filibuster: Automated Fault Injection Tool to Improve DoorDash's Reliability

Filibuster: Automated Fault Injection Tool to Improve DoorDash's Reliability

DoorDash recently revealed how they are using Filibuster, an automated fault injection tool, to identify resilience issues in microservice applications early on and improve platform reliability.

Filibuster is the prototype implementation of service-level fault-injection testing. Combining static analysis, concolic style execution, and a novel dynamic reduction algorithm, it extends existing functional test suites to cover failure scenarios with minimal development effort.

DoorDash migrated from a monolith to microservices architecture a couple of years back. When a monolith is broken into microservices, developers must consider a new type of application complexity - partial failure, where one or more of the services that a single process depends on fails.

Considering human error, reliance on third-party services, and infrastructure, outages are inevitable.

The article states, "it is inevitable and must be anticipated and planned for."

In recent years, chaos engineering has emerged as the primary discipline addressing reliability issues. In a popular view, chaos engineering should be practised directly in production. DoorDash wasn't comfortable with this idea as they thought DoorDash users might get frustrated if issues happened. The users would move to another platform.

Today's chaos-engineering approaches and tools rely on manually specified experiment configurations. As a result, it is the responsibility of developers to devise and manually set the possible failure scenarios they want to test. This may not cover all possible failure scenarios.

Using Filibuster, microservices can be automatically tested for resilience. It is based on a process called "resilience-driven development."

First, it identifies possible errors visible by static code analysis. It then repeatedly executes a functional end-to-end application test by injecting these errors individually and in combinations. Using conditional assertions provided by Filibuster, developers can encode their desired behaviour under fault directly into their functional test.

Filibuster can identify technical resilience problems in applications early, during development, without testing in production.

Filibuster was initially designed to test only microservices implemented in Python, using Flask, that communicated strictly over HTTP. To work in the DoorDash environment, it had to be extended to support Kotlin and GRPC.

Initially, it also relied on application-level instrumentation of services to support fault injection. But this wasn't a viable option at DoorDash as it would introduce a lot of effort and overhead for only resilience testing, so the team developed a new strategy to add this instrumentation dynamically at runtime.

DoorDash ran Filibuster on two services in the local development environment at the end of summer and plans to expand its scope.

You can read more about Filibuster in this paper.

About the Author

Rate this Article