BT

Gremlin Releases Application Level Fault Injection (ALFI) Platform for Targeted Chaos Experiments

| by Daniel Bryant Follow 766 Followers on Oct 07, 2018. Estimated reading time: 4 minutes |

At the inaugural ChaosConf, held in San Francisco, Gremlin Inc released Application-Level Fault Injection (ALFI), their second product offering in the "Failure-as-a-Service" space. Building upon their initial SaaS product which facilitated engineers in creating and running chaos experiments at the infrastructure level, ALFI enables failure injection at the application level via a native language library. Currently only the Java/JVM platform is supported, but additional language libraries will be added soon.

The Gremlin documentation states that "Operators Think in Requests", and in addition to wanting to experiment with injecting failure targeted at the infrastructure level -- such as rebooting a compute instance, adding latency to a network connection, or consuming large amounts of RAM -- operators also want to target application-level requests and inject failures, for example, adding latency or terminating requests.

After the ALFI library has been integrated into an application as a dependency, engineers can use the Gremlin web-based UI to run "attacks" and match and restrict the impact of failure injection by targeting specific application attributes reported by the ALFI dependency. This allows an engineer to create a precisely scoped failure experiment that only impacts, for example, particular customer IDs, locations, or device types.

 

Gremlin ALFI web-based UI
Selecting traffic for failure injection via ALFI (image courtesy of the Gremlin Blog)

 

The Gremlin team claims that since ALFI is embedded within an application, it will work in any existing environment, which also includes all serverless platforms like AWS Lambda, Azure Functions, and Google Cloud Functions. Gremlin posits that many incidents that occur within a system built using the microservices or function-as-a-service (FaaS) architecture style are due to a slowdown or failure somewhere in the upstream dependencies. Accordingly, ALFI can simulate delay or full-fledged failure of specific services, specific RPC calls, and external dependencies, which allows an engineer to reproduce outages, proactively find unknown failure modes, and prepare for more complicated scenarios where multiple components fail.

To use ALFI, an engineer must integrate the Gremlin language dependencies into their application and redeploy. The JVM Installation Guide provides a comprehensive walkthrough for the current supported installation process (currently only a Gradle dependency example is provided, with a Maven example promised soon). Once the application is redeployed, a series of ALFI parameters, such as a Gremlin team identifier and credentials, must be supplied via environmental variables or a properties file.

The main Java class an engineer will interact with is com.gremlin.GremlinService, which abstracts all of the functionality required to register with the Gremlin SaaS platform API, find and cache experiments, and report success back to the Gremlin API. The GremlinService class is designed to be a singleton, and can be managed via a dependency injection framework. Examples are provided in the documentation for integrating fault injection into the Java Apache HTTP Client and Amazon DynamoDB NoSQL database client. Custom extensions can be also added.

An important concept in ALFI is that each application has a set of identifying attributes. This set of attributes is named ApplicationCoordinates and is used to determine when an application matches an attack request via the web-based UI. The gremlin-core dependency includes integrations for running on AWS Lambda and Amazon EC2. In the case of AWS Lambda, the attributes "type=AwsLambda, name, and region" will be set by default. In the case of AWS EC2, the attributes "type=AwsEc2, region, az, and instanceId" will be set, e.g.:

{"type"="AwsLambda", "region"="us-west-1", "name"="event-handler"} and {"type"="MyServiceType", "region"="us-east-1", "service"="recommendations", "criticality"="2", "userfacing"="true"}

Other facets or "coordinates" of an application that an operator requires for targeting can also be defined by implementing the two methods in the abstract GremlinCoordinatesProvider class. To create a custom ApplicationCoordinates an engineer must override initializeApplicationCoordinates(). The auto-generated ApplicationCoordinates (if any) are supplied as an argument to this method, which means any custom coordinates can be appended to these. An example of a custom ApplicationCoordinates is shown below:

 

import com.gremlin.ApplicationCoordinates;
import com.gremlin.GremlinCoordinatesProvider;

public class MyCoordinatesProvider extends GremlinCoordinatesProvider {

    @Override
    public ApplicationCoordinates initializeApplicationCoordinates(Optional<ApplicationCoordinates> autoDiscoveredCoordinates) {
        return autoDiscoveredCoordinates.map(c -> {
            c.putField("userfacing", "true");
            return c;
        }).orElseGet(() -> new ApplicationCoordinates.Builder()
                .withType("MyServiceType")
                .withField("name", "recommendations")
                .withField("userfacing", "true")
                .build());
    }
}

 

This set of ApplicationCoordinates can then be used to match attacks. For example, if an operator created an attack that matches userfacing=true, then the application specified in the above example will be included in the attack. Currently an operator can specify the percentage of requests that should be impacted by the failure injection, and add latency to a request or cause an Exception to be thrown on this request execution thread.

Open source solutions within the failure injection space do exist, such as the (now retired) Simian Army and the Chaos Toolkit, but self-hosting of these products is required. Running chaos experiments does require upfront preparation and design, and these topics were covered in a recent full-length article "Chaos Conf Q&A: The Benefits, Challenges and Practices of Chaos Engineering".

Additional information on Gremlin's ALFI can be found on the ALFI announcement blog post and ALFI help pages.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Java Agent? by Greg Liebowitz

You would have to be nuts to add a dependency like this to your application. A Java agent to intercept the method calls would be far less intrusive than a library.

Re: Java Agent? by Daniel Bryant

Hey Greg,

I'm not sure I agree with you on this one, but I would be keen to better understand.

Part of the Gremlin library allows an engineer to expose "coordinates" for use in targeting requests to fail, and this for this to be possible, you would have to use a library in order to access the Gremlin API.

I also think that using a library would be less opaque and easier to debug? I'm taking a few of my cues for this thinking in relation to fault-tolerant libraries like Hystrix, which were implemented as a library, rather than agent?

Generally speaking, I think agents are good for instrumentation and runtime debugging, but not so good for implementing actual functionality?

Best wishes,

Daniel

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

2 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT