At the inaugural ChaosConf, held in San Francisco, Gremlin Inc released Application-Level Fault Injection (ALFI), their second product offering in the "Failure-as-a-Service" space. Building upon their initial SaaS product which facilitated engineers in creating and running chaos experiments at the infrastructure level, ALFI enables failure injection at the application level via a native language library. Currently only the Java/JVM platform is supported, but additional language libraries will be added soon.
The Gremlin documentation states that "Operators Think in Requests", and in addition to wanting to experiment with injecting failure targeted at the infrastructure level -- such as rebooting a compute instance, adding latency to a network connection, or consuming large amounts of RAM -- operators also want to target application-level requests and inject failures, for example, adding latency or terminating requests.
After the ALFI library has been integrated into an application as a dependency, engineers can use the Gremlin web-based UI to run "attacks" and match and restrict the impact of failure injection by targeting specific application attributes reported by the ALFI dependency. This allows an engineer to create a precisely scoped failure experiment that only impacts, for example, particular customer IDs, locations, or device types.
The Gremlin team claims that since ALFI is embedded within an application, it will work in any existing environment, which also includes all serverless platforms like AWS Lambda, Azure Functions, and Google Cloud Functions. Gremlin posits that many incidents that occur within a system built using the microservices or function-as-a-service (FaaS) architecture style are due to a slowdown or failure somewhere in the upstream dependencies. Accordingly, ALFI can simulate delay or full-fledged failure of specific services, specific RPC calls, and external dependencies, which allows an engineer to reproduce outages, proactively find unknown failure modes, and prepare for more complicated scenarios where multiple components fail.
To use ALFI, an engineer must integrate the Gremlin language dependencies into their application and redeploy. The JVM Installation Guide provides a comprehensive walkthrough for the current supported installation process (currently only a Gradle dependency example is provided, with a Maven example promised soon). Once the application is redeployed, a series of ALFI parameters, such as a Gremlin team identifier and credentials, must be supplied via environmental variables or a properties file.
The main Java class an engineer will interact with is com.gremlin.GremlinService
, which abstracts all of the functionality required to register with the Gremlin SaaS platform API, find and cache experiments, and report success back to the Gremlin API. The GremlinService
class is designed to be a singleton, and can be managed via a dependency injection framework. Examples are provided in the documentation for integrating fault injection into the Java Apache HTTP Client and Amazon DynamoDB NoSQL database client. Custom extensions can be also added.
An important concept in ALFI is that each application has a set of identifying attributes. This set of attributes is named ApplicationCoordinates
and is used to determine when an application matches an attack request via the web-based UI. The gremlin-core dependency includes integrations for running on AWS Lambda and Amazon EC2. In the case of AWS Lambda, the attributes "type=AwsLambda, name, and region" will be set by default. In the case of AWS EC2, the attributes "type=AwsEc2, region, az, and instanceId" will be set, e.g.:
{"type"="AwsLambda", "region"="us-west-1", "name"="event-handler"}
and {"type"="MyServiceType", "region"="us-east-1", "service"="recommendations", "criticality"="2", "userfacing"="true"}
Other facets or "coordinates" of an application that an operator requires for targeting can also be defined by implementing the two methods in the abstract GremlinCoordinatesProvider
class. To create a custom ApplicationCoordinates
an engineer must override initializeApplicationCoordinates()
. The auto-generated ApplicationCoordinates
(if any) are supplied as an argument to this method, which means any custom coordinates can be appended to these. An example of a custom ApplicationCoordinates
is shown below:
import com.gremlin.ApplicationCoordinates;
import com.gremlin.GremlinCoordinatesProvider;
public class MyCoordinatesProvider extends GremlinCoordinatesProvider {
@Override
public ApplicationCoordinates initializeApplicationCoordinates(Optional<ApplicationCoordinates> autoDiscoveredCoordinates) {
return autoDiscoveredCoordinates.map(c -> {
c.putField("userfacing", "true");
return c;
}).orElseGet(() -> new ApplicationCoordinates.Builder()
.withType("MyServiceType")
.withField("name", "recommendations")
.withField("userfacing", "true")
.build());
}
}
This set of ApplicationCoordinates
can then be used to match attacks. For example, if an operator created an attack that matches userfacing=true
, then the application specified in the above example will be included in the attack. Currently an operator can specify the percentage of requests that should be impacted by the failure injection, and add latency to a request or cause an Exception to be thrown on this request execution thread.
Open source solutions within the failure injection space do exist, such as the (now retired) Simian Army and the Chaos Toolkit, but self-hosting of these products is required. Running chaos experiments does require upfront preparation and design, and these topics were covered in a recent full-length article "Chaos Conf Q&A: The Benefits, Challenges and Practices of Chaos Engineering".
Additional information on Gremlin's ALFI can be found on the ALFI announcement blog post and ALFI help pages.