Gremlin Releases State of Chaos Engineering 2021 Report

Gremlin released their State of Chaos Engineering 2021 report based on a community survey and their own product data. The key findings include a positive correlation between running chaos engineering experiments and increased availability.

Ever since Netflix announced their use of Chaos Monkey to randomly shut down VM instances, chaos engineering has developed as a field with many tools and practices. Gremlin’s report is based on 400+ respondents and Gremlin’s own product data. The companies surveyed range from the small to the very large, but industry-wise they are "primarily in Software and Services", notes the report. The next demographic is Financial and Banking services (23.2%). While still an emerging practice, the majority of respondents (60%) have run at least one Chaos Engineering attack.

One of the findings of the report is a positive correlation between increased availability and decreased mean time to resolution (MTTR) for teams that run chaos engineering experiments. Teams running frequent experiments report greater than 99.9% availability. 23% of teams had an MTTR of under one hour and 60% under 12 hours. Network attacks are the most commonly run experiments and 34% of respondents run Chaos Engineering experiments in production.

Image reused from the report with permission.

According to the report, the top 20% of respondents had services with an availability of "more than four nines". 23% of teams had a mean time to resolution (MTTR) of under an hour, with 60% having an MTTR of under 12 hours. 81.4 % of respondents had an average of 1-10 high severity incidents per month. High severity incidents are mostly caused by networking issues (50%), internal dependencies (41%), bad code deployments (39%) and configuration errors (48%). The report also notes that:

Ad-hoc experiments are an important part of the practice, and teams with >99.9% availability are performing more ad-hoc experiments.

The surveyed teams reported using a variety of tools to enable service availability. These include autoscaling (65% of respondents who have > 99.9% availability), load balancers (77%), active-active sites (38%), circuit breakers (32%), backups (61%), DNS failover/elastic IPs (49%), and controlled deployment rollouts (51%). Note that these numbers are not mutually exclusive as multiple approaches are used by most teams. Almost 70% of the top-performing teams depend on monitoring with health checks. Monitoring and observability are key components of chaos engineering. Charity Majors, CEO of Honeycomb, noted in her Chaos Community Day talk in 2019 that "without observability, you don't have chaos engineering. You just have chaos".

Image reused from the report with permission.

The most popular way of monitoring is "standard uptime over total time using synthetic monitoring", according to the report, but many organizations use multiple methods and metrics. These include error rates, latency, changes in transaction patterns as measured against historical data, and rate of successful requests. Availability reports are received mainly by Ops teams (61.4%) and Dev teams (54.5%), but also by CTO (33.7%) and VP-level (30.2%) executives. Similar percentages of these groups received reports on performance as well.

The adoption patterns of chaos engineering indicate "nearly 50% of respondents working for companies with more than 1,000 employees". 54% of small orgs (<100 employees) have never performed an attack, whereas the number is 34.5% for very large organizations (>10000 employees). This number fluctuates as we move from small to large organizations. The report also found that both SRE and application developers were most involved in carrying out experiments, with platform teams coming in second. Lack of awareness, lack of experience, and other priorities were reported as the top three barriers to adopting chaos engineering.

Use of CI/CD, containerization and cloud deployments accounted for significant portions of production workloads among the respondents surveyed. Among the workloads, 38% were deployed on AWS and 11-12% on GCP and Azure. This also reflects in the choice of monitoring tool for most users - Amazon CloudWatch (20%). However, Grafana and Prometheus (both 18%) are also popular. It's worth noting here that AWS announced Chaos Engineering as a service last year. There are both commercial and free chaos engineering tools - including Gremlin, Chaos Mesh, Netflix's Chaos Monkey, Litmus, and ChaosBlade.

The report is available online (it requires registration to view).

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter