BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Resilience Content on InfoQ

  • How LinkedIn Serves over 4.8 Million Member Profiles per Second

    LinkedIn introduced Couchbase as a centralized caching tier for scaling member profile reads to handle increasing traffic that has outgrown their existing database cluster. The new solution achieved over 99% hit rate, helped reduce tail latencies by more than 60% and costs by 10% annually.

  • How Resilience Can Help to Get Better at Resolving Incidents

    Applying resilience throughout the incident lifecycle by taking a holistic look at the sociotechnical system can help to turn incidents into learning opportunities. Resilience can help folks get better at resolving incidents and improve collaboration. It can also give organizations time to realize their plans.

  • Using Code Instrumentation for Fault Injection at the Application Level at eBay

    eBay engineers have been using fault injections techniques to improve the reliability of the notification platform and explore its weaknesses. While fault injection is a common industry practice, eBay attempted a novel approach leveraging instrumentation to bring fault injection within the application level.

  • Resilience4j 2.0.0 Delivers Support for JDK 17

    Resilience4j, a lightweight fault tolerance library designed for functional programming, has released version 2.0 featuring support for Java 17 and dependency upgrades to Kotlin, Spring Boot and Micronaut. This new version also removes the dependency on Vavr, a functional library for Java, in order to become a more lightweight library.

  • Java News Roundup: Major Spring Releases, Resilience4j, Open Liberty, GlassFish, Kotlin 1.8-Beta

    This week's Java roundup for November 21st, 2022, features news from JDK 20, major, point and patch releases for Spring (namely Boot, Web Services, Security, Batch, Authorization Server, REST Docs, Framework, Modulith, GraphQL, Apache Kafka and RabbitMQ), Open Liberty 22.0.0.12, GlassFish 7.0-M10, GraalVM Native Build Tools 0.9.18, Resilience4j 2.0, Apache Tomcat 8.5.84 and Kotlin 1.8-Beta.

  • Dropbox Unplugs Data Center to Test Resilience

    Dropbox has published a detailed account of why and how they unplugged an entire data center to test their disaster readiness. The disaster readiness team began building tools to make performing frequent failovers possible, and ran their first formalized failover in 2019. Eventually, with new tooling and procedures, the data center was unplugged. This provided a significantly reduced RTO.

  • Building Resiliency into the Twitter Ad Pacing Service

    Twitter’s ad pacing algorithms were initially part of an ad-serving monolith. Later, Twitter’s engineering extracted them into a separate service to facilitate its development. Being an important service, it needs to be very reliable. An article was published recently describing how they built a reliable service by making economical design choices on managing different failure scenarios.

  • Netflix’s RENO Keeps Experience Consistent across Devices

    Netflix has developed the Rapid Event Notification System (RENO) to create a consistent user experience across various platforms and devices. RENO reacts more quickly and consistently than the traditional request/response model to user-generated actions ranging from watching a title to changing profile information.

  • Failsafe 3.2 Released with New Resilience Policies

    Failsafe, a lightweight fault tolerance library for Java 8+, launched the major 3.0 release in November 2021. More recently, Failsafe announced the availability of version 3.2 which introduced new Rate Limiter and Bulkhead policies. Failsafe also integrates with asynchronous code like Java’s CompletableFuture.

  • AWS US-EAST-1 Outage: Postmortem and Lessons Learned

    On December 7th AWS experienced an hours-long outage that affected many services in its most popular region, Northern Virginia. The cloud provider released an analysis of the incident that started threads in the community about redundancy on AWS and multi-region approaches.

  • Amazon Introduces AWS Resilience Hub to Monitor and Improve RPO and RTO

    Amazon recently announced the availability of AWS Resilience Hub, a service designed to help customers define, measure, and manage the resilience of their applications on the cloud.

  • Real-Time Exactly-Once Event Processing at Uber with Apache Flink, Kafka, and Pinot

    Uber faced some challenges after introducing ads on UberEats. The events they generated had to be processed quickly, reliably and accurately. These requirements were fulfilled by a system based on Apache Flink, Kafka, and Pinot that can process streams of ad events in real-time with exactly-once semantics. An article describing its architecture was published recently in the Uber Engineering blog.

  • Microsoft Announces Azure Chaos Studio in Public Preview

    At the recent Ignite, Microsoft announced the public preview of Azure Chaos Studio, a fully-managed experimentation service to help customers track, measure, and mitigate faults with controlled chaos engineering to improve the resilience of their cloud applications.

  • Why the Most Resilient Companies Want More Incidents

    According to John Egan, the incident management process is meant to be a cycle of not just the response, but also the account of root cause and the updating of internal processes and practices across the industry. Lowering the barrier to reporting incidents, holding effective incident review meetings using blameless postmortems, and giving everyone access to postmortems is what he advises.

  • Uber Implements Disaster Recovery for Multi-Region Kafka

    In a recent blog post, Uber engineers highlight how they use a replication platform to implement disaster recovery at scale with a multi-region Kafka deployment. Uber has a large deployment of Apache Kafka, processing trillions of messages and multiple petabytes of data per day. Uber's engineers provided business resilience and continuity in the face of natural and human-made disasters.

BT