InfoQ Homepage Resilience Content on InfoQ

News

RSS Feed

Newer Older

DevOps

Disaster Recovery Across a Million Pieces: Michelle Brush at QCon San Francisco

During the second day of QCon San Francisco 2023, Michelle Brush, an engineering director, SRE at Google, discussed challenges, patterns, and practices for disaster recovery actions in massively distributed systems in her session. The session is part of the "Designing for Resilience" track.

Steef-Jan Wiggers
on Oct 04, 2023
Development

6 Tracks Not to Miss at QCon San Francisco, October 2-6, 2023: ML, Architecture, Resilience & More!

At InfoQ’s international software development conference, QCon San Francisco (October 2-6) 2023, senior software practitioners driving innovation and change in software development will explore real-world architectures, technology, and techniques to help you solve such challenges.

Artenisa Chatziou
on Sep 05, 2023
Cloud

Microsoft Azure Cross-Region (Global) Load Balancer Now Generally Available

Microsoft recently announced the general availability (GA) of Azure cross-region (Global) Load Balancer in all Azure public and national cloud regions.

Steef-Jan Wiggers
on Jul 19, 2023
Architecture & Design

How LinkedIn Serves over 4.8 Million Member Profiles per Second

LinkedIn introduced Couchbase as a centralized caching tier for scaling member profile reads to handle increasing traffic that has outgrown their existing database cluster. The new solution achieved over 99% hit rate, helped reduce tail latencies by more than 60% and costs by 10% annually.

Rafal Gancarz
on Jul 03, 2023
Culture & Methods

How Resilience Can Help to Get Better at Resolving Incidents

Applying resilience throughout the incident lifecycle by taking a holistic look at the sociotechnical system can help to turn incidents into learning opportunities. Resilience can help folks get better at resolving incidents and improve collaboration. It can also give organizations time to realize their plans.

Ben Linders
on Jun 15, 2023
DevOps

Using Code Instrumentation for Fault Injection at the Application Level at eBay

eBay engineers have been using fault injections techniques to improve the reliability of the notification platform and explore its weaknesses. While fault injection is a common industry practice, eBay attempted a novel approach leveraging instrumentation to bring fault injection within the application level.

Sergio De Simone
on Dec 31, 2022
Java

Resilience4j 2.0.0 Delivers Support for JDK 17

Resilience4j, a lightweight fault tolerance library designed for functional programming, has released version 2.0 featuring support for Java 17 and dependency upgrades to Kotlin, Spring Boot and Micronaut. This new version also removes the dependency on Vavr, a functional library for Java, in order to become a more lightweight library.

Johan Janssen
on Dec 05, 2022
Java

Java News Roundup: Major Spring Releases, Resilience4j, Open Liberty, GlassFish, Kotlin 1.8-Beta

This week's Java roundup for November 21st, 2022, features news from JDK 20, major, point and patch releases for Spring (namely Boot, Web Services, Security, Batch, Authorization Server, REST Docs, Framework, Modulith, GraphQL, Apache Kafka and RabbitMQ), Open Liberty 22.0.0.12, GlassFish 7.0-M10, GraalVM Native Build Tools 0.9.18, Resilience4j 2.0, Apache Tomcat 8.5.84 and Kotlin 1.8-Beta.

Michael Redlich
on Nov 28, 2022
DevOps

Dropbox Unplugs Data Center to Test Resilience

Dropbox has published a detailed account of why and how they unplugged an entire data center to test their disaster readiness. The disaster readiness team began building tools to make performing frequent failovers possible, and ran their first formalized failover in 2019. Eventually, with new tooling and procedures, the data center was unplugged. This provided a significantly reduced RTO.

Matt Saunders
on Jun 30, 2022
Architecture & Design

Building Resiliency into the Twitter Ad Pacing Service

Twitter’s ad pacing algorithms were initially part of an ad-serving monolith. Later, Twitter’s engineering extracted them into a separate service to facilitate its development. Being an important service, it needs to be very reliable. An article was published recently describing how they built a reliable service by making economical design choices on managing different failure scenarios.

Vasco Veloso
on Apr 20, 2022
Architecture & Design

Netflix’s RENO Keeps Experience Consistent across Devices

Netflix has developed the Rapid Event Notification System (RENO) to create a consistent user experience across various platforms and devices. RENO reacts more quickly and consistently than the traditional request/response model to user-generated actions ranging from watching a title to changing profile information.

Patrick Zhang
on Mar 15, 2022
Java

Failsafe 3.2 Released with New Resilience Policies

Failsafe, a lightweight fault tolerance library for Java 8+, launched the major 3.0 release in November 2021. More recently, Failsafe announced the availability of version 3.2 which introduced new Rate Limiter and Bulkhead policies. Failsafe also integrates with asynchronous code like Java’s CompletableFuture.

Andrea Messetti
on Feb 09, 2022
Cloud

AWS US-EAST-1 Outage: Postmortem and Lessons Learned

On December 7th AWS experienced an hours-long outage that affected many services in its most popular region, Northern Virginia. The cloud provider released an analysis of the incident that started threads in the community about redundancy on AWS and multi-region approaches.

Renato Losio
on Dec 18, 2021
Cloud

Amazon Introduces AWS Resilience Hub to Monitor and Improve RPO and RTO

Amazon recently announced the availability of AWS Resilience Hub, a service designed to help customers define, measure, and manage the resilience of their applications on the cloud.

Renato Losio
on Nov 17, 2021
Architecture & Design

Real-Time Exactly-Once Event Processing at Uber with Apache Flink, Kafka, and Pinot

Uber faced some challenges after introducing ads on UberEats. The events they generated had to be processed quickly, reliably and accurately. These requirements were fulfilled by a system based on Apache Flink, Kafka, and Pinot that can process streams of ad events in real-time with exactly-once semantics. An article describing its architecture was published recently in the Uber Engineering blog.

Vasco Veloso
on Nov 12, 2021

Newer News

Older News

InfoQ Software Architects' Newsletter

News