InfoQ Homepage Reliability Content on InfoQ
-
Uber Builds Scalable Chat Using Microservices with GraphQL Subscriptions and Kafka
Uber replaced a legacy architecture built using the WAMP protocol with a new solution that takes advantage of GraphQL subscriptions. The main drivers for creating a new architecture were challenges around reliability, scalability, observability/debugibility, as well as technical debt impeding the team’s ability to maintain the existing solution.
-
Grab Improves Kafka on Kubernetes Fault Tolerance with Strimzi, AWS AddOns and EBS
Grab updated its Kafka on Kubernetes setup to improve fault tolerance and completely eliminate human intervention in case of unexpected Kafka broker terminations. To address the shortcomings of the initial design, the team integrated with AWS Node Termination Handler (NTH), used the Load Balancer Controller for target group mapping, and switched to ELB volumes for storage.
-
Uber Improves Resiliency of Microservices with Adaptive Load Shedding
Uber created a new load-shedding library for its microservice platform, serving over 130 million customers and handling aggregated peaks of millions of requests per second (RPSs). The company replaced the solution based on QALM with Cinnamon library, which, in addition to graceful degradation, can dynamically and continuously adjust the capacity of the service and the amount of load shedding.
-
Zonal Autoshift on AWS: Optimizing Infrastructure Reliability
Zonal autoshift, a new capability of Amazon Route 53 Application Recovery Controller, automatically shifts traffic away from an Availability Zone (AZ) when a potential failure is identified by the cloud provider. The service redirects the traffic back once the AZ failure is resolved.
-
Microsoft Refreshes its Well-Architected Framework
Microsoft recently announced a comprehensive refresh of the Well-Architected Framework (WAF) for designing and running optimized workloads on Azure.
-
AWS Restructures and Consolidates Its Well-Architected Framework
AWS published a new set of updates to its Well-Architected Framework, with changes across all six pillars of the framework. The performance efficiency and operational excellence pillars have been restructured and consolidated to reduce the number of best practices. Other pillars received improved implementation guidance, including recommendations and steps on reusable architecture patterns.
-
Google Delivers Comprehensive Cloud Infrastructure Reliability Guide
Google recently delivered a cloud infrastructure reliability guide combining best practices and expertise from its engineers for its customers.
-
Azure Cosmos DB: Low Latency and High Availability at Planet Scale
Mei-Chin Sei and Vinod Sridharan spoke at QCon San Francisco on Azure Cosmos DB: Low Latency and High Availability at Planet Scale. The talk was part of the "Architectures You've Always Wondered About" track.
-
Adopting Continuous Deployment: Tom Wanielista at QCon San Francisco 2022
At QCon San Francisco 2022, Tom Wanielista, a staff engineer on infrastructure at Lyft, presented on Adopting Continuous Deployment at his company. The talk is part of one of the editorial tracks called "Architecting Change at Scale."
-
Filibuster: Automated Fault Injection Tool to Improve DoorDash's Reliability
DoorDash recently revealed how they are using Filibuster, an automated fault injection tool, to identify resilience issues in microservice applications early on and improve platform reliability.
-
Google Introduces Cloud Backup and Disaster Recovery
Google recently introduced Cloud Backup and Disaster Recovery (DR), allowing customers to enable centralized backup management directly from the Google Cloud console. The new backup and recovery service is designed to work with cloud storage repositories, databases, and applications.
-
Developing and Evolving SaaS Infrastructures for Enterprises
SaaS companies that are focused on the enterprise market need to evolve their infrastructure to meet the security, reliability, and other IT requirements of their customers. IT admins and large customers are two important sources of requirements to drive development.
-
Building Resiliency into the Twitter Ad Pacing Service
Twitter’s ad pacing algorithms were initially part of an ad-serving monolith. Later, Twitter’s engineering extracted them into a separate service to facilitate its development. Being an important service, it needs to be very reliable. An article was published recently describing how they built a reliable service by making economical design choices on managing different failure scenarios.
-
AWS Increases the Availability and Reliability of Amazon EventBridge with Global Endpoints
Recently, AWS introduced a new capability called global endpoints for its serverless event bus service Amazon EventBridge to improve availability and reliability.
-
Measuring the Environmental Impact of Software and Cloud Services
Software has an influence on the limitation of the service life or the increased energy consumption. It’s possible to measure the environmental impacts that are caused by cloud services. The design of the software architecture determines how much hardware and electrical power is required. Software can be economical or wasteful with hardware resources.