InfoQ Homepage Distributed Systems Content on InfoQ
-
Improving Distributed System Data Integrity with Amazon S3 Conditional Writes
AWS recently announced support for conditional writing in Amazon S3, allowing users to check for the existence of an object before creating it. This feature helps prevent overwriting existing objects when uploading data, making it easier for applications to manage data.
-
How Amazon Aurora Serverless Manages Resources and Scaling for Fleets of 10K+ Instances
AWS engineers published a paper describing the evolution and latest design of resource management and scaling for the Amazon Aurora Serverless platform. Aurora Serverless uses a combination of components at different levels to create a holistic approach for dynamically scaling and adjusting resources to satisfy the needs of customer workloads.
-
Uber Drives Apache Kafka's Tiered Storage Feature; Sparks Efficiency Debate
Apache Kafka, the popular distributed event streaming platform, has introduced a new tiered storage feature in version 3.6.0, initially proposed by Uber engineers. This feature, currently in early access, aims to address the scalability and efficiency challenges faced by organizations running large Kafka clusters.
-
Apache Skywalking v10: Application Performance Monitoring Tool for Distributed Systems
The Apache Software Foundation has released version 10 of Apache SkyWalking, an open-source observability platform designed to provide comprehensive monitoring, tracing, and analytics for distributed systems. It features many new features and enhancements...
-
ClickHouse Keeper: Efficient Apache ZooKeeper Alternative Created with C++ and Raft
ClickHouse project team created an in-house replacement for Apache Zookeeper as it needed a more efficient implementation that would also address some of Zookeeper's shortcomings. Now, ClickHouse Keeper is an essential part of the ClickHouse project and a cornerstone of this open-source analytical database, but can also be used independently for many distributed coordination use cases.
-
Microsoft Releases DeepSpeed-FastGen for High-Throughput Text Generation
Microsoft has announced the alpha release of DeepSpeed-FastGen, a system designed to improve the deployment and serving of large language models (LLMs). DeepSpeed-FastGen is the synergistic composition of DeepSpeed-MII and DeepSpeed-Inference . DeepSpeed-FastGen is based on the Dynamic SplitFuse technique. The system currently supports several model architectures.
-
How DoorDash Rearchitected its Cache to Improve Scalability and Performance
DoorDash rearchitected the heterogeneous caching system they were using across all of their microservices and created a common, multi-layered cache providing a generic mechanism and solving a number of issues coming from the adoption of a fragmented cache.
-
Disaster Recovery Across a Million Pieces: Michelle Brush at QCon San Francisco
During the second day of QCon San Francisco 2023, Michelle Brush, an engineering director, SRE at Google, discussed challenges, patterns, and practices for disaster recovery actions in massively distributed systems in her session. The session is part of the "Designing for Resilience" track.
-
LinkedIn's Open-Source "iris-message-processor" Achieves 86.6x Faster Escalation Management Speeds
LinkedIn developed a new open-source service called "iris-message-processor" to enhance the performance and reliability of its existing Iris escalation management system. "iris-message-processor" significantly improves processing speeds, being ~4.6x faster under average loads and ~86.6x faster under high loads than its predecessor.
-
Pinterest Revamps Its Asynchronous Computing Platform with Kubernetes and Apache Helix
Pinterest created the next-generation asynchronous computing platform, Pacer, to replace the older solution, Pinlater, which the company outgrew, resulting in scalability and reliability challenges. The new architecture leverages Kubernetes for scheduling job-execution workers and Apache Helix for cluster management.
-
Cadence 1.0: Uber Releases Its Scalable Workflow Orchestration Platform
Uber released a major version of its workflow orchestration platform named Cadence after six years in development. Uber and other companies use Cadence to build stateful services at scale using native programming languages.
-
Apache Pulsar 3.0 Delivers a New LTS Version and Efficiency Improvements
The Apache Software Foundation has released version 3.0 of Apache Pulsar, the distributed messaging and streaming platform. Pulsar 3.0 introduces the Long-Term Support release and many performance and scalability improvements.
-
Preventing Serverless Vendor Lock-in with Design Patterns
Gregor Hohpe recently published an article proposing a paradigm shift to address vendor lock-in concerns on serverless cloud applications. Designing a solution using well-known patterns decouples its functional characteristics from the underlying cloud implementation, making it easier to avoid lock-in or to go multi-cloud.
-
A Distributed System is Knowable: an Impossible Thing for Developers
Failure in distributed systems is normal. Distributed systems can provide only two of the three guarantees in consistency, availability, and partition tolerance. According to Kevlin Henney, this limits how much you can know about how a distributed system will behave. He gave a keynote about Six Impossible Things at QCon London 2022 and at QCon Plus May 10-20, 2022.
-
Cloudflare D1 Provides Distributed SQLite for Cloudflare Workers
Soon to enter beta, D1 is Cloudflare's first step into the Cloud-based SQL storage arena. D1 is built on top of SQLite with the addition of a distributed replication mechanism, batch operation support, embedded compute, automatic backups and redundancy, and more.