InfoQ Homepage Articles
-
Adaptive Responses to Resiliently Handle Hard Problems in Software Operations
As engineers move into more senior positions such as Staff Engineer, Architect, or Sr Tech Lead roles, their knowledge and experience is often applied across the system. This expertise is increasingly needed for handling novel problems or designing innovative solutions to complex problems. This article discusses strategies for approaching your role as a senior member of your organization.
-
Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems
Cell-based architectures offer a robust approach to building resilient systems. They achieve this through the core principles of isolation, autonomy, and replication. Each cell manages its resources and makes decisions autonomously. Observability for cell-based architecture requires a tailored approach to address the unique challenges and opportunities presented by this distributed system design.
-
Optimizing Wellhub Autocomplete Service Latency: a Multi-Region Architecture
Every company wants fast, reliable, and low-latency services. Achieving these goals requires significant investment and effort. In this article, I will share how Wellhub invested in a multi-region architecture to achieve a low-latency autocomplete service.
-
Article Series: Cell-Based Architectures: How to Build Scalable and Resilient Systems
In this article series, we take readers on a journey of discovery and provide a comprehensive overview and in-depth analysis of many key aspects of cell-based architectures, as well as practical advice for applying this approach to existing and new architectures.
-
How Cell-Based Architecture Enhances Modern Distributed Systems
Cell-based architecture has emerged as a response to many challenges associated with distributed systems. It employs the bulkhead pattern to isolate failures to a fraction of the affected infrastructure footprint and prevent widespread impact. Cells can also help organize large architectures into domain-bound deployment and delivery units, which provides essential sociotechnical benefits.
-
Building a Global Caching System at Netflix: a Deep Dive to Global Replication
Netflix's EVCache system handles 400M ops/second across 22,000 servers, managing 14.3 PB of data. This infrastructure ensures global availability and resilience through intelligent data routing and flexible replication strategies. By implementing batch compression and switching to DNS-based discovery, Netflix optimizes efficiency, reduces bandwidth usage and significantly lowers operational costs.
-
Proactive Approaches to Securing Linux Systems and Engineering Applications
Maintaining a strong security posture is challenging, especially with Linux. An effective approach is proactive and includes patch management, optimized resource allocation, and effective alerting.
-
How Functional Programming Can Help You Write Efficient, Elegant Web Applications
Many things can make software more challenging to understand and, consequently, to maintain. One of the most complex and problematic causes is managing internal mutable states. When the internal state is poorly managed, the software behaves unexpectedly, leading to bugs and fixing, which introduces unnecessary complexity. FP solves this problem by providing immutability mechanisms and more.
-
Virtual Panel: What to Consider When Adopting Large Language Models
Four experts discuss some issues people should think about when adopting LLMs and how they can make the best choice for their specific use case. Topics include how to choose between an API-based vs. self-hosted LLM, when to fine-tune an LLM, how to mitigate LLM risks, and what non-technical changes organizations need to make when adopting LLMs.
-
How to Minimize Latency and Cost in Distributed Systems
Explore the benefits and challenges of microservices architecture in cloud environments, focusing on achieving resilience and high availability while managing costs and performance issues.
-
Navigating LLM Deployment: Tips, Tricks, and Techniques
This article focuses on self-hosted LLMs and how to get the best performance from them. The author provides best practices on how to overcome challenges due to model size, GPU scarcity, and a rapidly evolving field.
-
Building Better Platforms with Empathy: Case Studies and Counter-Examples
Scaling platform development often means absorbing cognitive burdens, but empathy is key. Understanding users beyond their immediate issues leads to better solutions. Platforms help manage growth's complexity, but a product mindset with user-centricity is vital. In his talk at QCon San Francisco 2023, David Stenglein expanded on cultivating empathy through open communication.