InfoQ Homepage DevOps Content on InfoQ
-
Analyzing Apache Kafka Stretch Clusters: WAN Disruptions, Failure Scenarios, and DR Strategies
Proficient in analyzing the dynamics of Apache Kafka Stretch Clusters, I assess WAN disruptions and devise effective Disaster Recovery (DR) strategies. With deep expertise, I ensure high availability and data integrity across multi-region deployments. My insights optimize operational resilience, safeguarding vital services against service level agreement violations.
-
We Took Developers out of the Portal: How APIOps and IaC Reshaped Our API Strategy
Dynamic API strategist with expertise in transforming legacy management into efficient APIOps frameworks using Infrastructure as Code (IaC). Proven track record in automating API lifecycles, enhancing security, and fostering developer productivity through CI/CD integration. Adept at driving operational excellence and consistency across environments, enabling rapid deployment and innovation.
-
Using Traffic Mirroring to Debug and Test Microservices in Production-Like Environments
Traffic mirroring has evolved from a network security tool to a robust method for debugging and testing microservices using real-world data. By safely duplicating production traffic to a shadow environment, teams can replicate elusive bugs, profile performance under actual load, validate new features, and detect regressions, ensuring that production remains isolated and user experiences intact.
-
Designing Resilient Event-Driven Systems at Scale
Learn how to design resilient event-driven systems that scale. Explore key patterns like shuffle sharding and decoupling queues to handle load spikes and failures. Understand common pitfalls like over-relying on retries and neglecting observability for robust, scalable architectures.
-
Inflection Points in Engineering Productivity as Amazon Grew 30x
In this article, Carlos Arguelles elaborates on how engineering productivity needs a shift as organizations scale. He shares examples from his time at Google and Amazon, explaining how some architectural decisions made at these companies shaped the way they develop software. Engineering productivity investments depend on inflection points, scale, controls, data, and tooling choices.
-
Distributed Cloud Computing: Enhancing Privacy with AI-Driven Solutions
Distributed cloud, PETs, and AI enable secure, private data processing. This integration enhances collaboration, security, and compliance across marketing, finance, and healthcare, addressing the growing need for data protection.
-
DiRMA: Measuring How Your Organization Manages Chaos
Elevate your disaster recovery strategy with DiRMA—an innovative framework for assessing and enhancing Disaster Recovery Testing (DiRT) maturity across people, processes, and tools. As chaos engineering becomes essential for resilience, DiRMA guides organizations through structured improvement, addressing cultural hurdles and ensuring robust recovery readiness in the face of modern challenges.
-
Checklist for Kubernetes in Production: Best Practices for SREs
This article provides SREs with a checklist for managing Kubernetes in production environments. It identifies common challenges including resource management, workload placement, high availability, health probes, storage, monitoring, and cost optimization. By implementing consistent GitOps automation across these areas, teams can significantly reduce complexity, and prevent downtime.
-
2025 Article Contest: Win Your Conference Ticket
The InfoQ Team is excited to invite you to participate in our annual article writing competition. Authors of top-rated articles will win complimentary tickets to prominent software development conferences such as QCon and InfoQ Dev Summit.
-
Being Functionless: How to Develop a Serverless Mindset to Write Less Code!
Dynamic cloud services like AWS Lambda have revolutionized computing, leading to rapid deployment and innovation in serverless technology. However, over-reliance on Functions as a Service (FaaS) can create complex architectures and increase costs. Adopting a functionless mindset and leveraging native service integrations fosters simplicity, enhances sustainability, and optimizes efficiency.
-
Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems
Cell-based architectures offer a robust approach to building resilient systems. They achieve this through the core principles of isolation, autonomy, and replication. Each cell manages its resources and makes decisions autonomously. Observability for cell-based architecture requires a tailored approach to address the unique challenges and opportunities presented by this distributed system design.
-
Optimizing Wellhub Autocomplete Service Latency: a Multi-Region Architecture
Every company wants fast, reliable, and low-latency services. Achieving these goals requires significant investment and effort. In this article, I will share how Wellhub invested in a multi-region architecture to achieve a low-latency autocomplete service.