Facilitating the Spread of Knowledge and Innovation in Professional Software Development



Choose your language

InfoQ Homepage Site Reliability Engineering Content on InfoQ

  • Building an SLO-Driven Culture at Salesforce

    Salesforce built a platform to monitor Service Level Objectives (SLOs). The platform provided service owners with deep and actionable insights into how to improve or maintain the health of their services, to find dips in SLIs, to find dependent services that weren’t meeting their own SLOs, and overall provide a better understanding of customers’ experience with their services.

  • Site Reliability Engineers and the Specialist Mindset

    A site reliability engineer (SRE) can be a generalist or specialist. Recently, the team at Blameless elaborated on the advantages of a specialized SRE team. The specialist nature of the SRE role can be highlighted from the recruitment process. Depending on the individual skillset, organizations can engage an SRE in a number of specialist roles.

  • How to Work Asynchronously as a Remote-First SRE

    The core practices for remote work at Netlify are prioritising asynchronous communication, being intentional about our remote community building, and encouraging colleagues to protect their work-life balance. Sustainable remote work starts with sustainable working hours, which includes making yourself “almost” unreachable with clear boundaries and protocols for out of hours contact.

  • How External IT Providers Can Adopt DevOps Practices

    IT suppliers can follow the “you build it, you run it” mantra by working in small batches, using an experimental approach to product development, and validating small product increments in production. The supplier has to find out what his client’s goal is, and it has to become the supplier’s goal as well to work in a collaborative way.

  • InfoQ Live March 16: Explore Ways of Reducing Uncertainty in Software Delivery

    InfoQ Live, the one-day virtual event for software engineers and architects, returns on March 16th with a new edition, this time focusing on ways to reduce the uncertainty of your software development cycle.

  • Observability Strategies for Distributed Systems - Lessons Learned at InfoQ Live

    A good observability strategy makes it easy for teams to share their data, and uses data from across a distributed system to identify if business goals are being achieved. These were some of the ideas discussed during the InfoQ Live roundtable discussion on observability patterns for distributed systems, held on August 25.

  • Google Meet’s Scaling Challenges during COVID-19

    Google wrote about their challenges in scaling Google Meet due to increased usage since the COVID-19 pandemic led to more people using it. The SRE team at Google used their existing incident management framework with modifications to tackle the challenge of increased traffic that started earlier this year.

  • New Report Shows "Overwhelming" Cloud Usage

    The new Cloud Adoption in 2020 report from O'Reilly Media paints a picture of "overwhelming" usage of cloud computing. The survey results also revealed growing adoption of Site Reliability Engineering, high but flattening usage of microservices, and limited interest in serverless computing.

  • Exploring Costs of Coordination During Outages with Laura Maguire at QCon London

    Laura Maguire talked at QCon London about how the coordinative efforts during outages cause a high cognitive cost. Maguire found out that coordination during anomaly response is difficult, that existing models can undermine speedy resolution, and that the strategies to control the cost of coordination are adaptive to the type of incident. Moreover, tooling has additional costs of coordination.

  • How Twitter Improves Resource Usage with a Deterministic Load Balancing Algorithm

    Twitter recently shared the details of why their RPC framework Finagle implements a client-side load balancing using a deterministic aperture algorithm for their microservices architecture. Twitter ran different experiments but confirmed that with a deterministic approach, requests are better distributed, connections count reduces drastically, and they even need less infrastructure.

  • Scaling Infrastructure as Code at Challenger Bank N26

    To launch their banking platform globally in the US, Brazil, and beyond, the challenges bank N26 introduced a new layer for the configuration of regions in their architecture, where product development teams can add application needs. At FlowCon France, Kat Liu presented why and how they introduced this layer, the benefits that it brings, and the things they learned.

  • The Importance of Fun in the Workplace

    Things at work that make us smile or laugh can improve team cohesion, productivity and organisational performance. Fun can’t be forced, but it can be fostered, said Holly Cummins at FlowCon France 2019, where she spoke about the importance of fun in the workplace.

  • How Did Things Go Right? Learning More from Incidents at Netflix: Ryan Kitchens at QCon New York

    At QCon New York, Ryan Kitchens presented “How Did Things Go Right? Learning More from Incidents”. Key takeaways from the talk included: recovery is better than prevention; an incident occurs when there is a “perfect storm” of events -- there is no root cause; “stop reporting on the nines”, as user happiness is more important; and there is value in learning how things go right.

  • The Evolution of Full Cycle Developers at Netflix: Greg Burrell at QCon SF

    At QCon San Francisco, Greg Burrell talked about the journey towards “full cycle developers” within the Netflix edge engineering team. Following the principle of “operate what you build”, developers within this team chose to take on more operational responsibility for their services, and were facilitated by comprehensive tooling, training and management support.

  • GitHub Incident Analysis Shows How to Improve Service Reliability

    On October 21, 2018, GitHub users experienced a degraded service during 24 hours due to an incident caused by routine maintenance work. This led to the display of outdated and inconsistent information and to the unavailability of webhooks and other internal services for 24 hours. GitHub post-incident report shows where things failed and suggests how to improve site reliability.