Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Site Reliability Engineering Content on InfoQ

  • AWS Introduces Amazon CloudWatch Internet Weather Map

    AWS recently announced the availability of the Internet Weather Map, a new feature of CloudWatch that displays a 24-hour global snapshot of internet latency and availability outages. This new map offers a worldwide perspective on Internet conditions, allowing users to zoom in and analyze performance and availability problems in specific cities or with particular service providers.

  • Google Delivers Comprehensive Cloud Infrastructure Reliability Guide

    Google recently delivered a cloud infrastructure reliability guide combining best practices and expertise from its engineers for its customers.

  • Log Analytics Feature in Cloud Logging Now Generally Available

    Google recently made its Cloud Logging Log Analytics feature generally available (GA), allowing users to search, aggregate, and transform all log data types, including application, network, and audit logs.

  • Google Production Excellence Program "ProdEx": Christof Leng at DOES 2022

    Christof Leng, SRE lead at Google, presented ProdEx, their production excellence review program that helps manage operational risks and promote best practices. ProdEx is a community that builds platforms together, establishes standards and promotes best practices, so people learn from each others and grow. Today they have more than 100 SRE teams signed up and have performed more than 1000 reviews.

  • Disney SRE "Proximity Powered Engineering" Culture: Jason Cox at DOES 2022

    Jason Cox, SRE director at Disney, shares how he developed a world-class centralized shared services SRE organization based on “proximity-powered empathy engineering” and three core values: Listen: Know the Business - Know the Mission - Know the Team. Empathize: Shared Mission - Shared Struggles - Shared Wins. Actually Help: Build Community - Build Trust - Build Magic Together.

  • Platform Engineering, DevOps, and Cognitive Load: a Summary of Community Discussions

    Operations engineering is moving in the direction of platform engineering according to Charity Majors, CTO at Honeycomb. Majors sees platform teams tending to work higher up the stack than operations, DevOps, and SRE teams do. This shift in focus enables organizations to focus their limited development resources on their core product to drive maximum business value.

  • Dropbox Unplugs Data Center to Test Resilience

    Dropbox has published a detailed account of why and how they unplugged an entire data center to test their disaster readiness. The disaster readiness team began building tools to make performing frequent failovers possible, and ran their first formalized failover in 2019. Eventually, with new tooling and procedures, the data center was unplugged. This provided a significantly reduced RTO.

  • Building an SLO-Driven Culture at Salesforce

    Salesforce built a platform to monitor Service Level Objectives (SLOs). The platform provided service owners with deep and actionable insights into how to improve or maintain the health of their services, to find dips in SLIs, to find dependent services that weren’t meeting their own SLOs, and overall provide a better understanding of customers’ experience with their services.

  • Site Reliability Engineers and the Specialist Mindset

    A site reliability engineer (SRE) can be a generalist or specialist. Recently, the team at Blameless elaborated on the advantages of a specialized SRE team. The specialist nature of the SRE role can be highlighted from the recruitment process. Depending on the individual skillset, organizations can engage an SRE in a number of specialist roles.

  • How to Work Asynchronously as a Remote-First SRE

    The core practices for remote work at Netlify are prioritising asynchronous communication, being intentional about our remote community building, and encouraging colleagues to protect their work-life balance. Sustainable remote work starts with sustainable working hours, which includes making yourself “almost” unreachable with clear boundaries and protocols for out of hours contact.

  • How External IT Providers Can Adopt DevOps Practices

    IT suppliers can follow the “you build it, you run it” mantra by working in small batches, using an experimental approach to product development, and validating small product increments in production. The supplier has to find out what his client’s goal is, and it has to become the supplier’s goal as well to work in a collaborative way.

  • InfoQ Live March 16: Explore Ways of Reducing Uncertainty in Software Delivery

    InfoQ Live, the one-day virtual event for software engineers and architects, returns on March 16th with a new edition, this time focusing on ways to reduce the uncertainty of your software development cycle.

  • Observability Strategies for Distributed Systems - Lessons Learned at InfoQ Live

    A good observability strategy makes it easy for teams to share their data, and uses data from across a distributed system to identify if business goals are being achieved. These were some of the ideas discussed during the InfoQ Live roundtable discussion on observability patterns for distributed systems, held on August 25.

  • Google Meet’s Scaling Challenges during COVID-19

    Google wrote about their challenges in scaling Google Meet due to increased usage since the COVID-19 pandemic led to more people using it. The SRE team at Google used their existing incident management framework with modifications to tackle the challenge of increased traffic that started earlier this year.

  • New Report Shows "Overwhelming" Cloud Usage

    The new Cloud Adoption in 2020 report from O'Reilly Media paints a picture of "overwhelming" usage of cloud computing. The survey results also revealed growing adoption of Site Reliability Engineering, high but flattening usage of microservices, and limited interest in serverless computing.