InfoQ Homepage Site Reliability Engineering Content on InfoQ

News

RSS Feed

Newer Older

Cloud

AWS Introduces Amazon CloudWatch Internet Weather Map

AWS recently announced the availability of the Internet Weather Map, a new feature of CloudWatch that displays a 24-hour global snapshot of internet latency and availability outages. This new map offers a worldwide perspective on Internet conditions, allowing users to zoom in and analyze performance and availability problems in specific cities or with particular service providers.

Renato Losio
on May 01, 2024
Cloud

Google Delivers Comprehensive Cloud Infrastructure Reliability Guide

Google recently delivered a cloud infrastructure reliability guide combining best practices and expertise from its engineers for its customers.

Steef-Jan Wiggers
on Jan 24, 2023
Cloud

Log Analytics Feature in Cloud Logging Now Generally Available

Google recently made its Cloud Logging Log Analytics feature generally available (GA), allowing users to search, aggregate, and transform all log data types, including application, network, and audit logs.

Steef-Jan Wiggers
on Jan 24, 2023
DevOps

Google Production Excellence Program "ProdEx": Christof Leng at DOES 2022

Christof Leng, SRE lead at Google, presented ProdEx, their production excellence review program that helps manage operational risks and promote best practices. ProdEx is a community that builds platforms together, establishes standards and promotes best practices, so people learn from each others and grow. Today they have more than 100 SRE teams signed up and have performed more than 1000 reviews.

Shaaron A Alvares
on Oct 25, 2022
DevOps

Disney SRE "Proximity Powered Engineering" Culture: Jason Cox at DOES 2022

Jason Cox, SRE director at Disney, shares how he developed a world-class centralized shared services SRE organization based on “proximity-powered empathy engineering” and three core values: Listen: Know the Business - Know the Mission - Know the Team. Empathize: Shared Mission - Shared Struggles - Shared Wins. Actually Help: Build Community - Build Trust - Build Magic Together.

Shaaron A Alvares
on Oct 24, 2022
DevOps

Platform Engineering, DevOps, and Cognitive Load: a Summary of Community Discussions

Operations engineering is moving in the direction of platform engineering according to Charity Majors, CTO at Honeycomb. Majors sees platform teams tending to work higher up the stack than operations, DevOps, and SRE teams do. This shift in focus enables organizations to focus their limited development resources on their core product to drive maximum business value.

Matt Campbell
on Oct 09, 2022
DevOps

Dropbox Unplugs Data Center to Test Resilience

Dropbox has published a detailed account of why and how they unplugged an entire data center to test their disaster readiness. The disaster readiness team began building tools to make performing frequent failovers possible, and ran their first formalized failover in 2019. Eventually, with new tooling and procedures, the data center was unplugged. This provided a significantly reduced RTO.

Matt Saunders
on Jun 30, 2022
DevOps

Building an SLO-Driven Culture at Salesforce

Salesforce built a platform to monitor Service Level Objectives (SLOs). The platform provided service owners with deep and actionable insights into how to improve or maintain the health of their services, to find dips in SLIs, to find dependent services that weren’t meeting their own SLOs, and overall provide a better understanding of customers’ experience with their services.

Matt Saunders
on Apr 30, 2022
DevOps

Site Reliability Engineers and the Specialist Mindset

A site reliability engineer (SRE) can be a generalist or specialist. Recently, the team at Blameless elaborated on the advantages of a specialized SRE team. The specialist nature of the SRE role can be highlighted from the recruitment process. Depending on the individual skillset, organizations can engage an SRE in a number of specialist roles.

Aditya Kulkarni
on Mar 15, 2022
Culture & Methods

How to Work Asynchronously as a Remote-First SRE

The core practices for remote work at Netlify are prioritising asynchronous communication, being intentional about our remote community building, and encouraging colleagues to protect their work-life balance. Sustainable remote work starts with sustainable working hours, which includes making yourself “almost” unreachable with clear boundaries and protocols for out of hours contact.

Ben Linders
on Dec 02, 2021
Culture & Methods

How External IT Providers Can Adopt DevOps Practices

IT suppliers can follow the “you build it, you run it” mantra by working in small batches, using an experimental approach to product development, and validating small product increments in production. The supplier has to find out what his client’s goal is, and it has to become the supplier’s goal as well to work in a collaborative way.

Ben Linders
on Aug 19, 2021
DevOps

InfoQ Live March 16: Explore Ways of Reducing Uncertainty in Software Delivery

InfoQ Live, the one-day virtual event for software engineers and architects, returns on March 16th with a new edition, this time focusing on ways to reduce the uncertainty of your software development cycle.

Adelina Turcu
on Mar 04, 2021
Development

Observability Strategies for Distributed Systems - Lessons Learned at InfoQ Live

A good observability strategy makes it easy for teams to share their data, and uses data from across a distributed system to identify if business goals are being achieved. These were some of the ideas discussed during the InfoQ Live roundtable discussion on observability patterns for distributed systems, held on August 25.

Thomas Betts
on Sep 03, 2020
DevOps

Google Meet’s Scaling Challenges during COVID-19

Google wrote about their challenges in scaling Google Meet due to increased usage since the COVID-19 pandemic led to more people using it. The SRE team at Google used their existing incident management framework with modifications to tackle the challenge of increased traffic that started earlier this year.

Hrishikesh Barua
on Aug 16, 2020
Cloud

New Report Shows "Overwhelming" Cloud Usage

The new Cloud Adoption in 2020 report from O'Reilly Media paints a picture of "overwhelming" usage of cloud computing. The survey results also revealed growing adoption of Site Reliability Engineering, high but flattening usage of microservices, and limited interest in serverless computing.

Richard Seroter
on Jun 15, 2020

Newer News

Older News

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

News