InfoQ Homepage Site Reliability Engineering Content on InfoQ
-
Azure Advisor Well-Architected Assessment in Public Preview to Optimize Cloud Infrastructure
Microsoft Azure recently announced the public preview of the Advisor Well-Architected assessment. This self-guided questionnaire aims to provide tailored, actionable recommendations to optimize Azure resources while aligning with the Azure Well-Architected Framework (WAF) principles.
-
Advancing System Reliability: Meta's AI-Driven Approach to Root Cause Analysis
Meta recently shared how they are enhancing their system reliability through advanced investigation tools, including the AI-assisted Hawkeye, which aids in debugging machine learning workflows. By integrating Artificial Intelligence, Meta has developed a new investigation system that combines heuristic-based retrieval with large language model (LLM) ranking to assist in root cause analysis.
-
AWS Introduces Amazon CloudWatch Internet Weather Map
AWS recently announced the availability of the Internet Weather Map, a new feature of CloudWatch that displays a 24-hour global snapshot of internet latency and availability outages. This new map offers a worldwide perspective on Internet conditions, allowing users to zoom in and analyze performance and availability problems in specific cities or with particular service providers.
-
Google Delivers Comprehensive Cloud Infrastructure Reliability Guide
Google recently delivered a cloud infrastructure reliability guide combining best practices and expertise from its engineers for its customers.
-
Log Analytics Feature in Cloud Logging Now Generally Available
Google recently made its Cloud Logging Log Analytics feature generally available (GA), allowing users to search, aggregate, and transform all log data types, including application, network, and audit logs.
-
Google Production Excellence Program "ProdEx": Christof Leng at DOES 2022
Christof Leng, SRE lead at Google, presented ProdEx, their production excellence review program that helps manage operational risks and promote best practices. ProdEx is a community that builds platforms together, establishes standards and promotes best practices, so people learn from each others and grow. Today they have more than 100 SRE teams signed up and have performed more than 1000 reviews.
-
Disney SRE "Proximity Powered Engineering" Culture: Jason Cox at DOES 2022
Jason Cox, SRE director at Disney, shares how he developed a world-class centralized shared services SRE organization based on “proximity-powered empathy engineering” and three core values: Listen: Know the Business - Know the Mission - Know the Team. Empathize: Shared Mission - Shared Struggles - Shared Wins. Actually Help: Build Community - Build Trust - Build Magic Together.
-
Platform Engineering, DevOps, and Cognitive Load: a Summary of Community Discussions
Operations engineering is moving in the direction of platform engineering according to Charity Majors, CTO at Honeycomb. Majors sees platform teams tending to work higher up the stack than operations, DevOps, and SRE teams do. This shift in focus enables organizations to focus their limited development resources on their core product to drive maximum business value.
-
Dropbox Unplugs Data Center to Test Resilience
Dropbox has published a detailed account of why and how they unplugged an entire data center to test their disaster readiness. The disaster readiness team began building tools to make performing frequent failovers possible, and ran their first formalized failover in 2019. Eventually, with new tooling and procedures, the data center was unplugged. This provided a significantly reduced RTO.
-
Building an SLO-Driven Culture at Salesforce
Salesforce built a platform to monitor Service Level Objectives (SLOs). The platform provided service owners with deep and actionable insights into how to improve or maintain the health of their services, to find dips in SLIs, to find dependent services that weren’t meeting their own SLOs, and overall provide a better understanding of customers’ experience with their services.
-
Site Reliability Engineers and the Specialist Mindset
A site reliability engineer (SRE) can be a generalist or specialist. Recently, the team at Blameless elaborated on the advantages of a specialized SRE team. The specialist nature of the SRE role can be highlighted from the recruitment process. Depending on the individual skillset, organizations can engage an SRE in a number of specialist roles.
-
How to Work Asynchronously as a Remote-First SRE
The core practices for remote work at Netlify are prioritising asynchronous communication, being intentional about our remote community building, and encouraging colleagues to protect their work-life balance. Sustainable remote work starts with sustainable working hours, which includes making yourself “almost” unreachable with clear boundaries and protocols for out of hours contact.
-
How External IT Providers Can Adopt DevOps Practices
IT suppliers can follow the “you build it, you run it” mantra by working in small batches, using an experimental approach to product development, and validating small product increments in production. The supplier has to find out what his client’s goal is, and it has to become the supplier’s goal as well to work in a collaborative way.
-
InfoQ Live March 16: Explore Ways of Reducing Uncertainty in Software Delivery
InfoQ Live, the one-day virtual event for software engineers and architects, returns on March 16th with a new edition, this time focusing on ways to reduce the uncertainty of your software development cycle.
-
Observability Strategies for Distributed Systems - Lessons Learned at InfoQ Live
A good observability strategy makes it easy for teams to share their data, and uses data from across a distributed system to identify if business goals are being achieved. These were some of the ideas discussed during the InfoQ Live roundtable discussion on observability patterns for distributed systems, held on August 25.