InfoQ Homepage Operations management Content on InfoQ
-
Netflix Scales "Human Infrastructure" to Manage Global Live Operations
Netflix has introduced a "human infrastructure" layer to manage live broadcasts at scale. Using a low-latency "telemetry hot path" and a Live Operations Centre, the company now balances automated scaling with human oversight. This shift, which mirrors strategies at AWS and Disney+, focuses on maintaining reliability through expert intervention during high-concurrency global events.
-
HashiCorp Vault 2.0 Marks Shift to IBM Lifecycle with New Identity Federation
HashiCorp has released Vault 2.0, moving to the IBM versioning and support model following its acquisition. The update introduces Workload Identity Federation for secret syncing without static credentials, SCIM 2.0 provisioning, and performance gains in the storage engine. It also prioritises identity-based security and certificate automation while removing legacy architectural components.
-
AWS Unveils Independent European Governance and Operations for European Sovereign Cloud
AWS unveils its European Sovereign Cloud, launching in Brandenburg, Germany, by 2025, with strict EU governance and a focus on digital sovereignty. This initiative features an EU-controlled parent company, dedicated Security Operations Center, and customer data residing exclusively in the EU, ensuring compliance and operational autonomy while leveraging AWS's innovative cloud services.
-
AWS Launches Centralized Product Lifecycle Page: Transparency and Consolidating Deprecation Info
AWS has launched its Product Lifecycle page, a centralized hub for tracking service availability changes, deprecations, and end-of-support timelines. This initiative streamlines communication, enhances customer confidence, and aligns with other hyperscalers, Microsoft Azure and Google Cloud. The page offers clear rationales and transition plans, ensuring a smooth process for customers.
-
Datadog Employs LLMs for Assisting with Writing Accident Postmortems
Datadog combined structured metadata from its incident management app with Slack messages to create an LLM-driven functionality assisting engineers in composing incident postmortems. While working on this solution, the company dealt with the challenges of using LLMs outside of the interactive dialog systems and ensuring that high-quality content was produced.
-
Atlassian Announces Opsgenie Consolidation into JIRA Service Management
Atlassian recently announced that it is consolidating its IT Operations offering and transitioning Opsgenie’s capabilities into JIRA Service Management and Compass.
-
Data Teams Survey: Lag in DataOps and Value Delivered
We report on Jesse Anderson's 2024 Data Teams Survey which showed a lag in DataOps capabilities, slow LLM adoption, and a concerning decline in perceived value creation by data teams. It called out the importance of teams spread with data science, engineering, and operations capabilities. We also cover Petr Janda's recent podcast on the need for more engineering rigour for parity with other teams.
-
The Impact of Cloudflare's Sudden Service Change at an Online Casino
Recently, an online casino website experienced a severe disruption when Cloudflare abruptly disabled its services. Robin Dev, a systems operations engineer at the casino, provided a detailed account of the sequence of events in a blog post, shedding light on the extent of the disruption and its aftermath.
-
Public Preview of Azure Compute Fleet: Streamlining Azure Compute Capacity Management
At the annual Build conference, Microsoft announced the public preview of Azure Compute Fleet, a new service that streamlines the provisioning and management of Azure compute capacity across different virtual machine (VM) types, availability zones, and pricing models to achieve desired scale, performance, and cost.
-
Google Cloud Launches Security Command Center Enterprise
Google Cloud has launched Security Command Center (SSC) Enterprise, a cloud risk management solution that offers proactive cloud security with enterprise security operations. The solution helps customers manage and mitigate risk across multi-cloud environments and is enhanced by Mandiant expertise.
-
Arc-Enabled Servers Run Command Public Preview Feature: Remote Management for Various Environments
Microsoft has recently announced a significant preview feature related to Arc-enabled servers, introducing the Run Command. This feature allows customers to manage Azure Arc-enabled servers remotely and securely.
-
Intuitive Application Resource Management with myApplications in the AWS Management Console
AWS recently announced at its re:Invent conference the general availability of myApplications. myApplications in the AWS Management Console can help customers manage and monitor the cost, health, security posture, and performance of their applications on AWS more effectively.
-
OpenTelemetry Logging Marked Stable: Morgan McLean at KubeCon NA
Logging is a core capability of applications today. OpenTelemetry (OTel) has stabilized logging as another available signal within the project. OTel Logging offers improvements to traditional logging.
-
AWS Introduces Amazon Route 53 Resolver on AWS Outposts Rack
AWS recently announced that Amazon Route 53 Resolver is now available on AWS Outposts rack providing on-premises services and applications with local Domain Name Service (DNS) resolution directly from Outposts. In addition, local Route 53 Resolver endpoints also enable DNS resolution between Outposts and on-premises DNS servers.
-
Platform Engineering, DevOps, and Cognitive Load: a Summary of Community Discussions
Operations engineering is moving in the direction of platform engineering according to Charity Majors, CTO at Honeycomb. Majors sees platform teams tending to work higher up the stack than operations, DevOps, and SRE teams do. This shift in focus enables organizations to focus their limited development resources on their core product to drive maximum business value.