John Wilkes shares lessons learned managing clusters at the scale of Google.
Robert Benefield offers a pragmatic overview for discovering operational indicators that provide valuable insight in running and improving online services.
Pedro Canahuati describes how Facebook's operations maintains their infrastructure, including challenges faced and lessons learned: prioritizing calls, managing technical debt, incident management.
Ben Christensen describes Netflix API's evolution to a web service platform serving all devices and users, the challenges met in operations, deployment, performance, fault-tolerance, and innovation.
Joe Sondow presents how Netflix uses Asgard to deploy code updates and manage resources in the Amazon cloud.
Roy Rapoport discusses how Netflix uses metrics to monitor and manage their operating environment along with some notes about their event management system.
Filippos Santas explains how to apply service-orientation principles, patterns, processes and SOA governance precepts to ITIL's service lifecycle stages, key processes and activities.
Phil Toland discusses using Erlang and Ruby providing backup for 20k network devices running in 8 datacenters across 3 continents for Rackspace’s operations.
Ram C Singh discusses using Big Data for infrastructure telemetry along with good practices and an autonomic engine to create an autonomic computing infrastructure that might prevent downtime.
Jez Humble discusses how to deal with risk management, regulation compliance, ITIL, audit requirements in a large organization that intends to adopt devops.
Julian Simpson thinks dev and ops should be one team, achieved through: collaboration, respecting everyone, having lunch together, co-location, discussing problems, joined retrospectives, etc.