Avleen Vig presents some of the most unexpected, confusing, hilarious and face-palming events during Etsy's outages to show what can be learnt from their problems to build more resilient systems.
Andy Piper describes some fundamentals of communicating reliably in an unreliable world and communication techniques used to build distributed data structures that can tolerate failures.
Viktor Klang keynotes on the imminence and the need to prepare for failure along with several ways of managing failure in case it happens.
Evolving Culture and Values. Understanding the Tradeoffs. Growth through Failure. The Importance of Leadership and Open Communication.
Pedram Keyani discusses the importance of evolving the culture and values of an organization, dealing with tradeoffs, learning from failure, proper leadership and open communication.
Jason Little discusses how to avoid an organizational change failure when introducing Agile by leveraging principles of Lean Startup and Customer Development.
Jeremy Edberg discusses how Netflix designs their systems and deployment processes to help the service survive both catastrophic events like zone and regional outages and less catastrophic events like network latency and random instance death.
Ariel Tseitlin discusses Netflix' suite of tools, collectively called the Simian Army, used to improve resiliency and maintain the cloud environment. The tools simulate failure in order to see how the system reacts to it.
Mike Andrews discusses architecting for failure even you when don’t know what might fail.
Poul-Henning Kamp considers that if developers are not getting better, we are going to repeat many of the major IT project failures. He exemplifies with major Denmark project failures.
Michael Brunton-Spall talks about various types of system failure that can happen, sharing the lessons learned at the Guardian and measures taken to prevent and mitigate failure.
John Allspaw discusses pitfalls to be avoided while troubleshooting failed systems, comparing web operations at scale with practices in aviation and nuclear power industries.
Blake Mizerany presents various ways that can lead to system failure in distributed systems and how to recover using Doozer, a highly available, consistent data store.