BT
Older rss
48:38

Building Resilience: How Outages Shaped Etsy's Systems

Posted by Avleen Vig  on  Jun 17, 2014

Avleen Vig presents some of the most unexpected, confusing, hilarious and face-palming events during Etsy's outages to show what can be learnt from their problems to build more resilient systems.

49:23

Principles of Reliable Communication & Shared State

Posted by Andy Piper  on  May 20, 2014

Andy Piper describes some fundamentals of communicating reliably in an unreliable world and communication techniques used to build distributed data structures that can tolerate failures.

51:21

Failure: The Good Parts

Posted by Viktor Klang  on  May 01, 2014

Viktor Klang keynotes on the imminence and the need to prepare for failure along with several ways of managing failure in case it happens.

50:29

Evolving Culture and Values. Understanding the Tradeoffs. Growth through Failure. The Importance of Leadership and Open Communication.

Posted by Pedram Keyani  on  Mar 11, 2014

Pedram Keyani discusses the importance of evolving the culture and values of an organization, dealing with tradeoffs, learning from failure, proper leadership and open communication.

30:21

Running an Agile Transformation using Lean Startup

Posted by Jason Little  on  Feb 01, 2014

Jason Little discusses how to avoid an organizational change failure when introducing Agile by leveraging principles of Lean Startup and Customer Development.

49:18

How Netflix Architects for Survival

Posted by Jeremy Edberg  on  Nov 29, 2013

Jeremy Edberg discusses how Netflix designs their systems and deployment processes to help the service survive both catastrophic events like zone and regional outages and less catastrophic events like network latency and random instance death.

48:52

Resiliency through Failure - Netflix's Approach to Extreme Availability in the Cloud

Posted by Ariel Tseitlin  on  Sep 22, 2013 1

Ariel Tseitlin discusses Netflix' suite of tools, collectively called the Simian Army, used to improve resiliency and maintain the cloud environment. The tools simulate failure in order to see how the system reacts to it.

Keynote: System, Heal Thyself

Posted by Mike Andrews  on  Oct 03, 2012

Mike Andrews discusses architecting for failure even you when don’t know what might fail.

Entirely Predictable Failures

Posted by Poul-Henning Kamp  on  Sep 26, 2012 1

Poul-Henning Kamp considers that if developers are not getting better, we are going to repeat many of the major IT project failures. He exemplifies with major Denmark project failures.

Architecting for Failure at the Guardian.co.uk

Posted by Michael Brunton-Spall  on  Apr 25, 2012

Michael Brunton-Spall talks about various types of system failure that can happen, sharing the lessons learned at the Guardian and measures taken to prevent and mitigate failure.

Resilient Response In Complex Systems

Posted by John Allspaw  on  Apr 19, 2012

John Allspaw discusses pitfalls to be avoided while troubleshooting failed systems, comparing web operations at scale with practices in aviation and nuclear power industries.

On Distributed Failures (and handling them with Doozer)

Posted by Blake Mizerany  on  Dec 27, 2011 1

Blake Mizerany presents various ways that can lead to system failure in distributed systems and how to recover using Doozer, a highly available, consistent data store.

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT