Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Podcasts Caitie McCaffrey on Engineering Effectiveness and Verifying Distributed Systems

Caitie McCaffrey on Engineering Effectiveness and Verifying Distributed Systems


In this week's podcast, QCon chair Wes Reisz and Werner Schuster talk to Caitie McCaffrey. McCaffrey works on distributed systems with the engineering effectiveness team at Twitter, and has experience building the large scale services and systems that power the entertainment industry at 343 Industries, Microsoft Game Studios, and HBO. McCaffrey's presentation at QCon New York was called The Verification of a Distributed System.

Key Takeaways

  • Twitter's engineering effectiveness team aims to help make dev tools better, and make developers happier and more efficient.
  • Asking someone to speak at your conference or join your team solely because of their gender does more harm than people think.
  • There is not one prescriptive way to make good, successful technology.
  • Even when we don't have time for testing, there are options to increase your confidence in your system.
  • The biggest problem when running a unit test is that it is only testing the input you hard code into the unit test.


Engineering Effectiveness

  • 1m:24s - The purpose of the engineering effectiveness team is to help make dev tools better, and to make Twitter's developers happier and more efficient.
  • 2m:44s - The team is trying to make infrastructure so that not every team has to solve the distributed problem on their own, and give developers some APIs and tools so that they can build systems easily.
  • 3m:45s - One of the challenges is that we use Timeouts to detect failures in distributed systems; but with a job that may take an hour you can't have a Timeout, so there are different things you should do. We're also using the saga pattern for jobs with multiple tasks you want to execute, and saga is a way to do distributed transactions and ensure all the pieces of the job run correctly.
  • 6m:01s - The end game is that if you're offline everything will run while you're disconnected, and be able to make your build go much faster when you hook into the VPN or network and have your credentials authenticated, and run verification in the background.

Approaching Diversity

  • 7m:30s - I get asked to speak a lot, and I think they mean well, but if you take a step back and look at it, if I'm being asked to speak because I'm a woman, that's like saying, "You do all of these things, but we really just want you because of your gender."
  • 8m:35s - It has a lot of weird implications that I don't think are meant, but these are all conversations that are happening, and it doesn't feel great. It's one of those well-meaning things that does more harm than people think it does.
  • 11m:05s - A lot of it comes down to a networking problem. People will say, "There aren't XYZ group in tech," but look at the Techies project that Helena Price just did- there are people of all different backgrounds in tech. Price found a lot of them, with very little searching, just in San Francisco.

Verification of Distributed Systems

  • 12m:55s - Testing is not a skill we are great at across the board, for a variety of reasons.
  • 13m:30s - There are methods to increase confidence in your code, and resources that support that doing tests yields larger returns than you might intuitively think.
  • 13m:54s - We have budgets and we have project deadlines, so what time do we have for testing? But here are some options that will increase your confidence that your system is going to do the right thing.
  • 14m:30s - We're all building distributed systems. The things I talk about are unit integration tests and areas to focus on where you can get the most value back, property-based testing that is like the industry version of a model checker, and fault injection testing. In each of these there is a wide range of investment you can make.
  • 16m:30s - Formal verification is very interesting, and Amazon use TLA+ to verify their core infrastructure pieces like S3. Koch, which is less known, in which you can write your specification, generates code in Ocaml, and Diego Ongaro has just released Runway, a formal specification tool.
  • 18m:33s - This area of research is really powerful because it starts allowing us to do things better, without investing too much time in testing, verification and bugs.

Model-Driven Development

  • 20m:40s - In the book that Leslie Lamport wrote about TLA+ he says it's a good idea to design a system before you implement it; you should write a specification before you implement it because it forces you think through all the things.
  • 21m:00s - Writing the specification is step one of formal verification, and you have to run it through a model checker, and by doing that- without even going through the second step of proving its correct- is useful, as a lot of the bugs were found while writing the specification.
  • 21m:30s - There's a lot of value, especially when dealing with a lot of very complicated complex systems and asynchronous messages, and different levels of consistency which can get even harder.

Type Theory

  • 22m:45s - If you want to write code like you're used to writing code, and then do a specification on top- run something that tells you it's correct requires a stronger type theory, or a type system in a language. Languages like Go, JavaScript or Python that don't have types don't live in this new world of running code through a formal method detector, unless someone is doing the work to build a type system.

Property Testing

  • 24m:58s - I think Property Testing is super useful. QuickCheck was written by John Hughes, and the idea is that you define properties about your systems, and you can combine properties to build complex test cases.
  • 25m:38s - The biggest problem when running a unit test is that it is only testing the input you hard code into the unit test.
  • 26m:06s - If you define these things in a property based testing tool, it's like a model checker but it's not going to explore the entire state's base. You can't say it's proven correct, but it explores more of the state's base than a unit test.
  • 27m:05s - You can find some really gnarly bugs because it generates a lot of input, and combinations of input.

Fault Injection

  • 30m:12s - Kyle Kingsbury's Jepsen is a sophisticated class of fault injection and analysis tool.
  • 30m:35s - There are other ways to inject failure; you can test what happens when I take down a service.
  • 31m:09s - I take inspiration from Peter Alvaro's paper on lineage driven fault injection (or his tool Molly).

Books and papers mentioned

Peter Alvaro et al: Lineage-driven Fault Injection
Leslie Lamport: Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers
Caitie McCaffrey: The Verification of a Distributed System

Tools mentioned


More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p