Caitie McCaffrey on Engineering Effectiveness and Verifying Distributed Systems

Notes

Engineering Effectiveness

1m:24s - The purpose of the engineering effectiveness team is to help make dev tools better, and to make Twitter's developers happier and more efficient.
2m:44s - The team is trying to make infrastructure so that not every team has to solve the distributed problem on their own, and give developers some APIs and tools so that they can build systems easily.
3m:45s - One of the challenges is that we use Timeouts to detect failures in distributed systems; but with a job that may take an hour you can't have a Timeout, so there are different things you should do. We're also using the saga pattern for jobs with multiple tasks you want to execute, and saga is a way to do distributed transactions and ensure all the pieces of the job run correctly.
6m:01s - The end game is that if you're offline everything will run while you're disconnected, and be able to make your build go much faster when you hook into the VPN or network and have your credentials authenticated, and run verification in the background.

Approaching Diversity

7m:30s - I get asked to speak a lot, and I think they mean well, but if you take a step back and look at it, if I'm being asked to speak because I'm a woman, that's like saying, "You do all of these things, but we really just want you because of your gender."
8m:35s - It has a lot of weird implications that I don't think are meant, but these are all conversations that are happening, and it doesn't feel great. It's one of those well-meaning things that does more harm than people think it does.
11m:05s - A lot of it comes down to a networking problem. People will say, "There aren't XYZ group in tech," but look at the Techies project that Helena Price just did- there are people of all different backgrounds in tech. Price found a lot of them, with very little searching, just in San Francisco.

Verification of Distributed Systems

12m:55s - Testing is not a skill we are great at across the board, for a variety of reasons.
13m:30s - There are methods to increase confidence in your code, and resources that support that doing tests yields larger returns than you might intuitively think.
13m:54s - We have budgets and we have project deadlines, so what time do we have for testing? But here are some options that will increase your confidence that your system is going to do the right thing.
14m:30s - We're all building distributed systems. The things I talk about are unit integration tests and areas to focus on where you can get the most value back, property-based testing that is like the industry version of a model checker, and fault injection testing. In each of these there is a wide range of investment you can make.
16m:30s - Formal verification is very interesting, and Amazon use TLA+ to verify their core infrastructure pieces like S3. Koch, which is less known, in which you can write your specification, generates code in Ocaml, and Diego Ongaro has just released Runway, a formal specification tool.
18m:33s - This area of research is really powerful because it starts allowing us to do things better, without investing too much time in testing, verification and bugs.

Model-Driven Development

20m:40s - In the book that Leslie Lamport wrote about TLA+ he says it's a good idea to design a system before you implement it; you should write a specification before you implement it because it forces you think through all the things.
21m:00s - Writing the specification is step one of formal verification, and you have to run it through a model checker, and by doing that- without even going through the second step of proving its correct- is useful, as a lot of the bugs were found while writing the specification.
21m:30s - There's a lot of value, especially when dealing with a lot of very complicated complex systems and asynchronous messages, and different levels of consistency which can get even harder.

Type Theory

22m:45s - If you want to write code like you're used to writing code, and then do a specification on top- run something that tells you it's correct requires a stronger type theory, or a type system in a language. Languages like Go, JavaScript or Python that don't have types don't live in this new world of running code through a formal method detector, unless someone is doing the work to build a type system.

Property Testing

24m:58s - I think Property Testing is super useful. QuickCheck was written by John Hughes, and the idea is that you define properties about your systems, and you can combine properties to build complex test cases.
25m:38s - The biggest problem when running a unit test is that it is only testing the input you hard code into the unit test.
26m:06s - If you define these things in a property based testing tool, it's like a model checker but it's not going to explore the entire state's base. You can't say it's proven correct, but it explores more of the state's base than a unit test.
27m:05s - You can find some really gnarly bugs because it generates a lot of input, and combinations of input.

Fault Injection

30m:12s - Kyle Kingsbury's Jepsen is a sophisticated class of fault injection and analysis tool.
30m:35s - There are other ways to inject failure; you can test what happens when I take down a service.
31m:09s - I take inspiration from Peter Alvaro's paper on lineage driven fault injection (or his tool Molly).

Books and papers mentioned

Tools mentioned

Jepsen
Molly
QuickCheck