Evan Broder talks about how Stripe has designed the systems to speed up the development process and how the software infrastructure in their API enables the next generation of tech companies to build faster and less painfully. Then, he examines how Stripe solves PCI and compliance concerns in a way that allows their engineering teams to develop new features more quickly.
Evan Broder has worked on systems and infrastructure at Stripe for four years, helping them stay online through several orders of magnitude of growth. Previously, he worked on virtualization management and the Linux desktop at MokaFive and helped build XVM at MIT, one of the earliest cloud computing environments.
Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.
- Projects that work incrementally making small changes and plan their work, especially when dealing with migrating existing or legacy features or code, are often the most successful.
- Once you have written code using the Stripe API you should never have to change it.
- One of the main goals going through the PCI compliance process is to make as few things as possible in scope for PCI.
- Moving from one AWS data centre to another was probably the single most complex infrastructure ever done by Stripe, but it went largely without a hitch, mainly because of an incremental approach.
Making changes at Stripe
- 00:25 - Stripe is a software platform for online payments available in 25 countries.
- 01:55 - The ability to move faster than the people you are competing with is often one of the greatest advantages you have.
- 02:18 - When moving money things shouldn’t break, which means that reliability and stability is much more important for Stripe than for a typical start-up.
- 03:20 - Projects that work incrementally making small changes and plan their work, especially when dealing with migrating existing or legacy features or code, are often the most successful.
- 04:45 - Three stories where Stripe claim to successfully have managed to take tasks, break them into smaller pieces and work incrementally:
- Evolving the Stripe API over time without affecting users.
- Rewrite of the PCI infrastructure.
- Migrating data centres in AWS.
Evolving the Stripe API over time without affecting users
- 06:22 - Changes in the API is commonly done because of new products or features.
- 06:34 - Some changes in the API is due to mistakes done in the past, e.g. features not scaling.
- 07:06 - Stripe documents what they consider to be backward compatible changes, not requiring user code changes to support the updated API.
- 08:45 - Stripe strives to make it very easy to sign up and activate an account, and to write the first lines of code. All which they think is super important to get early adoption.
- 09:04 - Once you have written code using Stripe you should never have to change it.
- 09:25 - A simple example of a breaking change is when a name was changed from type to brand.
- 10:50 - To accomplish giving different users different behaviour they introduced the concept of Gates for controlling behaviour for a specific user.
- 11:13 - Each Stripe user has a list of gates, each enabling some legacy behaviour.
- 12:05 - Most changes in the API has been about changes in presentation, either in input or output.
- 12:30 - A layer of adapters deals with compatibility, transforming requests to look like a modern request, and on return transforming the answers back to the old format.
- 13:12 - The API only has to know what it looks like right now, the adapters will deal with what the API used to look like.
- 14:00 - Hiding complexity with an abstraction is fundamentally leaky.
- 14:45 - With time the number of gates started to grow drastically and they introduced versioning of the API, by using dates.
- 17:00 - The version of the API is a property of the users account, which will preserve the behaviour.
- 17:30 - A quote from twitter: "Know why I love @Stripe? They update their API practically every day but code from 2 years ago still works"
PCI Compliance and Go
- 18:41 - Incremental changes have also helped them with their internal infrastructure.
- 20:00 - In general, one of the main goals going through the PCI compliance process is to make as few things as possible in scope for PCI.
- 20:20 - Just knowing if you have seen a specific credit card in the past and how frequently you see it is an incredibly valuable fraud signal.
- 20:45 - Replacing the credit card number with a unique identifier, a token, satisfies all the needs they have, but without a PCI scope.
- 21:15 - Stripe’s solution for tokenization is called Apiori, which works by replacing PCI-sensitive information in a request with a token.
- 22:47 -Stripe is mainly a Ruby shop and Apiori is also written in Ruby.
- 23:07 - Eventmachine is a concurrency library used in Apiori.
- 23:45 - Eventually they found that Eventmachine is unusually difficult to program against.
- 24:10 - When doing load tests and plan for feature growth, they found that Apiori was becoming a bottleneck.
- 24:55 - Stripe decided to rewrite Apiori in Go, one reason being the concurrency primitives found in Go.
- 27:55 - Some of the more common problems they experienced in the new implementation were encoding problems.
- 30:13 - For testing they worked with example based testing, and called the collection of examples The Zoo.
- 30:20 - Every time they saw some odd request they added it to the zoo. By capturing the request as well as the response they knew exactly how the API should react for the given input.
- 31:02 - During two months they incrementally rolled out more and more code, and more and more requests. Eventually all traffic went through the Go code, without any major incidents.
- 31:40 - One improvement found was that the latency for the Go implementation was about 150 µs and for the old Ruby implementation more than 500 µs.
- 31:40 - One key to success was the fact that they were slow and incremental in rolling the new code out to production, with one month for writing the code and two months for validating it.
Migrating from one set of AWS data centres to another
- 33:00 - Moving to another data centre was probably the single most complex infrastructure ever done by Stripe.
- 33:10 - The move to another data centre went largely without a hitch, mainly because of an incremental approach.
- 34:20 - When starting Stripe, they believed that their first users would be based in California and selected the AWS Northern California region to have the infrastructure close to the customers.
- 34:50 - One major problem selecting the Northern California region was that later on AWS set up a region in Oregon which was about 10% cheaper, had much more capacity, and with new features coming out faster.
- 35:34 - Another major problem was that just when Stripe started to run on AWS they released their second generation networking stack, VPC, thus missing all new features released on the new stack.
- 36:34 - After a couple of years Stripe decided it was time to move to the Oregon region.
- 36:46 - They set three goals for the move.
- 36:51 - Goal one was no planned downtime.
- 37:16 - Goal two was to minimize the time in vulnerable state with one foot in one region and the other foot in another region.
- 37:52 - Goal three was to minimize the impact on other teams.
- 38:40 - The overall plan was to make the fact that they were running out of two different regions largely transparent to all of their systems.
- 39:05 - The migration was complicated and took about five months.
- 39:05 - Since traffic between two AWS regions potentially goes over public network they set up a VPN between the regions.
- 40:30 - Security groups in AWS doesn’t work cross regions which forced them to implement their own version of security groups.
- 43:52 - Making databases transparently replicated across regions is one case where they could benefit both from a low and a high level view in the migration.
- 44:37 - They setup the load balancing so that they could incrementally move traffic from one region to another.
- 45:28 - The actual migration took about two hours, with a lot of planning and preparation beforehand.
See more presentations with show notes
How does the version defaulting affect discoverability of documentation?
Another aspect I am curious about is how that mapping actually works if your users bake the clients into (mobile) applications their users actually use? Assume those applications receive an update, moving to a newer version of the API but the user's users not necessarily updating to the latest version of the application. Doesn't that break the user-to-API-version mapping?