Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations How to Invest in Technical Infrastructure

How to Invest in Technical Infrastructure



Will Larson unpacks the process of picking and prioritizing technical infrastructure work, which is essential to long-term company success but discussed infrequently. Larson shares Stripe's approaches to prioritizing infrastructure as a company scales, justifying a company's spend on technical infrastructure, exploring the whole range of possible areas to invest into infrastructure.


Will Larson has been an engineering leader and software engineer at a number of technology companies including Digg, Uber, and Stripe. He is also the author of An Elegant Puzzle: Systems of Engineering Management.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Larson: What I want to talk about is prioritizing infrastructure investment, but not in any environment. How do you invest it in an environment where people do what they want? I've heard of in companies where you can tell people what to do and they'll just do what you've told them to do, but I've never actually worked at a company where you just tell people what to do and they'll actually do that, you really have to convince them to do it. Even then they might not do it anyway. Thinking about this high autonomy environment, and also within a scaling business, everything is a lot harder when the company that you're working in is growing fast. Anything that isn't improving is probably getting worse in a company that's going through a lot of growth. Another way to think about this is how can infrastructure teams be surprisingly impactful without burning out? This is focused a little bit on infrastructure teams, but it's not just infrastructure teams, it's really any software team, I think, is going to be able to take away a lot from this.

The first question when we get into this is, what is technical infrastructure? This is a definition that you've probably seen a few times, which is, any problem that people don't like they think might be infrastructure. You can probably think of some examples when someone has come to you and come by your desk and tapped and said, "Maybe you should work on this problem that I particularly dislike." I can think about one company that I worked at where we were sitting there firefighting, and a team came up and asked, could we take on this project for them? It happened to be in Ruby on Rails, which was a little confusing because we were mostly a Python Flask company. They said it happens running MongoDB, which was a little bit confusing because we were a Postgres company. Then they said, "It also happens to be on AWS," which were only in physical data centers. They said, "It's also confidential running on my personal account, and you can't let anyone know about it." This was their idea of what good infrastructure project might be for us to take on.

We kindly thanked them for that idea and chose not to take on that piece of infrastructure because the definition that we'd been working with for infrastructure is a little bit different than this one - tools used by three or more teams for business-critical workloads. Three or more teams is essential, because if you just have one person using it, you're really part of that team. You're not infrastructure, you're just part of that team, and you're pretending you're a different team, but you're really on the same team together. It's not business-critical, then you have a whole range of options that work really well that don't work for business-critical workloads. If it's not business-critical and it's costing too much, you just turn it off. It's not business-critical and it's down, it's ok, you'll get to it tomorrow, get to it next week. This is the definition that I'm working with.

What are some examples of what technical infrastructure can be in this case? Developer tools: your builds, your tests, your IDEs, your linting. Data, infrastructure: Airflow, Hadoop, Presto, Spark. Core libraries and frameworks, some people would think this is product infrastructure or some people would say this is not infrastructure at all. These are really core tools used by huge amount of your company, a huge amount of your developers that really define the experience of working as an engineer within your company. I think defining them as not infrastructure because they aren't touching servers directly in some way is really short-sighted. It's really important to have an expansive view of what an infrastructure can be if you want to be as impactful of the full set of options at your fingertips to be helping the folks around you succeed.

Another one is machine learning training model evaluation. I think a decade ago, our vision of ML was wizards would come out of the Stanford CS program, write a perfect algorithm and hand it off to you, but ML today is a technique that almost every team should be using for certain types of problems. It's just another toolkit for a common set of problems. In this case, model training, serving features, evaluating those models in production. This is just another piece of infrastructure that many teams are going to need at your company.

What I want to walk through today is first a framework for how I think about teams, the state of teams and investing, and the different types of problems. Then the classic problem of what do you do when a team is stuck firefighting? How do you actually get out? Then the new, novel problem of innovation. I think for a lot of teams innovation is not a problem, but so many teams get this rare opportunity to innovate, to pick on any problem they want to build, but then fall back into firefighting shortly thereafter. Why does that happen, and how can you prevent it from happening to you? Then finally, if you're increasingly successful, your organizations can keep growing, but you can have so many things that people want from you and so many things that you need to do yourself that figuring out which of them to work on can become extremely challenging. You can feel like you're constantly failing because there's so many things that are important that you want to do. How do you navigate that breadth? Then we take all of those threads and talk about what we actually do today within Stripe's Foundation engineering organization.


This is a line. There are two ends on the line, a continuum between them, one end forced, one end discretionary. Forced, if your database is falling over and you simply can't run, scaling that is pretty much something you cannot do. If you are about to run out of money because your burn rate is too high and your AWS costs are your primary source of spend, lowering those is recommended. Who here has worked on GDPR in the last year? Lots of hands. Who here woke up one morning and said, "GDPR is what I crave to work on?" Very few people have like a hidden passion for GDPR or the California privacy laws, but you did it. You didn't do it because you necessarily thought it was like a fun or exciting technical problem, you did it because it was essential for your users, essential for your business. A project that Dmitry [Petrashko] spoke about yesterday here and that I'll talk about a bit later is that Stripe actually built a static typer for Ruby. You never have to build a static typer for Ruby. You could just not do that, and you can be successful without it, but we chose to do it. We'll talk about why. This was something we didn't have to do, it was discretionary.

A lot of companies move from monoliths to microservices. There's a lot of reasons why this is valuable, but you never have to do this. There's also a lot of downsides as well. This is a choice that you can make. Moving from your good, trusty friend linear aggression to deep learning because you don't want to understand the algorithm and want to be able to blame it when you get a Twitter firestorm, you don't have to do that. You can keep working on something that you understand and can explain to yourself and to others when Congress comes asking questions.

Another line, short-term, long-term. Short-term, if you have an incident, you want to make sure it doesn't happen again tomorrow. A critical remediation, you can do it in a week, a couple of weeks. Scaling for the holidays that are coming up very shortly - hopefully, something you can do in the next couple of weeks, and if not, you might have an interesting set of critical remediations for next year. Supporting a launch that's coming up. These are all things you can do in a couple of weeks, month, quarter max. There are other things that are much longer-term. If you want to roll out a quality of service strategy across all of your endpoints or all your network traffic, that's not just something you turn on, that's going to take a lot of time to get right. If you want to bend the cost curve, one of my favorite phrases, so you get more efficient over time, that's not something you just turn on. That's going to take a lot of rearchitecture, a lot of intent. If you want to rewrite your monolith that's been nagging and bothering you for years, it's not just going to happen, it's going to take quite a bit of time to get through.

If you put these together, you get an interesting quadrant. Starting on the top left, you have the short-term forced work, bottom right, long-term discretionary work. I find this is actually a pretty powerful way to look at the state of teams and the state of projects. Top left, this is the firefighting quadrant. This is where a lot of teams spend a lot of their time, where they have no choice about what to work on. Every project they work on is critical and can't be postponed, and they all have to get done immediately. Then the bottom right, this is like the R&D quadrant. This is a really exciting quadrant to be in because you're only doing work that you pick on a long-term timeframe. Typically, this means you don't have any users, and no one cares about your work. This is a really scary quadrant to be in because you think you finally made it but you're actually in a bit of a trap.

Ask yourself to think about where is your team now, and then think about where you want to be. If you picked anywhere other than here, you're wrong. This is where you want to be. You want to be doing more long-term work than short-term work, more discretionary work than forced work. If you aren't doing some short-term forced work, it means you don't have users that are actually asking things or you're not talking to your users. This is quite bad sign for the long-term health of what you're building and the value you're trying to add to the engineers that you're working with, the users that you're working with. If you're not doing any discretionary work, it means that you're really just stuck on a wheel. You're really probably not doing the most impactful work. You're doing what you have to. This isn't the right way to be a successful, healthy team either.

Escaping the Firefight

Escaping the firefight - this is truly where almost all teams start. Any new team either is going to start in this research quadrant or in the firefighting quadrant, where you have just more work coming in than you're possibly able to handle. Even the good the company Stripe, which is a great company, has had some problems of this as well. Who here has used or uses MongoDB in production? Who here has tried to blame the technology for a problem you had instead of your integration? Some hands should be going up. There are no bad technologies really. There's just misusing technology. This is not a story of how MongoDB is a terrible database, this is a story of how we misused it a bit and learned through that experience.

Like many companies, we used to have one shared database or one shared replSet of nodes that handled many different types of query patterns: analytics query patterns, batch query patterns, real-time web query patterns, and all the data was strapped in there together. There are some advantages to having these large shared replsets. It's easy to provision, there's just one set to deal with. It's pretty cheap because you just don't need that much capacity, you don't have too much overhead. There are some problems, like microservices you want shared with nothing. Here you have this shared everything pattern, just a little bit concerning. Joint ownership - no one really owns the database if everything is in one database together. You have lots of partial ownership. Also, limited isolation, any bad query can really start impacting the entire database because the blast radius is so large.

What we've found is we were spending more time on incidents as we added more developers, more complexity. Our business was thankfully growing but, less excitingly, it meant that each of the incidents we were having was more and more impactful both to our users, who are growing, but also to us and to our business, which was growing. With infrastructure in particular, but really software in general, these are not steady-state systems. There is no status quo in the software that you're writing in your codebase. There's no status quo in living systems. If it's not getting better, it's probably getting worse. It might not be obvious day over day it's getting worse, but it's almost certainly either improving or degrading. We were having these incidents, we were spending more time on them, the impact to our business, but more importantly to our users was growing. How do we go about fixing it?

First bog-standard shard, get the different use cases, get the different data into different databases that have smaller blast radius. Second, who here has used or who here knows how the Mongo Query Planner works? It's actually pretty cool. They have a racing technique which works really well for many workloads where they will actually try to race queries and different types of indices and see which performs better. They'll do that dynamically against the production workload every n queries. It works very well for most workloads, but for workloads that have a weird distribution, like for us, we had a few users that were much larger than the others. Depending if the races happened on a small user or on a large user, we would get the wrong index in one of those cases.

What we found is that just periodically, these queries would start going haywire. From a latency perspective, they still complete, they're just too slow. Originally, someone working on our storage team would go have to figure out what the query was and go terminate that query automatically, and that easy fix. We'll have a script do that, we don't need humans to do this repetitive labor. We can just automate it. We got a Query Hunter in and we automated the termination of these long-running poor performed queries. We still had these incidents, but we recovered much more quickly, and all of a sudden, our engineers are spending time trying to understand and fix the larger problem, instead of trying to just fix and mitigate the short-term incident.

Really, when things really got good was not this faster remediation, although that helped, but when things really got good is we actually added static analysis where any new query going out had to have a query hint or an index hint. If you didn't have it, you could not deploy. This is where our ratchet on poorly behaving queries, and after we did this, we noticed we had no more storage issues for many years, not forever, but for many years.

What's the playbook behind all of this? If you have a similar problem, you are hopefully not having MongoDB outages, but you're probably having some sort of scalability issue that's taking a ton of your team's time. It might not be scalability, it might just be a volume of work issue taking up your time.

The first thing and the most important thing whenever you're behind is to finish something. Who here has read "Phoenix Project?" It's a really great book. I don't know if you like business fables, it turns out the business fable genre of book is polarizing. It's pretty good, though, if you're willing to deal with that. It really emphasizes the importance of finishing something. You only get value from completing work. You can only get faster by finishing projects that are useful. If you don't finish work, you will never get faster. You will never get out. The number one thing you have to find me some way to do is just to finish something of value.

One of the best ways to finish things of value is to reduce the level of concurrent work you have going on. If you have too many things working, even though they're all extremely valuable, even if you have a team that doesn't like to work together or a team which rightfully believes that individually the coordination costs of having one person per project is lower, if you don't finish the work, even if you're working as fast and completing as much partial work as possible, you don't get any value from it. Figuring out how to reduce the work in progress, work on fewer tasks and actually complete them.

Who here knows what ITIL is? Who here actually knows what the acronym expands to? I forget it every single time, but I think it's information technology or internet technology information toolkit or library or something like that. Terrible acronym, but actually, the ideas behind it are pretty phenomenal. It's interesting as you look into this, this is a series of practices that have been standardized for decades and answer many of the problems we have today, but because it has a terrible acronym most people have never heard of, but really worth slogging through one of those references as well at some point in your career.

If you have something like a cookbook where you actually have a full list of everything that people are asking you for, you get the metadata about what the most important tasks are. You can figure out what to automate to save the most time from your team. Figuring out how to get a few of these things that are taking up time, automate them, all of a sudden, you get that time back. Once you get that time back, you all of a sudden have the opportunity to do things like the static analysis of these queries to make sure they have these indices hinted, you can eliminate entire categories of problem.

If you do all of these things, you might find that your team is still not digging out. At that point, there's really only one thing left, and you've got to hire. I think a lot of companies and a lot of teams don't do this last step, but if you've tried to automate, if you've tried to reduce work in progress, if you've finished work that's valuable, if you've gotten really creative and thought about ways they can eliminate categories, if none of that works, the only thing you have left in your toolkit that's going to work is to hire. I think you have to push pretty hard on this one. Sometimes you just have to recognize that you're being put in a spot where you just can't get out if a company is not willing to fund your team's success. If you do all of that, nothing works, you've got to hire.

The other thing, though, is one of the scariest things is you often see that your strategy is working, you'll often find that you are finishing something and that you've got a little bit more time, you finished the second thing, a little bit more time, finish the third thing. You're confident you're building out, but it's not happening fast enough, and people are going to come to you tell you your strategy sucks. They come to you and tell you that it's not working. They come to you and tell you to rush, to do like a spot fix that will somehow make it better and to rethink. It's really hard in that moment to know is your strategy terrible, because half the time it is, or is it actually working, you just have to have the courage to see it through. That's where having some metrics, some visibility, some information to see if it's actually compounding just in the very beginning of compounding but compounding nonetheless. If so, stick it out. Have the courage. If you don't, you're going to be stuck firefighting forever.

Finally, a lot of teams love firefighting, it's really motivating. The CEO comes by and says like you're doing a great job. Your manager tells you how essential you are to the company. You know exactly what to work on every day because it's a thing that's the most deeply on fire, and it's flickering light. It's pulling you to it like a moth to the proverbial flame. It feels really good. Teams jell, you know what to do, but it's not actually a healthy state. You're not actually doing a whole lot. It feels productive, but the actual value is quite low, so don't fall in love.

Learning to Innovate

If you don't fall in love, if you have the courage to stick to your plan, if you've finished valuable work, something exciting happens, which is you get to actually innovate, but not quite this far right to the bottom, a little bit up into the left. This is predicting infrastructure rare, but also in software development in general getting to work on greenfield projects, getting to work on new projects with no technical debt other than what you're about to add to it, it's a pretty rare opportunity. Unfortunately, rare also means inexperienced. There are so many people who have careers of excellence in firefighting that have never had to set up or pick a new project from scratch. I think when you come into this, you have to be really thoughtful about how you approach it.

The entire summary of this section is, talk to your users more. If you're building something new, talk to your users more. It's not really just talking, it's actually spend as much time as possible listening to your users. If you take nothing else away from this, if you actually have spare time to pick something to work on, talk to your users and spend as much time as possible listening. If you don't just want to take that as advice, here are some other ways that innovation goes wrong and some patterns that I've seen to help correct those.

One of the challenges I see is the most intuitive fix problem, which is also the fixating on your local maxima problem. When you've been working on a piece of technology, you have such a great intuitive sense of what needs to be done that you finally get some time and you're going to go make that incremental improvement that is so obviously valuable to you. One of the challenges is that is valuable, but you only have so much context yourself, you don't have the context of our users, how do you expand a little bit more?

You run a discovery process, the single most valuable way to figure out what you should be working on is to spend 2 days and talk to 5, 10 different peer companies about how they're approaching the similar set of problems. This will give you an amazing way to just steal the brains from hundreds or thousands of other people working on similar problem sets. It will take you very little time, but this can help save you literally years of going down the wrong directions. You shouldn't cargo cult. You don't just take what these other companies tell you and just do exactly like that. You don't average them all out do exactly that. That's the buying IBM of problem selection. It's a great way to figure out what's possible out there. Internally or externally, depending on where you are on the stack, actually sitting down talking to your users, what are their biggest sources of friction? What are they spending time on? what do they actually want from you?

Particularly if you've been firefighting, a lot of users are going to get trained not to bother talking to you, because they've come to you and you've been "No, no." They've gotten that response for a year or six months, and they're just "These people literally hate their users and won't talk to us." They've stopped telling you what you need to know about what you need to build next. You have to go proactively build those relationships, not just assume "Why aren't they telling us what they want?" They stopped telling you because you stopped listening to them ages ago. You have to go intentionally rebuild those connections after you pop out and have time.

Another one that I found surprisingly useful is just having the SLO discussion with their users can be surprisingly good sources surface a bunch of unexpected mismatches between what you're offering and what your users think you're offering. Then closing that gap can be a really valuable way to find work.

Who here sends out a developer survey every year to all the engineers at your company? Ok, some hands are going up. Are those useful? who finds those useful? There's an interesting pattern that I see in a lot of folks sending these surveys out. The first three are incredibly impactful, and then they normalize and they stop giving much new information. My new strategy is to do two or three of them, let it go fallow for like six months, and then start up again. That way you avoid anchoring on some random point in the data. Doing the surveys is a great way to find your detractors who have the most valuable information for you, because they're really upset about something, and that's what you want to know when you're picking what to work on next. It's who's really frustrated.

A problem that many of you have probably heard as you talk to folks is, Ruby is a terrible language. This is not a true or a false statement. This is a feeling that folks have. If you start out with this problem, it's extremely hard to get to Sorbet, the Ruby typechecker, as the answer. I think it's really powerful when you do the benchmarking, when you talk to your users and you talk to other companies and you try to understand the problems in detail. I did not expect when we first started talking about scaling Ruby that the best solution for us was going to be adding a static typewriter for the obvious alternatives. That seemed simpler at the time, migrating to languages static typing. Maybe migrate to Java, migrate to Scala or whatnot.

As we dug into it, this project turned out to be so impactful, and impressively, the migration path to this was phenomenal on incremental gradual typing. If it didn't work, all we lost was a couple of engineers building it. If we had tried to move directly to a Java or something statically types, we would have had a huge full rewrite on our hands. With this, our actual risk was almost zero, literally two or three engineers working for six months to get the first version running. It was only through talking with our users, through talking with Facebook and how they work on the Hack project, talking with Shopify and others that this became the obvious choice for us to start with, as opposed to a terrible pie in the sky idea. This is where doing this discovery was so important for us. For me personally, I was confident this is the wrong decision for a long time, but as we got into the data, it became clear this was actually the best path forward, it has been incredibly successful for us so far.

Another challenge is you can literally do anything. How do you select which things to work on? Prioritization - great word. Ordering by return on investment, I think you don't want to have things that are too far out that have too much risk of not completing. Pick a good mix of things that'll be really impactful but things you can actually complete in a bounded amount of time.

Do it with users in the room. I find many teams are still planning without their users in their room. I understand why teams do that. Users disagree with you, users want things that you don't think makes sense. All sorts of things that users do that are unhelpful to doing the work that you think is interesting, but it's really helpful to doing the work that's going to be impactful. Something we've started doing is pulling more and more users into the actual room during planning, which allows us to front-load disagreement about the plan instead of like learning about disagreement after we've already started the work.

Finally, one of the highest impact things you can possibly do is if you have a long-term vision of what you want to accomplish over the next two or three years and you can look at your short-term work and fit it into that, you can actually have your work compound. If you don't have a long-term vision, then all of your work is just scattered in this possibility space of what you could do. It's never going to come together into something larger, into something that multiplies, into something that creates true leverage. It's only by having a sense of what you want to accomplish and how you want the pieces to fit over the longer time horizon can actually create significant leverage for yourself and for other companies within team. If you're planning without a sense of the vision of how you want the pieces to come together long-term, you're robbing yourself of the biggest potential to have leverage in the company.

Here is one you've potentially heard "Using a language that I'm excited about is a critical business outcome." Users in the room typically don't care about this unless they like the language too, in which case you might need several users in the room. One that to me, again, I thought didn't make sense at all initially but ended up being a great solution is we ended up moving our machine learning training onto Kubernetes. At first, this sounds like buzzword, buzzword, intersection and not obviously what we should be working on. As we started talking to the machine learning infrastructure team within Stripe, they were spending so much time managing the servers, or the VMs in this case that was just overhead for them that actually, this ended up being extremely impactful for them because we were able to take on all the toil they spent on VMs onto our orchestration teams. Again, originally this sounded like a terrible project. It was only getting the users into the room and digging into their problems in detail that this, which sounded like buzzword soup, actually turned out to be an incredibly impactful, successful project for us.

Another one is you find the right problem, but you actually solve it in the wrong way. I have a couple of examples for this one. Validation is pretty easy. First, you try to disprove your approach as cheaply as possible. If you can do like a one-day prototype that shows that our idea won't work, do that. If you can do a quick design spec that shows the idea won't work, do that. If you can figure out the fundamental costs and try to draw the curve and see how that scales and it's going to be too expensive, do that. Find the cheapest possible way to disprove it. You can also try a very simple implementation of the simplest service.

The second thing you have to do is to try the hardest possible use case next. A lot of times you have a migration metric where your team needs to get to 80% adoption of your new orchestration framework. Obviously, you as the irrational person will move the 80% of the easiest services. You'll hit your OKRs, people will think you're smart, your platform is obviously working really well, but it's not actually obviously working very well. You're dodging the important learning, you're dodging the risk that the platform migration might never complete, but you're hitting your metrics. You're hitting your metrics in a way that feels like success but it's actually extremely likely to sabotage the real impact that you're trying to have. That's why I think, start with something very easy first, just to prove that the developer experience is good of your new approach. Then do the hardest thing second. If it's not going to work, if the actual platform you're migrating to, if the system you're migrating to won't work for your hardest problems, the worst possible outcome is that you end up having two different solutions. You end up running Mesos and Kubernetes, end up running console and at CD, etc. Try the hardest one second.

Then to actually do the integration, embed with the owners. A lot of times infrastructure teams throw or even developer productivity teams [inaudible 00:30:47] throw solutions over the wall. It's interesting, you find every company either infrastructure tells product what to do, and they capture the product teams, and product teams just implement infrastructure migrations, like "Why is this happening?" Or we see the inverse. The easy solution is actually, build empathy, and the best way to build empathy is not to throw something over the wall for people to implement, to actually go spend six months joining the early adopters of your new system, platform, and help them migrate onto it and actually feel the experience of operating it from their perspective, not just reason about their experience of operating it from the comfort of your own chair.

Stripe has a tool called Monster. If I could go back and rename it, would we name it Monster? No, it's a bad name. It's a queuing system and has some really interesting properties. In particular, one of the interesting properties it has is scheduling future retries or future tries. If something fails, you can say, "I want to retry this in 12 hours," or you can say, "I want to send this email 18 hours from now." The actual data structure underneath that is essentially a priority queue. Having a highly available distributed priority queue is a messy problem, and there's not really a great open-source solution. You just download, install and works out of the box.

We're having some problems working with it, and so the obvious solution is to just rewrite it, followed by the well-loved SQL, "Let's rewrite it again," and then followed by the "Let's just make it better but not try to do a full rewrite." This was really valuable experience for us because we thought we understood the problem well, and we thought the problem was our codebase was bad and just hard to operate with, and that was right. That was the problem. The solution wasn't starting over from scratch, it was actually fundamentally a really challenging problem to work on. Instead, we found that simply by improving the critical pieces of it, we're able to harden bit by bit and then make it into something that actually evolved easily, instead of trying to throw it away and build a second system that actually worked out of the box.

A problem that actually went well is this seems like a pretty reasonable problem. I love to run my cronjobs, my batch workloads, my services all on the same system. We actually validated this in a really powerful way that worked quite well for us. We used to use something called Chronos. Chronos is a second-tier scheduler on top of Mesos, originally out of Airbnb. It's gotten a little bit less love over time, and we were running a bunch of our critical cronjobs on it. We decided, "Let's see what would happen if we can migrate all those cronjobs to run on Kubernetes cron. That worked incredibly well for us. The great thing that happened is we were able to completely deprecate all of Mesos at the company as a result, turned that off, moved everything over, and we got a foothold into Kubernetes, where even if nothing else worked on Kubernetes, we still maintained just one platform, and we got some experience operating it in a relatively safe way. If crons don't run for a few minutes, it's usually ok. Whereas if production workloads stopped serving for a few minutes, it's less exciting.

Then we found another workload that was extremely valuable and let us deprecate entire class of orchestration systems we're using, where even if nothing else worked beyond that was still valuable. We had our machine learning training. We had pretty much a custom orchestration tier that we had built out for that. We were able to move all of that into ad hoc one-off jobs running Kubernetes as well. Then we're also finally able to move to running our services on Kubernetes. At each step, we figured out, "What is a discreet thing we can do?" We tested it with the hardest things within that subset and made sure the problem was discreetly valuable for us before we moved on. This has been working really well for us for over a year now.

Listen to users more. If you do nothing else for innovation, just listen to users more. The reason why this matters so much is that at the end of the day, you get this gift of time, you finally escape firefighting through your diligence, through your thoughtfulness, through your creativity. You get a moment to pick what to work on. If you pick the right thing, you're going to reinvest it, you're going to get more time, you're going to do more and more innovation, and you'll be doing more and more highly leveraged work for your company and for yourself and for your users. If you don't, you go right back to firefighting. Sometimes even worse, because now you have two systems that you're trying to maintain simultaneously. Or if it's the second time we're doing this, three, and this just gets worse. Really critical to be thoughtful about this. This isn't a vacation, this is like your opportunity to make this into an amazing job, we're able to think, learn, or to send yourself right back to the place you just got out of.

Navigating Breadth

I have an interesting problem, though. You're finally in this perfect spot that I've told you to seek, but you're having some interesting problems. Common old saying here fool me once, shame on you, fool me twice, shame on me. What if you fool me the exact same way on the same date every single year? That's probably at some point actually my fault. Like many companies, seasonal traffic spikes, and every year, just before them, we'd realize "We should probably make sure we're scaling up for the seasonal traffic spike." What if, and this is just hypothetical, what if we thought about this earlier because we know every year it's coming? This was a problem we had for just making sure that we could actually handle. For us Black Friday, Cyber Monday are huge retail days. We had this problem for several years, and we're "Let's stop having this problem."

We built this load generation team that basically work on the mission of how do we convert this unplanned panic work we were doing into planned predictable work that we could actually schedule? This is pretty straightforward. We start with manual load tests to actually validate, get that all run. We were originally running in Tsung. Who here has used Tsung? Zero hands? Tragic. A classic Erlang load balancer that you use on your path to something you know how to debug. Scheduling the annual tests, very important to run, get to learn from that. Second, we started automating those, where instead of just running them by hand on a scheduled calendar every couple weeks, we run them every Wednesday automatically, and then eventually moved to like running them continuously in production 24/7, which is where we've been for several years now. At any given point, we look 90 days in the future, we project the expected load, and we simulate that amount of load in addition to our current load to ensure that we always have at least three months of scalability headroom and warning if something is regressed.

This team solved out of a job. No longer have that team. Doesn't even exist anymore because the systems are simple enough and worked well enough that we just were able to move them on to working on a new set of problems. This is a great technology fix. That team was rewarded with a new set of exciting problems, mostly Elasticsearch, which they've been working on for a little bit since. How do you actually do this a consistent way as an organization? How do you not get surprised every year by the same problem? The tool that we've rolled out is something that I call pnfrastructure properties. For your company, it's not going to be this exactly, it's going to depend on what you're doing, but these worked really well for us.

The first one is security. I would say if we have a security vulnerability, it's better for us to be off than to be on, because it's such a huge impact for us in the business that we're in, that probably for almost any technology company, this is true today. Reliability - are we actually serving successfully to our users? Usability has a dual aspect of both being people integrating with us externally, but also how do engineers within the company get their work done? Are they actually able to get the value and the productivity out of the tools that we're offering to them? Efficiency, and this is performance. This costs on the cloud to run. Finally, latency. These are lightly ordered but not stack ranked.

One of the classic problems of planning or prioritization is thinking anything can be stack ranked. Nothing is stack rankable. If you go back and think about this, if this is stack ranked, you say security is the most important thing, then you only do security work, and your company is useless. If you only do security work, nothing you're doing is directly useful for your users. Your users will leave. Security is the most important priority, but you can't only do security work. Same with latency at the bottom, you cannot do any latency work. Stack ranks do not work for prioritization. Instead, you have to think about it as a portfolio. For the most important thing, security, you do the most investment. Maybe put your top priority that's 30%, and maybe for your bottom priority, latency, that's only 5% of your workload. You have to think of this as a portfolio that you invest into all of them. If you don't invest across all of them, you will regret it. You have to ask yourself, "How do I know when to invest into which of these?" That's where having baselines is really powerful.

You might say, for security, if your metric that we're looking for is instance age, is that if our longest live instance is ever over seven days, then we need to do a major investment in security to make sure that we're rotating those quickly. Why is that an interesting metric? It's a proxy for your ability to roll out patches and critical security fixes as quickly as possible. For reliability, you might have some SLOs you're looking at, you might look at the time to fix critical remediations. You might look at some sort of synthetic score about fundamentals they believe represent your reliability. For each one of these, have a series of baselines, and then when they're triggered, when they're breached, that's your warning to actually invest into them more. By having all of these, you are able to think about what for this quarter or for this half is the right mix between security, reliability, latency and so on. You then invest to maintain those baselines.

Something really interesting here is that you're not just thinking about, "What do I need to do to recover from my baselines?" If you only think about recovering from failed baselines, you're probably doing short-term forced work. You actually have to think about, "What do I need to do to avoid triggering these baselines six months out? What do I need to do to avoid failing to meet these baselines two years out?" You run into this really interesting category of work, which is long-term forced work. If you look at your infrastructure costs and they're scaling too fast or they're scaling faster than a user usage is, then you know you actually have a fundamental pretty severe problem that you need to invest into, but you have some time to do it.

If you start working on it now, knowing that in two years from now it will be a critical business problem, you can actually do it well, can actually do it thoughtfully, can actually do it without firefighting. If you ignored this thing that is definitely going to happen to you in two years, that you can tell by extrapolating the impact to your baselines of user growth, you can postpone it, but you're just going to end up firefighting again, and quality work does not come out of the firefight- top right corner. Do it now or firefight later. I think a lot of times we make bad decisions because we've never seen people plan ahead and make good decisions. I believe all of us as professionals can actually just look to the future, think about it, and make better decisions where we're not constantly heading back into the firefight. This isn't fundamental to our work, that we're firefighting, this is bad planning, and it's because we're planning badly, not because planning is impossible.

Unifying Approach

Taking all of these ideas, how do we actually unify them into like a single cohesive approach? One of the things that happens is there are many teams you're planning for, so there's not just like "oh," like a simple fix and a simple stack rank. How do you actually get like an organization of 200, 300 engineers to execute on these sorts of ideas simultaneously? Here's what we do today. I have an investment strategy. I think a lot of times you talk to people about planning and they have like top-down then bottoms up, and they have really complicated systems, and they get like spreadsheets, and they get like some JIRA planning systems. You spend a lot of time on plans that you know you're going to throw away in a couple of weeks after the quarter starts. Avoiding the false specificity of planning and instead investing on work that is actually useful because it helps the teams understand how to make decisions is incredibly high impact.

Planning the complete details when you know there's unknown unknowns out there that are going to cause the specific details to change is just waste. Try to avoid doing it. Investment theses are a great way to navigate this. What we do is 40% user asks, 30% platform quality, which is KTLO, but I think of it as an elevated KTLO, where you feel better about yourself, because you can also be doing long-term work. If your platform is already in good state, there's no KTLO, you're "Ok, so we'll just do new features only." As you start extending the timeframes, there is KTLO or platinum quality work that you're going to need to do, and you have this budget to actually do it before it gets severe. Then 30% key initiatives.

Anything someone asks for that is one of your users, you're going to do it. Sometimes users asked for things, and you're "That's a terrible idea." What that means is you don't understand what the users are doing. Doing it is actually incredibly powerful, even if you initially think it's a bad idea, because that's forced learning to actually understand the user problem better. I've never done something, even the things that I think initially are pretty bad ideas that users have asked for. Once I've started working on them, it's been clear to me that I've been missing context to their problems. It's never been that the users have asked for something that simply doesn't make sense, it's always the case that the user is asking for something that makes sense in their world, and that I've just been misunderstanding their needs.

Top 70%, these are basically all driven by teams. Individual teams get to pick what to work on. They have the best context, they understand their problems the best, they know their users the best, they can make the decisions the best. We have this interesting problem, which is senior leadership also needs the ability to actually staff large projects. If you do only bottoms-up planning, what happens is you do bottoms-up planning and they do a ton of work and they start working on it. Then I have some sort of crisis like a GDPR or something that maybe is coming externally where we couldn't foresee it. GDPR - very foreseeable, it turns out – but something pretty important, like maybe we're having a scalability problem that we weren't able to project and see in the future. Having this 30% key initiatives budget that we set from the top down lets us work around that problem, where if we have a huge project that comes up, this is a budget that we're able to use to prioritize that.

Then if, for example, you've done the 7% of planning on your team, figured out all the projects to work on, if something comes in, instead of dislodging all of your agreements you've made with your customers, one of the challenges of planning is making these commitments and relitigating these commitments can be the slowest part of planning. By only me having to relitigate these commitments to myself and to my peers in this key initiative bucket has been incredibly impactful. The other thing is there are certain types of projects that just can't be done at a team level, that need more people than a single team will ever have, and this bucket gives us the flexibility to invest in those projects in a very powerful way.

40/30/30 is, of course, completely arbitrary set of numbers, and you really have to think about what are the constraints and needs for your organization? If you are really early on and don't have many people running on your software, probably you don't need to spend that much time on the platform quality aspects. Conversely, if you are getting close to end-of-life in your platform, that's probably going to be much more of your time. Don't just cargo cult, take reasonable numbers and iterate. Again, false specificity is the cause of so much grief in planning. The thing I love about 40/40/30 is it's obviously not right, which is powerful because then you don't get caught up on it, and instead, you just try to move towards that level.

Pulling these all together, technical infrastructure, tools used by three or more teams for business-critical workloads. Firefighting, if you're doing that, limit work in progress, finish anything, and if you can't get out of it from those two, you've just got to start hiring. Nothing else will work. You can quit too. Innovation - listen to your users and just keep listening to them. Never stop. If they tell you to do something that doesn't make sense, it's because you're missing context, not because they're asking for something that doesn't make sense. Navigating breadth - identify the principals, identify the value that you're bringing to your users, to your company. Set a baseline for each one of these. If you've read "Evolutionary Architecture" is like fitness functions, it's the same idea there. Then plan across timeframes. Don't just think about what you need to do urgently, think about what you need to do so you're not doing work urgently 6 months out, 12 months out.

Finally, to bring it all together, in planning, create an investment thesis that you can pass down to teams. Give them time to think about what to work on and to actually own that themselves. Don't force them into the nitty-gritty details. It's just a waste of time, that you will throw that away every time. Don't do it. Then really think about who are your users? What are their needs? What are the baselines that you need to certify to make sure you're serving those needs? What are the timeframes that you need to serve them over?


See more presentations with transcripts


Recorded at:

Jan 06, 2020