InfoQ Homepage Podcasts Architecture Should Model the World as it Really is: a Conversation with Randy Shoup

Architecture Should Model the World as it Really is: a Conversation with Randy Shoup

Nov 10, 2025

In this podcast Michael Stiefel spoke with Randy Shoup about how to evolve your software after a software failure, and how to improve the resilience of your software by modeling transient states using events and workflows.

Software failure is inevitable, but learning from failure, including making the necessary changes to organizational culture can make your software more resilient. One of the most important ways to do this is to search for the truth, rather than trying to seek out the guilty. The real world is asynchronous, which means that transient events are important for resilient systems because that is where failures often occur, or compensation has to take place. Workflows and events are the best way to model these systems.

Key Takeaways

Software failure is inevitable, but we can learn from them if we are not satisfied with proximate causes. Improbable things happen even in small scale systems.
The best analyses of large software failures cause organizations to make cultural changes that will help in both avoiding and remediating failure. One of the most important ways to do this is to search for the truth, not to search for guilty parties.
The real world is asynchronous, and the best way to model it is with events and workflows. Model the world as it really is, not the way you would like it to be.
Exposing transient states makes software more resilient because they are often the places where failure occurs or compensation has to take place.
The better your model of the real world is, the better your software will be. There will be less cognitive load for the developers to understand the system. They will be happier, more productive, and more likely to want to stay at the company.

Subscribe on:

Transcript

Introduction [00:29]

Michael Stiefel: Welcome to the Architects Podcast, where we discuss what it means to be an architect and how architects actually do their job. Today we're going to have our first repeat guest, Randy Shoup, who has spent more than three decades building distributed systems and high performing teams.

He started his architectural saga at Oracle as an architect and tech lead in the 1990s. Then served as Chief Architect at Tumbleweed Communications. He joined eBay in 2004 and was a distinguished architect there until 2011, working mainly on eBay's real-time search engine. After that, he shifted into engineering management and worked as a senior engineering leader at Google and Stitch Fix. He crossed the architecture and leadership streams in 2020 when he returned to eBay as Chief Architect and VP for eBay's platform engineering group. He's currently Senior Vice President of Engineering at Thrive Market, an organic online grocery in the United States.

How Do You Learn From Failure [01:33]

When we had our last discussion, when we talked about failure, we had several topics that we could have gone deeper on and it was suggested that we actually go and do that. So here we are. And one of the things that came to mind based on our previous conversation when we last talked about failure is how do we get the failures that occur and the information about those failures that occur back into the architecture?

In other words, an SRE does a postmortem, finds out the proximate causes of things. But like anything the proximate cause is not necessarily the real cause because there are many things happening. The classic example, of course, is the airline pilot flips the wrong switch and bad things happen to the airplane. That's not necessarily the fault of the pilot because he may have gotten the wrong information from the dashboard. The switches may have been poorly designed. There may have been assumptions that the plane could never be in a certain state.

So how do we get this information when we've looked at a software failure back to the architect so they can get it back into the architecture and we can build a more resilient system?

Randy Shoup: Cool. Well, thanks for having me back, Michael. It's exciting to talk about this stuff. I know you and I love it, love to talk about these sorts of things.

Michael Stiefel: Absolutely.

Proximate Causes are Not Real Causes [03:03]

Randy Shoup: Yes. So if I could just jump off of your intro there. Yes, I think you and I are going to talk about lots of different things here, but I guess I'll start in the reverse order, which is something bad happened and how do we extract the learnings of that bad thing. And so the typical practice that most of us use, I hope is some flavor of retrospective or post-mortem. And to your point, while it is a little bit better than nothing to find, to your point, the proximate cause, like, oh, pilot clicked button x or database collapsed or whatever, it's much more valuable and more interesting and more sort of informative to not just look at the the proximate cause, but look at all the other aspects.

And in fact, there's a lot of research and decades of experience in what's called the resilience engineering community. So people may be familiar with the write-up How Complex Systems Fail by Dr. Richard Cook, and then Sydney Decker, who's also in that community wrote a wonderful book called Drift Into Failure, which is about how you can kind of over a long period of time have things fail more and more.

Let's stick on the post-mortem thing. So for every discipline to which bad things happen, that includes software engineering, we have discovered, developed this kind of practice. So when military folks go out on a mission, they come back and they do an after action report. When emergency response people go have an emergency, they come back and they do, I think they don't call it a post-mortem, but they do some flavor of retrospective about that. How could we have fought that fire better? And then ditto in the airlines where the airlines, for many decades, it's hard to remember where we are now 'cause it's so safe to fly, but it was pretty terrifyingly not safe to fly in the middle part of the 1900s. And it took several decades of bad things happening and then going back and doing retrospectives to learn a lot about how things happened.

Michael Stiefel: I'm old enough to remember that when it was ... Well, it was not routine, it was not unusual to hear about an airline crash. In fact, after 9/11, there was an airline crash and they had to reassure everybody that wasn't terrorists. It was just an airline crash. And we've been very fortunate. We forgot about those days.

Randy Shoup: Yes. And you're not saying otherwise, but the fortunate, it's one of those, it took me 20 years to be an overnight success kind of situation or luck is where preparation meets opportunity. But we've been fortunate, and I'm using air quotes, in the sense that there have been decades of work to engineer resilience into all aspects of the system.

Yes. So let's talk about a post-mortem. People hear about the "five whys". I think that's useful. So this particular thing happened. Okay, well why did the pilot press the button and then kind of go the why behind that and the why behind that and the why behind that. I think that's a hugely valuable exercise to go through.

The framing that I like to use with my teams is fivefold, is how could we have detected this thing, detect? Then diagnose. Once we detected it, how could we have diagnosed it more quickly? How could we have mitigated it? In other words, something bad's happening. How could we have flipped to something else and stopped the failure from getting any worse? How could we have remediated it? Meaning how could we really have solved the underlying problem? And then fifth, how could we have prevented it in the first place?

And I find that five step, five flavor way of looking at it super useful. 'Cause again, a lot of times you'll look at the proximate cause and you'll say, "Oh, well". You and I were chatting before, like, oh, the database collapsed because the transaction log filled up the disk. Okay, give more disk space to it and fine. Well, no, what was it about the log rotation or whatever that didn't happen and so on. Yes, so I find that framing super helpful.

Google App Engine Failure [06:47]

We were also chatting earlier about what's an example in practice of where something bad happened and we didn't just fix that one problem, but we looked for whole classes of problems to solve. And I can think of a number of examples in my career, but the one that I use all the time is back in 2012, I was running engineering for Google App Engine. So that's Google's platform as a service like Heroku or Engine Yard or Cloud Foundry. And we had an eight-hour global outage. That was bad. That was not one of my favorite days in my professional life. And obviously it was all hands on deck. And the SREs who were amazing did all the things they possibly could, ran all the playbooks which we had had, and nothing was working, nothing was working.

I'm happy to talk about the actual problem that was there, but I'm going to start by saying that after that thing, we didn't just say, "Okay, well, we fixed the one particular cause", which again, I'm happy to talk about now, 10 years later. But it was such a big deal. Snapchat was totally down 'cause Snapchat, at least at the time, was completely dependent on App Engine. The predecessor to Pokemon Go completely down, 3 million other applications including almost everything internally at Google. 15,000 internal applications at Google ran on App Engine. So it was a mess for lots of people around the world and in the company.

And so we basically took the next six months and we took at least 50% of the team to fix all sorts of related reliability issues. And how did we get there once we fixed the actual issue that caused the outage and got everybody back going after those long eight hours, we got a little sleep and then we came back in the next morning or the next week, I forget, and brought all the tech leads and folks into a room. And this was when we were all physically in one room, and we had whiteboards all the way around and we spent several hours just starting out by brainstorming and enumerating all the possible things that we could think of that caused this kind of reliability or any other reliability related thing.

The prompt was essentially, "Hey, this kind of catastrophic failure, think of all the things that did cause or could cause or you're worried about causing these kinds of things". So we peppered the whiteboards all over and maybe spent an hour, an hour and a half on that. Then we bucketed those into themes. So they very sort of naturally clustered themselves into, okay, well, here are a set of themes about provisioning. Here are a set of themes that are around authentication and authorization. Here are a set of themes around scalability and load management.

And then people volunteered or were voluntold to take one of those themes. So one person would take a theme. And then for a week that person would synthesize what we talked about, talk to other people, think more about it. And then in a week, all of us came back in essentially that same room and everybody had a list of project ideas.

So not like design docs, but literally a line in a spreadsheet, we should do this, we should have replication for the such and such. We should back up the so-and-so. And those line items were just a couple of words or a sentence about what the idea was. Then an order of magnitude estimate, is this a one hour, one day, one week, one month, one year kind of effort? And then we talked about it and we prioritized them and we got going. And again, like I say, for the next six months, we iterated on those things kind of in priority order.

And I'm proud to say that after those six months, we reduced all the reliability issues that we were having in App Engine by 10x. But even more importantly was the sort of cultural result of that where a lot of people knew about these issues but didn't necessarily believe that we would prioritize them, or there wasn't really maybe a mechanism for them to talk about it and brainstorm about it together.

Anyway, so as much as it was great that we solved those reliability issues, we created more of a like, I don't know, resilience culture or reliability related culture within the team. So I'll just pause there and we can drop off that.

Michael Stiefel: That sounds a little bit like a long time ago when Bill Gates at Microsoft said that we're not doing any more development. We're going to go back and look over our code from a security point of view before we go forward.

So the question for you in this scenario is when you were working on these things and developing this culture, how much clash was there between new stuff that people want ... managers and business people want to get done and doing this resilience and failure proof work?

Randy Shoup: Yes. The team is the team. So there's always a trade-off between there's only so many hours in the day. We had the benefit of having had such a disaster that it …

Never Let a Good Crisis Go To Waste [11:36]

Michael Stiefel: Never let a good disaster go to waste.

Randy Shoup: Totally. No, that is absolutely true. I think that's Churchill. And if it isn't, it should be. Yes, never let a good crisis go to waste. And that wasn't even manipulative. It was the biggest problem for App Engine, right? 3 million applications around the world relied on us, and we did not meet that reliance and they were justifiably angry. And we paid SLA money back to people. We gave them credits. So it was a big, big deal.

I like to say reliability and performance are P zero features. As much as it's important to have a really good user experience and a really great feature set, if the thing's not up or you don't trust that it's going to be up, it doesn't matter how fancy your UI is or how great your feature set is. I mean, it's obvious in retrospect, but it was just so clearly the most important thing for us to work on.

And also we had other things we wanted to do. So it was about 50% of the team that worked on this effort for six months and the other 50% did other stuff. But yes, no, I mean it was a real trade-off. I mean I guess I had a little bit of the benefit that I was running the engineering part of it. And so I got a vote at the table about what we did, but also everybody up the management chain above me, and there were Google's big, so many people above me in the management chain, they all 100% agreed that this was the right thing to do.

Find the Truth – Do Not Look For the Guilty –You Cannot Do Both [12:52]

Michael Stiefel: It was interesting to know, and maybe you don't want to say or what have you, but I presume people understood that these types of failures and this learning from failure was not necessarily a failure on your part, but just the nature of the way software is constructed. Because it's very important in these situations because one or two things can happen. You can either try to punish the guilty, in which case everybody will engage in a CYA exercise and you'll never get to the truth. Or you can do what the airlines did as you mentioned before, and say, "No, we're not going to find the guilty or innocent. If there's a crash, the airlines will pay a certain amount of money, but we're not going to argue who is exactly at fault so we can get the truth and fix things".

Randy Shoup: Yes. In general, and also in that specific case, it was entirely blameless. And there's a wonderful thing about, I mean all these resilience cultures have all again co-discovered this idea that you just say, which is when you look for a throat to choke and somebody to blame, that does not make the system safer and makes the system much less safe for exactly the reason you say, which is it doesn't make people work harder at making things correct. It makes people work harder at trying not to be the person that gets blamed, and that's human nature.

So you absolutely need this retrospective or post-mortem after action report to be entirely blameless. We need to be objective and understand what actually happened and think about it as improving the system rather than blaming the people.

So yes, we absolutely had that situation. And when you have a situation like that, when people going into one of those post-mortems as we all did in that room, and they know that it's going to be blameless, they're almost volunteering that it was their area.

So I'm remembering this very specifically 'cause there was a bunch of services that were involved. So one service is like, "Oh, yes, man, you're right. We weren't provisioned for this. We ran out of not disk space, but we ran out of resourcing and that was a real problem and we kind of knew that, but we didn't take care of it. Oh, bummer". And the other team's like, "You think that was bad? Our application wasn't working and where we were taking back", and the other team is like, "Hold on there".

Michael Stiefel: This is a confessional. It's like the sinners coming forth at the meeting.

Randy Shoup: It really is. And here's the wonderful place in the engineering heart where it comes from, where it's like finally all these things that I know about this system, finally we're going to do something. Do you know what I mean? I've known or suspected that there were dragons over here or whatever. This part of the system wasn't very good or it wasn't very performant or reliable or whatever. And finally, finally, finally, Randy, you're giving us the opportunity to prioritize it. And thank God, because I've been complaining about this for months or at least in my head..

Again, when you can tap into that part of the engineer-heart, which is we just want to make things better, for the most part. We get into this job 'cause we want to make the world a better place. We like to take broken things and fix them. We like to take things that work and make them work better. Anyway, so you absolutely can tap into that if and only if it's a blameless conversation.

And yes. So I mean, again, we had the benefit of it being so terrible that we could prioritize it. But we also had the opportunity to then have conversations with other teams that we interacted with where they were like, "No, we can't give you more resources". And guess what? Remember how we were asking for that? We actually did need that. It wasn't disk space, but we actually did need the disk space. We actually did need the compute there. Anyway, so all good.

Michael Stiefel: It sounds, if I could summarize, the blameless culture is important. It's important to give some degree of priority to these things because as you say, if the system is down, it's not performant, it's the zeroth order lack of performance. And it's not only a question of being of course up or down. It's also a question of how responsive, how ... I mean, the important thing is sometimes failure is not complete breakage, but failure is degradation to the point where the system is not usable.

What Happened with the Google App Engine [17:11]

Randy Shoup: Yes. Well, actually I promised I would say what actually happened. And okay, here's what actually happened. So the way that App Engine was designed at the time was the persistence of things was a ring of five data centers and a given one of those 3 million applications run by App Engine would have a primary data center that it would be served out of. So Snapchat would be served out of data center A, and whatever, Ingress, which is this predecessor to Pokemon Go game that would be served out of B. And Snapchat was at the time, this is 2012, was at the time already a household name and they still had five engineers or some crazy small number, but they were growing like crazy. And they basically were as big as a data center for us. In order to serve the Snapchat traffic, we needed to give them almost exclusively an entire data center. Oh, and again, our limitation, which was entirely architecturally on ours, is we could only serve an application from one. We could move them, but they could only be served from one at any given moment.

Something happened, and I forget the details, it's not important, but something like slowed down or took a tax on our resourcing and we had to take one of the data centers out of rotation essentially. And that was normal. Everything would get migrated to their secondaries and everything is fine. But then we had to move Snapchat and it couldn't go anywhere. We would take it from A, and it went over to B and took everybody else out. And then, okay, that's not good. So then it moved to C, this was all automated, took everybody else out. And no fault of Snapchat. It's entirely on us, but it was a cascading failure.

And the reason why I always bring it up A) is 'cause I promised to, but B) you were saying failure and degradation are like a spectrum. And this was absolutely that situation where that eight hour global outage, I mean this is going to sound like I'm splitting hairs, but actually not everybody was down the whole time. Do you know what I mean? Because it was a cascading massive degradation that essentially made everybody unresponsive. But they weren't down. They were just super, super slow 'cause the system was moving everything around.

So seven and a half of the eight hours we're trying every trick that we had in our playbook to deal with it and nothing was really working 'cause we essentially didn't have enough resourcing to serve what we needed to serve. So finally what we did in the last half hour, we had a thing that had been built where you could shut the entire world down and spin it back up. And I don't know that it had ever been tried.

Michael Stiefel: Essentially reboot the computer.

Randy Shoup: Reboot the world, reboot all of App Engine globally. And we did that, and then half an hour we were back. So A) I'm super proud of the people on my team, and it was before I got there, but that had built that reboot thing essentially 'cause the reboot thing worked gorgeously. And the funny thing in retrospect, 20/20/20, is if as soon as we detected this, we just simply had rebooted the whole world we only would've been down for half an hour as opposed to eight hours. So we tried all these mitigations essentially, and then we were like, "Well, we have no other thing to try. Let's just turn it off and turn it on again".

Michael Stiefel: There's an old saying that very often the right thing gets done when there's nothing else left to do.

Randy Shoup: Yes. That's actually Churchill about the Americans, I believe.

Michael Stiefel: Yes. Right, right, right. The Americans will always do the right thing after they've tried everything else.

Randy Shoup: Yes. Yes.

Improbable Things Happen Even At Non Google Scale [20:29]

Michael Stiefel: So this kind of thing is completely fascinating to me because A) it's real world and it's not theoretical. Another example of this, and it's only for big companies because for small companies it might be something as simple as the Wall Street Journal or the New York Times mentions your little chocolate shop and all of a sudden it blows up on you. So people should not think, "Oh, this is only Google scale". This can happen anywhere along the continuum where you have in this particular case or some sort of resource contention problem where you just don't have enough resources.

Randy Shoup: Yes, and you have to be resilient to that. And the thing is, we were resilient to, "the normal classes of resource ... " I mean, in retrospect, it's obvious. We had never designed for the case where there was one unit that was so much bigger than everybody else that if you moved it, it would stomp on everybody else like a massive elephant or something.

Michael Stiefel: It reminds me of the old saying, "Idiot-proof software usually fails to postulate a resourceful enough idiot".

Randy Shoup: Yes. I take the point. In this case, as you well know, it wasn't an idiot as much as a thing we hadn't thought through. Yes, that case seemed so insanely unlikely. What do you mean the entire world of App Engine is like one fifth taken up by one application? Like what? How could that happen?

Michael Stiefel: We talked about that in the last podcast where very often people don't think of those really small probability things, the "it will never happen" thing happens often enough.

Randy Shoup: Yes. I mean, the real answer, which is what we did implement, is make it so we can serve one application out of multiple data centers and all good. And there was a lot of complexity associated with that sentence I've just said, but that's the correct way to do it. 'Cause if you have a situation or when you have a situation like that where it's kind of a hot key in a key value store or something like that, or to the extent possible, you want to spread the load of that thing broadly across the entire system, and that's the correct way more resilient solution to the situation. Yes.

Michael Stiefel: So to summarize, it's the blameless culture, it's the willingness to actually look for the real causes as opposed to the proximate causes. And there you have to be careful when you ask questions is, not to assume answers in your question. You don't want the software equivalent of when did you stop beating your wife where the question implies that this was a failure of some sort. And I know I've seen that happen very often where there's an assumption in the question and it's really the assumption the question has to be challenged. And it also needs the willingness of management, both engineering and executive, to take the time to prioritize, in an important way, that learning into changing the architecture, because you have to be able to do those three things.

Randy Shoup: Yes, I agree 100%. And let me just add a little more detail. So your second one about thinking about what we could do better and not limiting yourself to the proximate cause, but expanding it. Again, as I was mentioning in that post-mortem-y thing, we brainstormed really broadly and really opened up our minds to all the possible causes of also related things. So we opened the aperture up really widely. And then the other thing which is part of your third, which is like, okay, we're going to do it is we're going to do it in priority order.

So there's this expand, but then you have to contract again. Expand my prompt to the team was expand your brains as widely as you can think of all the things that could possibly go wrong. And then obviously the next thing is, and we're not going to work on them all in parallel. They're not all equally valuable or useful. Instead, we have to be really disciplined and really systematic about prioritizing the correct things. And the reality is we came up with, I'm making this number up and it's too small, but we came up with a hundred things, but then we prioritized the top three and then we got those out and then we did the next three or the next 10 or something like that.

And then I guess the other aspect of that third section is being very iterative on solving those things so you're not implying otherwise. But we didn't go in a hole for six months and come out afterward with some perfect system. Instead, we said, "Okay, well here are things that are both really terrible when they go wrong and really easy to fix", and we solved those really quickly. And then just iteratively or incrementally kept almost throwing off or adding improvements to the system, if that makes any sense. So we did it very, very iteratively.

Michael Stiefel: So after this experience, did you feel that the team became much more sensitive to bringing these issues up as they saw them?

Randy Shoup: Oh, absolutely. I mean, it brought us all together and no one should ever manufacture a crisis, and this wasn't manufactured, but the reality is how humans work is shared ...

Michael Stiefel: Trauma.

Randy Shoup: Trauma, I was going to say trauma, but it wasn't traumatic in an emotional sense, it was not very fun. But we were all in it together. Even when we were in it, it was we're all in it together. We weren't apart before. It was a pretty close knit team, but we were really close after it. And again, the cultural thing that you're touching on that it enabled was giving voice to people's spidey sense going, "I don't know that that's going to work", or, "That area over there makes me uncomfortable and I'm not entirely sure why". Now everybody's raising their hands and going, "Hey, guess what? That whatever X, Y, Z service over there that we use for this and that, I'm not comfortable with how that works", and they're open to raising it up. And we also had a pretty decent, at that point prioritization, framework of like, "Oh, okay, that does sound like that's important", or, "Yes. You know what? That's a kind of third level priority after we solve all these other higher level things".

Cultural Improvements Are Most Important [26:24]

But yes, no, I mean the cultural improvements were way more important in the long-term than solving the distribution of applications over our data center.

Michael Stiefel: And did the SRE sort of feel better because they'd said that these software development people and the architects are really taking what we say seriously?

Randy Shoup: Yes, thanks for saying it that way, because the SREs were always part of the team. People knew, oh yes, Michael, there was this guy's name, but Michael was in the SRE part and John is a developer, but we were all in it together. So when I talk about the people being in that room, the SREs were all there as were all the tech leads and senior folks on the standard software side, if you like, application building side and everybody. So we were always all one team.

One story I love to tell about, again, working on that team, unrelated to the outage, was the Google facilities people, like the people that decide, okay, we have these buildings and where are you going to be in the buildings, came to me at one point and they were like, "Hey, tell me the growth of your team over the next couple of years, 'cause we're trying to figure out space planning over time".

Cool, great. So I told them my thoughts. And then they were like, "Oh, who are these other people that are near your team?" Like, "Oh, they're SRE". "Do they need to be with you?" "Oh, absolutely, SRE, they're part of our team. I know they don't report to me, but they need to be there". "How about these other people?" "Oh, that's product. They absolutely need to be with us". "How about these other people?" "Oh, they're support for our team. They actually don't need to be on our same floor, but they need to also be close in the same building".

So the thing that I came away from that experience was even though I knew the product people didn't report to me and the SREs didn't either, we were all just one team. We didn't think of each other as we reported to separate VPs "because reasons", but we all thought of ourselves as one thing. So when I talk about the App Engine team, yes, that's actually four or five different sub business units in Google, if that makes any sense.

Michael Stiefel: Or as let's use the buzzwords of cross-cutting concerns.

Randy Shoup: Absolutely. Cross-cutting concerns, cross-functional team. Yes, absolutely.

Happier, Healthier, and Long Lasting Teams [28:26]

Michael Stiefel: I mean, this is really, really an interesting, educational and instructive story, and I hope our listeners take it to heart and see how they can incorporate these things into their teams because an ounce of prevention is worth a pound of cure. And I will tell a not quite related story, but somewhat similar story, but it's the point.

Many, many years ago I was in charge of a team that developed a database, and I remember, this is way before it was popular, put in regression tests. Now everyone talks about regression. We're talking 1980s now, and I insisted on having regression tests. So it's not that we didn't have bugs in our software. We found them and we made sure they stayed fixed.

So one time there was this big crisis at the company. I sent my team home and I stayed because it would be politically not good if I wasn't there, but everybody else was in a panic and I wasn't. In other words, my team went home at night and they were not spending late hours. So in other words, the point that I wanted to bring, even though it's not directly related, is if you take these things seriously, they become a quality of life issue for your team and that makes the team happier.

Randy Shoup: Yes, I think it is related, and it's related in this way. We're talking about systems. And I talk about the worst possible thing that could happen to the system, which is the entire thing is down. And then we did feedback. We instituted a feedback mechanism, which was do this post-mortem for six months, make reliability improvements to the system. You're not saying otherwise. You never want to get it to that level. If the only time you ever worry about reliability is when you're down for eight hours, that's not great.

But the thing you're talking about is a systems related thing, but you had set up a feedback loop that was way tighter and therefore healthier. So the feedback loop that you set up was, hey, every time you built a thing, you wrote some tests to make sure that that thing didn't break, and every time you broke it, something escaped your tests, you added a new test and you know where we go back from there.

And you're not saying otherwise, but I think both those things are really two ends of the spectrum. They're all about things happen and we need a feedback loop to keep them effective, essentially, right? To keep them resilient or good or high quality or high performance, whatever word you want to use. And you're not saying this, but I'm going to riff off it anyway, the next, going even farther on the spectrum in the direction that you were saying is, "Okay, well, it's great that we test at the end, but what if we tested at the beginning?"

So now we're in TDD, and now we're making sure that now we're as developers in our inner loop of just adding a new feature or fixing a bug, we're also giving, adding to our feedback mechanism in the system and getting that super tight loop to make sure we don't break anything.

Michael Stiefel: But I think that's also important, I think, for the quality of life of your team. In other words, if you are proactive about resilience in whatever form it takes, your team will have a better home life and then they'll do better at work and they'll stay with your company because they're happy.

Randy Shoup: Yes, and you're exactly saying this. If the only thing you cared about was their work outcomes and work output, you would want them to be happier and well rested. That's not always people's behavior, but it should be. If you actually objectively want to build the best software and make the most money, the thing that you want to do is make sure that you have designed in through all these practices, these feedback loops and improvements in quality and have a happier team and the Accelerate DORA Research Data Development report, all this stuff, prove this over and over again, is that happy, well-rested, sustainable teams aren't only the best for the humans, but they actually produce the best software and the best business results.

Transactions, Software, and Sagas [32:42]

Michael Stiefel: So another thing that we didn't get enough time to go into last time was looking at transactions and software and saga. And the way I sort of want to approach this is because this is really taking a look at the middle tier where to use the often cited and perhaps often abused restaurant analogy is that the back end is like the ingredients and the front end is with the servers presented to the customer, but it's in the middle tier where the meal is actually cooked, where all the, you take the simple elements, the single sources of truth and the atomic operations and put them in a way where there is resilience and is orchestration and there is choreography and all the things that happen with the business logic to actually be able to present this to a user. I'm curious what you think about that analogy and how that relates to things like sagas and transactions and software.

Randy Shoup: Yes. As we were talking earlier, yes, I think the restaurant analogy is a pretty rich one. Depending on when this comes out, the Vibe Coding book from Gene Kim and Steve Yegge will be just about to be on the shelves. And the restaurant and chef analogy is one that they use really extensively for what's an effective way for a developer to use a bunch of AI coding assistant agenty things.

So yes, lots of rich metaphors to be mined in the restaurant analogy. As you're kind of saying, the interesting thing about the system is happening in those, I'll use different words, those domain services essentially, that execute the actual business logic and do the stuff that the users care about. And to your point, you're mentioning sagas and workflows, one of the things that I have found, I mean, you and I have both been in the industry for a long time in the whatever, 38 years, 37, 38 years that I've been doing this kind of stuff, I think we miss out on when we don't think about things as workflows or events. And that's a tool in the architect's toolbox that I don't think is used as much as it should be. And that's a way of reifying a thing that happens and making sure that when that thing happens, other interesting things happen.

So at eBay, when I was there the first time, we finally built ourselves an eventing system. We've been trying to justify doing it for a while, and I didn't do it personally, but I got the benefit of using it. And as soon as we made available, the concept of a new item gets added into the system or a bid is placed on an item, all of a sudden, boom, within a year, there were 10 different consumers that were all listening to the item.newevent, all doing interesting, totally different things and whatever, 10 other consumers that were listening to the item.bidevent. So I just think we should definitely dig into this more and go where you want to go with this, but I just feel like events are such a critical tool in our architectural toolbox and really make it a lot easier and more natural to think about how systems actually work and how we would like them to work.

Why People Have Difficulty Understanding Events and Workflows [35:55]

Michael Stiefel: 'Cause I find that events are things that people have sometimes a great deal of difficulty dealing with because they're not natural. How would you get people to start to think about events? I mean, in the restaurant analogy, you could come up with examples, but how do you get people comfortable with the idea of events?

Randy Shoup: It's a really good question.

Michael Stiefel: And I think it's also related to the difficulty people have with things happening asynchronously.

Randy Shoup: Yes, it absolutely is related to asynchrony. The thing is, and I struggle a little bit with why people struggle with it, to be honest, because it's how the world works. The world isn't synchronous in the software sense of synchronous. Literally, you and I are talking with each other and our mental model might be worth doing something synchronously, but actually it's this connection of asynchronous events. I say a thing. It travels over across a continent to where you are. You hear it, you react to it, you say words out your mouth. They travel the, whatever, 3000 miles, 5,000 kilometers, whatever, across the continent back to me. And so the world is made up of these asynchronous events.

Michael Stiefel: Well, we're constantly waiting for things to happen. The mail to come, reports to come, dinner gets ready. It's all these things that you never wait on anything in the real world or rarely do. Rarely do.

The Real World is Asynchronous Whether You Like It or Not [37:16]

Randy Shoup: Yes, exactly. There's no concept of blocking. Yes. So I'm both listening and speaking, and I'm drinking water and my autonomic systems are pumping blood and I'm breathing and some of these things I'm not consciously aware of, but yes.

Anyway, so you were talking about the power of the middle tier or power of this sort of business layer, and yes, I think that's where the magic happens. That's where the interesting thing happens. And particularly it's true at every scale, but particularly at larger scale you just can't hold the whole system in your brain. And if you try to think about the system as one big synchronous set of stuff, A) that's not how it works, but B) that's not how it could possibly work.

Instead, you have to think of, again, to go back to the eBay analogy 'cause it's pretty visceral for people. Like, okay, somebody adds this new item to the site. Then it's not, then immediately it shows up in search. We did a lot of work, so it took only a couple of seconds, but it doesn't transactionally end up in the search engine. It doesn't transactionally check the item for fraud. It doesn't transactionally thumbnail all the images. It doesn't transactionally store X and Y and Z and all these places. All those things really at scale can only happen asynchronously.

So in a system like that, you want to remember, so when somebody clicks the button that says submit this new item 'cause I want to sell it, you want to do the absolute minimum possible at that moment and act to them and say, "Yes, great. Your item idea is one, two, three, four, five. Thank you very much". And then all this other stuff that absolutely has to happen happens in the background.

Michael Stiefel: I think to emphasize this in people's minds is, there's a user interaction here, and you can't put a database lock where there's a user interaction in the middle. That's just asking for collisions and conflicts.

Randy Shoup: Yes. Yes. Tell me how I know that. Or ask me how I know that.

Michael Stiefel: How do you know that? Well, we're doing true architectural confessions.

Randy Shoup: Oh, true architectural confessions. Okay, well, I will do it. Yes, so the way I describe it at eBay is how it works and is how it worked 20 years ago where obviously there was a database transaction, but it was auto-committed, so it was pretty fast. And yes, essentially add a new record to the items table more or less. And then also as part of doing that, transactionally queue up an event that gets listened to by all these different consumers, they do all stuff in parallel.

Well, that's how this stuff should work. And that's how whenever you click place an order on your favorite e-commerce site, that's exactly how it should work where all we are remembering at that moment is that Randy placed this order for X and Y and Z things in his cart, and after that we do all sorts of other fraud checks and execution of things and charging payment methods.

Michael Stiefel: And checking whether it's actually really in stock.

Randy Shoup: Yes, there's usually a multi-step for that. So before you show it to me, there's some check whether synchronous or otherwise, but then you really, really reserve it typically.

Michael Stiefel: Yes. You see, my favorite example of that is in Amazon where they say you have five copies left. Now that's a cached value. You don't want to put a database lock on that because what if you get called away to dinner in the midst of your order session? That could be hours that it could be in your basket. So what happens, of course, is when that gets dispatched on a queue and then you will find out where they actually have it and then you get the email. It's only when you get the email that your book will happen.

Randy Shoup: Yes, that's absolutely true. And so the mental model here, just to use techie words is it's not like one transaction, it's a workflow or a state machine or a saga, and various individual steps of those are transactional and are atomic and happen or don't happen. But there's this whole, yes, state machine and workflow of, okay, whatever, we created the order and then we charge the payment method, and then we reserved the inventory and then we tick, tick, tick, tick, tick. And to your implicit point, any one of those steps could fail along the way. And so like, oh, I failed to reserve the inventory because hey, those five books were already gone. I have to go back and I have to apologize. I have to say, "Hey, sorry Michael, I'll give you a refund", or, "I'm going to get it in stock again in a week", and whatever, or that kind of thing. Yes. So true confessions.

One of the things that we're working on at Thrive Market, which is an e-commerce site as we mentioned before, is taking apart a monster database transaction that happens when somebody clicks place order into a workflow. And so that's something that we're actively working on.

So again, the way it should work is you do some preparatory work. When people put stuff in the cart, you make sure that at least it's available at that moment. And to your exact point, when we place an order, or almost immediately "real soon now" after we place the order, we actually reserve the inventory and so on. But all that stuff today, and this is what we're working on, one of the things we're working on, is all that stuff today happens in a monster database transaction, and it really needs to be spread out over time with failure management and retries and the proper workflow and all that.

Michael Stiefel: And of course, there's the problem of what happens if in the actual packaging of it, they take the last head of lettuce, drop it on the floor, and it's all dirty and it can't be sent.

Randy Shoup: Yes. Conceptually, this entire thing is a workflow. The e-commerce site itself, all the complexity of what my team builds and maintains, that's only a tiniest bit of this broader workflow of, okay, we agreed that we will send Michael this head of lettuce. Okay, well, now we actually have to go pick that head of lettuce out of a refrigerator or whatever, and then we have to put it in a box or you have to wrap it. And then we're not even done. Okay, we put it in a box, now it goes onto the shipping carrier and it goes in a truck and it wends its way to Michael's house. And all these things are entirely as you know, entirely this very long, logical workflow where there could be failures or delays or whatever all along the way.

And yes, we do in real world situations, we need to design mitigations or fallbacks or resilience mechanisms, whatever you want to say for all these things. Like, hey, what if it arrives all the way to your house and it's damaged? Okay, well, you tell us that and we refund you or we ship you another one, or both, that sort of thing.

Exposing Transient States Makes Software More Resilient [43:46]

Michael Stiefel: Yes. I think this goes back to what you were talking about before when you were saying don't hide transient states.

Randy Shoup: Totally, yes, when we were talking last time. I was thinking the same thing as you were saying this. A huge lesson that I have learned myself over again, the 30 some odd years is, and it sounds so obvious and trivial but it really is deep, model the actual world in the software, not how you would like the world to be. I would love it if as soon as I ordered a thing on Amazon or on Thrive Market, it atomically and immediately showed up on my doorstep. And that's not the world. The world is that it gets reserved and picked and packed and shipped and dropped off. And model that. And pretending that those states don't happen is where software goes wrong, pretending that there's something atomic about in order. No, there's all these things and any one of them could fail.

And having the outside, whether it's a web page that says the status of my order or whatever from the outside to be able to ask, "Hey, where is Randy's damn order?" "Oh, well, it is with OnTrac", the shipping carrier, and it is between the Reno warehouse and where Randy lives, that kind of idea of being able to again, expose those transient states or be able to see the intermediate steps in the workflow or intermediate states in the state machine, whatever metaphor you want to use, absolutely critical to building systems that work. And that makes sense 'cause it's the world.

Michael Stiefel: And that is why, just to build on the point that you just made, why workflows in sagas are so important because they're resilient to failure because they can, if those hidden states are there, you are in the position to do something because you know what, something that could happen.

To go back to your Google example before, this is saying, someone's saying when you're designing the software, here's a point where something could go wrong. Here's a point where something could go wrong. And you can put that in the workflow and you can put that in the saga.

Randy Shoup: Yes, 100%. All these things are an additional advantage of this thing. A) You can encode failure in very naturally. B) It models the world. It's the truth as opposed to what we would like it to be, and it's abstracted in a useful way, right? It's clean. And I guess the related fourth thing, well, overall it's abstracted, but what it also means is those steps are abstracted, right? Charging the payment from the slightly higher level. How does the workflow work, how I'm describing words? Like, oh, we charge the payment method. Well, as you well know, particularly you well-know there's a huge amount of complexity and all these banks and all this loss. And at the level we're talking about, I don't care, charge the payment method. It works or it doesn't. And then ship it to me.

At some level somebody goes, "I don't care about the details". But no, there's an actual human and there's an actual warehouse and there's an actual truck and there's an actual. All this complexity associated with that, but all hidden. Like ship it.

And so the gorgeousness I think of thinking about things as workflows is it gets all these things. It's like it matches the world, it encodes failure, and it abstracts things at the appropriate level of abstraction.

Before we close, I want to give a plug to what I think is the absolute best programming framework for thinking about workflows, and it's called Temporal.

Michael Stiefel: Actually you mentioned it last time, but you said you were looking into it.

Randy Shoup: I mentioned it last time and I still believe it. Yes. So we're actively using it. Now, again, as I'm sure I mentioned last time, every Snapchat story is a Temporal workflow. Every Coinbase transaction, every Stripe transaction, it's used everywhere in things you do all the time. It's open source. It models what would otherwise be a saga with events in a very clean way. So for those folks that are listening and wonder how one might, there's a million workflow engines out in the world, and I have friends that have developed many of them. This is the best one.

Michael Stiefel: Okay. But again, the reason why you want to model all these things in workflows or using an engine like you suggest, is because then you have the opportunity to respond to failure. You can't respond to what's hidden away.

Randy Shoup: Totally, totally. Yes. You can make it, again, back, back all the way to the beginning, you can detect a thing, you can diagnose it, 'cause again, you've had this abstraction. You can mitigate or remediate situations by retrying or by giving up or by doing something different. Yes, so it makes it very natural to handle all these kinds of things.

Resilience Means Modeling the World As It Is, Not As You Want It to Be [48:25]

Michael Stiefel: And it goes with the idea that we talked about last time, that resilience is not a castle with a moat and alligators and a drawbridge. You have to assume that things happen inside. Invaders will get inside, and how do you respond to them?

Randy Shoup: Yes, yes, absolutely. Cool.

Michael Stiefel: Well, even to the extent that we've talked about this, we could even go deeper. I hope the people who listen will get a feeling for how difficult and important this is, because you really have to think about what's going on, and it's worth the time and the effort to think about what's going on. Because if you do that and you align, as you were saying, modeling the real world, in the end, your life will actually be easier because the cognitive load will be lower.

Randy Shoup: 100%. Yes. It's very counterintuitive, but yes, absolutely. It seems like, oh, workflow with steps, that somehow feels more complicated than one big old transaction. We're like, "Nope". It's way easier. It's way easier conceptually and then also in practice.

Michael Stiefel: Yes. And you can also break up the problem.

Randy Shoup: It's how the world works.

Michael Stiefel: Yes. I think that's the most important thing that people have to realize. I don't know if it's still fashionable, but when I got started in software design architecture, we talked about essential and inessential coupling. Inessential coupling we introduce in the software world, but essential coupling, you can't do away with that. If customers are related to orders and it's complicated, there's no way you can get around about it because that's the way the world is. And I think if I had to sum up this conversation in one line it would be, look at the world and see how the world actually operates. Not like we wish it would operate or wanted to operate, but model the software as the world actually operates.

Randy Shoup: What a beautiful ending.

Michael Stiefel: Well, thank you very much. I know we could do this forever, because to me, this is interesting because there are so many dimensions to the problem, the human dimension, the software dimension, and it's infinitely interesting. Well, thank you very much.

Randy Shoup: Thanks, Michael. This was great.

Mentioned:

About the Author

Randy Shoup

Show moreShow less

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.