In this podcast, Michael Stiefel spoke to Matthew Liste about building and managing software platforms. Platform services act as the basis for application development, and must always be stable, secure, and scalable. Scaling these systems is particularly difficult because unknown resource contention often causes them to break. Using customer journeys, one can pinpoint the places where the system is particularly at risk. Platform engineering also requires managing limited resources, and making difficult tradeoffs about which functionality should be implemented.
The discussion also highlighted how artificial intelligence can increase the speed of development and thus increase risk, and how it interferes with the development of junior engineers who typically learn from basic tasks that now can be done by artificial intelligence. Nonetheless, platform engineering is still responsible for maintaining the stability, security, and scalability of the platform.
Key Takeaways
- Platform services, which are the basis for application development must always be stable, secure, and scalable (the 3 Ss). Scaling a system is particularly difficult, as unknown resource contention often breaks these systems when scaling.
- Agentic AI does not change the nature of software development, but it does increase the speed of change. You still have to supervise the changes as before, and maintain the 3 Ss, but the degree of risk increases. Agents make mistakes faster than humans. Observability and monitoring platforms must increase their speed, which means that they must also use agentic AI. This is similar to the use of AI to combat cyber threats generated by AI.
- Using customer journeys is an effective means to measure system reliability and functionality. Customers here are both end-users and developers. An example of a customer journey is "Can I pay with my credit card". Evaluating how a system failure affects a journey illustrates where the system is particularly at risk.
- The use of artificial intelligence to do basic coding tasks interferes with the development of junior engineers who typically learn their craft from these tasks. This is an unsolved problem in the apprenticeship of new developers.
- Platform engineering requires managing limited resources and making difficult tradeoffs for end users and developer clients. One critical issue is whether to be an early or late adopter of technology, although the existence of open source software helps this decision. It is necessary to have the discipline to reject narrow custom requests for single clients.
Subscribe on:
Transcript
Michael Stiefel: Welcome to the Architects Podcast, where we discuss what it means to be an architect and how architects actually do their job. Today's guest is Matthew Liste, who is responsible for American Express's data center, resiliency, and multi-cloud strategies, and is responsible for all other infrastructure components, including digital workspace, and has operational oversight with site reliability, application support, and mission-control for all lines of businesses.
Matthew brings more than 30 years of infrastructure engineering expertise. At JD Morgan Chase. he oversaw the design, build, and management of the bank's platform infrastructure. He was responsible for databases, middleware, and critical infrastructure, including identity, observability, reference data, and cloud integration. He has also worked at Goldman Sachs, Throughpoint, and Schlumberger. Away from work, he enjoys spending time with his family, cooking, and traveling.
Becoming a Systems Engineer [01:16]
It's great to have you here on the podcast. You think of yourself as a systems engineer, but it seems your description of systems engineering corresponds to what I like to think of architecture, and the role of an architect. Were you trained that way? How did you arrive at your current role? It's not something you decided one morning, you woke up and said, "Today I'm going to be a systems engineer".
Matthew Liste: Yes. Well, first of all, thanks for having me and really looking forward to this conversation. I was always a tinkerer, I guess. I grew up in the age where computers were not ubiquitous or common.
And I had an experience as a kid, and it's kind of instrumental in how my career happened.
We were living in Norway, and my parents, they're not Norwegian. We moved there when I was a kid, six years old, and they made a friend who ran the mainframe for the Universe of Oslo. One day we went over there. I was eight or nine at the time, and he pulled me aside and said, "I want to show you something". He put me in front of a terminal, put his phone in it, and I played chess. He basically had me play chess against the mainframe. And that was when I said, "That magic is something I want to be involved with".
So I was always tinkering, putting together computers, writing code, soldering electronics, even though I didn't even know what it was at the time, but I fell in love with this whole concept of being able to think of something and then go make it happen and manifest itself. And probably part of it is my father, a carpenter, he worked with his hands and manifested stuff. And I knew that wasn't really for me. I didn't have the patience that he had to do that, but I liked building stuff. I started with the electronics, low level software and electronics really allowed me to be a builder, I guess. So that's probably the way I just fell into it. But there was that moment, playing chess against a mainframe that had me down my path.
Michael Stiefel: I remember tinkering, a similar path through tinkering software. Push this, see what happens. Change that, see what happens. But that is far away from solid engineering practice. When you build a platform, and maybe we should talk right now about the nature of a platform, you have to sort of distinguish between that urge to tinker and that urge to make something that's solid. If you build a table and one of the legs is not quite solid, you can still use the table. But in software, if you build a table and one of the legs is not solid, you can get into a lot of trouble.
Systems Engineering as an Apprenticeship [03:57]
Matthew Liste: System engineering is in an apprenticeship no different than any other craft. And you get good at a craft by learning from others, from making mistakes and gradually understanding what great looks like. But it takes experience, it takes apprenticing, it takes being willing to take risk and learn from the mistakes and work through it. So I don't think we're any different than someone who apprentices to become a cabinet maker, making a really great cabinet. Over time, you learn how to do things in the right way.
And I'll use another sort of early career example of how I learned a really hard lesson, but one that stuck with me. This was a summer job. I was working for Schlumberger. I ended up having a career at Schlumberger, but this was during high school. I had a summer job with soldering cables, basically putting together these really complex 40-way, 50-way cables with big connectors on each end and solder them. So I had no idea how to solder them when I first started at 16. They gave me two connectors, gave me a cable bunch, say, "Connect these together. A goes to A, B to B and so on". And my boss spent half an hour showing me how to solder and said, "Okay, go build a cable, come back when you're done". Took me probably two weeks to make my first cable. Probably should take me when I was done and got good at it, took the same thing half a day, but I was learning.
Anyways, I soldered this thing. I tested, it works, I bring it to him, and I say, "I'm very proud of my work". And he looks at it, he unscrews the connectors, looks at my soldering, and then he reaches into his drawer and pulls out a bolt cutter and then clips my cable in two and gives it back to me and says, "Do it again". And I'm in tears. I'm like, "What do you mean do it again? It worked. I tested it. You didn't even try it". He said, "I looked at your work and it was shoddy". I could see from the way you did it, you didn't do it right. And he said, "Do this better. Pay better attention, pay more attention to what you're doing and do it properly".
And these simple things like, "Are you really soldering the joints? Do they stick or are they just going to stay around for a couple of days and then break?" That apprenticing and those lessons are really at least how my career's been is, through iterating my way through making small mistakes on continuous spaces, hopefully not too big. Taking a lesson from them, also apprenticing, learning from others, and gradually building up this, I hate to say it, but a bit of a gut of what is right and what is wrong, intuition. And then that begets itself and over time you build. But it's not a ... I don't know you can read a book for it, for lack of a better word. This is something that you build a corpus of knowledge of over time.
Systems Engineering and the Impact of AI [06:45]
Michael Stiefel: This is very interesting on multiple levels. And eventually we'll get to talk about software platforms, but this apprenticeship and this development of the intuition, I ask myself, "Do we have enough time today the way we develop software very often in a hurry?" In other words, Silicon Valley - break things fast. Or even when we think of, and this is a problem that I thought about and talked to people, if AI starts to do the easy coding jobs, where will beginning engineers have their apprenticeship? And I think this is a very important issue.
Matthew Liste: I think that's the most profound ... I mean, at least to me right now is the most profound. I was lucky I could apprentice. I could learn to do stupid little things to begin with that gradually became more and more complex things over time.
Michael Stiefel: As I could.
Matthew Liste: And I think most people at our age still do that. But then you can imagine if you no longer do the stupid stuff because AI is doing that for you, how do you ever learn to do the more complex stuff? Because you have to learn over time. So it's a great question. I don't know the answer to it.
And I think it's for any knowledge worker, imagine like legal work or any structured work, accounting and so on, if you never do the boring stuff because that's now done by AI, then how do you learn to do more? I don't know, but I mean, I'm also somewhat optimistic that we will pivot over time to do that.
I had someone who worked for me once who was a compiler expert and he was always frustrated with that people don't really understand how the CPU works anymore. They just write code and then it becomes assembly and they don't know what happens under the hood and they should really all know that. And it's like, "Yes, but they don't really need to because the compiler just does it and does it well". And the fact is that you don't have to have people write assembly anymore, it's probably a good thing.
Managing Software Abstractions [08:34]
Michael Stiefel: Yes. But I'll give you a counter example to that. The problem comes where the abstraction breaks.
Matthew Liste: Yes, 100%.
Michael Stiefel: I remember one time very early in my career that we were programming at Fortran and there was a performance hit every now and then. And it turned out that the compiler had placed an instruction across a page boundary. So every so often there was a page fault.
Matthew Liste: Yes. And it had to be loaded... Yes.
Michael Stiefel: And that caused a performance hit. But if you just assume the compiler worked, you never would've found that.
Matthew Liste: It's a great point. I think to your point about no abstraction is, you want to abstract sufficiently, but you still need people that understand how the machine works. I mean, it's no different than my car. I mean, I cannot pretend to repair my car today. I could have repaired my car, at least on simple stuff 20 years ago. No longer. It is too obtuse to me the way that it's built. And today with modern electronics, for me to be able to do anything meaningful with my car. And so that abstraction is now beyond me. But luckily there's people at a shop that can do that.
And so I do think that software is no different, but I do think certainly to point of AI, it's very new. And I think that the way that we have been apprenticing in our field for the last 40, 50 years is changing abruptly. I don't know what it'll look like on other side of it.
And I think about, I have two boys, both college age. And five years ago, I always told them, "You need to do computer science, you'll be employed for life. It'll be great". And I'm kind of happy they didn't listen to me. Don't get me wrong, I think there still would be a lot of great engineering work, but it'd be very different. And so what I was telling them, like, "You should be a coder. You should learn Java or learn code". And I'm not so sure anymore that that's going to be the job of the future. There will be jobs in creating software, but they'll be different. I don't know exactly what they'll look like.
Michael Stiefel: At the current point, I've talked to enough people who use Claude or other LLMs to build stuff and they still have to read the code because they make mistakes and you have to figure that out.
Matthew Liste: And I think actually right now, I think we're no different than if I'm a senior developer and I have a team of 10 junior developers.
Michael Stiefel: Right.
Matthew Liste: They will write code and they will make mistakes, and I'm still accountable to make sure it works and I still need to read it. This is, and back to your apprenticing question is, how do I become a senior developer if I never was a junior developer? And so do we have a pipeline problem and do we end up not being able to have that person do that job? And to my point about getting hired, would someone hire my 22-year-old to write code or is it just, use Claude for that? The senior developer has a job, no doubt. Does a junior developer still have a job?
Michael Stiefel: And it's also, I mean, without going down the rabbit hole, there's also a question who writes the unit test, who does the system test. I think we'll come back to this, but how does system engineer contribute to it?
Building Platforms: Stability, Security, and Scalability [11:37]
Matthew Liste: Yes. So just to give colors on myself, I built my team and what I've done for the last 20 plus years in financial services. And before that, I did similar work in telecom and oil and gas. But it's really about building platforms that are broadly used by developers to build their applications on top of it is the best way to put it. And so I build platforms for other engineers that use them in turn to deliver business software to whatever they serve.
So in financial services, you can imagine that these platforms, to use your point about the table, I describe it as the three S's. They have to be stable, they have to be secure, and they have to be scalable. Those three are non-negotiable at all times. They always need to hold true.
In production, now you could compromise on pre-production. But once you have and you're trying to support a trading app, a banking app, an ATM app and so on, the expectation is that these three S's are always true, which means that you are now building platforms that are kind of conservative. You're always threading the needle between how much risk can I take, how much change can I make. You don't want to make too little change because then you don't keep current, make too much change as you're running risk.
To me, the system engineering concept is being able to think about the whole... This platform has platforms on top of it and it's sitting on top of other platforms. And so you think of this whole thing as a system, it's an organism. And if one part no different than a human body, if your lungs don't work, well, even if everything else is perfectly fine, your body doesn't work.
Managing Risk and Learning from System Failures [13:15]
And so you have to think about your part in the ecosystem, downstream and upstream dependencies, and how you manage that as a system. And so that system thinking of, "How do I build and do my thing really, really well, but understand how it fits into the bigger system and my role to play and what I need to be really good at?" And again, financial services is very unforgiving. Meaning you get it wrong, it's very obvious because you will blow up and there's very little tolerance for the wrong kind of mistake. There's often tolerance to the right kind of mistake. You can make mistakes as long as you make the bank more money than you lose. Or if you lose more money than you make, there's very little tolerance.
Michael Stiefel: Many, many years ago, I heard this story. They had a test system that duplicated the trading floor. This is when there was not 24 hour a day trading yet. So there was a machine that interacted with the trades and then another one that overnight cleared them. So there was a duplicate system for us to use to test. And of course, these were the days before the days of the internet, we had to hook these systems up. And the test account was the account of one of the biggest stockholders. We made a copy of it. And it was just because it was so diverse and there's so many things, it was good for testing. So somebody made the mistake of connecting the test server to the actual live stock exchange. And it was so much effort to pull back that trade and undo all what they had done.
Matthew Liste: And catastrophic, right?
Michael Stiefel: Yes.
Matthew Liste: Yes. So I've worked in environments where we've had these kinds of things happen and you learn from them, but you have to be careful with mistakes you make because you make the wrong kind of mistake and you are costing the place millions of dollars. You don't make any trades, you don't make money, right? So you have to be willing to take risk. And I think on system engineering, it's risk management. How much risk are you willing to take as you think about evolving a system and how do you get that balance right?
Michael Stiefel: Are you familiar with Barry Boehm’s risk model of software development, the spiral model of software development? At every stage in development, you ask yourself, "What risk am I taking here?" And based on the risk, you decide what your next course of action is. Do I do a prototype? Do I implement something? Do I need to do more research?
Matthew Liste: No, I'm not familiar with that particular model, but it's very much the way we think about it. Like, what do you bring along from prototyping? Now I think of this as you have engineering candidates, you've got development candidates, you've got production candidates. And you think about along those steps, like how much more do I need to know to launch this? What validations do I need to be comfortable with that or this one? And it varies enormously. I could say you will have a business that is very willing to take risk because they're a more competitive side of it and other parts that they're like, "This is the golden goose part of the business. This is highly profitable. And do not screw with this".
The interesting thing about being in financial services, you get all gamuts of that. You get high risk and low risk environments and everything in between, and you are dialing up or down that. And if I think about SRE, the way we do that is ultimately you really want to think about how much failure can I tolerate. If I'm failing too little, I'm clearly not taking enough risk, and I fail too often taking too much risk. You go up and down the changes and the chaos you're introducing basically through thinking about that. And we don't always use formal error budgets and so on, but it's a good way to think about.
For example, we measure all customer journeys. Can I use my points? Can I pay with my card and so on? And those journeys are very instrumental to think about the risk because then it's like, "Well, if I fail the journey more often, then I have a risk tolerance for". I may want to dial down the change rate or I want to test better, invest more in testing and so on.
So there's a lot of thinking that goes into managing that from that perspective and thinking about system engineering. And back to my point earlier about gut, that gut is also informed by the data that you have to have data that shows you what is working and what is breaking.
Getting Real World Feedback to the Architects [17:31]
Michael Stiefel: You mentioned SREs and you talked about the customer journey. The question is, how do you get what the SREs find or the customer got and feed it back into the architecture? This I find is a big problem and nobody really has a good answer for this.
Matthew Liste: Yes. I mean, it's a great question. And it is. So the perennial question is how do you have a perfect feedback loop? You don't. But I will say that, and this is pretty recent, my life being informed by the customer's journey, but I think that has been very helpful because it focuses the mind. Which is if you think about ultimately why are we building software, we're building software to support certain business outcomes, which support our customers. And so if you put yourself and say, "The most important thing for us are customers, and of course pay us money to use our products", right? You always are forcing yourself to say, "When my software doesn't work as anticipated, how did it impact the customer?" That really focuses how to direct that. Because then you say, if something breaks and the customer never noticed it, it's not that important. If it breaks, and the customer noticed it, it's very important. And if a hundred thousand customers notice it, even more important.
And so I think that really helps in terms of reinforcing the feedback loop to go to the right places. And we use that a lot in terms of our conversations around reviews or what broke, why do we think ... Of course, we always do postmortems of, why did this thing break and work back. But then where we focus effort comes down to, is it impacting customer outcomes or not? And that's a very helpful way of tightening that feedback. It's still imperfect, don't get me wrong, but I will say that that orientation around customer outcomes has been very helpful.
Michael Stiefel: That raises two things in my mind. One is perhaps for the listeners, you might define an example of a customer journey. So it's a little more concrete in your mind. But the other thing that comes to mind is a lot of these systems are sort of on the edge of breaking, so to speak. In other words, sometimes it's a miracle that it works. And how do you deal with that? When you talk about stability and you talk about reliability, which is very important to you, and you know when to push and when not, you have to deal with the fact that you are always, so to speak, on the edge of chaos.
Managing Complexity and System Scale [19:52]
Matthew Liste: Yes, that's a great point. So let me start with the journey just to express it. So I'll use a couple of journeys that we have here just to put them in context, but every company will have quota. So one journey is, "Can I pay with my card?" Think about that's the most foundational at American Express, "Can I use my card?" That's a very clear journey. That's, for example, one we measure. Or another one would be, "Can I look at my statement?" So these are different journeys. And they have different systems under them.
And so to your point about complexity and things on the edge of breaking, I think in any complex environment, because that's what systems are, like, "Can I pay with my card?" you can imagine, has a very complex ecosystem underneath it. So when you go to a store and use your card, the fact that it in real time pretty much gets authorized between the merchant you're involved with, the network, the backend and so on, then a number of parties involved in that transaction and a lot of systems, and so that transaction flows from the point of sale through a network to a backend that knows everything about and says, "Yes, you are authorized and no, you're not authorized and this is why", and then all the way back and then you get a yes usually that you can make the purchase, sometimes a no, there's a huge amount of complexity that the customer never sees and never should see.
But to your point around things at the edge of chaos, I would just say that, that example is where we pay a lot of attention to managing that because it is a very important journey to us clearly and one where we want to make sure that it always works. So we have, in the cases where the journeys are more rigorous, we do pay a ton of attention to testing, chaos testing, scenario planning, all kinds of paranoid activities as in, "Well, if this breaks and this breaks and that breaks, what then happens?" And then you anticipate also scale, as in, "We work fine now, but what if we got double the volume? Then what happens? Triple the volume? What if it's Black Friday?"
And so there's also a degree of anticipation. You learn over time to think about, as I said, scale, security, stability. Those three S's. And so scaling is probably the thing that people get wrong the most often, anticipating scale. But the way I think about it, if you want your product to be very successful, if your product is very successful, guess what? You're going to get more customers, you're going to drive up scale. And so you have to have built into your systems how they will deal with scaling. And to be honest, in my experience over the many years I've done this, it is usually scaling issues that have broken complex systems because something that was working fine over time was getting closer and closer to some threshold.
Michael Stiefel: Resource contention of some sort?
Matthew Liste: Yes, resource contention. Network contention, CPU contention, memory contention. In somewhere downstream that is not obvious at all until it happens. And then once it happens, of course, then everything starts downstream failing. And so really thinking about scaling upfront is one of the most important things to do in complex systems.
And then other places are willing to take risk of saying, "Listen, if this thing fails, I'm going to let it hit the bottle neck. And then once it does, I will go fix that after because I'm okay with this thing failing every so often because it's a rapidly evolving business. We do both of those things. And to your point around things always at the edge of breaking, I would say that's true for where you're either willing to take more risk and/or a brand new system where you're still learning the edges.
But for the places where you are core to your business, I would say that you don't want ... I mean, I'm not saying it's ever perfect because these are very complex environments. Often, the complexities of even your own system could be with a third party. So there's a lot to it, but I would say that we, from a system perspective, think about dialing up and down that paranoia and think about where things break. But you see it every so often that companies are incredibly good at engineering, like plenty of great software companies and tech companies that occasionally get it wrong, but they do a deep introspection, they do the postmortems, they look through it and say, "All right, this is what we can learn from". And that's also the other thing is, anticipate scale and learn your lessons. Don't make the same mistake twice.
And so it's also like when these things do fail, they didn't know there was a bottle neck here or this downstream dependency you didn't anticipate. Well, then learn from all of these failures and think about how you engineer that out of your system so that particular thing doesn't happen again.
Trading Off Technical Perfection and Customer Experience [24:33]
Michael Stiefel: Some of it might be just human intervention. For example, you talked about the situation where you may deny a charge. Someone goes out of the country, they're in Afghanistan and they're having a charge and you have to figure out, "Well, is this really you?"
Matthew Liste: 100%.
Michael Stiefel: And also, you probably have to make concurrency, optimistic concurrency or pessimistic concurrency decisions along the way.
Matthew Liste: Oh yes, so 100%. And you always are threading the needle behind being perfectly mathematically and the customer experience. And you cannot usually have both. You cannot have a perfect customer experience and a perfect technical outcome.
And what I mean by that is, let's use the example of using a card. We could be incredibly fine-grained on every single attribute imagining every part of it. We might deny more chargers than we would annoy you because sometimes you would get declined even though you're legitimately using it. And so we say like, "There's some risk we're willing to take here to make the custom experience smooth because do we really want to annoy every customer when they travel and have to check in and say, 'I'm really in Afghanistan or really in Italy every single time'? No, we're going to use heuristics and model it out. And to your point, concurrency. You know what? It's probably legitimate. And we get it right.
Michael Stiefel: Especially if it's a $50 restaurant charge, the risk is not great.
Matthew Liste: But $1,000 card, very different thing. And so these are all things that go, it's the risk appetite. It is managing that and managing the customer experience through that, that becomes the right way of thinking about it.
And to your point around resiliency, it is not just a technical resiliency. It's also process resiliency, people resiliency, because again, if you think about the system in the true sense of the word and you think about what you've given to your customer, it is not just technologists creating that customer experience. It is everything around it too. And so you have to really think of that part too. And a good example with any banking application is, worst case, you can't get to your online statement, and this is true of any financial institution. Well, you can call someone and talk to a human. And then therefore that is to degree part of the system and part of how you manage business risk is for worst case thing.
And then you think about, "Well, how resilient does it need to be?" It's like there's a lot of diminishing returns in how hardened you want to make something. And let's say there's four 9s or five 9s and make your six 9s will cost you 10 times more than having a five 9s. Well, maybe it's just not worth it. And for the 15 minutes, it could maybe go down a year. You say, "You know what? If it happens in those 15 minutes, someone can call someone as an example of how you think about system resiliency and system uptime".
Michael Stiefel: One time I actually did a calculation on how many nines the electric company actually gives my house or business. And it's really not that many if you actually think about it for precisely the reasons that you just outlined.
Matthew Liste: That's an example you could probably live with occasionally, it’s annoying, but it's not life and death, but a plane needs a very high degree of 9s, right? And so that's why we do this all the time in real life. We think about these tradeoffs and we accept them naturally. But then when it comes to software, we expect this perfection. And ultimately we surround ourselves, we do natural risk management all day long as humans. But in software because the system is doing it, we expect this degree of profession, that you have to let that go.
Serving the Developer as a Customer [28:01]
Michael Stiefel: You mentioned customer experience, but platforms also have another customer, and that's the developer that is building on top. And you must have to work with them. And that raises all kinds of questions, is how do you trade off the long-term versus the short-term when they think, "We need this maybe for a year, but in two years things will change". And how do you say no to them or how do you develop a path from what is needed now to what's eventually necessary? This is another type of customer experience.
Matthew Liste: It's a great point. And by the way, we do also measure journeys for our developers, how things work for them. But that trade off is super important because like any platform provider, I have limited resource or I don't have infinite resource, probably better way to put it. And so I cannot build everything everyone wants. It's impossible. And so it is really figuring out what is it that adds the most valued ...
I mean, if you think about it in a condensed way, what adds the most value to the most developers in the least amount of time that cost me the least to maintain? So I think of it a couple of ways. First of all, I don't want to be too early and I don't want to be too late. And what I mean by that is being in my space, there's always new open source projects, new variety of different things that are starting to pop up that look interesting.
And I'll use Kubernetes as a good example of this. I started working and providing container platforms pre-Kubernetes. And I knew that Kubernetes would come out of Google, but it wasn't really ready. And I invested in other platforms to do it, but I had to pivot over. And so it was one of those things where I made a deliberate decision and say, "You know what? It's okay. I can learn from this, spend a year or two getting from the containers. And then given it's so early days, I will pivot". And so it's important to recognize when you're in that stage of innovation and when it's more mature and when do you decide to make that investment and when do you just observe and stay away. And that's the real art. There's no science to it. I keep getting it right. I keep getting it wrong. I mean, sometimes I think something needs way more maturing before I know it. Every single developer's using something and they're all upset that we don't have a platform solution for them as long as we build things that are robust and no one wants it yet.
What I ask my teams and myself to think about is, have conviction that this is the right thing to do. And I use an analogy of a puppy and a dog. You can fall in love with a puppy, but are you ready to care and feed for it for years and walk it and do all things? And if the answer's no, well, then it's probably not time to bring it into the house. And only when you have the conviction that if this is successful, and platforms are like that, if I decide to do something and people love it and they start using it, I can't get them off it. Not easily. No, because now it's work for them they don't want to do. They're already using, let's say whatever, a container platform is on. And I tell them, "Well, I'm going to stop supporting you and you're going to have to migrate". They're pissed off. They're like, "You need to support me". And that is years, not days or weeks because they want to focus on writing business software and not on migrating off whatever I have.
Michael Stiefel: And they have finite resources too.
Matthew Liste: They have finite resource too. So that's where it's really important to think about, "Do I have conviction around this thing?" And then it comes down to the art is, "When is it the right time to pull the trigger?" And because of a finite resource, I have to always think about, like, "Well, if I do this, there's something else I'm not doing. Am I making the right trade among it?"
But it is a very particular challenge. My customers are captive. They can't go use someone else's infrastructure. They have to use what we provide to them. I mean, yes, we provide to them cloud-based infrastructure, public cloud as well, but it still funnels through my team. Whilst if you are a cloud provider paying customers, and if you do a good job, you have more paying customers. And if you do a bad job, they can go to someone else's cloud. And so that really focuses the mind.
When you are in enterprise IT like I've been in for a long time, you have to really drive the discipline more forcefully into your teams because you don't get natural market signals like you do if you sell your software to someone with a real wallet.
Michael Stiefel: But you must have management that understands this because your developer customers can always appeal over your head and this becomes all political, but you must have a management that understands all these trade-offs and these things that you've just spoken about.
Matthew Liste: Our job is to really demonstrate as the best we can that we're building what has the best value for the company and we have a sizable budget and are we doing be good stewards of that and managing it? And then to point out saying no. And yes, we do say no quite often and say, "This is just not... Only you want this. No one else has asked for this. And so whilst it might be a great idea for you, I don't have bandwidth to go do this because this doesn't take precedence of the things that more people are asking for, so go do it yourself".
And then if you do it yourself, you have to do it in the right way. You can't blow the place up. And so we have to sometimes allow for a degree of innovation happening outside of our team and just accept the fact that sometimes someone's going to build something for them only that, yes, they might not be database experts, but they need a very peculiar database and no one else is asking for it. I'll let them go build and stand on that database stack because it's just not the right time for me to do so. But let's say two other, three other teams are to ask for the same, then we might take it all.
Michael Stiefel: And you might find that it's a special for one team, but as things evolve, more teams want the same thing.
Matthew Liste: To be fair, open source has really helped with this because it allows people to more democratically build together and collaborate. And because open source, if you take standard database technologies like Postgres or MySQL and so on, it is widely available. And so it helps enterprises like us collaborate better because we all have access to the code base and can iterate along it. And so that has actually helped tremendously with that problem statement of how we can better do that. Because think about it in close source software, a team like mine is the only one that has access to it because we're licensing it from a vendor and standing it up. And then it becomes a lot more tension filled. And if we don't have the bandwidth, then it's going to really upset us because only you can build it and you're standing my way versus with open source allows us a bit more organizational flexibility as to who does what.
The Role of Culture in Platform Engineering [34:28]
Michael Stiefel: The other thing that seems to be implicit in what you're talking about is that everybody understands these sorts of things, which means that there's a cultural element to all of this. And you must have the culture inside your organization to understand these things and also the culture of the users of the platform because it's not just knowing the contracts of the API. It's the implicit contracts, and it's also the understandings that the culture brings.
Matthew Liste: Yes. I think the most important job for me at this stage in my career, I lead a big team, is setting the culture because great culture builds great teams, and great teams build great products. And you can't get lucky. I've gotten lucky. I've had teams that have really been successful, but it wasn't because I was good, I was lucky, and I happened to get lucky and assemble the right people and build great products, but to do so repeatedly because my job of building platform thinking that I'm building enterprise has many components to it and many teams with these components, you really have to focus on the culture to get that right at scale.
And so a lot of what I described is like, figure out how you empower teams to make decisions as autonomously as they can, but do so safely, managing, as I describe, making the right decisions and working through it. So it's just a reinforcement of that on a continuous space of setting that culture. And for me, the most important thing is, really come down to, how do you set the macro guidelines and then you let people run within those sort of guardrails as freely as they want and give them as much freedom as you can safely.
And again, there's a bit of a hard trick to it because people will make mistakes. As you have to accept the fact that sometimes people are going to do something wrong and you're going to say, "Well, that wasn't too smart. You blew this thing up". To use the example, you did trades and prod. Well, learn from that. Don't do that again. And setting a culture where you're allowed to make mistakes. And again, back to my apprentice example, there's enough other people with experience around you that stop you from making the worst mistakes, but allow you to make the smaller mistakes that allow you to grow.
My role is really to reinforce that behavior as much as I can and allow teams to blossom because if you build platforms, you need to hang together. It's like a meal. You go to a restaurant to eat a meal, not to eat a carrot and a potato and a steak. It's assembled together and you expect that it's that experience of the meal you're going for. They're not looking for the components. They're looking for all this to work together in a cohesive, coherent way. Which means like Conway's Law, if you have an organization that's dysfunctional and does not talk to each other, well, your platform is not going to hang together. And as a consumer of that, you're like, "Well, that was not great. I'm trying to use this database with this container, with this messaging broker, and I have to sew it all together. All the observer built in looks different. None of this is easy to troubleshoot". And so that's a very frustrating experience as a consumer.
And so you have to, again, back to your point about putting yourself in the shoes of a developer, I'd like to create a team and I have one here too, I call it Developer Zero, the first developer. And so I have a team that they're not real developers, but they're critical people who consume our platforms and their freedom to roam where they see fit. And their job is to give constructive criticism to any other platform team saying, "Well, this one sucked". But they're deliberately outside of those things because I want them to consume just like any other developer. So they don't have inside access to documentation or to special APIs. All they see is what the developer sees, but they know the people and they'll go and say, "Hey, by the way, when I try to use that database with your message broker, the documentation doesn't describe at all how to do it. It was actually like this". So that also helps really find things before my skill customers, to which are the development community.
Testing Documentation [38:41]
Michael Stiefel: That's an interesting point that you mentioned about the documentation explanation because when people who know very well what they're writing about often make implicit assumptions in their writing that they do not realize is obvious. And it's only someone else's eyes that can say, "So what you really wanted a review, I think you have a good idea there because you have people who are technically competent, yet don't know the insides". So they understand the documentation, they understand what's being explained to them, but they can call out the hidden assumptions being made that are not clear.
Matthew Liste: Yes. And it's very important to your point, like that old expression, "You don't see the forest for the trees". And I find that all the time as our platform teams are building various components, they get all the complaints. They say, "Well, why are they complaining? This works perfectly and it's well documented. It's very well understood". Like, "Well, to be honest, perception's reality and you might think that, but your customers don't". And so to your point around, they just don't see it until it's pointed out.
And I do think having an explicit function around it. But I don't think this function is not what you call UAT because it's a lot more loose and it's a cultural thing. It's not a formal thing as in, I'm not going to ask this team to check every single release. I'm just telling them, "Go consume stuff at will. Go figure it out and go direct. And you spend your time where you see fit and where you hear the most noise, that's where you should spend your time". But it's sort of a formal gate, but it's always slow as if I told them and said, "Listen, you have to go test every single time something releases everything", we'd never get anything out the door.
Michael Stiefel: And more importantly, they wouldn't get the important things done.
Matthew Liste: Exactly.
Agentic AI and the Changing Speed of Operations [40:21]
Michael Stiefel: So I want to get back to the question that I sort of teased at the beginning is, we now have talked a little bit about platform, engineering and the importance of it, and we are increasingly coming into the world of agentic AI where the software could be written by AI, it could be written by humans. How does this change architecture or systems engineering or platform engineering, however you want to conceptualize it, when your customers may not be human or the level of platform engineering goes up because the AIs may be writing software that's sort of like a platform that humans use, what does that world look like and how does it change things?
Matthew Liste: So first of all, I'll get back to that right now, I don't view agentic any different from a human or cultural perspective than a senior developer overseeing junior developers or a senior platform engineer overseeing junior platform. Still the same problem statement. I'm still accountable to make sure this stuff works. And if it's built by humans or built by agents, it still needs to function and I have to make sure the system hangs together and I'm doing the appropriate tests and validations and so on to have confidence around it, no different than before.
So what does change though is the speed. And how do I feed this? So, first order for us is, the complexity of operating these platforms. And back to your point earlier around your only one mistake from chaos or very close. And these are, as I said before, very complex ecosystems that hang together and clearly a place where we have humans operating it today. And you can imagine that agentic systems, because they can look at a vast amount of data way faster than a human can, should be able and can triage complex systems quicker than a human can. And so we're feeding these agentic systems all the telemetry, observability, state and so on. And again, I say earlier, but everything we see shows us a lot of promise that this will be profoundly quicker in finding issues than we can as humans.
Michael Stiefel: Or making mistakes faster than we can make mistakes.
Matthew Liste: And make mistakes faster. But this is a bit of like, you have to apply to both sides of the equation. And that's kind of my point, which is, assume you have agents writing code and assume that they will spawn mistakes, you also need agents observing the systems that also can go just as fast. And so that's the way I think. As long as you think about evolving both sides of that trade, it doesn't really change the dynamics, but you have to keep up with that. It's no different than cyber fraud with agentic,.but equally cyber detection, how you manage that with agentic, it's an arms race.
And so I think of it like as long as I can, from a systems perspective, at the same speed, observe and manage the system when things go wrong, then that equilibrium hopefully holds true. But you're very right, I don't know if it will. So that's something I'm acutely looking at is, make sure that I operationally are focused on the same level of speed.
And then the complexity comes into data because we service observability today for humans to look at. So they'll go look at a dashboard, they'll look at some logs, they'll troubleshoot through a system trace, but all of that is built today to feed humans and we did not build it to feed APIs and systems. And so now we have to scale up all that same way for agentic reading because now reading this stuff 100, 1000, 10,000 times faster, I now need to build the underlying platforms for that that I have to scale way more than I have anticipated. So that's back to my scaling point. Now I have to go scale those to work at the same speed as software.
Michael Stiefel: And to be stable and to be secure, maybe operate in a zero trust environment. In other words, everything happens faster. So all your 3 S's become even more critical.
Matthew Liste: And this is what makes this job fun, right? Because there's this constant evolution that means that you always have a different dimension to deal with. And you always have pressure in the system to go faster, to continue to be secure, to continue to scale. And so that makes this a never ending challenge.
Michael Stiefel: I often say to people in my software career of about, I guess, over 30 years now, I've only done three things: trade off space and time, insert levels of and direction, and try to get my customers to tell me what they really want.
Matthew Liste: Yes, that's a great way to summarize it. It's not a little more complicated than that, although getting it right is very, very difficult.
The Architect's Questionnaire [45:11]
Michael Stiefel: Right. Right. So this is the point in the conversation where I'd like to take a little more human-centric approach and ask you the questionnaire that I ask everybody that appears on the program. I call it the architect's questionnaire, but it's just as much as the system engineer's questionnaire.
What is your favorite part of being a systems engineer?
Matthew Liste: I mean, I love building stuff and seeing it being used, that it manifests itself into something in production. And I always use the saying, running code wins. If it's running in production, that's when it's real to me. And I love that satisfaction. And it's no different than my father got a satisfaction out of delivering a cabinet to its customer and they put on their wall, like it's that. The thing is actually being used. That to me is my favorite part of what I do.
Michael Stiefel: What is your least favorite part of that role?
Matthew Liste: I guess the least favorite part is, and I'll use the term architect here, because I'm not a big believer in not thinking about the whole end-to-end as in getting into production. And so I think that to distinguish between enterprise architecture and building - I don't love that because I think of it as a continuum and you ideate, you build, you put in production. So when people call me an architect, I'm like, "Yes, I'm not really thinking about us being an architect. I'm a builder that understands how things need to be designed".
Michael Stiefel: Is there anything creatively, spiritually, or emotionally satisfying about systems engineering or being a practitioner?
Matthew Liste: Yes, I think at least I get motivated out of... And I think it is quite human, is to not just be abstract, but be able to see your thoughts come into, for lack of a better word, material form. I always feel like people in general are motivated by... At least a lot of people I know get motivated by the idea, thinking of something and they're seeing it happen. I find it incredibly satisfying and it gives me a real sense of accomplishment that I don't think much else can achieve from a professional perspective. Because just thinking about it but not seeing it happen... And I have had parts of times in my career. I've had those kinds of jobs and I never found them particularly satisfying, but to me, it's both. But equally, I don't enjoy building someone else's vision if I didn't have any part of it. So it's really being on both sides of that that I find very, very satisfying.
Michael Stiefel: What turns you off about system engineering?
Matthew Liste: Your job is to be invisible. And if you do a really good job, no one ever knows you exist.
Michael Stiefel: Who was that masked man?
Matthew Liste: Yes. It's just like your power example. Like you're just expected to come out of the wall, it just works. You have no idea who does the job or how they do it. They're not appreciated. You never call them and say, "Thank you for giving me power". You just assume it's going to be there. And to be fair, if they do a great job, you never know they're even there.
And so being an unsung hero, for lack of a better word, or you call it silent running can be very frustrating if the people that you are building for and are funding you don't appreciate everything that goes into it. And so they take it for granted. And only when things fail... "So how can you let this fail? Well, because you took away all my money, so I couldn't build what you wanted on resiliency". And then they're like, "Well, don't make that happen again". "Well, then you have to give me money". And so it's that kind of like where you're working for clients that don't appreciate what you do and don't want to fund you.
Now, sometimes I've been in places where maybe they didn't appreciate it, but they said, "I don't care what you do, but I trust you. And so just make it work". That's fine. But where it's not fine is where they think they understand, "Well, why do you need so much? It couldn't be that difficult". They trivialize the complexity that goes into building a system and they over trivialize it and then say, "Well, that cannot be that difficult. It cannot cost that much. You can do this way easier. So you should make all these trade-offs". And then when you try and say, "Well, these are the things that can happen", they don't want to hear it. And then when things blow up, it's like, it's not their fault, it's your fault. That is frustrating.
Michael Stiefel: Yes, yes. I've been there. I know exactly what that's like. Do you have any favorite technologies?
Matthew Liste: I don't know about favorite, but I've had profound experiences. So I spoke about the one where I first saw a computer playing chess. I had a similar experience the first time I logged into a computer that was not in the same building I was in, so DARPA Net. And that was the profound sort of aha moment for me as in, this is really possible. And I had a friend who has been doing neural networks and doing AI forever since the '70s. And so it's a bit of like, "Well, this is never going to happen". It's always been an idea that ... And now that it's happening, it's like it's profound.
So what I love about my job, I guess, is not one technology per se, but when you hit these magical moments of saying, "This thing that just seemed like an idea for so long is now actually happening". Even if you anticipate it may be happening, when you just first see that, it's so magical. And then of course, six months later you take it for granted, but those few months of that magic is something I love about being in technology, is you get them every so often. And it's amazing to be able to be part of that small part of that journey.
Michael Stiefel: What about systems and engineering do you love?
Matthew Liste: I love the team aspect of it because by the nature of a system, there are multiple components and multiple teams and multiple people involved. And so it is really like an organism. It is like, "How do I get this organism that is incredibly complex with all these people and systems and technology and software and hardware all to interact and deliver a systems outcome?" And it is probably the most complex problem you can imagine. And I love that complexity of all those moving parts and having all of that magically deliver these outcomes that if you looked under it, you'd be both fascinated and horrified at everything that goes into making it happen. It's kind of like an airport. I'm fascinated looking at how does an airport actually operate. All the different things that go into running an airport, I have all these thousand planes take off and land on time and all the different specialized jobs that go into it.
It's incredibly fascinating how it just works because that's a system. And so I love that when you take an incredibly complex system and make it appear very simple, even though you know of course underneath it is incredibly complicated.
Michael Stiefel: What about systems engineering do you hate?
Matthew Liste: Well, beyond the funding, and it's not just funding, it's the lack of being open-minded or appreciating that complexity, it's probably just unrealistic expectations from clients around how easy or quick it is to do things. And so it's more of a speed problem. And you alluded to it a bit before, which is clients always asking why is this not done and why it is not ready? Something like doing things right takes time. And you can either be scope-bound or you can be time-bound. You can't be both. And so if someone asks you to say like, "I want this thing and I want it done this way. I want it to be stable, secure, and scalable". Well, guess what? It's going to take me time to get it to that point.
And so there's a degree of that that is frustrating, which is, you want it both ways, and I can't give it to you both. And you don't want to make any compromises. And just recognize our job is making trade-offs and what do you want to compromise on. And when people don't appreciate that, it's frustrating because you feel like you cannot be successful.
Michael Stiefel: It's like the old saying, someone says, "I want it fast, cheap, and correct". And you say back then, "Well, which two do you want?"
Matthew Liste: Yes, which two? You can only try at two".
Michael Stiefel: Yes. What profession other than your current role would you like to attempt?
Matthew Liste: At this point, as I spoke with apprenticing, I have people that I guess in many ways apprentice with me now, although I wouldn't call it formally. But that degree of being able to impart my lessons I like doing, I enjoy that. So I can certainly envision spending at some point more time doing mentoring, teaching, maybe even a more formal structure. That'd be exciting to me.
And one thing I never did, but I wanted to do, but of course I didn't go down, I actually was, when I was in college, was maybe being an architect, as a building architect. I love well-built buildings and the combination of art and science that goes into them. And if you look upon a really beautiful building and what it took to both envision it and to build it, to me is something, I mean, again, too late in my career to go back and do that. And the reason I... Maybe this day and age I could. I'm horrible at drawing and drafting. And so I will not be able to pass any drafting.
Michael Stiefel: Software does that now for you.
Matthew Liste: Oh the software does it. Back then, you had to do it by hand. And so that turned me off. I said, "Yes, that's not going to work for me". But yes, if I could redo my life in this day and age, maybe I'd be a building architect.
Michael Stiefel: Do you ever see yourself not doing systems engineering anymore?
Matthew Liste: Not in this. I love what I do. Not as long as I'm working. I love what I do for all the reasons, the complexity, the teamwork, the cultural aspects, all the different things that go into it. I've done this for a very long time and I do it happily and see no reason to not continue to do so.
Michael Stiefel: We spoke a little bit about this before, but when a project is done, what ideally would you like to hear from the clients or your team?
Matthew Liste: Well, when it's done, I want to hear a couple of things. And again, back to using my dog analogy, we raised a puppy to a dog, but we still have to maintain the feeding, the care, and so on.
And so really what I want to hear is, first of all, that we met the expectations of the client, we delivered as an MVP or the first situation of it. And then secondly, what we did is sustainable, that what we did was actually something we can maintain because building a platform that can't sustain might feel great when you deliver it, but then it's a nightmare afterwards because maybe you can't scale it or it's not secure. And now you have to do all that work after the fact. It's so much more difficult than if you did it upfront. So I want to hear that we successfully delivered, but we also delivered in the right way because delivering in the wrong way, and I've done that way too many times myself, feels great day one, and feels awful day two.
Michael Stiefel: Well, thank you very much for being on the podcast. I enjoyed this conversation very much. It's something we don't explore because as you say, people take platforms for granted, but without platforms, we wouldn't have software.
Matthew Liste: I really enjoyed the conversation too. And to all of the platform engineers, platform architects, platform designers, systems engineers out there, I appreciate everything you do. And I use many of your platforms myself and are very happy if you're even like a power company or airline or you build clouds and so on. We are all building on top of each other's platforms and rely upon them. And I think that's what makes this so exciting. And it's also allows you to appreciate other people's platforms even more when you are in this job. So I'm really thankful that you let me speak a bit about this and my journey along this path.
Michael Stiefel: Well, thank you very much. And maybe we can have this conversation sometime again.
Matthew Liste: I would love to do that.
Mentioned:
- Barry Boehm's spiral model of software development: Boehm, B. W. (1988). "A Spiral Model of Software Development and Enhancement". Computer. 21 (5): 61–72.
- Conway's Law: Conway, M. E. (1968). "How do Committees Invent?". Datamation. 14 (4): 28–31.