InfoQ Homepage Presentations Observability for Speed & Flow

Observability for Speed & Flow

View Presentation

Speed:

Download

49:17

Summary

Jessica Kerr considers that we should be looking at the software as part of the team, and observability in the software becomes an asset to organizing teams.

Bio

Jessica Kerr is a Principal Developer Evangelist at Honeycomb.io. After twenty years as a developer, she sees software as a significant force in the world. As software engineers, we change reality--including our own, and that's developer experience.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Kerr: I'm Jessica Kerr. I am a big fan of systems thinking, especially in software. I'm very keen on observability, which is why I work at Honeycomb. Today I get to talk about observability for speed and flow. Flow of what? I can think of a couple things that we care about the flow of. We could look at the flow of requests in software, or the flow of changes in that software. I suspect most of the people here are looking at the flow of work through people, which results in changes in software. It's also interesting to think about changes in those people.

1. Requests through Software

Let's start at the top. Requests through software, specialty of observability. Of course, we all care about requests flowing through software. It's like why we're here. Somebody can decide to do a thing and hit a button and a request flows through software, and it affects the rest of the world, and they find value in that. I like to say that our job as software developers, it's not so much changing code as providing valued capabilities to customers. Whether those customers are internal or external, whether they are people or other software, we make something possible in the world that wasn't possible before. Say, for instance, I'm making an expense report, not the most exciting thing, but I do value that it's possible. We could look at the flow of that request through the software on a distributed trace. Distributed tracing is an essential part of observability. It shows that, ok, I hit submit, and we hit the submit expense report endpoint in the backend. That's calling through to the merchant endpoint to validate the merchant. Maybe we need to check on the reasons, so we're calling that endpoint in the reason service, which is brown here, and checking a couple things. We can see the flow of the information and the request through the different services. Then we can map that to how the services connect. Here, expense service is calling merchant, and it's calling reason-srv a couple times. That one calls through to roles to find out whether you personally are allowed to use this reason for that expense.

When we can map the information flow through the software, then we can compare that to the information flow in our teams. Because there's this property of software organizations, like development organizations that the code tends to mirror the structure of the organization, the communication flows specifically of the people in the organization. Sociotechnical mirroring, known by some as Conway's Law. The thing is, this is not just a tendency, it's important. Your communication structure in your organization needs to match the communication structure in your software, because this is how we make that software communicate smoothly. This is how we coordinate changes in the communication between that software, is by talking to people, people who then develop the software so it all fits together.

In this example, you've got the expense service and the merchant, and the reason service team, they all know each other. The roles team is like in a total another part of the organization. This company just happens to be really lucky, and someone on that reason-srv team happens to know someone on that roles team. That's a really valuable connection, like person to person connection, because when there's trouble in production, and they're looking at that distributed trace together to debug it, the reason service person knows who to call and can get that information. This really makes everything go smoother in our software. That particular connection is not at all supported by the formal organization, but it's very valuable. I like this, this look at the happy case of sociotechnical mirroring.

We can see some things about this. One thing that's important is that each service has a team behind it. Every piece of running software in this picture is supported by a person who has that software alive in their head and they're connected to it, and they understand it, and they can change it. You might notice that where expense service is calling reason service twice, maybe they could use a little more communication. Maybe they just haven't had time to change that. Maybe it'll evolve over time and get better. What if there is no reason-srv team? What if reason-srv, they wrote it and that team moved on, or they all quit, or we didn't make any changes so we just reallocated all those people? This is a less happy situation, because suddenly reason-srv can't change. There's nobody who can do it. That means expense-srv can't change. Those interfaces have to be what they are, and stay what they are. In fact, maybe that's why they're calling it twice. That would actually make sense because they weren't getting what they needed from one request, but they figured out how to get it in two, and doing that was easier than changing reason-srv.

Even more dangerous here is that the roles service cannot change its interface without breaking something in an entirely another part of the organization. That's very constricting for the people who maintain the roles service. This is unhealthy. It's not ok to have software running in production that you can't change. This is really stultifying to the organization, because if you can't change the software, you can't change your business. You can't make things better for the customers that are external, but you also can't make things better for the people who work in your organization. Software that doesn't have anyone who can change it, it doesn't have people on its team, it's ossifying. It's ossified. It can't move, and your whole company becomes ossified.

I worked at a large enterprise not too long ago, that they had an email listserv, if you wanted to make an email list, you used their email listserv software. It was probably 20 years old. It ran on a mainframe. Nobody had changed it in ages. It was really hard to use, and there were a lot of things, perfectly ordinary things that you would want to do with a mailing list, like unsubscribe without logging in or clicking a bunch. It was so painful. People who worked there long enough, they got used to it. Anyone coming in was like, what is this? Why is it so much easier to do this outside this company than in? We hold ourselves back. We keep ourselves in the past, when we run software internally that we can't change. The same thing for our customers, if it's external software.

In fact, I think, if you're not going to change it, why write custom software at all? Custom software is incredibly expensive. The value you get out of that is control, you can change it. If you buy off-the-shelf software, fine, adapt to its model for stuff that's not your core business. If you buy Salesforce or something and you call it off-the-shelf, but then you customize it, you've got the worst of everything, like the most expensive thing and you can't change it. Write custom software that's important to your business because that lets you change your business, and update it and learn as a company. That means continuing to be able to solve to a change in software, and that means having a team behind the software.

Obvious solution, let's give reason-srv to the same team as expense-srv and they can just do both services. This is what happens all the time. In fact, why wouldn't we just have one team for all of these services, because then we wouldn't have to worry about the lines of communication between them. It would just be all there. The answer is, there's too much complexity in these services for a team to hold all of them. We have to be able to fit the code in our heads. If this team has expense-srv in their head, and they've worked to get it there, maybe they wrote this service that's the easiest way to get it there, then they can change it. There's a limit to how much we can hold in our individual head and in our collective head. There's a limit to how many people we can have on our team because there's a lot of coordination costs. That limitation depends on the complexity of the software, including the architecture and some incidental complexity. Really, a lot of that complexity comes from the capabilities themselves. The capabilities that we're offering to the customers, the people who use our software. Expense reports are not simple things. They sound simple until you dig into them, and then there's all this weird stuff that goes on. I'm gaining that capability of being able to submit one because the software absorbs that complexity. It's good to have that essential domain complexity in the software, but it limits how much a single team can support. When you try to cram reason service into the same team, they're going to get overloaded and not be able to change either of them safely.

Another thing to notice that getting your brain around existing code is much harder than getting your brain around code that you have grown from something simple. Then your brain grows with the code, and that's a lot easier. This is why we want team-sized services. In fact, we aim for services instead of giant monoliths because we want each deployable unit of code to fit in a single team's collective cognitive load limit. Our distributed traces help a little bit with this, because it makes it easier to transfer knowledge and to onboard to new software, because the software is sitting there saying, "Look what I do? I do this, and then this." It's helpful, it's still a limitation. Distributed traces are like having the software on your team, because when the programmer is saying, do this, by changing the code, the software is saying, ok, this is what I'm doing now. That feedback loop makes it a lot safer to change, and that increases our speed and flow.

2. Changes in Software

When we're thinking about changes in software, that's another thing that we can look at the flow of. We can use observability for that as well. If we make a change in the expense service, and it says what we're doing. Then great, we can get an idea that it's working. Another thing we can do is notice when a feature was released, and when people started using it. If, as we make the code, we do observability during development, we add a little emission of a span or an event that says, "I'm being used. I am this new feature." Then we can measure that. Here's a graph, and the graph has the deploy, but then the feature flag is turned off. Then we decide to roll it out to a few customers, and we turn the feature flag on in a limited fashion. We can see the usage of that feature happen as people start using it. Then we roll it out to everybody. Then more customers start using it. A couple things to notice about this graph. One is that the deploy is separate from the release. That's because the deploy is under the control of the team that maintains the code, and they decide when to push out deploys. The actual release of the feature is influenced by marketing and other business departments. That's important. That's possible when it's behind feature flags. Also, you can do that limited testing, test it with only internal people or whatever.

Another use of this is that you can see which features are actually used and how they're being used. As a developer, that's so important. If there's some error case that I'm quite certain is never going to happen, I can put an event for that and I can find out whether I was right. At Honeycomb, we talk about testing in production. Also, test in tests, that's fine too. That's also good, but why wouldn't you test in production, and find out if what you think is happening is really what's happening? What is the real customer experience? Another thing that helps a lot is when you have these feature flags and your releases are separate from the deploys, you can do faster deploys, or more frequent deploys. That helps a lot. You can track that. I recommend observability in your continuous integration, so your build and test and deploy platform. Then you can see how long it took to build and how long it took to test. Ideally, that's like 15 minutes, and then you've deployed to production. A developer can go and look at what's happening very soon after making the code change while it's still in their head. If there is any noise in that transmission, like a bug, or a performance regression, or something, they can notice it immediately, and filter it out and push a fix. Then everything's quick because it's still in our heads.

With this, you can count how many deploys happen in a day. Maybe 10 is great because you have a dozen engineers, or maybe you have hundreds of engineers and you hope for having hundreds or thousands of deploys. You can look at it, useful to see this does represent the flow of change in software, and that has a lot to do with the flows we're looking for. This is just a measurement. Every measurement is just a clue to the emergent property, that widespread system property that we're trying to get. We want flow. We want work to move through the system quickly and smoothly with high throughput and low latency. One thing we can measure is how many deploys are happening. Maybe you measure other things in Jira, or other ways that are more disruptive to the work. They're just measurements. They're only clues to something deeper and more valuable that we're trying to get at. There's a lot of properties like this.

3. Work through People

With flow, for instance, we could measure throughput and speed probably with some reporting. Importantly, we could measure queue length, what is blocking us. Because flow is a lot of things that aren't happening, it's a lot of blockages that aren't happening for work to move smoothly through the system, things that we don't measure. That little fish symbol represents things that are helping the flow. It's like people unblocking each other, just answering questions or helping people move tasks along, pairing, sitting down looking at traces, whatever it is. There's a lot that we can't measure that we want to happen. The property of flow is not in the measurements. Those are just clues to everything else that makes up a smooth working style.

The ultimate one of these is we're trying to create value that can be measured in monies, dollars in the U.S. The value is a lot harder to measure but we do want to measure it in money. Here's another one, value isn't always directly connected to cash. There's a lot more to it, like availability. How, when the people want to do the thing is your software available to them. You can measure that with uptime, if you want. You can measure it very poorly with MTTR, mean time to recovery after an incident. In Honeycomb, we have something slightly better, which is the service level indicator, which is closer to what we mean when we say when a user wants to do the thing, they can do the thing. This availability, it's really about the software not going down. It's a bunch of bad things that don't happen. You can't measure that perfectly. You have to just aim for it and do things like checking error conditions. In your general course of development, you have to care.

Security, that's another one. You can't measure security. You can measure insecurity, sometimes. You can do some testing, you can notice security incidents, I hope. You can look for libraries that are out of date. That is the number one source of security vulnerabilities is out of date libraries. You can update them and you can work on improving that measurement, and that will help. The number two thing that's different for every application is understanding your data. Because if you can do validations on your data that are good, then it'll be more secure. That's not something you can measure. You just have to understand the software and the business domain, and try.

The big, emergent property of a software system that I think is the most important because it leads to every other one is malleability. Can you change it? I think, at best, this is what we mean when we say code quality, when we're looking for code that we and other people will be able to understand and therefore change in the future. Sometimes we aim for customer delight. I know design people do. I love it when customers are delighted by our product. Feature deliveries do not measure customer delight. Net Promoter Score makes some attempts, but it's at best a clue. Really, so many things go into customer delight and a lot of them are negatives. Customers don't get stuck. Things don't fail, or they were guided smoothly to making them happier. It's an emergent property in the everything else of the system. Whenever we're looking at an emergent property, if we focus on the measurement, if we laser focus on what's most important to the business, if we just are quite certain that feature delivery is the most important thing, because that's what our KPI says and that's what's going to get us promoted. When we laser focus on one of these measurements, then everything else in the system starts to go to pot, gradually over time. If we deliver features as quickly as we can, if we go for speed of feature delivery right now, it's different from flow. Flow is an ongoing process. If you go for speed in the short term, then you'll push off security fixes. You'll let your tests get long or flaky and delete them. Everything else will get worse, and you will not have any of the emergent properties. You will not have delight. You will not have value as much as if you'd had a more balanced system.

Yet we do have KPIs, and we talk about them. We talk like we're going to laser focus on the business, but we don't. We set that as a goal. We set feature deliveries as a goal, but in front of them, we put code review. We put these obstacles in front of ourselves before our goal, you must go through code review. You must have automated tests, and they all need to pass. You have to go to all these meetings, and talk to people, and gain knowledge about the rest of the system. Then we give ourselves some abilities, various tooling and platform, including observability. Other developers would totally swap these and call some of these abilities, obstacles, and some of these obstacles, abilities. That's fine. They're all rules, and we set them up for ourselves, in order to protect the emergent properties from our laser focus on feature delivery.

Game Design

This reminds me a lot of game design. In American football, your goal is to get the ball to the end zone or between the posts, depending. You don't just get it there, you only do it when the clock runs. You have to stay within bounds, and you're not allowed to punch or kick anybody. We have various abilities to do this. We can run with the ball. We can pass the ball. Sometimes you can kick it occasionally. Why do we call it football? Other abilities are like particular plays. There's a sweep and there's a quarterback sneak. These aren't in the game rules, they're in the game community. People have figured this out, so the community becomes part of the game design, in this case. Totally a thing.

I'm reading this really interesting book about game design right now. It's called, "Games: Agency as Art." I'm not super interested in the art part, but the part where he describes game design as designing agency. That's really interesting. When we play a game, we adopt a particular agency composed of a goal, to win, usually, to get points, often. Which is not our life goal or anything, we just choose to adopt it. We don't go for it directly, we choose to adopt the obstacles, the rules that the game designer has set out for us. Then we only have the abilities that the game says we have in a lot of cases. It provides those. In video games, super obvious. You can only do what the keys and the mouse clickers will make your character do. You can jump and you can run and you can turn your head. You can slash your sword or you can kick, whatever it is. As players, we adopt this agency in order to have the experience of striving, or skill development, or whatever actual life goal we have. Some people actually care about winning. The striving play is the most interesting one.

This is totally what we do in Org design too, when we're looking for speed and flow, or whatever it is we're optimizing for. We set the goals for the teams and the individuals. We set the rules that they have to play by, and we provide them with some abilities. Some they bring to the table, but a lot of the abilities exist within the system, or are added over time in the system, like expertise in the software. This is distinct from gamification. Gamification, when you just have leaderboards and points and you try to turn actual work into a competition. No, that's just garbage. Don't do that. That's bad. Do think about principles of game design when we're designing the systems for our teams.

A couple examples, for instance, developers being on-call. We're very in favor of that at Honeycomb, because the goal is to provide capabilities that are valued by people. In order to provide those capabilities, the software has to be running in production. The same people who are changing the software are the ones supporting it, and answering the pager when it goes down. This requires a lot of abilities. Some of them the system can provide, such as observability. Very important. The platform where our software runs needs to be comprehensible, and we need enough knowledge of how the software runs, and where and what to do about it. We need people to call for help, totally counts as an ability.

The result is that when we have the experience of operating our software, when we have the experience of troubleshooting in production, we make it easier to troubleshoot. We spend time building in error handling and building in more observability, whatever it is we need, to make the software easier to operate to make our own job easier. In board games, this is called engine building. Engine building is where you start up the game with certain abilities. I like Wingspan, that's a fun one. In the beginning, you can play birds, you can lay eggs, you can get food, or you can get more bird cards. As you play the birds, you make it easier for yourself to get eggs, or get food, or get bird cards. The game gives you more abilities, you build up your own abilities to play the game as the game goes. In the last turn, you use all of these extra abilities you've given yourself to lay a lot of eggs and win, or something. Engine building. I think putting developers on-call, turns our job more into an engine building game.

Here's another example, blocking pull request reviews. When I at Honeycomb want to make a change to the product code, I make the change locally. Then I push this branch to GitHub, and I create a pull request. That pull request contains the changes and has to be reviewed by another person. Sometimes a specific another person. That code I can't push it to production, I have to wait until someone else looks at it. Which is bad enough, because that can take hours and cause a lot of context switches. Then that person's job is to maintain code quality, keep bugs out of the code, and make sure there are tests, and all these other things. We are suddenly in opposition. They are somehow responsible for several emergent properties and I'm trying to get the bug fix out. I swear the product would be better with this bug fix even if the code isn't formatted right or something. It sets us in opposition. We're teammates, but it makes us in opposition when it's one person's job to get the code into production and another person's job to stop it if it doesn't meet the obstacles that we've set up for ourselves. Do not create antagonism where there was none. These systems have a lot of effect on how we work together, on how we work at all, on what we do, and on who we become.

4. Changing People

Now we get into the changes in people part. You want to do rubrics and interviews and coding problems to bring the right people into the system, and then it'll just work. No. The agency that you give people in the system has way more impact on how productive those people can be, than whatever coding problem you want to come up with, or algorithms you want to check them on. Because in a system where people have a healthy agency with feedback, observability in this case, but other feedback loops of how's my code doing, being on-call is one, this helps them develop greater expertise, which in turn makes the code more malleable. These people are going to be able to get better at the software and do more. Whereas if you don't have those feedback loops, maybe you're limited by architecture, maybe you can't even send logs from production, you just have to make a change and hope it doesn't fail. That's fear. As the software gets bigger, it gets harder to change and you get to that ossification part.

It's not the people's fault. Redesign your game. Redesign the agency that you put these people into. Let it develop expertise, because expertise is the biggest thing that is going to keep your code malleable. Expertise between the people and the software, not just hiring experts at Kubernetes, or whatever, but expertise in this domain and this piece of software. That's really hard to measure. I don't know how to measure it. I do recommend that you can notice it. You can notice people answering each other's questions, jumping into Slack and saying I can help with that. You can look at a lot of things and you can notice expertise. Even more than noticing people have it, notice that they share it, because then they're building the whole organization. Then they're changing your organization into one that can change itself, and that's all your potential everywhere.

That expertise, I've represented it here with more of these watercolor lines between the software and the people. You think the company is made of people. I've argued here that it's definitely made of people and software, because software determines what your company can do and enforces a lot of its rules internally. Even more, the company is these relationships. It is the relationships between the people, and also between the people and the running software. This is what makes a company. Of course, whenever those people leave those relationships disappear too. This is our potential for learning, and growth, and all of our future flow. Observability, it can help you look at requests in software and changes in software. From that, you can learn a lot about how the work flows through the system, and the opportunities that your people have to become even better. Observe some of these things. Appreciate the changes in people.

Summary

What do we leave behind when optimizing for speed and fast flow? I think if you really care about flow, and flow in the long term, not just speed today, then you don't laser focus on any single metric. You care about the everything else in the system. Yes, have a metric, measure it, be a little better, but care about the stuff you can't measure. What do we adopt and prefer when optimizing for speed and fast flow? I think we prefer feedback. We prefer knowing what's going on, noticing what we can't measure. Looking for the blockages and smoothing those out. That's where we get flow is from the lack of slowing down.

Questions and Answers

Skelton: I particularly liked the raise team-sized services. I think that one for me is when I had that realization a few years ago that actually the size and shape of the software systems that we're building actually needs to take in consideration the fact that we're human, was a big realization for me, and obviously influenced a lot of the stuff in, "Team Topologies." It's a nice phrase that you use there. How observability can actually be really empowering for teams, and help to make what would otherwise be a service that's too big for that team, something that the team can actually manage, because they can use this to help them understand what's happening. I think that's a really key point, and one of the most important things about observability, I think.

Kerr: True. Observability doesn't fix your complexity. It doesn't make your software less complex, but it does give you more ability to navigate within that complexity, so that a single team can support more software, because that software is like their friend. It's helping them understand it.

Skelton: What would you recommend for organizations that have so many areas of code that they cannot have a team always assigned to every piece of software? Isn't part of the solution there using modern tools, like observability tooling and so on, to help navigate the software that they've got. What does that feel like? Have you worked with organizations who have retrofitted some observability tooling into older code so they can have it look after it better?

Kerr: Yes. I was talking to one the other day, who pointed out that they're trying to integrate some people into some legacy code. This could be either people who are new to the organization or people within the organization who were asked to pick up a service or the monolith that they haven't worked on before. As they go, they're adding instrumentation, so code to emit events that turn into traces. They're adding that to express their understanding of which units of work are important, and which attributes are significant in the decisions the software is making, or in its performance. They're able to do that at the same time.

Skelton: It's such a key aspect, we're using this tooling like logging or metrics or observability, whatever, to express their understanding.

Kerr: Express their understanding.

Skelton: For me, I totally get that. That way of using tools is in a different dimension, different universe from the way that lots of people use tools. There's this idea of using it to express your understanding and therefore explore your understanding even of the software, so you might add some logging or traces, whatever. You're like, I think I'm expecting to get this output, but that's [inaudible 00:35:08], and you go, I had the phrases wrong. You're leaning on the tool supporting you.

Kerr: We do that with refactoring, a lot of the time. I have in the past used refactoring to express my understanding. If I have good test coverage, and I can run the tests, change the code, run the tests and stuff, then I can express my understanding with refactoring, which has the danger of breaking anyone else's understanding of the code if they had the old version in their head. Now I can also do that with observability, which does not change the flow, or the naming. It's a much safer change. If I don't have tests to add to a piece of software, I'm still perfectly fine adding a span or adding attributes to the current span, because that's like adding a log statement. It's not any harder.

Skelton: In some respects, it's safer, because we're not changing the functionality of the code itself, we're actually instrumenting it. Yes, technically speaking, at one level you just change some timings and things. We're being careful not to touch the primary logic, we're instrumenting around it. In some respects, it's safer to do that.

Kerr: We're not changing the flow, we're just adding. Only adding is a much safer way to change a system than trying to remove anything. Amy Tobey tells the story of how when they turned on OpenTelemetry in their Ruby app, it exposed some concurrency bugs.

Skelton: They were already in there, they were just exposed. They were still there.

Kerr: Those bugs were still cropping up, but only occasionally. Anything can happen when you have concurrency bugs, but in general it's a safe way to change the code. Then you can test what you think the code is doing, because you can say, I think I'll get a new span in this place, and it'll have these attributes, and it'll have these values. This is how we test in production in Honeycomb, is because we see what happens, we look at the flow of the code as expressed by the traces. It's much less granular than function calls, and a little more granular than HTTP calls, and database calls, anything over the network. It's a great way to find out if your understanding is correct.

Skelton: Seeing it in those terms, really fits with this idea of team-sized services, because we can start to use Honeycomb or similar tools, to express and explore understanding the code. Therefore, if we then gain additional understanding of the code, then it helps us answer this question that we had before about an organization that has so many areas of code that cannot always have a team assigned to every piece of software. A halfway house or a step in a useful direction is, let's provide modern tooling to help people explore their understanding of that code, and therefore eventually increase their understanding of the code so that their cognitive load is not exceeded. They can go, ok, because I'm leading on this observability tooling and this logging and metrics and these dashboards, I actually feel like I can support this legacy system, and this one. I'm happy because actually, although the code is old, the tooling around it, the observability tooling is actually quite new, it helps makes me feel like I'm empowered to look after it. It feels like a really powerful dynamic between these two things, like team-sized software, and then modern tooling to help us feel like we can actually look after it.

Kerr: Yes, and evolve it, and see the evolution.

Skelton: If everyone should have a domain in their head, how do you standardize the code and its structure between teams even though there are templates to improve the observability of this? There's a couple of interesting things here that might be behind that question, something around standardization of code and structure between teams.

Kerr: Standardized code is one of those rules, those obstacles that you put in front of delivery. How much do you really need? I'm a big believer in good defaults. Give people a common starting place, but let them vary it as they need to. Standardizing code needs to be voluntary. Everything you ask them to standardize is another thing that they have to hold in their heads, so watch out for it. Observability has a lot related to readable code. If your code is easy to understand, and not easy for people who aren't you, easy for new people coming to it to understand, then that's a plus, and observability helps with that. Also, when you think you standardize the code across the organization, and that means you can move developers around, you can't. That's another thing.

Skelton: That was exactly what was going through my head when I was reading this question. Why do you need to standardize? There's going to be some value, because what we want to do is we want to be able to learn across different areas of the organization. We don't want to have completely unrecognizable stuff in here and completely unrecognizable stuff somewhere else. Is there an expectation that you can just move people around from one team to another? Because that's not usually in the modern context with loads of different technologies, and particularly domains, the domains we're working in these days in 2022, are much more complicated and involved than they were 10, 15, 20 years ago. The idea of moving people around makes almost no sense.

Kerr: We have high level languages. It's not 25 years ago, where if you wanted a shift left function, you had to write it, because all you had were the standard libraries, and so, so much of software was basic stuff. We've abstracted all of that. We have libraries. We have services. Our main source of complexity, besides understanding all of these individual technologies that we're using, and just how many of them there are, is the business domain logic. We've worked hard to achieve that, as an industry. All of these services and abstractions allow business developers to know more about the business and implement changes in the business with less code, but not with less understanding.

Skelton: My feeling on this is that observability tooling, and similar kinds of tools, logging, metrics, and code inspection, and things like this, the tools that sit around the edge, so not the code itself, if there are some recognizable patterns or some fairly standard ways of using these, that could be a really valuable way of agreeing on maybe some loose standards in the organization.

Kerr: That gets into all the technologies around the business logic. The more consistency you have there, the more you can move to a new team and learn the business logic.

Skelton: You're not insisting on uniformity at the domain level, because actually, that might hurt flow. What you need is some variation in these different flows of change, for all sorts of reasons. You might need some variability there. You want to discover new ways of working on new languages, or whatever. You want that discoverability there. If people do move from one area to another, the thing that they're familiar with is how they get to understand the code using these code scanning and observability tooling.

Kerr: Deployment platform and stuff when that's consistent, it's a lot easier to move to a new area. I like that point that the code is maybe the worst place to apply standards, and the most valuable place to allow variation, whereas there's the scaffolding around it. [inaudible 00:44:27] talks a lot about the scaffolding we build in order to build the software. Some of that is very specific like unit tests. Automated testing is going to be tightly coupled to the code. Observability, there's layers of abstraction there, the deployment platform and stuff like that. If we can have consistency in our scaffolding, then that's a lot of consistency that we're gaining that in turn enables you to dive into code that maybe puts these curly braces in the wrong place.

Skelton: I think the idea of expecting every application team or service team to use Java 10 or something, then you will build like this, and you'll use this framework. Why would you want to have that constraint in place when there's a rich variety of very expressive languages that actually work quite well in specific contexts? We're working on lots of different contexts. Whereas, observability, if we're using a really good tool, and we've found some useful patterns that help us to explore traces and spans and where the code has a cross boundary. Then that can be really good to share and say, "Actually, this is actually useful to use in a way which is more of a standard, more standardized." Not necessarily like you must do it like this, but recognizable as people move across different parts of the organization.

Kerr: It won't be obvious to you what the code is doing, but you will know how to find out.

Skelton: Exactly, so you can onboard quickly, even if you've never programmed in Rust or whatever before. You're using the code analysis tooling, using the deployment pipeline tooling, which is familiar. You've got the observability tooling which is also familiar. You can go, "I can see what's happening here." I can make a contribution pretty quickly.

Kerr: I can get my head around this faster, but there's a limitation.

Skelton: You get your head around it. Exactly, it's about the cognitive load. It works really well together if we take that perspective. If you take this perspective, like you said, of cognitive load effectively is our limit. Let's take that as a useful design constraint, and think about how this other tooling like observability and whatever can help us work within that constraint, and have it become more human, and address the fact that we have some limitations. It is just a constraint, and we can find creative ways to work.

Kerr: Yes, work within it instead of railing against it.

Skelton: You mentioned code review as an obstacle. Code reviews can also act as a point where knowledge transfers between developers happens.

Kerr: It's an important point because code review has a lot of purposes, some of which are blocking bugs or inconsistencies going into production. It's the blocking part that I'm complaining about, and the conflicting objectives of the two people involved. Looking at other people's code is great. I'd much prefer pair programming or ensemble programming, but you can also do code review that's not blocking. You're free to look at all the commits that go in. You can do that together as well. It doesn't have to be blocking.

Skelton: This idea that you can split the work into individual person-sized chunks and somehow get better throughput is one of the main problems in software development, I think. Sometimes it's the right thing to do if we found a task that is individual or at least simple for one person, that's fine. There's a whole lot of context that we've missed by doing that.

Kerr: Tackling it as a team applies all of our knowledge and understanding to it.

See more presentations with transcripts

Recorded at:

Sep 23, 2022

Jessica Kerr

InfoQ Software Architects' Newsletter

Observability for Speed & Flow

Summary

Bio

About the conference

Transcript

1. Requests through Software

2. Changes in Software

3. Work through People

Game Design

4. Changing People

Summary

Questions and Answers

Related Sponsors

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Popular across InfoQ