BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts Marty Abbott and Tanya Cordrey on Microservices, Availability, and Managing Risk

Marty Abbott and Tanya Cordrey on Microservices, Availability, and Managing Risk

In this podcast, Marty Abbott and Tanya Cordrey sat down with InfoQ podcast co-host Daniel Bryant. Abbott, CEO and co-founder of AKF Partners, and Cordrey, partner at AKF Partners, discussed topics that included: their learning from working together in the early days of eBay, why and how to avoid creating software systems that are composed of deep call chains of microservices, and how to build effective product teams.

Key Takeaways

  •  First introduced in the book “The Art of Scalability”, the AKF Scale Cube is a model for segmenting software components, defining microservices, and scaling products. It also creates a common language for teams to discuss scale related options in designing solutions.
  • The microservice architectural pattern is best used for implementing the “breadth” of business functionality. Engineers should avoid building deep call chains of services, as this can increase the probability of failure, and can also increase the challenges of locating and diagnosing issues. Code libraries can often be used more effectively to implement “depth” within services.
  • The AKF Availability Cube is a new model to guide discussions on how to achieve high availability. This model can also be used as a mathematical tool to evaluate the theoretical “as designed” availability of existing systems.
  • Building products using cross-functional teams is an effective approach. However, care should be taken not to accidentally create unnecessarily large teams, as this can add communication and coordination friction to the delivery process.
  • Teams should make a conscious choice to adopt new technologies, and understand the benefits and tradeoffs with doing so. Managing risk, and in particular, technology lifespan risk, is an important part of the value engineers provide to the business.

Transcript

Bryant: Hello. Welcome to the "InfoQ Podcast." I'm Daniel Bryant, News Manager at InfoQ, and Product Architect at Datawire. I recently had the pleasure of sitting down with Marty Abbott and Tanya Cordrey. Marty is the CEO and co-founder of AKF Partners, a global technology consulting organization. Tanya is a partner at AKF Partners. Marty and Tanya both had fascinating careers and worked together in the early days of eBay. I followed Marty's work for many years now, having been a big fan of the books he co-authored, "The Art of Scalability" and "Scalability Rules." I was keen to learn more about his latest thinking in regards to the models that he and the AKF team have been working on. If you look back through my talks online, you can see that I've referenced the AKF Scale Cube a lot in relation to microservices.

I've also followed Tanya's work for quite some time, and I wanted to explore her experiences in relation to when she worked as the chief digital officer at Guardian News & Media. Tanya was on the senior executive team here as the organization scaled from national news reporting to a global presence. I knew Tanya had a lot of great insights into leading teams, setting objectives, and sharing the vision. This is an area I wanted to focus on and dive a little deeper into.

Hello, Tanya. Hello, Marty. Welcome to the "InfoQ" podcast.

Abbott: Hello, Daniel.

Cordrey: Hi.

Bryant: Could you briefly introduce yourselves for the listeners, please, and share a bit of your background, your career highlights as well? Thanks.

Introductions

Cordrey: I'll start. I'm Tanya Cordrey. I previously lead the Guardian's global engineering, product, and data teams. It was a great time to be at "The Guardian" because during that period, "The Guardian" transformed from a UK and print-centric organization to a global digital powerhouse. We laid a lot of the foundations for commercial success. Before that, I ran product at eBay UK. I've worked for a wide range of different organizations from large legacy organizations to production tech-first companies to startups, etc. Today I spend my time on some boards of some companies, and I'm also a partner at AKF Partners, heading up our efforts in Europe.

Abbott: I'm Marty Abbott. Started my career as an officer in the U.S. Army. Moved on after five years to be a software and electrical engineer at Motorola. Then ran components of IT for Gateway when they were still around, specifically within the Asia Pacific region. Then went on to eBay for six years, last two of which I was a CTO. That's how I met Tanya. Then became the CEO for a struggling startup in New York. Turned it around, sold it to AOL, and started AKF Partners with 2 friends 13 years ago, and was lucky enough to hire Tanya recently.

Bryant: As I mentioned off mic, I'm most familiar with AKF work through the Scale Cube. I think Simon Brown tipped me off to that many years ago now, and it's been super helpful throughout my career. I saw recently you have the AKF Availability Cube online, which I thought was very interesting. I love models as a sort of ex-consultant, I really like models. Could you briefly introduce how the Availability Cube came about for me please?

How Availability Cube Came About?

Abbott: Yeah, the Availability Cube, in many regards, is just an extension of the Scale Cube which we developed 13 years ago. The premise for it and how it started was we noticed a number of clients are really struggling with the concept of microservices, specifically that they were chaining them together and very much disregarding the math regarding the availability impact of chained microservices. That's where service A calls service B which may call service C, etc., each of which has a very specific, sometimes difficult to calculate, but nevertheless measurable availability component. When they are chained, when service A calls service B calls service C, the availability lowers because you get the multiplicative effect of failure. Think about electrical circuits, any component of which when wired in series, should they break, break all components.

Now, especially when applied to services, as compared to, say, a monolith besides having these individual breakable components, any of which might cause a service to fail, you also have a number of others devices in between these services. Within cloud deployments, it's that much more difficult because you're not in control of the complete infrastructure. If service A is designed to have four nines of availability, B and C are each designed to have four nines availability. Then you have networking components in between them, which may be built to five nines of availability, etc. It's very easy to see that those 4 9s on an individual basis, coupled with the 5 9s, very quickly reduces your availability to below 99.9%. And that's roughly 42 minutes of outage or customer impact a month, which most companies really don't want.

Bryant: Totally makes sense. A question to yourself, Tanya, as you were working with "The Guardian," did you see microservices becoming a thing when you were there? Did you see this need to sort of break things apart to scale out and try to increase availability?

Did you see Microservices Becoming a Thing at "The Guardian?"

Cordrey: Without a shadow of a doubt. Funnily enough, I actually hired AKF to come and help us.

Bryant: What a surprise.

Cordrey: Exactly. Marty sort of rocked up. And we had a sort of rapid learning on things like the Scalability Cube, etc. It radically changed our approach on how we were building our services. Because I think at that time, we were very much focused really on the UK audience. This type of approach allowed us to suddenly really think about things in a much more ambitious, scalable way. Suddenly, we could think about having a global audience, having our own journalists, being able to write content 24 hours a day, being able to expand in different countries, etc. It really did lay the foundations for us to rapidly change our sort of engineering philosophy.

Abbott: I think back to my earlier point around the Availability Cube and microservices, we've always been huge proponents of microservices. The issue is in how they're deployed and architected. Microservices help engender significantly higher levels of organizational scalability. If not applied properly, as in breadth versus depth, depth being the chaining of micro services, if not deployed such that each pool of microservices faces a customer and they're never deep, that's where we may get organizational scale at the expense of the availability of the solution. We like to flip that. We like to present microservices and breadth as components of the product architecture, but never in depth.

Bryant: That's really interesting because I think there's a tendency with folks, when they hear the word microservices, is to decompose. We have this sort of notion of bounded context and so forth. I definitely see people wanting to go to the chaining level. They say, "I'm going to have the cart service that calls the payment service," these kind of things. I think from when we build systems at a code level, that's the kind of natural way we structure our code, the different layers, the different kind of chaining of things. Have you got any advice on how folks should design to minimize this long chain of services?

Have You Got Any Advice on How to Minimize Long Chain of Services?

Abbott: Sure, we often use the phrase, "Services for breadth, and libraries for depth." The thing about libraries, everyone seems to be running away from them. The great thing about them is they eliminate all of these extra hops, call through the operating system to the neck down through the network, to another service, etc. They still, especially if you use dynamically deployable, shared loadable libraries, they give you many of the same benefits of services in depth without the significant availability impact, all of these other devices standing between service A calling service B, and still allow for independent deployment. Yet, for some reason, folks have just forgotten about them. They think service is everywhere, service is deep, service is broad. As a result, we get these huge service mesh networks that not only have low availability, but are incredibly difficult to troubleshoot because when one of these little fuses or services pops, if everything's connected, it's generally called the N-squared problem, everything fails. When everything fails, you don't know where to start troubleshooting.

We see companies striving to get better organizational scale, where an organization or team of 15 or fewer people own a service. All great. That allows us to have lower overhead, which results in higher velocity. Again, they start stacking services deep. The number of our clients that do this runs into the hundreds at this point. Then they all wonder, "Why is it that my services deployment results in lower availability, even though I get higher velocity than it did before?" The answer is simple. It's the same thing we see, again, in electrical components. Think of Christmas tree lights. The old lights that if a single bolt burnt, the entire string went. That's what happens when we chain services together.

Bryant: One thing I think you hinted there, Marty, which developers do you like is the ownership thing, the whole two-pizza teams, and Jeff Bezos and so forth. How does the ownership work with these libraries? Because I'm totally fond of libraries, I've built my fair share of libraries in the Java world, but again, I was working on the monolith then. I was boning the library as part of the general code base. If we're trying to split services up into different teams to improve velocity, how do the libraries play into this? Is it a centralized team or not, for example?

Libraries

Abbott: You can deploy a library in the same fashion that you deploy a service. As a matter of fact, often when we work with clients, again, we're talking about services that run deep here, or call chains, if you will, we often tell folks that a service is nothing more than a demonized library with a restful interface. Again, remember things like DLLs, or shared dynamically loadable libraries in both Unix and Linux, if you can do that, a team can own it, deploy it separately, and it just requires a restart of the service that was causing it, which, as a side note, often happens when you deploy a service anyways. The parent or calling service will often need to be restarted even if it has circuit breakers, etc. It's just good practice to eliminate all these extra hops, but again, obviously focus on services decomposition, but do so from a broad perspective.

You also brought up, if you don't mind, I think it's an interesting story, the notion of Jeff Bezos and two-pizza teams. There's a great story behind this. When Rick Dalzell was the CTO of Amazon, Amazon had hired him from Walmart long ago. He had a reading club within Amazon and they read Fred Brooks' "The Mythical Man-Month." In there, Brooks references Conway's Law, which at the time was published in about 1969. The paper is called "How do Committees Invent?" Brooks is the first one to call Conway's Law Conway's Law, before it was a little red activity. "In there," he said, "Brooks and Conway argue that teams should be small and own components," which is one of the corollaries of Conway's Law. Jeff asked Rick, "How large should teams be?" Conway, by the way, in his paper says should be less than 8 people, but Brooks sort of increases that to the closest Dunbar number which is between 12 and 15. Tells this to Bezos. Bezos is like, "This makes a lot of sense, but it's hard to follow. Why don't we say nope, team should be larger than that which two pizzas can feed?"

Bryant: That is an awesome story. I read that read "The Mythical Man-Month" in college. I thoroughly enjoyed that one. I see the problems talked about in that book reinvented time and time again. I'd be curious, actually, to get both your opinions on why do we as industry seem to repeat history? We don't always learn as much as I think we should. Don't know if you've got opinions on that.

Why Do We as Industry Seem to Repeat History?

Cordrey: I just think people often just get their heads down and sort of forget to look up and realize what's sort of happening. As an industry, as you quite rightly say, we're very passionate about having small empowered teams. Yet, particularly over the last 12 months, I've seen so many teams have really got very bloated, and because the teams are really keen to have an expert on each thing in the room. For example, there'll be, obviously, the engineers, but then they'll have the product manager, and they'll have a data analyst, and they'll have a user experience researcher, and then they'll have the designer. Then sometimes there's more than one of those people in the room. Suddenly, a team that's meant to be very efficient to sort of 6, 7 people is suddenly becoming 10, 11, 12 people.

What's really frightening about the way some of these teams work is that they're being very much sort of siloed in their responsibilities, so they don't get out of their own swim lane. I think that's particularly worrying with things like the data and the metrics, because nothing makes me sadder when you go and see a team and you ask them about their success metrics, or the performance, or particularly something they've implemented, what impact has it had. Then the team will look a bit blank and go, "We don't have the data person here." You're sort of going, "Come on." Everybody should really be owning that and feeling sort of passionate about that. I suspect you've seen the same sort of thing, Marty.

Abbott: Well put, Tanya. I liken this to two different areas, hopefully they'll resonate with the listeners. One being diet and fitness. A lot of folks will attack only one side of it, right? They'll either diet or there'll exercise, but they don't do both. As a result, the results are suboptimal. The next analogy is the comparison of corporations with the military. Specifically, the military is always fighting the newest war with the newest weapons, but last war's tactics. Corporations do the same thing. It's new battle, new technology, old tactics. We'll deploy microservices, but we'll do it in the same fashion that we used to think about libraries. That creates a problem. New weapons require new tactics, new technology requires new approaches.

Cordrey: I was going to add. I think one of the things I've seen, and we see a lot, actually, now when we work with companies is that you have a lot of teams that are really trying to do the right things. I often refer to them as sort of framework fanatics. This is where these teams really beginning to make process rather than outcomes, their Northstar. Now, I have to confess, as consultants, as you've quite rightly said, we all love a good model, a good framework, etc. I love a great framework to simplify a problem or identify a potential solution. Software development, product management, these things are not painting by numbers. Frameworks, rituals, processes are guides, not solutions. Too many teams today are really making their role about the execution of the process, rather than the impact of the work.

I've actually even come across teams who come to a product squad. I came across one a few months ago, and they defined their mission as being exemplary in Agile. They even had OKRs about how great they were doing Agile, and the what and the how they were doing it had become the whole focus, rather than the why, which I think everybody here would agree is the most important question.

Bryant: Yes, the tail wagging, the dog type situation. I think it's often easier to focus on process than it is on outcomes. My experience is as an organization gets bigger, it's even harder because there's more bureaucracy introduced by its very nature as an organization scales. I think that's something I've tried to fight against sometimes, and reduce the bureaucracy. You mentioned OKRs there, for example. I think OKRs, when used well, are really good way to align teams on what are we actually trying to do, and what are our leavers to have [inaudible 00:15:24] experience. How do we best communicate that to our teams, have you got any advice on that?

How to Best Communicate OKRs to Teams?

Abbott: I think that a senior technology executive's job, one of their primary jobs anyways, is to create a really good causal roadmap between what an engineer does every day and how it creates value for the company. If OKRs are implemented properly, nested properly, and always reviewed thoroughly and updated frequently, it's a great mechanism to do that. The test there is can an engineer look at an OKR tree and understand what he or she does every day and how it creates business value. If we do that, I think we eliminate some of the issues, similar to what Tanya brought up, specifically having approaches run perpendicular or orthogonal to the outcomes of the company. It's unclear how being agile, and the way that Tanya put it, really starts to create stakeholder wealth and value within a company.

Bryant: Interesting. There's something I want to riff off there and pick up what you said, Tanya. I've definitely seen this where teams get bloated. What do you think is the way to resolve some of that? Is it to have, say, more of these specialists as consultants, even if they're full-time staff, but do they consult more to the individual teams?

How to Solve Bloated Teams?

Cordrey: I would agree with that. I think you don't need everybody in the room all the time. I think teams need to be clear on who is the decision-making body. What's that lovely phrase? Camel is a horse designed by committee. I think sometimes when you get teams that are too big, unfortunately sort of common sense disappears and groupthink takes over. I think teams really need to apply common sense. Put bluntly, if there are too many people in the room, you should kind of say, "Hold on, we need to get some of the people out of the room." Because I think the trouble is that often teams can see when things are going wrong, but actually, we're all very nice. Nobody wants to be the jerk to sort of say, "Actually, hop it, half of you." I think sometimes teams really need to do that.

For example, let me tell you a story of a team last year I came across. So this team was really passionate about discovering getting user feedback, which obviously we all agree about. Every team had a dedicated user researcher. The team had really landed on this idea that you would test everything in front of real users. They built a testing lab in their building. So far, so good, you think that sounds all great. Over the periods of months, the process took over everything and nothing could be decided until it had been in front of some focus groups, and not just one focus group or two focus groups. What's it with the design sprint, the sort of Jake Knapp process, they say you only have to put it in front of five people? This group were putting it in front of 8, 9, 10 focus groups. Because it's a team sport, discovery's a team sport, everybody had to attend every focus groups.

The team lost confidence in making any decisions. The team became completely reliant on focus groups. When they got the answer they didn't like, they just did more focus groups. The engineers were utterly tearing their hair out because they had stopped writing any code because they're always in focus groups. The thing is, this team had moved their OKRs to be around how many focus groups they were actually conducting. They were doing great on the OKRs, but actually, common sense had gone through the door. As I say, the engineers were just so frustrated. What had started as a very sensible thing and a very sensible approach, driven by good intentions, had just got warped. Again, I think size of teams, it doesn't happen overnight where you have a high functioning team of 7, and then one day, there's 15 people. What happens over a matter of weeks or months, suddenly that 7 becomes 8, then 9, 10, etc.

Bryant: That drift into failure mode, isn't it?

Cordrey: Yeah.

Abbott: Size of an organization responsible for getting something done is always highly correlated with bureaucracy, the larger the size of a group necessary to get something done, the higher the level of bureaucracy. Bureaucracy is all non-value added overhead. Trying to get teams to be the right size, own outcomes consistent with their size is how we battle bureaucracy, and then measuring what matters. Specifically with an engineering, what matters is how much time an engineer is spending doing the thing for which you pay him or her to do, and that's code. If they're in meetings, if we don't measure that, if we don't know that an engineer spends 50% of his or her time in meetings, we don't know that they're only writing 50% of the time, which means that they cost twice as much to us. We're paying twice as much as we should for what matters and management has to get involved and eliminate that bureaucracy, push ownership down, properly relate teams to architectural components, which are further directly related with business and product outcomes. That's how we engender empowerment within teams.

Bryant: Well said. I wouldn't mind to pivot a little bit to just look at some of the architectural principles now, if that's okay. It's nice leading there, I think. Microservices we're not necessary a thing when your book was first published now many years ago, but the other principles were there pre what we're now labeling as microservices. What's your thoughts of the evolution from the monolith to microservices? We're even seeing some folks now sort of going back towards the monolith. Have you got any sort of commentary around the journey we've been on?

Thoughts on Evolution from Monolith to Microservices?

Abbott: This whiplash that you describe of teams rapidly adopting microservices and moving back to model is seen time and time again with virtually every other technology evolution. It's not uncommon for folks to latch on to whatever the [inaudible 00:21:04] or approach du jour is go wholeheartedly, drop their hips and drive into the objective, and then realize, "We didn't plan enough. We didn't understand enough what we should be doing. We didn't architect well enough." Back to our earlier discussion, now we have these deep changes of microservices. Unfortunately, often they say, "Let's just go back to the prior state." When the best solution is, again, to optimize both sides of the equation. It's to both plan properly for availability, calculate the theoretical availability of a solution, broaden your microservices, in depth use libraries, your failure rates drop, the probability of failure drops. As a result, your availability increases and you still get all of the other benefits of microservices.

Bryant: I like it a lot. I was actually looking back through your book earlier on today. One other thing I picked up on the designing things, whether it's using mature technologies. I guess that can apply to programming languages, as well as databases and other things. We've got a fantastic array of technology at our disposal now. As a developer, when I started my career, I couldn't have dreamed what we've got now. What's your thoughts on how we should pick technologies appropriate for the task at hand?

How Should We Pick Appropriate Technologies?

Abbott: We have a wonderful article on the site about the bathtub effect in failure rates of different things, whether they be infrastructure or software, as it relates to their age. Think about a graph that has a bit of a bathtub. Think of it as a U, where the x-axis is time and the y-axis is failure rates. Any new technology, whether it be infrastructure or software, has an incredibly high failure rate as we try to figure out how to make it work, how to reduce defects, etc. Ultimately, we bought them out, not so mature technology, but if you wait too long, as infrastructure ages, or as you take on additional debt and don't pay it down from a software perspective, those failure rates, again, increase on the far right side of the curve. Where we want our clients are in the basin of the bathtub. Solutions that have been tried by many other companies already. Most of the kinks have been worked out, but there's still some competitive advantage in adopting them. Then ride that until such time as the next bathtub comes around and swap before you have significantly higher failure rates.

If your designing systems predicated on high availability, either to meet a business need, a B2B business, or a consumer need, a B2C business, inherent to that value adoption by the consumer is availability. We overlook it, we just think, "It's a new competitive advantage," and don't properly apply the calculus to understand what the failure rate and associated impact to our availability will be.

Bryant: Very interesting. Have you got an experience with this one, Tanya, say, picking too new technology and then regretting the choice at all?

Cordrey: I think we did that a couple of times at "The Guardian," but it's really hard because when you have a fantastic enthusiastic team, often the engineers want to try new technology. I'm a big believer in making space for everybody to sort of learn, and grow, and test, and all those sorts of things. I think there's the balance between allowing the team to try out new things. As Marty quite rightly says, it's balancing the risk. You don't want to be trying something really funky and new on something that's mission-critical system for your service or product, etc. I think there's a case of trying to give teams the room to experiment, but actually, at the end of the day, you have to sort of be taking kind of fairly cautious, pragmatic, and a very sort of business approach on kind of where you do that, and where you don't do that.

Bryant: It makes total sense. It's something we talk at InfoQ a lot is about diffusion of innovation. Geoffrey Moore's "Crossing the Chasm," for example. One thing we see is folks don't always recognize that different people's perspectives of it are going to be different. If you're a super hipster startup burning through VC money versus you're an enterprise, you've got money-making software on COBOL that you've got to keep running, your risk tolerance, I guess, is very different depending on where you're coming from. I'm kind of curious have you both got experiences of that where perhaps people haven't fully realized where they are in terms of what is risky to them. Is this new technology can going to value? Or is it too risky to experiment with?

Diffusion of Innovation?

Abbott: I think applying Tanya's approach of playing with things in non-mission-critical areas of a product, or even within a development or R&D environment helps teams better understand that and also get up to speed on the technology before it goes into a mission-critical system. As a side note, the diffusion of innovation, do you happen to know where that came from prior to Geoffrey Moore?

Bryant: I've recently re-read the book. Actually, I don't. Go on, Marty, let me know.

Abbott: The technology adoption lifecycle and diffusion of innovation theory comes from Rogers in 1962. But it had nothing to do with technology per se. It was the adoption of hybrid corn seed. This model applies not only to technology, but virtually everything else that we do, any new innovation, even though it's called the diffusion of innovation theory, which resulted in the technology adoption lifecycle and the technology adoption model, but it applies to the adoption of virtually anything that's innovative.

Bryant: Intriguing. There's always lots to learn. When I was an academic, actually, my professor at a time always said to me, "Cross pollinate." I was in, obviously, computer science, but he said, "Go and chat to the biologists. Go and chat to different people." You learn so much more by learning from other sources, I think, than where you're coming from.

Abbott: Absolutely. That's why we are so big as a firm and try to get our clients to adopt the notion of durable cross-functional teams, where it's not just software developers who then pass something along to infrastructure people. Rather, you have business people, product owners in an Agile term, and software and infrastructure DevOps, SREs, etc., all working on the same team to achieve a common outcome within that team.

Bryant: Very nice. You mentioned the SREs there and DevOps. One thing on the conversation I was keen to pick up on is observability, as we're now calling it, sort of monitoring, logging, and so forth. I'm guessing, we all know really, that's super important. Have you got any sort of thoughts on how that relates to scalability and availability too?

How Observability Relates to Scalability and Availability?

Abbott: Absolutely. There's a wonderful notion not widely adopted of test-driven development within engineering. We sort of took that and said, "You also need to be thinking about, at the time of design, how it is you're going to monitor for the desired effect or business outcomes of this solution." Very often we say, within the Agile notion of done, something being complete, it's when something actually achieves the business outcome, because that's where the expenses were predicated. Not when we're done developing something, but rather did it achieve the desired outcome? If we start thinking that way on day zero, besides just the software that we write, and the infrastructure upon which it's hosted, we would also be thinking about all of the associated monitoring capabilities to detect not only that it's not functioning as design, I think that's a horrible term, but rather, as expected. Expected means achieving the desired business outcome.

Bryant: I wonder if you can talk a little bit about performance testing. Now, I always find this quite tricky. Often it's a big bang thing at the end of a project. We've designed the system, hopefully, we thought about the cross-functional, non-functional requirements. My experience frequently is things like performance testing is done at the end of the project. Any advice from either of on how to shift that left or shift that forward a bit? Get sort of things like performance into the design and into the testing early on?

Thoughts on Performance Testing?

Abbott: I have two thoughts. One is that the most mature organizations we know perform performance testing within their CI pipeline, such that commits ultimately trigger not only integration in unit tests, but also performance tests to understand the impact or degradation of performance of added functionality. That's point one. Point two, performance testing rarely has the payout we desire because it's rare that we can reproduce the production environment in its entirety. If you were to do that, you've doubled your cost. If your production environment is properly fault isolated, back to the Scale Cube, along the z-axis, by customers. Let's say you have 10 of these, each of which serves one tenth of your customers, with one very small exception that might be 30 basis points of your customers. That's your Canary environment. Now we can test live. As long as we can easily roll back, we'll understand with real user traffic what the impact is going to be.

One of the biggest issues with performance testing is when we release new functionality, we have an expectation as to user behavior, but very often what happens, and this is from our third book, users will use the new functionality in a way that we didn't understand, so our test is invalid to begin with.

Bryant: Digging along that, Tanya, I've seen that too. I'm guessing you've seen.

Cordrey: I think that you always have to be very aware of how the users are behaving. As Marty quite rightly says, sometimes they don't behave in a way that you're expecting. Or there's the other trap you fall into where you look at the data and you make an assumption. I remember in the early days, when I worked at eBay, there used to be this amazing chart that we had, which showed the number of pages that users looked at. This, as I say, is in the very early days of eBay. It was a bubble chart. When you looked at all the other big websites at the time, the eBay sort of engagement of how many pages people looked at, it was like this big Deathstar bubble chart. This sort of Deathstar in the middle was eBay. We really took it, at least for us in the UK, we looked at it, went, "Aren't our users engaged? Isn't that really great?" We were really proud of this chart. As we were trying to grow the business, we'd go out and sort of show it a lot to talk about how engaged our users were.

However, as we sort of matured a bit and we started doing different types of testing, different ways to engage with our users, we realized that a fair proportion of all those pages people were checking out was because they were finding discoverability really hard on the website. They couldn't actually find what they were looking for. We had this metric that we initially thought was a really fantastic thing. We were very proud of it. Then actually, when we started peeling away the onion, we realized that a fair chunk of it was actually due to sort of user confusion or not being able to do what you wanted to do. Not surprisingly, discoverability and findability became a sort of big priority then for them months ahead, once we discovered that.

Abbott: That is such a great story about making sure you're measuring the right things. As a result of that, we moved away from how many things or items the customer views to the ratio of searches to an Add to Cart or purchase because that's really what's important.

Bryant: We're coming to the end of our time now. Are there any additional topics you were keen to cover?

Any Additional Topics?

Cordrey: In the middle of our sort of Coronavirus craziness, there are a lot of websites at the moment that are really struggling with scalability, availability, etc. I thought it'd be really interesting, Marty, maybe if you had a few words of sort of, if you're the CTO or engineering head of one of those companies, from a leadership perspective, how do you work through these crazy times? I see Zoom today has published something about what they've been doing. Obviously, you're also one of the people who've lived through this sort of crazy hyper growth, and then having to deal with the consequences of it.

Abbott: I don't know how a company can properly identify black swan events like COVID. I do know that there are companies that have absolutely risen to the occasion, and you both mentioned Zoom. Those folks should be applauded, because all of a sudden, overnight, I have no direct insight into their traffic numbers or interaction numbers, but overnight, everyone's working from home and they're using solutions like Zoom to do this. Those folks emerge from this thing as a shining star in terms of the scale and availability of their solution, as compared to a number of old brick and mortar companies that didn't properly invest in their commerce solutions online that are failing, because they're all seeing Black Friday-like numbers. These things, while they take time to fix, they're almost always easy to fix. It's just technology. As long as it's architected properly using building blocks or scale bricks, as we call them, it's easy, especially with elastic compute, to be able to scale these things nearly on demand and avoid black swan events. Again, as with any significant issue like we're facing today, they're always the heroes like Zoom. Unfortunately, the landscape's always littered with folks who weren't prepared for the black swan.

Bryant: I was chatting with some folks the other day, and they were talking about running local game days, effectively sort of creating disaster scenarios. Not real ones, but as in creating these scenarios, running them, and see how people behave. Is that something either of you've done in the past?

Abbott: Absolutely. We always tell our clients that you need to practice what we call the DID approach. The first being the design of a solution, which is theoretically, or at least intellectually, high cost. In terms of actual software, there's not a lot to do. You need to get that out of the way before these events, such that you would understand how to deploy. High cost intellectually, but the absolute cost is low. Then the implementation of it, which is the writing of software, lower intellectual cost, but higher activity, operating margin impact you're paying engineers. Then finally, you don't actually have to deploy the last D until you need it, especially in the cloud infrastructure as a service enabled world. Therefore, you don't have to impact your cost of goods sold. You'll have it developed and ready. Then it's just the elastic expansion when you need it.

Bryant: Interesting. Have you seen anything like that, Tanya? I've heard a bit of it more recently, about folks sort of running these disaster trainings, for example. Like Marty said, you got to have some capabilities already there. It's often interesting from the social side, how do folks react to disaster?

Cordrey: We probably never planned anything quite the level of disaster that we're seeing today. We've never planned for some COVID-19-type event, but we have done, though, sort of disaster recovery days where something's not working, or what have you. The one that, funnily enough, teams often neglect is particularly when you have a lot of legacy systems. There's that one engineer who is the only one left in the building who knows something. If you're ever doing the sort of planning around sort of disaster scenario planning, you should plan that that engineer is suddenly unavailable, because that's often a really big single point of failure that many teams have.

Bryant: I like that a lot. I don't know if either of you have read "The Unicorn Project" and "The Phoenix Project" by Gene Kim. He talks a lot about this kind of stuff. I recently read "The Unicom Project," and it's very much the people can be as big of a bottleneck as the technology. I like it a lot. Super. Thanks both for your time today. I really enjoyed chatting. Thank you very much.

Cordrey: No, thank you. It's great.

Abbott: Thank you, Daniel.

Mentioned

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article

Adoption
Style

BT