InfoQ Homepage Presentations Engineering Your Organization: Services, Platforms, and Communities

Engineering Your Organization: Services, Platforms, and Communities

Bookmarks

View Presentation

Speed:

Download

38:07

Summary

Randy Shoup discusses the different ways high-performing engineering organizations gain leverage by specialization and sharing.

Bio

Randy Shoup has spent more than two decades building distributed systems and high performing teams, and has worked as a senior technology leader at eBay, Google, and Stitch Fix. He coaches CTOs, advises companies, and generally makes a nuisance of himself wherever possible. He is currently VP Engineering and Chief Architect at eBay.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Shoup: I'm Randy Shoup. We're going to talk about engineering your organization: services, platforms, and communities. At the moment, I'm VP of Engineering and Chief Architect at eBay. I've also been a senior engineering leader at Google, at Stitch Fix, and at WeWork. What we're going to talk about comes from experiences from all those different places. One thing I'd like to say, and certainly been my experience in my 30 years in the industry is that all organizations are wrong, but some are useful. What do I mean by that? I mean, by definition, our interactions with other people in our company are multi-dimensional. Also, by definition, an organization is a single dimensional slice through that multi-dimensional space. Again, any particular organization is wrong, but some of them are a little bit better than others. Hopefully, we'll learn what ones are better than others.

Organizational Goals

What are some goals of our organization? Obviously, as an engineering organization, we want to sustainably deliver value to business and to our customers. We want to effectively leverage all the things we have at our disposal: our people, our teams, our technology. We also want to continuously improve and adapt to changing conditions.

Specialization and Sharing

One could imagine designing an engineering organization that looks like this, where every individual team was full stack all the way down to the metal. They didn't just build the web part or the mobile part and the application part and the storage part. They were also building the operating system, the data centers, the chips. Not many places are organized this way, because it's pretty inefficient. It doesn't give teams the capabilities to specialize in particular areas where they can make the most impact on the organization. Instead, we tend to organize like this, where we have maybe application teams, at the top, we'll have a bunch of shared services, maybe in the middle, and then we'll have a bunch of platform underneath. Then, in almost every organization I've been at, there are cross-cutting concerns as well, communities around a particular type of practice, a particular type of technology that cross these organizational boundaries. Again, this is one of those examples of where the organization is wrong. It doesn't encode these kinds of communities. Probably the way that we've organized it here is better than the alternative.

Services

I want to talk about all those different things. I want to talk about services, and platforms, and communities. Then I also want to undergird everything with some thoughts about leadership, the things that are going to make any of this possible. First, let's start talking about services. As in a domain driven design approach to the world, we'd like any individual service or any individual team to be aligned around a particular business problem. In particular, as leadership, we'd like to give a particular team a clear set of golden goals and metrics that matter to those teams' customers. Those teams' customers might be our external customers, or they might be internal customers that are also part of our organization. You end up having full stack teams, where teams are organized by domain. In eBay where I work, we have teams that build the shipping part of the site, the teams that build the selling part of the site, the search part of the site, all sorts of different areas of a marketplace like eBay. Those individual teams tend to co-locate idea generation, software development, quality assurance, and operations all together. The reason that we do that is we want to keep making very fast cycle time and allow those teams to move independently of one another.

Service Organization

The organization is going to look like this. We'll have a domain that corresponds to a single team, which is maybe going to maintain one or several services. Those who are familiar with Conway's Law about organization and architecture being duals or reflections of another, are going to see this very directly right here. What we want to have is that a service team or an application team can independently design, develop, deploy, and operate all the things that they are responsible for. Then we also want them to be responsible for those services or applications, end-to-end, cradle to grave. All the way from the beginning of it, all the way through its main lifecycle, maybe into maintenance mode, if we have it, all the way through to deprecation.

Service Provider

As a service provider team, so as a team that provides services to other customer teams around our organization, my main goal is to meet the needs of my customers in terms of functionality, and quality, and performance. Also, in terms of stability and reliability. There's also this implicit expectation that I'm going to be constantly improving the service over time. Also, implicitly, there's a bunch of goals that I have, maybe nonfunctional goals that are around, that I do that with minimum cost and effort, because the amount of resources that I have at my disposal are finite. To do the best job that I can, as a service provider, I want to leverage common tools and infrastructure, which we'll call a platform. I want to leverage other shared services. I probably want to do a lot of effort around automating all the building, deploying, and automating of my service. I want to optimize the implementation of my service to make most efficient use of compute resources.

Service Discipline

The discipline that I think is most valuable between service provider teams and service consumer teams is an analogy to the vendor-customer relationship, when we're third parties from each other. By analogy, the service provider team is an internal vendor, and the consumer teams are internal customers. Just as if we were talking about maybe using something from a third party, our service is only useful to the extent that it actively provides value to our customer teams. The wonderful corollary to that in mature service organizations is that teams get to choose whether they're going to use an internal service or not. The wonderful implication of that is that it forces me as a service provider to be strictly better than any of the alternatives that my customer teams might have around building it themselves, buying it from a third party, borrowing it via open source, or from another team at my company. That discipline around the service consumer teams being able to choose to use my service or not, makes the ecosystem run more smoothly. Again, that's exactly analogous to why or when customer organizations choose or don't choose to use particular vendors.

Service Evolution

The services are going to evolve over time. We're not going to have a static set of services, as long as our organization is continuing to evolve, and our external customers are continuing to give us more things they want us to do for them. Over time, we're going to be creating new services when they're needed to solve new problems. Then, services justify their continued existence as long as they are useful and are being used. Then, when services are no longer used, we deprecate them. For individual teams in an overall growing organization, what I've seen is that a team will start off small, and then they'll get bigger. Then just like cellular mitosis, as maybe we learned in high school biology, those teams are going to split into two daughter teams, and those teams might split into several teams as well, just exactly like cells in your body. In a high performing or mature service ecosystem, like the one that I experienced when I worked at Google, you get this wonderful quip, because there's so much change in the ecosystem, every service at Google is either deprecated or not ready yet. That was a quick survey of how to approach services.

Common Platform

Now I want to approach platforms and infrastructure. What is in a common platform? Probably there's a lot of shared infrastructure maybe provided by a third party cloud provider: compute, and storage, and databases, and typically some event system. There's also a common developer platform, typically. We're going to use some small number, hopefully, one source control system, development and testing environments, continuous delivery pipelines. Those things that make the developer experience work. Common capabilities that are necessary across any different service or application, so authentication, secrets management, operational tools like observability, alerting. Then it's also really common to have standard frameworks being provided. Maybe there's a service chassis, as I've called it in a couple of different organizations, that make it real easy for teams to build a new service or build a new application. There's also typically standardization around a small number of communication protocols and data formats. This might be one platform or several. My experience in these larger organizations is that all these things tend to be provided in a general platform-y way for all the teams to consume. Particularly, again, because for a particular service or application team, having to roll your own or choose your own of these is not a very efficient use of your time.

Platform Provider

As a platform provider, what's my goal? My goal is to reduce the cognitive load that the customer teams have to do all the kinds of things that we were just talking about. Best is if I provide a paved road, so a consistent set of integrated capabilities, maybe a framework, a set of basic services, that all tend to work together, if I'm doing a good job, then it's the path of least resistance for teams to use those integrated capabilities. Well known organizations, Netflix and Google are in this example, eBay, as well, where a paved road or this common suite of services and frameworks and libraries tend to be produced by a centralized organization. The key idea of the paved road just like services is it is permissible for teams to go off and bushwhack through the jungle, if they think that there's a better way to do the thing that they need to do. If they choose the paved road, then it's really easy, and everything comes for free.

The other aspect of being a good platform provider is we do everything self-service. Again, by analogy, maybe to third party cloud providers, we want to make sure that it's really easy to automatically provision things. All monitoring comes out of the box, examples and documentation for how to use the platform or how to use the infrastructure. Then something that we did very effectively at Google and at other mature organizations is, a lot of times teams as they're testing their thing, they want to mock out the core services. Providing basic mocks is a great thing that a platform provider can provide.

Platform Consumer

As a platform consumer, if you're going to use an internal service or internal infrastructure, or if you're going to leverage a third party cloud provider like AWS, or Google Cloud Platform, or Azure, my strong recommendation is that if you're going to use something, embrace it. There's a lot of thinking that maybe I'm going to change my cloud provider, or maybe the internal services are going to change, and so maybe I should put a layer in place where I abstract it away. I understand the place that that comes from in the engineering heart, but every time you try to abstract away one of those things, you make it less likely you can actually leverage the full suite of capabilities that it provides. You end up with a least common denominator. In my personal experience, in 30-some odd years in this industry, replacing one of those platforms is really rare, and so optimizing for that edge case is not typically the right engineering tradeoff. People who fear lock-in of databases, lock-in of cloud providers, lock-in of third party vendors, to the extent that you feel locked in, that is exactly another way of saying that that third party capability is providing actually active value to your organization. Everything is a tradeoff. Lock-in isn't necessarily bad in my personal experience.

Usage Discipline

The last thing I want to say about platforms is we want to make sure that there's a disciplined usage approach. One of the things I found really effective is charging even internal customers for the use of the platforms and infrastructure that are provided internally. Why would I want to do that? Because, in my experience, particularly at larger scale, when customer teams get to use something for free, it gives them no incentive, economic or otherwise, to control their usage of scarce resources, or find more efficient alternatives. If you charge people to use the internal stuff, which is actually a lot easier now in the cloud world than it used to be, it motivates both the provider side and the consumer side to optimize their usage of scarce resources, whether it's compute or storage, or something like that.

I want to tell a quick story of when I was running engineering at Google App Engine. App Engine is Google's platform as a service. Being a platform as a service that it served when I was there, 3 million external applications, but also 15,000 or so internal applications at Google. There was a particularly egregious internal customer, now I think after 10 years, I can probably name that it was the predecessor game to Pokémon Go, that was using globally I think it was 25% of this incredibly expensive, incredibly scarce resource that App Engine had to pay for. Because it was an internal team, we charged external customers but we weren't charging internal customers. I bugged them. We filed bugs. We sent them kind emails. We sent them nasty emails. I set up meetings. Nothing did it. What did do it however, was handing them a bill for a not insignificant number of millions of dollars, which is what we would have been charging our external customers. Magically, in the next couple weeks, they were able to prioritize reducing their usage of this thing. It reduced their usage of it by 10x, and it also made their thing faster, because they weren't doing something as inefficiently as they used to be. These economic incentives around charging actually matter. That's why economies work in the way they do in the real world.

Communities of Practice

Now I want to talk about communities. Again, services and platforms are part of that layered infrastructure, but there's a bunch of orthogonal things that we do as a company that we'd like to share knowledge and share experience. At eBay, and lots of other places, we have communities that are organized around language frameworks and ecosystems. We have communities that are organized around particular roles, like security or user experience. We have communities that are organized around being the users of a particular service, or a particular platform, or a particular piece of infrastructure. Then we also have communities of practice around particular techniques and practices that people use. Those communities collaborate in lots of different ways. At eBay, we use a lot of Slack, so every one of those things has its own dedicated Slack channel, typically. In other organizations, I've seen heavy use of groups and mail lists, maybe those communities get together with periodic meetings. Then also, at least at eBay, we do internal conferences, and sometimes the communities will organize around that. These are ways that, orthogonal to the strict organizational structure, we can encourage sharing of knowledge and capabilities across different parts of the organization, and leverage lots of people's expertise, not having to go through the organizational structure.

Internal Open Source

Another thing that I have found super useful, both at eBay, and at Google and other places, is an internal open source model. I'm a platform team, or I'm a shared service team working away on my stuff. Lots of times, I'll get more requests for new capabilities and features than I have time to do. Just like in the external open source world, it's nice to be able to accept contributions from outside the team. Just like in external open source, some proposed committer is going to submit a pull request. We're going to expect that that pull request comes with tests. As a provider team, I will review it, iterate maybe with the contributor to make sure that it all works well. Then, hopefully, I'll merge it in. Then, just like external open source, we follow a bunch of the standard approaches that good external open source projects follow. There's a great open source guide at, https://opensource.guide, that's provided by GitHub. We should document our processes about what does it mean to contribute. We should learn to say no, because not every feature request is something that definitely makes sense and is consistent with the rest of the platform. We want to leverage the community to get new ideas and to evolve whatever we're working on. We also want to embrace automation. As in many situations when I have run shared services teams, or infrastructure teams, again, we get a lot of requests for new capabilities and bug fixes. A wonderful thing you can say is pull requests are accepted.

Leadership

The last thing I want to talk about is leadership. What are some of the capabilities and characteristics of leadership that make these engineered organizations work really well? One of the things that I found, now that I've been in the industry for a bunch of decades, and I'm also a parent, that when I'm thinking about the organizational structure, I try to think like an engineer. How would I refactor this if it were code? When I'm thinking about leading people and leading teams, I often like to think as a parent. I think you're going to see that as we go forward and talk about what I think about leadership.

Technological Maestro

One of the great things I just learned from a wonderful Idealcast podcast by Gene Kim, where he was interviewing Ron Westrum, the organizational sociologist who has talked about the generative organizations and pathological organizations, and is a key inspiration for a bunch of DevOps approach to culture. Has this wonderful idea of the technological maestro. You want your leader to be a very competent technologist, even if that leader isn't typing all the time in code. You want a maestro to be high energy because you're trying to inspire the team and make sure they're doing really good work. You want to make sure that the maestro asks the right questions. He or she might not know the answer, but they know how to ask the right questions to elicit, is everything going effectively? Are there opportunities for us to maybe be better, more efficient, leverage the resources that we have inside and outside the company in the most effective way? We want the maestro, just like a conductor in a concert, should have high standards for the quality of the things that we produce. Then, that maestro ought to be good on the details. Again, they're not supposed to be running the organization. This is not an argument for micromanagement, but it is an argument for the person at the top knowing what they're doing.

The wonderful quote that Ron Westrum quotes is this, from a guy named Jacob Rabinow, one of his laws of leadership. "If the man at the top is a dope or ignorant, everyone under him will soon be a dope or ignorant because he sets the tone." I have seen this so many times in my career in this industry. A way to understand why that might be true is this other aphorism that A players hire A players. The way I like to refine that is, A players hire and retain A players, whereas B players hire and retain C players. The insight here is that if the leader is really good, they're not threatened by hiring other people that are maybe even smarter than the leader is, or more experienced, or whatever. Somebody who is less confident in his or her own competence is going to hire people that aren't threatening. That's how the B players end up hiring and retaining the C players. That's why if there's a dope at the top, eventually there will be dopes all the way down.

Theory X vs. Theory Y

The other aspect I want to talk about in leadership is leadership's view of what motivates the individual employees. Back in 1960, Dr. Douglas McGregor wrote this wonderful book called, "The Human Side of Enterprise," where he posited these ideas of Theory X and Theory Y. If I'm honest, I don't really like the words of those things, but those are the words that he gave us back 61 years ago. It's about, what does leadership think about what motivates employees? In the Theory X model, people are inherently lazy, or this is leadership's impression. People are inherently lazy, they inherently avoid responsibility, and they require extrinsic motivation. As a consequence, the leadership approach, if you believe this, would be micromanagement and lots of pathological and bureaucratic approaches to organizational culture. Whereas Theory Y, where most effective organizations and effective leaders are, is the idea that people are actually intrinsically motivated. They fundamentally want to do a good job, they want to take ownership of things, and they actually want to perform well. The leaders that have this belief about the employees are much more likely to empower them, give them capabilities, train them. It's a much nicer and, frankly, a more effective organization, as the Accelerate and State of DevOps report research has shown, is that Theory Y approach tends to yield happier teams and better results.

Maybe people have heard this Hanlon's Razor, which is, "Never ascribe to malice, what can be adequately explained by incompetence." I've thought about this a lot. I think there's another thing to say here, which is, never ascribe to incompetence, what can adequately be explained by perverse incentives. Again, this is where the Theory X and Theory Y come in. Because if you're in a Theory X mode, you're incented to avoid responsibility, and hand it off to somebody else and shift blame, whereas when you're in a Theory Y organization, you're incented to take on more responsibility, take ownership and lead things end-to-end.

Psychological Safety

The last set of things I want to talk about in terms of leadership is driving psychological safety. People have probably heard this term quite a lot. I can't keep saying it enough. People maybe know that Google being Google has studied themselves. Project Aristotle, going on a couple of years was trying to figure out what made good teams at Google, what explained it. It turns out, the number one thing that explained high performing teams at Google was that the team was safe for interpersonal risk taking. Which means that the individuals that are part of the team are able to bring their whole self to work without fear of negative consequences. The insight here is that if I can bring all of myself to work, and I'm not afraid anybody's going to make me feel bad about it, I can bring 100% of my ideas, rather than having 50% of my brain think about how I can keep myself safe.

Inclusive Decision Making

The other related notion is around inclusive decision making. There's this wonderful study here in 2016, which shows that every time we improve diversity among the team or among the set of people that are making decisions, we make that much better decisions. A fully diverse team in the study which studied geographic diversity, age diversity, and gender diversity, if you had all those things, you made better business decisions, 87% of the time. You made those decisions twice as fast with half the meetings, and you delivered 60% business results. This is 100%, a characteristic of good engineering leadership. The insight here, of course, is, why does that work? Why does it give us better business results? Why do inclusive decision making and psychological safety matter? It's this Japanese proverb as quoted by the Xerox PARC leader, Bob Taylor, "None of us is as smart as all of us."

We talked about services, platforms, communities, and some characteristics of leadership that we need to have to make this all work together. If your organization doesn't look like that. First, please try to change your organization, because organizations just like people can change over time. If that doesn't seem to work, then I have this wonderful quote from Martin Fowler, that if you can't change your organization, you can change your organization.

Choosing a Specific Platform Due to Security and Compliance

Security and compliance seem to become mandates that force tying to a specific platform without choice. Talk about that.

There's a lot of good reason for teams and companies to adopt a common platform. One of the strongest arguments for at least a core part of the platform being common is all aspects of security, compliance, a bunch of those ilities. Actually, in eBay where I work, that's one of the strong motivators for people to use the common set of platforms that we have. Yes, 100%. If what you're implying is that, because of security and compliance, we can only choose one platform, then there's an architectural opportunity to separate a little bit, where the lowest level we build the security and compliance stuff only one time, and those provide services or libraries to maybe one or several different platforms.

Charging at Smaller Companies

Does it make sense to charge at smaller companies? I think probably no. Why is that true, though? It's because, what are you trying to achieve with charging? We're trying to use economic incentives to encourage people to be efficient in their behavior, that's why economies work. If you are small, like if your company is the size of a family or a small town, you actually don't need strong, external impersonal economic incentives. You can do things informally. You're in a small town, tragedy of the commons, please don't graze your cow over there as often as you're doing, go somewhere else. You can have those conversations when we're below the Dunbar number.

When you're beyond that, as we absolutely were at Google, it was very hands off between the individual teams. We had to step it up a level. That was where that particular example came from. For larger organizations, you absolutely should, to other people's questions, give people budgets, or start with show-back. You can start by simply saying, selling team, here's how many resources you're using. Maybe you'll be surprised, but I'm usually surprised how motivating that is for engineers. They're like, "Are you kidding me? We're using that many widgets? I don't want to use that many widgets. That's crazy." That typically gets to the 80/20. Then there are some times where you have to actually bring the bill as in the example that I gave.

Questions and Answers

Wells: What's the best way for you to introduce starting to ask people to pay those bills? Because I can imagine doing that, and my company, it would be resisted quite heavily to start with. How do you introduce that?

Shoup: Fully, transparently, we don't at eBay today, do budgets in the way that I would like, and it's something that we're going to do. It's not my highest priority. At some point, we're going to do it. At Google, in the beginning I'm sure there was nothing. Then there was charging for the raw resources. Again, think Google thinking about itself as a cloud provider, internally. If I were infrastructure, I'm charging the teams for the compute, the storage, the spindles for IOPS, memory, all that stuff, those baseline IaaS type of stuff. Then for a while, when I was there, it was this case, there wasn't a charging model on top of that for leveraging internal services. For App Engine, I had to pay for all the compute and storage resources, but the other teams that I was serving did not have to pay me for using App Engine internally. Externally, of course they did, because that was the whole goal of App Engine.

How do you introduce it? I think you can start small by show-back. Like, here's what you're using. I'll just say openly, this is a lot easier in 2021 than it ever used to be. Anybody that's using a third party cloud provider, they already give you a bill, and they can already break it down by little sub-accounts that you have. That's one of the first things that they started doing. They're world class at metering. That's the easiest way to do it. It's like, "Here's your cloud bill for the XYZ service." That's a great way to start.

Wells: We've had some success with a Slack bot that just says it looks like you aren't using these resources.

Shoup: Highlighting stuff that's on but not being used. That's obvious. Yes, totally. Like, the utilization of this is 1%, is that what you want?

Wells: Are you still using this? Could you downsize it? I think that's quite interesting.

Jeffrey asked a question about security and compliance being a mandate that forced you to be on particular platforms, not to go off the paved road. There is a related thing for me, which is, how do you decide? Are there things that people can't go to an external provider for? Is it because of things like security and compliance?

Shoup: As with every good question, it depends. It's about the outcomes. I'm going to hand-wave a little bit, but not so much. This is a real example. eBay builds most of our stuff in Java and some in Node. This is not that case, but one could imagine somebody comes and says, "I want to build this service in Haskell, because it's going to make everything so much easier for me for reasons." My answer would be, I am supportive of that, and you must be this high. In other words, it needs to work with a deployment tooling. It needs to integrate with the monitoring infrastructure. You need to meet all these security vulnerability, not having them. An answer could be and maybe should be, at a high level, you need to meet these outcomes. Here are these goals that you'd have to not have these risks. You'd have to mitigate them. Rather than saying, no, the only way that it's possible to do it is by using the Java libraries. It's also legit for me to say, that's a little too high a bar, so let's try to do it this other way.

Again, just to take one more step in my particular example. In the world of 2021 this is so much easier because I could do a bunch of this stuff as a sidecar. In the not too long ago world, it was Haskell or Java, that was like this wall. Where like now, I could build some of this stuff that integrates with various things in a Java sidecar. It's not a binary question. I'm ok with it, as long as you meet these things, and I'm happy to talk with you about how to do it. There's a bunch of different ways we could make that happen.

Wells: Michael has just asked exactly the question I was going to ask as the follow-up question, which is the maintainability. Specifically for us, we find we have a team who make a decision that they're going to start using Scala. Then a year and a half later, the product they built has been moved as part of a reorganization. Now you've got a team where this is not their language, and they're paying for the decision that someone else made.

Shoup: That's why, every time, the broader 'we' decide to go off the paved road, it's a big deal. At eBay scale, I don't want to say no. At a small company, I do want to say no. Because there's only so much time and we got probably three more months of our series A funding. That's not the right time to be playing around. Companies that have a longer time horizon, and much broader set of developer needs, I think it is legitimate and healthy for us to encourage those things. You must be this high. Everything that you and Michael say is 100% a tradeoff. That is the tradeoff. You have to go in with open eyes. You can't tell me, I'm going to choose Haskell because it saves me and my team 20% of effort. Sorry, not good, not enough. It has to be like, this basically makes a thing we couldn't even possibly consider, to, it's possible now. Then, it's like, ok, and maybe at some level that team is implicitly deciding, yes, and I'm going to support connecting with a monitoring infrastructure and making sure it's secure, and keeping up with the updates in the Haskell language. All those things are like, yes, that's part of what you're signing up to do when you make this choice.

I don't want to be too hand-wavy about it, but organizations beyond a certain size I actually don't want to say no. I want to say let's talk about it, and open our eyes to all the tradeoffs that we're making, implicit and otherwise, and make the right decision. Google, they're pretty big. It's one end of the spectrum. At least when I was there, they had four separate language environments. There was Java, C++, Python, and Go, and some sidecars, because not everything is written in the language you happen to want to use for your thing. Actually, even there were a bunch of cases, nonzero number of cases where the client library is only written in Java. It has this implication where like, I better build my thing in Java, and so on.

See more presentations with transcripts

Recorded at:

Dec 22, 2021

Randy Shoup

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?