[Note: please be advised that this transcript contains strong language]
Transcript
Shoup: Welcome to the Architecture's Panel. We're fortunate to have all of the speakers in the architecture's track up here. I'll quickly introduce each one and then we can get to your questions, and if in a surprising case that nobody has any questions to ask I have some of my own, but hopefully we'll get your questions answered instead of mine. Not in the order in which they spoke, but Rob Zuber is CTO – not co-founder I'm told – CTO of CircleCI, and gave a talk about the evolutionary architecture of CircleCI. Anvita Pandit is a software engineer at Google who talked to us about how they made scalable and reliable Google's key management system.
Ben Sigelman is CEO and co-founder of LightStep, and talked to us about deep architectures, how we scale things both widely but also deeply, and what things we might be able to do on the observability side to make that tenable. Justin Ryan from Netflix talked about scaling patterns that Netflix uses on their edge services. Then, finally, Thierry Cruanes from Snowflake talked about how to build Snowflake's data warehouse to take advantage of all the new things that we have in the cloud. I can start by opening up for questions.
Protobuffs in Large Architectures
Participant 1: We're currently evaluating potentially using gRPC, or HTTP/2, those types of technologies. Do you have any experience, lessons learned with protobuffs in larger architectures, that thing?
Pandit: I work with protobuffs all the time. They're really convenient. Definitely they were built to be backwards compatible. Don't break that. That's my advice.
Ryan: We recently moved from a REST infrastructure to gRPC at Netflix. That's the new, cool thing to do, and it's been a huge win. There's just productivity gains just around how you would define that schema. The protobuff does that for us. It's been really great, and then the generating the clients. We had a lot of people who were in the habit of making heavyweight clients because it's, "It's REST. There's no way to do it, so I'm going to make my own," and they got really big. The gRPC really let us say, "No, that's it. There's one generated client, that's all you use. It's nice and thin."
I think we've just seen great growth actually. I think there's a lot of problems we saw engineers working around because they're so used to our old system, NAWS, which is open-source as Ribbon. Then, when we showed them gRPC they were, "All those problems went away." We were, "Yes, that's why we've been telling you for months to move towards it." Not a direct answer, but I would say take the leap. It really did make a big difference for us.
Zuber: I'm going to take a bit of a counterpoint. I think it really depends on your size, one. I think you got two answers from organizations, and the thing that I would say is, you have to invest in tooling to make those kinds of choices be successful. It's probably true on both sides, but I'm going to project that you're hearing from two organizations that have entire teams that manage tooling, and we run at a much smaller size. We chose to roll out gRPC and we had two problems: one, not investing in that tooling. Then, specifically, we're a Clojure shop, and there's a big impedance mismatch between how people think about Clojure and how they think about gRPC, or protobuffs in particular. We've done some batshit, insane things to try to make it feel like Clojure, and in the end, we created a bunch of overhead for ourselves, so if you're going to make a decision like that, just make sure you have the tooling and people really know how to use the tooling to then get the efficiency and the gains.
Sigelman: We have some cautionary tales from gRPC. I'm a Google DNA person for better or worse, so I was pretty excited about it in the early days. I definitely have some burnt fingers, mainly around some yes, more esoteric surface area for Google. Google doesn't use Clojure, so lo and behold it doesn't work very well in Clojure. Also, the mobile story was very appealing to us but it ended up being a problem. The binary size for adding gRPC to iOS, for instance, we were using it for clients of LightStep and a lot of our customers it was an immediate dealbreaker. It increased their binary size by 6 megabytes just to have it in there as a dependency, which is unacceptable to them, for instance.
We definitely have had some burnt fingers. Other thing I would just say briefly is that Google has a very thick client culture, which I think has a lot of huge benefits, but in some situations, and especially in very thin surfaces can be, impedance mismatch feels like the right term, so I think it depends on that how long lived and larger processes are to a certain extent.
Scaling Microservice
Participant 2: I have a question related to microservice. At the earliest stage of mainly all the companies, Netflix, CircleCI, you guys have a giant monolith application, and then you have more users, so one big, modernized application cannot support your customers. What's your experience of splitting some feature into microservice and also scale them? What experience do you have?
Sigelman: You did mention Netflix, I guess I should chime in there. It actually goes back to the tick clients. A lot of the migrations that we've made is because we already had some tick library and they were hiding that transition behind the scenes from that tick library. What became a local call to a remote call was invisible to a lot of users. That was the big reason why we could make the changes over time. As far as how we scale it, for us, it was just you run more of them. Generally, if we're focused on being stateless, and we can horizontally scale, that really got rid of a lot of the problems.
Cruanes: Yes, I would say we're coming from a very different background where a database system is 20-million line of code in the same process, something like that, and so just splitting this into different layer and exploring that was already a big win, but each of our services are not small, are not micro. I think the debate is not about microservices versus not microservices, it's a little bit about stateless versus state, where human nature is state, and how do you scale different aspect of your system. You have to be very cognizant of that and very precise about where the state is maintained and scaled.
Zuber: I think you were asking, coming from a monolith where you start to realize that you have scaling problems and stuff like that early days. We did a couple things that were great and a couple things that didn't turn out so great. One is, I retroactively refer to this as roles because I learned that this is how some other application that I would never model after is deployed, but it worked. This is basically taking a monolith and starting it up in different configurations. We always had more than one single instance of our monolith, but we got to places where, in order to meet our load you had to scale to a certain number of instances. But if you scaled other parts of the work that the monolith was doing to that level it actually took down the system instead of increasing the capacity of the system.
We would deploy different versions of our monolith by basically passing environment parameters to say, "You run as this, you run as this." Then, we got the, , differentiated deployment model from an operational perspective before we had to do the work of splitting up microservices. Because I can tell you, when your site is down for 48 hours, saying, "Let's re-architect everything as microservices," doesn't feel like the solution you're looking for. That also gave us the opportunity to learn the domains of our system without having to do all the work of transitioning.
Then, when we started to transition, the thing that I would say, and this is the answer to every question, we didn't invest enough in tooling and patterns for how we should build microservices, and then we had to go back and undo a bunch of that, so that's more organizational scaling, like how you can make that transition. I would pick one and make sure you really understand this is how we're going to build them, these are the tools we're going to use, etc. Again, I'm very different from Netflix and Google. I'm pretty sure you don't have to build a lot of infrastructure to deploy a service, it's just all there, whereas every time we go to deploy a service it's "We don't have this thing. We don't have that thing."
Sigelman: The only thing I'll add is, there are smaller companies, and then Netflix and Google, but there's also giant enterprises that aren't Netflix or Google and have many businesses with many independent monoliths. In my conversations with those folks, I think one of the largest challenges has nothing to do with software but it's much more about just the way that their engineering or development organizations work.
They actually probably call it an IT organization and it's not an IT organization anymore when you make that transition, and making software development and operation integrated practices a massive re-training effort. I think the leadership in those companies often struggles as much with hiring and re-training their development teams than with the technology piece. They're both hard, but I think that's almost the harder part, or at least a very limiting factor which doesn't happen at Netflix or Google.
Reliability and Latency in External Calls
Participant 3: Can you tell us about any experience you have with making external calls to other services? Do any of your software applications call things that are owned by other people, and talk about the reliability and latency?
Ryan: We do do it. There's a lot of payment processors or other billing partners that we have that we have to call externally. It's exactly what you called out, there's just more latency, there's more failures, there's more re-tries. I don't think we came up with anything ever special. Actually what we ended up going around, though, was when we enter a billing relationship with somebody we would actually really push on them to be a push model to us instead of us pulling them. Only in that we usually found us pulling them didn't work really well, so we try and turn it around.
We actually worked with the business development side of things to say, "There's this other model where they call us, or they post-date it somewhere else, or we don't have to call them," and if we could get that into the contract, get it into the deal, we had more success with it. When they're slow, they're just slow. You just got to try it over again.
Zuber: CircleCI is a house of cards. We talk to every vendor on the internet, I think, in order to provide our service. We can't do anything without upstream version control. Basically nothing happens without that. In my talk earlier I showed a chart of what happens when GitHub is down for a few hours and then floods us with every hook that's happened in that time. I have tons of horror stories, but the one that I would pull out, or the thing that I would pull out that I think is most interesting to me is how you think about your data model.
I don't know if you're familiar with domain-driven design, but the concept of an anti-corruption layer, basically make sure that everything that that provider thinks about how their data should be modeled is at the very boundary of your system and translating into your language as soon as it comes inside. If you let it pollute your internal language, which is what happens when you're building a system really quickly and trying to get to market, and all those things that we do as startups, you'll suffer that for a really long time. Just make sure you have a really clear, , boundary and understanding. It's less about latency and reliability but more about how you build, especially when you have to switch it later find another provider of payment something like that. If you internalize someone else's language it just makes it a lot harder to work with.
Pandit: We have a couple dependencies which aren't external. Our service doesn't call any external services, but since the company is so larger it's almost like they're external. They're different teams and managed in other parts of the world, but when we start up one of our most important goals is availability of the service. If we find that some of our dependencies aren't available we'll enter into a lame-duck mode where we try to serve with whatever we have and wait for that dependency to be available.
When to Convert a Monolithic Architecture to Microservices
Participant 4: Coming back to the microservices versus monolithic architecture, what considerations in your experience is the right point to convert a monolithic to a microservices? Currently in my company we are monolithic and we can scale while being on cloud. We could still do a 10x scaling with some code optimization, some basic optimizations, and obviously through infrastructure, but at what point do we make the decision? For a smaller company, smaller team, it is also a lot of overhead, so just wanted to get some insights around what's your experience in that?
Sigelman: It's interesting that you went straight to scale. In terms of throughput or latency, I don't know what the 10x is for, but I don't usually think of that as the primary driver. I think it's typically driven more by a need to release features and new capabilities faster, and that if you have more than 20 or so developers working on a single service, whether it's monolith or not, things start breaking, the work doesn't work anymore.
I would say once you have many dozens of developers it's time to start thinking about a path to having higher release velocity and more independence for the developers. Then, the side effects in terms of the system are not all good. There's a lot of visibility problems that emerge when you introduce services, so I don't want to make it seem like it's not better, but at least you can have your developers in a faster cadence. That's my two cents.
Ryan: It's a tech-deck argument. Every day that you push that decision off, the system will be harder and harder to use and you're wasting time every time you add a feature. Contrary to popular belief, Netflix does have some monoliths, and I have worked on them, and we have hit that inflection point where we said, "Ok, the cost of us adding features is too great, and so we have to re-factor. We have to tell the business that we're going to go away for six months and re-write it," but we have another system where we don't add many features. Yes, it takes a month to get a feature in but that's an acceptable level, and we'll just keep that system around for a very long time. It's just a matter of doing the math on how long you want to get a feature out, and how much is it worth to the business? You have to work it out.
Zuber: I'm a huge fan of cost modeling these things, especially at a smaller company. We're a VC-funded company. I actually put together a model last week. I just made it on the weekend to present in a talk at a different conference. If you're in that B, C, D timeframe where if you're VC funded you indicated that you're smaller, I don't know if that's right, you think about 18-month-ish time horizons between rounds of funding, and then talk about pulling out 6, 12 months to go re-write something. If you model out the cost impact of that, even if you make up all of the feature development on the back side, in the model that I built which was very low desire for delivery of value coming from the product team, probably underrated, you were still down a couple million dollars in cash by the time you got to the end of that.
Ask yourself, out of the revenue of your company is losing a couple million dollars in cash to get a better deployment, is that right right now? It took five minutes to build the model. Build it for yourself, don't take my model, but I think understanding how much time are we losing and what could we be delivering in that time, even if eventually we will deliver faster, is really important. The other thing is, Ben [Sigelman] was talking about it both in his talk and just now. All the things that you have to add to deliver microservices successfully, it's not like you just start dumping in services. You're just moving the complexity somewhere else, and then, you have to invest in the tooling, and the training, and the skills to be able to operate like that. Just make sure you include that in how much time you're going to spend to get that first one out.
When Should Event-Driven Architectured Be Used
Participant 5: I have a similar question, more in regard to event-driven architecture. There are a lot of talks about that, and theoretically there's all these perfect, wonderful advantages to it. Then, obviously also the operational complexity is in the complexities reasoning about in the ambiguities in an event-driven system. Just a very broad question, when should event-driven architectures be used? Any kind of guidance or perspectives on that?
Ryan: I'm not specifically familiar with what event-driven architecture would specifically be. I will say, where we're actually seeing a trend for is when people are trying to write batch systems or they were event based, and they're trying to write it as a normal service, there is impedance mismatch and it feels a little awkward. We're seeing what could be benefits, for serverless architecture, where something else is going to manage spinning things up and spinning them down, and consuming events.
When things are event driven, tying it with a serverless framework that knows how to tie those events to actual executions, I see some promise there, versus the problem of trying to write an instance. If it's a request response, if that's what you're used to writing, and you're "We're just going to listen to a Kafka queue," I've felt that be a little awkward. I'm looking for something better in that space.
Cruanes: I like [inaudible] a lot because in database system we have these things called active databases that are workflow with events being emitted. It creates good decoupling, but then you cannot understand anymore your system. It's very hard to understand end-to-end. It's a little bit the assembly of workflow.
You might want to have at least in [inaudible 00:20:06] workflow [inaudible 00:20:07], you want to have a higher level construct that you can push in the backend [inaudible 00:20:13] these different things, instead of relying on events. Maybe I'm talking about something different than what you meant by event-driven architecture.
Zuber: I'm assuming you're asking about a transition, like you have a system and you're thinking about moving to events. I have two answers that just apply to everything. Do something incremental. Find a way to experiment, even if you can redo something in parallel, "We're going to run a small number of customers through this other system," and be prepared to throw it out. The problem with experiments that we run is we're "Yes, that doesn't work that well, but it's serving this one customer over here so we're going to leave it." Now, I have this net increase in my complexity until I remove everything else, but the right time is when you really need it or maybe the lead time.
When you really need it minus the lead time, it's very difficult to guess. It totally depends on your business, but I would definitely say in places where you can find a place where you can try something that's maybe of low consequence and learn from it, and understand where the problems are going to be in your organization. Rather than, "Cool, we're going all in on event-driven architecture and we're going to re-write everything right now."
Randy: Now I'm going to step in. Since there were two votes against it and one partial for, I feel like I need to do a strong for. I'm a big fan of event-driven architectures, but again, let's start from asking the question, "What problem are you trying to solve?" First order, actually for both when should I do microservices, and when should I do events? What's the biggest problem that I have in my development team? Whatever solves that, whether it's microservices or events, that's fine, but if it isn't then go solve that other thing.
Just start from, what's the most important problem to solve? Microservices are a tool in your toolbox, event-driven stuff is a tool in your toolbox. On the event-driven question, assuming that that's something that you want to try for reasons, there are some great parts of the system that you probably already have that lend themselves very naturally to events, and there are ones that are more challenging.
For example, common things are like "I need to send emails." You send an event and have that other thing send emails, do that asynchronously as opposed to synchronously right in the request flow, for example. That's a super obvious thing. Interactions with payment systems and other third-party systems tend to lend themselves pretty naturally to event-driven and asynchronous approaches. To the point behind your question, there's not one right architecture that works for everything. It's just, I have a bunch of different tools, and find places where the screwdriver works, versus the hammer, versus the saw.
Splitting a Big Monolith Into Microservices
Participant 6 : I'm actually glad that question was asked and to hear that Randy's [Shoup] an expert, because I have a question in the same area. My context is, we have a big monolith, and we need to split it up, and we're going with microservices. We've got a blueprint for how we want to do them, and just based on a lot of the reading about the best way to keep the microservices decoupled, and to reduce runtime failures based on downstream dependencies on other microservices, we've drank the Kool-Aid about making them communicate asynchronously as much as possible by publishing events or sharing snapshots of any shared state that they might need to know about.
Based on a lot of reading that I've done, that seems to be an in-vogue way to do it for some of those reasons that I mentioned: decoupling, resilience, runtime. I just want to get your guys' feedback on, have you actually built systems at a larger scale that operate that way? Looking at the Netflix talk we had earlier this morning, and Ben's [Sigelmab] observability talk, it seemed like most of that looks like synchronous interactions between components. That's why I'm wondering, how much have you guys actually tried to do it pushing the asynchronous model?
Zuber: I'm going to reference domain-driven design again just for a second. I think we haven't done a lot of event-driven architecture work, but we have a domain that lends itself really well to that. We accept events from an outside system, and then we drive a bunch of events, manage a little bit of a state around that, and then deliver a bunch of events. That's literally the product that CircleCI offers. It's easy for me to reason about what that would look like.
In fact, within domain-driven design, not the original blue book, but someone wrote this after, there's this concept of event storming, which is basically sitting down and looking at how your whole system works in terms of events. If you went through an event-storming process and found there were maybe one or two events in your system, then to Randy's point, I'd probably argue that that's not the problem that you're solving for, and therefore, doing everything in that way may not make sense.
At the other end of the spectrum is our domain, which is really all event driven. We just have modeled it very synchronously in terms of how it works. I can't remember someone else asking the question asked about ability to reason about it as well. Yes, there's always the beautiful end state that you see in books and other writings and stuff, but you really have to think about how your domain is structured, the problems that you're trying to solve, and see how it maps to it.
Sigelman: I'll try to be more controversial. I think we're all being too even-handed. I don't really mean everything I'm about to say, but here I go. I'm pretty negative on default to adopting event-driven architectures, mainly for latency-sensitive system. Maybe I'm biased to seeing latency-sensitive systems because that's usually what my company interacts with. Again, if it's not latency sensitive, that's different, but for latency-sensitive workloads maybe it can be made to work but you're introducing this pretty high-variance thing between every single communicating process, and I just think it's a risk not worth taking. It's just so much easier to understand nested request response pairs to my mind.
It's obviously a really bad idea to have those nested requests doing blocking, like calls to SaaS services or something on the critical path. Don't be a dummy about it, but I think if you're referencing local resources, including making [inaudible 00:27:46] calls and stuff, that's all fair game. Then, it's just literally easier to understand. I think ease of understanding is a significant barrier to velocity, and it may just be a tooling thing but I don't know how we're going to make our tools get around the fundamental multi-tenancy issues in Kafka as it stands today, which is usually what people are actually talking about when they say this. I'm a naysayer for latency-sensitive applications. Maybe that'll get someone mad. If someone's mad, I'm sorry, nothing personal.
Orchestration Versus Choreography
Participant 7: Continuing the theme of event-driven architecture and the general hype around just more message-based communication in general, I've seen two main styles presented when people talk about this topic. I've heard them referred to as orchestration versus choreography, orchestration being something that actually has complete knowledge of the whole process end-to-end and has a long-running state where it knows exactly where messages are being directed to and read from. Then, choreography being the more event-driven side of things where you just fire events off into the universe and business processes happen almost by accident. Do you have any guidance for when you would use one style over the other?
Sigelman: The second one makes me really nervous. The whole idea of, I don't know what's happening, I'll just start doing stuff.
Zuber: It was a slightly biasing form of the question, to be totally honest. Again, we're only looked at this in some cases as we've looked at places we would apply it. I'm speaking theoretically, but we do have on the orchestration system, we have a component in our system that we call a conductor – the workflow conductor. It's really a state machine. This is the actual business process. If you run a workflow, then we're firing off, "This thing needs to happen, then this thing needs to happen," and then, we get back events, effectively, that say, "This passed. This failed." Therefore, the next thing that can happen should happen.
I can't actually speak to why we chose that, it just made sense to the people working on it at the time. The complete picture of that state is really important to our customers, being able to understand what is the state of this entire set of whatever, of this graph, and all of its dependencies at any given time. That made sense in terms of how to model that. Again, I'm going to always say, it could be different in your domain, it could be different in terms of what you're trying to achieve.
Ryan: We also have a tool called Conductor, that is a workflow engine, funny that. It orchestrates. It has the top-level knowledge of thing started, thing ended. For traceability it's just the easiest thing to do. In the end their users need to know if it finished. We have a central UI, usually a team, one team can focus on on the orchestrator, the visualizations. That's all coupled together, and that's been successful. I'd say until it falls over, we will probably stick with it.
Are We Moving Away From Rest?
Participant 8: This is my quick, two cents on the event-driven thing. It depends on the applications, like for example, in a networking application, the state of router is something that is going to happen very quickly there. Obviously, everything is geared towards the customer user. The user, what does he need? The user is sitting in front of the browser and something should change immediately. That's where, I think, it's a necessity. Event driven is a necessity, but even go with a polling model. Some people don't like it, it adds artificial latency. You're going to poll everyone, then you don't want to do that, so event driven has to be there.
I have another question. I'm really not sure whether this applies here or not, but I'll just ask it anyways. We hear a lot about the REST guideline and we have the front end and the back end. REST meets the API, maybe, application needs, but more and more what we are seeing is in the front end, are we going towards something like a GraphQL model? Because the needs keep changing, we have to add a lot of filtering to the REST and all that, because all the data is not needed, or some is needed, or more is needed. Then, it becomes very complex to even design the schema. How do we design the schema? We don't know what the user is going to ask in a later point of time. My short question is, are we moving towards something non-REST going forward? How do we scale to these requirements?
Zuber: I'm going to take a stab at this one because I've failed at everything that anyone can ask a question about, so this is on the list. We've dabbled in GraphQL internally. We didn't really publish it or anything like that, but we were doing some transition in our front end. I would say API design and API management is an organizational problem more than it is a technical problem. Once you have a whole broken out set of services - I would love to hear actually how other people manage this - but then you hear about these API gateways and API services on the front end. At least in REST you get subpaths. With GraphQL you centralize everything, and then, it's likely to be everyone working on a single code base, or some team that's responsible, and is the bottleneck for everyone else. "I need to ship this feature but I need someone to implement this." Then, the schema has to be consistent. Your REST approach should be consistent, but GraphQL is also new so you end up investing a lot of time. "I don't know. Is this the right way to do GraphQL? I'm not sure." Plus, we're a Clojure shop so we took GraphQL Clojure library off the shelf. It was riddled with bugs, totally tanked in production, so that was awesome. Yes, I guess we're going this way. I see people doing it, but what we've found was trying to do GraphQL strained the organization challenges we already had around APIs, and so we've backed down from it.
Ryan: Netflix has a little bit of history with this. We have a library called Falcor, which is the same in nature as GraphQL. For us, there is a technical aspect to it that we have low-end devices or they're on a slow network. You can imagine someone in Southeast Asia trying to load Netflix on their phone, making lots of REST request is just not going to work for them, and so, have an optimal API. For us, the Falcor and the GraphQL let the client teams be very particular with what data they need, and I don't actually have to get involved necessarily.
There is a lot of tooling and letting our mid-tier services publish a schema to make it available at a generic API gateway level that did take some tooling. Amazon does offer a product, an API gateway that you can publish, I think a GraphQL schema, too, and then it will expose it for you. It's not that you need a dedicated team writing those tools, they offer that as a product. To me, that technical advantage is a big piece to it, and we've tried versions of it where we have Groovy Scripts that will rip apart an endpoint and give you that same experience, and that just wasn't great. In the end, it was the client team that was going to ask for the data, and they knew best, and so having client teams write stuff on the server side was the most effective strategy for us. Letting them write in JavaScript, and not having types and stuff worked for them with GraphQL.
Breaking Semantics & Testing Services
Participant 9: A related question to APIs. The importance of versioning – we can all do some tactic checking, but how do you make sure you don't break semantics? What are your best approaches there? Also, with that, when we want to do some testing on our services, how do you go about mocking things correctly from a client perspective? Are there particular approaches that you've found that work best?
Pandit: It's a big question, but I think first of all, the service should always create the mocks otherwise you don't know what you're testing. The service team should maintain them. Especially at a big company with maybe hundreds or thousands of internal users, it can be extremely painful to start updating all instances of the mock if, for example, they've been spawned out over different teams' code bases over the years.
Ryan: I've seen some interesting stuff, and I believe I saw Josh Long talk about it, and had to do some sprint projects where they actually will generate out a mock interface that they then run their test against, and ship as a jar and let other people test against. It's something I've been wanting to look into more, but generally we've been lazy and we don't do that. "That's the test environment."
Performance Testing
Participant 10: I wanted to ask what you guys think about load testing, performance testing? Whether you're a startup and you have to scale 5x or 10x in the next year, and you're wondering, "Can our current system get there?" Or you're already running at massive scale, but you need to be sure that as you push out more changes, you're not going to break that. How do you guys approach load testing or performance testing? How does that factor into your development process, and maybe weigh that against [inaudible 00:38:20] in a controlled manner so that you are observing them in production versus observing them before they get to production?
Cruanes: I don't know, performance testing for us is at the core of what we do. We have a very strict process for performance that goes from really the atomic operations on the CPU, on the single core, or something, all the way to workload testing. I would say that for performance, the main thing to understand is with flow of data and with numbers between these different system. If you don't know the numbers, you will not be able to predict, you will not be able to scale, you will not be able to do that, so [inaudible 00:39:13] going to give you in your architecture where you should focus.
Sigelman: I have a ton of thoughts about this, a couple of them off the bat. Performance as a word is a little funny because I think you probably mean latency, or maybe error rates, or something, but there's also throughput. Sometimes people forget that it's usually pretty possible to trade throughput and latency if you want. I had a project, I won't say what, that I was working on a long time ago with some engineers who were just saying it was a performance problem that basically couldn't be solved. They just weren't willing to add another 20% of the number of workers to this pool, and they did that and the problem went away really quickly. It was gross, but that's an option.
The other thing I was going to say is that the two types of performance testing are, "I have a new version of software and I need to test it," versus "I know there's going to be a load event in the future that's going to have a different workload for business reasons." I was talking to someone at Ticketmaster and when Beyonce does a ticket sale that's a big deal, but it's much harder to test that because you can do blue-green deploys and stuff with your new version and run a dark copy of all your traffic, and actually experience load, but it's really difficult to simulate workloads of organic traffic. Product launches, and stuff like that can be very difficult because the workload is often hard to model with random data and stuff.
In those situations, that's, I think, most performance outages that are unpreventable or hard to prevent have to do with organic spikes in load. In those cases, I would just encourage people to be really pessimistic and creative about creating the absolute worst possible workload, and at least understand where your system fails so you know that ahead of time. Random data is terrible for that, I'll say that off the bat.
Ryan: I'll just second that. There are two aspects – it's the rolling out of new software and small, incremental changes, so we do include some performance metrics that we choose. It could latency, could not be into [inaudible 00:41:06]. You could very much have a 0.1%, 0.1%, 0.1% on every deploy and you look back and you're, "There goes out performance." One other way we counter that is a form of squeeze testing. This is to get the real traffic on a smaller number of instances.
We do normally run in three regions. We can shut off a region. We can shut off two regions and send it all to one, and that will mimic some of our highs so if you know what one region is – we do see problems, actually. This is a pretty effective exercise for us. We will hit limits, we will see the problems, and these are the problems that we generally will see in six months, nine months, Christmas time, or whatnot, which is a big season for us. Squeeze testing is the best production traffic on just less instances is how we can simulate it.
Managing Evolution in Your Architecture
Participant 11: I wanted to bring up Conway's law, because I think if you have a centralized organization, you're going to build a distributed monolith if you try and do microservices. If you have a distributed organization you just leave them alone, they'll produce microservices anyway. The real question I have is, how do you manage evolution in your architecture? Because that comes naturally in the distributed organization, but it's a bit out of control, and in a centralized thing, then it gets too frozen. How do you think about evolving your architecture and adopting new things, and rolling that out across the organization?
Pandit: Maybe this is more like the endgame of what you stated, but after many years of development in key management there are actually multiple key management systems at Google. One of our efforts has been to try to unify them under one team, and that's been somewhat difficult as well. Our service split off from another slightly related service about eight years ago, but I wasn't around for that so I'm not sure what decisions were involved.
Cruanes: I would say that innovation have to go both ways, meaning sometimes you have big initiative, what you need to push down, we need to get there. It's a flag on the mountain and then you need to align forces behind that. Then, you have innovation that comes on teams that are bottom-up innovations, but requires more local changes. Both are needed. You can't say, "I'm going to do only one or the other." I don't believe that.
Ryan: I think it's literally around willingness to change, especially to your point about Conway's law, and having smaller organizations that have some autonomy. They can evolve their evolution as they need, because they have a service boundary and they can change it as they need. They just need to have a willingness to change. I think that's where I've seen success, where a team is saying, "We have that old thing. Let's throw it away and re-write it from scratch." I've been at other organizations where that's just a non-starter, even though it's actually, at least I've seen, highly effectively. Not letting the old architecture hold you back, having a willingness to re-write it.
Assuming you have tests, and you know what regressions are going to look like, allow things to really move forward. If it's a staring game, or "Are you going to push for evolution? Are you going to push for it?" No one is going to do it. I think once the skids are greased so people are "You know what? The better thing to do is just re-write it." You'll see the evolution that needs to happen.
Sigelman: I would actually go back to evolution in the Darwinian sense. I don't mean this hopefully in a political way. If the organization isn't establishing an environment where it's possible for mutations to occur, it's impossible to expect real evolution to happen either. I know that there are some organizations that have had amazing businesses that have allowed them to stick with the model that's served them well for a decade or more, and then it stops doing that but they never developed a muscle around allowing for mutations and letting those things live on. It's bad news from a factoring standpoint.
If you try to factor your organization perfectly, it's actually going to stifle any possibility for future evolution, even if things look good for several years. I think Netflix, Amazon have actually done a great job of making it possible for experiments to live for many years. Some of them die, and that's fine, but if you try and cut them off immediately and say, "We don't want two of these," that actually, on long term, it's a bit of a death knell for evolution at a structural level.
Zuber: I had just a couple things that came to mind. When you talk about organization, especially as you've split up and have a distributed organization, I think certainly one of the things that we've struggled with that's informed the chaos that you describe, is failing to create really good alignment around overall technical direction, where you're trying to get to. As people make decisions about, "We need to change this, we need to change this," it can diverge from where you're trying to go in the big picture.
It's been said a million times, but one of the keys to creating that autonomous environment where people can be effective and work within their boundaries is that bigger picture alignment. The other thing that I think about a lot, and thinking about it and solving it is very different, is simplicity. The more that we can keep our systems simple, the more we can adapt. As they become more and more complex, it becomes harder to reason about how we would make those evolutionary changes. Then, it becomes, "I guess we just throw it out and build another one," which depending on the size of your organization, or just culture, whatever it is, is a much harder thing to tackle.
Being Open to Changes
Participant 12: Speaking of evolutionary architecture, I didn't hear too much evolutionary architecture in you presentation, Robert [Zuber], but you named your presentation as Evolutionary Architecture as a Product, so I feel like you have to make a change in your product. If you don't, you cannot use some of the service, the [inaudible 00:47:54], the old service container, or the old LXC container [inaudible 00:48:02]. If you don't make a change, you cannot do business anymore. Evolution in architecture is just a mind change, keep your mind open, willing to make changes, to adapt to changes, or there's specifically a way to design your architecture.
Zuber: I talked about my biases also, I have been doing startups and small, scrappy companies for a long time. I'm very accustomed to having systems that were not designed to evolve. They actually look like a hodgepodge of the four pivots before you actually found your business and charge down a path. My perspective is, ultimately it's maybe in your second or third pass that you think about, "How do I build this in a way that I can continue to evolve it?" That first gut-wrenching transition is, "How do I get to a place that's even logical and easy to reason about?"
I think there are many drivers for evolving your architecture. You're scaling as an organization, so you're trying to tune your architecture to the structure, referencing Conway's law. You're scaling in terms of your customer base, so you're trying to architect differently for the needs of your customers as you grow, but in our case, we also have this driver that is, customers just use our product totally differently, and our architecture is a little bit exposed to the customers and how they use it. I'll call it little "e" evolutionary architecture. I'm not really sure, I think we capitalize words and then decide there's one way to do this, but ultimately all of us, all of our architectures have to evolve, as you said, in order for us to continue to succeed. It's a question of, what are the driving forces, and what's going to give you the most benefit?
See more presentations with transcripts