BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Progressive Delivery

Progressive Delivery

Bookmarks
48:55

Summary

James Governor talks about Progressive Delivery and includes lessons from Microsoft, Cloudflare, Sumo Logic and Target.

Bio

James Governor is the co-founder and analyst at RedMonk.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

[Note: this transcript might contain strong language]

Transcript

Governor: Good morning, everyone. It's great to be here. QCon is a great conference. I’ve been a few times. It's very much a practitioner conference. I think one of the things I like about it is, it may be run by Americans but it's got a very sort of European flavor. I think that's important. So often we're worshipping at the altar of Silicon Valley. Anyway.

Like many good stories in tech, this one begins with rage. Some of you may have done rage coding, somebody wrote something, said something you didn't like, and you said, “I'm going to prove you wrong.” For me, I was at KubeCon a couple of years ago, and I'm watching all of these keynotes. I'm watching all of these speakers. They keep saying how great Istio is. But nobody says what it's for. It's like, “Istio, Istio, Istio. It's great. Do you know why it's great? Because it makes Kubernetes great. And why is it great? Because it's Istio.”

I sort of got a bit ragey, and a bit of energy. Anyway, there's this question, "What even is Istio for?" Now, I'm not going to guarantee you're going to get an answer for that in this talk. But it was that rage, that question about what the hell is it for? Why doesn't anyone have a use case? That set me off on this journey, and set me off is probably the right word. What Tracy said, I'm so extra. I thought, “Yes.” And I'm starting with this GIF.

Hopefully, we're ready to go. We're going to be talking about what is this Progressive Delivery notion? Where did the idea come from? What are some of the technology underpinnings, and let's look at some early adopters and what they're doing. Now, early adopters is an interesting term because there's nothing new in IT. We just implement, re-implement, rinse and repeat. But looking for those early patterns, and people that are packaging things together to do interesting things, and most importantly, people have validated my idea, because confirmation bias is the sweetest bias.

I'm @monkchips. I'm one of the founders of a company called RedMonk. We are a research and advisory firm. I won't say we're an industry analyst firm, because then you would hate me. So what we do is quite different from the companies in the space that look at the world through the lens of purchasing. We try and understand technology adoption through the eyes of the practitioner. So it's all about developer-led, operations-led adoption of technology. That used to be weird when it started. We were the kind of outlier people, and everyone said, "Why would anyone care what the practitioners think? We're outsourcing everything and sending it offshore." I guess that shows how old I am. Things have changed. Increasingly, these days, because of open source, social coding, the availability of cloud resources, there's no need to ask for permission. And that's the world that we try and understand. So I'm not really a practitioner, but you are my people.

DevOps

DevOps. Some people are so good at delivering new functionality quickly that’s it’s almost ludicrous. We ask, "How are they doing this? Why are they so much better?" But what can we learn from them? This is an important thing as an industry. We look to the leaders, we try and package up some of those lessons. Take that on board. You've got these organizations doing ridiculous, ridiculous things. We all talk about Netflix, many production deploys per day, that’s very aspirational.

Looking at Expedia. So they now ship 2,500 production deploys per day. And that is a very different world. I was talking to Alex Blewitt this morning. Just the change and transition in the industry as a whole over the last few years has been pretty staggering. I mean, causation versus correlation, but it seems like we've got a lot better at this stuff since I entered the industry and you entered the industry.

Anyway, so let’s look at DORA, founded by Dr. Nicole Forsgren. She's an amazing force of nature, really good at throwing axes. Anyway, she's built this DevOps Research and Assessment (DORA) company, which was recently acquired by Google. Well, what's interesting about that is it's basically trying to put some numbers onto the organizations that are doing DevOps well. For me, a really key insight, one of the most interesting things - this year that it's pulled out elite performers.

The key thing here is, in terms of this slide, is that moving fast doesn’t mean breaking things? The elite performers are shipping a lot more code and with higher quality.

Why is that? They've got this DevOps mentality. It's small teams, it's product-focused, it's very much a cultural phenomenon. You've got small organizations taking responsibility for the code they ship. These are the aspirations that we've been taking on. DORA has done a good enough job of packaging this knowledge that they were acquired by Google.

CI/CD, I think is absolutely foundational to how we're able to move quickly, deliver more software, at better quality. It's this shifting testing left phenomena that we've seen in the industry, and it's absolutely foundational. So I'm definitely not here to say, "Oh, continuous delivery is a bad thing," because basically, that's the touchstone for becoming better in delivering new services to your end users, or wherever your customer is.

When we're doing CI/CD and continuous delivery, I think we think it's like this. We're like, "Yes, this is great." Okay, we've got new functionality, we're delivering it. This is a very well-automated process, everything is good, this is great. The problem is, sometimes it's actually a bit more like this [slide of chaotic assembly line]. If something goes wrong, suddenly, you've got galloping herds. You don't actually know what's happening in your systems. That is not what we want.

A brilliant story earlier this year from Lego at an event that New Relic ran. Lego loves to have big, big launches. If they're going to do the Millennium Falcon, more pieces than they've ever had in a Lego set before, they want to sell them all on one day. They're not going to do this as a phased rollout. Actually, they're not going to do it. It's Progressive Delivery. They just want to go and whack it out there. We're going to have two main events, one in the Europe in the morning, one in the U.S. in the afternoon. Needless to say, everything broke. It all backed up and it turned out what was the problem with the backend? SAP.

There we go. They've invested heavily in monitoring and tracing. We probably still don't know what's going on in the SAP systems. But it's this notion of we can be really good at continuous delivery. But sometimes there's going to be that edge case, that outlier, more traffic than we'd expected and stuff's going to get hard. So what are the organizations doing that are really leading? There are some technologies and approaches that are not in the general definitions of continuous delivery, but they're there. For example canarying. You're going to take a small amount of the traffic and you're going to see, does the Canary snuff it? You basically route some of the service to some of the users, and try and identify if there are going to be any problems. So leading organizations are doing Canary.

They're doing A/B testing. A/B testing came out of the world of content management, and if we give users two different experiences, which one will they prefer? So it might be Ethiopian coffee, it might be Costa Rican coffee, but I want to go to the coffee shop, I want to test both. I want to know which is best. It was the Ethiopian, by the way. Standard 20 at Elisabethmarkt in Munich, really good. So if you're in Munich, I'd highly recommend you visit that coffee shop if you like sweet, light, bright coffees.

Blue/Green deployments. You have two target environments that are exactly the same, and you start rolling it out to one, you start rolling the service out to one. You move more and more traffic over there, and if everything is okay, until it shifts completely from one to the other. That's the way that organizations are thinking. It's actually not continuous delivery in the sense of, "I'm going to press the button and everybody is going to experience this new service."

Microsoft Azure. Microsoft has always been on this incredible journey. They invented a lot of the best practices in the last wave of technology. Then to make this change that they made around the cloud has been really staggering. You take something like VSTS, Visual Studio Team Services, and you're going to break that down, recompose it, make it cloud native, turn it into Azure DevOps. That's really impressive.

What is Progressive Delivery?

I was talking to Sam Guggenheimer, one of the lead technologists over there. I'm still full of rage, by the way. I hope you don't think that I've lost my rage. I might have lost that. I was raging still, and I was talking to Sam Guggenheimer. And he said, "Well, what we do …"- because I was talking about this routing, and what could you do with this service mesh and this Istio stuff? What could you do with that? He said, "Well, when we're rolling out services. What we do is progressive experimentation because what really matters is the blast radius. How many people will be affected when we roll that service out and what can we learn from them?" He said, "Progressive experimentation."

Which inspired me. All of this new stuff, we can call it Progressive Delivery. Now, one of the things about my company is we generally do not invent new terms for things. We think that is a bad idea. Don't invent new terms for things. Find terms that are out there in the industry. So what did I do? Invented a new term for things. Basically, it came from the thing, that I couldn't find anyone that was able to describe this basket of things with a single term. I was just like, “What's the shorthand that I can discuss this with people? Progressive Delivery.

What is Progressive Delivery? For me, what you want to see if you're going to come up with some ridiculous idea, is that other people are also going to follow you down that path to get some fast followers. So Weaveworks and also practitioners. It's very easy for the hand wavers like me to go, "Oh, yes, that's a good term. We like that." Stefan Prodan over at Weaveworks was like, "I really like this Progressive Delivery notion. And we're going to come up with product that supports it." See Flagger. It is about this moving of traffic over a Kubernetes cluster. So Weaveworks, the company where they are, it's all about trying to make Kubernetes more consumable, work better for development teams.

See CloudBees. Carlos Sanchez, a little bit more aggressive than me, with his definition of Progressive Delivery - “is the next step after Continuous Delivery, where new versions are deployed to a subset of users and are evaluated in terms of correctness and performance before rolling them to the totality of the users and rolled back if not matching some key metrics.”

Key insight - new versions are deployed to a subset of users. I thought, “Ah, brilliant.” I've got Weaveworks. They liked the idea. We've got Carlos over at CloudBees. They like the idea. And for CloudBess, it's like, look, Jenkins is a technology that loads of us have used, lots of us have opinions of. I mean, even the CEO, he describes the problem of Jenkinsteins. They're breaking it all down, they're moving to Kubernetes. They're doing all of this good stuff. So they're interested in new ways of thinking about routing services.

A key insight about Progressive Delivery is that deployment is not the same as release. Service activation is not the same as deployment. The developer can deploy a service, you can ship the service, but that doesn't mean you're activating it for all users. These are actually different things. And that's interesting because what it means is businesses get excited. When I started talking about Progressive Delivery, what I didn't really realize was that, yes, businesses would like it.

I was at an event and I was talking to Comcast and the guy called Greg Otto that runs a lot of their open source stuff. They used Cloud Foundry. Anyway, he said, "Ah, yes, because we've got 30,000 customer service agents. If we roll out a new service that breaks their experience to think how many customers that's going to impact" We don't want to do that. Well, let's think about well, could we do training? Could we start rolling out the services to batches of agents as we train them? Business people, for them, “continuous delivery”, is a little bit scary, because they worry about losing control. And businesses, they're used to saying that IT is the boat anchor. IT is an anchor on our business, it's preventing us from moving ahead, it's preventing us from getting all of these lovely new services for our customers. But actually, that's beginning to change.

Now, all of you I'm sure, are already living in the future where the business is not able to deliver new functionality as quickly as you can write that code. But there are others out there, that it's just sort of a transition. Certainly, I grew up with a notion that obviously IT is a brake on the business. But we're all trying to move to a world where it no longer is. And if it isn't, what's the implication for the business, in terms of training, in terms of understanding the user populations, in terms of that service and how it’s managed? It's not just for IT. Progressive Delivery is really thinking about, "Well, yes, what does this look like in practice?" Most people don't really like application changes. We think we do. We're technologists. “Show us the new stuff.” “There's like 280 characters, are you out of your mind?"

Or, Google will change the Gmail interface. Do we like it? Of course, we don't like it. We generally don't really like change. But I think one of the interesting things there is Google put us in some control of when we could adopt the new version. We really like that. Give us something new, but let us choose it at our own pace. The anchor has changed. Now, the thing about Continuous Delivery for businesses, it sounds absolutely freaking terrifying. Take Charity Majors, founder of Honeycomb.io. She's amazing. But it's kind of like, "Yes, do debugging in production or you're no good." When a business hears debugging in production, they are terrified, because the last thing they want to hear- debugging in production sounds insane. So yes, this is what it sounds like. So businesses generally don't like that.

I was thinking about AWS. Obviously, they are amazingly good at avoiding outages, big storms knocking out a whole region notwithstanding. Amazon is really good at this stuff. It's partly because of their whole view of compartmentalization and isolation - encapsulated in a brilliant, brilliant essay. So Werner wrote this, "10 Years of Compartmentalization". This thinking of, when we think about cloud services and we think about the whole cattle versus pets, we think that all of the cattle are the same. We think it's all really pretty homogenous, and a change is going to populate through the entire network, sort of like wildfire. But of course, you actually want fire breaks. Amazon, well, there could be some advantages to having one system. But actually, everything needs to be isolated in order that people can move quickly and safely. That's an absolutely key insight. It’s a brilliant, brilliant essay.

I was talking to Sumo Logic, and I mentioned the term, "Progressive Delivery." They said, "What's that?" I explained it and Bruno Kurtic, who runs product said, "Yes, actually, we do that. We do that." I think one of the key points about Sumo Logic when they do this stuff - okay, they do Canarying. It's a small user population first, but they're really focused about it. For example they do a lot of machine learning and AI across logs to understand basically what had broken the system. But they don't always know when they roll something out whether it’s going to be applicable to all their users. So literally, they can choose particular user populations, build the AI models there, train the AI models there before deciding to roll it out more broadly.

Bruno said "Yes, actually, that really does make sense." I think one of the absolute key things that he was talking about was just the scale of it, because they have a shadow image. They're all in on AWS, so they’ve got the capacity to do this. They can have two production systems that they're testing things on. So whether it's the AI models or just the system as a whole, they're in a position where it's all a question about actually routing traffic with a high degree of sophistication.

Or take Cloudflare. I think we've been through the era of cattle and pets. Now we're moving into dogs, canaries, and pigs. I was talking to John Graham-Cumming, Cloudflare’s CTO, about how they roll out new services. They've got dogs, they have loyal customers, and you want to treat your dog right. Actually, pets are good. Pet customers are great. They pay you money, you want to treat them well. So you make sure you don't do any breaking changes. You don't roll out the new functionality straight away to them. That's bad news. They have canaries. At any given point, they'll have the canary city. So I don't know, it could be in London today, and maybe we're getting the slightly janky stuff, or it might be Munich, or it might be Oslo, it might be wherever, but they'll have a canary city.

I think my favorite is the pigs. Cloudflare has an interesting approach to internet security. Their key thing is wanting to understand the world as best they can, and to not get in the way. They're not going to censor things, and they take a bit of flack for that. But they're all about, “Let's understand the system.” So they've got pigs- people that have signed up for Cloudflare services with stolen credit card information. They get all the crap rolled out to them as soon as possible. Needless to say, they all start complaining about it. It's like, “You've used stolen credit card information for sign up." But there's a lot of signal in those complaints, even though they all thieves. So they're the pigs. So we've got dogs, canaries, and pigs. And I think that's what makes for a great slide, even if it isn't the future.

Datadog. God, you see what I did there? Anyway, Datadog, I think it's really interesting. It's one of those questions, again, As I was researching this, who's doing this stuff? There's a guy called Jason Yee. He's doing some fantastic work, talking about how they're using Istio. Honestly, I would not be doing this talk today, if I had seen his talk at that KubeCon event. Because he described how they're using Istio for canaries. It's a great talk. I'll be sharing the slides so you can look at the link later. But their whole canarying A/B testing infrastructure is all based on Istio and Kubernetes. So they're actually implementing on this sort of vision that I'm trying to describe.

They take it really far. Let's think about geographically who we're rolling out the service to? Because it's not just about availability and those sorts of things. It might be that people in different geographies use services differently. Jason Yeew said the Japanese might have a slightly different approach to using the application. You need to think about that geographical separation. When are you rolling it out? I mean, obviously, there's no point rolling out in the wrong geo if everybody's asleep. You may find that there are some outliers that are using the network at that time. But you're not going to get a broad-based understanding what's going on.

GrubHub, they're a food delivery company. They roll out all of their services to small cities first. I think it's this sort of thinking where you start thinking, “Well, actually, that question about what small and/or limited community I should roll out a digital service to, that does feel kind of new, and kind of different.” And finally, Google's SRE Book, what are the signals we should be learning from when we do these rollout? Latency errors, traffic saturation, those are the key signals that are described in Google's SRE Book.

Developer Experience

Now, developer experience. The thing is at the moment, it's like roll your own. Everything is like, "God, what am I doing? There are no tools for doing this. Why would I want to do this? I'm not Amazon, I'm not Google, I'm not SumoLogic. I don't have a cast of thousands, I've got a day job. And you're telling me about Progressive Delivery, like really?" So Weaveworks went and actually built a tool. They called it Flagger, to think about, well, actually, why don't we have a traffic shifting tool? When I'm deploying something, I can set thresholds. I'm going to move 5% of the traffic over, I'm going to move 10% of the traffic over. And then when we know that the experience is okay, we're going to let Kubernetes roll it out across the fleet.

And I think when people start to design products to support a pattern - a pattern’s real, whether or not you like the name. To describe a pattern, that's really interesting. I think when markets begin to move forward, because it's not based on you have to have all of the best technologists on the planet. So it's a pretty simple model. It is basically about moving traffic from one service to another. They're obviously going to investing in the tool and delivering more functionality.

This stuff is not new. There is brilliant piece by Pete Hodgson, who wrote it on the ThoughtWorks blog. Who here has never read anything from ThoughtWorks? That's QCon. So he called them Feature Toggles. But I think the sophistication here with this is you might be experimenting with Feature Flags. You might take one as a one subset of the customer and roll out and experiment. It might be from an availability perspective. There are different reasons why you might want to roll things out. Some are in that A/B testing sense. Some are about ensuring you won't have bugs at scale.

Feature Flags is a core and key approach here. LaunchDarkly is a cool company. The CEO Edith Harbaugh is amazing. She's building this really incredible technology for doing Feature Flags. They're not the only ones out there. There's a company called Split, split.io. They look at understanding patterns, helping people to decide what technology to move forward with. Wix has a thing called Petri. But I am leading with LaunchDarkly. Why? A couple of things, A, it’s stuff is good. I've talked to some customers that are using it. And, B, Adams Zimman likes the term Progressive Delivery. Yes, you may be getting the hang of why and how I put this presentation together. He talks about, if statements for features - fancy if then statements. And it is basically this idea that just within your code, you can have a very simple - if this is a beta user, give them this experience. And that's what they're moving to.

What's interesting was that LaunchDarkly began their journey thinking it was really about the development and software delivery experience. But over time, they've realized it's actually about the service management experience. Adam is much better at slides than me, and seemingly storytelling too. The tale he tells is of his daughter and his grandmother. His daughter loves all the new technology. She's amazing. She's built this Lego robot that can solve a Rubik's Cube. Grandmother, she likes a bit less change in her technology environment. Both of them need to be satisfied. The way that LaunchDarkly is beginning to think about this is, can we put the user or the business in charge of that application change?

Release Progression and Delegation

There are two notions in Progressive Delivery as Launch Darkly see it - release progression and delegation. Let's understand and better understand our user populations. Google Cloud invented a lot of this stuff. The perpetual beta. Google Cloud - one of the things about Google just as a whole is just the network they have, the degree of control in network they have, is so incredible. Leads to so much of the effectiveness of the organization, and that's the stuff they can't open source. Borg is so built into the network that that's not something they could have easily open sourced. But they did open source Kubernetes.

I think it was interesting that we were doing containers. Docker, it's great on the laptop and they're trying to roll it out in production, It's like. “What the fuck?” So you're in an environment where actually we needed an orchestration platform. Google hit that perfectly. I think it's a great story there because Docker made a slight blunder, which is that Google offered to give them Kubernetes and Docker said, "No, we'll build our own." It's a bit like saying, "No, we don't want to sign the Beatles." It's one of those probably not a good idea things.

Kubernetes

When you have Kubernetes, it's like, "Okay, you get all the goodness we can have. We're going to have these containers, microservices, immutability, catalyst is pet. It's all going to be fantastic." But here's the thing. In Google's network, it is a homogenous network that is well understood that they have all of the things, that everything is the same. "Oh, when it breaks, we just throw it out. It doesn't matter, plug in another one." Definitely there are no pets here. But actually, that's not like other people's networks. Not everybody is in the Borg.

Target. So they're doing Kubernetes. They're doing some ridiculous things with Kubernetes. They're rolling out Kubernetes in their stores. What are you going to have? You're going to have an SRE in every store? You're going to need, I don't know, the GDP of the UK, in order to hire enough people to support their Kubernetes rollout doing that. So if you can have Kubernetes in store, what does that look like? When are you ready to actually roll out a new service? Are they all going to be appliances? How does that work? They've had to do all sorts of janky engineering in order to do Progressive Delivery of their Kubernetes rollouts.

IBM Cloud. IBM actually does a lot of Kubernetes at scale. At first, I was like, "Really? You got IBM Cloud Private and IBM Cloud Public? How does this make sense? Why don't you just have IBM Cloud, that you can have on-prem or off-prem?" Well, the thing is, is it just doesn't work like that. They've got customers that are doing stuff with state that needs to be managed. They're using Kubernetes in lots of different ways. And it turns out that they've got this ridiculous real estate, and customers have all sorts of different operating systems and servers and all sorts of stuff they need to support from a services perspective.

The reason they have IBM Cloud Private, it's just that they don't have homogeneity of target environment. So IBM has built a polling system. So they could do Progressive Delivery with Kubernetes where it's like, "Are we ready?" You can't just roll it out across the fleet. Are we ready to roll this out? The pace of Kubernetes is, it's really impressive/frightening. Increasingly, today, we feel like we've got to stay current because otherwise security vulnerabilities, it's become more and more obvious that we can't just be, "Oh, yes, but we were running that 14-year-old version of Struts. And unfortunately, there was a vulnerability," right? We're moving to a thing where currency is much more important, Kubernetes is moving incredibly quickly and we're supposed to consume that.

Continuous Delivery isn't just about delivering new services to end users. It's also about consuming new services from the Kubernetes team that are moving at an incredible rate of knots. This is the challenge. Basically, the long-term support model that we grew up with is, it's a parrot. I don't know, I'm feeling old now. When you put a Monty Python and nobody chuckles - getting your cultural references wrong, mate. Anyway. So long-term support is sort of dying. What does that mean? How are we going to manage this stuff? We're going to need some new approaches.

Service meshes

Service meshes do give you the ability to do rollbacks from failovers, service routing, this is talking about where you can move some of the traffic over. And very much the sense that the metrics, logging, and everything else is going to be built into the system. So observability is really key. I mean, we talked about service mesh for the fact that it's an ability to orchestrate services. But I think it's the instrumentation of those services that is just as valuable.

Expedia does, like I said, 2,500 production deploys a day. Subbu Alamaraju has done some really excellent work looking at their outages, because it's one thing to roll out with new functionality. So when do they not have outages? When no developers are shipping code. When the Snowmageddon hit Seattle, the DevOps team was like, "Wicked, we actually don't need to do anything. The developers have all said they can't get it to work, no code is being shipped, nothing is breaking. Awesome."

Now, there is a reason why Amazon, they're not doing production deploys at Thanksgiving. That would make no sense. Just leave the system as it is. Because things break when you change things. It's sort of obvious. But what we're talking about is, well, if things break when we change things, how do we better understand that? For Subbu, it's like a compartmentalization strategy, that absolutely moving forward with a view that everything that gets rolled out needs to be isolated, needs to be compartmentalized.

Thinking about this view, we got the notion of Progressive Delivery, but also mitigation is still so important. This is a talk about DevOps, and just culture and a culture which is not based on blame is so important. We should try and make it so that we're rolling out stuff that doesn't break. That's super important. But things will always break. And thinking about the incident response, it's what you learn from incident response that is perhaps the most important thing. Think about that, and then into how we're going to have Progressive Delivery in the future, because it's about, yes, a peer review. And also just the learnings. I mean, I know a lot of startups and the most useful thing they see as they build a platform is Amazon's post-mortems. Learning from failure, I think, is probably the key insight in terms of DevOps.

So in terms of this world, if we're going to have Kubernetes, where do we have developers? As I said, I'm about practitioners. Developers today live in not that many places. They're in Slack to share GIFs, they're in their editor, and they are using Git and GitHub, or maybe GitLab, or whatever. So GitOps is a notion that was put forward, is you put the desired state stored in Git and then let the automation capabilities of Kubernetes roll something out. You monitor that and then make changes. So you've got the desired state and the actual state. And hopefully, you try and get them to come together. It was kind of interesting. This is something that happens in practice talking to Boeing. What they do is they store their user IDs in Git, and then they roll out new services to them just doing pull requests. So it's automation via pull request. I think one of the key things is if we're going to get things out of the way of developers so they can be more effective, Git is absolutely key to the way developers work today.

Observability

Observability. I mentioned Charity Majors earlier. Absolutely brilliant. She's doing fantastic work. Honeycomb is a really powerful platform for understanding what's happening in the system and could be used within the context of 5 or 10 or 15% of the overall traffic in a progressive delivery context. She says debugging in production, because you're always going to need to debugging in production because something's going to break, whatever you do. But it's the notion of, how do we have a better understanding of what the system is doing? I think the key there, there's nothing new, distributed traces, logs, and metrics, and just bringing them together. But the scale of data, I think, is really important.

Cindy Sridharan @copyconstruct, Charity Majors and Jaana B. Dogan from Google are all people that if you want to understand observability and making systems that don't break or that you can fix. Follow them, and you'll have a much better understanding of the world. Because you're going to have to get inside the machine. It doesn't matter how much testing you do. This is indeed modern times.

So here are some of the things we've spoken about today. I think one of the key things I'd like to finish on is, is the notion of using the abundance. Why are we able to do Progressive Delivery? Why are the organizations that have done this stuff able to do it? It's because they've got this cloud capability, SumoLogic example. If I'm a Netflix engineer, chances are high that if I want to spin up, say, 10 instances of 10 VMs or 10 containers or whatever, I'll actually spin up 100 and throw away the 90 that don't meet the performance characteristics that I want. That's the sort of world we're moving to, where we've got this cloud abundance.

So what does that mean in terms of the architecture that we can develop and the user experiences that we can provide? I think it's that. We want developer experience to be amazing. And that has very much been RedMonk's mantra ever since we founded the company. But at the end of the day, it also needs to be about the user and customer experience. And that's Progressive Delivery. Thank you.

Questions & Answers

Moderator: With the Progressive Delivery canary, so I guess feedback becomes really important, not just about pushing code, but can you say more about getting feedback?

Governor: In terms of the users or in terms of the system?

Moderator: Yes. When, let's say, you're just rolling out to Japan or something like that.

Governor: Yes, and that's I was supposed to be talking about closing the loop at the end, so Tracy nailed it. I think that there are a couple of things there. One, is obviously we've got some really effective incident management platforms now and alerting platforms. So if you think about the data now that you can collect with the likes of PagerDuty, that data really becomes a gold mine. I think it's no surprise that we've seen the acquisition of VictorOps and Opsgenie, Atlassian very much moving into a world where it's beginning to collect the data after the fact, or at least the user experience part, some of that semi-structured stuff, as well as the structured information that we've got logged. Logs are semi-structured as well. But I think Incident Management is one of the keys there and it should be part of this worldview.

Participant 1: Thank you for the comfort for the talk. Question is, for a small team, between 5 and 10 people, we've got the DevOps mindset and we have CD as well. But I wanted to know if the tooling is mature enough for a small team to do Progressive Delivery without, having a team helping you out with the releases and how to roll it out to the rest of your population. Is the tuning automated enough for a small team to do it?

Governor: I think that is both an easy and a hard answer. I mean, the easy answer is no. As I said, the organizations I've seen that are doing this primarily do have big teams that are able to focus on specific problems like that. That said, we're definitely seeing a marked shift in terms of - I mean, certainly, if you look at the development velocity of LaunchDarkly or someone like Weaveworks, again, these are small teams that are quite tightly focused on problems. I think it's a little bit early but within the next 12 to 18 months, we're going to see much better-automated tooling for canarying at least, and thinking about how we can start rolling stuff out more safely.

I don't think that we've got solutions to all of this, and definitely Tracy gave me the chance to speak a little bit in the future. But I think that what we are seeing is vendors are beginning to do that work. So, hopefully in 12 months, I could come back and I'd be like, "Oh, here are some actual solutions." But at the moment it is, I'd say it's a bit early.

Participant 2: From a user perspective - this is a real example, we make in my company a big change on a search form. We are a real estate company. We made a big change and users don't like change. A lot of users are writing that they don’t like the new form, but it's much more powerful but we launched it in A/B tests. But we cannot roll back the functionality then, because the data are not most compatible with the previous version. Can you face up with this problem?

Governor: Well, I think that part of that is going to be let's not roll it out immediately to everyone. So I think that's partly the notion of progressive delivery now. There are two things here that we're talking about, well, quite a few things. One, we are awful at understanding the psychology of change management and helping people to understand the value of adopting a new approach.

Who is this room thinks their employer offers them enough opportunities in training? That's amazing. I would have expected nobody to put up their hands in the room. I want to know who you work for and I'm going to try and get a job there.

The thought about, organizations are always like, "We don't want to train users." Just everyone's expected to use it. "It's like a Facebook interface. Anybody can use that." I don't know about you but I find Facebook has the worst interface known to mankind. Everything is hidden. You can't find anything, your privacy settings keep changing without telling you and all that sort of stuff. So there is psychology of change.

And the other one there is that, yes, I think once you start getting into questions of state, everything always gets harder. Once data is changing based on the service rollout, that does create a new set of challenges. But I think that's why you want to test it with a small constituency first. Rollbacks, I've talked about Istio making rollbacks easier. That's not to say they're easy, and a lot of us aren't running complete Kubernetes and Istio infrastructures. We're probably just running some Java apps or something like that. I think it's somewhat of a feature statement.

But I think with a lot of these things, like service routing - so the network view of Istio, which is interesting because you have to layer a seven column model. Well, you could actually do some of the things I'm describing, using switches, and you think about F5 Networks and the work they've done in terms of, "That's a pretty sophisticated routing architecture.

I think I've Per Buer here." He spent years in the Varnish ecosystem, you can do some absolutely incredible service routing things with Varnish. VCL was amazing for taking traffic and routing it in particular directions. That's where there's an aspect of not all of this is new. And you might be able to do some of it. But yes, you're right. My view would be avoid breaking changes for a very big user population and avoid some of the problems that you might get.

Participant 3: Thanks for the talk. Hopefully, these are two short questions. But the first one I was wondering is the scope of Continuous Delivery, does it include the operational and the process related work prior to the technical aspect? So I'm talking about change management or submitting CRs, which is a painful work. So if you think of it from DevEx experience or DevEx aspect, it's a pretty painful process. And the keynote from yesterday kind of highlighted that. Yes, it's not really a good idea to be having this submit manual change request and wait for it to get it approved. And then you go through all the cycles. So the first one.

The second one is, for canary, how long, would you have to wait until you're comfortable enough to move to the next state? But then what if you're in the state and then maybe you're 90% to 95%, and then the perfect recipe for a storm brews and then find out things like fail?

Governor: Both are excellent questions. The good news is we can get the business involved in service rollout skin from product ownership perspective. The bad news is that the business is involved, again, from service management perspective. One of the advantages of the world that we've been building where developers are much more enabled to just build the functionality and move on ahead, is that there have been few of those gating factors. So business is always going to change its mind.

But on the other hand, what are we here for? We're generally here to frankly make money and serve customers. I think from that perspective, yes, listening to- as a developer, it's easy to go, "Oh, the worst thing you want to do is have marketing have a voice." But on the other hand, there has never been a successful technology company that didn't have marketing that was effective. So I think that listening to the business is a challenge. We don't want to go back to a world of everything being waterfall style, change management requests, and so on. I think we're doing something new.

Progressive Delivery is - I want to be educated. I've come here with an idea. I hope I haven't given the indication that this is like a fully-fledged thing you must adopt. It's more I'm trying to describe some practices. And I think like a lot of the problems that we're currently facing, these things are multidisciplinary. When we had ITIL, nobody talked about empathy, empathy for your colleagues or your end users. I think we have made some progress in thinking about, "Well, what does that mean?" Being at a conference where someone can say, "HugOps," and you know what they mean.

When there is a massive systems failure, it's all too easy for us, as a community of users, to start bitching about it. But let's think about the SREs on the other side of the fence who are really struggling, not sleeping, feel terrible, trying to get something fixed. So I think that there are all going to be some changes to how processes, and as I said questions. Which user populations? Why can we be more sophisticated about the beta program? What does alpha look like? What's a preview? Thinking about that stuff, I think does absolutely need to be part of this. Which users like change? Which users want services to be rolled out immediately? Some people do like stuff that breaks and they have to work it out. Just not most people.

I spent so long in the first question I forgot the second question. I mean, this is again, canarying, yes, how long should you do it before you know? That's a really good question. If you look at Weaveworks and Flagger Tool, that's very much just in terms of like, at least, as it’s initially constituted, are there any catastrophic breaking changes? You can begin to identify that fairly quickly, but it's really going to depend, because for example, if you're doing something which is more like A/B testing, you've got to roll it out to a user population that can have a meaningful experience with the tool before you can begin to look at that feedback and decide whether that feature should be rolled out as it is. Which I guess then feeds back into your previous question, policy, change management, what does the business think, and how can we become more effective in that alignment of business and technology?

 

See more presentations with transcripts

 

Recorded at:

Apr 11, 2019

BT