InfoQ Homepage Presentations Should We Really Run It if We Build It?

Should We Really Run It if We Build It?

Bookmarks

View Presentation

Speed:

Download

50:26

Summary

"Build it, run it" is the war-cry of the startup and scale up industry. Is it really that simple? Are there hidden costs like engineer burnout and a lasting impact on a young culture? And do B2B and B2C companies have different prerogatives?

Bio

Paul Hammant is co-creator of Selenium and other OSS tools. He works as a CD consultant.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Hammant: Nowadays, I like to think that I'm the guy that speaks most and loudest on trunk-based development. If anyone has a need to get me into their company to drag into that, as well as CI, as well as CD, and all of that stuff, I think it sits on TBD and monorepos, then they can call me. In terms of InfoQ, Floyd contacted us for works, and I was volunteered to maybe write an article. We did one based on a case study that I and Ian Cartwright had done in fulsome in the U.S. I just think it's interesting that I've come back to the InfoQ family to present for the first time, and I was there at the beginning, writing an article, which you could still find.

Premise

The idea is we're in startups. There's possibly a good chunk of the people present here today who are or have recently been in a startup. Then you moved through a scale-up. Then you're a legacy company, presumably, with thousands or millions of clients. You're now watching your rearview mirror as somebody's going to nimbly overtake you because they're better. Your focus, for now, is on the start phase. Your CEO has told you that he's heard this war cry that if you build it you can run it. You think, "Maybe we can." We have an imagination what that's going to be like, "I'm going to go home and then there's going to be an alert," and you hope for the best, maybe.

We should think about the rationale. When the CEO says, "Why are we going to do it," it's because we're proud of what we make. It's also maybe a focus area for us as developers that we want to maybe eliminate all the defects, and the best way to be focused on that is to be exposed to the defects as they happen rather than think that's an over-the-fence thing. The fence can quite often be dev to QA going downstream, but the fence is upstream from us too, BAs, classically for feature roadmaps and backlogs. We also have a fence coming from the people that are using the software in production and how they get something for each of us to fix it. If we're exposed to things as they happen, then maybe we can make it good. The rationale seems solid.

There's some poison here, and this one, for me, I can't tell you how much I hate Slack. Honestly, every company, every client I have is trying to drag me into Slack. There are maybe three, four, five Slacks I have to ping between, and then I've stupidly done it on different logins for different companies that give me the things. The whole pinging around is hard, and it's fraying my nerves, the alerts that come through Slack and all those channels. It's like somebody's hailing me here, and it's like, "No, I'm confused. I went into the wrong channel. That was from a week ago, and I missed it a week ago." There's no doubt about it, you can make Slack ops work for your company, especially supporting the production system. Slack certainly works for impromptu development communication, especially when you're not all collocated, which would otherwise be the extreme programming ideal.

Slack isn't even the only way. A startup I was involved with was using WhatsApp in exactly the same structure. They had a Slack, but they're only using that for planned work. Think of a stream going downhill and planned works being dropped into the top, and it's going to flow through as fast as we can, which is why the Lean-Agile community talks about flow. The stream is not water, it's made of molasses. If somebody's trucking defects in there, rocks, and they're being thrown into the stream midway, they're slowing down the flow. We have a separation where we're using Slack for planned work and then WhatsApp for unplanned work. That's two different systems that can ping my phone 24/7, and my nerves are going to be even more frayed, personally, and I don't think I'm speaking just for myself. The poison can be just the interrupts. Then you have family, and then there's a work-life balance problem, and pressure to not attend to your phone.

Burnout

We risk burnout. Of course, I think the clue was in the setup for this. We fare through a "build it, support it" ethos that we might be in this place where we're pushed to just collapse because there's too many incidents too often, and then there was five working days but there's actually seven days in a week. Depending on the nature of our business, we're supporting the stack, the solution, the users seven days a week, or not. We thought, in agile, for extreme programming at least, that maybe six hours a day pairing, and then maybe another two hours a day attending meetings and doing email and stuff like that, and then you go home, exactly at the eight-hour mark. We're in the U.K., luckily, we have seven-hour working days, so we just adjust that down a little bit.

Startups, especially when you feel there's a buzz going on for the thing you're building and the race to beat some known or unknown competitor to market, you might find yourself working up to 10 hours a day. In some cases, I think, back in 2001, just before the downturn, I was doing 16 hours a day. You can't sustain that. We might observe that 10 and 12 are regular, and then if we're doing support too, then it's going north of that. It's definitely in a place that me, at my age, I can't do that anymore, and I'm not even sure I could do as a 20-year-old. At least, if I was, I was kidding myself, not everyone.

Quitting

Quitting is your outcome. That's your bad outcome. You could plan to quit. I'm going to work a month or two, sharpening up my CV, going to interviews on the down low, and then hand my notice in. I'm going to work my notice in the U.K., not so much in the U.S. You'd be black-bagged the same day. You're going to find a new employer, and you're going to have some relief in that. The question could be, in a moment, the quitting could be because you've reached a breaking point. The quitting could be because it seemed like the only rational course of action to do at that particular moment in time was this shouting match was happening over an incident or immediately after the incident and you hadn't been home, or whatever it is. It can get that bad in startups. You disappear for a regular job, and you're here at a new place. It is a reset. Change is as good as a break. If you're in a startup and one of the reasons you joined a startup was because you really super love their value proposition and even the technologies being used, and that somehow, in the induction cycle, you were persuaded to take stock and not just salary. I call some of those two things compensation.

There's eyes open ways of going into that and running spreadsheet to see, "I'm going to move my salary down from here to here, and I'm going to take that amount as options. Those will come in one year or four years later, and they'll come in in tranche shares, but I must remember to get a checkbook out to actually buy them at that stage at the strike price." If you're going to quit, it could easily be ahead of that schedule. You might actually be disappearing and losing some of your delayed compensation. If we think that maybe 9 out of 10 startups fail anyway, what if that was one of the ones who was going to make it and you're not leaving a sinking ship? It's a ship that was going to make it, and you were the one that bailed early. We have all left, hopefully, companies for good reasons, but hopefully not too many of us have left companies that went on to become unicorns and left us struggling to make the rent because we bailed too early. Those are bad lessons that would sit with you for the rest of your life maybe that, if you just stuck yet another month or two, you'd have been fine.

With all people that exit bad situations, whether it's a bad support situation or a bad development situation because your team leader or your architect is yelling at you or telling you not to test really bad things, then quite often, the people that do leave are the most talented. They do so quietly and they do it the planned way. They leave the organization in a place where it's the people that are willing to put up with more shits and more misery and actually have been, maybe throughout their life, acclimatized to being yelled at. Those are the people that stick around, and they should also leave if it's a really poisonous outfit. The quitters can quite often be the most talented.

Remedy

There's a remedy, and the remedy is known to established organizations already. The people that have gone past scaleup, and then now the incumbent legacy company, they have a proper three-line support. People argue about what is line 1, what is level 2, what is level 3, but let's just say there can be some variation. We'll say that, classically, the development team that would support a stack in the goings of supporting it is actually level 3. The main life is we are developers in a developed team, we have multifunctions within the team, we have a nice sprint-centric way of working, and all the candy upside of living in a dev team, but we also do support.

Support is unplanned work that comes in. If I'm throwing into a stream, the rock's dropped in the stream. If you're lucky, they come into your stream during a working day. Then you can work on them as it happens, and maybe you are dropping something to work on that rock that's just fallen in the stream. Maybe that's all good and you still go home on time. We're really worrying about the ones that come out of hours. In a high functioning organization that's structured for long term delivery success, there will be a bunch of support professionals. I think that's maybe the historical industry rollup title for people. There's plenty of modern ways of phrasing these people, but they might have a career in that. They may never have done any of the other roles within a large IT organization, they may have focused on that one.

Level 1, depending on who you are, that's bots or account representatives and things like that. That's the first alerting system that something could be wrong that hasn't involved human. Triage would go through level 2. Then, if it's midnight and we need to call somebody else, it'd be somebody from level 2 that does it, hopefully, in this equitous setup for how we do support rather than the Slack ops or the WhatsApp-centric way of doing it.

If we dwell into level 2 a little bit more, we could say level 2 staff report to the users, and we don't always know who the users are. In the system, they can be internal or external. The users could also be companies rather than humans. They're awake and alert, hopefully, at any time the incident is happening, wherein the developer, who would be level 3, might be happily asleep. They do draw in a body of knowledge, and they can resolve issues themselves. The best-case scenario is a team that's been skilled up with, if need be, SQL server access on some temporary password basis and some SQL skills to run some known remedies for some frequently or infrequently appearing incidents before they've been fixed properly by developers. They have their own systems, their own software, software that we don't use. If we live in Git and IntelliJ or another JetBrains product and Jira, then they don't live in those tools. They might subsequently slot something into Jira, but they don't live in those tools for their operation. One of the human aspects that maybe we have all institutionalized now in whichever part of our world. Part of IT is we have respect for the people that feed us stuff from the support team, and they hopefully have respect for us as they feed us things. It's not always the case, but we should have it if it's working well.

Contrasting level 2 and level 3. Level 2 is 24/7. Maybe if you're a business and you're selling to businesses, then maybe it's five days a week. If Singapore is one of your marketplaces and you're sitting in Hawaii, then maybe it's six days a week. Because Saturday or Sunday is debatable, depending on which side of the International Date Line you are, and that maybe your hours are not just regular New York hours, maybe it's a long day. Depending on the nature of your company, even the age a bit, you might actually decide to not do 24/7 when you should do, and you're actually a startup and there's only 8 developers. You just can't support it, so you're going to take a risk that the smaller amounts of users who are awake at midnight using your booking application that might encounter something, that they're going to be ok, during your startup phase, that nothing was done with that until morning, including downtime, because you only have seven developers. If you're hockey stick, then you shouldn't be in that situation. Everything about growth of a startup through scaleup is acquire the personnel and their practices on a just-in-time basis for when they're needed. Maybe part of what I'm trying to tell you today is sometimes the support might have to be a little bit earlier if we want to do it right.

Level 2 has run books. That's a very historical term for a body of knowledge that is curated over time and growing and trimmed, tweaked, fixed, eliminated. It could be a Word document, and it talks about what we do when this particular incident happens. Part of the onboarding for level staff would be to train through that. They have investigative powers. They take tracker issues, and then they might slot some of those into a backlog tool. They might have a historical tool, which will be a trouble ticket system, and devs might prefer to work out of Jira or Trello. In some agile teams, there'll be transcription. Someone in project manager role, scrum master role, will be copy-and-pasting stuff from the tracker tool into Jira, and I personally don't think that's right. I think, as a dev team, we should be sophisticated enough to look at two queues and work out what we should do in a week. It seems, to me, the transcription from one system record to another is just a total waste.

The level 2 team can actually have toggles created for them. If you're an airline and you were selling cars, renting cars at the same time as a secondary funding stream, you can have partnerships with rent-a-car, you could have partnerships with Hertz, Avis, and others, but one of those is going down. Their marked service has gone dog for a little period of time, and it's affecting the page that would render the aggregation of all of those hire-a-car results. You want to empower your ops team to actually flip a toggle without asking or calling anyone in executive management or anyone in level 3 support. They can flip a toggle and maybe just send an email saying, "We turned Hertz off." Then maybe, before everyone gets out of bed and comes in to work, Hertz is turned on again because their own ops team has fixed their issue. That would be a classic case for a toggle that you've configured to work at run-time. As developers, maybe we understand that toggles have a lot of uses, but that was a toggle that we put into the application, not in the end-user application, but in the admin console for the running stack to toggle something off. In the run book, we told the operations team when they could make a decision without asking anyone in leadership.

Level 3, hopefully, does working hours stuff only. That would mean that your first prerogative if you're first in on any morning would be to look at the queue of things that need immediate attention that we've taken a gamble on that we can wait four hours to have them resolved. That'll be a nice situation rather than call through the night. We can help the ops team with their body of work, their run books. We can participate in those activities, sometimes called path to production, long before we're actually in production. We have interviews with the ops team that would support new applications that are going to slot into an array of applications about what it would be to support that, what would they need, and what can we do for them now as development in order to make their operability better. We can, as is mentioned earlier, take work directly from the tracker if we need to work on it. If it's been tested as a mechanism of giving stuff to developers, then we should stick to it and hold on to it, because if we stop attending the tracker, then some project manager starts copy-and-pasting into Jira, and now we're in that wasteful place. With double references, because you got to point the Jira at the tracker ticket number, and you got to point the ticket number to Jira. Those two tools don't necessarily have hyperlinks to each other and weren't single sign-on and all sorts of corporate problems like that. We end up with a more supportable system if we've demarked correctly between level 2 and level 3.

B2B

That is these types of companies. Cassie passed a note saying, "Make sure that you differentiate between the types of business." B2B, you maybe don't have end users that you super focus on most. You mostly focus on their account manager or their partner who were to call you and operate as maybe level 1 in some regard. You have your account manager who deals with their account manager, and either one of them could be in a persuasive moment with you that you should drop everything to work on an issue, especially if it's a whale customer. The mechanisms of funding for how a company acquires customers who are other businesses is curious.

There's a division between op-ex, which is how we pay for all the regular developers making the base platform, and then there's an onboarding effort, sometimes called pro services, and then the funding streams for that are quite often chargebacks when it counts. If the company is big enough, they're a whale, you'll just comp all of the customizations they want for your stack and just say, "We're happy to have you as a customer." If they're small, you'll make them pay-to-play. All of that, the nature of their onboarding, the size of their account within your larger corporate deliverables, all of that drives your attention to them in a support cycle. Maybe the minnows who pay to play are going to be attended at 9 a.m. when the first developers come in to work and have a look at something, presuming there's nothing on our priority. If it was a whale, you might be up at 3 a.m., and you might not get to bed again until 7 a.m., and you might not come in that day. The criticality is sometimes driven from the financial importance of the clients to the company.

B2C

B2C is not that, right? It's people with phones. In this age, hockey sticks can happen whilst you're asleep. You went to bed with 10,000 customers, you woke up with 100,000. It can be driven by news cycles or just word of mouth if your application is good enough. There's renames that happen within the support world for what the support engineers are called, even how engineering they are versus how supported they are. I think one of the latest renames is to call this CX, customer experience. It's confusing to us, developers, because we keep thinking of UX for user experience, and then we're all confused because our little reptile minds can't hold too much at the same time. Either way, CX seems to be sticking and it's a sudden field of science. You have a whole bunch of training around that, a whole bunch of expertise that developers can't match but they can learn.

Platforms change too when maybe you're dealing with millions of customers. You might, as a startup, be speaking to the likes of Salesforce to deliver some of your customer support operation, and you might be dealing with having to customize Salesforce to your needs as you integrate it. There's 20 or 30 different companies you could bring into your startup, and they're going to be quite cheap for you to consider at the outset. As a truly sophisticated company, I think Gartner wrote this, if you're at the top of your vertical, you're probably going to write all your own software. That may have been true 10 years ago, but now, in the era of Mongo and a whole bunch of super-advanced databases, maybe you're not writing your own software anymore. You'd have to be Google to do that. Either way, support changes when there's multiple channels, phones that you have has a little rectangle or a button or a hamburger fly-out thing that integrates some rectangles that came from that vendor, from Salesforce, or for other, or it could be a classic, "I'm calling a call center for a support line," or for us, devs, we're also going to look for an email way of hitting support agency for the piece we're complaining about. We want to do asynchronously and get on to something else. It does change whether you're B2B or B2C.

Systems

Systems, other than Salesforce and the like, would classically be PagerDuty, and there's about 12 variants of people that compete on this level. They can strike you as quite expensive, but they're not when you're a startup. They would worry you if you had 10,000 people that you'd engineered into PagerDuty and was paying a monthly fee with only modest breaks for volume, and maybe you'd think about your own or some other solution or changing vendor if you'd gotten to bed with the wrong one and it was going to be costly later. You don't care as a startup. You never try to save the CEO's money by writing something yourself. You're always trying to get to the goal line quicker and spend as much of the CEO's money as you can and have them tell you no rather than assume the no was going to come.

I foolishly, when I was a director of engineering, went and wrote my own extension to Confluence, and Confluence doesn't have extensions. There I am hacking this joyfully after hours, my wife complaining that I should have been paying more attention. I made this JavaScript thing over several pages, and it would load an adjacent page and then read data from the HTML and counting that as a table. I was dropping query functions in it, and it would go and render something on a just-in-time basis like this. It was out of screenshots, so I should have log entry. I went and rolled it out, and nobody used it at all. It's like, "Yes, that was just a blog, wasn't it?" The clue for me was that I was actually banned from developing on the first day of being hired as director. They haven't told me this in the interview, and I might not take on the job, but it was probably the right decision. I would do the same now if I was in the same position to somebody else.

Do Not Disturb

One of your problems in any support cycle is, it doesn't matter whether it's Slack ops, WhatsApp, PagerDuty, Salesforce plugins, you have an essential problem that a developer is going to have a phone, Android or iPhone, and they all have the same feature which is "do not disturb," and it allows you to sleep. It doesn't seem unreasonable to me that we allow developers to have that, even on the night they're on-call if there's a rotation so that the previous screen had an implicit rotation. It doesn't seem unreasonable to me. Whatever tool we're using, and even if that's human in your first month of operation that you've got a person running a spreadsheet and going on an incident who's on-call. If they're a pool, "Ok, what's their pool's number? Who has their pool's number?" Then you call their pool. Be aware that they don't want to pick up. A fine criticality driven here, that's call again within 30 seconds. At some moment, one, two calls, or three calls, you'll bust through the "do not disturb" feature of the iPhone and the Android. Then, hopefully, [inaudible 00:24:14] and he can pick it up and say, "What's up," a stifling yawn, "Where's my coffee?" Then you're on it. Then you're back into Slack where the incident has been running for five minutes now or in email or there's a bridge line and you're now going to be speaking to a bunch of people in your company, and you're the sleepy developer. It seems fair that we allow people to utilize their "do not disturb" to ease their work-life balance, even when they know they're on-call.

Preparedness

We need a rotation. You need to do this early, perhaps. If you have hockey sticks overnight, and suddenly, the incidents have gone up from one every three days to four a night, you're in the position where you're pushing people immediately to the burnout place. Before your hockey stick, and there's not many things you need to do in dev before your hockey stick. This is one of them. I think if you're in a PHP stack and you're about to hockey stick, you probably should have solved your PHP problem before you have hockey sticks. Although, Facebook didn't, they just made the PHP solution work. There's another idea of a rotation from dev into the support team or the nascent support team. You site somebody for a month, you want to go and sit with them and be part of their team, "No, you won't have to be awake through the night shift. You can do it through the day shift, because it's three shifts. You're going to listen to every incident or you're going to help work out where the systems are deficient and the bits that dev could do for them," or they really are just going to learn the way they do stuff and be able to do it as one of them. In any rotation basis, and XP does this already with QA automators and devs, when you're comparing one or the other, the distinction is somewhat blurred, but we're trying to apply the same idea between dev and support, especially if they're local. Can we slot into their team for a month to see how they work? Can we improve the way the system performs for them? There are times that I've seen this done, and it's not me who's done it. It's me observing really successful companies operating this. The times I've seen this done, it really works. It's called a SWOT team when I saw it, and it was super impressive, almost like the place where somebody could say, "I don't want to rotate back to being a developer. I want to stay in this team." In our quiet moments, we're doing fire drills, all the stuff that Netflix was doing with yanking cables and sexy open-source product names for that same thing. The chaos engineering this company was doing too and just calling it fire drill. There's plenty of fun that we have. Somebody that previously thought the development was their career track, and they could take a year off and be in the support world and still get very fine.

Measuring

We all know about blameless postmortems, hence, the picture on the right. It's the best I could find. We want to do things blamelessly, because if we don't, habits change. Just in any part of your life, if somebody in a position of authority yells at you, even once, you modify your behavior for every subsequent meeting with that person to avoid being yelled at. All of that defensive behavior is just like a padding. In the history of agile, we'd say, "How long is that going to take?" You'd say, "Eight weeks." You go, "I think it's only a five-minute job," because everyone you asked pad it by 50%, and then it got all the way to the CTO and suddenly it's a month's work. We pad stuff in order to be defensive so that we don't get yelled at again. If we don't make it safe for people to fail, and that includes during postmortems and blameless autopsies, if we don't keep it safe for people to fail or for systems to be broken in for humans to subsequently resolve it, we change the behaviors of the team to be subsequently malfunctioning or persistently malfunctioning. Some stupid folklore about monkeys in a ladder and some bananas and a firehose, which is a great story, but it turns out to be not true.

We should maybe think about the run books as living documents rather than Word files that are emailed around. That's the worst. Everything that's email as an attachment automatically has 100 versions that nobody is sure which is the most current or most valuable or the definition of value changes based on who you ask. Can we have a system record that actually allows for updates on a bit of an order trail? That could be Confluence. It seems to be one of the Atlassian products that I get what it does. We want to maybe audit the operations too, not just the curation of the run books. If the operations team has temporarily gained access to production to go and do a data fix with an SQL update statement, how do we audit that? Is that in a place that we can turn to auditors later if we have them and say, "That's what happened here with the permissions around that"?

We might actually, as a software delivery team, be writing systems that are custom in-house, not from Salesforce land, but a custom in-house that do parts of our auditing cycle to allow us to stand in front of an auditor in the future, that could be Ernst & Young, or allow some member of our team to stand in front of a judge and jury in the case of stolen assets or something like that and say, "We did our best, and we had a provable trail." Then we measure everything about it, because it could inform how we tweak everything to be better the next time. If we stick with the positivity, avoid the negativity around incidents, keep it calm, rational, have somebody to run an incident, and then close it off. Pat was mentioning in the keynote, there's plenty of things that can just be forgotten out of the retrospectives. The same is true of blameless postmortems. We have to make sure that we can work forwards with the stuff that we are taking out of the back of that and action it. Maybe that's a feature in the planned backlog rather than something that's slotted into or remains in the unplanned backlog, especially if it feels like work that saves future incidents rather than work to solve a specific incident.

Staffing

You're a startup or a scaleup, you're wondering, "How are we going to do this with seven developers who are covering front end and back end, where one of them is doubling as scrum master and there's no QAs?" Not mention any names, in particular. The answer is you could borrow this as an elastic resource from a company that offers support on this basis using your tools while supporting other companies, at least. They can have a follow-the-sun metaphor, meaning whichever eight-hour segment of the day that you're particularly worried about is daylight for them. When they pick up the phone, they answer as your company's name, even though the next phone call, they might say, "We're Cisco," and then, 10 minutes later, they're Oracle. That could get you through a tricky period where you don't want to take people out of the highly trained and aligned dev force.

You don't want to diminish that, but you need somebody to pick this up so that we're not all called through the night and there are people who can solve incidents. You may not have a multiyear relationship with that company, maybe they hope you will, but it's ok to have some company fulfilling this function for six months until you are hockey sticking and your funding levels are increasing because people are coming in through rounds of funding through venture capitalists or others at other moments. Then, suddenly, your buying power for additional staff has increased, and you could maybe think about hiring the first people to lead that group or replace that group and count them as in-house. If you're sophisticated enough, you will run it in-house in time, according to that Gartner report. Then you don't have to do 24/7 straightaway. You could do 18-hour days, 5 days a week, and count that as just peachy even if your usage cycles are longer than that. It's a gamble. A lot of stuff we do is a gamble in development if you're chasing a delivery date with scant resource.

Success

Success is none of your staff quitting, really. That's a multiple access. Not quitting because they felt pressure too much, but there's another not quitting, which is the people who are there at the beginning that sailed with you all the way through to your acquisition or your flotation. That team photo is interesting, the one that's formed on the first day, with the founding team and then the people who made it through to flotation, doesn't always have the same people in it. Maybe only one or two, and it's the CEO. Most of the tech staff have been rotated, hopefully, kicking their options. It's interesting, if you make an energetic enough company that the software is so compelling, we've heard speakers already in this track that talk about how great their companies are, and it's very believable, that if your company is so compelling that every aspects of it, development, testing, which is downstream from development, and upstream, the support aspect, none of those are odious or cancerous. Then you're more likely to stay the course. People actually look for reasons to leave a company. If there's none who presents it, then the easiest thing for you to do is just remain in that company, provided they are meeting all your compensation asks, your, "Hey, I've grown with you. I'm no longer the post boy. Can I now be director of engineering?" Maybe the answer can be yes in many companies.

Success is multifaceted. You don't want your staff quitting is one. Success is also maybe secondarily measured, and I've seen this in companies, again, not something I personally engineered, but I've witnessed as a major triumph, which was a support engineer sitting amongst the developers and then being treated as equals. The more separation you have, the further they are away, the easier it is for you to have some animosity for them. I started in the '80s coding professionally, and you would hear conversations about somebody slamming down the phone and then, Brits, at least, expletive driven accounts of what happened and calling somebody parentage somewhat questionable. That's normal. No, it wasn't. That was nasty. If that person knew you're speaking about them like that, they probably wouldn't pick up the phone to you again. Guess what, when they slam the phone down on the ops side, they're probably yelling at you about your ... Early '90s, we were moaning about only having 4 megs of memory in a 286 PC or something, and we wanted 8 megs of memory. Anyway, things change. Those problems were solved.

If we've moved the situation, we're between the people doing development and the people doing support, there's friendships. You link in to these people, you look forward to working with them again. There's rotation between both of those groups. These social functions aren't just staged. They're genuine like, "Come along," or "We'll wait for you to fix Velma's RAM." We can be in a position that we're friends with these people, and that's what we should engineer within startups and scaleups, even if it's a vendor. If it's a software supplier vendor, we should hope to have the same relationship with them as we would have with our own staff. It shouldn't change. We want it to be fair for them too. If we ask all sorts of support and we're just keeping dev in-house, we don't want them, as level 3, hurting like our staff would hurt in the same situation. Success is multifaceted, and I think it's certainly attainable. I've seen it done a few times, and I think it's very possible for anyone to engineer that for the startup they're at least influencing during its initial period of time.

That was my last slide and I'm not sure, Cassie wasn't giving me move forward things so I won't be ahead of schedule. I think the answer is yes, we should support it if we build it, as long as we do that fairly. Any questions?

Questions and Answers

Shum: Actually, one of my questions is around the blameless postmortems. What if you're part of an organization that doesn't have that trust? How do you introduce that?

Hammant: Yes. That's changing culture, isn't it?

Shum: Yes.

Hammant: I mean, it's very difficult. If you can get enough pats, and we have plenty of friends who could do this, if we get somebody who can just watch the postmortem go through and then pass a few comments very constructively, then maybe you can move them a little bit to a place. I mean, if they're already doing nasty postmortems, the likelihood is that there's a problem in the culture. If you had a bad message to give to somebody in a position of power, they're not going to take it. They're going to reject you. One of your problems is that organizations that have gone bad are very difficult to drag back to good. If you start good, let's say, I obsess about build times, the startup I was involved with, the first build time was 40 seconds, including tests. I made an announcement, "This is ours to lose now. We stick with this and the CEO never yells at us. If we get worse, then we get in a perilous place." There are some things you can do from the larger agile people-friendly way to doing things that are easy if you're one of the first people in the company, and the business is a good sponsor for this, because actually, their goodwill is required for you to make any changes to the way we work. There are times when it's going to happen quite easily, just from the setup moment. If the corporate culture is already bad, it's going to be very difficult for you. It's going to be a collective effort or it might require a change of guard at the top, like a new CTO comes in and says, "Hey, I'm from Netflix. Let's do this the Netflix way." Then you'll go, "Phew." There's a chance, at least within the 100-day rule, that that CTO might bring in some changes. Any other questions?

Participant 1: It's a challenging topic and some of your ideas I can relate to. One of the things that I wonder is, by doing a level 2 or level 3, aren't you transferring the problem to the level 2?

Hammant: That's really good, and I should have worked it into the talk. There's an unworded problem within our industry around risk and responsibility. There's the business who has a need for you to do something very cheaply and quickly and perfectly. Then there's you as a deliverer of that. Sometimes the business asks you to take responsibility to do something and maybe change the way of doing it, but they don't necessarily always take back an equal measure of risk. Quite often, this is an exchange of risk and responsibility, but the business is asking you to take both. It takes some maturity to realize that you go, "Hey." Maybe the premise was there, "We build it, we run it. I'm taking risk and responsibility." I think you have to make sure that in a change within an organization to start to consider level 2 support and the need for it, it was a part of a larger exchange of risk and responsibility. You can do many exchanges of those, but you shouldn't have one side of that arrangement take both. You should be able to call in safely and say, "Hey, I seem to have risk and responsibility here, which means I'm going to be the hero for a few minutes and then I'm going to mess up one day and be yelled at." You don't want that.

Participant 2: A question around support and getting escalated issues when you are not out of hours but during development. You're in a development team, you have maybe a sprint goal that you're trying to hit, you have a standup thing, you said, "I'm going to get this done by the next standup," and then something comes up, or you are watching the dashboard of "Here are the production issues," at the same time as you're trying to do development. You're expected to do both at the same time.

Hammant: No, that's risk and responsibility, again. If you're in working hours, you have a scrum master, scrum mistress, project manager, coach present whose job is to watch the queues. Your job is to keep your headphones on if you're [inaudible 00:41:49] or your wide bench for you and your [inaudible 00:41:53] if you've got two screens, two mice, one CPU, and you just carry on with your duty until somebody comes up and says, "Hey, when you finish that, can you look at Jira 1, 2, 3? That was the backlog tool. Could you look at this trouble ticket?" You don't, as a developer in working hours, keep half an eye on the incident channel.

Participant 2: That would be great, but sometimes you have to rotate that, don't you? The teams will rotate.

Hammant: Your team might have designed that the project manager said, "I can't do this because I've got meetings." You might take a rotating duty, yes. One of you, out of eight of you, is looking at those queues. Somebody else can answer the question here.

Participant 3: Something I can just help with this. Something we do is that we have something called an interrupt role in the team. That's the person that can be interrupted. If there is an incident during hours, then that person, their usual work is not feature work. Let's say they're looking at tech debt or incoming bugs and so on or, say, build failures and so on. If there's an incident, that's their priority. They're not part of the capacity planning for that sprint, and it's a rotating role.

Hammant: XP had the same, didn't it? It used to have a bug pair.

Participant 4: My question is about the level 3. I'm not completely clear about how you convince the team to do level 3 support, because that also involves being standby, right?

Hammant: On-call, yes.

Participant 4: That also affects your private life.

Hammant: Yes. If it was fair, you were told in the job interview that there would be elements of support. If it wasn't fair, you were surprised six months in that we've just invented this role called level 3 support. It should be something that's eyes open and that in your partnership with your wife/husbands and your consideration of the role is being offered, you should factor it in. If you're an industry cycle, you could ask to speak to somebody who currently does support, to ask them what that's like. In every interview, you do. In every round of every interview, you do. You should turn the table halfway through some portion and at least get your own questions answered, what does support look like, especially if you've never done it before. I mean, it is troublesome to be pinged. I was director of engineering and had to be pinged on everyone else's support calls at one stage. The bigger my dev team gets, the more calls I get if I'm overseeing every incident. Ok, I was a younger man then.

Participant 4: In your view, that's also a rotation.

Hammant: If it's fair and it's a rotation, then we work out what we need to do there. If we're accumulating tons of defects and the support incidents around one particular piece of your application is going up and up and up, we have to have a conversation with the business about slowing down the rate of functional deliverables to attend to something that clearly is tech debt in order to make the software more robust in that regard. If the business isn't doing that, they're in that same place of giving you risk and responsibility without that being a fair exchange. If they're not allowing us to, the best teams are going to attend to that tech debt as they accumulate it. Meaning, in any one finish of any one week so we don't have any tech debt. We've adjusted every story estimate if we're still estimating to include the remediation of the tech debt as it accumulates, even if it's a surprise. One of the things we do around estimating for points or gummy bears or t-shirts, whatever your estimating model is, is you try to actually think of the average time it takes to do a thing. Sometimes you come in quicker and sometimes you come in slower because there are some tech debt. If we're ok with the business and they're not in that shanty place, then they don't mind if some of them are longer than the others and that the amount of story points done in a week or whatever its rational length is. If it roughly matches after a few weeks the expectation, they're ok with that. We should be attending tech debt, and that does include making the software more robust in production so we don't get caught. Speak to your boss about fairness on tech debt, assuming you're not the boss. You're the boss, ok. Bad news. You're going to have to see less functional developers.

Participant 5: What advice would you give to people picking a level 2 support partner? What would you do to make sure they are set up for success?

Hammant: There's hundreds, honestly. Then you sit in there and you just want to choose one because you got other things to go on with. You go, "Ok, which mainly eliminated themselves? Ok, who else is in the mix?" You find Salesforce is gone and these other five here. Depending on the market, it changes too because not everyone does U.K. or South America. His problems must have been super crazy. You find a partner, they whittle themselves down to three, you have some selection criteria, you have an objective interview. You'll ask them things like, "What is your technology?" You assess them as if they're a long-term partner, and you'll make critical judgments about their service too. They could have people that are available, but their software crashed, and you go, "Great, so they're downtime is affecting our downtime." Provided they pass the beauty contest, the only thing that remains...that process being a beauty contest. I have a blog entry on that one, not for support, for something else. Look it up, it's called "Like a Used Sofa".

I think the one remaining thing that goes wrong here is how we integrate them. They used to have standup software, and we have a standup software. If we're honest, we have a dev environment, and then, maybe not yet, but soon, we might have a QA environment, which is deployed to you less frequently than the dev environment, I don't mean my personal dev environment, I mean the shared dev environment, after I've committed and after CI said it was good. If I'm sufficiently mature, I might have a UAT environment, and I might have a [inaudible 00:48:21], but assume I don't. I just have dev and QA and production. I want a standup something with them that we will both agree as dev. Separately, I want a standup something that's another supportable environment. For me, I'll call it QA, and maybe it has hardcoded users in it [inaudible 00:48:38], but I want their provision to be separate so the dev provision that my devs are actively coding against. I don't want the data to be mixed on their side, because it's not mixed on my side.

When I go live, I want them to stand up one more, totally separate to the other two, nothing mixed, no configuration, no shared users, and that one I want to be live and untangled with any of the other issues that I might have been testing. As I bring up a support capacity that involves my own devs, I want environment separations. Environment's a canonical name. Most of the time, these partners are just appalling. They don't know what CI is, let alone Jenkins. They only have one environment, they'll call it Sandbox, so your QA and your dev have the same two feet in one shoe and is mingling there. Then somebody jumps in configuration with the dev, which reconfigures the QA. You don't know if we're still playing with that. What have you done? You've poured that down. They're all appalling.

As it turns out, I'm in charge of a new technology called service virtualization, Servirtium, which is of the type of technology called service virtualization. One of the things we're using it for in a startup is to record the interactions with the service stack and then play them back in a CI loop so that it is always green and always passes and isolates this from the suppliers from a sandbox environment, which is up and down like a ping-pong ball and nowhere near the capacity of production, nowhere near the response times of production. You maybe have to employ some dev tricks to make the supply staff more reliable.

See more presentations with transcripts

Recorded at:

May 25, 2020

Paul Hammant

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?