InfoQ Homepage Presentations DevOps for the Database

DevOps for the Database

Bookmarks

View Presentation

Speed:

Download

48:37

Summary

Baron Schwartz explores real-life stories that answer two questions: “Why is it hard to apply DevOps principles and practices to databases, and how can we get better at it?” He covers topics including: what the research shows about DevOps, databases, and company performance; current & emerging trends in building and managing data tiers; the traditional dedicated DBA role, and more.

Bio

Baron Schwartz is the CTO and founder of VividCortex. He has written a lot of open source software, and several books including High Performance MySQL. He’s focused his career on learning and teaching about scalability, performance, and observability of systems generally (including the view that teams are systems and culture influences their performance), and databases specifically.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

First, let me talk a little bit about my background and where I'm coming from on this. So, I've been working with databases for a couple of decades now, which is terrifying to think about. But first as a developer for a handful of years and then as a consultant, and now that I've revealed that I'm a consultant, I'm officially disqualified from ever speaking at QCon again because there's something over consultants. We prefer practitioners over consultants, but I'm not a consultant anymore. Now I'm a vendor. So, bring the tomatoes. I was at Percona, which now is more than MySQL, but in those days it was MySQL consulting for people who were trying to scale MySQL in the web 2.0 era, and I was there for about five years and solved a lot of performance problems.

I wrote a couple of books, a few books. I wrote a lot of software, most of it open-sourced, most of it around database performance, monitoring, database operations, things like that. And then I founded VividCortex, which is a database monitoring company, very focused on performance, and that's been six years. And during this time I've been a DBA, I've been a developer, I have gotten up in the middle of the night and fixed things. I've gotten up in the middle of the night and had a hard time fixing things. I've lived the life. I've also gotten the benefit of during the consulting and with customers at VividCortex, lots of other people's experience in seeing things that work and things that don't. And I'm very curious and so I often ask why? What was it that worked there and what was it that worked here? And I've tried a bunch of these things myself.

As you can probably guess, the things that I'm going to talk about today, DevOps for the database, well, you should guess, I guess. I haven't successfully implemented all of these things. I'm a knucklehead, I've been successful with some things, I've made a lot of mistakes, still working on some other things. But this is the combined experiences of the last 20 years or so of a lot of different folks, the patterns that I've drawn to try and figure out what works and what doesn't.

Now, in contrast to Jez and Nicole's keynote this morning, this is not research, this is not science. This is opinion. I did tweet out a survey, and more than 50 people filled it out, and they contributed to these slides. They helped clarify some things, they helped, added nuances, and I've actually included some of the things that people said in the surveys, some anecdotes tying together some concepts. But again, that survey, I have no training on how to create a proper survey. That was just me putting down five or so questions and asking people to fill them out. So, no science, no research, no validity, just one guy's opinions. Lots of personal experience though.

Three Database DevOps Stories

So, to anchor this, I want to talk about a few different database stories that I've seen, either firsthand or at very close range, and what's common and what's different between them. So, in the first one, there's a company that's growing incredibly fast and is making money hand over fist. They're in a highly-regulated industry, their product is super boring, their team is growing so fast that they're throwing money at problems. Their customer base is growing so fast that they're just buying more hardware. They had one DBA and a lot of developers, and then a lot of developers became really a lot of developers in very short order. And the DBA was moved out of an individual contributor role into a management role, where her job was not to run the databases, but to hire a team of a handful of developers to run the databases.

So, I'm sure several of us have seen or done this transition from running things to running a team who runs things. Essentially, their strategy there was to multiply the database administration team in proportion to the development team. And a lot of things had to be deprioritized along the way. Some things that made me a little bit sad, some things that made me really happy. What I observed though was that it was not a fundamental change. It was just a multiplication. We'd been doing this before, we're going to do five to eight times more of that in the future. And I did not observe that quality was getting better. I heard that there were a lot of outages and incidents and a lot of problems, that there was a lot of pressure on the DBA team to keep up with the development team's workload. So this was a traditional DBA kind of a culture, and I'll talk a little bit more what that means later.

In the second case, there was a friend of mine who had joined, this is probably about five years ago now, one of Silicon Valley's hottest, fastest growing startups, exploded into an overnight success. And he joined as the sole DBA, there were about 20 developers. Short time later, we connected somehow, I forget how, and we were talking. And he said, "We're going from 20 to about 100 developers over the course of the next month or so. We're about halfway into that. We were 20 developers a month or so ago, and now we're approaching 50 and we'll be at 100 soon.” And the mandate, there was a really, move fast and break things culture there. They may have actually had that printed out and posters on the walls. And the mandate was for developers to come in and ship code on day one. This was how you proved that you could keep up in this hyper-growth organization.

And as I talked to him, I said, "Well, what's that like for you?" And he said, "Well, my mandate is to keep the databases up and fast, but I have no control over what these developers are doing, and of course, they're super junior. It's day one, they're shipping stuff, and of course, it's causing outages." And I probably did a beard stroke thing even though I had no beard and said, "Well, how are you thinking you're going to handle this?" And he said, "Well, I'm not getting any more resources. They're refusing to hire anybody to help me. So, I have decided that I'm going to implement change control processes and the mandate is that I have to review anything before it goes into production and look at any query changes or schema changes or things like that." And I was like, "I'm your friend. I just have to tell you this is not going to work. You can't. You just can't. It's not physically possible. Even if you could keep up with 100 developers all just competing with each other to push code over the wall, you're still not going to solve the outages."

Another story is this relatively small team that was running a relatively large media organization, very well known. I won't say who it is. They definitely punched above their weight as far as efficiency of their infrastructure. You could compare them side by side with some of the names that we're all familiar with every day, and look at things like server to person ratio, and server to audience ratio and things like that. And they were a real outlier. And there were two operations folks there. One was classical operations, and one was database operations.

And, in this case, they happened to bring us in as a customer, and I thought, "Oh, great. Now, we're going to have two people using our software in this small team." I actually didn't know anything about the team beyond that. I just knew that there were two people who were interested in our product. They brought us in, and 17 developers overseas began using the software on a daily basis and the people who purchased us didn't use it very frequently at all, more like on a weekly basis. And when I noticed that, I reached out and said, "What's going on here? This is not what I expected. Can you tell me what you're doing?" And he's like, "Well, developers are smart, they can use a database monitoring product. I don't want to monitor the databases, we can't be the bottlenecks." And that was the first time that I saw in our business what we call the golden motion, which is the golden, the magical thing that makes things happen, turns customers into successes. Internally, we started trying to replicate that.

And part of this, what I'm talking about today comes from being really intentional over the last handful of years, four or five years, they were a super early customer, and trying to figure out what helps some customers broaden the database responsibility from just the DBA to the entire team. And in what cases does it actually not work? And you end up with a DBA who either won't or can't get developers involved with the database.

Benefits of Database DevOps

So, hence I didn't think of it at first as DevOps, but it really is. It's DevOps. Database is one of the last holdouts of DevOps, and over time it dawned on me that's what was going on. And so I started thinking about this in terms of parallels to the rest of DevOps. So, generally speaking, DevOps, if you can apply DevOps principles, practices, cultures, habits and so forth to the database and the things immediately around the database, you get the same benefits that you get everywhere else in your code base and in your applications. And in particular, for me, cueing off of Nicole and Jez's work, I'd started talking about software delivery performance. It's a great, great word for that. It means, not only do you have faster, better, and cheaper, but now you can have all three. So it's not "faster, better, cheaper, pick any two." It's “faster, better, cheaper, pick all three.” And you could add a couple of things in there too, like make more money.

So, I think this is a fundamental mindset change; that iron triangle of faster, better, cheaper, you can only pick two, t least for me, with 20, 25 years of working in software, had become fundamentally beaten into me to the point where I never thought to question it anymore. And in the last 10 years, I've seen lots of existence proofs that that is actually a fallacy.

The other thing is you get a virtuous cycle of stability, leading to more speed, which leads to more stability, etc., which I've seen go the other way as well. Instability leads to slowing things down, which means that you can't fix things as fast and things get less stable. So I've seen it go both ways, and definitely DevOps is the winner there. I'll have some resources and links at the end, but you should totally get the- two things, you should totally get "Accelerate," the book by Nicole et. al, and you should also get this year's State of DevOps report and read it.

Detriments of Lacking DevOps

And if you don't have DevOps, what do we have? In the database arena, particularly, you have a DBA, like my friend at the fast-growing startup, who is responsible for code that they can't control. Because a query is code. It's running in the database, and when you deploy your application, you're deploying queries into the database as well. And so that often leads to them trying to control putting in gatekeeping kinds of processes, which creates a dependency for developers, which offloads developer work onto the DBA, which diverts them from strategic activities like architecting the next version of the platform and helping developers design better applications and schema. Which is a shame, because now you have a highly skilled and highly knowledgeable database administrator who is doing unskilled dependency work for the developers and creating bottleneck. And that's a real waste of a resource because frankly, we all can be good at databases, but we also need people who are experts at databases, as I'll argue a little bit later. And those folks who can really excel or who choose to really excel in those areas can be a little bit hard to find, so don't waste them.

So, engineering doesn't get the benefit of these DBAs helping them build better applications faster. And ultimately, you get that human in the feedback loop, and production feedback doesn't get back into development, which makes it even harder for developers to build things that will run well in production, and that limits how fast you can learn and fix and improve. And ultimately, developer productivity declines and they become dependent on DBAs to literally debug their application. I have seen many, many organizations and teams where really detailed information about what the database is doing, is actually a crucial signal for understanding what the app is doing. And when you look at it that way, in fact, I've seen many cases where we figured out application bugs by looking at database behavior, which really should be unacceptable, but it's actually pretty commonplace.

What Is Database DevOps?

So let's look at what DevOps means in the context of the database. And then from there, I want to go into what works or what I've seen work. Let me be really clear, I don't know that it works. I've just seen it work. And what doesn't, or what I have seen fail, and then some thoughts about the processes.

So, a few of the crucial aspects, what I believe are crucial, are, first of all that developers have to own how their system operates against the database in production. So that means they have to own the schema, they have to own the performance. It's really critical to own the performance and the overall workload of the application against the database. They have to be able to debug and troubleshoot and triage and repair as much as possible their own database outages. And as developers, probably some of us maybe are cringing a little bit in fear, thinking databases are really, really hard. Well, it's true. They are really, really hard, but this is not to say that every developer has to be able to do this alone in the most obscure database outages at 3:00 a.m. on a Sunday without calling in reinforcements or something like that. We should be able to call in reinforcements, and it's really valuable to have somebody who understands how buffer pools and latches work. We don't all have to be able to do that, but we have to be able to understand that we're running a query very slowly or very frequently or some of just the basics.

The schema and the data model are part of the codebase, and it's not a separate codebase, and they're also part of the deployment pipeline and, ideally, it is not a separate deployment pipeline. So the way that code gets to production is the same way that the database changes, schema changes and things like that get to production. And those things are version-controlled and they're deployed through the same tooling. Those schema migrations are ideally automated. Now, this is one of the hard and challenging things that is the last holdout, I would say, partly because of technical limitations, partly because it's a database, and data doesn't move. Data has inertia and resistance.

And a couple of another little bit more optional things I think that I've seen companies do very well with and without, are automatically rebuilding pre-production environments that are somewhat like production, perhaps by restoring last night's backup once a day, which is also a really nice way to test that the backups are restorable. Hint, hint. If you're not doing that, you have Schrodinger's backups. And automation of database operations. If you're not using something like RDS, then you basically don't get to large scale without building your own internal data platform that has some of those attributes.

So, one thing to mention, nobody, or very few folks that I'm aware of, do all of these things, or certainly not perfectly, and I've put some of what I think are the most important ones first. And that's cool. You can do the things that are most important and valuable for you first. You can pick something, in fact, as I will argue, I think it's really rare to see somebody succeed and starting to adopt some of these things without picking something and doing it first, preferably, something little, like just adding monitoring.

I want to tie in Charity Majors. She has a great quote. I'm going to mangle it because I don't know exactly what the quote is. But she talks about the first age of DevOps, which is when operations people started to do development, infrastructure as code. So now, infrastructure is defined in Chef and Puppet and so forth. Whereas before it was everything done at the command line and the automation was you pushing keys on the keyboard. So the first age of DevOps was when we started to describe our infrastructure as code. And she says, "The second age of DevOps is here when developers need to be able to do operations." And I 100% agree with that. Especially in the database arena. And I think that's actually a really important distinction, and it's definitely time for that transformation.

So to quote a little bit from this year's State of DevOps report, on page 57, I think actually maybe there are two slides, but at least this one. I've highlighted some stuff in bold that I think is really important. It echoes some of the same themes: the integrating database work into the software delivery process. So, for example, automating it with the delivery pipeline, database changes as scripts and version control, manage them in the same way as other production application deployments and so forth. And these teams discuss, so the communication and collaboration is a really big part of this. I put lots of dot, dot, dots where I've shortened this to fit onto a slide, but go grab the full report and read it. There's so much value there.

Bringing DevOps to the Database

So let's talk about how I have seen people try and succeed in bringing DevOps to the database. There are four core elements. I haven't pulled this from anywhere, but it probably resonates and feels pretty familiar with things like the four major attributes of DevOps, or some of the acronyms that you've probably seen. I always think people first, however, I'm going to go the other direction because I'm going to talk about what needs to happen in the organization and then I'm going to back that into what kinds of people you need to be able to execute on that. So, people, culture, the structure and the process of actually getting there. I want to be clear, this is not a DevOps adoption project. It's more of a process, it's not done. It's continuous and ongoing, but I want to talk about some of the ways that I've seen people approach at least starting that change in their orgs, and tooling.

Tooling

So I want to start with tooling, because tooling is the most concrete, and easily grappled with of these things. So the first tooling that we need is deploy and release. We typically have continuous deployment. I hope everybody has continuous deployment for their code, their application code, and the build of artifacts. You need that for the database as well. You need to be able to do frequent and automated changes to the database. So this is a tall order. Sounds simple. Probably doesn't sound simple at all actually, to anybody who has done it, but it is really, really important, and you need to eliminate the manual interactions. If you are SSHing or connecting to the database and doing things like -. My first major database operations stuff was in a Microsoft SQL server shop, and at least half of the business logic was in stored procedures. At least. And deployment was copy-pasting, opening up the stored procedure on the development servers, copying the SQL, opening up on the production servers, pasting it in, and saving it. So that is the kind of manual toil that keeps you from working on the system, keeps you working in the system instead improving the system itself.

It's actually relatively easy to see where manual work is. When it's automated away, it becomes invisible and becomes a little bit hard to say how much of your work have you automated, but it's really easy for folks to say how much work are we still doing manually, because it's typically a very painful process that somebody is explicitly conscious of. And so pay attention to those things that you're explicitly conscious of somebody doing manually. Have discussions and retrospectives and talk with other folks around the org, and you will often be very surprised at what it is that actually keeps the wheels on, and somebody is doing something manually on a continual basis and that needs to be automated.

The next part of tooling is monitoring and observability. If you can't see how your databases are running, you're not going to improve them. So, it's that simple. I don't think of monitoring and observability as the same things. I don't think that observability is just the new buzzword for monitoring. I consider observability to be an attribute or property of a system that allows it to be observed and the whole bunch of things that need to be true for that to happen. Monitoring is actually an activity, in my view, of continually either manually or automatically essentially testing a system for conditions that you have predetermined. Observability is much more about taking the telemetry that's instrumented out of the system into an analytics pipeline and then being able to ask ad hoc questions on the fly about how your systems are behaving. Put another way, monitoring tells you if your systems are broken; observability tells you why they're broken and how to fix them.

Tooling for deployment is great, but the best and fastest moving teams that I see working under the highest scale with the lowest downtime, etc., etc., have a lot of knowledge-sharing and a lot of tooling that isn't what we would consider to be tooling, necessarily. But it might be things like wiki pages, documentation, chatbots, those kinds of things. So there's a lot of ancillary tooling that serves as a way to tie people's experience and knowledge and mental models about the system together in a shareable way that other people can benefit from rapidly, including themselves in the future, because I think we can all recognize when we have opened the system up to look at some code that we wrote. In my case, it might've just been yesterday and I already can't remember what it did. But certainly, some system that you wrote before, and you're looking at a line of code and going, "I know this was an important line of code, but I can't remember why."

So this shared knowledge base, documentation, runbooks, playbooks, dashboards, those kinds of things, is really important to invest in. And this is a screenshot from Etsy's Deployinator, some automation that they created at Etsy to help deploy changes. And if you notice, there's important links at the bottom. "What to watch after a push." So, Etsy has a whole set of wiki pages and things like that. Chat commands, I don't remember exactly in their case everything that they have. GitHub has something too. They call it the deploy confidence processes, where they gain confidence into deploy after they push it out, and it's a bunch of wiki pages with links into dashboards and different monitoring tools, what to look for, what kinds of things to run in the chat, what to look for, are you seeing what you expected or seeing something unexpected?

And you can tie these things in to your monitoring and alert notifications as well. So you can put links in, so that when somebody gets the page for something being problematic at 3:00 a.m., they can go and look at the wiki page and it can say it might be this, it's often that, click on this, load this dashboard and you'll see whether it is or not, and here's what to do and here's how to verify that it works or it doesn't work or anything like that. So those are all really, really important. I value them so much that they're a core part of our, of VividCortex's monitoring. We're investing heavily in this right now.

Structure: Team Orientation

So, so much for the tooling. Let's talk about culture and structure and process. Coming up next, some things that I have seen companies do specifically as they go through a process of adopting or broadening or expanding DevOps practices around the database. So the first thing is pretty universal with software teams in general. Teams work best when they are not segregated artificially into, for example, all the Go developers are in one team and all the Javascript developers are another team, or all the operations people are here and all the QA people are there. That's what we call silos. It reduces knowledge-sharing, it creates bottlenecks in the way of getting things done. And these teams that follow, I believe this would be describing more of the Westrum organizational model, work better.

Process

As you start on this, your first rule should be don't bring down production. That will definitely not advance buy-in and confidence and all of these kinds of things. So, I used to be an EMT, and I still remember the EMT classes and how boring it was that the instructor would say, "Do this, do that," and the other thing, and "stabilize the patient and then transport. Stabilize the patient and then transport." Another way that I've heard this said in the same vein as "move fast and break things," versus “move fast and don't break things,” is it's okay to punch a hole in the boat, just make sure that you punch it above the waterline. So, you have to be careful and diligent. If you read the Chaos Engineering book, they also talk a lot about this. It's a short book, I encourage everybody to read it. Forty pages or so of brilliance. Somewhere in there, a few pages in, they say, "Chaos engineering is not for breaking systems in production. If you know that your system is going to break, you don't need chaos, you need to fix it." So, you don't go release Chaos Monkey on a system that you think is going to break. You want to release it on a system that you think is going to stay up, and then watch what happens and learn.

It will be a process, and there is a ton of politics and persuasion, selling, marketing, advocating. You've got to get people on board. And to do that, it helps to have a plan. So often, the leadership stuff that we read is all about this fill in the blanks formula of vision and mission and Simon Sinek's "Start with Why" and all of these kinds of things. Those are all very helpful, but what most of the time is lacking, is planning and execution. What are we going to do in what order, and how is that actually going to achieve our goals? And then follow up and follow through, crossing all of the T's, dotting all of the I's. So, create an upfront plan.

I know this might sound a little bit anti-agile, a little bit anti-lean, but do create some kind of a high-level thematic, staged, thematic progression, so that people can see where this is going. So it isn't just like, "I have this bright idea, let's do something." "Why?" "Well, I'm not really sure, but I just feel compelled to do it." You need to have a long term direction that this is going. And when you're talking about change, it's really, really important to help people understand what is staying the same, the continuity. Because people fear change. We all do. Even those of us who say that we like to embrace change, if we're really honest with ourselves, we're comfortable with the way things are and we would like to maybe relax and take a breather from all of this constant change. So, when you're trying to get other people to embrace change, it helps a lot to emphasize to them what is it that's actually an invariant here that these familiar and comforting things are not going away.

As a process, it's super important, I think, to pick something and get started. And make it as small as possible. Begin with small wins and build on top of that. I'll talk later about some anti-patterns that I've seen. And spoiler alert, "We're adopting DevOps in one fell swoop," or "We're becoming agile. Everybody start doing scrum and stand-ups," don't work. I think we probably all know that. So pick one team. We don't have to do this on every team. It's a great idea to have one team outperform the others. We at VividCortex have a couple of different teams and we follow some of the anti-patterns that I've already talked about and we see differences in performance in various areas, difference in reliability and resilience. And it's great to have one team who is delivering software better, faster with fewer outages and lower change failure rate. That means that you have something concrete that you can sell to the other teams and try and figure out what is it about these teams that's a little bit different anyway? Because there's often things in there that you don't understand. There's things well below the surface.

So, for example, instead of taking something old that the business depends on, that sucks and is painful, and transforming it, which I've made the mistake of doing and causing a lot of problems, how about trying it on something new? Stakes are low. Maybe it's in beta, maybe there's no production dependency on it and customers aren't actually touching it. Make that really great and then take some of those things and try and apply them to the old legacy code later. So, starting small, getting buy-in, earns you the right to continue this process.

Culture

And culture is a big part of creating change, and of DevOps in general. So I think we've probably all heard people say that DevOps is a culture initiative or culture problem as much as anything else. It's not really; at root, it's not technological. The problem is culture is an emergent artifact. It's not the thing in and of itself, but it's the thing that happens as a result of other things. And so you can't operate on culture directly. You can't just go and change culture. What you have to do is you can do things like changing processes, you can change tooling, you can change incentives, and that's where the real money lies. Changing the incentives is what drives culture. People talk about culture, culture fit. And other people will say, "Show me who gets rewarded or promoted or whatever, and then I'll tell you what your real values are and what your real culture is." And that's so true.

And so if you want to change that culture, you've got to change what the perceived status of work is. Another shout out to Charity Majors. She tweeted that at one point and it immediately resonated with me. Operations work shouldn't be low-status. Carrying a pager shouldn't be low-status. Being on call for your own code shouldn't be low-status. Make the reverse low-status. Make it higher status to be the person who is owning the production behavior of your systems.

Look at what creates friction and resistance to the change that you want to create, and then create a newer path that's easier, the path of compliance. Make the right thing the easiest thing, and try to make the wrong thing or the old way that you don't want, to be a little bit harder. Maybe just starve it of attention. That works sometimes. But definitely get everybody in the same boat as far as experiencing the benefits of the new way. And again, silos get in the way of this. So, anti-silo helps people to understand this is actually really great if I step out of the area that I've been operating in. And, share the pain equally.

Another little anecdote- this will be a little bit of a tangent- but I was doing a sales pipeline review with somebody at one point and there was a high confidence that a deal was going to proceed, because, they really needed to buy, because there was a DBA who was terribly miserable and working all kinds of late hours. They obviously needed to buy because tooling that we could provide was clearly the only thing that could turn the situation around. And then somebody asked the critical question, "Does this person's boss know that they're staying up nights and weekends?" "No." The deal did not close, because only one person was experiencing the pain.

And I have actually done this myself. After I came out of my consulting career, before founding VividCortex, I went back to a company that I had been at about five years earlier. I had been the person who installed Nagios, I'd set up a whole bunch of Nagios checks for the database. I went back, those same checks were still running. And replication behavior had changed greatly as this company had grown over five years, and so replication was continually delayed. And there was a replication delay alert that was continually emailing. And nobody was looking at any of this because it was so noisy. Everybody had just filtered all of the Nagios alerts to /dev/null, of course.

So I look at that and I'm like, "Okay. So, clearly, this is not an actionable alert. Let's delete it." "No, no, no, no. You can't delete it, it's an alert. It tells us things. Are you kidding?" So, who is actually on call for this? Who is getting paged? Nobody. Literally, nobody was getting paged. "Does anybody know if the databases are even staying up?" No. Nobody knew if the databases were staying up. This was a very much batch, daytime-oriented kind of a thing. There was a bunch of nighttime batch-oriented stuff that was crashing the databases every night, and nobody knew. And they would just restart and all of these processes were failing. Leads you to ask if the processes were actually critical or not. I hooked Nagios emails into PagerDuty, and I had the little robo-voice call me, and I had very small infant children. And at 3:00 a.m., I'm picking up my phone, "Hello?" "Mer, mer, mer, mer," robot voice telling me about the database server being down. So I very quickly made some changes, and it wasn't until I left the company that I realized nobody knew I was doing this. Nobody knew that I was up all night and that my spouse was getting woken up every night. Nobody knew and nobody cared. So, if the pain is not shared, you're not going to make any changes happen.

So, along the same lines, you’ve got to have leadership support. And leadership comes from all levels, but you definitely have to have the hippo's support as well, the highest paid person's opinion. So you've got to have real buy-in and support there. Not just words, but you've got to have actions from those people. It's okay if it's a project or an initiative that you're doing at a smaller scope and it doesn't require executive transformative kinds of things, but transformational leadership helps a hell of a lot. If you have somebody passive aggressively working against it, it's not going to happen, I can promise that. So, you can starve the old way, I talked about that a little bit before.

The other thing is that leadership comes from beneath. I have this mental diagram that I often draw where there's a person in a leadership position and they're focused on this level of a high level of business-focused kinds of things that they're thinking about. And then there's a person that they manage who's down here, who's focusing on doing. Then there's this space in between, and therein lies the manager-managee relationship, and whether somebody manages up and owns that space in between them and their manager, or whether they leave that for their manager to manage down and micromanage, is huge. It makes all the difference in whether people have autonomy and show themselves up as leaders from the bottom of the organization. So you may not have a leader or a manager who is very empowering, but if you can, if you have somebody who can learn, you can show them that you are able to communicate and structure and run things and take authority and responsibility over them, produce results without them being so involved. And by doing that, you can earn the right to do a little bit more and a little more.

You’ve definitely got to have lots and lots of communication and trust, laterally as well as vertically, for these kinds of things to happen. And this is where that vision and mission and "start with why" and stuff like that actually does make sense. A compelling reason why the new way that you're advocating is going to be better than the old way is really important for getting people to buy in. Tying that into customer and business-specific initiatives is very, very important. So if you can tie that into the customer impact, whether it is the experience or how successful customers can be, and if you can tie that into high-level, widely broadcast commitments and organizational priorities and mandates, things like OKRs, you're much more persuasive doing that.

The other thing that I've spent a lot of time thinking about, and I'll tell you, I'm just a beginner in this because I didn't really come from a family of a lot of trust and love and autonomy. So, I've really been learning these things in the workplace over the last couple of decades, about building trust with other people and about understanding and empathizing with other people and supporting somebody. So, I think about psychological trust. I think three things. The one thing is listening to people and saying, "I hear you. You matter to me, you exist." The second thing is "I care." So, “I've got your back. I hear you. I've got your back.” And you put those things together, “I hear you, you matter to me, and I've got your back, and I'm going to support you.” And that is the recipe that I think about for creating psychological trust and safety in a team.

People: You Need Experts

The final thing, now that I've described all of these things that I've seen help or appear to help a company succeed in bringing DevOps in, generally, but also specifically to the database, is that you definitely need people who know what they're doing. You need people who are going to be able to look at a tool, for example and go, "That tool is going to cause an outage, because that tool was written for Microsoft Access on a desktop and we are running PostgreSQL on a thousand nodes in the cloud." So, just a goofy example, but I have seen those kinds of things. When I was doing performance consulting, a lot of the times the performance problems that we saw were caused by automation, by monitoring systems that were causing undue burden on the systems. There was a lot of problems that came from the tooling itself and not from the application.

When you have a DBA, that DBA is an expert with a lot of deep domain expertise. They should be your subject matter expertise, they should not be caretaking the database. They should be helping everybody to be better with the database. And engineers can learn this and I've seen this at scale. Some of our customers have hundreds of engineers and zero DBAs. And I only know this because they're are customers, but more than 50% of those engineers are using our product on a daily basis. They have service-oriented teams. You build, you own, you run. And they are watching what is going on in the database and they're competent with the database, even though they may not understand, again, buffer pools and latches and SQL transaction isolation levels. So you need those people, but the other people can actually learn and enjoy working with the database. So, educational efforts can be really, really important. And whether you continue to call somebody a DBA, or, as in the customer that I just mentioned, you say no DBAs, or you just redefine what DBA means, isn't really as important as looking at how everybody is working and whether that changes. So those are some of the things that I've seen apparently lead to success.

Failure

Let's talk about failure. First of all, bad tooling will bring your systems down, make you look bad, everything will go backwards. Any change that you're trying to introduce is going to fail or become much more difficult to justify and advocate for. A lot of database automation and database tooling, operations tool is very fragile. It is not built by people who have experience running at scale, and it will cause critical outages, and it will cause critical career changes sometimes. So, you've got to be really careful about this. Automation that tries to do too much, that tries to work outside of a well-defined scope is more likely to have these problems, in my experience. On the other hand, accepting that there are areas where automation doesn't exist and just doing manual toil will not lead to good places either. It tends to just perpetuate the problems until they get to the point where there's some kind of a crisis that builds up.

The other thing that I've seen not succeed is when you have one set of tooling for getting code to production and another for getting database changes. Now you have two distributed systems that you have to coordinate with each other. And let's be real, these are distributed systems. If they run on more than one computer, you've got these problems. It's okay if you do this in the short term or you build small things with defined scope that you think you can merge into your code deployment tooling later, but definitely you should try and leverage the code deployment tooling and the automation, infrastructure as code, all of those kinds of things that you have, rather than building a shadow copy in parallel beside it.

Culture fail. Obviously, some of the hardest and stickiest problems. I only have limited success here, have made some changes that I can directly attribute to success in my own teams. I've made other changes that haven't stuck. So, this is hard stuff. But generally, any friction, anything that is easier to do the old way, people are going to do it the way they're used to, they're going to do it the way that's easiest, most convenient, that doesn't require them to check with someone else.

The other things are bringing in a vendor. So, I've definitely seen multiple times where I have been brought in as a vendor and it was assumed that the tooling that we sell was going to be able to create a culture change. Sometimes we can support that culture change, but we are not going to change anybody's culture from the outside. And if you bring in a vendor, us or someone else, and you're putting your neck on the line, if you're putting your career and your reputation on the line, it could be career-ending. And I've also seen in some cases where people feel like their career isn't worth continuing there unless they can make some of these changes. So it's like, "What can I lose? I'm going to try and bring in some tooling to support our efforts here and if it doesn't work, I'm going to find another job."

And clinging to legacy DBA roles, I don't know anybody who says this better than Silvia Botros. We won't read the whole quote, but you can look at it later. I think the key things that I want to pull out from this is the job of the DBA means she is the only person with access. The go-to person, the person, the only person. So if you are any of those "The person", the only person they ask, if you are that single dependency for the development team, rather than being embedded with the product team and helping very early in the life cycle, that role, that workflow is itself a problem. So, click on that link and read it. That's gold.

Leadership fail. I'm running a little shorter on time than I thought, so I'll skip past a couple of these. Planning fail. Doing an all or nothing or pushing too fast. Pushing for speed at the expense of stability or pushing for feature delivery and not giving people enough time to pay down technical debt, to sit with their systems, to use that observability tooling to become familiar with how the systems really run, to poke and ask little questions that may not seem like they have any real purpose, but as a result of that, discover something new and unexpected that maybe if you hadn't fixed it would have been one of the five or six causes that came together to cause an outage in the future.

Challenges

So what is the hardest part? And there's politics, there's culture, there's all of these challenges around the people side of it. Particularly, I would say, as individuals, advocating for change is a skill, and it is a very difficult skill to learn. But it is critical for your career growth. If you look at the people who grow in their careers, they're the ones who can clearly articulate what needs to be done, why and what's the benefit, and how to do it. That is a skill that all of us should be working on, not just coding and testing.

The tooling. Traditional databases, are very hard to operate in a cloud-native way. They were not built for non-blocking schema changes. Failover is hard; there's all of these kinds of things. And then when you get them at scale and performance is critical and you can't have outages and downtime, even planned maintenance, it becomes much, much, much harder. So I would encourage you to look at the work that GitHub has done on non-blocking schema change. Facebook has done some similar stuff. In a past life, I wrote some tooling, but I think GitHub's tooling is the best out there in the open source arena right now. So those are really, really hard technical problems.

And I want to wrap up by asking what happens to the DBA. So, if we're bringing DevOps to the database, where does the DBA go? And my answer is that the DBA becomes a database reliability engineer. Get Charity and Laine's book, "Database Reliability Engineering." Seriously, it is a phenomenal book. And in terms of culture, processes, tooling, the various rubrics that, just like the way that we think about databases needs to change, and it's all there in that book. We need to take the same sysadmin to systems reliability engineer, site reliability engineer, and we need to apply that same kind of a worldview change into the database, from database administration to database reliability engineering.

The Rewards

If you can do this, and I know we can, we can do it a little bit at a time, maybe just small things at a time, we get better outcomes for our company. We get the outcomes from the project itself, from the process, from the culture that we build, the tooling that we have that we didn't have before, and the individual benefits to you to your careers. So I'll stop there. Thank you to the many people who haven't had a chance to review these slides and I can't say that they endorse them yet. And if you want these slides, scan the QR code and I'll see you in the hallways for questions because I have one minute, actually.

See more presentations with transcripts

Recorded at:

Jan 02, 2019

Baron Schwartz

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?