In this podcast, Melissa Benua, Director of Engineering at mParticle, sat down with InfoQ podcast host Daniel Bryant and discussed: the importance of the roles of testing and security within DevOps; the benefits and challenges of building systems with teams of generalists; and how to “bake in” observability of systems from day zero.
Key Takeaways
- Testing and security are both inextricably tied up in quality. You can't have software that does what the users want in a safe and happy way -- and in a performant way -- without those two things.
- Engineers should consider the design implications of important cross-functional requirements, such as observability, testability, and security, within their systems from day zero.
- Using cross-functional teams to design and build systems can be highly effective. However, these teams should have access to specialists with deep knowledge in, say, security and performance.
- Start simple with observability. For example, if you are deploying to a public cloud, you can take advantage of their metrics, logging, and tracing solutions, without the need to build and run your own.
- Codifying security and observability assertions and checks into a continuous delivery pipeline is an effective way to ensure compliance and standardisation across an organisation.
Subscribe on:
Transcript
00:21 Introductions
00:21 Daniel Bryant: Hello, and welcome to The InfoQ Podcast. I'm Daniel Bryant, News Manager here at InfoQ, and Director of Dev Rel at Ambassador Labs. In this edition of the podcast, I had the pleasure of chatting with Melissa Benua, Director of Engineering at mParticle. I met Melissa at a panel at an online security conference and enjoyed chatting to her about all things, platforms, testing and continuous delivery. In this podcast, I wanted to dive a little deeper into the concept of DevTestSecOps, the shifting left of testing and security, and how to encourage everyone in the organization to take responsibility for these important topics. From my previous discussions with Melissa, I knew that she has a lot of hard won experience, both as an individual contributor and also as a technical leader. And so I wanted to explore the differing approaches to testing and continuous delivery for both types of roles.
01:04 Daniel Bryant: Hello, Melissa, and welcome to The InfoQ Podcast.
01:06 Melissa Benua: Hi. Thanks for having me, Daniel.
01:07 Daniel Bryant: Could you introduce yourself for the listeners please?
01:10 Melissa Benua: My name is Melissa Benua. I am a Director of Engineering at a startup called mParticle and I do many things, but foremost among them is our DevTestSecOps strategy, which is kind of a mouthful.
01:21 How important is testing and security in relation to DevOps?
01:21 Daniel Bryant: You and I met on a SnykCon panel, a few months ago now, I guess it was. And I was super interested when you did your intro there. You mentioned about DevTestSecOps, and initially we were there to talk about DevSecOps, but I was like "Testing makes sense. Right? Everyone's got to do testing". Could you unpack that a little bit, Melissa? What your motivation is for doing all that?
01:39 Melissa Benua: Testing and security, are both inextricably tied up in quality. You can't have software that does what the users want, in a safe and happy way, in a performant way, without those two things. And you really can't have one without the other. And so I've tied them together because the strategies are just, they're so intertwined. And so part of ensuring our code quality is static analysis, which is also a reasonable set of checking for security vulnerabilities. Likewise, making sure we understand what the performance and characteristics of our services is important for quality for testing, but again also for security. So you know when you've got a deviation. It's all about knowing that something strange has happened.
02:15 How can developers establish a baseline for what “normal” looks like in their system?
02:15 Daniel Bryant: That's the really interesting thing, I guess, because I think many engineers I've checked in the past don't know the baseline of their system. They don't know what weird looks like. Have you got any advice for how folks should go by, maybe an organizational level or a technical level, to establish those baselines?
02:30 Melissa Benua: You have to start simple. So if you have nothing, you have to start at least with your basic metrics. There's an acronym for them that I never remember. There's a clever one. But you have to know your successes, your failures, your latencies. What does that look like on a normal day? So you have to instrument "This is what successes are. Most services have a sine wave. If you follow business traffic or home traffic, assignments will invert. What do your latencies look like on a normal day at the 50th, 90th, 99th percentile? What are your error rates?". And really just starting there. Those are the bare bones basics: latencies, requests, success and failure and CPU usage. CPU usage is the other bare bone simple one to track. If you're usually running at 30% and then one day you do a deploy and you're running at 90% utilization, something strange has happened.
03:12 Daniel Bryant: Who do you think should predominantly own these things? And this is probably a bit of a trick question. But is it Dev, is it Ops, is it someone else?
03:19 Melissa Benua: It's so tricky. So, it depends on the model of your engineering groups. If you have a traditional Ops group, then they may own the alerts. But, if you have a more savvy developer group or developers who slide into DevOps, then it should be the developers. So my philosophy here is whoever owns the metrics are the people you would call to fix it. I've done many years of on-call rotation and there's nothing worse than being called for an issue that you cannot fix. All you can do is say "Yep, that sure is an issue. And let me go call someone else". There's nothing worse than that. Nothing burns out a Dev team or an Ops team or any team faster than being woken up for no reason. So my opinion is whoever owns the metrics should be the people who at least know how to provide the fix. And preferably should be the ones providing the fix.
04:04 How do you encourage developers to take more responsibility for operational concerns?
04:04 Daniel Bryant: I like that. I often, the same as you, on call, both from an IC and a leadership point of view, and it's really powerful, I think, making the folks who are creating these things responsible. I used to get a lot of pushback, though, doing the whole devil on call thing. Developers used to say to me, and this is a fair bit of time when I was doing development to be fair, about five years ago or so, but developers used to say to me "That's not my job. I'm here to create code, create functions, and then I hand it off to Ops". Have you got any advice on folks who say that?
04:29 Melissa Benua: Anytime there's a handed off, you've got an organizational problem, in my opinion. So I may be dating myself, but we had an SDET team. So it was developers who handed off their ... test their code, they handed off their code to the SDET to test. And then SDET handed off the code to Ops to monitor. And that code is a nightmare. And not only was the code a nightmare, but the teams were a nightmare because, more often than not, you'd end up with people arguing with each other. It ended up with the live incident "Well, it's Dev's fault because Dev wrote the bad code", "Well, it's Test's fault, because Tests didn't catch it". "Well, actually it's Ops fault because they monitored the wrong thing". You just end up with a lot of finger-pointing and blame. And there's a lot of inefficiencies at every step you have to hand off. Every time you hand off, you lose something.
05:07 Daniel Bryant: Oh, interesting.
05:09 Melissa Benua: And what you end up with is developers who own enormous swabs of functionality, but have very little responsibility for it. So I would much rather have developers own small pieces of functionality, but have deep cutting responsibility for it, because you preserve context better that way.
05:22 Daniel Bryant: Very interesting. That's going sort of to the whole Amazon two-pizza team, properly cross-functional teams, because one pushback I get is that, now as a full stack developer, you need to know everything, almost from the CPU chips all the way through to React and so forth at the front end. I guess that you need to have these responsibilities, those understandings, but at a team level, not at an individual level.
05:44 Melissa Benua: Exactly. I think it's too broad to expect somebody to be a master of React. And also to know C++ code is compiling down at byte code level. And also to know how to op like it's too much. I'd much rather have a UI expert, a backend expert, but they own their system cross call all the way through to production, because nobody knows what your backend system is supposed to be doing better than the person who wrote it. But you know what your code is supposed to be doing. You have the intrinsic insight and, in theory, tooling available to know "I wrote this code, it's supposed to do this thing. Here's what it looks like when it's doing that thing. Here's what it looks like when it's not doing that thing". You throw it over the wall to an Ops person, they know generally what services look like, but they don't really know what yours should look like.
06:23 How should dev and ops most effectively collaborate and interact via a platform and the associated tooling?
06:23 Daniel Bryant: I was going to actually ask you about platforms. And I think you mentioned tooling there, and I think it's somewhere in the similar space. How do you think Dev and Ops, and Test of course, should interact in this new world of containers, infrastructure's code, everything on the network cloud? What's the best thing you've perhaps seen in your experience, or what you'd recommend to folks about how to interact and use the platform and the tools appropriately?
06:45 Melissa Benua: Yeah. And so this is fluid and that the answers are going to change over time, as certain technologies go from edge case to specialty case to mainstream case. I've seen it change. So my preference is that operations, like Ops teams, are fantastic at knowing the services in general. How do you scale a service? How does cost factor in? But general reliability and monitoring, in a cross cutting way. They're generally more Linux savvy. If you give somebody to SSH onto a box and do something crazy, they do it. So far, I've seen them take over ownership of managed services, like running your Kubernetes cluster, because it's a specialized skill set. It's not something that's entered the mainstream yet.
07:26 Melissa Benua: But in five years I may have a different opinion. I may feel like "Oh, the tooling actually is mature enough that the Dev could figure out how their team is going to scale in Kubernetes". Today, I don't think we're there. But this is what I mean by "It shifts over time", as the tooling and the tools themselves enter the mainstream and more and more people know it. And the resources and everything becomes more accessible. I use the IM rule. If you have to go into Amazon and write YAML to configure your services, probably you need a specialist. If you can do it through the UI or through a simple Terraform, probably you don't.
07:57 Daniel Bryant: Oh, I like that. I like that. I've had a few interesting chats with folks, in terms of high level constructs. So Terraform gets mentioned a lot as does Pulumi, things like this. But, again, that kind of comes back to some of the arguments of developers having to know more stuff. It's great that Pulumi allows you to write infrastructure's code in a language of your choosing, a general purpose language, but then does the responsibility pour more onto the developers again.
08:17 Melissa Benua: I view Ops in the same way as I do security, in the same way as I do Test. And that they're a specialist provider, whose job is to make it easy for the other teams to interact with them. So Ops made the grand bulk of the Terraform and make it so Dev's only have to write "You're setting up new service? Great. Here's the five lines of template of Terraform. You just have to customize little pieces". Same thing with security. The security guys know the tools, know the rules that need to be set, but they should make it easy for the developers to run and validate that they've done the right thing. Same thing with Tests. Test experts know what an end to end for automation framework should look like, you know the characteristics of it, you know what it should be validating, but you shouldn't be writing every single test case.
08:55 What is the best way to introduce the DevTestSecOps mindset into a traditional team?
08:55 Daniel Bryant: So, what's the best way to go about introducing the DevTestSecOps mindset into a traditional team?
09:00 Melissa Benua: The most important thing you need is champions. You don't need even an expert, but you need a champion. You need somebody who's going to advocate for whatever it is. And it may be that you have a strong quality advocate, it may be that you have one or two security advocates, preferably one per team. So it sort of depends on your team structure and team size and team duration, if you have short-lived teams or his long-lived teams. I've tried it where I said from on high "Hey, organization, I need you to do these things". And doesn't matter what your title is, it's unlikely to be effective. But if you can convince one or two people on the teams "Hey, look at this cool thing". And they could say "Oh, look. This is amazing. It made my life easier". And then you have all these little seeds planted, little secret seeds planted in the teams that are much more impactful. Because, if you hear it from one person, fine. If you hear it from five people, maybe there's something interesting.
09:49 Daniel Bryant: Any advice on how to create that? I've seen some folks at Gene Kim talk about running internal bootcamps, trying to almost, like, create champions. Any advice or thoughts on that process?
10:00 Melissa Benua: I've done it a couple of ways. The most effective has been having people float temporarily onto a project that they're interested in. So my team will have a list of projects we want to do, and they'll say "I want to really try this", and I'll see if anybody is interested in working a short term, like a month, two month project. Having them float over to my team, work on that, whether it's a testing thing, a tooling thing, whatever it is. And then pull back to their team. And that does a couple of things. If somebody floats on my team, I gain credibility as somebody, hopefully, hopefully I gain credibility as somebody who does what they're doing, in theory.
10:32 Melissa Benua: And then they are the ones who are doing some of the work. So they gain empathy for the work that's sometimes quite hard, sometimes quite thankless. And then they take those best practices back to their team, and then the team is hearing it from their teammate, not from somebody they only sort of know. Somebody they interact with and trust, every single day. So rotation, floats, special projects, hackathons, all of those things are secret tools I've used to build empathy.
10:54 Daniel Bryant: I like that. That is definitely like a bonus thing, I think, to take away there as in... Because I find this really hard. I've done consultancy in the past, but it is getting your internal champions I struggle with at times. But once I did it, the magic happened to your point.
11:07 Melissa Benua: Exactly.
11:08 Daniel Bryant: You've got that chatter in the lunchroom, that chatter around the water cooler. And suddenly everyone was going "Well, yeah. I'm using Jenkins. It's much easier just blowing my thing now". And you're like "There we go".
11:16 Melissa Benua: Exactly. I know that I've won when it's not me saying "But what about this cool thing?", it's the developer saying "Oh, yeah. Cypress". All of a sudden, they're going from me pushing Cypress to one person trying out Cypress to all of them arguing over the best test pattern to use in their Cypress automation. That's it. Then I know we're good to go, we're solid.
11:33 Daniel Bryant: I tell you personally on that advice I've definitely seen that. So when I'm sat in a meeting and someone echoes back something I've been champing all the time I'm like "Result". Yes. Nice.
11:40 Melissa Benua: Yes.
11:41 Could you share any tips and tricks for thinking about understandability in relation to green and brownfield systems?
11:41 Daniel Bryant: Excellent. Well, let's change gears a little bit now and think about sort of understandability. Because you and I were talking off mic, you've got to be able to understand the system to think about testing and security. Have you got tips and tricks for folks to think about understandability, either when they're Greenfield building a new system, or maybe how do they understand something that's already in production, has been in production for some time?
12:04 Melissa Benua: So if you're Greenfield, it's relatively easy. There's a lot of best practices for how to start. There's a lot of ways to bootstrap. Open telemetry is something like that. And you can plug into a number of different providers, your understandability is basically done for you. But if you're not Greenfield, which most of us aren't, you've got to bolt something on. I always say to start easy and start with what may be already available to you that you might not know. For example, if you're running in a cloud, your cloud provides metrics. Amazon provides you CloudWatch, Azure provides you... What? I forgot the name of the insight metrics, but I've used it before. Start with what you just get for free and see what that looks like, see what that tells you. And then see what that doesn't tell you, because there's going to be plenty of insights where you don't have, and then that's the next step to figure out "Okay, I understand my basic performance metrics, but I don't understand how many times and ways I'm calling into this database for duplicates" or something like that.
12:52 Melissa Benua: Other metrics that might be things that you might want to know about because they indicate a problem. And usually, if you own your service, you understand that. And that's where a developer Ops really shines above regular Ops, as regular ops is going to look for different metrics than the person who wrote the code. But the person who wrote the code, or the team that's responsible for the code, is going to intrinsically come up with different scenarios for failure that are functional, that depends on what is correct and what is not correct.
13:16 Melissa Benua: So figuring out "Okay. I need to understand what my interaction with the database is, and I'm not getting enough metrics from the free cheap stuff, so I need to add a counter". I always say to start with a counter and not a log, because counters scale and logs don't. Obviously distributed tracing is better. But let's just say logs don't with an Asterisk, they technically do, but it's a hundred times more expensive than a counter does. If you think about a counter in Datadog, it increments a hundred times versus a hundred log lines, each line is 200 bytes. There's a significant cost in network, traffic and storage, in natural language processing, in parsing and aggregating log lines. That just doesn't exist in a counter. Just a bite.
13:57 Daniel Bryant: I love it. Yeah, that was a great advice. I think that's hard won advice, by the sound of it.
14:02 Melissa Benua: Yeah. I know logs are easy to reach for because "Oh, we can look at logs on our machines. Everybody understands logs". But they don't scale, at all. So I always prefer to add a metric and then, if you have a truly exceptional weird case, a log to provide more information. But it really should be a truly exceptional weird case.
14:19 How can you “shift left” some of the responsibility of security and testing?
14:19 Daniel Bryant: Great advice, Melissa. Great advice. Something I want to pick up on you mentioned there, which I think is super interesting, is about the person who created the thing probably knows best or can intrinsically identify something that's going wrong. They have a stronger mental model, most likely. Any advice on how to sort of shift left some of these responsibilities? You and I briefly talked off mic, again, around security testing really needs to be shifted less. There's not a thing you do at the end of the pipeline before release. It is something to think about almost day zero. Any tips or hardware experience in that thing too?
14:47 Melissa Benua: The earlier you can think about it, the better. As you're writing your code is you're thinking about "Oh, this is what I'm going to monitor". Before you're writing your code, or as you're designing, understanding your failure cases, that's an intrinsic part of design. "How is this going to go wrong?" And then making sure you have monitoring to know that you'll know how it's going to go wrong and that you're handling it appropriately. Having the tooling in place, so if you're going to use automated tooling to catch issues, I always prefer that happens at poll request time and it needs to happen automatically, as automated and as fast as possible. So, if you're going to rely a lot in unit test, your code should be unit tested. Unit tested it and passed its unit tests before it lands. This is a part of continuous delivery, but you should never have code in a deployable branch that's not ready to be deployed.
15:28 Daniel Bryant: Totally makes sense.
15:29 Melissa Benua: Also, a part of shifting left because you can't land stuff and then test it. You have to test it and then land it. That includes all the pieces.
15:38 How can engineerings go about ensuring their system is observable?
15:38 Daniel Bryant: Yeah. Something I heard you say there about thinking of these things at design time, I think that's super valuable advice. I've heard of threat modeling, for when you're doing security, using tools like threat modeling to think about the security aspect. Any comments around that? Or is there an analogy, as well, to thinking about observability, like observability modeling or something like that?
15:55 Melissa Benua: I don't know that I've ever seen tooling to do observability modeling, which doesn't mean it doesn't exist. It just means that I haven't seen it. Maybe I'd haven't thought to look for it, but I'm used to having to do it on my own. So I prefer to think of observability from a service wide level. Because if you're, especially if you're using microservices; one service standardization doesn't matter quite as much, but if you have 40 services or a hundred services, your standardization matters a lot. So being able to apply the standard model first, so requested metrics aren't named one way in one service, and another way in another service, and a third way in a third service. So you have a standard model to build off of. And then once you sort of have the standard model of things that you get more or less for free, with your service or your new piece, writing through your functional path, writing through your tests or running through your functional testing, especially your end to end testing or any user interaction stuff, to figure out what are your key paths, what are your key breaking points.
16:42 Melissa Benua: Because then you're using real interactions and not theoretical interactions. A lot of times where things fall down is we thought about how something should work in our head, but never actually tried it. I don't know how many times I have a Dev "Sign off on a change". And you say, "Did you test it?" And they said "Well, a unit tested it", but it goes to production and nobody actually ran through the scenario end to end. And it turns out all kinds of things come out when you actually run through a user scenario end to end, that each unit piece was working fine by itself but, when you put them together, it didn't make any sense. Similar with observability. Each of your metrics may seem to make sense individually, but if you don't do a pass all the way through and validate that "Yep. This actually makes sense". You're going to find that out in production, often to your sadness.
17:24 Wrapping up, and could you talk a little more about how all of the concepts we’ve discussed today relate to continuous delivery?
17:24 Daniel Bryant: Yes. We're close to wrapping up now, but I'd love to get your thoughts around how what we've been discussing today relates to continuous delivery and deployment, as I think these are super important topics for the listeners.
17:33 Melissa Benua: So, when you have a continuous delivery system, I talked a little bit about it earlier, but the continuous delivery system implies every change in your main line branch or your deployable branch could, in theory, go live. Doesn't mean it has to immediately, instantly go live, but it should be able to go live without any additional extra validation. So what that looks like is a lot of automated tooling, but also a lot of manual testing and whatever else is appropriate that happens to pull request time. And so what that means is you open a pull request, preferably a quite small one, you get your build run, you get your unit test run, you get any static analysis, simple, automated checks that can be run on code run, and then you have to deploy it somewhere to a test environment.
18:10 Melissa Benua: And this is where traditionally used to loan out test environments, now with containers and Terraform or an infrastructure's code, you can spin one up pretty easily and it should be as production-like as possible so that you can do your integration testing and end testing, you can do your observability testing there.,You can do a manual test path because most changes still need a manual test pass just to validate that the scenario is correct. I don't know how many times I've heard people say "We don't do manual testing", but they don't really. What they mean is "We don't do manual, functional or aggression testing". Not that "We don't do scenario validation". Because I promise you, if nobody's doing it, your users are not happy about it.
18:45 Melissa Benua: So by the time something has finished PR it needs to have been truly tested. You need to have spun up the UI and looked at some pages, whether that's through UI automation or with somebody looking. Every piece needs to have actually been deployed and run and truly validated to ensure that, when you've landed "Yep. No problem. We can go live, in theory, instantly". There's no room for slack or margin of error. You can't hand wave away "Yeah. I swear I ran the build", but you didn't and your code doesn't compile.
19:11 What advice would you offer to folks looking at implementing canary releasing or feature flagging?
19:11 Daniel Bryant: Yeah. I've definitely seen that. I've definitely seen that. Have you got any tips on sort of progressive rollout? I see a lot of folks talking about progressive delivery, as well. And I think that's a really interesting idea as all that kind of boils down to Canary releasing, feature flagging kind of limiting the blast radius. Have you got any tips you'd like to share with our listeners around that?
19:29 Melissa Benua: Both. So Canary releases are great if you're worried about the box itself. If you're incrementing some version and you're like "Oh, I don't know about the performance on my new version of Linux, or my new runtime framework version". Canarying is great for that because you can clearly see differences. Feature flagging is much better and safer, is more binary. Because, if you're Canarying, you're still affecting some percentage of your users, small percentage in theory, but it's still effecting some percentage of your users. So Canarying is great when you're worried about performance changes that aren't critical. Feature flagging are when it's cutting across the whole box or the whole container, whatever. Feature flagging will cut across every box you have, but will limit the blast radius of users. So, if you're more limited about taking out users, it's better to do feature flagging so you can limit the scope to a specific set of users. But, if you're more worried about the box itself failing, it's better to Canary so that you only lose one box instead of losing all your boxes.
20:24 Daniel Bryant: Well, it's been great chatting
20:33 Melissa Benua: Yeah. Thank you. This was fun.