BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Interviews Dan Craig on Bringing a Federal Agency Up to Speed on DevOps

Dan Craig on Bringing a Federal Agency Up to Speed on DevOps

Bookmarks
   

1. Hello. I'm Manuel Pais. I'm here at the Agile Conf 2015 with Dan Craig, Agile Delivery Director at Steel Thread. Thanks for accepting our invitation. Can you briefly introduce yourself to our audience?

Sure. I'm Dan Craig, Director of Agile Delivery at Steel Thread software. We're based right here in the Washington, D.C. area. In my personal background, I've been doing Agile development since 2001 and all in the commercial sector. But starting in 2009 I really focused on the continuous delivery and DevOps problem and I've been working on that since, purely in the federal sector.

   

2. So that's the theme of your talk here: your work at this federal agency. Can you guide us through how the engagement process took place? What were the problems that they were facing when they asked for help?

Sure. Yeah, a little bit of background. So it's a large organization within Commerce Department. They really came to be highly visible during the height of the recession when President Obama cited them as one of the potential instruments to support businesses and turn the economy around. So Congress pushed quite a bit of funding their way to build the next generation platform that was going to support their public supporting systems and any back office systems that dealt with that.

So they took that funding, began working and about a year in we're really experiencing some problems. Those problems were largely due to issues within configuration management, building the wrong things and then once built actually deploying the wrong binaries into production. So the CIO who is a very forward-thinking guy, he worked in the commercial sector and had seen the solutions to these problems, went looking for someone to solve this. We had done the same work at DISA and he found us. He gave us a little, several weeks engagement to build a strategy roadmap. I think maybe more of a test of us. After that, it was delivered. They gave us the longer term engagement.

   

3. Which strategies did you devise for attacking the problems that they had?

Well, during the road mapping session, we didn’t have enough time to understand every nuance of the problem. But we did know enough to come in and know we had to build a platform that was going to support these Agile practices. We needed to attack the problem from the bottom up. So we wanted first to concentrate on configuration management and, once that got a bit in order, we wanted to work with the development practices. So not so much the Scrum management practices but really the development practices, the automated builds, the unit testing, and eventually work on ARA, so the application releases automated into their test environments, and get tests going.

Our strategy was to only get to that level, as quickly as possible, in a pilot program. Three programs were going to come through. And then kind of take all our learning, take a break, codify it into the platform and then kind of open our doors for business for the rest of the systems to come in.

   

4. During that process, what were the biggest challenges you faced?

Well, actually, that strategy held up fairly well. Early on, the platform went in pretty well. We were a little delayed because of networking considerations. So there was a demand for external access so we had to work with the networking and security teams to open up a DMZ that allowed outside access to contractors yet provided all the services, the disaster recovery and high availability that you get on a production system.

I think past that, the next challenge is really an operations team that maybe didn’t believe in the vision yet and the test team that had never seen automated testing at any kind of scale. So it caused us to need to jump in and act on their behalf. So we wrote the first deployment scripts. We wrote the first test scripts. That slowed us down quite a bit.

   

5. You mentioned some development practices like configuration management and promotion of binaries and automated testing. I guess we expect those to be already common practice but in reality a lot of organizations are still not there. Why do you think, before you came in, that never adopted? Lack of skills, prioritization or resources?

It's probably a little of all of that. They certainly had teams whose responsibility were configuration management and deployment and testing. I think to be honest the push that I often see in the United States around Scrum and the kind of business practices oftentimes you see a lot of certified Scrum practitioners but with no real grounding in the development practices that support it and back it up. So I don’t think I've ever been to a place that didn’t have configuration management but often there wasn’t the rigor behind the process.

Almost every organization invests to some degree in automated testing but they've never seen it implemented and they don’t know exactly what they're looking for. So we found that showing testers how to write small atomic tests that are data independent and portable across environments, once they saw it, it really tends to click. So I think it's lack of exposure in a lot of ways.

   

6. Speaking about technological stack, there have been several examples in the US and abroad of federal organizations moving from proprietary technological stacks to open source and cloud. Did you encounter similar experiences or did you face a resistance to tool and technical changes?

Right. So our charter on this particular engagement was to create a platform that supported development activities. So we were not the development team itself. But we did play witness to exactly what you're talking about. What we typically tend to see was that the engineering teams were very strong on moving towards open source. So I want to use Tomcat instead of WebSphere, right? Those kind of debates would rage on and often I would see the organization hesitant to make that move largely because they wanted a vendor behind the product, they didn’t quite yet trust this open source community thing. They wanted a vendor behind the product and they wanted to make a standards choice, so they wanted to go with a vendor and make that the de facto standard.

I will say I see engineering slowly winning this discussion but it's slow. There are signs of it all over. We do see the engineering teams making more and more moves towards a technology stack that's open.

My customer has their own private cloud. They're rolling out yet they're already doing production deployments to Amazon and Google-based cloud. If I had to bet on it I'd say somewhere off in the future we're moving towards open source, open standards.

   

7. In terms of the choices you made for the platform you worked on, can you expand a bit on what was your process to decide which tools and technology to adopt?

Sure. Like I said, I'm from an organization really steeped in Agile development. So our one mandate was it had to be open source which was where we lean anyway. So we chose Subversion for the SCM. There was some debate around Git but there had been zero exposure to Git within 500 plus contractors at this site. And we thought that might be a bridge too far. So Subversion was the SCM. We chose Jenkins, at the time Hudson. Oracle was kind of reclaiming Hudson and we were a little worried it might go to license so we went with Jenkins for the build automation. And then Nexus open source for artifact repository management and Sonar for continuous inspection and dashboarding.

We did bring in Selenium as an open source test tool to kind of show them the art of the possible with the test tool. So yeah, we were given a long leash to bring in our own products and pretty much the full suite is open source.

   

9. Was that because of legacy or other technical constraints?

It was kind of negotiated how we landed there. So we, in our early pilot, were doing deployments via shell script just to show how it would work. When it was time to really, I guess production-harden the approach, we actually started with Puppet because there were some Puppet work in-house. We actually deployed our own platform using Puppet. What we found was a real challenge in bringing dev and ops together using Puppet. The engineers saw it as little extra coating burden. Not all of them knew Ruby, not all of them knew how to do it. And the operations team not quite as well seeped in the engineering practices. So it just wasn’t catching on. It became obvious knowledge transfer wasn’t going to work.

So we moved to Ansible which kind of respected the roles. We had project teams creating playbooks. We had operations creating inventory files and you put those two together and that kind of runs your deployment in Ansible. So that worked great.

Concurrent to all that going on, the operations team, the network team was working on system level deployments of Linux and some middleware and they were using Puppet which we think is appropriate. So right now we're in a hybrid state. System goes in using Puppet and applications generally going in using Ansible.

   

10. Was it also important that Ansible works on a push model instead of a pull model?

Very important, yeah. Early on we bumped into cases where we were unable to put a Puppet agent. So for example, the web server sat in a DMZ where by default they weren’t getting this Puppet agent installed. At five routers you can't put a Puppet agent. You can't pull down those rules so Ansible is very useful in pushing them.

   

11. In terms of security and compliance constraints that were required for that agency, did they impact or not the kind of changes and improvements that you are making? I guess what you just mentioned about using Ansible was an example of that?

Well, compliance certainly impacted us. In pilot mode, some things are forgiven. As you start the production, nothing is forgiven. So we had to begin to build a mechanism to collect kind of the metrics upon which go/no-go decisions were being made to move forward to the next steps in these pipelines. And we had to collect artifacts that were created so test reports or WebInspect security scan, results. We had to talk about retention policy for these things and where we were going to store them. So it took quite a bit of negotiation and effort to get the platform to be able to support those kinds of things.

Now, security really wasn’t as much of a challenge. They had three primary tools they used: WebInspect at the app layer, Nessus at the system layer and then NGS SQuirreL at the Oracle database level. Those we were able to integrate with the platform fairly easily. The real challenge there was how do you train developers, who have never really seen these data before, to read it and make it useful early in the development cycle? So that's really where the work was there.

   

12. Were the results from these tools integrated in a pipeline for delivery or was it more manual?

Right. They weren’t at first so at first it was very manual. They would be located in the workspace of your Jenkins job that kicked it off for example. So you had to kind of go all over the place to find the results, all the various results of a given pipeline release. Right now we have a pipeline development underway. It's like a custom pipeline that kind of sits on top of all these tools, orchestrates their running, gathers data and creates a kind of an end run report that satisfies audit and also retention rules.

Manuel: Are you aiming to have a continuous delivery pipeline in the future? I believe you're not there yet.

Dan: Well, people use the term continuous delivery. I'm not sure if that's yet on the table as a goal. So we still have quite a bit of work to do to simply get a true DevOps [culture] working and a pipeline that is fully functional. Our goal, that we're working under right now, is to have deploys to production in under 24 hours from the point of code commit. I think if we can get there over the next year, I would say that's a fantastic victory and then it would be appropriate to talk about true continuous delivery.

   

13. Do you believe that federal agencies maybe face specific DevOps challenges in terms of the culture and the collaboration and sharing that is required? Or is it just that each organization has their own context and their own challenges?

I do. Coming from someone who worked in the commercial world his entire career, who only in the recent past five or six years has been working in the federal sector, there's a host of problems in the federal world that I never came across in the commercial world.

It all starts with the procurement cycle itself. The federal procurement cycle still doesn’t understand Agile development very well. They're making attempts out of groups like 18F with GSA but it's still not very widespread. So how you write a contract that allows the moving in and out of resources and the fluid nature that you need for a DevOps initiative, they still don’t quite understand. Even when you fight your way through that, contractors land on site and these contractors are transient. So any training you put into them walks away as soon as that contract ends. It's one of the biggest issues that I've seen.

And then it's not a political statement but the government employees themselves, [because of the] labor unions that they work under, get pretty much put a bullhorn in the hand of anyone who does not agree with the movement forward to amplify their disagreement. We actually have a stakeholder that, we would joke, spent quite a bit of time in meetings, dealing with grievances from various parts of the organization. So it takes quite a bit of fortitude to be a stakeholder in the federal government trying to bring this into place.

There are other things. There is policy at every level, policy and compliance that we've talked about. Even just shifts in administration, congressional shifts of power, budgetary shifts, all these things can affect a DevOps initiative in the federal world.

   

14. Going back to the move from a pilot, when you were testing out the changes, to then opening to the whole agency. I believe there were several growing pains that you mentioned during your talk as well. Can you briefly summarize them?

Sure. Some of the growing pains ended up being a business decision. We had completed a pilot, we knew a lot of work had to be done to harden it. The customer made a painful decision that they needed to make. We needed to open the doors to more projects before we were ready. The reason being was they were having such problems with their CM and build and deployment activities. They thought it would be more painful to stay in that state than it would be for us to kind of build the plane as we're flying it. So we opened the gates, allowed more projects in. It really kind of put us into six months of tailspin as we had to build out the process around the tool to support it.

So the big problems were no two projects for a while looked the same. Even though they were all doing builds and bundles and deploys, every team was branching and tagging for a different reason and with a different standard. Deployments, like I said, started with shell scripts and it was only later that we kind of made them consistent with Ansible or Puppet. The operations team, their hesitance until they had seen some cycles was clearly part of the growing pains. The test team continues to be a growing pain in that we finally got around tool selection but now it's education of 130+ test team members on how to write a script that's not brittle so they can work in any environment, et cetera. So those are a couple of them anyway

   

15. Final question, you are still working for this organization, right?

Yes.

Manuel: So how do you envision when will you be able to say, "Okay, my work is done here because these and these patterns or practices are in place”? How do you see that happening?

I think it's a fantastic question. I was asked this earlier today. I think I'll know it's time for me to leave when I see more evidence of kind of organic DevOps going on. So we're seeing good signs of that now and to me organic DevOps is when the developer has a new component that they're building or deploying and they know to reach out to the operations team and say, "Hey, let's get together on this." In our case, using Ansible, “I'm going to do my playbook. Can you update our inventory file? Let's make sure we're all good.” without me ever knowing about it. Once I see evidence of that, I know it's time to pull the customer aside and say, hey, maybe you're arriving. We can throttle down on the kind of subject matter expert type people you have floating around here and bring in more of the O&M [Operation & Maintenance] kind of operationalized teams. So that's what I'm looking for.

Manuel: Well, thank you very much, Dan.

Oh, thank you very much.

Sep 20, 2015

BT