Debate: What is the Role of an Operations Team in Software Development Today? [Updated May 10th]
[Last update: May 10th, 05:00 GMT - final notes and summary added]
In the last several years, with the rise of such phenomena as Cloud Computing and DevOps, there has been some debate about the role of the traditional Operations team as it is often found in today's software development shops. InfoQ will explore this debate further, to get an understanding of the different aspects which are involved and the tradeoffs of each approach.
A big question which lies at the core of this debate is:
Who should be responsible for the management, monitoring and operation of a production application?
To start off the discussion around this debate, we have asked for input from Bjorn Freeman-Benson, Director of Engineering at New Relic (a provider of SaaS-based application performance management tools), and Carlos Armas, the lead editor of InfoQ's Operations community. However, this is just the beginning of the debate - as the discussion grows and evolves InfoQ will update this article with the latest discussion, and the discussion on Twitter is also being tracked by following the #roleofops tag. We want you to participate in this debate, so please feel free to send a tweet with the #roleofops tag in it, send us an email at email@example.com with your input, or leave a comment on this post to add your voice to the debate.
Bjorn Freeman-Benson: There has been a lot of talk, blogging, and general powerpointing about the topic of the changing role of operations. A lot of that communication has included a plea to bring development and operations closer together. They may call it Dev-Ops or another catchphrase, but it's always about how Development and Operations Management need to communicate more effectively to manage production applications. While my colleagues here at New Relic and I agree with that generally, we think companies should consider a more definitive step toward more effective operations management - that is, consider making each development team responsible for the deployment and performance of their own applications.
Lets consider some reasons why this is not as unusual or as radical as it may sound.
First, who knows more about an application than the team that created it? Leading up to the moment of deployment either in the datacenter or a Cloud, the application development team has conferred with the business app owners, designed, architected, and coded the application, selected and tested and integrated the various components of the application (app servers, OS, databases, integration middleware, etc.), created prototypes, ran function tests and maybe load and scalability tests, demonstrated the capability to the business, and finally got the app ready for production. How can weeks or months of collective knowledge possibly get transferred to the Operations team?
Shouldn't the dev team make the critical deployment choices - scale up or scale out the hardware hosts, to virtualize hosts or not, what's optimal CPU and memory, etc? Shouldn't the dev team decide what the best performance monitoring and logging they would need? Shouldn't the dev team monitor, deal with alerts, handle performance and availability incidents, and deal with the cranky calls from the business when the app goes whacky? (Why should Operations have all that fun?) I know this approach can work - we use it ourselves at New Relic to manage one of the busiest SaaS applications in the world. Ironically our SaaS application is one used by nearly 4,000 dev teams to manage their production applications.
Carlos Armas: It is attractive to think that because a team designed and built a system it is the best prepared to operate, monitor, and scale it. Extrapolating that logic, I would contend the development team should be doing the company accounting because software developers are good with numbers, or cleaning up the office after hours and taking out the recyclables to the containers outside since they care for the environment. Doing so ignores a practice that is several hundred years old: division of labour. The fact that the development team knows the application better than anyone else is not a good reason to give them responsibility for operation and maintenance once the code is released to production.
I will assume the role of a business owner, and give you three reasons why I do not want this to happen to my business:
1) Financial: A software developer's average salary is consistently higher than an operations support engineer's average salary. As a business owner, why would I want my most expensive team to be performing operational tasks which other teams can do in a more financially effective way? As any good ScrumMaster in any Agile team would tell us, removing obstacles and making hidden work visible is one of their primary tasks, so the development team can more efficiently develop code. Why would I willingly bring obstacles and extra work to the team to reduce its velocity?
2) Quality: I believe in check and balances, because nobody is perfect. When a team is responsible for the full lifecycle of the application, the customer is the one who ultimately suffers the consequences: a bad user experience. When the team that develops the code is the one that decides whether its quality is good enough to release, there's an increased risk of releasing imperfect code. Owning the full lifecycle makes people complacent, and the "we can always fix it in production" syndrome kicks in.
As a business owner, I would prefer to keep instead a system of check and balances, so our customers are well-served and reward us for that:
- I would not want the development team responsible for doing QA, but accountable to the QA team for the quality of the code
- I would not want the QA team to release and operate the application in production, but instead be accountable to the operations team for thoroughly testing the application so issues are detected earlier and bad product rejected back to development before it hits customers
- I would not want the operations team to assume the role of constraining the frequency at which we release new functionality to customers, but be accountable for application availability and performance, and be responsible for releasing and operating the code, as well as feeding back bugs and imperfections that made it out despite the inspection and test process
Essentially, I would want quality to be a shared responsibility, with teams accountable to each other for very specific areas and roles. And I would want the relationship among the teams to be one of positive conflict fuelling continuous quality improvement, with the customer as the only end in mind.
3) Opportunity cost: Who is going to develop new features to improve the application when the software team is busy operating it? Who is going to fix software bugs? Will developers leave in the middle of a next-generation design session with the product team because a pager went off alerting them of an Amazon EC2 instance failure?
Bjorn Freeman-Benson: Secondly, what really is the role of operations when apps are deployed in the cloud? More and more web applications are being deployed onto public or private cloud infrastructures. At New Relic, more than 40% of our customers have apps deployed in a cloud, and we expect that group to be in the majority by the end of 2010. Granted, many of these are smaller companies without the legacy datacenters or data integration worries that larger enterprises have to deal with. But every day we signup a new customer who represents a large organization with a traditional datacenter, and the customer's application is deployed at Amazon or another public cloud.
So, in these cases, what role does Ops play? They are not responsible for cloud hardware, networks, telecom. They don't make the choice about infrastructure monitoring. The only thing belonging to their company in the cloud is the application code itself. Everything else is the responsibility of the cloud infrastructure provider. So, to put it bluntly, when my app is deployed in a cloud, who needs Ops? The Dev team must be responsible for the successful performance of cloud based apps. If not them, no one will.
Carlos Armas: Interesting, it seems as if folks expect systems in the cloud to manage themselves, which is a mistake. Let's go back a few years to illustrate the point and talk about managed services.
Managed services as we know it today began when hosting providers realized they could potentially go beyond offering hardware and bandwidth to their customers, and include systems administration and application management in their portfolio. It turned out to be a tricky business, and it was very hard to deliver with consistency and good quality. Only a few providers got it right, and at a very high price.
From the above perspective, cloud computing pushes system administration and application management back to the customer. (We do not intend to define cloud computing here, it probably deserves a different debate :) )
So what is Ops going to do in the cloud? Systems administration, to begin with. Application deployment, monitoring, issue escalation and response. 24x7 on-call support does not change because the systems are in the cloud. System administration (be that Unix, Linux, Windows, etc.) is an activity developers do not do well because it is not their area of expertise, just as system administrators are not good software developers, or good marketing and communication executives.
However, the composition of the operations team changes gradually as (and if) cloud computing becomes pervasive. You mentioned hardware, telecom, and networks previously. Obviously the demand for those skills will migrate from the cloud customer's organization to the cloud provider. The system administrator role will remain with the customer, and will be just as badly needed as before (possibly moreso) since the provider no longer manages the OSes of your virtual servers for you. (And remember, as a business owner I want the software engineers to develop code, and my ScrumMaster to prevent them from doing hidden work while also removing other obstacles in their way)
We could also argue that in such a scenario the Operations team ceases to exist, and the operations engineers become part of the software development team, or another team. That could be possible, and in fact in small organizations it currently happens. However I do not think we are concerned about the organization chart here, but more about the role of operations and development within the context of emerging hosting technology.
Essentially, my perspective is that the software development team should not be responsible and accountable for operations tasks not because they are not able to do it, but because it makes no sense financially, organizationally and business-wise.
Follow-up added April 16th
Bjorn Freeman-Benson: Before I provide another "proof point" in my argument that developers should be responsible for production operations for their applications, let me respond to a couple of Carlos' points. Carlos, I know you are being facetious when you say "[maybe] the development team should be doing the company accounting because software developers are good with numbers." No, I am not saying developers are capable of doing everyone else's job. But, we are not talking about having developers use operations skills that are way outside of a developer's capabilities. Accounting is pretty foreign to developers. But configuring hardware, deploying the software stack, managing the server capacity for a given application, establishing backup schedules - none of these Ops tasks are beyond the capability of the typical developer.
Further, when we are talking about Cloud, you say "...it seems as if folks expect systems in the cloud to manage themselves, which is a mistake." No, clearly cloud infrastructures will not manage themselves. But we are saying the sys-admin work associated with the cloud is done by the cloud provider. So I am saying that if we are running IT in a company whose apps are deployed in the cloud, why do WE need sys-admins? We need the developers to do most of the application management save what the cloud team does. If one uses one of the cloud platforms - from RightScale, Heroku, Stax, Gigaspaces, EngineYard, and others - most of the sys-admin-related work is done by the platform, not by human sys-admins.
Now let me give you another reason why Developers should take responsibility for the application in production. Who is better able to find the root cause of performance problems than the team that wrote the code? Let's say the Operations team is alerted, either by a performance monitoring tool or by the inevitable angry phone call from the business people, that an application's performance is lagging. What can the typical operations staffer do with that information? If they are extremely lucky they will be able to isolate the cause of the problem to some piece of failing infrastructure. Frankly that is relatively rare. Usually the various infrastructure monitoring tools Ops uses are showing all "green lights." More often than not, the problem lies inside the application. Who knows more about what is happening inside that VM than the development team? Most ops people are not developers, cannot read code and would not be able to track down a problem whose root cause was buried within the application layer. Why have a middle man in operations get alerted simply so he can turn the problem over to the dev team anyway? Just have the developers get the alerts and jump on the problem sooner! If it's their code that is the problem they should be the ones remediating the problem. The secondary consequence of this system is that developers become a bit more diligent about the code they push into production, knowing they have to live with the results.
Follow-up added April 19th
Yes, I was in a very facetious mood. :)
I find humour to be a wonderful vehicle in facilitating communication and understanding. The more I read your comments, the more it feels as if our ideas are not as opposite as they look to the casual observer or at first sight.
I agree developers are able to learn and perform many Ops tasks, definitely. Keep in mind I still want them to be concerned with writing code in the first place, as I wear my business owner hat. I will go back to this point as I address a very important point you make below.
We have a very fluid, and evolving concept here: "cloud computing". If you ask me to define it, I will probably run the other way. (I do not want to attract the attention of the hordes of modern-day cloud computing evangelists).
Suffice it to say that cloud computing is very broad, it concerns services that require minimum or possibly no operations professional support (Heroku, et al.). It also involves services such as Amazon's EC2, Rackspace's RackspaceCloud, Opsource's OpsourceCloud (to name a few) where there's a substantial amount of Ops work involved, depending on the kind of application to support.
There is a strong case to make for a SaaS provider focused on a very specific service to have a homogeneous team that keeps the service ticking from concept to delivery (which might be the case of New Relic)., with razor-sharp focus on delivering the application.
One possible contrasting example would be a company that decides it doesn't want to spend a lot of capital in powering a development environment and moves its development infrastructure to EC2. Fast provisioning, quick turnaround, what's not to like? There are many other examples that come to mind.
So the moral of the story here is, "it depends".
With regards to your point about root-cause analysis, Let me start my counterpoint by highlighting a couple of your statements which sounded like music to my ears:
- "The secondary consequence of this system is that developers become a bit more diligent about the code they push into production, knowing they have to live with the results"
- "Why have a middle man in operations get alerted simply so he can turn the problem over to the dev team anyway?"
As I see it, if Dev starts doing Ops tasks:
- Developers would be more diligent, as they would have to live with the results of the code pushed to production
- This would resolve the middle-man syndrome (Ops) caught in-between, not able to fix the problem yet accountable for the failure
Let me add my own grievances, based on direct experience:
- Why is it that, in a number of companies (apparently a large number) Ops is perceived (and acts) as an entity that blocks change, and adds red tape to workflow to the point that it is almost impossible to be agile and release code to production?
- Why does Operations consistently stonewall Development?
- Why is it that Development circumvents strict company policies, and ends up buying cloud-computing services (because Ops did not provide the service in the first place)?
- Why is Ops penalized for missing an SLA target if the failure was not related to faulty infrastructure or processes, but failing code?
- Why isn't Ops taking advantage of the rapid, flexible deployment capabilities of cloud services? Is it that the "guardian" mentality of '70s computing practices is still alive?
Something seems clear to me: there's conflict now, but not being versed in the history of information technology practice, it is hard to tease out the reasons why we got here in the first place. My untested and simplistic theory is that change is bad for operations, while software development is change. So there's a "primal" contradiction that needs to be managed wisely.
Let me put my business hat back on, as I run cost numbers and try to provide a process to manage (wisely?) the conflict. Let me express it in Agile user-story statements - as a business owner, I want:
- Software engineers not doing the company accounting (didn't you see that coming? :) )
- Operations engineers primarily focused on 24/7 service availability
- Software engineers primarily focused on service improvement
- A zero-wall policy between development and operations
- Operations engineers as core members of Agile teams
- Software engineers regularly rotating through on-call duties for third-level escalations
- In pain, there is learning
- Development and Operations sharing responsibility for application availability and latency
- I want to be blunt here: missed target, no bonus for either team, I do not care who broke it. corollary: in financial pain, there's learning
- Operations engineers required to learn the core, essential parts of the service application layer
- To the point they can help setup trend monitoring, and be able to predict failure build-up scenarios
Essentially, I still want my software development team to write code and build new features that amaze our customers, and not be distracted by anything else (including cloud computing). I still want my Operations team (either a multi-engineer team or a part-time remote sysadmin guy) to be tuned up and extremely responsive to the team that builds the stuff our customers want. I want my software development team to be accountable for the code they wrote and deemed fit to release to our customers. And I want my Operations to learn the application layer to the point they can call out bug vs. infrastructure anomaly (and as the icing on the cake, stop complaining about change).
Follow-up added April 20th
Bjorn Freeman-Benson: Carlos, thanks for clarifying your position. In brief, I agree with most of your observations about how Ops is perceived by Devs ("change-blockers") and I would add how Dev is perceived by Ops ("a bunch of friggin' cowboys.") And I see we have some interesting reader comments about our debate. I have also gotten some feedback offline (thanks @markimbriaco and @randybias) to the effect that my position comes across as black or white, and militantly against an Ops function. I didn't mean to, and hope I didn't portray Ops people as completely unnecessary or incompetent. I do not believe that no company needs an operations function. That is clearly not the case.
My position and my perspectives are focused on applications, and who has responsibility for them. After all, what is IT for, if not for application development and operations?
Let me use this posting to clarify and to react to some of Carlos' comments above.
First, all the discussions I have heard and read by ops people (and a good one to read is by one of our commenters, John Willis) tell us that no one knows that the role of operations is pretty dramatically changing better than the ops people themselves. Carlos, your comments show that, too They can see that their datacenters and applications are different than they were only a few years ago. It used to be that a datacenter contained a hodge-podge of proprietary technologies - a mainframe, some AS-400s, an RS-6000, some DEC minis, and some Wintel servers that "those web guys" used, plus a bunch of storage devices which needed frequent care and feeding. There are still datacenters with this kind of variety and in those, and for the teams running them, I think maybe less is changing. However it's not uncommon today for a datacenter to be comprised of 1000 Linux/Tomcat blades, all nearly clones of one another. It's also not uncommon for nearly all the applications to be web-based (Java, .Net, Ruby, PHP) and that in those datacenters there will be fewer management tools to learn and fewer proprietary systems to support. Cloud computing takes this picture to an extreme. So in more and more cases, the role of the ops team is being simplified by this standardization and commoditization. It's our contention that the picture I paint is becoming the norm rather than the exception.
Even in the case of the highly standardized datacenter (my 1000 blades example) there is still a role for operations. There are numerous jobs that need specialized knowledge - database administration, capacity planning, data backup and restore, disaster recovery planning, power management, telecom management, and a lot more. The people who perform these functions do so for the whole datacenter (or for the cloud provider if that is who employs them.)
The crux of my argument is that responsibility for application management should largely reside with the application development team, not with the operations team. And Carlos, (this will be a shocker) I completely agree with your observation that:
My untested and simplistic theory is that change is bad for operations, while software development _is_ change. So there's a 'primal' contradiction that needs to be managed wisely.
It is my contention that developers and architects are better prepared to make deployment, monitoring, and incident management decisions than the ops team because of their intimate knowledge of the application architecture and language. In the case of application management, a separation of responsibilities between Ops and Dev is less efficient. It's less clear who is responsible to the business for the success of the app. And finally, by putting ongoing management of apps squarely on the job description of the developers, your application quality will improve. Developers will no longer be allowed to hand off a poorly coded app to Ops and walk away from the ensuing mess.
I like your Agile story approach to "What the business owner wants" but I would like to hear some reader comments before I comment on those.
Follow-up added April 21st
Carlos Armas: As much as I would love to believe the role of the operations team is being simplified, (and I wish it were), I see the opposite happening.
The operations role, in my opinion, has been misunderstood and later minimized over the last 15 years or so. Not too surprising, because in large part it was the fault of Ops.
It started in mainframe times, when the MIS (Managers of Information Systems) took on the role of "priests of the computing temple". The adjudicators of "machine time" behind the glass walls operated with the principles of rite, secrecy, and separation. Too good for scrutiny, too in control to challenge in the realms of the business playfield.
Times have changed. As I see it, the simplest parts of the job have been slowly fading away. We no longer segregate
/usr/bin to fast and slow hard disks, or nurture and pamper that 12GB-memory Sun E4500 that took over the place of a deity in the datacenter. I forgot when the last time I used a crimping tool to make my own cables was (thank heavens!). I also cringe and contort when I have to compile something because
yum will give me a slightly older versions which won't cut it.
I would say that the physical tasks of operations have long disappeared from our job description, and have been pushed down and away to starting/supporting roles. On the other hand, our job got increasingly more complex. The multi-server homogeneous datacenter (even the virtual, 'cloudy' one) brought a different, higher level of headaches and complexity. With
puppet, and other related automated deployment mechanisms came what I call "the atomic risk". A simple typo in
/etc/sudoers in a single server might have been easy to fix - now we have the multiplier/accelerating effect of automation which helps the error to spread in a matter of minutes if not seconds to thousands of servers.
Our daily challenges have changed from "why is the compilation bombing?" to "how do I cajole my puppet module that deploys app Y to 120 servers to install release X, but not release X+1 before it's ready, so I do not end up with alpha-quality code in our production instances?" In that sense, I love how the 'constraints of the physical world' are becoming less of an obstacle in a cloud-provider environment. Negotiation, procurement, logistics, racking and configuration is done beforehand. That's progress. The trend began way before cloud computing, and is definitely welcome. Yet my job is definitely not getting simpler, though I am definitely getting way more done with the way automation has come to help.
Let's put it this way: it got more complex, but now I have much better tools to assist me. And, as a segue way to my next point: I am grateful to the developers that brought to life such automation tools, and that is precisely the reason I want them to keep developing new ideas, and not managing deployed applications :)
I agree with your thoughts about developers and architects being better prepared to make deployment/monitoring/incident management decisions in principle. There is no doubt in my mind developers and architects know better than anyone else what they built.
As for who is responsible to the business for the success of the app, I still have the perspective (maybe an old-fashioned one) that an application is part of the service ecosystem, and can't survive without an infrastructure foundation that supports it -- even in cloud environments, the infrastructure needs management, and I would rather have the folks who are specialized in that area performing those tasks. I guess this is a matter of viewpoints, we will likely agree to disagree here.
Now, going back to developers having app production management responsibility in their job description, as you mentioned: I like it! Immensely. "Give me your tired, your young". Rotate your developers to Ops team positions so they get first-hand experience of the needs and challenges of delivering a consistent, 24/7 SLA-backed user experience, while instilling the app-level knowledge. Same with newly hired developers. In reciprocity (retaliation?) I will rotate my ops engineers as members of your SCRUM teams, and experience first-hand the 'removal of obstacles', frustration with delays and red-tape, etc. This has the added benefit of taking down the (artificial) walls between the teams, so there's no more "us vs. them".
The above would ensure that I keep the developers doing what I need them to do (business owner hat on): building new functionality, but helping transfer the knowledge so the apps can be supported in production efficiently.
Follow-up added April 27th
Bjorn Freeman-Benson: Well this has been an interesting week. Sorry I haven't posted in few days. We deployed a significant new feature on our SaaS tool and we kicked off our next round of development. We now include production profiling in our app performance monitoring tool. We push something new at least weekly and do ad-hoc patching several times a week. Keeps us busy. Let me also say that at New Relic we also have a very good sys-admin named Bayard Carlin. I know, you probably thought by my comments that we had no ops people. But, no we have one. He is also our internal IT department serving our employees' needs. I will talk more about Bayard next post.
In looking over the comments from some of the readers, I saw several really good remarks that I would like to highlight and react to. In my next post, I will summarize what we have learned from all of your feedback and from Carlos' insights.
First, David Sims commented "It is indeed good for our developers to be deeply involved in technical support, as it leads to a better product that they produce. However, like Carlos pointed out, as a small business owner, I know it's not always the best use of resources for a developer to answer questions that a skilled support engineer can handle." I agree with both of those points. We have seen that product quality consistently improves with developer involvement in production operations. We also agree with David that it is a challenge for the development team to devote time to operations when they have new code to write. However if David means operations work is not "high value" enough, I dispute that. I am not assigning a value to the work Ops does that is any less than the work Dev does. Just different.
Second, Geva Perry's analysis of the impact Cloud Computing is having and will yet have on the role of operations is very valuable thinking that we should expand on, maybe in another thread one day. At New Relic we have some apps in the cloud and others in a more traditional hosting environment. We have lots of customers, though, deployed in all kinds of cloud environments and we hear from some of them about how they struggle with the new and different demands of that deployment option.
Third, I agree and disagree with John Allspaw's comments. I disagree when he says that (Cloud) automation will not appreciably reduce the amount of Ops people. I think it's an inevitable trend. I agree that in most larger organizations, there will remain an Ops organization and success will be measured by the degree to which they learn to collaborate, not the degree to which they obliterate the other.
Fourth, I like Sellers Smith's "signs of a healthy operations environment." I think he is on the right track. I still favor shifting more responsibility for application and service level success to the developers so that there is less emphasis on the hand-offs between Dev and Ops and more on building apps and app platforms with the end in mind. Think of the Design for Maintainability movement in industrial engineering and consumer products and you will see where I am going.
Next post I will summarize what we have learned and solicit your comments.
Follow-up added May 10th
Bjorn Freeman-Benson: This will be my last posting on the debate, though I may jump back in as a commenter if I see some more great comments like those below. First of all, this has been a great experience for me and for my colleagues at New Relic. This has stimulated some interesting internal discussions (see Bayard Carlin's comment below for an example.) We have heard from customers, partners, and other friends in the business. Its clear from these discussions and from your comments in the debate, that having an Ops function is not going away for most organizations of any size. There are too many jobs cutting across all of IT needing compliance, governance and standardization that cannot be left to individual app teams. So let's assume you are going to have an Ops function if you are employed by a larger company. Startups and small companies may (and should, in our opinion,) successfully blend Dev and Ops into a single role.
In this posting, I would like to summarize our thinking on both the Role of Operations and on some ways Dev and Ops can work more harmoniously, productively and efficiently.
The role of IT operations has changed significantly in the past, say, 10 years. And in the past 2 years the pace of change has accelerated pretty dramatically. On this, I think Ops people and most others in IT can agree. As more applications have moved to a distributed model using web technologies, application complexity has increased at the same time that application development cycle times have decreased. What a bind Operations finds itself in! Ops teams need deeper and more varied skills to manage more complex application environments even while agile methodologies, Cloud deployment platforms, high performance IDE's like Eclipse, and new application frameworks like Rails, Spring, JEE, Grails, and .NET enable faster and faster application development.
To succeed today, the Ops team will need to adapt to a faster pace of deployment and to a continuous ratcheting up of complexity. In my opinion the role will call for one set of skills that are solely within the Ops team (because they apply across the whole enterprise) and another set of skills that are shared with the Development teams. Shared skills are those where Dev takes the lead but Ops works side by side (more on that later).
These skills reside within the Ops team:
- Hardware and network configuration and deployment
- OS and firmware maintenance
- Application stack maintenance (app servers, frameworks, plugins, etc.)
- Capacity planning
- Storage and backup management
- Disaster planning
- Security and access administration
- Telecom management
And these skills are shared with Development (with Dev taking the lead and responsibility):
- Application deployment
- Application, Network, and Infrastructure monitoring
- Log management
- Database design and administration
- Incident management and troubleshooting
- Application performance management
- Service level management and reporting
The best thing an Ops team leader can do for his people is to provide them with opportunities to cross train in those skills in which Development takes the lead.
Finally, here are some simple recommendations for improving the Ops job:
- Move most Ops staff into cross-functional application teams consisting of developers, Ops, DBA's, and business analysts (product managers)
- Physically co-locate Dev and Ops people into cross-functional units
- Assign cross-functional success metrics that make an entire team responsible for on-time, on-target application delivery and performance
- Make sure the whole team understands how to monitor performance, gather critical data, and interpret performance metrics in a critical situation
- Involve developers in the selection and customization of application and database monitoring solutions
- Insist that the whole team meet with their business customer-counterparts at least monthly to review progress and goals
Thanks for following our debate. It's been a lot of fun. Special thanks to Carlos Armas for challenging our positions and taking on the role of opponent.
Below are the most recent Twitter comments related to this debate, for your viewing enjoyment:
What is the Role of an Operations Team in Software Development Today?
We are in the middle of dealing with this very same issue from an organizational perspective. We have a large development organization located in multiple development centers across the country as well as a significantly larger operations organization. Our market differentiater (in some cases good in others bad) are systems we built, many years ago, but still depend on today. However, we are starting to bring commercial products into the mix since we have not been able to keep up with many of the advances (process, framework, methodology or even tool sets) in the market place taking place today.
We have a mixed support model with operations folks and development staff supporting operational systems but neither in a really holistic fashion. Sometimes it depends on the day of the week or the time of the day who takes the first call and tries to help. In time it get to the right person (mostly), but it takes time (not acceptable from a customer support standpoint) and mostly is not a way to run a quality service organization.
So we have started a 12 month effort to define the difference between replacement, enhancement and bug fix on the development side as well as from an operational application support standpoint; what is tier 1, 2, 3 or even 4. Once we have those definitions agreed to (not just within the IT organization but also with our internal business customer) we will start untangling the 2 organizations such that we can count on our operations staff to 'fully' support systems (whether software or hardware) and our development staff to focus on delivery of new capabilities.
So were pretty much aligned with Carlos on his perspective. However, do recognize that if you move to a public cloud or even a private cloud run by your vendor the size and extent of your operations staff should change.
Re: What is the Role of an Operations Team in Software Development Today?
It seems to me that the three of you have described case studies for different operational environments. What works for Bjorn's operational environment may not for Carlos's and Stephen's.
I can paint a tiny case study description for my small organization: we are like Bjorn (developers handle everything: dev, QA, support, ops). This has not always been the case, and it will change going forward. It is indeed good for our developers to be deeply involved in technical support, as it leads to a better product that they produce. However, like Carlos pointed out, as a small business owner, I know it's not always the best use of resources for a developer to answer questions that a skilled support engineer can handle.
All of what you three have described sound like case studies, and I've benefitted from reading them all. :-)
Flux Job Scheduler
Blog ~ jobscheduler.com
I think there is a moddle ground
Amazon Web Services is a great example of enabling developers to self-serve their applications with limited operations. Not only EC2 and EBS; but also they have added Data services, elastic load balancing, and monitoring as services.
I have been in operations for over 30 years and you will never hear me advocate the elimination of operations. However, I think as industry operations needs to clean up and man up. Developers over the last 15 years have made enormous strides in productivity and most ops shops are still running the same as 15 years ago. Operations need to think “2010” and add services that facilitate developers needs. Operation should still be responsible for the ebb and flow of ops and be less dependent on guesses of what dev wants.
DevOps, not Dev over Ops!
Developer superiority over ops seems to be partly due to startup culture, partly due to a long standing hubris-driven power struggle between developers and non-developers, and also partly due to a recent trend where frameworks & platforms weren’t designed with appropriate instrumentation and management interfaces from the start (early J2EE comes to mind!).
There are very different value systems at play when you’re paid to preserve & incrementally enhance something vs. create something new. There are even legal requirements to keep dev & ops separate in some industries. The point of devops tools / frameworks is that more collaboration, understanding, and goal alignment between dev & ops are good, helpful things, much in need.
Basically agree with John -- ops needs to modernize and improve their processes, but that doesn't mean they go away…. far from it. Operations folks could learn a lot from the agile tools that developers have learned these past 10 years. But developers also need to understand that infrastructure is a different beast than straight code -- you're manipulating something more akin to a robot, where transition states, incomplete knowledge, and continuous sensing are common.
Looking at many of the cloud management on-ramps out there, they seem to me designed more for developer self-service than for ops, IMO. This is fine, because dev are often the ones trying new technologies, they're the ones that are building new things in departments or small businesses, whereas many ops departments aren’t quite so proactive, though there are exceptions. But it shouldn't be taken to an extreme conclusion that *in general* sysadmins need to become more like developers. Self service is nice, but it's not the crucial feature to make an application reliable, fast, and secure.
To the topic itself:
It seems to me that we are actually looking at several trends that are changing the nature of operations, and therefore, require a re-examination.
Trend 1: Agile/Lean/Continuous Integration/Continuous Deployment -- By their nature, these new development practices require a streamlined application lifecycle. In turn, this means that development and operations processes have to be tightly coupled. Different organizations will address this in different ways: in some, development will take responsibility for the operations. In other cases, there will be dedicated ops people but they will report to the dev organizations, and in others, there will be a separate dedicated ops team. The decision on which approach works best depends on factors such as the size of the organization, the nature of the app (more on this below), the history and politics of the company, etc.
Trend 2: Automation of Operations -- We are seeing significant innovation in the automation of IT operations themselves. Companies such as John Willis's employer, OpsCode, with Chef and Puppet Labs with Puppet are driving this trend to the next level by providing domain-specific languages for the automation of many IT tasks that were previously done manually (or at least, on an ad hoc basis with home-grown scripts). Others have already mentioned self-service capabilities, and of course, there is the fact that almost every aspect of the infrastructure is being expose through an API, which allows developers to access it programmatically.
In short, and this may be painful for some to hear, many aspects of operations are being automated through programming and are less necessary. That's not to say there is no need for dedicated operations positions, but that they are changing their nature to higher level tasks.
Trend 3: Infrastructure-as-a-Service -- The increasing use of public IaaS clouds takes the automation of IT described in Trend 2 to the next level. Not only are many of the tasks automated, but the entire infrastructure is outsourced further removing the need for operations personnel in the organization that owns the app. The good news for ops folks is that IaaS (and other cloud) providers need highly-skilled ops people and in those types of companies they too are the rock stars and not just developers.
Trend 4: Platfrom-as-a-Service -- Although in many ways this just a special case of the IaaS trend (which can also be referred to as "cloud computing" in general), it's important enough to this debate to list separately. Increasingly, we will see cloud platforms that handle even relatively high level tasks that have to do with the software stack (e.g., JEE/Spring, LAMP, RoR). This further decreases the need for sys admin tasks. In addition, many of these platforms, such as Heroku, Engine Yard, Force.com, Google App Engine and others, come pre-integrated and pre-instrumented with monitoring and management tools that don't require special ops expertise and can be handled by developers. For example, in Heroku, you can add New Relic's APM capabilities with a click and now have sophisticated management and monitoring capabilities.
These changes obviously don't happen overnight but I think the trends are undeniable and will have a pretty profound effect on the role of operations and dev.
Re: Converging Trends
I do not believe that developers can be crossfunctional to such a degree that they can be good developers and at the same time be good DBA's and be good at IT operations, ... we should not exaggerate people. I agree that this might be a good model for small development shops but like I mentioned do not exaggerate.
Re: Converging Trends
1) That the overall need for operations folks (in numbers) is being reduced due to automation
2) That the need is moving from the organization that owns the app to the cloud provider
3) That there is a need for development and ops to work more closely -- and I stated explicitly that likely in small shops the roles will converge but in other cases it mght simply e a question of merging the development and ops teams (which is what is actually happening particularly in many SaaS shops)
Re: Converging Trends
I don't agree. "Operations" type work will always be here, and some could argue it's going to increase over time. Automation/cloud are only going to reduce the manual datacenter work for the most part, and that's not where most ops work is needed. I won't repeat what I said here about the topic: www.kitchensoap.com/2009/05/22/annoying-to-me/
about your #3, above: I'll say that making predictions on the reporting structure of development and operations (or merging, etc.) is orthogonal to the 'devops' trend of development and operations groups collaborating to a greater degree. It's very possible for development and ops to work together while being parallel groups within an organization.
Re: Converging Trends
Re #3 - I'm not trying to predict that. I think my main point is being lost, which is that dev and ops need to work more closely together and that will take different forms in different organizations.
In addition to commune-like levels of collaboration however, we need to recognise that developers and operations people have skills and understanding to pass on to each other. Stereotypically, developers don't know about anything outside the IDE, and ops people don't know how to code. That's not entirely correct either, although I have worked with examples of both.
Hopefully the skills transfer will naturally flow from the collaboration. I'm not suggesting that developers need to be operating system experts, but knowing about the ecosystem that your code runs in makes a world of differnence. It seems that the two roles have been wedged apart by vendors, habit, and a sociopathic need to build specialised teams in organisations.
Architects and developers in operations will never work
Re: My take
* Accountability. Regardless of your organizational structure, everyone needs to take accountability for the successful operation - and use - of the system.
* Knowledge Sharing. The folks operating a system needs to understand how it was built and the folks building a system needs to understand how it actually performs (an is used by customers). You can this knowledge sharing with a single DevOps organization, or with close collaboration between the two organizations.
* Common Deliverables. As part of building software, work products necessary to operate the software need to be built - e.g., deployment process, monitoring, server specifications and configuration). Ideally, they are built by a joint team, and serve as a mechanism for process improvement.
* Joint Process Improvement. Teams should be continually improving their processes. If the teams are separate, representatives from each team should be providing - and accepting - feedback on ways to improve the processes.
Dev and Ops as opposed to Dev vs. Ops
In General I think the debate should not be "how can we separate them" or "how can one group replace the other", but rather "how can we bring the two together for additional value".
One of the things I have in mind is that operations and developers sit together and define "monitoring hooks" that allow operations to better monitor the working of the app - not only the 'physical' parameters like VM size or request response times, but also other factors that can be counted within the app. This can even go as far as involving the biz people to expose business metrics into the monitoring systems, so that the business stakeholders can be alerted on out-of-bounds business metrics.
"Mostly" absent from this debate
Sitting together - was Re: Dev and Ops as opposed to Dev vs. Ops
Today, operations and development, have a common stand-up, share a common space and there are frequent ad-hoc conversations. When there is a serious production issue, development is there helping to troubleshoot the problem and offering advice. Ops has moved many of its tools into development environment - for example Splunk to do log file analysis. Not only do we have more resilient and scalable systems, but it is quite frankly a much more enjoyable place to work.
On the monitoring hooks item, we have plugin (Zapcat) for our monitoring tool (Zabbix) that allows us to monitor and alert on an JMX properties. Ops and development work together to monitor select standard properties ( e.g., tenured gen memory usage, ActiveMQ queue sizes) and custom properties built into the application (e.g., transaction rates, module errors states and messages).
RE: Debate: What is the Role of an Operations Team in Software Development
We have seen some extreme opinions here. As with most of life there is a happy medium to be found ;-).
I respectfully disagree with Bjorn about a couple of things. There must be checks and balances in place especially for companies subject to compliance. Developers and ops people have very different mindsets. I have seen few developers who are also truly skilled ops people. They are worth their weight in gold.
The key thing here is the relationship between the organizations. We all provide some form of product to customers. The product consists of software, database, network, OS Etc. Every company's goal is to make the customers happy. You can't provide an excellent product without good communication and relationships between development and ops. There can be no walls.
I worked at a large e-tailer. When we had an outage there was a lot of confusion and finger pointing. We had to get someone on the development organization on a call if it was finally determined to be an app issue. This led to longer outages.
I started involving someone from the development organization in *every* crit sit. There was a rotating on call person same as ops. The developers started to get a good idea how their software was actually working in production. Ops got a better idea of why the software was written the way it was. We discovered that the developers were not a bunch of crazies who were lobbing crap code over the wall. This helped break down the wall between the the two organizations. The relationships got better. We had fewer outages and shortened the MTTR considerably. In short we had a better product.
I am lucky to be a New Relic power customer. I use the product to view what is going on in our production app. I talk to the product developers on a daily basis. I tell them what I need. They take my feedback and if reasonable add it to the product. I listen to them and try to give them what they need.The development organization is involved (voluntarily) in every production issue we have. We are one team with different reporting structures. There is no confusion, BS or finger pointing. I am not shining anyone with marketing fluff here. We provide a product. It *must* be the best we can make it. Our customers *must* be happy or we will not succeed.