BT

Your opinion matters! Please fill in the InfoQ Readers’ Survey!

Tom Limoncelli on DevOps and Automation
Recorded at:

| Interview with Tom Limoncelli Follow 0 Followers by Barry Burd Follow 0 Followers on Sep 25, 2015 | NOTICE: The next QCon is in London, Mar 4 - 6, 2019. Join us!
21:59

Bio Tom is an internationally recognized author, speaker, and system administrator. His new book, The Practice of Cloud System Administration, (http://the-cloud-book.com) launched last year. He works in New York City at Stack Exchange, home of ServerFault.com and StackOverflow.com. Previously he's worked at small and large companies including Google, Bell Labs / Lucent, AT&T.

Sponsored Content

Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

   

1. [...] Why are developers increasingly concerned with operations?

Barry's full question: I am Barry Burd, I am a professor at Drew University in Madison, New Jersey and I am interviewing Tom Limoncelli at QCon New York. Tom is an internationally recognized author, speaker and system administrator. His new book, “The Practice of Cloud System Administration”, shipped last November. His past books include “Time Management for System Administrators” and the “Practice of System and Network Administration”. He works in New York City at Stack Exchange, home of careers.stackoverflow.com. Previously he has worked at companies such as Bell Labs and Google. Tom, welcome to QCon. Why are developers increasingly concerned with operations?

Developers are increasingly concerned with operations because more and more their duties are overlapping. In the old days, developers wrote software, it got shrink wrapped, put on a floppy disk or a CDRom and shipped to customers. So, developers had no direct responsibility for operations. But nowadays, with the web, developers are often at the same company that is running the code, so if there is an operational problem, the people doing operations are down the hall - more than just down the hall – often it is a combined organization and companies have found huge benefits in having a combined developer operations team because you get the benefits of vertical integration and better cooperation.

   

2. What is DevOps and how does it affect business?

DevOps is a cultural change within businesses. A lot of people think that it is about tools or various software packages, but really, fundamentally, it is a cultural change that businesses are endeavoring to do and this change is kind of the opposite of Waterfall. In Waterfall developers make the software and when they are done, they throw it over the wall and they hand it to the operations people and if operations people have a problem, they have to figure it out on their own. DevOps pushes those two groups together, forming a DevOps team that is mutually responsible for the development and operation of the system. So, instead of separation, you have empathy.

You have developers develop empathy for the operational needs of the company and operations develops empathy for the developer needs and as a result, you get all these benefits: improved uptime, improved ability to introduce new changes and new features. So companies that have made this cultural change, the majority of them, have found that as a result some tools that are useful are things like continuous integration and infrastructure as code and a lot of different things and because they all use those tools, often DevOps is confused with the tools and people think that if they have a continuous integration system or if they use source code control or if they are doing web-based development then they are automatically DevOps. But those are really symptoms or downstream effects of having this good cultural change.

   

3. Most of the practices that I read about are things that companies like to say that they do, but they do not actually do. To what extent has the DevOps culture really penetrated the software industry?

Good question. I think we are at the base of the mountain looking up. This is really just the beginning. First of all, there are many different techniques that fall under the DevOps umbrella, that are all part of the vertical integration of having developers and operations having shared responsibility for the product. So of these techniques, some companies do one or two, some companies do dozens and dozens of them. I think a lot of companies say they are DevOps when they are really just beginning.

One big mistake that I see is often companies will create a new team called the DevOps team and they say “We put all of our company DevOps in this team” and that is the wrong way of doing it. That is kind of like saying that the quality of our software is important so we are going to have a quality department, which is what you did in the 1970’s and 1980’s, but modern software has realized that quality happens because everyone is concerned with quality and you are building quality from the developers are writing unit tests along with the code and it is an integrated pervasive system and certainly DevOps is best when it is pervasive in the organization.

   

4. What features do system administrators need from developers to make systems operable?

When developers and operations people start really communicating and collaborating, it is very common that they discover there are certain software features that need to be designed into the software to make it much more operable. Some of the most fundamental things are visibility. So, for example, the ability to monitor not just “Is the system is up or down?”, which is kind of like looking at the software as a black box, but being more invasive and having white box monitoring, so visibility into buffer counts and API call counts and all these different things. Another thing is the ability to do failover in a supportive manner. I cannot tell you how many times I have seen software where the only way to do failover was some hack that operations figured out and it is not really supported.

In a large environment, you have to expect that if you are distributing your work over dozens of machines, that at any given moment, the chance of one being down is pretty frequent so you need some kind of failover built into the software and not tacked on at the end. So visibility, resiliency and also logging is really critical to maintain the history of the system. So, the system needs to generate logs that are useful for operations and useful in debugging problems, as well as monitoring specific error situations.

Barry: I think what I hear you saying is that logging cannot be added as an afterthought.

Right. Yes. Logging cannot be added as an afterthought.

Barry: Visibility of certain parts of the system, performance and so on cannot be added as an afterthought. If there are two teams, the developers and the operators, they have to be talking to one another continuously.

Right. And not just continuously at the end of the process, but in the best, high-performing DevOps teams, the operations peoples are involved in the very beginning of design. So, where I work, at stackoverflow.com, when developers are just thinking about a new feature or a new subsystem, they get a delegate from our SRE team, who sits there and helps them with the architecture and prevents more problems than the developers could imagine that they would have had. For example, we recently had a new feature being developed and the operations people said: “Hey, this is great. How are we going to do back-ups?” and that one question made the developers rethink the whole architecture and, as a result, when the new feature launched, they were able to have a back-up system that was needed. It would have been quite a disaster if they were two days from launch and someone said “How are we going to do back-ups of this?” and they would have said “Oh, no. What are we going to do?” And I have been at companies who said “Well, we will run without back-up for the first couple of weeks because we did not think about that”. And that is really scary. That is a really scary business risk and it is so much better to have people thinking about those things from the very beginning.

Barry: Does some of this have to do with the fact that software is more user-facing than it used to be? If it is software that is aimed at the consumer, it is more consumer-facing. And I remember the days when you brought a deck of cards to the operator and the operator went behind the desk and you did not see it for four hours and then in that case the notion of a DevOps might not make sense. I am thinking that part of the reason why DevOps is such an important concept is because there is a continuous chain now between the software developers and the actual operation to users of the software.

If you think about when the web sites were brand new, they were often just static content and so you could update it once a week with new static content. But now web sites are very interactive and because they are interactive, users are always thinking about new features that they might want, product development people are thinking about ways that it can be improved, so they want those improvement rapidly. Could you imagine if Google, the search engine, updated their web index once a year. I mean, that would be crazy, right? But instead, you can post to your blog and literally seconds later, it is in the search index and searchable and that is because from end to end they have had to work out how they were going to have those changes constantly happening. It is a very dynamic system. And even if you are not big like Google, even if you are doing a simple web site you need, you are going to be shipping new releases of that web site constantly and that is great because that means that web sites and applications are much more customer-oriented. I remember back in the day Microsoft Office used to have a new release every three years and in between there were some bug fix releases, but basically, you had to wait three years for those new features.

Now web sites ship new features all the time. Etsy.com had a blog post recently where they said that they shipped a new release of their software... in 2014 they shipped 9000 releases of their software. Literally, more than once an hour during the 9 to 5 hours of the day at least one new release was coming out. That means that they can respond to demands for new features very effectively and in fact, they can even experiment: they can try a feature one way and back it out if people do not like it. You cannot do that if you are on a three-year software cycle.

   

5. Is there a difference between the mobile world and the desktop web world when it comes to DevOps?

Interesting point. In the mobile world people expect apps to be updated frequently and with most the apps that I use, I see a new release every week or two and the apps that update every couple of months or once a year or seem to never update are the apps that I end up stop using because there is nothing new for me there, right?

   

6. So how can one automate a system without creating a system which is difficult to maintain?

You know, a big part of DevOps is that once you are in a situation where you can do a lot of new software releases, you want to automate the software release process because for example Etsy, with thousands of releases every year, obviously each one cannot have been done manually. So, it is a good question. How do you automate in a way that does not corner you in? So, to answer that question I am going to first tell you how most people get into the bad situation and then I will tell you how to fix it. So, what people often do is that they automate as much as they can and the things that are too difficult to automate, they leave for the humans to do manually and that is called “the leftover principle”.

By doing that, eventually you keep chipping away and automating more and more and eventually what is left for the humans to do is the stuff that is really difficult and really complicated or the stuff that is so rare that is not worth automating. When you do this, you end up with a system that is so sophisticated and complicated that it becomes very difficult to debug or another way to look at it: what is left over for the humans, requires so much skill that it is difficult to hire anyone that can do that kind of work. One of the reasons why I think that Google has difficulties hiring SREs is that they have automated things to the point where the only people they can hire have to have incredible skill level to be able to maintain what is developed.

So, how do you fix this problem? I think it is important to think in terms of Iron Man instead of Ultron. So, if you think of Iron Man’s suit, it takes what Tony Stark can do and just does it a little bit better, right? Or it does the things that humans are not good at, but Tony Stark does the rest as opposed to Ultron, which was this complicated system that went out of control and when no one could understand it well enough to defeat it. So, what does that means for system administrators or developers? What that means is that you want to make tools that do what humans do, but just do it better. So, for example, a monitoring system – as a human, I know that I can collect monitoring statistics and metrics from all our services manually. I could collect each one of those metrics. But, a) what a human is not good at is repetition, so I would not want to do that every five minutes accurately and also, humans like to sleep. So, monitoring systems automate that process by doing it 24 hours a day.

Barry: There is a speed issue also.

Right. There is a speed issue. So the monitoring system, built right, does what a human can do, but it just does it faster, more repeatably and 24 hours/day. Another example is I used to work on a large virtualization project and if a physical machine got sick, we needed to evacuate the virtual machines off of it, send it to repairs, which was the people in the data center and then when it was repaired, bring it back to the system. There are basically five discreet tasks that had to be done when the physical machines were sick.

Instead of building a large complicated system that did all that, what we built was five tools and each tool was a tool that a human could use to do one of those tasks and then later, once those tools were all stable and people were good at doing those tools we decided we just needed automation then that can kick-off those tools at the right time and that automation could run 24 hours/day. The automation was simple because it was just doing what the humans would have normally done – it just did it when we were asleep and so it never grew into that Ultron system that could never be controlled.

So, still, what do we do about those problems that are so complex that you do not want to rely on software to do them and it is difficult to hire people who can do them. Even in that case, there were certain edge cases that were not worth coding around and so instead of being paged a couple of times a day, only the edge cases paged us for human intervention and that happened maybe once a week or less. What we found was we were able to simplify a lot of those edge cases. In this case, because it was a virtualization environment, we were just in a position where we could wipe and reinstall machines rather than try to fix them, but even if one of those five tools broke, it was doing what the humans normally did. So, when it broke, it broke in ways that humans understood.

   

7. So, let us put in a little plug here: how does careers.stackoverflow.com make hiring and getting hired better?

Great. A lot of people are familiar with Stack Overflow. It is the Q&A website that almost any developer in the world goes to for their answers and also for the community of people writing answers. Careers.stackoverflow.com is our jobs and career site. The way that it is different is that we have a couple of rules there, a gamified process, in ways that make the process suck less for developers. For example, we do not permit third party recruiters. So, if you have an account on careers, you are not inundated with constant recruiters with nebulous job offers. The other thing is that as a hiring company I like to believe that the resumes that they get are of higher quality because they are recruiting from within the Stack Overflow community, which is a great community and also, before they even talk to a candidate, they can see their reputation score on the site and that kind of thing which our customers find very useful.

   

8. How do you keep recruiters off of the site?

We have a company policy that our customers cannot be third party recruiters. I mean they can be recruiters for a company, but not third party recruiters.

   

9. Ok. You have a new book coming out. Are there any surprises?

There is going to be tons of surprises. One of the surprises that we have not talked about much is that we have developed an assessment system. One thing we found in our past system administration books is that people say “Hey, lots of great ideas, but I do not know where to start” or they have said “I think my team is doing well, but I am not exactly sure” So, we have developed an assessment system that lets you self-evaluate your team and it helps identify what your areas of improvement are. Also, if you do the self-assessment say monthly or quarterly and you keep it in a spreadsheet, you can either build a view of how you are improving over time and you can do it per team or per service and I have seen this done at other companies and it really has a very powerful effect of helping guide your priorities.

Barry: Sounds good. Tom, thank you very much.

Thank you for having me.

BT