Q&A on the Practice of System and Network Administration (3rd Edition)
The book The Practice of System and Network Administration takes a holistic view on system administration: It provides a framework and strategies for solving problems regardless of the operating system, brand of computer, or type of environment. The aim of this book is to help people to become a professional system administrator. The third edition incorporates new developments like DevOps, infrastructure as code, continuous integration (CI), operational excellence and assessments.
InfoQ readers can download a book extract with a discount code.
InfoQ interviewed Thomas A. Limoncelli, Christina J. Hogan, and Strata R. Chalup about game changing strategies for system administration, benefits from using an architecture that uses open standards and open protocols, what can go wrong when a new service is launched, and how to prepare for possible problems, state of the art practices for monitoring systems, things that DevOps has brought to System Administration, improving communication and collaboration between systems administration and the users of systems, how to assess the services provided by system administration, and which development they expect to happen in system administration in the future.
InfoQ:For whom is this book intended?
Limoncelli: This book is for system administrators that work in small and large enterprises and educational institutions. It is useful whether you work at a helpdesk, desktop support and delivery organization, or back-end services.
Chalup: We wrote this book for system administrators and those performing systems work (who may not be career system administrators) who want to “level up” by learning “the why” of best practices as well as “the how”. We describe design and process patterns and explain them, rather than focusing on a specific platform or program. That way the reader can apply the pattern to any current technology.
Hogan: The book is also intended for managers who have system administrators in their organizations. In particular, the last two chapters in the book focus on how to constructively assess the system administrators organization, and how to drive and prioritize improvements.
InfoQ: What's new in the third edition?
Hogan: The third edition is a huge update. The field of system administration has seen some significant advances in the past ten years. This edition captures those advances, and demonstrates how best practices have changed. We look at how DevOps methods can be applied in environments that run commercial software. We also focus a lot on automation, and how integration with other systems such as the HR database and the inventory database enable you to build top notch services.
Limoncelli: A whole lot! 28 of the 56 chapters are new. In the previous edition we had one chapter on desktop services, there are now 8 chapters covering everything from architecture, to desktop lifecycle, to managing new employee onboarding. The chapter on servers was replaced by a 3-chapter sequence. The chapter on running services is now a 7-chapter sequence that covers planning, different deployment approaches, and service conversions. The book is 50% longer than the first edition, and 20% longer than the second edition.
InfoQ: The book starts with some game changing strategies. Which are they, and why do they matter?
Limoncelli: We start with a chapter about what to do if your organization is a hot mess. Feedback we received about previous editions was that it is difficult to do the right thing when you feel like your entire network is on fire. Therefore the first chapter is about putting out enough fires so that you can use the advice in the rest of the book.
Chapters 1-4 are about strategies for organizing your work. For example, system administrators sometimes launch a new service and their customers hate it. Oops! Now you feel like you wasted a year of work. We explain how to launch a new service as a series of mini-launches, perhaps once a week. The first launch might have fewer feature and only be visible to a small group of users. The feedback you get is invaluable and informs the next mini-launch. Over time each mini-launch adds features and supports more users. The project might still take a year, but when it is done you’ve built a better system that more closely matches what users want. This works if you are rolling out a new printer in an office environment or a million dollar web site.
Starting the book this way may surprise someone that expects a system administration book to be about what commands to type and buttons to click. However if you talk with any senior system administrator, they’ll tell you that these are the real secrets of top system administrators. This is the kind of advice you won’t find in the manual.
Hogan: Chapters 1 to 4 describe the changes in mindset and approach that are the basis for how system administration has evolved over the last ten years. The approaches described in those chapters constitute the mindset that all system administrators should bring to the job, and these approaches should inform system administrators’ decisions on how to tackle every challenge.
InfoQ: Which are the benefits from using an architecture that uses open standards and open protocols?
Limoncelli: You get competition that leads to better projects and lower prices. It was radical to say this in our earlier editions but now this kind of thing is conventional wisdom. We’re proud to have been on the leading edge with that one. However we’re also appalled at the continued attempts by vendors to find new and creative ways to lock in customers. This edition tries to educate people so they can watch for the new, more subtle, attempts by vendors to do this.
InfoQ: What can go wrong when a new service is launched?
Limoncelli: Everything! It isn’t fast enough, it doesn’t have the features customers wanted, it disrupts unrelated systems in the datacenter, it doesn’t work with all the browsers you’d expect, users connecting via VPN can’t use it… just about anything. A few years ago Apple had a website outage during one of their famous keynote presentations because a new “real time news feed” feature didn’t scale to millions of simultaneous users. Who could have expected that? Well, we would have! Something so critical should not have been exposed to millions of users without capacity testing first. However where could Apple have found millions of users to test that bit of code ahead of time? Well, they could have put it on their homepage as an invisible element. That would have tested its ability to scale with enough lead time to fix any problems. This isn’t arm-chair quarterbacking. Other companies do that kind of testing all the time. Facebook Messenger was running in people’s browsers as an invisible service sending fake messages for 6 months until the scaling issues had been worked out.
InfoQ: How can system administrators prepare themselves to deal with problems during launch?
Chalup: The key is to get information as early as possible. Discovering a problem on launch day is the worst. A simple technique is have a beta launch to find problems early. Everyone knows that, but people don’t think to do it for internal systems or system administration tools. We take this even further. Can you launch a single feature to validate assumptions months ahead of the real launch? I like to launch a service with no features, just the welcome-page, months ahead of the actual system launch. This gives us time to practice software upgrades, develop the backup procedures, document and test our runbook, and so on. Meanwhile the developers flesh out the system by adding features. When the system is ready for real users, there are very few surprises because the system has been running for months. Best of all, users get access to new features faster.
InfoQ: Which are the state of the art practices for monitoring systems?
Hogan: The industry is making a big shift right now from up/down monitoring to time series-based monitoring. The old way is to monitor if something is up or down and alert if, for example, it has been unreachable for a certain amount of time. The new way is to collect telemetry about many aspects of the system and do data-mining on the history of the data to notice when the system is sick. Now we can cure the underlying causes and prevent the outage. As a result, it is less common to be woken up at 4 AM because the system is down, and more likely that during the day you fix a small problem before it results in an outage. The old way is like trying to help someone having a heart attack, the new way is like treating high blood pressure. Some systems that use this newer methodology include Bosun, Prometheus, and Circonus.
InfoQ: Can you explain the “fix it once” mantra?
Hogan: When something breaks, it is tempting to just fix it quickly (for example by rebooting the server) and then move on. This can be driven by the fact that it is user-impacting, and you need to get people back up and running as fast as possible, or it could simply be because you are super-busy. However, if you don’t understand why it broke, and fix the underlying reasons, then it will break again, and you will need to fix it again. It is better to fix the underlying problem once, rather than rebooting the server every time it breaks.
For example, if you have a service, and every so often the machines that it is running on experience high CPU, memory and swap usage, you can have your monitoring detect that condition, generate an alert and have someone reboot the system. You could even get clever and have some automation reboot it for you. Or you could do some investigation into which process is chewing up CPU and memory, check for known bugs and, if necessary, raise a support case with the vendor to get the bug fixed. Then you upgrade to the fixed version when it is available. The latter approach is what we mean by fixing things once. You fix the problem permanently, rather than continually repeating the workaround. It’s not that you leave the machine broken until you have the permanent fix, but that you investigate the issue thoroughly, and fix it permanently as soon as possible.
Chalup: It's much better to drain the swamp than to fight the individual alligators!
InfoQ: What are the main things that DevOps has brought to System Administration?
Chalup: DevOps has brought a level of collaborative accountability to the profession. It's explicitly part of a programmer’s responsibility to create maintainable systems with functional APIs and a system administrator’s responsibility to create a managed and monitored landscape in which those systems can operate. Neither side gets to throw things over the wall and then point fingers when something goes wrong. The focus on a complete life cycle for a system, from design to development to release to maintenance, shifts both groups’ thinking into a more holistic mode.
Limoncelli: DevOps techniques lead to an environment that is less stressful and more productive. Imagine if job advertisements were completely honest. Most companies advertising for IT workers would state that the job is mostly great except for twice a year when ``Hell Month'' arrives and everyone scrambles to deploy the new release of some major software system. This month is so full of stress, fear, and blame that it makes you hate your employer, your job, and your life. Sadly, at many companies Hell Month is every month. A company that adopts the DevOps principles is different. A rapid release environment automatically deploys upgrades to production weekly, daily, or more often. Little or no human involvement is required. It is not a stressful event---it is just another day. There is no fear of an upcoming Hell Month.
Companies that use these techniques are rare now but are growing in number. When they are the majority, companies that have not eliminated Hell Month will find it difficult to hire employees. This doesn’t just include IT workers. Given the choice between working at two companies that are otherwise equal, wouldn’t you pick the one known for providing its employees with seamless technology and support?
InfoQ: Which suggestions do you have for improving communication and collaboration between systems administration and the users of systems?
Chalup: It's really important that customer communication includes an immediate attempt to find out the goal and urgency of the customer’s request.
I once heard a customer asking a colleague for a piece of network hardware. The colleague told the customer to await a desk visit in about 15 minutes for assistance. The customer left, and within 10 minutes the network went down. The problem was eventually traced to a piece of transmission equipment that the customer had appropriated from the network “because I needed one and it didn't look like anyone was using it.”
System administrators need to go beyond the symptoms of a problem and discover what the customers are actually trying to accomplish as the end goal. It's important to cultivate a mindset of being a customer enabler, rather than a systems maintainer. Customer requests shouldn't be thought of as annoying interruptions to be gotten rid of as quickly as possible, but as real world use cases that help us better understand how to provide effective services.
Limoncelli: Good communication with users is not enough. We must develop bi-directional empathy and collaborate to create IT systems that are useful and sustainable. We must understand our users to the point that we develop empathy. This can be done by shadowing a user for a week to better understand their process and discover the annoyances and pitfalls of the systems we built. This can be done on the team level too. Once I worked with two teams that depended on each other but rarely talked to each other. I coordinated an effort where both teams sat down and walked through a major processes, listing the steps and pointing out the rough edges, the unreliable parts, and the burdensome manual steps. This new understanding lead each team to make changes to improve the process. Some things were small, like displaying data sorted by date instead of last name. Other things were big, like providing an API so that the other team could get what they needed without opening a service request. This spawned many projects that made life better for members of both teams. I remember at one point someone at the meeting saying, “No need to file a bug for that one… I just fixed the code. It will be in production tonight!”
Empathy is a two-way street. Developers often don’t appreciate how difficult operations is, and refuse to add features that would reduce a lot of operational strife. Why should they? Their performance reviews are based on whether or not new features get written. However if all developers have shared responsibility for uptime, and have to take a turn being oncall, you’d be amazed at how fast those operational pain-points get fixed.
Hogan: The most important part of communicating with the end-users is listening, and making sure that people know that they are being heard. Provide a forum where people can make suggestions, perhaps vote on the next big project, or the small improvements that would eliminate large time sinks. Once you find a way for people to provide their feedback, you need to dedicate some resources to delivering on those top requests, updating the forum with which requests are completed, in progress and under consideration. That way people can see the value in participating, and know that they are being heard.
InfoQ: How can you assess the services provided by system administration?
Chalup: Any assessment tool has a set of expectations or metrics to which the assessment is tailored. The two main assessment systems, which are somewhat orthogonal, are customer satisfaction and organizational maturity.
A customer satisfaction matrix will be about responsiveness, end to end solution completeness, ticket resolution time, and similar. It's an important tool to assess how well you are serving customers, since much of the work system administrators do is preventive in nature and subject to being overlooked. A variant of this survey might be to assess satisfaction with the services themselves, e.g., application suitability and responsiveness.
With regard to the overall service provided to the organization, a capability maturity model is the standard way to measure the overall maturity of the operational practices of the IT team. We discuss applying a CMM in this book, and how the various levels of process, repeatability, and documentation create a more functional systems group. We include a 40-page guide for organizations interested in taking this approach. This includes a complete assessment system you can adapt to your own team plus instructions on how to use it.
InfoQ: Which development do you expect will happen in system administration in the future?
Chalup: I hope we will see the development of technologically aware workplace personal assistants, more like a system administration Siri than our old love-to-hate friend Clippy. These automated expert systems would create a virtual help desk that would be always available and able to escalate to live staff, after asking some useful background information.
Limoncelli: Everything is becoming more programmable. This doesn’t just permit IT to automate their work, but enables self-service portals that let non-IT workers productive without waiting for IT. People often think of this as something only “cloud companies” do. However it is happening at all levels. At one company anyone that requested remote access to the network (VPN access) had to make the request, the IT department had to get their manager to approve it, install software on their laptop, and configure it. As they improved the programmability of their systems, they eventually ended up with a system where a user would request VPN access from a self-service portal. Their manager would receive email with an “approve” link to click on. Within minutes the laptop’s software update system (Puppet) would install the VPN client software and securely configure it. This eliminated wait time, typos, and security problems that come from misconfiguration. This required 5 subsystems to be programmable via APIs. 10 years ago that would have been impossible.
Hogan: The Internet of Things revolution is going to provide some new challenges in the coming years. Our networks will be dominated by a lot of “smart” devices, that are a lot less smart than those we are accustomed to dealing with. These devices are being made by companies that traditionally did not make networked devices, and the concept of their device being hacked and becoming part of a botnet is alien to them, but something that they need to come to terms with and address. In the meantime, it will fall to the system administrators to figure out how to protect their enterprise networks, and the rest of the world, from the company’s smart lightbulbs, blinds, AV systems, fridges and toilets!
IoT will also mean that system administrator teams will need to reach out to their facilities teams to make sure that they are involved in the product evaluation and selection process. Emerging standards should make these devices easier to manage, but only if the manufacturers are compliant. It’s a new field that is rapidly changing. Make sure you keep abreast of developments so that all these new types of network devices can be managed as easily and seamlessly as possible, or you will find yourself fighting endless IoT fires.
About the Book Authors
Thomas A. Limoncelli is an internationally recognized author, speaker, and system administrator with 20+ years of experience at companies like Google, Bell Labs and StackOverflow.com. He manages the SRE team at StackOverflow.com.
Christina J. Hogan has 20+ years experience in system administration and network engineering, from Silicon Valley, to Italy, and Switzerland. She has a Masters in CS, a PhD in Aeronautical Engineering and has been part of a Formula 1 racing team.
Strata R. Chalup has 25+ years experience in Silicon Valley focusing on IT strategy, best-practices, and scalable infrastructures at firms including Apple, Sun, Cisco, McAfee, and Palm.