Interview and Book Review: DevOps Troubleshooting: Linux® Server Best Practices
Kyle Rankin delivers practical advice and techniques for team oriented troubleshooting of Linux servers in a DevOps culture. The book targets systems engineers, developers, and QA staff that have gaps in knowledge about troubleshooting Linux servers. Experienced Linux system engineers will find the content refreshing and prescriptive in what should be shared in a cross functional team environment. An seasoned Linux engineer could take this book as a basis for leading a small series of hands on labs to help prepare developers and QA staff for the times when trouble strikes.
Focusing on DevOps
The first chapter provides an approach to troubleshooting for everyone in a DevOps culture. DevOps is put front and center, setting the stage for individuals with different expertise to interact efficiently. The first chapter guides people to do the following things:
- Approach troubleshooting with a divide and conquer mentality.
- Prefer IRC or similar text based chat technology during troubleshooting.
- Start troubleshooting with quick simple tests and avoid the slow complex tests if possible.
- Use known solutions first.
- Document your troubleshooting experience including the solution.
- Start by considering what has changed recently.
- Build up knowledge and understanding of how systems work and interact.
- Use the internet for specific inquiries only.
- Gather as much information related to the server(s) having issues before rebooting.
Throughout the book Kyle makes a case for common ground troubleshooting skills as being an important part of a DevOps culture. He states it as follows:
" In a DevOps organization, cooperation between all the teams is stressed, but when it comes to troubleshooting, often people still fall into their traditional roles even if there’s no blame game. Why? Well, even if everyone wants to work together, without the same troubleshooting skills and techniques, everyone may still be waiting on everyone else to troubleshoot their part. The goal of this book is to get every member of your DevOps team on the same page when it comes to Linux troubleshooting."
Chapters 2 through 10 divide the problem domains that can exist on a Linux server. The following is a list of the topics that Kyle approaches from a Linux server troubleshooting perspective: server slowness, booting, disks, networking, DNS, email, websites, databases, and hardware.
Linux servers appear to "slow" down when troublesome processes are causing high loads of the following types: CPU, RAM, or I/O. During troubleshooting a team needs to identify and stop processes from degrading server performance. Most Linux servers will have the tools to determine the category of issue and the offending process(es). The command line interface (or CLI) tool "uptime" will help diagnosis CPU load issues by reporting the load averages for the last 1,5, and 15 minutes. The CLI tool "top" will help diagnosis CPU and RAM load issues by continually reporting system information to the console. The CLI tool "iotop" will help diagnosis I/O load issues. The before mentioned command line tools analyze the issue if the issue exists while the tools are being used, however a different tool is needed for analysis after the issue has occurred. The "sysstat" package provides a collection of tools for gathering data across time using a configurable interval and for reporting that information after the fact.
Kyle covers the Linux boot process from bios to init process with a description of both the classic "System V Init" and "Upstart Init". Then the book dives into individual components in the boot process that can cause boot problems and discusses how to resolve those problems. The book covers the following: BIOS, GRUB, disabling splash screen, mounting root file system, and mounting secondary file system. The order of the material leads the reader through one or more potential problems in a way that allows each problem to be analyzed and worked through sequentially.
Readers gain knowledge about disk issues through a series of troubleshooting scenarios. Starting with how to manage a full disk, which Linux has prepared for by having reserved space set aside for the root user to login and move files around. The reserved space can be examined by using "tune2fs" utility. Then the command "du" assists in tracking down the largest directories. Next if the disk is not empty, but you can't create files, then you may not have any inodes free. Use the command "df -i" to see how many inodes are being used. Another disk issue is when a file system protects itself by mounting read only after experiencing an error. Use the "mount" utility command to remount it. Conversely "unmount" a corrupted disk and use the "fsck" utility to check and correct the disk. The path "/proc/mdstat" when concatenated to the console will reveal failed disks in a RAID. The command "mdadm" can remove a failed disk from a RAID configuration and likewise add a good disk to a RAID configuration.
When a server becomes unaccessible from a client computer, each layer of the networking starting from the client should be analyzed in turn. Start by using the "ethtool" command on the client computer to determine if a physical connection to the network exists. Once the link is detected the next step is to determine if the interface is up and has an IP address. The command "ifconfig" will report status of an interface and its IP address. Next a check is made with the "route" command for a default gateway, additionally the "ping" command is used to test communication from the client to other computers. After securing basic communication a test of DNS is needed by using "nslookup" to make sure the server name resolves. Determine if the client communication route is breaking down along the way to the server using "traceroute". Once the route to the machine is secure, use the "telnet" utility to check if the remote port is open. Next, use SSH to connect to the server, then use the "netstat" command locally on the server and also check the firewall with the "iptables" command. Use the "iftop" utility to detect slowly performing networks. Finally the "tcpdump" utility or "winshark" tool will allow for deep protocol inspection in and out of a server if necessary for protocol analysis.
DNS issues can exist on the client side and server. The client can use the utility "nslookup" to determine if the issue is a mis-configured "/etc/resolv.conf" file. In another case the server may respond to the client indicating that the host searched for is not configured. The "dig" utility is a powerful utility that can help detect server problems including: recursive name server, DNS caching, TTL, zone syntax errors, and zone transfer issues.
Email is delivered according to well known protocols. The headers with in an email expose important information about the route the email took in its delivery. In the event that email fails to be received, "telnet" can be used to send email by connecting to port 25 and typing in character data according to the SMTP protocol. Codes returned by the mail server can be used to diagnosis issues. Additionally "nmap" will reveal if a mail server is listening on a port. Scanning through the logs and inspecting the configuration of the mail server are further options for diagnosing email issues.
Nginx web server and the Apache web server are both applications that provide similar website services. A large part of a website's responsibility is to receive requests and send responses. When a website becomes unavailable it is quickly noticed. The first thing to do is to check if the service providing the website is running, this can be checked with "service" command and it should be made sure that it is configured to start when the server boots by using the "chkconfig" command (or similar tool depending on Init system). The command line tools "wget" and "curl" use the http protocol to communicate with websites to quickly test their availability. For example, following the restart of a web server service you can use the utilities to make sure a specific URL returns a success message (status code 200). Websites serve requests using the HTTP protocol, the protocol has a set of well known status codes. The book describes the status code ranges and their associated meanings. Web servers also generate logs that can be inspected for errors. The most common errors you will find in the logs include configuration errors and permission issues. Additionally the web servers can directly report their status through a webpage. The statistics that a web server will report directly can be used to determine if there is excessive load causing sluggishness or errors.
MySQL and PostgresSQL are two industry strength database technologies. Both database technologies provide services that can be checked like the before mentioned web servers using the "service" and "chkconfig" commands. The general patterns for inspection of a database server are similar to web servers. The inspection of logs can reveal previous issues, and likewise the databases themselves can report on their current status. The logging can be configured to do specialized logging like tracking slow queries. Additionally, the tooling used for analyzing server slowness and disk issues apply to database troubleshooting.
Troubleshooting may lead to the detection of failed hardware. Hard drives fail the most often, however the other following hardware components have a potential to fail (or cause failure): Ram, Network card, temperature, and power supply. Each device can exhibit unique symptoms of failure, and some symptoms have multiple root causes.
InfoQ spoke with Kyle Rankin, the book author, about different topics:
InfoQ: What other books have helped you (Kyle Rankin) become an expert in Linux servers? Why?
Kyle: With technical books there are the ones you keep on your shelf at home and the ones you make sure are at your desk. As I've changed jobs I've noticeda few books always make the cut to the next desk:
DNS and BIND and Postfix the Definitive Guide: I keep both of these books around for the same reason. In my job I'm always working with DNS and email servers and these two books have always been my first resource when I have a question about BIND or Postfix configuration, respectively.
TCP/IP Illustrated, Vol 1: This is _the_ book for understanding at a fundamental level how TCP/IP works. In my book I say a lot that understanding how something works greatly helps you when troubleshooting it, and this book goes far beyond simply explaining the 3-way TCP handshake or what a MAC address is and digs deep into explaining all the major low-level protocols.
Forensics Discovery: I imagine a lot of systems administrators passed on this book when it came out because they assumed it was aimed at the security crowd. While I have a definite interest in security and forensics, the beginning chapters of this book do about as good an explanation of how a Linux file system works at a low level as I've seen. I also appreciate that it's a nice thin book. Too many authors fill their book out either with long-winded explanations or a lot of reference material just to make the book thick and I prefer books that are thin and to the point.
InfoQ: What differentiates your book from other Linux server and troubleshooting books?
Kyle: One of the first things you will notice is that my book isn't the size of a telephone book. I think far too many technical books in general focus on thickening the book with what amounts to reference material you can get from free software documentation or man pages. Otherwise a lot of technical books try too hard to sound clever or technical at the expense of readability. I prefer to take a much simpler, more practical approach to troubleshooting and talk about general classes of problems and how to use common tools to track down root causes. There were many opportunities where I could have, for instance, spent pages documenting every single MySQL or Postgres diagnostic metric. Instead I try to pare things down to what you actually need to know.
InfoQ: What approach should a person take to help a cross functional team learn the material in book?
Kyle: One of the best ways to introduce a cross functional team to the book would be when troubleshooting a non-production issue. That way everyone can take their time and walk through the steps and not worry about making mistakes. Have a team leader assign out parts of the system troubleshooting process to members of the team who might traditionally take a developer or QA role. Another approach might be to apply troubleshooting steps in the book during a post-mortem process. The team can then discuss the steps they took to solve a problem and compare and contrast them to the ideas presented in the book.
About the Book Author
Kyle Rankin is the Systems Engineering Lead for Artemis Internet, Inc. In addition to "DevOps Troubleshooting", he is the author of "The Official Ubuntu Server Book", "Knoppix Hacks", "Knoppix Pocket Reference", "Linux Multimedia Hacks", and "Ubuntu Hacks", and a contributor to a number of other books. Rankin is an award-winning columnist for Linux Journal, and has written for PC Magazine, TechTarget websites and other publications. He speaks frequently on Open Source software including at SCALE, OSCON, Linux World Expo, Penguicon, and a number of Linux Users Groups.