BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Monitoring SRE's Golden Signals

Monitoring SRE's Golden Signals

Bookmarks

Key Takeaways

  • Golden signals are critical for ops teams to monitor their systems and identify problems.
  • These signals are especially important as we move to microservices and containers, where more functions are spread more thinly, including 3rd parties.
  • There are many metrics to monitor, but industry experience has shown that these 5: rate, errors, latency, saturation and utilization, contain virtually all the information you need to know what’s going on and where.
  • Getting these signals is quite challenging and varies a lot by service and tooling available.
  • These signals are most effectively used as part of anomaly detection, looking for unusual patterns, as these metrics will vary widely by time and day.

Golden signals are increasingly popular these days, due in part to the rise of Site Reliability Engineering (SRE), and the related Google book plus various people’s blogs on improving your site, performance, and monitoring.

Monitoring golden signals is important, but not much is written about how to actually monitor and use them. This article outlines what golden signals are, and how to get and use them in context of various common services.

What are golden signals?

There is no definitive agreement, but these are the three main lists of golden signals today:

  • From the Google SRE book: Latency, Traffic, Errors, Saturation
  • USE Method (from Brendan Gregg): Utilization, Saturation, Errors
  • RED Method (from Tom Wilkie): Rate, Errors, and Duration

You can see the overlap. USE is about resources with an internal view, while RED is about requests and real work, with an external view.

In this article we will be focusing on a superset consisting of five signals:

  • Request Rate — request rate, in requests/sec.
  • Error Rate — error rate, in errors/sec.
  • Latency — response time, including queue/wait time, in milliseconds.
  • Saturation — how overloaded something is, directly measured by things like queue depth (or sometimes concurrency). Becomes non-zero when the system gets saturated.
  • Utilization — how busy the resource or system is. Usually expressed 0–100% and most useful for predictions (saturation is usually more useful for alerts).

Saturation & utilization are often the hardest to get, but they are also often the most valuable for hunting down current and future problems.

What should we do with our signals?

One of the key reasons these are “golden” signals is they try to measure things that directly affect the end-user and work-producing parts of the system — they are direct measurements of things that matter.

This means they are more useful than less-direct measurements such as CPU, RAM, networks, replication lag, and endless other things.

We use the golden signals in several ways:

  • Alerting — tell us when something is wrong.
  • Troubleshooting — help us find and fix the problem.
  • Tuning & Capacity Planning — help us make things better over time.

The first aspect to focus on is how to alert on these signals.

Broadly, you can and should use your current alerting methods on these signals, as they will be more useful than the CPU, RAM, and other lower level indicators that are usually monitored. Once you have the data, observe for a while, then start adding basic alerts into your normal workflow to see how these signals affect your systems.

However, golden signals are also harder to alert on as they don’t fit traditional static alerting thresholds as well as high CPU usage, low available memory or low disk space do. Static thresholds work, but are hard to set well and generate lots of alert noise, as any ops person (and anyone living with them) will tell you.

In any case, start with static alerts, but set thresholds to levels where we’re pretty sure something is unusual or wrong, such as latency over 10 seconds, long queues, error rates above a few per second, for example.

If you use static alerting, don’t forget the lower bound alerts, such as near zero requests per second or latency, as these often mean something is wrong, even at 3 a.m. when traffic is light.

Are you average or percentile?

Basic alerts typically use average values to compare against some threshold, but - if your monitoring system can do it - use median values instead, which are less sensitive to big/small outlier values. This will reduce false alerts.

Percentiles are even better. For example, you can alert on 95th percentile latency, which is a much better measure of bad user experience. If the 95th percentile is good, then most everyone is good. You’ll often be shocked by how bad your percentiles are.

Are you an anomaly, or just weird?

Ideally, you can start using anomaly detection on your golden signals. This is especially useful to catch off-peak problems or unusually low metric values, such as when your web request rate at 3am is 5x higher than normal or drops to zero at noon due to a network problem. Furthermore, anomaly detection allows for tighter alerting bands so you can find issues much faster than you would with static thresholds (which must be fairly broad to avoid false alerts).

However, anomaly detection can be challenging, as few on-premises monitoring solutions can even do it. It’s also fairly new and hard to tune well (especially with the ’seasonality’ and trending that are common in golden signals). Tools that support anomaly detection well include some SaaS / cloud monitoring solutions such as DataDog or SignalFX, as well as on-premises tools like Prometheus or InfluxDB.

Regardless of your tooling, if you want to better understand the various options, algorithms, and challenges that anomaly detection poses, you should read Baron Schwartz’s book on Anomaly Detection for Monitoring.

Can I see you?

In addition to alerting, you should also visualize these signals. Try to get all of a given service’s signals together on one page so you can visually correlate them in time, to see how error rates are related to latency or request rates, and other signals. Here is an example from Datadog:

You can also enrich your metrics with tags/events, such as deployments, auto-scale events, restarts, and more. Ideally, display all these metrics on a system diagram to see how services are related and where latency or errors in lower levels might affect higher levels.

Fix me, fix you

As a final note on alerting, I’ve found SRE golden signal alerts more challenging to respond to, because they are symptoms of an underlying problem that is rarely directly exposed by the alert. For example, a single high-latency problem on a low-level service can easily cause many latency and error alerts all over the system.

This often means engineers must have more system knowledge and be more skilled at digging into the problem, which can easily lie in any of a dozen services or resources.

Engineers have always had to connect all the dots and dig below the alerts, even for basic high CPU or low RAM issues. But the golden signals are usually even more abstract, and it’s easy to have a lot of them.

Fortunately, golden signals help by providing clear metrics on each service and each layer of the stack. This helps nail down which services are most likely contributing to an issue (especially if you have accurate dependency information), and thus where to focus your effort.

Now, let’s look at how to get the golden signals data from common services.

Getting data from multiple services

There are a few nuances and challenges to getting this data in a usable way, and, in the interest of space, the elements below have been simplified somewhat. Also note that, in some cases, you have to do your own processing, such as delta calculations (change per second) when you use counter-based metrics such as network bytes, log lines, total requests, and others (most monitoring systems will do this automatically).

Load balancers’ golden signals

Load balancers are key components of most modern systems, usually in front of an application, but increasingly inside systems too, supporting containers, socket services, databases, and more.

There are several popular load balancers in use today, so we’ll cover the three most common ones: HAProxy, AWS ELB, and AWS ALB.

Load balancer frontends and backends

Load balancers have frontends and backends, usually several of each. Depending on your architecture, you may just use a sum of all of these, or you can break out signals for various services for more detail.

Also, as you’ll see below, load balancers usually have better backend data for various web/app servers than you can get from the web/app servers themselves. So you can therefore choose based on which is easier to monitor.

HAProxy

HAProxy data comes in a CSV format, which can be accessed in three ways: CSV, CLI, and unix socket.

HAProxy golden signals (capitalized words reference official HAProxy variable names) can be retrieved as follows:

  • Request Rate — Use REQ_TOT and do delta processing to get the rate. Use RATE for servers (though this is only over the last second).
  • Error Rate — Use response errors, ERESP, which means backend errors. It’s a counter, so you must do delta processing on it. You can also get this per backend server. You can also get HTTP 5xx error counts which are critical for any system.
  • Latency — Use response time, RTIME (per backend), which calculates an average over the last 1024 requests. Also available per server.
  • Saturation — Use number of queued requests, QCUR. Should be zero so alert if higher.
  • Utilization — This is not useful for HAProxy, as it’s very hard to measure and, on most systems, HAProxy’s capacity exceeds the backend systems by far. Therefore, it’s nearly impossible to overload.

AWS ELB and ALB

All data points for AWS Elastic Load Balancing and Application Load Balancing services come via CloudWatch. If you do percentiles and statistics on ELB/ALB signals, be sure to read carefully the Statistics for Classic Load Balancer Metrics section of the documentation.

ELB metrics are available for the ELB as a whole, but not per backend group or server. ALB data is very similar, with more available data and a few differences in metric names. Metrics are available for the ALB as a whole, and by Target Group (via Dimension Filtering), where you can get data for backend servers instead of monitoring the web/app servers directly. Per-server data is not available from the ALB (though you can filter by Availability Zone, which would be per-server if you had only one target backend server per Availability Zone).

ALB/ELB signals (note the sum() part refers to the CloudWatch statistical functions you must choose when enabling these metrics):

  • Request Rate — Use requests per second, from sum(RequestCount) divided by the configured CloudWatch sampling time, either 1 or 5 minutes. This will include errors.
  • Error Rate — You should add three metrics:

ELB: sum(HTTPCode_Backend_5XX), sum(HTTPCode_ELB_5XX), and sum(BackendConnectionErrors)

ALB: sum(HTTPCode_Backend_5XX), sum(HTTPCode_ELB_5XX), and sum(TargetConnectionErrorCount)

  • Latency — Use averages:

ELB: average(Latency)

ALB: average(TargetResponseTime)

  • Saturation — Use:

ELB: max(SurgeQueueLength) and sum(SpilloverCount)

ALB: sum(RejectedConnectionCount)

  • Utilization — There is no good way to get utilization data on ELBs or ALBs, as they are not provided nor exposed to us.

Web servers’ golden signals

It’s critical to get good signals from the web servers. Unfortunately, they don’t usually provide this data directly, and when they do they still lack aggregated data for all the servers. Thus we are left with three choices:

  1. Use the very limited built-in status reports/pages
  2. Collect & aggregate the web server’s HTTP logs
  3. Utilize the upstream load balancer per-server metrics (if we can)

The last choice is usually the best, because the load balancers have better metrics than the web servers do. See the above section on load balancing golden signals to find out how.

However, not all systems have the right load balancing type, and not all monitoring systems support this type of backend data. In those cases, we must resort to the first two options.

The following are painful, but worthwhile ways to retrieve golden signals from web servers using status pages and HTTP logs. We will look at two popular web servers: Apache and Nginx.

Enable status monitoring

To get monitoring data, you first need to enable status monitoring:

Enable logging

You also need to turn on logging and add response time to the logs by editing your web configs:

  • Apache — Add “%D” field into the log definition (usually at the end of httpd.conf file), which logs the response time in microseconds (use “%T” if on Apache V1.x, but note this only logs seconds, not milliseconds).
  • Nginx — Add ”$upstream_response_time” field to log backend response time (usually in nginx.conf file).

Log processing for metrics

Latency and other key metrics can only be obtained from the logs, which are hard to read, parse, and summarize. This section describes how to best do that for various servers.

There are many HTTP log tools, but they can’t calculate the golden signals — we need tools that can read & summarize logs. The ELK stack can do this (as can Splunk, Sumologic, Logz, and others) with detailed summary queries on response time, status counts, etc. Also most SaaS monitoring tools today, such as DataDog, can extract these.

Traditional monitoring systems such as Zabbix can’t do this so easily, especially as we need log aggregation metrics, not the actual log lines. Ideally you can find a monitoring system or tools that natively support web server log metrics.

For web servers, we can get standard status and monitoring statistics:

  • Request Rate — Requests per second, which you can get via the server’s status info:

Apache — Apache provides Total Accesses, which you need to do delta processing on to get requests per second (just the difference between current metric and the last one, divided by the number of seconds between them, e.g. (50800 - 50200) / 60 sec = 10/second). Do not use Apache’s “Requests per Second” as it relates to the entire server process lifetime, which could be months or years, not the last few minutes that we’re interested in.

Nginx — Do delta processing on requests to get requests per second.

  • Error Rate — Count the log’s 5xx errors per second.
  • Latency — Average the request and response time from the log’s average (or median). I suggest a 1-5 minutes sampling period to reduce the noise but still be responsive.
  • Saturation — Not very useful because it’s nearly impossible to saturate nginx in most systems, thus it will always be zero unless you do billions of requests per day per server.
  • Utilization — Also not useful for Nginx. For Apache, you should monitor the ratio of BusyWorkers to the smallest of the MaxRequestWorkers, MaxClients, and ServerLimit (from the httpd.conf file - they all interact, and the smallest wins). Also count and delta process (errors/second) on HTTP 503 errors from the logs.

App servers’ golden signals

Application servers are where the application’s main work gets done. Ideally, you can embed observability metrics in your code on the app servers. This is especially useful for Error Rate and Latency golden signals, as that will save you considerable effort further down the line. In particular, for Golang, Python, and Node.js languages, this is the only option.

PHP

PHP runs either as Apache mod_php or PHP-FPM. For mod_php there are no good external signals, just the Apache status page and logs as covered in the above section on web servers.

For PHP-FPM we need to enable the status page in JSON, XML, & HTML formats. We also need the PHP-FPM access, error, and slow logs.

The error log is set in the main PHP-FPM config file. The access log is rarely used, but turn it on and set the format to include “%d”, the time taken to serve requests (this is done in the POOL configuration, usually in www.conf, not in the main php-fpm.conf). The slow log is also set in this file. For more details, see this uself how-to page.

Finally, you should properly configure your php logs. By default, they go into the web server’s error logs but are mixed with 404 and other HTTP errors, making them hard to analyze or aggregate. It’s better to add an error_log override setting (php_admin_value[error_log]) in your PHP-FPM pool file (usually www.conf).

For PHP-FPM, our signals are:

  • Request Rate — There is no direct way to get this other than read the access log and aggregate into requests per second.
  • Error Rate — Count php errors in the error log per second (the PHP-FPM error log doesn’t have any metric nor error info).
  • Latency — From the PHP-FPM access log get the response times and average them.
  • Saturation — Monitor the “Listen Queue” as this will be non-zero when there are no more FPM processes available. That means it will be saturated.
  • Utilization — Monitor the in-use FPM processes (“Active Processes”) using your monitoring system’s process counters, and compare to the configured maximum processes in your FPM config.

Java

For Java, the golden signals are better monitored upstream, either in a frontend web server or load balancer. To directly monitor Tomcat, we need to configure it for monitoring, which means making sure you have good access/error logging in your application code, and turning on JMX support. Enable JMX at the JVM level and restart Tomcat. Be sure it’s read-only and, for security reasons, limit access to the local machine a read-only user. .

The Tomcat signals are:

  • Request Rate — Via JMX, use GlobalRequestProcessor’s requestCount and do delta processing to get requests per second.
  • Error Rate — Via JMX, use GlobalRequestProcessor’s errorCount and do delta processing for errors per second. Includes non-HTTP errors unless you filter by processor.
  • Latency — Via JMX, get GlobalRequestProcessor’s processingTime, but this is total time since restart, which when divided by requestCount will give you the long-term average response time, which is not very useful. Ideally, your monitoring system or scripts can store both these values each time you sample, then get the differences and divide them.
  • Saturation — If ThreadPool’s currentThreadsBusy value equals maxThreads value then Tomcat is saturated and will start queueing.
  • Utilization — Use JMX to calculate currentThreadsBusy / maxThreads which corresponds to thread utilization percentage.

Ruby

Ruby running under Passenger provides a passenger-status page which can be queried to get key metrics.

For Passenger or Apache mod_rails, our signals are:

  • Request Rate — Get the “Processed” count per “Application Group” and do delta processing to calculate requests per second.
  • Error Rate — There is no obvious way to get this as Passenger has no separate error log.
  • Latency — You need to get this from the Apache/Nginx access logs. See the above section on web servers golden signals.
  • Saturation — A non-zero value for “Requests in Queue” per “Application Group” will tell you that the server is saturated. Note: do not use “Requests in top-level queue” as this should always be zero, according to the documentation.
  • Utilization — There is no obvious way to get this signal.

Python, Node.js, and Golang

For Python, Node.js, and Golang, the app servers are very hard to monitor — most

people monitor by embedding metrics in the code and/or using a special package / APM tools such as New Relic. Several services such as DataDog can do this for you, but you still have to code the metrics yourself.

For Python, Django has a Prometheus module that can be useful.

For Node.js, KeyMetrics and AppMetrics provide an API to get most of this data.

For Golang, some people use Caddy, which can provide latency via the {latency_ms} field

(similar to Apache/Nginx), but does not provide status or request rate data (although there is a Prometheus plug-in that has a few metrics).

If you don’t use an existing library/tool, you can always embed golden signals directly in your code:

  • Request Rate — Most code runs on a per-request basis, so this is hard to track correctly on a global basis because the code ends and loses state after each request. Probably easiest way to set a global request counter and directly emit that.
  • Error Rate — Similar to request rate, probably best using a global counter.
  • Latency — Easy to get per request by capturing start and end time, but probably need to keep a running counter of elapsed time to divide by request rate.
  • Saturation — This is very hard to get from within your code as you have no easy global access, and the global server-level services don’t track this.
  • Utilization — Same as saturation.

Databases’ golden signals

Databases are at the core of most on-line systems and thus their Golden Signals are often critical to good system monitoring and troubleshooting. In particular, high latency at the database level is often the core cause of website or app problems.

MySQL’s golden signals

Getting MySQL’s golden signals varies in complexity depending on the version you are running, and whether you are running MySQL yourself or using a cloud service such as AWS RDS or Aurora.

Everything below should apply to MySQL-based AWS RDS service, as long as the performance schema is activated — you must turn this on via your AWS RDS instance parameter group and restart your database.

For AWS Aurora, CloudWatch can provide all that you need.

For MySQL, our signals are:

Request Rate — Queries per second, most easily measured by the sum of MySQL’s status variables com_select, com_insert, com_update, com_delete and Qcache_hits,

followed by delta processing to get queries per second.

Here’s an SQL code snippet that retrieves the sum of all queries executed:

SELECT sum(variable_value)

FROM information_schema.global_status

WHERE variable_name IN (“com_select”, “com_update”, “com_delete”, “com_insert”, “qcache_hits”) ;

  • Error Rate — Get global error count from the performance schema, then do delta

processing

Here’s an SQL code snippet that retrieves the global error count:

SELECT sum(sum_errors) AS query_count

FROM performance_schema.events_statements_summary_by_user_by_event_name

WHERE event_name IN (‘statement/sql/select’, ‘statement/sql/insert’, ‘statement/sql/

update’, ‘statement/sql/delete’);

  • Latency — We can get the latency from the performance schema. To get the latency we use two statements, SELECT and TRUNCATE, as we must truncate the table between reads:

SELECT (avg_timer_wait)/1e9 AS avg_latency_ms FROM

performance_schema.events_statements_summary_global_by_event_name

WHERE event_name = ‘statement/sql/select’;

TRUNCATE TABLE

performance_schema.events_statements_summary_global_by_event_name ;

  • Saturation — This is hard to get. Easiest way is to monitor running threads and alert

when there’s a sharp increase. This is an instantaneous measurement so you should average it over several short monitoring intervals.

SELECT sum(variable_value)

FROM information_schema.global_status

WHERE variable_name = “THREADS_RUNNING” ;

If you have an I/O-limited workload, you can monitor I/O utilization in Linux or DiskQueueDepth on AWS RDS.

  • Utilization — Monitor CPU and I/O usage directly from the operating system. On AWS RDS you can also get CPUUtilization from CloudWatch.

Operating system’s golden signals

The OS is of course a key part of any system, as it underlies every other service. Thus, it’s often critical to monitor the CPU, RAM, network, and disks to ensure good and reliable service.

Linux’s golden signals

Application services depend on the hardware underneath them, as their core resource for CPU, RAM, network, and I/O. So ideally we can retrieve our golden signals from Linux or whatever the operating system may be. This is especially important when upper-level services are sensitive to the underlying resource usage (in particular I/O).

Even if you don’t alert on these things, they provide valuable details for observed changes in higher-level metrics. For example, if MySQL latency suddenly rises — were there changes in the SQL, in the data, or in the underlying I/O system?

For Linux, we are mostly interested in CPU and disk performance, as RAM and modern networks are much less important from a golden signals perspective as they are quite fast and not usually a bottleneck.

CPU

For CPU, the only real signals are utilization and saturation, as errors, latency, and requests aren’t very relevant:

  • Saturation — We want the CPU run queue length, which becomes higher than zero when processes have to wait on the CPU being available. It’s hard to get an accurate measurement.

We can use load average, but only if there is very little I/O, as Linux’s load average

includes I/O processing. Load average above 1–(2xCPU count) is usually considered saturated. Real CPU run queue data is very hard to get, as all tools get only instantaneous measurements which are quite noisy.

To solve this, I created a new open source tool called runqstat (still under development).

CPU steal percentage also indicates the CPU is saturated. It is also useful for single-threaded loads like HAProxy or Node.js where the run queue is less accurate.

  • Utilization — We want the CPU percentage, but this also needs a bit of math, so you need to calculate CPU percentage as:

User + Nice + System + (probably) Steal

Do not add idle time percentage or iowait percentage.

For containers, it’s more complex and hard to measure saturation and utilization. One simple way to get saturation is reading the /proc file system for each container, and the cpu.stat file, using the nr_throttled metric which will increment each time the container was CPU-throttled. Delta this to get throttles/second, which happens when Docker is telling you that you are using too much CPU and  throttled.

Disk usage

For disk usage, golden signals come from the iostat utility or directly from /proc/diskstats (most monitoring agents do this). Get data for the disk(s) where your service’s data sits.

  • Request Rate — The disk rate is the IO per second for the disk system (after request merging), which in the iostat utility is r/s and w/s columns.
  • Error Rate — Not useful as disk errors are much too rare to be relevant.
  • Latency — The read and write times, which in iostat is average wait time (await). It includes queue time, as this will rise sharply when the disk is saturated. You can also measure iostat service time if you have I/O sensitive services.
  • Saturation — Best measured by I/O queue depth per disk, which is provided by aqu-sz variable in iostat.
  • Utilization — This is not useful on modern SSD, cloud, or SAN disks because they process IO requests in parallel.

Conclusion

Golden signals are a rich and interesting area for exploration, as many services still lack monitoring, or they lack exposing the right metrics.

Even in the midst of a cloud and DevOps revolution, many mainstream tools still don’t really provide all the data we need to effectively manage and maintain our infrastructure, but we keep working to make it better, day by day.

About the Author

Steve Mushero has 25+ years of IT & Software experience, mostly as CTO or IT Manager, in a wide range of startups and large companies, from New York to Seattle, Silicon Valley, and Shanghai. He is currently CEO of OpsStack.io and of ChinaNetCloud, which is China’s first and largest Internet Managed Service Provider. Steve is based in Shanghai and Silicon Valley.

Rate this Article

Adoption
Style

BT