BT

Book Review: DTrace: Dynamic Tracing in Oracle Solaris, Mac OS X and FreeBSD

Posted by Alex Blewitt on Aug 03, 2011 |

DTrace provides a means to dynamically instrument and trace a running program with zero overhead when tracing is not enabled. It provides a means to trace both in user and kernel level with the same technology.

"DTrace: Dynamic Tracing in Oracle Solaris, Mac OSX and FreeBSD" is written by Brendan Gregg and Jim Mauro, both expert users in DTrace and the Solaris Internals from Prentice Hall. It covers the key features of the DTrace environment, the D language that is used to write the scripts used to provide answers to questions regarding the performance of the system under question. It also provides practical examples across all operating systems which have DTrace embedded, including Solaris, Mac OSX (since 10.5) and FreeBSD (since 7.1).

Probes

Programs can be traced with either dynamic or static probes. Dynamic probes are often used with kernel level function calls; instead of having to embed traces in each function call, the kernel can trace function entry and exit points without any need for code. On the other hand, programs can also have probes compiled in.

Unlike other approaches for tracing (where the trace points are compiled in or not whether or not they are debug mode), DTrace probes use dynamic code replacement. When a trace is not enabled, the compiled code runs as usual; when it is dynamically enabled, the instructions are temporarily replaced by tracing code. As a result, traces that are not enabled have no impact on the runtime of code.

Probes are broken down by provider, module, function and name. When used in a D script, the format is provider:module:function:name, although each can be specified with the form of a trailing wildcard to match appropriate names. The exact format and meaning of module, function and name is provider dependent.

Providers

Providers enable different kind of conditions to be monitored. For example, the profile provider allows samples to be taken periodically (say, once per second) and can be configured to run once per system or once per CPU (tick). Other providers include the fbt, or functional boundary trace, which allows calls to kernel-level functions to be intercepted, timed and aggregated.

In addition, there are providers which can be configured on a per-process level instead of system-wide. For example, the pid provider allows you to introspect the state of a single process, and for OSX, objc provider allows introspection of Objective-C objects at runtime.

Statically compiled traces can be probed with the sdt provider, which can augment the code at a lower level than method entry/exit, and can be used to expose in-memory data structures to collecting scripts.

D scripts

Although probes can be enabled dynamically, they just produce data. In order to determine performance characteristics the data needs to be processed, aggregated and filtered. In order to get the best performance possible, trace programs are compiled in and executed in the process space of the program under trace. To prevent arbitrary code execution and introduce crashes, trace scripts are written in the D programming language, similar to C syntax with awk style data processing. The dtrace executable compiles the D script and loads it into the target process; to do so, it needs root access.

D programs are a set of probes with corresponding actions. When a probe is matched, the corresponding action is fired, in much the same way that an awk script will execute its actions if a particular regular expression or predicate is met. D scripts can unconditionally fire on a probe being fired, or they can have an optional predicate which can be used to further restrict the probe's action. The predicate can be based on the current state of the script or on the arguments to the probe in question; this allows selective filtering on system-wide processes to a specific process id or a specific subset of function calls.

The actions that can be taken either include displaying output (with printf), or more commonly, aggregating data which can then be displayed later. In order to speed up execution of DTrace scripts, data is often aggregated (which uses per-thread caches to store the results) and then merges them into one dataset for presentation at the end of the script. This data can be post-processed, such as quantizing or summing as desired.

Examples

Here is an example of a DTrace script which prints out "Hello World" every second:

$ sudo dtrace -n 'profile:::tick-1s { printf("%u Hello World",walltimestamp/1000000000)}'
dtrace: description 'profile:::tick-1s ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  3  62423                         :tick-1s 1310249459 Hello World
  3  62423                         :tick-1s 1310249460 Hello World
  3  62423                         :tick-1s 1310249461 Hello World

In this case, the dtrace provider is running on CPU#3, and the ID corresponds to the unique provider for ticking on a 1 second interval. In this case, walltimestamp evaluates to a nanosecond value of the current system time, and here shows the profiling being done ever second. If we wanted to fire this on all CPUs (for example, we wanted to get a snapshot of the per-CPU utilisation at the time) then we can use the profile function instead:

$ sudo dtrace -n 'profile:::profile-1s { printf("%u Hello World",walltimestamp/1000000000)}'
dtrace: description 'profile:::profile-1s ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  62424                      :profile-1s 1310249540 Hello World
  1  62424                      :profile-1s 1310249540 Hello World
  2  62424                      :profile-1s 1310249540 Hello World
  3  62424                      :profile-1s 1310249540 Hello World
  0  62424                      :profile-1s 1310249541 Hello World
  1  62424                      :profile-1s 1310249541 Hello World
  2  62424                      :profile-1s 1310249541 Hello World
  3  62424                      :profile-1s 1310249541 Hello World

It's worth noting that the authors explicitly recommend against profiling on a 1s basis, and instead suggest profiling at 997Hz or 1001Hz to ensure that regular system-wide processes (which might be scheduled at a 1s period) don't skew the results seen over a short running DTrace script.

However, it's more common for a DTrace script to collect data; either timings or textual data. To enable the fastest processing possible, there are two different areas of storage; per-thread and global. Although it's possible to write to global storage, this can lead to race conditions and data loss if competing threads write out at the same time. Instead, it is preferable if data is collected on a per-thread basis and then aggregated at the end of the run.

Aggregations and associative arrays

An associative array is a key-value structure, found in many scripting languages. The key can be multi-valued, and whilst the value contains a single expression, it is fairly common for the value used to be the product of an aggregating function.

An aggregation is a variable which can be written to from multiple CPUs at once without data corruption. These are evaluated with aggregation functions, which allow data to be aggregated afterwards across multiple threads. Aggregation functions are prefixed with an @ symbol, and typically use functions such as count(), sum() and avg(), as well as processing functions such as quantize() and lquantize(). Whilst aggregations can be scalars, they are typically also used in conjunction with associative arrays to record specific information about the actions as well.

Here is a simple DTrace script to monitor the file usage on the system, adapted from Brendan Gregg's DTrace OneLiner scripts (installed as /usr/bin/filebyproc.d on systems that support it):

$ sudo dtrace -n 'syscall::open*:entry { @files[execname, copyinstr(arg0)] = count(); }'
dtrace: description 'syscall::open*:entry ' matched 3 probes
^C
  mdworker        /.vol/234881026/1233333//Contents/PkgInfo          1
  mdworker        /.vol/234881026/1233334//Contents/PkgInfo          1
  mdworker        /.vol/234881026/1233335//Contents/PkgInfo          1
  mdworker        /.vol/234881026/1233336//Contents/PkgInfo          1
  mdworker        /.vol/234881026/1233337//Contents/PkgInfo          1
  mds             /private/var/db/sudo/alex                          2
  mds             live.8.indexArrays                                24
  mds             live.8.indexUpdates                               24
  mds             .                                                151

In this case, the files variable is an associative array, keyed by the combination of execname and arg0. In this case, the function being probed is any kernel function beginning with open* – in this case, open(), open_nocancel() and open_extended(). For each of these functions, the first argument is the name of the file being opened; man 2 open tells us that the signature is open(const char *path, int oflag, ...);.

Because DTrace runs in the kernel-level space, function arguments may not be available if they are in user-level space. The D function copyinstr() takes a string in user-level space, and makes it available in kernel-level space for processing.

Finally, the value of the array is assigned the value count(). This isn't evaluated directly, but rather, each time a new entry is written to the array, the value is incremented. It has a similar effect to @files[...]++, except that instead of reading and writing back the value it simply adds a new value to a per-CPU cache, which then gets added together at the end.

The default operation is to print out the value of the aggregation at the end of the run; however, with a combination of the profile provider above, you could get a periodic dump of data.

Quantizing results

There is a difference between data and information; to get the latter form the former, it helps to be able to process it to give a meaningful answer. One way of doing this is with quantizations. DTrace has two options built in; quantize(), which provides a logarithmic range and lquantize() to perform a linear range. For example, consider the following example which determines which CPU the application is running on:

$ sudo dtrace -n 'profile:::profile-100hz /pid/ { @sample[pid,execname] = lquantize(cpu,0,10,1);}'
dtrace: description 'profile:::profile-100hz ' matched 1 probe
^C

     1762  dtrace
           value  ------------- Distribution ------------- count
             < 0 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
               1 |                                         0

      159  Safari
           value  ------------- Distribution ------------- count
             < 0 |                                         0
               0 |@@                                       6
               1 |@@@@@@@@@@                               35
               2 |@@@@@@@@@@@@@@                           48
               3 |@@@@@@@@@@@@@@                           48

This sets up a profile running at 100Hz (0.1s) and collects the process id and executable name, with a linearly quantised data set. We log the CPU number, with the range 0..10 in steps of 1. You can see from the distribution above that the dtrace program only runs on one CPU (probably because it is single threaded; all the work is done in the kernel), whilst Safari is running across all processors. This works for a small fixed set of data, but often it's useful to know order of magnitude instead, which can be achieved by the quantize() function instead. Here is a script which gives a breakdown of the sizes of write() functions:

$ sudo dtrace -n 'syscall::write:entry { @sizes=quantize(arg2); }'
dtrace: description 'syscall::write:entry ' matched 1 probe
^C
           value  ------------- Distribution ------------- count
               0 |                                         0
               1 |@@                                       11
               2 |                                         0
               4 |@@                                       8
               8 |                                         0
              16 |                                         0
              32 |@@@                                      13
              64 |@                                        4
             128 |@@@@@@@@@                                41
             256 |@@                                       10
             512 |@@@@@@@@@@@                              54
            1024 |@                                        3
            2048 |@                                        5
            4096 |@@@@@                                    23
            8192 |@@@                                      14
           16384 |                                         0

This tells us that the system as a whole (over the couple of seconds the data was being collected for) spent mostly with 512-byte writes, followed by 128-byte writes. Although the aggregation could have told us the exact values, it wouldn't have been as useful to understand what is happening here. Clearly, monitoring the system as a whole is useful but often DTrace is used to monitor specific processes. For example, we can concentrate on just the network traffic generated by Safari with syscall::write:entry /execname="Safari"/ {...}.

More details

This introduction has just scratched the surface of what DTrace can do. The book takes you through not only the full language syntax and built-in variables (such as execname and walltimestamp) but also provides more details about the uses to which DTrace can be put. The one-liners above demonstrate its capabilities, but if you wanted to use DTrace to optimise the performance of a database server, DTrace will allow you to get answers to questions which you wouldn't have easily been able to get before.

Not only that, but because the low-level details of each operating system's kernel and IO libraries can vary, the book takes you through specific cases of where problems differ on Solaris, OSX and FreeBSD. There are also a number of helpful DTrace recipes, which provide standard solutions to specific problems that are useful when debugging performance issues. Many of these take the form of trying to answer a specific problem, giving not only an answer but also a feel of how one can go about solving similar problems in the future. If you're interested in learning more about the low-level details of DTrace then this is the key book to read.

About the Book Authors

Brendan Gregg is a performance specialist at Joyent and is known worldwide in the field of DTrace. Brendan created and developed the DTraceToolkit and is the coauthor of Solaris Performance and Tools (McDougall, Mauro, and Gregg, 2006) as well as numerous articles about DTrace. He was previously the performance lead for the Sun/Oracle ZFS storage appliance and a software developer on the Fishworks advanced development team at Sun, where he worked with the three creators of DTrace. He has also worked as a system administrator, performance consultant, and instructor, and he has taught DTrace worldwide including workshops that he authored. His software achievements include creating the DTrace IP, TCP, and UDP providers; the DTrace JavaScript provider; and the ZFS L2ARC. Many of Brendan’s DTrace scripts are shipped by default in Mac OS X.

Jim Mauro is a senior software engineer for Oracle Corporation. Jim works in the Systems group, with a primary focus on systems performance. Jim’s work includes internal performance-related projects, as well as working with Oracle customers on diagnosing performance issues on production systems. Jim has 30 years of experience in the computer industry, including 19 years with Sun Microsystems prior to the acquisition by Oracle. Jim has used DTrace extensively for his performance work since it was first introduced in Solaris 10 and has taught Solaris performance analysis and DTrace for many years. Jim coauthored the first and second editions of Solaris Internals (McDougall and Mauro, 2006) and Solaris Performance and Tools (McDougall, Mauro, and Gregg, 2006) and has written numerous articles and white papers on various aspects of Solaris performance and internals.

Hello stranger!

You need to Register an InfoQ account or to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2013 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT