Understanding System Utilization

The following chapter introduces and demonstrates tools that help out in understanding overall system utilisation.

Tools discussed in this posting

uptime
perfbar
cpubar
vmstat
mpstat
zonestat

uptime – Print Load Average

The easiest way to gain an overview on:

how long a system has been running.
current CPU load averages.
how many active users.

is with the command uptime.

uptime – print CPU load averages

The numbers printed to the right of “load average: “ are the 1-, 5- and 15-minute load averages of the system. The load average numbers give a measure of the number of runnable threads and running threads. Therefore the number has to be put in relation with the number of active CPUs in a system. For example, a load average of three (3) on a single CPU system would indicate some CPU overloading, while the same load average on a thirty-two (32) way system would indicate an unloaded system.

perfbar [Tools CD] - A lightweight CPU Meter

perfbar is a tool that displays a single bar graph that color codes system activity. The colors are as follows:

Blue = system idle.
Red = System time.
Green = CPU time.
Yellow = I/O activity (obsolete on Solaris 10 and later).

perfbar – sample output of a system with 16 CPU cores

Perfbar has been enhancend in Version 1.2 to provide better support for servers with many CPUs through a multi line visualisation. See below

perfbar: Visualisation of a Sun T5240 System with 128 strands (execution units) without any load

perfbar can be called without specifying any command line arguments. perfbar provides a large number of options which can be viewed with the -h option:

$ perfbar -h

perfbar 1.2

maintained by Ralph Bogendoerfer
based on the original perfbar by:
Joe Eykholt, George Cameron, Jeff Bonwick, Bob Larson

Usage: perfbar [X-options] [tool-options]
   supported X-options:
      -display <display> or -disp <display>
      -geometry <geometry> or -geo <geometry>
      -background <background> or -bg <background>
      -foreground <foreground> or -fg <foreground>
      -font <font> or -fn <font>
      -title <title> or -t <title>
      -iconic or -icon
      -decoration or -deco
   supported tool-options:
      -h, -H, -? or -help: this help
      -v or -V: verbose
      -r or -rows: number of rows to display, default 1
      -bw or -barwidth: width of CPU bar, default 12
      -bh or -barheight: height of CPU bar, default 180
      -i or -idle: idle color, default blue
      -u or -user: user color, default green
      -s or -system: system color, default red
      -w or -wait: wait color, default yellow
      -int or -interval: interval for display updates (in ms),default 100
      -si or -statsint: interval for stats updates (in display intervals), default 1
      -avg or -smooth: number of values for average calculation, default 8

There are also a number of key strokes understood by the tool:

Q or q: Quit
R or r: Resize - this changes the window to the default size according to the number of CPU bars, rows and the chosen bar width and height.
Number keys 1 - 9: Display this number of rows.
+ and -: Increase or decrease number of rows displayed.

The tool is currently available as a beta in version 1.2. This latest version is not yet part of the Performance Tools CD 3.0. The engineers from the Sun Solution Center in Langen/Germany made it available for free through:

cpubar [Tools CD] - A CPU Meter, showing Swap, and Run Queue

cpubar displays one bar-graph for each processor with the processor speed(s) displayed on top. Each bar-graph is divided in four areas (top to bottom):

Blue - CPU is available.
Yellow - CPU is waiting for one or more I/O to complete (N/A on Solaris 10 and later).
Red - CPU is running in kernel space.
Green - CPU is running in user space.

As with netbar and iobar, a red and a dashed black & white marker shows the maximum and average used ratios respectively.

The bar-graphs labeled 'r', 'b' and 'w' are displaying the run, blocked and wait queues. A non empty wait queue is usually a symptom of a previous persistent RAM shortage. The total number of processes is displayed on top of these three bars.

The bar-graph labeled 'p/s' is displaying the process creation rate per second.

The bar-graph labeled 'RAM' is displaying the RAM usage (red=kernel, yellow=user, blue=free), the total RAM is displayed on top.

The bar-graph ('sr') is displaying (using a logarithmic scale) the scan rate (a high level of scans is usually a symptom of RAM shortage).

The bar-graph labeled 'SWAP' is displaying the SWAP (a.k.a Virtual Memory) usage (red=used, yellow=reserved, blue=free), the total SWAP space is displayed on top.

cpubar – sample output

vmstat – System Glimpse

The vmstat tool provides a glimpse of the current system behavior in a one line summary including both CPU utilisation and saturation.

In its simplest form, the command vmstat <interval> (i.e. vmstat 5) will report one line of statistics every <interval> seconds. The first line can be ignored as it is the summary since boot, all other lines report statistics of samples taken every <interval> seconds. The underlying statistics collection mechanism is based on kstat (see kstat(1)).

Let's run two copies of a CPU intensive application (cc_usr) and look at the output of vmstat 5. First start two (2) instances of the cc_usr program.

two (2) instances of cc_usr started

Now let's run vmstat 5 and watch its output.

vmstat – vmstat 5 report

First observe the cpu:id column which represents the system idle time (here 0%). Then look at the kthr:r column which represents the total number of runnable threads on dispatcher queues (here 1).

From this simple experiment, one can conclude that the system idle time for the five second samples was always 0, indicating 100% utilisation. On the other hand, kthr:r was mostly one and sustained indicating a modest saturation for this single CPU system (remember we launched two (2) CPU intensive applications).

A couple of notes with regard to CPU utilisation:

100% utilisation may be fine for your system. Think about a high-performance computing job: the aim will be to maximise utilisation of the CPU.
Values of kthr:r greater than zero indicate some CPU saturation (i.e. more jobs would like to run but cannot because no CPU was available). However, performance degradation should be gradual.
Sampling interval is important. Don't choose too small or too large intervals.

vmstat reports some additional information that can be interesting such as:

Column	Comments
in	Number of interrupts per second.
sys	Number of system calls per second.
cs	Number of context switches per second (both voluntary and involuntary).
us	Percent user time: time the CPUs spent processing user-mode threads.
sy	Percent system time: time the CPUs spent processing system calls on behalf of user-mode threads, plus the time spent processing kernel threads.
id	Percent of time the CPUs are waiting for runnable threads.

mpstat - Report per-Processor or per-Processor Set Statistics

The mpstat command reports processor statistics in tabular form. Each row of the table represents the activity of one processor. The first table summarizes all activity since boot. Each subsequent table summarizes activity for the preceding interval. The output table includes:

Column	Comments
CPU	Prints processor ID.
minf	Minor faults (per second).
mjf	Major faults (per second).
xcal	Inter-processor cross-calls (per second).
intr	Interrupts (per second).
ithr	Interrupts as threads (not counting clock interrupt) (per second).
csw	Context switches (per second).
icsw	Involuntary context switches (per second).
migr	Thread migrations (to another processor) (per second).
smtx	Spins on mutexes (lock not acquired on first try) (per second).
srw	Spins on readers/writer locks (lock not acquired on first try) (per second).
syscl	System calls (per second).
usr	Percent user time.
sys	Percent system time.
wt	Always 0.
idl	Percent idle time.

The reported statistics can be broken down into following categories:

Processor utilisation: see columns usr, sys and idl for a measure of CPU utilisation on each CPU.
System call activity: see syscl column for the number of system call per second on each CPU.
Scheduler activity: see column csw and column icsw. As the ratio icsw/csw comes closer to one (1), threads get preempted because of higher priority threads or expiration of their time quantum. Also the column migr displays the number of times the OS scheduler moves ready-to-run threads to an idle processor. If possible, the OS tries to keep the threads on the last processor on which it ran. If that processor is busy, the thread migrates.
Locking activity: column smtx indicates the number of mutex contention events in the kernel. Column srw indicates the number of reader-writer lock contention events in the kernel.

Now, consider the following sixteen-way (16) system used for test. This time four (4) instances of the cc_usr program were started and the output of vmstat 5 and mpstat 5 recorded.

Below, observe the output of processor information. Then the starting of the four (4) copies of the program and last the output of vmstat 5.

vmstat – vmstat 5 output on sixteen way system

Rightly, vmstat reports a user time of 25% because one-fourth (¼) of the system is used (remember 4 programs started, 16 available CPUs, i.e. 4/16 or 25%).

Now let's look at the output of mpstat 5.

mpstat – mpstat 5 sample output on sixteen way system

In the above output (two sets of statistics), one can clearly identify the four running instances of cc_usr on CPUs 1, 3, 5 and 11. All these CPUs are reported with 100% user time.

vmstat – Monitoring paging Activity

The vm stat command can also be used to report on system paging activity with the -p option. Using this form of the command, one can quickly get a clear picture on whether the system is paging because of file I/O (OK) or paging because of physical memory shortage (BAD).

Use the command as follows: vmstat -p <interval in seconds> . The output format includes following information:

Column	Description
swap	Available swap space in Kbytes.
free	Amount of free memory in Kbytes.
re	Page reclaims - number of page reclaims from the cache list (per second).
mf	Minor faults - number of pages attached to an address space (per second)
fr	Page frees in Kbytes per second.
de	Calculated anticipated short-term memory shortfall in Kbytes.
sr	Scan rate - number of pages scanned by the page scanner per second.
epi	Executable page-ins in Kbytes per second.
epo	Executable page-outs in Kbytes per second.
epf	Executable page-frees in Kbytes per second.
api	Anonymous page-ins in Kbytes per second.
apo	Anonymous page-outs in Kbytes per second.
apf	Anonymous page-frees in Kbytes per second.
fpi	File system page-ins in Kbytes per second.
fpo	File system page-outs in Kbytes per second.
fpf	File system page-frees in Kbytes per second.

As an example of vmstat -p output, let's try following commands:

find / > /dev/null 2>&1

and then monitor paging activity with: vmstat -p 5

As can be seen from the output, the system is showing paging activity because of file system read I/O (column fpi).

vmstat – sample output reporting on paging activity

zonestat - [OpenSolaris.org] Monitoring Resource Consumption within Zones

Jeff Victor developed an Open Source Perl script to measure utilization within zones. The tool is freely available for download on the OpenSolaris.org project pages.

It may me called with the following syntax:

zonestat [-l] [interval [count]]

The output looks like:

        |----Pool-----|------CPU-------|----------------Memory----------------|
	|---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
  global  0D  66K    2       0.1   1 25      986M      139K  18E   2M  18E 754M
    db01  0D  66K    2       0.1   2 50   1G 122M 536M      536M    0   1G 135M
   web02  0D  66K    2 0.42  0.0   1 25 100M  11M  20M       20M    0 268M   8M

zonestat allows as well to monitor zone limits (caps)

Thomas Bastian was a coauthor of an earlier version of this document. The earlier version of this page has been published in the "The Developers Edge" in 2009.

Printer-friendly version
Log in to post comments
30418 views