Solaris Performance: Getting Started

These pages are the Swiss Army knife for everyone in the need to analyze and tune a Solaris system.

The intention is to provide freely available tools and commands which help to make a quick assessment of the system.

This performance primer follows the standard process on how you will want to analyze a performance. You'll have to understand the individual stack from the hardware through the operating system to the application first. This is the foundation which will allow you to narrow down your performance problem and fix it.

The nature of all performance problems is basically straight forward: An algorithm (application) with a given load is exhausting the resources underneath.

This primer allows everyone to determine the Solaris and hardware resources which become the bottleneck. You'll then have two options

The Structure of the Primer

Automate Sampling with dimstat

The second chapter of the Solaris performance deals with a must have utility: dimSTAT. dimSTAT a freely available monitoring tool which will monitor entire data centers while you are gone fishing... 

dimSTAT is a tool for general and/or detailed performance analysis and monitoring of Solaris and Linux systems. dimSTAT is a monitoring framework that offers flight-recorder type functionality. A central site can monitor a number of nodes for performance data and store the results for easy displaying and post-processing.

You can download the software and documentation from: http://dimitrik.free.fr/

dimSTAT - Installation

dimSTAT installation is straight forward. Get hold of the latest distribution files and then untar them into a directory of your choice:

 

cd /tmp

tar -xvf <path_to_dimstat>/dim_STAT-v81-sol86.tar

 

Then run the INSTALL.sh file with /bin/sh and follow the instructions:

root@soldevx> sh INSTALL.sh

 

===========================================

** Starting dim_STAT Server INSTALLATION **

===========================================

HOSTNAME: soldevx

IP: ::1

DOMAINE:

Is it correct? (y/n): n

** Hostname [soldevx]: localhost

** IP addres [::1]:

** Domainname []:

** Domainname []: ** Domainname []:

** Domainname []: .

**

** ATTENTION!

**

** On your host You have to assign a USER/GROUP pair as owner

** of all dim_STAT modules (default: dim/dim)

User: dim

Group: dim

Is it correct? (y/n): y

**

** WARNING!!!

**

** User dim (group dim) is not created on your host...

** You may do it now by yourself or let me do it during

** installation...

**

 

May I create this USER/GROUP on your host? (y/n): y

======================================

** dim_STAT Directory Configuration **

======================================

** WebX root directory (5MB):

=> /WebX

=> /opt/WebX

=> /etc/WebX

 

[/opt/WebX]:

 

** HOME directory for dim_STAT staff [/apps]: /export/home/dimstat

** TEMP directory : /opt/WebX

=> HOME directory : /export/home/dimstat

=> TEMP directory : /tmp

=> HTTP Server Port : 80

=> DataBase Server Port : 3306

=> Default STAT-service Port : 5000

Is it correct? (y/n): y

** WARNING!!!

** ALL DATA WILL BE DELETED IN: /export/home/dimstat/* !!!

** AS WELL /WebX, /etc/WebX, /opt/WebX !!!

Is it correct? (y/n): y

** Cleanup /export/home/dimstat

** Add User...

** WebX Setup...

** dim_STAT Server extract...

** HTTP Server Setup...

** Database Server Setup...

** ADMIN/Tools Setup...

** TEMP directory...

** Permissions...

** Crontab Setup...

Sun Microsystems Inc. SunOS 5.11 snv_79a January 2008

Warning - Invalid account: 'dim' not allowed to execute cronjobs

 

**

** INSTALLATION is finished!!!

**

May I create now a dim_STAT-Server start/stop script in /etc/rc*.d? (y/n): y

** ===================================================================

**

** You can start dim_STAT-Server now from /export/home/dimstat/ADMIN:

**

** # cd /export/home/dimstat/ADMIN

** # ./dim_STAT-Server start

**

** and access homepage via Web browser - http://localhost:80

**

** To collect stats from any Solaris-SPARC/x86 or Linux-x86 machines

** just install & start on them [STAT-service] package...

**

** Enjoy! ;-)

**

** -Dimitri

** ===================================================================

 

root@soldevx>

 

After installation, please proceed with installation of the STAT service on the nodes of your choice:

 

cd dimSTAT

pkgadd -d dimSTAT-Solx86.pkg

 

Note: the dimSTAT STAT service needs to be installed on all nodes that you want to monitor and record for performance data.

dimSTAT – Configuration

The following steps will guide you through a simple dimSTAT configuration.

First, open a browser and navigate to http://localhost. You should see a similar screen:

Then click on the dim_STAT Main Page link (Welcome!). Following similar screen should now appear:

Let's start a new collect. Click on the “Start New Collect” link:

Enter the information to start a new collect on the host named localhost. Click the “Continue” button to move to the next screen:

Select the monitoring options of your choice (for example: vmstat, mpstat, iostat, netstat, etc...). Finally click the “Start STAT(s) Collect Now!!!” button to start monitoring.

The simple dimSTAT configuration is now complete. dimSTAT will record the selected data into a local database.

dimSTAT – Analysis

The following steps will guide you through a simple dimSTAT analysis session.

First, open a browser and navigate to http://localhost/ You should see a similar screen:

Next click on the “Welcome!” link to proceed to the next screen:

Click on the “Analyze” button to start analyzing recorded data. You should see a similar screen:

Select “Single-Host Analyze” and click on the “Analyze” button to proceed to the next screen:

Select the second line (with ID = 2) and click on the “VM stat” button to proceed to the next screen:

You should see a similar screen (top part).

Scroll to the bottom of the screen and select the tick boxes “CPU Usr%”, “CPU Sys%”, “CPU Idle%”. Then click on “Start” button to display the results.

et voila! dimSTAT displays a nice little graph that shows the percentage of usr, sys and idle time for the selected system.

Thomas Bastian was a coauthor of an earlier version of this document. The earlier version of this page has been published in the "The Developers Edge" in 2009.

Explore your Solaris System

Explore your System: Solaris System Inventory

Solaris Systems may have at least one CPU or up to hundreds. Solaris systems may have a single disk or entire farms. Anyone who deals with perfromance has to know what the quantitative aspects of the system are. The commands listed here will answer these questions. 

It's pivotal that you get an understanding of the components of the system you want to tune. Knowing the hardware components and the installed software allows you to understand the quantitative limits of the system.

Solaris offers a wealth of commands to identify the characteristics of the running system. The following chapter discusses commands that help administrators and software developers understand and document accurately the hardware and software specifications.

This document reflects the state of the art of spring 2010.

  • SunOS 5.10 known as Solaris 10
  • SunOS 5.11 (build snv_111b) known through the distribution OpenSolaris 2009/06

Both operating system version are very similar. Commands which don't work in both versions are tagged as such.

uname - Printing Information about the Current System

The uname utility prints information about the current system on the standard output. The command outputs detailed information about the current system on operating system software revision level, processor architecture and platform attributes.

The table below lists selected options to uname:

Option

Comments

-a

Prints basic information currently available from the system.

-s

Prints the name of the operating system.

-r

Prints the operating system release level.

-i

Prints the name of the platform.

-p

Prints the processor type or ISA [Instruction Set Architecture].

uname – selected options

uname – sample output

/etc/release – Detailed Information about the Operating System

The file /etc/release contains detailed information about the operating system. The content provided allows engineering or support staff to unambiguously identify the Solaris release running on the current system.

/etc/release file – sample output

showrev - Show Machine, Software and Patch Revision (Solaris 10 and older)

The showrev command shows machine, software revision and patch revision information. With no arguments, showrev shows the system revision information including hostname, hostid, release, kernel architecture, application architecture, hardware provider, domain and kernel version.

showrev – machine and software revision

To list patches installed on the current system, use the showrev command with the -p argument.

showrev -p – patch information

This command doesn't exist anymore in newer Sun OS 5.11 builds. The new packaging system (IPS) comes with a completely new set of commands.

pkg - IPS Packages (SunOS 5.11 only!)

List the installed packages with the $pkg list . the $pkg list -a option will list all packages wether installed or not.

pkg list command

isainfo - describe instruction set architectures

The isainfo command describes instruction set architectures. The isainfo utility is used to identify various attributes of the instruction set architectures supported on the currently running system. It can answer whether 64-bit applications are supported, or if the running kernel uses 32-bit or 64-bit device drivers.

The table below lists selected options to isainfo:

Option

Comments

<none>

Prints the names of the native instruction sets for portable applications.

-n

Prints the name of the native instruction set used by portable applications.

-k

Prints the name of the instruction set(s) used by the operating system kernel components such as device drivers and STREAMS modules.

-b

Prints the number of bits in the address space of the native instruction set.

isainfo – selected options

isainfo – describe instruction set architectures

isalist - Display native Instruction Sets Executable on this Platform

The isalist command displays the native instruction sets executable on this platform. The names are space-separated and are ordered in the sense of best performance. Earlier-named instruction sets may contain more instructions than later-named instruction sets. A program that is compiled for an earlier-named instruction sets will most likely run faster on this machine than the same program compiled for a later-named instruction set.

isalist – display native instruction sets

psrinfo - Display Information about Processors

The psrinfo command displays information about processors. Each physical processor may support multiple virtual processors. Each virtual processor is an entity with its own interrupt ID, capable of executing independent threads.

The table below lists selected options to psrinfo:

Option

Comments

<none>

Prints one line for each configured processor, displaying whether it is online, non-interruptible (designated by no-intr), spare, off-line, faulted or powered off, and when that status last changed.

-p

Prints the number of physical processors in a system.

-v

Verbose mode. Prints additional information about the specified processors, including: processor type, floating point unit type and clock speed. If any of this information cannot be determined, psrinfo displays unknown.

psrinfo – selected options

psrinfo – display processor information

prtdiag - Display System Diagnostic Information

The prtdiag command displays system diagnostic information. On Solaris 10 for x86/x64 systems, the command is only available with Solaris 10 01/06 or higher.

prtdiag – print system diagnostic information

prtconf - Print System Configuration

The prtconf command prints system configuration information. The output includes the total amount of memory, and the configuration of system peripherals formatted as a device tree.

prtconf –

print system configuration

cpuinfo [Tools CD] – Display CPU Configuration

The cpuinfo utility prints detailed information about the CPU type and characteristics (number, type, clock and strands) of the running system.

cpuinfo – sample output

meminfo [Tools CD] - Display physical Memory, Swap Devices, Files

The meminfo is a tool to display configuration of physical memory and swap devices or files.

meminfo – sample output 

 

Thomas Bastian was a coauthor of an earlier version of this document. The earlier version of this page has been published in the "The Developers Edge" in 2009.

 

Understanding System Utilization

The following chapter introduces and demonstrates tools that help out in understanding overall system utilisation.

Tools discussed in this posting

uptime – Print Load Average

The easiest way to gain an overview on:

  • how long a system has been running.

  • current CPU load averages.

  • how many active users.

is with the command uptime.

uptime – print CPU load averages

The numbers printed to the right of “load average: “ are the 1-, 5- and 15-minute load averages of the system. The load average numbers give a measure of the number of runnable threads and running threads. Therefore the number has to be put in relation with the number of active CPUs in a system. For example, a load average of three (3) on a single CPU system would indicate some CPU overloading, while the same load average on a thirty-two (32) way system would indicate an unloaded system.

perfbar [Tools CD] - A lightweight CPU Meter

perfbar is a tool that displays a single bar graph that color codes system activity. The colors are as follows:

  • Blue = system idle.

  • Red = System time.

  • Green = CPU time.

  • Yellow = I/O activity (obsolete on Solaris 10 and later).

 

perfbar – sample output of a system with 16 CPU cores

Perfbar has been enhancend in Version 1.2 to provide better support for servers with many CPUs through a multi line visualisation. See below 

perfbar: Visualisation of a Sun T5240 System with 128 strands (execution units) without any load

perfbar can be called without specifying any command line arguments. perfbar provides a large number of options which can be viewed with the -h option:

$ perfbar -h

perfbar 1.2

maintained by Ralph Bogendoerfer
based on the original perfbar by:
Joe Eykholt, George Cameron, Jeff Bonwick, Bob Larson

Usage: perfbar [X-options] [tool-options]
   supported X-options:
      -display <display> or -disp <display>
      -geometry <geometry> or -geo <geometry>
      -background <background> or -bg <background>
      -foreground <foreground> or -fg <foreground>
      -font <font> or -fn <font>
      -title <title> or -t <title>
      -iconic or -icon
      -decoration or -deco
   supported tool-options:
      -h, -H, -? or -help: this help
      -v or -V: verbose
      -r or -rows: number of rows to display, default 1
      -bw or -barwidth: width of CPU bar, default 12
      -bh or -barheight: height of CPU bar, default 180
      -i or -idle: idle color, default blue
      -u or -user: user color, default green
      -s or -system: system color, default red
      -w or -wait: wait color, default yellow
      -int or -interval: interval for display updates (in ms),default 100
      -si or -statsint: interval for stats updates (in display intervals), default 1
      -avg or -smooth: number of values for average calculation, default 8

There are also a number of key strokes understood by the tool:

  • Q or q: Quit

  • R or r: Resize - this changes the window to the default size according to the number of CPU bars, rows and the chosen bar width and height.

  • Number keys 1 - 9: Display this number of rows.

  • + and -: Increase or decrease number of rows displayed.

The tool is currently available as a beta in version 1.2. This latest version is not yet part of the Performance Tools CD 3.0. The engineers from the Sun Solution Center in Langen/Germany made it available for free through:

cpubar [Tools CD] - A CPU Meter, showing Swap, and Run Queue

cpubar displays one bar-graph for each processor with the processor speed(s) displayed on top. Each bar-graph is divided in four areas (top to bottom):

  • Blue - CPU is available.

  • Yellow - CPU is waiting for one or more I/O to complete (N/A on Solaris 10 and later).

  • Red - CPU is running in kernel space.

  • Green - CPU is running in user space.

As with netbar and iobar, a red and a dashed black & white marker shows the maximum and average used ratios respectively.

The bar-graphs labeled 'r', 'b' and 'w' are displaying the run, blocked and wait queues. A non empty wait queue is usually a symptom of a previous persistent RAM shortage. The total number of processes is displayed on top of these three bars.

The bar-graph labeled 'p/s' is displaying the process creation rate per second.

The bar-graph labeled 'RAM' is displaying the RAM usage (red=kernel, yellow=user, blue=free), the total RAM is displayed on top.

The bar-graph ('sr') is displaying (using a logarithmic scale) the scan rate (a high level of scans is usually a symptom of RAM shortage).

The bar-graph labeled 'SWAP' is displaying the SWAP (a.k.a Virtual Memory) usage (red=used, yellow=reserved, blue=free), the total SWAP space is displayed on top.

cpubar – sample output

vmstat – System Glimpse

The vmstat tool provides a glimpse of the current system behavior in a one line summary including both CPU utilisation and saturation.

In its simplest form, the command vmstat <interval> (i.e. vmstat 5) will report one line of statistics every <interval> seconds. The first line can be ignored as it is the summary since boot, all other lines report statistics of samples taken every <interval> seconds. The underlying statistics collection mechanism is based on kstat (see kstat(1)).

Let's run two copies of a CPU intensive application (cc_usr) and look at the output of vmstat 5. First start two (2) instances of the cc_usr program.

two (2) instances of cc_usr started

Now let's run vmstat 5 and watch its output.

vmstat – vmstat 5 report

First observe the cpu:id column which represents the system idle time (here 0%). Then look at the kthr:r column which represents the total number of runnable threads on dispatcher queues (here 1).

From this simple experiment, one can conclude that the system idle time for the five second samples was always 0, indicating 100% utilisation. On the other hand, kthr:r was mostly one and sustained indicating a modest saturation for this single CPU system (remember we launched two (2) CPU intensive applications).

A couple of notes with regard to CPU utilisation:

  • 100% utilisation may be fine for your system. Think about a high-performance computing job: the aim will be to maximise utilisation of the CPU.

  • Values of kthr:r greater than zero indicate some CPU saturation (i.e. more jobs would like to run but cannot because no CPU was available). However, performance degradation should be gradual.

  • Sampling interval is important. Don't choose too small or too large intervals.

vmstat reports some additional information that can be interesting such as:

Column

Comments

in

Number of interrupts per second.

sys

Number of system calls per second.

cs

Number of context switches per second (both voluntary and involuntary).

us

Percent user time: time the CPUs spent processing user-mode threads.

sy

Percent system time: time the CPUs spent processing system calls on behalf of user-mode threads, plus the time spent processing kernel threads.

id

Percent of time the CPUs are waiting for runnable threads.

mpstat - Report per-Processor or per-Processor Set Statistics

The mpstat command reports processor statistics in tabular form. Each row of the table represents the activity of one processor. The first table summarizes all activity since boot. Each subsequent table summarizes activity for the preceding interval. The output table includes:

Column

Comments

CPU

Prints processor ID.

minf

Minor faults (per second).

mjf

Major faults (per second).

xcal

Inter-processor cross-calls (per second).

intr

Interrupts (per second).

ithr

Interrupts as threads (not counting clock interrupt) (per second).

csw

Context switches (per second).

icsw

Involuntary context switches (per second).

migr

Thread migrations (to another processor) (per second).

smtx

Spins on mutexes (lock not acquired on first try) (per second).

srw

Spins on readers/writer locks (lock not acquired on first try) (per second).

syscl

System calls (per second).

usr

Percent user time.

sys

Percent system time.

wt

Always 0.

idl

Percent idle time.

The reported statistics can be broken down into following categories:

  • Processor utilisation: see columns usr, sys and idl for a measure of CPU utilisation on each CPU.

  • System call activity: see syscl column for the number of system call per second on each CPU.

  • Scheduler activity: see column csw and column icsw. As the ratio icsw/csw comes closer to one (1), threads get preempted because of higher priority threads or expiration of their time quantum. Also the column migr displays the number of times the OS scheduler moves ready-to-run threads to an idle processor. If possible, the OS tries to keep the threads on the last processor on which it ran. If that processor is busy, the thread migrates.

  • Locking activity: column smtx indicates the number of mutex contention events in the kernel. Column srw indicates the number of reader-writer lock contention events in the kernel.

Now, consider the following sixteen-way (16) system used for test. This time four (4) instances of the cc_usr program were started and the output of vmstat 5 and mpstat 5 recorded.

Below, observe the output of processor information. Then the starting of the four (4) copies of the program and last the output of vmstat 5.

vmstat – vmstat 5 output on sixteen way system

Rightly, vmstat reports a user time of 25% because one-fourth (¼) of the system is used (remember 4 programs started, 16 available CPUs, i.e. 4/16 or 25%).

Now let's look at the output of mpstat 5.

mpstat – mpstat 5 sample output on sixteen way system

In the above output (two sets of statistics), one can clearly identify the four running instances of cc_usr on CPUs 1, 3, 5 and 11. All these CPUs are reported with 100% user time.

vmstat – Monitoring paging Activity

The vm stat command can also be used to report on system paging activity with the -p option. Using this form of the command, one can quickly get a clear picture on whether the system is paging because of file I/O (OK) or paging because of physical memory shortage (BAD).

Use the command as follows: vmstat -p <interval in seconds> . The output format includes following information:

Column

Description

swap

Available swap space in Kbytes.

free

Amount of free memory in Kbytes.

re

Page reclaims - number of page reclaims from the cache list (per second).

mf

Minor faults - number of pages attached to an address space (per second)

fr

Page frees in Kbytes per second.

de

Calculated anticipated short-term memory shortfall in Kbytes.

sr

Scan rate - number of pages scanned by the page scanner per second.

epi

Executable page-ins in Kbytes per second.

epo

Executable page-outs in Kbytes per second.

epf

Executable page-frees in Kbytes per second.

api

Anonymous page-ins in Kbytes per second.

apo

Anonymous page-outs in Kbytes per second.

apf

Anonymous page-frees in Kbytes per second.

fpi

File system page-ins in Kbytes per second.

fpo

File system page-outs in Kbytes per second.

fpf

File system page-frees in Kbytes per second.

As an example of vmstat -p output, let's try following commands:

find / > /dev/null 2>&1

and then monitor paging activity with: vmstat -p 5

As can be seen from the output, the system is showing paging activity because of file system read I/O (column fpi).

vmstat – sample output reporting on paging activity

zonestat - [OpenSolaris.org] Monitoring Resource Consumption within Zones 

Jeff Victor developed an Open Source Perl script to measure utilization within zones. The tool is freely available for download on the OpenSolaris.org project pages.

It may me called with the following syntax:

 

zonestat [-l] [interval [count]]

The output looks like:

        |----Pool-----|------CPU-------|----------------Memory----------------|
	|---|--Size---|-----Pset-------|---RAM---|---Shm---|---Lkd---|---VM---|
Zonename| IT| Max| Cur| Cap|Used|Shr|S%| Cap| Use| Cap| Use| Cap| Use| Cap| Use
-------------------------------------------------------------------------------
  global  0D  66K    2       0.1   1 25      986M      139K  18E   2M  18E 754M
    db01  0D  66K    2       0.1   2 50   1G 122M 536M      536M    0   1G 135M
   web02  0D  66K    2 0.42  0.0   1 25 100M  11M  20M       20M    0 268M   8M

zonestat allows as well to monitor zone limits (caps)

Thomas Bastian was a coauthor of an earlier version of this document. The earlier version of this page has been published in the "The Developers Edge" in 2009.

 

Process Introspection

The next step in a performance analysis is to figure out what the application is doing. Configuring application is one thing. Checking if the application actually pulled all configration information is another thing. The tools below tell you what your application is doing.

Solaris provides a large collection of tools to list and control processes. For an overview and detailed description, please refer to the manual pages of proc(1). The following chapter introduces the most commonly used commands.

pgrep – Find Processes by Name and other Attributes

The pgrep command finds processes by name and other attributes. For that, the pgrep utility examines the active processes on the system and reports the process IDs of the processes whose attributes match the criteria specified on the command line. Each process ID is printed as a decimal value and is separated from the next ID by a delimiter string, which defaults to a newline.

pgrep – find processes by name and other attributes

pkill – Signal Processes by Name and other Attributes

The pkill signals processes by name and other attributes. pkill functions identically to pgrep, except that each matching process is signaled as if by kill(1) instead of having its process ID printed. A signal name or number may be specified as the first command line option to pkill.

pkill – signal processes by name and other attributes

ptree - Print Process Trees

The ptree prints parent-child relationship of processes. For that, it prints the process trees containing the specified pids or users, with child processes indented from their respective parent processes. An argument of all digits is taken to be a process-ID, otherwise it is assumed to be a user login name. The default is all processes.

ptree – no options

sta [Tools CD] – Print Process Trees

The sta tool provides similar output to ptree. See example run below.

sta – sample output

pargs - Print Process Arguments, Environment, or auxiliary Vector

The pargs utility examines a target process or process core file and prints arguments, environment variables and values, or the process auxiliary vector.

pargs – sample output

pfiles – Report on open Files in Process

The pfiles command reports fstat(2) and fcntl(2) information for all open files in each process. In addition, a path to the file is reported if the information is available from /proc/pid/path. This is not necessarily the same name used to open the file. See proc(4) for more information.

pfiles – sample output

pstack – Print lwp/process Stack Trace

The pstack command prints a hex+symbolic stack trace for each process or specified lwps in each process.

Note: use jstack for java processes

pstack – sample output

jstack – Print Java Thread Stack Trace [see $JAVA_HOME/bin]

The jstack command prints Java stack traces of Java threads for a given Java process or core file or a remote debug server. For each Java frame, the full class name, method name, 'bci' (byte code index) and line number, if available, are printed.

jstack – sample output

pwdx – Print Process current Working Directory

The pwdx utility prints the current working directory of each process.

pwdx – sample output

pldd – Print Process dynamic Libraries

The pldd command lists the dynamic libraries linked into each process, including shared objects explicitly attached using dlopen(3C). See also ldd(1).

pldd – sample output

pmap - Display Information about the Address Space of a Process

The pmap utility prints information about the address space of a process. By default, pmap displays all of the mappings in the virtual address order they are mapped into the process. The mapping size, flags and mapped object name are shown.

pmap – default output

An extended output is available by adding the -x option (additional information about each mapping) and the -s option (additional HAT size information).

pmap – extended output

showmem [Tools CD] – Process private and shared Memory usage

The showmem utility wraps around pmap and ps todetermine how much private and shared memory a process is using.

showmem – sample output

plimit - Get or set the Resource Limits of running Processes

In the first form, the plimit utility prints the resource limits of running processes.

plimit – displaying process resource limits

 

In the second form, the plimit utility sets the soft (current) limit and/or the hard (maximum) limit of the indicated resource(s) in the processes identified by the process-ID list, pid. As an example, let's limit the current (soft) core file size of the trashapplet process with PID 897 to five (5) MB, using the command: plimit -c 5m,unlimited 897.

plimit – setting the current (soft) core file limit to 5 MB 

Thomas Bastian was a coauthor of an earlier version of this document. The earlier version of this page has been published in the "The Developers Edge" in 2009.

Process Monitoring with prstat

The following chapter takes a deeper look at the Solaris tool prstat(1), the all round utility that helps understand system utilisation.

prstat– The Allround Utility

One of the most important and widely used utility found in Solaris is prstat (see prstat(1)). prstat gives fast answers to question:

  • How much is my system utilized in terms of CPU and memory?

  • Which processes (or users, zones, projects, tasks) are utilizing my system?

  • How are processes/threads using my system (user bound, I/O bound)?

In its simplest form, the command prstat <interval> (i.e. prstat 2) will examine all processes and report statistics sorted by CPU usage.

prstat – prstat 2 command reporting on all processes and sorting by CPU usage

As can be seen from the screen capture, processes are ordered from top (highest) to bottom (lowest) according to their current CPU usage (in % - 100% means all system CPUs are fully utilized). For each process in the list, following information is printed:

  • PID: the process ID of the process.

  • USERNAME: the real user (login) name or real user ID.

  • SIZE: the total virtual memory size of the process, including all mapped files and devices, in kilobytes (K), megabytes (M), or gigabytes (G).

  • RSS: the resident set size of the process (RSS), in kilobytes (K), megabytes (M), or gigabytes (G).

  • STATE: the state of the process (cpuN/sleep/wait/run/zombie/stop).

  • PRI: the priority of the process. Larger numbers mean higher priority.

  • NICE: nice value used in priority computation. Only processes in certain scheduling classes have a nice value.

  • TIME: the cumulative execution time for the process.

  • CPU: The percentage of recent CPU time used by the process. If executing in a non-global zone and the pools facility is active, the percentage will be that of the processors in the processor set in use by the pool to which the zone is bound.

  • PROCESS: the name of the process (name of executed file).

  • NLWP: the number of lwps in the process.

The <interval> argument given to prstat is the sampling/refresh interval in seconds.

Special Report – Sorting

The prstat output can be sorted by another criteria than CPU usage. Use the option -s (descending) or -S (ascending) with the criteria of choice (i.e. prstat -s time 2):

Criteria

Comments

cpu

Sort by process CPU usage. This is the default.

pri

Sort by process priority.

rss

Sort by resident set size.

size

Sort by size of process image.

time

Sort by process execution time.

Special report – Continuous Mode

With the option -c to prstat, new reports are printed below previous ones, instead of overprinting them. This is especially useful when gathering information to a file (i.e. prstat -c 2 > prstat.txt). The option -n <number of output lines> can be used to set the maximum length of a report.

prstat – continuous report sorted by ascending other of CPU usage

Special Report – by users

With the option -a or -t to prstat, additional reports about users are printed.

prstat – prstat -a 2 reports by user

Special Report – by Zones

With the option -Z to prstat, additional reports about zones are printed.

Special Report – by Projects (see projects(1))

With the option -J to prstat, additional reports about projects are printed.

prstat – prstat -J 2 reports about projects

Special Report – by Tasks (see newtask(1))

With the option -T to prstat, additional reports about tasks are printed.

prstat – prstat -T 2 reports by tasks

Special Report – Microstate Accounting

Unlike other operating systems that gather CPU statistics every clock tick or every fixed time interval (typically every hundredth of a second), Solaris 10 incorporates a technology called microstate accounting that uses high-resolution timestamps to measure CPU statistics for every event, thus producing extremely accurate statistics.

The microstate accounting system maintains accurate time counters for threads as well as CPUs. Thread-based microstate accounting tracks several meaningful states per thread in addition to user and system time, which include trap time, lock time, sleep time and latency time. prstat reports the per-process (option -m) or per-thread (option -mL) microstates.

prstat – prstat -m 2 reports on process microstates

The screen output shown above displays microstates for the running system. Looking at the top line with PID 693, one can see that the process Xorg spends 1.8% of its time in userland while sleeping (98%) the rest of its time. prstat – prstat -mL 2:

The screen output shown above displays per-thread microstates for the running system. Looking at the line with PID 1311 (display middle), one can see the microstates for LWP #9 and LWP #8 of the process firefox-bin.

prstat usage Scenario – cpu Latency

One important measure for CPU saturation is the latency (LAT column) output of prstat. Let's once again, start two (2) copies of our CPU intensive application.

prstat – observing latency with CPU intensive application

Now let's run prstat with microstate accounting reporting, i.e. prstat -m 2 and record the output:

prstat – prstat -m 2 output

Please observe the top two (2) lines of the output with PID 2223 and PID 2224. One can clearly see that both processes exhibit a high percentage of their time (50% and 52% respectively) in LAT microstate (CPU latency). The remaining time is spent in computation as expected (USR microstate). Clearly in this example, both CPU bound applications are fighting for the one CPU of the test system, resulting in high waiting times (latency) to gain access to a CPU.

prstat usage Scenario – High System Time

Let's run a system call intensive applicationand watch the output of prstat. First, start one instance of cc_sys:

prstat – system call intensive application

Then watch the prstat -m 2 output:

prstat – prstat -m 2 output for system call intensive application

Note the top line of the above output with PID 2310. One clearly identifies a high-system time usage (61%) for the process cc_sys. Also notice the high ratio of ICX/VCX (277/22) which shows that the process is frequently involuntarily context switched off the CPU.

prstat usage Scenario – Excessive Locking

Frequently, poor scaling is observed for applications on multi-processor systems. One possible root cause is badly designed locking inside the application resulting in large time spent waiting for synchronisation. The prstat column LCK reports on percentage of time spent waiting on user locks.

Let's look at an example with a sample program that implements a locking mechanism for a critical section using reader/writer locks. The programs has four (4) threads looking for access to the shared critical region as readers while one thread accesses the critical section in writer mode. To exhibit the problem, the writer has been slowed down on purpose, so that it spends some time holding the critical section (effectively barring access for the readers).

First start the program such as the writer spends zero (0) microseconds in the critical region (ideal case).

cc_lck 0 – running in ideal conditions

Now let's observe the per-thread microstates. Use prstat -mL -p 2626 2 for this.

cc_lck 0 – prstat output

One can observe, that all threads (5) are fighting almost equally for computing resources. Since nor the reader nor the writer hold the critical section for long there is no wait time registered.

Now let's restart, the whole test with a writer wait time of ten (10) microseconds.

Again, let's observe the microstates. Use prstat -mL -p 2656 2 for this.

cc_lck 10 – prstat output

Now the picture looks different. The four (4) reader threads are spending 84% of their time waiting on the lock for the critical region. The writer (LWP #1) on the other hand, is spending most of its time sleeping (82%).

While in the case of this sample application, the locking problems are obvious when looking at the source code, prstat microstate accounting capabilities can help pin-point locking weaknesses in larger applications.

Thomas Bastian was a coauthor of an earlier version of this document. The earlier version of this page has been published in the "The Developers Edge" in 2009.

Understanding IO

The following chapter takes a deeper look at understanding I/O behavior.

iobar [Tools CD] - Display I/O for Disk Devices Graphically

iobar displays two bar-graphs for each disk device. The read and write data rate are displayed in green in the left and right areas. The disk utilization is shown in the middle (red). At the bottom of the bars, input and output rates are displayed numerically, the value can be selected between last (green), average (white) and maximum (red) with the mouse middle button. The display mode can be toggled between logarithmic and linear with the left mouse button. In linear mode, scaling is automatic. All values are in bytes per second.

iobar – sample output

iotop [Tools CD] – Display iostat -x in a top-like Fashion

iotop is a binary that collects I/O statistics for disks, tapes, NFS-mounts, partitions (slices), SVM meta-devices and disk-paths. The display of those statistics can be filtered by device class or using regular expressions. Also the sorting order can be modified.

iotop – sample output

iostat – I/O Wizard

If you are looking at understanding I/O behaviour on a running system, your first stop will be the command iostat. iostat gives fast answers to question:

  • How much I/O in terms of input/output operations/second (IOPS) and throughput (MB/second)?

  • How busy are my I/O subsystems (latency and utilisation)?

In its simplest form, the command iostat -x <interval> (i.e. iostat -x 2) will examine all I/O channels and report statistics. See iostat -xc 2:

As can be seen from the screen capture, the iostat -x <interval> command will report device statistics every <interval> seconds. Every device is reported on a separate line and includes following information:

  • device: device name

  • r/s: device reads per second, i.e. read IOPS.

  • w/s: device writes per second, i.e. write IOPS.

  • kr/s: kilobytes read per second.

  • kw/s: kilobytes write per second.

  • wait: average number of transactions waiting for service (queue length)

  • actv: average number of transactions actively being serviced (removed from the queue but not yet completed) . This is the number of I/O operations accepted, but not yet serviced, by the device.

  • svc_t: average response time of transactions, in milliseconds . The svc_t output reports the overall response time, rather than the service time of a device. The overall time includes the time that transactions are in queue and the time that transactions are being serviced.

  • %w: percent of time there are transactions waiting for service (queue non-empty).

  • %b: percent of time the disk is busy (transactions in progress).

By adding the option -M to iostat, the report outputs megabytes instead of kilobytes.

iostat Usage Scenario – Sequential I/O

Let's study the output of iostat when doing sequential I/O on the system. For that, become super-user and in a terminal window, start the command:

  • dd if=/dev/rdsk/c1d0s0 of=/dev/null bs=128k &

Then start the iostat command with iostat -xM 10 and watch the output. After a minute stop the iostat and dd processes.

As can be seen from the screen capture above, the disk in the test system can sustain a read throughput of just over 25 MB/second, with an average service time below 5 milliseconds.

iostat Usage Scenario – Random I/O

Let's study the output of iostat when doing random I/O on the system. For that, start the command:

  • find / >/dev/null 2>&1 &

Then start the iostat command with iostat -xM 10 and watch the output. After a minute stop the iostat and find processes.

iostat – random I/O

As can be seen from the screen capture above, the same disk in the test system delivers just less than 1 MB/second on random I/O.

Properly sizing an I/O subsystem is not a trivial exercise. One has to take into considerations factors like:

  • Number of I/O operations per second (IOPS)

  • Throughput in Megabytes per second (MB/s)

  • Service times (in milliseconds)

  • I/O pattern (sequential or random)

  • Availability of caching

zpool iostat: iostat for zfs pools

ZFS comes with it's own version of iostat. It's build into the zpool command since the IO is a feature of the pool. The behavior is very similar to iostat. There are however less options:

zpool iostat [-T u | d ] [-v] [pool] ... [interval[count]]

The -T option allows to specify time formats. The version option (-v) shows the IO on a vdev device.

zpool iostat with a 10 second sample time

The verbose option is the option to go in more complex environments:

zpool iostat with verbose option and 10s sampletime

iosnoop [DtraceToolkit] – Print Disk I/O Events

iosnoop is a program that prints disk I/O events as they happen, with useful details such as UID, PID, filename, command, etc. iosnoop is measuring disk events that have made it past system caches.

Let's study the output of iosnoop when doing random I/O on the system. For that, start the command:

  • find / >/dev/null 2>&1 &

Then start the iosnoop command and watch the output. After a minute stop the iosnoop and find processes.

iosnoop – sample output

iopattern [DtraceToolkit] – Print Disk I/O Pattern

iopattern prints details on the I/O access pattern for disks, such as percentage of events that were of a random or sequential nature. By default, totals for all disks are printed.

Let's study the output of iopattern when doing random I/O on the system. For that, start the command:

  • find / >/dev/null 2>&1 &

Then start the iopattern command and watch the output. After a minute stop the iopattern and find processes.

iopattern – sample output

iotop [DtraceToolkit] – Display top Disk I/O Events by Process

iotop prints details on the top I/O events by process.

Let's study the output of iotop when doing random I/O on the system. For that, start the command:

  • find / >/dev/null 2>&1 &

Then start the iotop command and watch the output. After a minute stop the iotop and find processes.

fsstat [Solaris 10+] – Report File System Statistics

fsstat reports kernel file operation activity by the file system type or by the path name, which is converted to a mount point. Please see the man page fsstat(1) for details on all options.

Let's study the output of fsstat when doing random I/O on the system. For that, start the command:

find / >/dev/null 2>&1 &

Then start the fsstat / 1 command and watch the output. After a minute stop the fsstat and find processes.

fsstat – sample output

Thomas Bastian was a coauthor of an earlier version of this document. The earlier version of this page has been published in the "The Developers Edge" in 2009.

Tracing

The following chapter introduces further tools and techniques for tracing at large.

Dtrace is the Solaris framework in question everyone wants to use for queries in any depth. The technology is however rather complex to be learned.

I limited the number of Dtrace scripts to the ones in the Dtrace tool kit. The focus of this primer is to provide tools categorized by problem domain.

There are a number of visual tracing solutions build on top of Dtrace.

  • Project D-Light 
  • Chime
  • Project fishworks as used in the Oracle ZFS appliance (not available on general purpose systems)

truss – First Stop Tool

truss is one of the most valuable tools in Solaris to understand various issues with applications. truss can help understand which files are read and written, which system calls are called and much more. Although, truss is very useful, one has to understand that it is also quite intrusive on the applications traced and can therefore influence performance and timing.

A standard usage scenario for truss is to get a summary of system call activity of a process over a given window of time.

truss – system call summary

As can be seen from the output above, the find process issues large amounts of fstat(2), lstat(2), getdents(2) and fchdir(2) system calls. The getdents(2) system call consumes roughly 45% of the total system time (0.149 seconds of 0.331 seconds of total system time).

Another standard usage scenario of truss, is to get a detailed view of the system calls issued by a given process.

truss – detailed system call activity

The output above shows truss giving out details about the system calls issued and their parameters. Further details can be obtained with the -v option. For example:

  • truss -v all -p <pid>

Yet another standard usage scenario is to restrict output of truss to certain system calls:

  • truss -t fstat -p <pid>

would limit the output to fstat(2) system call activity.

truss -t – sample output

Finally, combining the -t option with -v, one gets an output like this:

truss -t -v – sample output

plockstat – Report User-Level Lock Statistics

The plockstat utility gathers and displays user-level locking statistics. By default, plockstat monitors all lock contention events, gathers frequency and timing data about those events, and displays the data in decreasing frequency order, so that the most common events appear first. plockstat gathers data until the specified command completes or the process specified with the -p option completes. plockstat relies on DTrace to instrument a running process or a command it invokes to trace events of interest. This imposes a small but measurable performance overhead on the processes being observed . Users must have the dtrace_proc privilege and have permission to observe a particular process with plockstat .

Let's study the output of plockstat by running our sample reader/writer locking program cc_lck. First start cc_lck with the writer blocking for ten microseconds:

Then run the plockstat tool for ten seconds:

  • plockstat -A -e 10 -p <pid>

The output should be similar to the screen shot below. From the output, one can observe some contention on the reader/writer lock.

plockstat – sample output

pfilestat [DtraceToolkit] – Trace Time spend in I/O

pfilestat prints I/O statistics for each file descriptor within a process. In particular, the time break down during read() and write() events is measured. This tool helps understanding the impact of I/O on the process.

To study the output of pfilestat , let's start as root the following command:

  • dd if=/dev/rdsk/c1d0s0 of=/dev/null bs=1k &

Then in another window, let's start the pfilestat tool with the pid of the dd command as argument:

  • pfilestat <pid of dd command>

The output should be similar to the screen shot below:

pfilestat – sample output

The pfilestat breaks down the process time in percentage spend for reading (read), writing (write), waiting for CPU (waitcpu), running on CPU (running), sleeping on read (sleep-r) and sleeping on write (sleep-w).

cputrack/cpustat – Monitor Process/System w/ CPU perf. counters

The cputrack utility allows CPU performance counters to be used to monitor the behavior of a process or family of processes running on the system. The cpustat utility allows CPU performance counters to be used to monitor the overall behavior of the CPUs in the system.

Using cputrack/cpustat requires intimate knowledge of the CPU and system under observation. Please consult the system/CPU documentation for details on the counters available. cpustat or cputrack with the -h option will list all available performance counters.

To observe the output of cputrack , let's run the tool with our sample program cc_usr .

Use the following command (all in one line):

  • cputrack -t -c pic0=FP_dispatched_fpu_ops,cmask0=0,umask0=0x7,pic1=FR_retired_x86_instr_w_excp_intr,cmask1=0 cc_usr

The output should look like this:

cputrack – sample output

In the above output, one can see that the cc_usr program executed roughly 600 million instructions per second with roughly 160 million floating point operations per second.

Thomas Bastian was a coauthor of an earlier version of this document. The earlier version of this page has been published in the "The Developers Edge" in 2009.

Understand the Network

The following chapter takes a deeper look at network utilisation.

netbar [Tools CD] - Display Network Traffic graphically

netbar displays two bar-graphs for each network interface.

  • The left one is the input bandwidth, the right one the output bandwidth.

  • The green area shows the used bandwidth and the blue area shows the available one.

On each bar-graph, a red marker shows the maximum bandwidth observed during the last period, and a dashed black & white marker shows the average bandwidth during the same period.

At the bottom of the bars, input and output rates are displayed numerically, the value can be selected between last (green), average (white) and maximum (red) with the mouse middle button. Between the bar-graphs, a white line displays the error rate while a red line displays the collision rate.

The display mode can be toggled between logarithmic and linear with the left mouse button. In linear mode, scaling is automatic.

A thin white line is showing the reported maximum interface speed, if this line spans the whole two bars, the interface is in full-duplex mode, while if the line is limited to the half of the bars, the interface is in half-duplex mode. All values are in bits per second.

netbar – sample output

netsum [Tools CD] – Displays Network Traffic

netsum is a netstat like tool, however, its display output is in kilobytes per second, packets per second, errors, collisions, and multicast.

netsum – sample output

nicstat [Tools CD] - Print Statistics for Network Interfaces

nicstat prints statistics for the network interfaces such as kilobytes per second read and written, packets per second read and written, average packet size, estimated utilisation in percent and interface saturation.

nicstat – sample output

netstat – Network Wizard

If you are looking at understanding network behavior on a running system, your first stop may be the command netstat. netstat gives fast answers to question:

  • How many TCP/IP sockets are open on my system?

  • Who communicates with whom? And with what parameters?

The netstat command has many options that will satisfy everyone's needs. Please refer to netstat(1) for details.

netstat Usage Scenario – List open Sockets

Often, one will want to look at the list of network sockets on a system. The netstat command delivers this type of information for the protocol TCP with the following command:

  • netstat -af inet -P tcp

If you are interested in the protocol UDP, replace tcp with udp, i.e.

  • netstat -af inet -P udp

As an example, let's run the following command and capture the output:

  • netstat -af inet -P tcp

netstat – sample output list TCP network sockets

The command outputs one line per socket in the system. Included information is about:

  • Local Address: the local socket endpoint with interface and protocol port.

  • Remote Address: the remote socket endpoint with interface and protocol port.

  • Swind: sending window size in bytes.

  • Send-Q: sending queue size in bytes.

  • Rwind: receiving window size in bytes.

  • Recv-Q: receiving queue size in bytes.

  • State: protocol state (i.e. LISTEN, IDLE, TIME_WAIT, etc...).

tcptop/tcptop_snv [DtraceToolkit] – network “ top”

tcptop (Solaris 10) and tcptop_snv (OpenSolaris) display top TCP network packets by process. To do so, the tool analyses TCP network packets and prints the responsible PID and UID, plus standard details such as IP address and port. The utility can help identify which processes are causing TCP traffic.

You can start the tool with the command tcptop on Solaris 10 and tcptop_snv on OpenSolaris. Let's study the output of tcptop_snv. For that, start tcptop_snv in one window and in another one, generate some network traffic with the command:

  • scp /kernel/genunix localhost:/tmp

The output should be similar to this screen:

tcptop_snv – sample output

tcpsnoop/tcpsnoop_snv [DtraceToolkit] – Network Snooping

tcpsnoop (Solaris 10) and tcpsnoop_snv (OpenSolaris) snoops TCP network packets by process. The tool operates in a similar way than tcptop and tcptop_snv, however information is displayed continuously.

You can start the tool with the command tcpsnoop on Solaris 10 and tcpsnoop_snv on OpenSolaris. Let's study the output of tcpsnoop_snv. For that, start tcpsnoop_snv in one window and in another one, generate some network traffic with the command:

  • scp /kernel/genunix localhost:/tmp

The output should be similar to this screen:

tcpsnoop_snv – sample output

nfsstat – NFS statistics

nfsstat displays statistical information about the NFS and RPC (Remote Procedure Call) interfaces to the kernel. It can be used to view client and/or server side statistics broken down by NFS version (2, 3 or 4).

Thomas Bastian was a coauthor of an earlier version of this document. The earlier version of this page has been published in the "The Developers Edge" in 2009.