OSWatcher Startup/Restart On Exadata

When the question of what starts OSWatcher (OSW) on Exadata was raised at a client site I thought I’d take a quick look. It took me a little longer than I expected to work out the detail and therefore it seems worth sharing.

If you’re simply looking to change the “snapshot interval”, “archive retention” or “compression command” then /opt/oracle.cellos/validations/init.d/oswatcher is what you need to modify and you’ll find a line with ./startOSW.sh X Y Z. Where X is the snapshot interval, Y is the archive retention and Z is the compression command used to compress the output files.

If you’re curious to know the details of what starts and restarts OSWatcher than read on.

The following is applicable to the X2-2 I regularly get my hands on which is running and I don’t know if things change with later versions, so apologies if this isn’t applicable to your Exadata environment.

Startup of OSWatcher on boot is indirectly handled by /etc/init.d/rc.local, which includes:

########### BEGIN DO NOT REMOVE Added by Oracle Exadata ###########
if [ -x /etc/rc.d/rc.Oracle.Exadata ]; then
  . /etc/rc.d/rc.Oracle.Exadata
########### END DO NOT REMOVE Added by Oracle Exadata ###########

/etc/rc.d/rc.Oracle.Exadata includes:

# Perform validations step
/opt/oracle.cellos/vldrun -all

The main purpose of /opt/oracle.cellos/vldrun and the Perl script /opt/oracle.cellos/validations/bin/vldrun.pl appears to be ensuring configuration changes are made on initial boot and after upgrades, although I haven’t looked into all the detail yet. The part of /opt/oracle.cellos/vldrun that is relevant in the context of starting OSWatcher on every boot is:

$VLDRUN_PL -quiet "$@"

This executes /opt/oracle.cellos/validations/bin/vldrun.pl with the -quiet and -all arguments (as that was passed to /opt/oracle.cellos/vldrun)

The “quiet” argument is pretty obvious and a little reading reveals that “all” simply means that all scripts in /opt/oracle.cellos/validations/init.d/ should be executed.

So off to /opt/oracle.cellos/validations/init.d/ we go:

root@my-host ~]# ls -1 /opt/oracle.cellos/validations/init.d/
[root@my-host ~]#

… and in oswatcher, as already mentioned in the second paragraph of the post, you’ll find ./startOSW.sh X Y Z, where X is the snapshot interval, Y is the archive retention and Z is the compression command used to compress the output files.

OK, so that’s what starts OSWatcher on boot, but you should also know that OSWatcher is restarted daily by /etc/cron.daily/cellos, which includes:

/opt/oracle.cellos/validations/bin/vldrun.pl -script oswatcher > /dev/null 2>&1

The only bit of all this that doesn’t really sit right with me is that OSWatcher is included with “validations”. That doesn’t seem like an appropriate description to me.

Trivial as it may be, I hope that later version of the Exadata software move from what is described above to the “service” based approach used on non-Exadata platforms and documented in How To Start OSWatcher Black Box Every System Boot [ID 580513.1]. This feel like a much more standard approach and allows control of the service using the /sbin/service and /sbin/chkconfig commands.

Production Support Tips & Tricks #1 – Collecting Log Data

Early this year (2012) I started working on a presentation, it would be my first, that I hoped to submit to UKOUG. The thrust of the presentation was to be tips on making your experiences with Oracle Support more pleasant, to help keep your support analyst busy rather than yourself. A prospective title was “with Support like this who needs enemies” – perhaps that’s a big strong ;-). Several things colluded to make it unlikely I would get to present it so I faltered and things ground to a halt. After a period of inactivity I have decided to convert it into a short series of blog posts. This is the first. Part 2 is here – “Production Support Tips & Tricks #2 – SQL Trace

This post contains some advice for collecting log data when raising SRs. It’s mostly obvious but hopefully not to all.

ADR Package

You already know so I’m not going to waste my breath.

Get everything packaged up, not just the trace files you think Oracle need. Avoids repeat requests.

Well covered by others so I’m not going near it:
John Hallas quality UKOUG presentation
Uwe Hesse’s super blog entry

Nah – see above

Not related to diagnostic collection but listener targets don’t auto purge so your housekeeping scripts need to make calls to adrci to force a purge.

Diagcollection.sh for clusters

diagcollection.sh is a script in your CRS home which collates all CRS related log files on the current cluster node.

It’s not easy manually collecting everything Oracle Support may require. This script makes it easy.

Several options, you can check them with the “-h” option. Or just collect everything:

$ diagcollection.sh

Uncompressed the resulting tar file can be very large

-rw-r--r-- 1 grid oinstall 1.1G Feb 22 21:49 crsData_n02_20120222_2144.tar

Even compressed the file can still be a lengthy upload to M.O.S (multiplied by the # of nodes)

-rw-r--r-- 1 grid oinstall 69M Feb 22 21:49 crsData_n02_20120222_2144.tar.gz

diagcollection.sh is just a wrapper for diagcollection.pl.

OS Watcher Black Box (OSWbb)

A quote from the user guide:

a collection of UNIX shell scripts intended to collect and archive operating system and network metrics to aid support in diagnosing performance issues

“Because every vendor wants to blame another vendor and OSWbb helps that process”
“Because every issue is the fault of the database so you need ammunition to feed to your vendor”
“insert your own cynical quote here”

Download from M.O.S – “OS Watcher Black Box User Guide [ID 301137.1]”. It is certified on AIX, Tru64, Solaris, HP-UX, Linux.

It is easy to run:

nohup ./startOSWbb.sh &

easy to stop


and easy to send

$ ./tarupfiles.sh
-rw-r--r-- 1 oracle oinstall 1.2M Feb  8 22:00 osw_archive_0208122216.tar.Z


You can install OSWbb as a Linux service – “How To Start OSWatcher Black Box Every System Boot [ID 580513.1]” or use any scheduling tool. Alternatvely you can control it via CRS, this way it is only active when the cluster is active which has plus and minus points. For details of this see M.O.S note “Making Applications Highly Available Using Oracle Clusterware [ID 1105489.1]”.

To do it you need an action script, there is a perfectly good demo one in “$GRID_HOME/crs/demo”. Alternatively the one I use for testing at home can be found here – osw.scr (use at your peril).

$GRID_HOME/bin/crsctl add resource osw -type ora.local_resource.type \
 -attr "AUTO_START=always,ACTION_SCRIPT=$GRID_HOME/crs/script/oswbb.scr"
$  $GRID_HOME/bin/crsctl status res osw
STATE=ONLINE on n01, ONLINE on n02

From “OS Watcher For Windows (OSWFW) User Guide [ID 433472.1]”:

OS Watcher for Windows is no longer supported.
It has been replace by the Cluster Health Monitor.

From “Cluster Health Monitor (CHM) FAQ [ID 1328466.1]”

Is the Cluster Health Monitor replacing OSWatcher?
…there [is] some information such as top, traceroute, and netstat that the Cluster Health Monitor does not collect, so running the Cluster Health Monitor while running OSWatcher is ideal. Both tools complement each other rather than supplement…

In my opinion another reason for still using OSWbb in spite of CHM is that CHM is very difficult to review yourself, it is also not yet the tool of choice for many within Oracle Support. OSWbb still has a place.

Quote from traceroute Unix man page by way of caveat:

Because of the load it could impose on the network, it is unwise to use traceroute during normal operations or from automated scripts.


“OS Watcher Black Box” was originally called “OS Watcher” but was renamed due to a clash of names with other unrelated, non-Oracle tool(s).

More to follow in the future