UKOUG Database Server SIG (Leeds 2013)

On Thursday I attended the UKOUG Database Server SIG in Leeds. All slides have been uploaded to the UKOUG website.

It’s the first SIG I’ve attended this year and after enjoying it so much I ought to try harder to get to other events. We had Oak Table presence from David Kurtz talking all things partitioning/compression and purging, two Oracle employees discussing first support and then ZFS/NetApp (I particularly enjoyed this one) and then Edgars Rudans on his evolution of the Exadata Minimal Patching process. All of these presentations are well worth downloading and checking out.

The last presentation of the day was me. I’ve never presented before and it took a big step out of my comfort zone to get up there but I’m so glad I did. I would recommend it to anyone currently spending time in the audience thinking “I wish I had the confidence to do that”. It’s nerve racking beforehand but exhilarating afterwards.

When blogging in the past I’ve liked how it makes you think a little bit harder before pressing the publish button. I think the best thing I got out of the whole presentation process was that it made me dig even deeper to make sure I’d done my homework.

After the SIG there was a good group headed out for Leeds Oracle Beers #3 which involved local beer, good burgers and Morris Dancing, all good fun.

Advanced Compression: Visualising Insert Overhead

There is a My Oracle Support note, “Master Note for OLTP Compression [ID 1223705.1]”, which talks about the overheads of “COMPRESS FOR ALL OPERATIONS”. It’s a great note and, for a very artificial test, quotes overheads of around 25% CPU, 100% redo and 100% undo when inserting data via conventional methods. The purpose of this post is to visualise when the extra resources are used.

My test uses a SQL*Plus session that repeatedly inserts a single row into a table with “COMPRESS FOR ALL OPERATIONS” enabled. After each insert I use the SQL*Plus HOST command to capture redo and undo statistics for my session. I struggled to measure CPU for each insert. The “CPU used by this session” statistic was not granular enough for my single row inserts and the CPU tracked by SQL Trace was being rounded to 0ms or 1ms for each insert. However the execute elapsed attribute from SQL Trace did give me good granularity. Obviously SQL Trace has it’s own overheads as I’ve talked about before but I don’t think they are important in this case as I’m just after a visualisation of when extra resources are used.

The shell and SQL scripts used to capture statistics are at the foot of this post.

My SQL*Plus control session looks a little like this:

sqlplus neil
create table compteststats
(	testid		varchar2(10)
,	snapid		number
,	statname	varchar2(64)
,	statval		number);

create table compobj
(	col1 varchar2(30)
,	col2 number
,	col3 varchar2(30)
,	col4 varchar2(30)
,	col5 number
,	col6 varchar2(30))
compress for all operations;

exec dbms_application_info.set_module('TESTSESS','TESTSESS');
exec dbms_monitor.session_trace_enable

--capture starting values
host ./ TEST1

insert into compobj values (lpad('w',30,'w'),1,lpad('x',30,'x'),lpad('y',30,'y'),2,lpad('z',30,'z'));
host ./ TEST1

insert into compobj values (lpad('w',30,'w'),1,lpad('x',30,'x'),lpad('y',30,'y'),2,lpad('z',30,'z'));
host ./ TEST1

-- repeat above insert/host combination until the first block is full and we move to a second


I was then able to report redo and undo statistics using the SQL below

select snapid row#
,	max(decode(statname,'redo size',delta,0)) redo_size
,	max(decode(statname,'undo change vector size',delta,0)) undo_size
,	max(decode(statname,'HSC OLTP positive compression',delta,0)) HSC_OLTP_positive_comp
,	max(decode(statname,'HSC OLTP Compressed Blocks',statval,0)) HSC_OLTP_Compressed_Blocks
from (
	select testid,snapid,statname, statval
	,statval-lag(statval) over (partition by testid,statname order by snapid) delta
	from compteststats
	where testid = 'TEST2'
	and snapid > 0
group by snapid
order by snapid;

And I captured the elapsed time for each insert from the SQL Trace file using the grep/awk combo below. I then combined these values with the redo/undo figures in a spreadsheet.

grep "^EXEC #139771649698992" orcl1_ora_6987.trc | awk -F[,=] '{print $4}'

The chart produced from the spreadsheet looks like this, you can click to enlarge but get the gist from the small version.


The number of inserts is plotted on the X axis and the Y axis is either the value from v$sesstat or elapsed micro seconds from the trace file. The values are not important. What I found interesting was being able to visualise the life of the block. You can see the spikes in redo and undo each time the block is (re)compressed and as the block gradually fills up the compressions become more frequent until we get a lull in the compressions as we start to fill a second data block. The last spike on the chart is first compression of the second block.

Excerpts from my data is below. You can see the “HSC OLTP positive compression” has a delta of 1 for each spike demonstrating that the spike is caused by block compression.

---------- ---------- ---------- ---------------------- --------------------------
         1        604        112                      0                          0
         2        432         68                      0                          0
         3        432         68                      0                          0
         4        432         68                      0                          0
         5        432         68                      0                          0
         6        432         68                      0                          0
         7        432         68                      0                          0
         8        432         68                      0                          0
        51        432         68                      0                          0
        52        432         68                      0                          0
        53       9092       8320                      1                          1
        54        420         68                      0                          1
        55        420         68                      0                          1
        97        420         68                      0                          1
        98        420         68                      0                          1
        99       9152       8320                      1                          1
       100        420         68                      0                          1
       101        420         68                      0                          1
       613        432         68                      0                          1
       614        432         68                      0                          1
       615       9092       8320                      1                          2
       616        420         68                      0                          2
       617        420         68                      0                          2

As stated previously – here are my scripts.

sqlplus -S neil/neil <<SQLINP
@inscompstat $1


set verify off
insert into compteststats
select '&1', c.maxsnap,, s.value
from (
        select, s.value
        from v$session u, v$sesstat s, v$statname n
        where u.sid = s.sid
        and s.statistic# = n.statistic#
        and u.module = 'TESTSESS'
        and in ( 'redo size'
                ,       'undo change vector size'
                ,       'HSC OLTP positive compression'
                ,       'HSC OLTP Compressed Blocks')
        ) s
,       (select nvl(max(snapid),-1)+1 maxsnap
        from compteststats
        where testid = '&1') c

Compression Advisory: DBMS_COMPRESSION

I’ve been doing a lot of testing with COMPRESS FOR ALL OPERATIONS (OLTP compression) recently and spent a bit of time with the Compression Advisor. This post documents a few items I noticed along the way.

I picked up instructions on how to run the Compression Advisor from My Oracle Support note “Compression Advisory in 11GR2: Using DBMS_COMPRESSION [ID 762974.1]”. The note advises me to create a new tablespace and informs me that two temporary segments are created.

Please create a tablespace before running compression advisor as it creates two temporary tables and will do significant amount of I/O traffic against them.

Below is a sneak preview of the temporary segments that will be created.

create table "ACME".DBMS_TABCOMP_TEMP_UNCMP tablespace "SCRATCH" nologging
 as select /*+ DYNAMIC_SAMPLING(0) FULL("ACME"."ACCS") */ *
 from "ACME"."ACCS"  sample block( 99) mytab

create table "ACME".DBMS_TABCOMP_TEMP_CMP organization heap 
 tablespace "SCRATCH" compress for all operations nologging
 as select /*+ DYNAMIC_SAMPLING(0) */ *

DBMS_COMPRESSION runs using invoker rights so the privileges of the user I log on as are relevant.

select procedure_name, authid 
from dba_procedures 
where owner = 'SYS' 
and object_name = 'DBMS_COMPRESSION';

------------------------------ ------------

Taking all of the above into account I’ve created a new SCRATCH tablespace for my temporary segments and will run the analysis logged in as myself and not the schema user who owns the segment I want to compress. As we at ORAganism are so security minded 😉 I am not permitted to have the standard “dba” role and have a cut down one with only the permissions I need.

create tablespace scratch datafile '+data' size 50m;

create role nj_dba;
create user nj identified by nj;
grant create session, create any table, drop any table, select any table to nj_dba;
grant execute on sys.dbms_monitor to nj_dba;
grant nj_dba to nj;

Also according to note “ORA-01536: space quota exceeded for tablespace along with DBMS_COMPRESSION.GET_COMPRESSION_RATIO [ID 970324.1]” I need to give the owner of the segment being analysed a quota on my scratch tablespace.

To get rid of the errors, the user specified as the second parameter in the DBMS_COMPRESSION.GET_COMPRESSION_RATIO need to have … some QUOTA over the scratch tablespace specified by the first parameter.


alter user acme quota unlimited on scratch;

I have created my script to call the advisor, dbms_comp.sql, as below.

set serveroutput on
set feedback on
set verify off

	blkcnt_cmp BINARY_integer;
	blkcnt_uncmp BINARY_integer;
	row_cmp BINARY_integer;
	row_uncmp BINARY_integer;
	cmp_ratio number;
	comptype_str varchar2(60);
		scratchtbsname => upper('&3.')
	,	ownname => upper('&1.')
	,	tabname => upper('&2.')
	,	partname => null
	,	comptype => dbms_compression.comp_for_oltp
	,	blkcnt_cmp => blkcnt_cmp
	,	blkcnt_uncmp => blkcnt_uncmp
	,	row_cmp => row_cmp
	,	row_uncmp => row_uncmp
	,	cmp_ratio => cmp_ratio
	,	comptype_str => comptype_str
	,	subset_numrows => &4.
	DBMS_OUTPUT.PUT_LINE('Block count compressed = ' || blkcnt_cmp);
	DBMS_OUTPUT.PUT_LINE('Block count uncompressed = ' || blkcnt_uncmp);
	DBMS_OUTPUT.PUT_LINE('Row count per block compressed = ' || row_cmp);
	DBMS_OUTPUT.PUT_LINE('Row count per block uncompressed = ' || row_uncmp);
	--DBMS_OUTPUT.PUT_LINE('Compression type = ' ||comptype_str);
	DBMS_OUTPUT.PUT_LINE('Compression ratio = '||round(blkcnt_uncmp/blkcnt_cmp,1)||' to 1');
	DBMS_OUTPUT.PUT_LINE('Compression % benefit = '||round((blkcnt_uncmp-blkcnt_cmp)/blkcnt_uncmp*100,1));
	--DBMS_OUTPUT.PUT_LINE('Compression ratio org= '||cmp_ratio);
set verify on

So we should be good to go:

conn nj/nj
@orcl> @dbms_comp acme accs scratch 200000
ERROR at line 1:
ORA-20000: Compression Advisor tablespace "SCRATCH" does not exist
ORA-06512: at "SYS.PRVT_COMPRESSION", line 776
ORA-06512: at "SYS.DBMS_COMPRESSION", line 214
ORA-06512: at line 9

With SQL Trace enabled I see the following in the trace file.

PARSING IN CURSOR #139655103675152 len=64 dep=1 uid=37 oct=3 lid=37 tim=1357852093619559 hv=814796271 ad='67de3b40' sqlid='g6pggnss91mgg'
SELECT count(*) FROM user_tablespaces WHERE tablespace_name= :b1
BINDS #139655103675152:

Even though the temporary segments are created in the same schema as the object being analysed my own account also needs a quota on the scratch tablespace. No segments are created in my schema but I need the quota in order to get past the check on “user_tablespaces”.

alter user nj quota unlimited on scratch;

Let’s try again:

@orcl> @dbms_comp acme accs scratch 200000
ERROR at line 1:
ORA-01031: insufficient privileges
ORA-06512: at "SYS.PRVT_COMPRESSION", line 776
ORA-06512: at "SYS.DBMS_COMPRESSION", line 214
ORA-06512: at line 9

I couldn’t find anything about this in M.O.S or find any evidence from SQL Trace. However the power of trial and error tracked the missing privilege down to “ANALYZE ANY” (I could have granted OEM_MONITOR to provide this privilege but we’re all about the principle of least privilege today).


Let’s try again… again:

@orcl> @dbms_comp acme accs scratch 200000
Block count compressed = 1612
Block count uncompressed = 1632
Row count per block compressed = 123
Row count per block uncompressed = 121
Compression ratio = 1 to 1
Compression % benefit = 1.2

PL/SQL procedure successfully completed.

Elapsed: 00:00:09.82

Sadly my table is not going to benefit from compression but it’s a success nonetheless. So the full list of required privileges is:


You also need a quota for your schema user and an unnecessary quota for the executing user.

If you use VPD then don’t forget to take that into account.

And one last point, the temporary tables are created in thje target schema and the names are not session/object specific so don’t try to run Compress Advisor calls in parallel:

@orcl> @dbms_comp acme accs scratch 200000
ERROR at line 1:
ORA-00955: name is already used by an existing object
ORA-06512: at "SYS.PRVT_COMPRESSION", line 776
ORA-06512: at "SYS.DBMS_COMPRESSION", line 214
ORA-06512: at line 9 Client, Password Expiry and ORA-1017

I have had an intermittent issue at work recently where I am occasionally unable to change my password when it has expired.

From my weapon of choice, DOS prompt/”sqlplus” on Windows, I enter my current password and am prompted to change it due to its expiry. I then enter a new password, one I am confident is consistent with password policy and get a message stating my username/password is incorrect. You can see example output below (and yes I am absolutely certain I entered the same password twice 🙂 ).

u:\sqls>sqlplus neiljohnson@db10g1

SQL*Plus: Release Production on Wed Sep 19 09:36:05 2012

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Enter password:
ORA-28001: the password has expired

Changing password for neiljohnson
New password:
Retype new password:
ORA-01017: invalid username/password; logon denied

Password unchanged
Enter user-name:

I was unable to find a hit in M.O.S and settled on a work-a-round of changing my password on the Unix database server, silently hoping I wasn’t going mad. After a few more occurrences I resorted to Google and via the OTN forums found an M.O.S note detailing the problem.

ORA-1017 Trying To Change Expired Password Through Programmatic Interface From To Lower Database Version [ID 1426233.1]

Below is the suggested solution from this note and I find it out of step with Oracle’s usual stance on these things:

There are no workarounds available to programatically changing the password, apart from;

a) Use an older client version, or
b) Change the user’s password via an ALTER USER statement from a different session.
c) Upgrade the database version to

I’ve always been quite proud to tell my Sysadmin brethren that Oracle clients are backward compatible (and within reason forward compatible too). What we have here is the latest and greatest of the Oracle clients,, and a statement that I read as “this is not backward compatible – tough luck!”

I tried to diagnose this further with SQL trace on the Oracle server process but I only capture the insert into sys.aud$. I also tried SQL Net client tracing and in the trace file I can see an error recorded in between the ORA-28001 and ORA-01017. I’ve highlighted it in the extracts from the trace file below.

2012-09-19 10:44:33.304271 : nsbasic_brc:00 00 24 4F 52 41 2D 32  |..$ORA-2|
2012-09-19 10:44:33.304290 : nsbasic_brc:38 30 30 31 3A 20 74 68  ||
2012-09-19 10:44:33.304309 : nsbasic_brc:65 20 70 61 73 73 77 6F  |e.passwo|
2012-09-19 10:44:33.304327 : nsbasic_brc:72 64 20 68 61 73 20 65  |rd.has.e|
2012-09-19 10:44:33.304346 : nsbasic_brc:78 70 69 72 65 64 0A     |xpired. |
2012-09-19 10:44:33.304364 : nsbasic_brc:exit: oln=0, dln=69, tot=79, rc=0
2012-09-19 10:44:33.304383 : nioqrc:exit
2012-09-19 10:45:59.358748 : nsbasic_brc:exit: oln=11, dln=1, tot=22, rc=0
2012-09-19 10:45:59.359278 : nioqrc: found a break marker...
2012-09-19 10:45:59.359797 : nioqrc: Recieve: returning error: 3111
2012-09-19 10:45:59.360333 : nioqrc:exit
2012-09-19 10:45:59.360834 : nioqrs:entry
2012-09-19 10:45:59.406650 : nsbasic_brc:00 00 33 4F 52 41 2D 30  |..3ORA-0|
2012-09-19 10:45:59.407173 : nsbasic_brc:31 30 31 37 3A 20 69 6E  ||
2012-09-19 10:45:59.407702 : nsbasic_brc:76 61 6C 69 64 20 75 73  ||
2012-09-19 10:45:59.408227 : nsbasic_brc:65 72 6E 61 6D 65 2F 70  |ername/p|
2012-09-19 10:45:59.408753 : nsbasic_brc:61 73 73 77 6F 72 64 3B  |assword;|
2012-09-19 10:45:59.409280 : nsbasic_brc:20 6C 6F 67 6F 6E 20 64  |.logon.d|
2012-09-19 10:45:59.409820 : nsbasic_brc:65 6E 69 65 64 0A        |enied.  |

It seems from the M.O.S information below that ORA-03111 is not something I’m going to get very far with, in fact it’s probably a side effect of some other piece of code failing. Either way – it’s back to the work-a-round for me.

Error:  ORA 3111 
Text:   break received on communication channel 
Cause:  A break was processed on the communications channel, but was not 
        properly handled by SQL*Net software.
        This is an internal error message not usually issued.
Action: Contact customer support.

Hopefully this post will save someone else some time if faced with the same issue, sadly I am not currently able to provide a solution.

Production Support Tips & Tricks #2 – SQL Trace

Part 1: Production Support Tip & Tricks #1 – Collecting Log Data

There’s no need to discuss SQL Trace other than the age-old concern – the overhead. Incident managers are risk averse and when you ask if you can trace a session on production you will always be asked what overhead it has and invariably you will say “erm… 5%… ish”. We can see from the Oracle documentation that Oracle’s stance on this has changed over the years:

9i docs:

“The additional overhead … is normally insignificant compared with the inherent overhead caused by the application’s inefficiency.”

11.2 docs:

“Using the SQL Trace facility can have a severe performance impact and may result in increased system overhead, excessive CPU usage, and inadequate disk space”

In this post I am going to attempt to assess the overhead of SQL Trace. My theory is that there are 2 potential factors.

  1. The overhead of the SQL Trace code. There must be extra code involved and it must take some amount of CPU to execute it.
  2. The overhead of writing trace files. Writing to a file can not be without any overhead.

My cunning plan is to test queries in 3 states:

  • With SQL Trace disabled
  • With SQL Trace enabled
  • With SQL Trace enabled but redirected to /dev/null thus removing any overhead of writing to the trace file

The queries will be tested in IO bound and CPU bound states. The tests were done on a Linux laptop running Oracle XE 11g with a single hard drive so the results are not “enterprise class” – your mileage may vary.

All of the code used can be found at these links, for brevity I have not included them inline – build-acme-schema.sql sqltrace-long1.sql sqltrace-long2.sql sqltrace-short1.sql sqltrace-short2.sql

Test #1 – IO bound queries with a long(ish) elapsed time

I say “long(ish)” as the queries only take a few seconds, the key thing is that the majority of time is spent fetching data. I flush the buffer cache before every execution to make sure the queries are IO bound. The test code can be found in sqltrace-long1.sql.

The results are:

Test \ Execution 1 2 3 4 5 6 Average % Overhead
Long IO bound no trace 5.45s 5.45s 5.52s 5.52s 5.45s 5.34s 5.45s
Long IO bound with trace (32KB) 6.90s 6.89s 6.94s 7.09s 6.84s 7.01s 6.94s 27.31%
Long IO bound with trace (to null) 6.89s 6.82s 6.89s 6.94s 6.78s 7.12s 6.90s 26.61%

I was really surprised by this. For a small number of queries generating small trace files, only 32KB, I saw a 25% degradation in run time with SQL Trace enabled. The minimal size of the trace files did nothing to prove my theory about the overhead in writing the trace file but this was an eye opener.

Test #2 – CPU bound queries with a long(ish) elapsed time

This test used the same queries as before but I removed the flush of the buffer cache from before each query. This removed all IO waits. The test code can be found in sqltrace-long2.sql.

Test \ Execution 1 2 3 4 5 6 Average % Overhead
Long CPU bound no trace 3.79s 3.76s 3.80s 3.73s 3.71s 3.79s 3.76s
Long CPU bound with trace (11KB) 4.90s 4.99s 5.02s 5.00s 4.89s 4.87s 4.94s 31.39%
Long CPU bound with trace (to null) 4.95s 5.00s 5.05s 5.05s 4.90s 4.86s 4.96s 32.02%

Interesting – so by running the same workload but with no disk reads to slow it down the overhead is slightly more pronounced. But infrequent long running queries are not the only fruit.

Test #3 – IO bound frequent queries with a short elapsed time

This test uses very fast queries in a tight loop. The proportion of SQL executes to row fetches is very different to the previous tests. The buffer cache is flushed before each execution. The test code can be found in sqltrace-short1.sql.

Test \ Execution 1 2 3 4 5 6 Average % Overhead
Short IO bound no trace 43.99s 43.2s 39.44s 41.81s 39.00s 40.03s 41.24s
Short IO bound with trace (2.9MB) 41.35s 43.63s 40.70s 41.72s 43.24s 39.86s 41.75s 1.22%
Short IO bound with trace (to null) 41.12s 45.21s 41.50s 40.82s 42.44s 40.05s 41.85s 1.48%

So this more like it – only a 1.5% overhead.

Test #4 – CPU bound frequent queries with a short elapsed time

This test uses the same very fast queries in a tight loop from the previous test. The buffer cache flushes were removed and the number of iterations increased to get a longer elapsed time. The test code can be found in sqltrace-short2.sql.

Test \ Execution 1 2 3 4 5 6 Average % Overhead
Short CPU bound no trace 2.58s 2.46s 2.46s 2.63s 2.49s 2.47s 2.51s
Short CPU bound with trace (32MB) 7.55s 7.52s 7.57s 7.64s 7.43s 7.67s 7.56s 200.72%
Short CPU bound with trace (to null) 4.45s 4.57s 4.69s 4.63s 4.54s 4.62 4.58s 82.23%

Ouch – a 200% overhead! Also the overhead of writing to the trace file was visible this time, SQL Trace overhead dropped from 200% to 80% when the trace file was /dev/null.


I am almost reluctant to summarise as these are not the most scientific of tests but the things I am taking away from this excercise are:

  • SQL workload that is CPU bound will see a dramatic degradation in elapsed time when SQL Trace is enabled – I’ll be wary of this in the future
  • SQL workload that is IO bound will see less of an impact from SQL Trace, but the impact could still be significant.
  • When trace file size is large then the speed of the trace file target filesystem is important. In these tests I used /dev/null which is of course cheating.
  • When IO response is important to a test remember to set filesystemio_option=setall. I wasted several evenings scratching my head at inconsistent, unpredictable test results.
  • Your mileage may vary

Feel free to point out mistakes in my train of thought or in the tests themselves. I’m not too happy with tests 1 & 2 – I’m not convinced they prove anything worthwhile. Tests 3 & 4 however have results I have found interesting and informative.

I think I’ll continue to tell Incident Managers that the overhead is around 5-10%, but only after I’ve given some careful thought to the criticality of the workload I want to trace.

Production Support Tips & Tricks #1 – Collecting Log Data

Early this year (2012) I started working on a presentation, it would be my first, that I hoped to submit to UKOUG. The thrust of the presentation was to be tips on making your experiences with Oracle Support more pleasant, to help keep your support analyst busy rather than yourself. A prospective title was “with Support like this who needs enemies” – perhaps that’s a big strong ;-). Several things colluded to make it unlikely I would get to present it so I faltered and things ground to a halt. After a period of inactivity I have decided to convert it into a short series of blog posts. This is the first. Part 2 is here – “Production Support Tips & Tricks #2 – SQL Trace

This post contains some advice for collecting log data when raising SRs. It’s mostly obvious but hopefully not to all.

ADR Package

You already know so I’m not going to waste my breath.

Get everything packaged up, not just the trace files you think Oracle need. Avoids repeat requests.

Well covered by others so I’m not going near it:
John Hallas quality UKOUG presentation
Uwe Hesse’s super blog entry

Nah – see above

Not related to diagnostic collection but listener targets don’t auto purge so your housekeeping scripts need to make calls to adrci to force a purge. for clusters

What? is a script in your CRS home which collates all CRS related log files on the current cluster node.

It’s not easy manually collecting everything Oracle Support may require. This script makes it easy.

Several options, you can check them with the “-h” option. Or just collect everything:


Uncompressed the resulting tar file can be very large

-rw-r--r-- 1 grid oinstall 1.1G Feb 22 21:49 crsData_n02_20120222_2144.tar

Even compressed the file can still be a lengthy upload to M.O.S (multiplied by the # of nodes)

-rw-r--r-- 1 grid oinstall 69M Feb 22 21:49 crsData_n02_20120222_2144.tar.gz

Trivia? is just a wrapper for

OS Watcher Black Box (OSWbb)

A quote from the user guide:

a collection of UNIX shell scripts intended to collect and archive operating system and network metrics to aid support in diagnosing performance issues

“Because every vendor wants to blame another vendor and OSWbb helps that process”
“Because every issue is the fault of the database so you need ammunition to feed to your vendor”
“insert your own cynical quote here”

Download from M.O.S – “OS Watcher Black Box User Guide [ID 301137.1]”. It is certified on AIX, Tru64, Solaris, HP-UX, Linux.

It is easy to run:

nohup ./ &

easy to stop


and easy to send

$ ./
-rw-r--r-- 1 oracle oinstall 1.2M Feb  8 22:00 osw_archive_0208122216.tar.Z


You can install OSWbb as a Linux service – “How To Start OSWatcher Black Box Every System Boot [ID 580513.1]” or use any scheduling tool. Alternatvely you can control it via CRS, this way it is only active when the cluster is active which has plus and minus points. For details of this see M.O.S note “Making Applications Highly Available Using Oracle Clusterware [ID 1105489.1]”.

To do it you need an action script, there is a perfectly good demo one in “$GRID_HOME/crs/demo”. Alternatively the one I use for testing at home can be found here – osw.scr (use at your peril).

$GRID_HOME/bin/crsctl add resource osw -type ora.local_resource.type \
 -attr "AUTO_START=always,ACTION_SCRIPT=$GRID_HOME/crs/script/oswbb.scr"
$  $GRID_HOME/bin/crsctl status res osw
STATE=ONLINE on n01, ONLINE on n02

From “OS Watcher For Windows (OSWFW) User Guide [ID 433472.1]”:

OS Watcher for Windows is no longer supported.
It has been replace by the Cluster Health Monitor.

From “Cluster Health Monitor (CHM) FAQ [ID 1328466.1]”

Is the Cluster Health Monitor replacing OSWatcher?
…there [is] some information such as top, traceroute, and netstat that the Cluster Health Monitor does not collect, so running the Cluster Health Monitor while running OSWatcher is ideal. Both tools complement each other rather than supplement…

In my opinion another reason for still using OSWbb in spite of CHM is that CHM is very difficult to review yourself, it is also not yet the tool of choice for many within Oracle Support. OSWbb still has a place.

Quote from traceroute Unix man page by way of caveat:

Because of the load it could impose on the network, it is unwise to use traceroute during normal operations or from automated scripts.


“OS Watcher Black Box” was originally called “OS Watcher” but was renamed due to a clash of names with other unrelated, non-Oracle tool(s).

More to follow in the future

Mining the listener log with Perl

Update – 02-MAY-2012
I have had cause to use this script again at work so thought I’d develop it further. This post has been updated to reflect the new version.

Recently at work I was engaged by one of our application support teams because of application calls failing with “ORA-00020: maximum number of processes (300) exceeded”. We quickly identified that these errors were the symptom of another problem and not a concern in themselves, the real issue was slow database calls due to an infrastructure issue. The ORA-00020 errors were because the application was spawning new connections to try to work around the slow connections. Now I don’t know much about application servers but I know this is not uncommon. The application team however could not find any indication of this in their logs so it was up to me to prove it.

As far as I am aware there are 2 options for retrospectively proving this at a granular level (please share any other options I’ve missed):

1) Query the database audit trail – DBA_AUDIT_SESSION
2) Mine the listener log

On this database no auditing was enabled so I was left with the listener log as my data source.

I knocked up a quick KSH/AWK script to get my data and then massaged it via various means. The output was quite pleasing so over the recent long UK Easter holiday I Googled some Perl commands and had some fun with it, this post is to share the script and an example chart from Microsoft Excel based on the incident I described above.

The script can be downloaded from –

Below is an example of how to call the script – it works on *nix and Windows. The “-f” parameter is used to filter log entries on the date, this example is matching any minute between “11:00” and “12:59”. I’ve included a full usage map further down this post.

C:\>perl -excel -f "04-APR-2012 1[12]" listener.log  > llog.csv

This is the chart showing the logon spike – click to enlarge. You can see the connection spike at 12:56 of over 250 connection attempts.

Connection rate from listener log

The script allows a few other options such as allowing listener logs to be “piped” into the script to allow multiple files to be processed in one command. Usage notes below.

C:\>perl -help

    Usage: [-help] [-filter ] [-[no]excel] [-[no]host] 
         : -excel|noexcel       : Output dates in Excel friendly format (default is noexcel)
         : -help                : Print this help
         : -host|nohost         : Include host on program (default is to include)
         : -filter      : A regexp to match beginning of log line. e.g. "23-DEC-2012 0[12]" or "..-MAR-2013"
         : listener_log_path|-  : Input either a valid listener log name or "-" to read STDIN

         # Process listener.log looking for entries on the 21st and 22nd of Sept and format dates for Excel
         C:\>perl -filter "2[12]-SEP-2011" -excel listener.log

         # Process two listener log files looking for entries on Jan 3rd between 20:00 and 21:59 and exclude the hostname
         $ cat listener.log listener_db.log | -filter "03-JAN-2012 2[01]" -nohost -

An improvement I may make in the future is to expand minutes where there are no connection attempts. I have drafted some code to fill in gaps but have kept it to myself for now as it’s too ugly to share (at the moment).

I hope someone finds this useful and if not, well I found it entertaining if nothing else 🙂

ADRCI and listener purging

I won’t go into any details on ADRCI as others have already done a better job than I could. The key piece of information for this post is that the MMON process shoulders the responsibility for ADR purging. So what if there isn’t an MMON process?

Note: All ADRCI “show control” output on this page has been edited for readability

Below is the ADR purge details for my local listener. These settings were defined using “set control (SHORTP_POLICY = 168,LONGP_POLICY = 336)”. “168” = 7 days and “336” = 14 days.

set home diag/tnslsnr/n01/listener
show control

---------- ------------- ------------ -------------------------- ----------------- ----------------------------
1794992421 168           336          2012-01-25 20:41:09.052526

adrci> host
$ date

Thu Feb 16 21:26:16 GMT 2012

$ cd  /u01/app/oracle/diag/tnslsnr/n01/listener/alert
$ ls -ltr

total 30260
-rw-r----- 1 grid oinstall 10485934 Jan 25 03:45 log_1.xml
-rw-r----- 1 grid oinstall 10485779 Feb  4 03:11 log_2.xml
-rw-r----- 1 grid oinstall  9954308 Feb 16 21:43 log.xml

Notice that the “LAST_AUTOPRG_TIME” column is blank. It looks like this system is not being purged automatically. Also notice that the date of this test was Feb 16th and the oldest file in the “alert” directory is dated “Jan 25”. According to M.O.S note “Which Files Are Part Of SHORTP_POLICY And LONGP_POLICY In ADR? [ID 975448.1]” the “alert” directory is under LONGP_POLICY which is 14 days so the file “log_1.xml” should have been purged.

Below I force a manual purge

adrci> purge
adrci> host

$ cd  /u01/app/oracle/diag/tnslsnr/n01/listener/alert
$ ls -ltr

total 20004
-rw-r----- 1 grid oinstall 10485779 Feb  4 03:11 log_2.xml
-rw-r----- 1 grid oinstall  9960304 Feb 16 21:53 log.xml

It has now gone. And LAST_MANUPRG_TIME has been set to record the fact.

---------- ------------- ------------ -------------------------- ----------------- --------------------------
1794992421 168           336          2012-01-25 20:41:09.052526                   2012-02-16 21:53:51.761639

So my recommendation is that alongside ADR’s auto purging and your own maintenance on the alert.log & listener.log files in the “trace” directories you should also routinely call something like below:

for i in `adrci exec="show homes"|grep listener`;do
 echo "adrci exec=\"set home $i;purge\""
 adrci exec="set home $i;purge";
 # maybe check $? here

I noticed this M.O.S note recently too. I’ve never seen the symptoms but it’s worth taking a look.

ADR Auto Purge Does Not Work On All RAC Instances, ASM and Standby Databases [ID 1326983.1]

Controlling The SCAN Listener Log Location

I was lucky enough to attend UKOUG Conference 2011 last year. I saw several great presentations, one of which was a presentation on troubleshooting RAC by Julian Dyke. During this presentation Julian commented that we need to watch out for SCAN listeners having their ADR home inside the Grid Infrastructure home instead of with the Database listeners in the ADR home under $ORACLE_BASE.

Well this was one of the few things in his presentation I was fit to comment on so I tried to attract Julian’s attention by waving my arm, sadly my timid nature meant I went unnoticed. So this blog post is me going public with my attempted interruption.

First an example of how the log location for SCAN listeners typically differs to that of Database listeners. I personally do not like this behaviour as I like my Grid Infrastructure home to have resonably stable size, which it can have once the log directory’s rotation is mature, however the ADR home for listeners is not cleared down automatically so you can end up with a directory growing unchecked.

[grid@n01 n01]$ $GRID_HOME/bin/lsnrctl status LISTENER|grep Log
Listener Log File         /u01/app/oracle/diag/tnslsnr/n01/listener/alert/log.xml
[grid@n01 n01]$ $GRID_HOME/bin/lsnrctl status LISTENER_SCAN3|grep Log
Listener Log File         /u01/app/grid/11203/ghome_1/log/diag/tnslsnr/n01/listener_scan3/alert/log.xml

And we know from the Oracle documentation that ADR_BASE is defined as:

To specify the base directory into which tracing and logging incidents are stored when ADR is enabled.

The default on the server side is ORACLE_BASE, or ORACLE_HOME/log, if ORACLE_BASE is not defined.

So… let’s check ORACLE_BASE for CRS resources:

[grid@n01 ~]$ $GRID_HOME/bin/crsctl stat res ora.LISTENER.lsnr -p|grep -i base
[grid@n01 ~]$ $GRID_HOME/bin/crsctl stat res ora.LISTENER_SCAN3.lsnr -p|grep -i base
[grid@n01 ~]$

Notice above that the Database listener has $ORACLE_BASE set and the Scan listener does not. Hence the log location.

This can be controlled for Database listeners by using the SETENV switch for SRVCTL but this does not work for Scan listeners.

[grid@n01 ~]$ $GRID_HOME/bin/srvctl setenv listener -l listener_scan1 -t "ORACLE_BASE=/u01/app/oracle"
PRKO-3162 : The actions setenv, getenv, and unsetenv are not suppported for Single Client Access Name listener listener_scan1
PRCN-2066 : Failed to retrieve cluster listener for listener_scan1 because it has the type of Single Client Access Name Listener

There is another way of setting environment variables for cluster resources and that is via crsctl modify resource.

[grid@n01 ~]$ $GRID_HOME/bin/crsctl modify resource ora.LISTENER_SCAN3.lsnr -attr USR_ORA_ENV=ORACLE_BASE=/u01/app/oracle
[grid@n01 ~]$ $GRID_HOME/bin/crsctl stat res ora.LISTENER_SCAN3.lsnr -p|grep -i base
[grid@n01 ~]$ $GRID_HOME/bin/srvctl stop scan_listener -i 3
[grid@n01 ~]$ $GRID_HOME/bin/srvctl start scan_listener -i 3
[grid@n01 ~]$ lsnrctl status  listener_scan3 | grep Log
Listener Log File         /u01/app/oracle/diag/tnslsnr/n01/listener_scan3/alert/log.xml

To my mind the whole thing would seem a little cleaner if there was an “SRVCTL SETENV SCAN_LISTENER” command.

After writing this post I thought I’d check around for other information to support this post and, lo and behold, there is a My Oracle Support article covering almost exactly what I’ve written above. The article also talks about using the LISTENER.ORA attribute ADR_BASE_LISTENER_SCANn to control this.

Diagnostic Destination For Listeners Is Under Grid_home/Log Instead Of Oracle_base/Diag [ID 1269109.1]

A DBA_HIST_SQLSTAT query that I am very fond of

This is a quick post to share a SQL statement I use a lot at work. The query mines the AWR tables (beware the licence implications) for a specific SQL ID and date/time range and shows a few choice statistics for each snapshot period.


prompt enter start and end times in format DD-MON-YYYY [HH24:MI]
column sample_end format a21
select to_char(min(s.end_interval_time),'DD-MON-YYYY DY HH24:MI') sample_end
, q.sql_id
, q.plan_hash_value
, sum(q.EXECUTIONS_DELTA) executions
, round(sum(DISK_READS_delta)/greatest(sum(executions_delta),1),1) pio_per_exec
, round(sum(BUFFER_GETS_delta)/greatest(sum(executions_delta),1),1) lio_per_exec
, round((sum(ELAPSED_TIME_delta)/greatest(sum(executions_delta),1)/1000),1) msec_exec
from dba_hist_sqlstat q, dba_hist_snapshot s
where q.SQL_ID=trim('&sqlid.')
and s.snap_id = q.snap_id
and s.dbid = q.dbid
and s.instance_number = q.instance_number
and s.end_interval_time >= to_date(trim('&start_time.'),'dd-mon-yyyy hh24:mi')
and s.begin_interval_time <= to_date(trim('&end_time.'),'dd-mon-yyyy hh24:mi')
and substr(to_char(s.end_interval_time,'DD-MON-YYYY DY HH24:MI'),13,2) like '%&hr24_filter.%'
group by s.snap_id
, q.sql_id
, q.plan_hash_value
order by s.snap_id, q.sql_id, q.plan_hash_value

Nothing ground breaking and I’m sure many will have a similar script.

Below I have example output showing one of the scripts many successful outings, quite a dramatic plan change I’m sure you’ll agree.

SQL> @awrsql
enter start and end times in format DD-MON-YYYY [HH24:MI]
Enter value for sqlid: 1jjpo2i4b313g
Enter value for start_time: 15-NOV-2011
Enter value for end_time: 21-NOV-2011 13:00
Enter value for hr24_filter:

--------------------- ------------- --------------- ---------- ------------ ------------ ----------
15-nov-2011 TUE 08:00 1jjpo2i4b313g      3133159894     129629            0            5          0
16-nov-2011 WED 08:01 1jjpo2i4b313g      3133159894     115003            0            5         .1
17-nov-2011 THU 08:01 1jjpo2i4b313g      3133159894     115741            0            5          0
18-nov-2011 FRI 07:00 1jjpo2i4b313g      3133159894      30997            0            5         .1
18-nov-2011 FRI 08:00 1jjpo2i4b313g      3133159894      81034            0            5          0
21-nov-2011 MON 00:00 1jjpo2i4b313g       790865878         16     323091.6     323128.3    36905.8
21-nov-2011 MON 01:00 1jjpo2i4b313g       790865878         29     349676.2     349713.7    48387.2
21-nov-2011 MON 02:00 1jjpo2i4b313g       790865878         35     339474.6     339509.2    34057.7
21-nov-2011 MON 03:00 1jjpo2i4b313g       790865878         37     340934.6     340970.2    35899.4
21-nov-2011 MON 04:01 1jjpo2i4b313g       790865878         38     333469.1     333503.9    35450.8
21-nov-2011 MON 05:00 1jjpo2i4b313g       790865878         35     347559.3     347595.2    35231.8
21-nov-2011 MON 06:00 1jjpo2i4b313g       790865878         32     340224.8       340260    35208.3

I also like to use the query to track number of executions or LIO per execution over a longer time frame to see if either the frequency or individual impact of the SQL is changing over time. I can use the “hr24_filter” variable to do this, for example showing me all snapshots for hour “13” over a whole month.