Unexpected Database Directory Permissions

In some secure environments code deployment privileges are split away from data modification ones. Below is an example user created only to execute DDL for application schema changes.

create user deploy identified by deploy;
grant create any directory to deploy;
grant create any table to deploy;
grant create session to deploy;

And here we see the account being used to make some changes to an application schema, creating a new table and a database directory.

conn deploy/deploy@pdb
create table app.secrets
( cardnum varchar2(15)
, holder varchar(40)
, balance number);
create directory sensitive_files as '/tmp/secrets';

Two weeks later my code monkey technician decides to have a cheeky look at the contents of the secrets table created previously.

select * from app.secrets
                  *
ERROR at line 1:
ORA-01031: insufficient privileges

Lovely – no access and my secrets are safe. However the technician happens to be aware of the file naming convention used by the ETL process and tries to read a dump file using UTL_FILE.

set serveroutput on
declare
  fh utl_file.file_type;
  str varchar2(1024);
begin
  fh := utl_file.fopen('SENSITIVE_FILES', 'secrets.dat', 'R');

  begin
    loop
      utl_file.get_line(fh, str);
      dbms_output.put_line(str);
    end loop;
  exception
    when no_data_found then
      null;
  end;
  
  utl_file.fclose(fh);    
end;
/

cardnum,holder,balance
1234-1234-4321-4321,Neil,0.05

Oh my, the “CREATE ANY DIRECTORY” privilege also gives me read permissions on the “contents” of the directory. And…

set serveroutput on
declare
  fh utl_file.file_type;
begin
  fh := utl_file.fopen('SENSITIVE_FILES', 'secrets.dat', 'W');

  utl_file.put_line(fh,'cardnum,holder,balance');
  utl_file.put_line(fh,'1234-1234-4321-4321,Neil,999999.99');
  utl_file.fclose(fh);    
end;
/

PL/SQL procedure successfully completed.

We can modify the file too. Here are the new contents when read again using UTL_FILE.

cardnum,holder,balance
1234-1234-4321-4321,Neil,999999.99

Looking at the permissions it may not be clear how the file contents were changed, “CREATE ANY DIRECTORY” is more powerful than it looks.

  1* select * from dba_sys_privs where grantee = 'DEPLOY'
SQL> /

GRANTEE    PRIVILEGE            ADM COM
---------- -------------------- --- ---
DEPLOY     CREATE SESSION       NO  NO
DEPLOY     CREATE ANY DIRECTORY NO  NO
DEPLOY     CREATE ANY TABLE     NO  NO

It’s probably an edge case but I found it surprising.

Checking Oracle Directory Object Path Validity

Bit of a mouthful that title. Today I was asked to check which, if any, directory objects had an invalid path. Normally this would be trivial but on the system I was on there were more than 90. So I knocked up a quick bit of SQL/KSH to do it for me and thought I’d share it here. I could have done something with PL/SQL but I was happy with quick and dirty.

In summary the snippet below generates shell test commands to a temporary file and then uses “sed” to chop off the first and last lines before running the resulting shell script. I just pasted this in to a putty window SQL*Plus prompt – your mileage may vary.


set define off heading off pages 0 lines 200 trimspool on feedback off echo off
spool /tmp/dirchk.tmp
select '[[ -d '||directory_path||' ]] || echo "'||directory_name||' missing ('||directory_path||')"' from dba_directories order by 1;
spool off

!ls -l /tmp/dirchk.tmp
!sed -e '1d;$d' /tmp/dirchk.tmp > /tmp/dirchk.sh
!chmod 700 /tmp/dirchk.sh
!/tmp/dirchk.sh

On Linux I could have used the sed “-i” switch the edit in place but this was AIX.

Example output:

NJTEMP missing (/home/nj)
RESPONSE_DIR missing (/mnt/app/files/response)
SCHEMA_DIR missing (/mnt/app/schema)
CONVERSION missing (/mnt/app/conversion)
FTP_DIR missing (/mnt/app/ftp_files)

UKOUG Database Server SIG (Leeds 2013)

On Thursday I attended the UKOUG Database Server SIG in Leeds. All slides have been uploaded to the UKOUG website.

https://www.ukoug.org/events/ukoug-database-server-sig-meeting-may-2013/

It’s the first SIG I’ve attended this year and after enjoying it so much I ought to try harder to get to other events. We had Oak Table presence from David Kurtz talking all things partitioning/compression and purging, two Oracle employees discussing first support and then ZFS/NetApp (I particularly enjoyed this one) and then Edgars Rudans on his evolution of the Exadata Minimal Patching process. All of these presentations are well worth downloading and checking out.

The last presentation of the day was me. I’ve never presented before and it took a big step out of my comfort zone to get up there but I’m so glad I did. I would recommend it to anyone currently spending time in the audience thinking “I wish I had the confidence to do that”. It’s nerve racking beforehand but exhilarating afterwards.

When blogging in the past I’ve liked how it makes you think a little bit harder before pressing the publish button. I think the best thing I got out of the whole presentation process was that it made me dig even deeper to make sure I’d done my homework.

After the SIG there was a good group headed out for Leeds Oracle Beers #3 which involved local beer, good burgers and Morris Dancing, all good fun.

Advanced Compression: Visualising Insert Overhead

There is a My Oracle Support note, “Master Note for OLTP Compression [ID 1223705.1]“, which talks about the overheads of “COMPRESS FOR ALL OPERATIONS”. It’s a great note and, for a very artificial test, quotes overheads of around 25% CPU, 100% redo and 100% undo when inserting data via conventional methods. The purpose of this post is to visualise when the extra resources are used.

My test uses a SQL*Plus session that repeatedly inserts a single row into a table with “COMPRESS FOR ALL OPERATIONS” enabled. After each insert I use the SQL*Plus HOST command to capture redo and undo statistics for my session. I struggled to measure CPU for each insert. The “CPU used by this session” statistic was not granular enough for my single row inserts and the CPU tracked by SQL Trace was being rounded to 0ms or 1ms for each insert. However the execute elapsed attribute from SQL Trace did give me good granularity. Obviously SQL Trace has it’s own overheads as I’ve talked about before but I don’t think they are important in this case as I’m just after a visualisation of when extra resources are used.

The shell and SQL scripts used to capture statistics are at the foot of this post.

My SQL*Plus control session looks a little like this:

sqlplus neil
create table compteststats
(	testid		varchar2(10)
,	snapid		number
,	statname	varchar2(64)
,	statval		number);

create table compobj
(	col1 varchar2(30)
,	col2 number
,	col3 varchar2(30)
,	col4 varchar2(30)
,	col5 number
,	col6 varchar2(30))
compress for all operations;

exec dbms_application_info.set_module('TESTSESS','TESTSESS');
exec dbms_monitor.session_trace_enable

--capture starting values
host ./inscompstat.sh TEST1

insert into compobj values (lpad('w',30,'w'),1,lpad('x',30,'x'),lpad('y',30,'y'),2,lpad('z',30,'z'));
host ./inscompstat.sh TEST1

insert into compobj values (lpad('w',30,'w'),1,lpad('x',30,'x'),lpad('y',30,'y'),2,lpad('z',30,'z'));
host ./inscompstat.sh TEST1

-- repeat above insert/host combination until the first block is full and we move to a second

exit

I was then able to report redo and undo statistics using the SQL below

select snapid row#
,	max(decode(statname,'redo size',delta,0)) redo_size
,	max(decode(statname,'undo change vector size',delta,0)) undo_size
,	max(decode(statname,'HSC OLTP positive compression',delta,0)) HSC_OLTP_positive_comp
,	max(decode(statname,'HSC OLTP Compressed Blocks',statval,0)) HSC_OLTP_Compressed_Blocks
from (
	select testid,snapid,statname, statval
	,statval-lag(statval) over (partition by testid,statname order by snapid) delta
	from compteststats
	where testid = 'TEST2'
	and snapid > 0
)
group by snapid
order by snapid;

And I captured the elapsed time for each insert from the SQL Trace file using the grep/awk combo below. I then combined these values with the redo/undo figures in a spreadsheet.

grep "^EXEC #139771649698992" orcl1_ora_6987.trc | awk -F[,=] '{print $4}'

The chart produced from the spreadsheet looks like this, you can click to enlarge but get the gist from the small version.

vis-advcomp_1

The number of inserts is plotted on the X axis and the Y axis is either the value from v$sesstat or elapsed micro seconds from the trace file. The values are not important. What I found interesting was being able to visualise the life of the block. You can see the spikes in redo and undo each time the block is (re)compressed and as the block gradually fills up the compressions become more frequent until we get a lull in the compressions as we start to fill a second data block. The last spike on the chart is first compression of the second block.

Excerpts from my data is below. You can see the “HSC OLTP positive compression” has a delta of 1 for each spike demonstrating that the spike is caused by block compression.

      ROW#  REDO_SIZE  UNDO_SIZE HSC_OLTP_POSITIVE_COMP HSC_OLTP_COMPRESSED_BLOCKS
---------- ---------- ---------- ---------------------- --------------------------
         1        604        112                      0                          0
         2        432         68                      0                          0
         3        432         68                      0                          0
         4        432         68                      0                          0
         5        432         68                      0                          0
         6        432         68                      0                          0
         7        432         68                      0                          0
         8        432         68                      0                          0
...
        51        432         68                      0                          0
        52        432         68                      0                          0
        53       9092       8320                      1                          1
        54        420         68                      0                          1
        55        420         68                      0                          1
...
        97        420         68                      0                          1
        98        420         68                      0                          1
        99       9152       8320                      1                          1
       100        420         68                      0                          1
       101        420         68                      0                          1
...
       613        432         68                      0                          1
       614        432         68                      0                          1
       615       9092       8320                      1                          2
       616        420         68                      0                          2
       617        420         68                      0                          2

As stated previously – here are my scripts.

inscompstat.sh

sqlplus -S neil/neil <<SQLINP
@inscompstat $1
exit
SQLINP

inscompstat.sql

set verify off
insert into compteststats
(testid,snapid,statname,statval)
select '&1', c.maxsnap, s.name, s.value
from (
        select n.name, s.value
        from v$session u, v$sesstat s, v$statname n
        where u.sid = s.sid
        and s.statistic# = n.statistic#
        and u.module = 'TESTSESS'
        and n.name in ( 'redo size'
                ,       'undo change vector size'
                ,       'HSC OLTP positive compression'
                ,       'HSC OLTP Compressed Blocks')
        ) s
,       (select nvl(max(snapid),-1)+1 maxsnap
        from compteststats
        where testid = '&1') c
;

Compression Advisory: DBMS_COMPRESSION

I’ve been doing a lot of testing with COMPRESS FOR ALL OPERATIONS (OLTP compression) recently and spent a bit of time with the Compression Advisor. This post documents a few items I noticed along the way.

I picked up instructions on how to run the Compression Advisor from My Oracle Support note “Compression Advisory in 11GR2: Using DBMS_COMPRESSION [ID 762974.1]“. The note advises me to create a new tablespace and informs me that two temporary segments are created.

Please create a tablespace before running compression advisor as it creates two temporary tables and will do significant amount of I/O traffic against them.

Below is a sneak preview of the temporary segments that will be created.

create table "ACME".DBMS_TABCOMP_TEMP_UNCMP tablespace "SCRATCH" nologging
 as select /*+ DYNAMIC_SAMPLING(0) FULL("ACME"."ACCS") */ *
 from "ACME"."ACCS"  sample block( 99) mytab

create table "ACME".DBMS_TABCOMP_TEMP_CMP organization heap 
 tablespace "SCRATCH" compress for all operations nologging
 as select /*+ DYNAMIC_SAMPLING(0) */ *
 from "ACME".DBMS_TABCOMP_TEMP_UNCMP mytab

DBMS_COMPRESSION runs using invoker rights so the privileges of the user I log on as are relevant.

select procedure_name, authid 
from dba_procedures 
where owner = 'SYS' 
and object_name = 'DBMS_COMPRESSION';

PROCEDURE_NAME                 AUTHID
------------------------------ ------------
INCREMENTAL_COMPRESS           CURRENT_USER
GET_COMPRESSION_TYPE           CURRENT_USER
GET_COMPRESSION_RATIO          CURRENT_USER
                               CURRENT_USER

Taking all of the above into account I’ve created a new SCRATCH tablespace for my temporary segments and will run the analysis logged in as myself and not the schema user who owns the segment I want to compress. As we at ORAganism are so security minded ;-) I am not permitted to have the standard “dba” role and have a cut down one with only the permissions I need.

create tablespace scratch datafile '+data' size 50m;

create role nj_dba;
create user nj identified by nj;
grant create session, create any table, drop any table, select any table to nj_dba;
grant execute on sys.dbms_monitor to nj_dba;
grant nj_dba to nj;

Also according to note “ORA-01536: space quota exceeded for tablespace along with DBMS_COMPRESSION.GET_COMPRESSION_RATIO [ID 970324.1]” I need to give the owner of the segment being analysed a quota on my scratch tablespace.

To get rid of the errors, the user specified as the second parameter in the DBMS_COMPRESSION.GET_COMPRESSION_RATIO need to have … some QUOTA over the scratch tablespace specified by the first parameter.

Therefore:

alter user acme quota unlimited on scratch;

I have created my script to call the advisor, dbms_comp.sql, as below.

set serveroutput on
set feedback on
set verify off

declare
	blkcnt_cmp BINARY_integer;
	blkcnt_uncmp BINARY_integer;
	row_cmp BINARY_integer;
	row_uncmp BINARY_integer;
	cmp_ratio number;
	comptype_str varchar2(60);
begin
	dbms_compression.get_compression_ratio(
		scratchtbsname => upper('&3.')
	,	ownname => upper('&1.')
	,	tabname => upper('&2.')
	,	partname => null
	,	comptype => dbms_compression.comp_for_oltp
	,	blkcnt_cmp => blkcnt_cmp
	,	blkcnt_uncmp => blkcnt_uncmp
	,	row_cmp => row_cmp
	,	row_uncmp => row_uncmp
	,	cmp_ratio => cmp_ratio
	,	comptype_str => comptype_str
	,	subset_numrows => &4.
	);
	DBMS_OUTPUT.PUT_LINE('Block count compressed = ' || blkcnt_cmp);
	DBMS_OUTPUT.PUT_LINE('Block count uncompressed = ' || blkcnt_uncmp);
	DBMS_OUTPUT.PUT_LINE('Row count per block compressed = ' || row_cmp);
	DBMS_OUTPUT.PUT_LINE('Row count per block uncompressed = ' || row_uncmp);
	--DBMS_OUTPUT.PUT_LINE('Compression type = ' ||comptype_str);
	DBMS_OUTPUT.PUT_LINE('Compression ratio = '||round(blkcnt_uncmp/blkcnt_cmp,1)||' to 1');
	DBMS_OUTPUT.PUT_LINE('Compression % benefit = '||round((blkcnt_uncmp-blkcnt_cmp)/blkcnt_uncmp*100,1));
	--DBMS_OUTPUT.PUT_LINE('Compression ratio org= '||cmp_ratio);
end;
/
set verify on

So we should be good to go:

conn nj/nj
@orcl> @dbms_comp acme accs scratch 200000
declare
*
ERROR at line 1:
ORA-20000: Compression Advisor tablespace "SCRATCH" does not exist
ORA-06512: at "SYS.PRVT_COMPRESSION", line 776
ORA-06512: at "SYS.DBMS_COMPRESSION", line 214
ORA-06512: at line 9

With SQL Trace enabled I see the following in the trace file.

PARSING IN CURSOR #139655103675152 len=64 dep=1 uid=37 oct=3 lid=37 tim=1357852093619559 hv=814796271 ad='67de3b40' sqlid='g6pggnss91mgg'
SELECT count(*) FROM user_tablespaces WHERE tablespace_name= :b1
END OF STMT
...
BINDS #139655103675152:
 Bind#0
  ...
  value="SCRATCH"

Even though the temporary segments are created in the same schema as the object being analysed my own account also needs a quota on the scratch tablespace. No segments are created in my schema but I need the quota in order to get past the check on “user_tablespaces”.

alter user nj quota unlimited on scratch;

Let’s try again:

@orcl> @dbms_comp acme accs scratch 200000
declare
*
ERROR at line 1:
ORA-01031: insufficient privileges
ORA-06512: at "SYS.PRVT_COMPRESSION", line 776
ORA-06512: at "SYS.DBMS_COMPRESSION", line 214
ORA-06512: at line 9

I couldn’t find anything about this in M.O.S or find any evidence from SQL Trace. However the power of trial and error tracked the missing privilege down to “ANALYZE ANY” (I could have granted OEM_MONITOR to provide this privilege but we’re all about the principle of least privilege today).

grant ANALYZE ANY to NJ_DBA;

Let’s try again… again:

@orcl> @dbms_comp acme accs scratch 200000
Block count compressed = 1612
Block count uncompressed = 1632
Row count per block compressed = 123
Row count per block uncompressed = 121
Compression ratio = 1 to 1
Compression % benefit = 1.2

PL/SQL procedure successfully completed.

Elapsed: 00:00:09.82

Sadly my table is not going to benefit from compression but it’s a success nonetheless. So the full list of required privileges is:

grant CREATE SESSION to NJ_DBA;
grant ANALYZE ANY to NJ_DBA;
grant CREATE ANY TABLE to NJ_DBA;
grant DROP ANY TABLE to NJ_DBA;
grant SELECT ANY TABLE to NJ_DBA;

You also need a quota for your schema user and an unnecessary quota for the executing user.

If you use VPD then don’t forget to take that into account.

And one last point, the temporary tables are created in thje target schema and the names are not session/object specific so don’t try to run Compress Advisor calls in parallel:

@orcl> @dbms_comp acme accs scratch 200000
declare
*
ERROR at line 1:
ORA-00955: name is already used by an existing object
ORA-06512: at "SYS.PRVT_COMPRESSION", line 776
ORA-06512: at "SYS.DBMS_COMPRESSION", line 214
ORA-06512: at line 9

11.2.0.3 Client, Password Expiry and ORA-1017

I have had an intermittent issue at work recently where I am occasionally unable to change my password when it has expired.

From my weapon of choice, DOS prompt/”sqlplus” on Windows, I enter my current password and am prompted to change it due to its expiry. I then enter a new password, one I am confident is consistent with password policy and get a message stating my username/password is incorrect. You can see example output below (and yes I am absolutely certain I entered the same password twice :) ).

u:\sqls>sqlplus neiljohnson@db10g1

SQL*Plus: Release 11.2.0.3.0 Production on Wed Sep 19 09:36:05 2012

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Enter password:
ERROR:
ORA-28001: the password has expired


Changing password for neiljohnson
New password:
Retype new password:
ERROR:
ORA-01017: invalid username/password; logon denied


Password unchanged
Enter user-name:

I was unable to find a hit in M.O.S and settled on a work-a-round of changing my password on the Unix database server, silently hoping I wasn’t going mad. After a few more occurrences I resorted to Google and via the OTN forums found an M.O.S note detailing the problem.

ORA-1017 Trying To Change Expired Password Through Programmatic Interface From 11.2.0.3 To Lower Database Version [ID 1426233.1]

Below is the suggested solution from this note and I find it out of step with Oracle’s usual stance on these things:

Solution
There are no workarounds available to programatically changing the password, apart from;

a) Use an older client version, or
b) Change the user’s password via an ALTER USER statement from a different session.
c) Upgrade the database version to 11.2.0.3

I’ve always been quite proud to tell my Sysadmin brethren that Oracle clients are backward compatible (and within reason forward compatible too). What we have here is the latest and greatest of the Oracle clients, 11.2.0.3, and a statement that I read as “this is not backward compatible – tough luck!”

I tried to diagnose this further with SQL trace on the Oracle server process but I only capture the insert into sys.aud$. I also tried SQL Net client tracing and in the trace file I can see an error recorded in between the ORA-28001 and ORA-01017. I’ve highlighted it in the extracts from the trace file below.

2012-09-19 10:44:33.304271 : nsbasic_brc:00 00 24 4F 52 41 2D 32  |..$ORA-2|
2012-09-19 10:44:33.304290 : nsbasic_brc:38 30 30 31 3A 20 74 68  |8001:.th|
2012-09-19 10:44:33.304309 : nsbasic_brc:65 20 70 61 73 73 77 6F  |e.passwo|
2012-09-19 10:44:33.304327 : nsbasic_brc:72 64 20 68 61 73 20 65  |rd.has.e|
2012-09-19 10:44:33.304346 : nsbasic_brc:78 70 69 72 65 64 0A     |xpired. |
2012-09-19 10:44:33.304364 : nsbasic_brc:exit: oln=0, dln=69, tot=79, rc=0
2012-09-19 10:44:33.304383 : nioqrc:exit
...
2012-09-19 10:45:59.358748 : nsbasic_brc:exit: oln=11, dln=1, tot=22, rc=0
2012-09-19 10:45:59.359278 : nioqrc: found a break marker...
2012-09-19 10:45:59.359797 : nioqrc: Recieve: returning error: 3111
2012-09-19 10:45:59.360333 : nioqrc:exit
2012-09-19 10:45:59.360834 : nioqrs:entry
...
2012-09-19 10:45:59.406650 : nsbasic_brc:00 00 33 4F 52 41 2D 30  |..3ORA-0|
2012-09-19 10:45:59.407173 : nsbasic_brc:31 30 31 37 3A 20 69 6E  |1017:.in|
2012-09-19 10:45:59.407702 : nsbasic_brc:76 61 6C 69 64 20 75 73  |valid.us|
2012-09-19 10:45:59.408227 : nsbasic_brc:65 72 6E 61 6D 65 2F 70  |ername/p|
2012-09-19 10:45:59.408753 : nsbasic_brc:61 73 73 77 6F 72 64 3B  |assword;|
2012-09-19 10:45:59.409280 : nsbasic_brc:20 6C 6F 67 6F 6E 20 64  |.logon.d|
2012-09-19 10:45:59.409820 : nsbasic_brc:65 6E 69 65 64 0A        |enied.  |

It seems from the M.O.S information below that ORA-03111 is not something I’m going to get very far with, in fact it’s probably a side effect of some other piece of code failing. Either way – it’s back to the work-a-round for me.

Error:  ORA 3111 
Text:   break received on communication channel 
-------------------------------------------------------------------------------
Cause:  A break was processed on the communications channel, but was not 
        properly handled by SQL*Net software.
        This is an internal error message not usually issued.
Action: Contact customer support.

Hopefully this post will save someone else some time if faced with the same issue, sadly I am not currently able to provide a solution.

Production Support Tips & Tricks #2 – SQL Trace

Part 1: Production Support Tip & Tricks #1 – Collecting Log Data

There’s no need to discuss SQL Trace other than the age-old concern – the overhead. Incident managers are risk averse and when you ask if you can trace a session on production you will always be asked what overhead it has and invariably you will say “erm… 5%… ish”. We can see from the Oracle documentation that Oracle’s stance on this has changed over the years:

9i docs:

“The additional overhead … is normally insignificant compared with the inherent overhead caused by the application’s inefficiency.”

11.2 docs:

“Using the SQL Trace facility can have a severe performance impact and may result in increased system overhead, excessive CPU usage, and inadequate disk space”

In this post I am going to attempt to assess the overhead of SQL Trace. My theory is that there are 2 potential factors.

  1. The overhead of the SQL Trace code. There must be extra code involved and it must take some amount of CPU to execute it.
  2. The overhead of writing trace files. Writing to a file can not be without any overhead.

My cunning plan is to test queries in 3 states:

  • With SQL Trace disabled
  • With SQL Trace enabled
  • With SQL Trace enabled but redirected to /dev/null thus removing any overhead of writing to the trace file

The queries will be tested in IO bound and CPU bound states. The tests were done on a Linux laptop running Oracle XE 11g with a single hard drive so the results are not “enterprise class” – your mileage may vary.

All of the code used can be found at these links, for brevity I have not included them inline – build-acme-schema.sql sqltrace-long1.sql sqltrace-long2.sql sqltrace-short1.sql sqltrace-short2.sql

Test #1 – IO bound queries with a long(ish) elapsed time

I say “long(ish)” as the queries only take a few seconds, the key thing is that the majority of time is spent fetching data. I flush the buffer cache before every execution to make sure the queries are IO bound. The test code can be found in sqltrace-long1.sql.

The results are:

Test \ Execution 1 2 3 4 5 6 Average % Overhead
Long IO bound no trace 5.45s 5.45s 5.52s 5.52s 5.45s 5.34s 5.45s -
Long IO bound with trace (32KB) 6.90s 6.89s 6.94s 7.09s 6.84s 7.01s 6.94s 27.31%
Long IO bound with trace (to null) 6.89s 6.82s 6.89s 6.94s 6.78s 7.12s 6.90s 26.61%

I was really surprised by this. For a small number of queries generating small trace files, only 32KB, I saw a 25% degradation in run time with SQL Trace enabled. The minimal size of the trace files did nothing to prove my theory about the overhead in writing the trace file but this was an eye opener.

Test #2 – CPU bound queries with a long(ish) elapsed time

This test used the same queries as before but I removed the flush of the buffer cache from before each query. This removed all IO waits. The test code can be found in sqltrace-long2.sql.

Test \ Execution 1 2 3 4 5 6 Average % Overhead
Long CPU bound no trace 3.79s 3.76s 3.80s 3.73s 3.71s 3.79s 3.76s -
Long CPU bound with trace (11KB) 4.90s 4.99s 5.02s 5.00s 4.89s 4.87s 4.94s 31.39%
Long CPU bound with trace (to null) 4.95s 5.00s 5.05s 5.05s 4.90s 4.86s 4.96s 32.02%

Interesting – so by running the same workload but with no disk reads to slow it down the overhead is slightly more pronounced. But infrequent long running queries are not the only fruit.

Test #3 – IO bound frequent queries with a short elapsed time

This test uses very fast queries in a tight loop. The proportion of SQL executes to row fetches is very different to the previous tests. The buffer cache is flushed before each execution. The test code can be found in sqltrace-short1.sql.

Test \ Execution 1 2 3 4 5 6 Average % Overhead
Short IO bound no trace 43.99s 43.2s 39.44s 41.81s 39.00s 40.03s 41.24s -
Short IO bound with trace (2.9MB) 41.35s 43.63s 40.70s 41.72s 43.24s 39.86s 41.75s 1.22%
Short IO bound with trace (to null) 41.12s 45.21s 41.50s 40.82s 42.44s 40.05s 41.85s 1.48%

So this more like it – only a 1.5% overhead.

Test #4 – CPU bound frequent queries with a short elapsed time

This test uses the same very fast queries in a tight loop from the previous test. The buffer cache flushes were removed and the number of iterations increased to get a longer elapsed time. The test code can be found in sqltrace-short2.sql.

Test \ Execution 1 2 3 4 5 6 Average % Overhead
Short CPU bound no trace 2.58s 2.46s 2.46s 2.63s 2.49s 2.47s 2.51s -
Short CPU bound with trace (32MB) 7.55s 7.52s 7.57s 7.64s 7.43s 7.67s 7.56s 200.72%
Short CPU bound with trace (to null) 4.45s 4.57s 4.69s 4.63s 4.54s 4.62 4.58s 82.23%

Ouch – a 200% overhead! Also the overhead of writing to the trace file was visible this time, SQL Trace overhead dropped from 200% to 80% when the trace file was /dev/null.

Summary

I am almost reluctant to summarise as these are not the most scientific of tests but the things I am taking away from this excercise are:

  • SQL workload that is CPU bound will see a dramatic degradation in elapsed time when SQL Trace is enabled – I’ll be wary of this in the future
  • SQL workload that is IO bound will see less of an impact from SQL Trace, but the impact could still be significant.
  • When trace file size is large then the speed of the trace file target filesystem is important. In these tests I used /dev/null which is of course cheating.
  • When IO response is important to a test remember to set filesystemio_option=setall. I wasted several evenings scratching my head at inconsistent, unpredictable test results.
  • Your mileage may vary

Feel free to point out mistakes in my train of thought or in the tests themselves. I’m not too happy with tests 1 & 2 – I’m not convinced they prove anything worthwhile. Tests 3 & 4 however have results I have found interesting and informative.

I think I’ll continue to tell Incident Managers that the overhead is around 5-10%, but only after I’ve given some careful thought to the criticality of the workload I want to trace.

Production Support Tips & Tricks #1 – Collecting Log Data

Early this year (2012) I started working on a presentation, it would be my first, that I hoped to submit to UKOUG. The thrust of the presentation was to be tips on making your experiences with Oracle Support more pleasant, to help keep your support analyst busy rather than yourself. A prospective title was “with Support like this who needs enemies” – perhaps that’s a big strong ;-). Several things colluded to make it unlikely I would get to present it so I faltered and things ground to a halt. After a period of inactivity I have decided to convert it into a short series of blog posts. This is the first. Part 2 is here – “Production Support Tips & Tricks #2 – SQL Trace

This post contains some advice for collecting log data when raising SRs. It’s mostly obvious but hopefully not to all.

ADR Package

What?
You already know so I’m not going to waste my breath.

Why?
Get everything packaged up, not just the trace files you think Oracle need. Avoids repeat requests.

How?
Well covered by others so I’m not going near it:
John Hallas quality UKOUG presentation
Uwe Hesse’s super blog entry

Example?
Nah – see above

Trivia
Not related to diagnostic collection but listener targets don’t auto purge so your housekeeping scripts need to make calls to adrci to force a purge.

Diagcollection.sh for clusters

What?
diagcollection.sh is a script in your CRS home which collates all CRS related log files on the current cluster node.

Why?
It’s not easy manually collecting everything Oracle Support may require. This script makes it easy.

How?
Several options, you can check them with the “-h” option. Or just collect everything:

$ diagcollection.sh

Uncompressed the resulting tar file can be very large

-rw-r--r-- 1 grid oinstall 1.1G Feb 22 21:49 crsData_n02_20120222_2144.tar

Even compressed the file can still be a lengthy upload to M.O.S (multiplied by the # of nodes)

-rw-r--r-- 1 grid oinstall 69M Feb 22 21:49 crsData_n02_20120222_2144.tar.gz

Trivia?
diagcollection.sh is just a wrapper for diagcollection.pl.

OS Watcher Black Box (OSWbb)

What?
A quote from the user guide:

a collection of UNIX shell scripts intended to collect and archive operating system and network metrics to aid support in diagnosing performance issues

Why?
“Because every vendor wants to blame another vendor and OSWbb helps that process”
or
“Because every issue is the fault of the database so you need ammunition to feed to your vendor”
or
“insert your own cynical quote here”

How?
Download from M.O.S – “OS Watcher Black Box User Guide [ID 301137.1]“. It is certified on AIX, Tru64, Solaris, HP-UX, Linux.

It is easy to run:

nohup ./startOSWbb.sh &

easy to stop

./stopOSWbb.sh

and easy to send

$ ./tarupfiles.sh
-rw-r--r-- 1 oracle oinstall 1.2M Feb  8 22:00 osw_archive_0208122216.tar.Z

Example?

You can install OSWbb as a Linux service – “How To Start OSWatcher Black Box Every System Boot [ID 580513.1]” or use any scheduling tool. Alternatvely you can control it via CRS, this way it is only active when the cluster is active which has plus and minus points. For details of this see M.O.S note “Making Applications Highly Available Using Oracle Clusterware [ID 1105489.1]“.

To do it you need an action script, there is a perfectly good demo one in “$GRID_HOME/crs/demo”. Alternatively the one I use for testing at home can be found here – osw.scr (use at your peril).

$GRID_HOME/bin/crsctl add resource osw -type ora.local_resource.type \
 -attr "AUTO_START=always,ACTION_SCRIPT=$GRID_HOME/crs/script/oswbb.scr"
$  $GRID_HOME/bin/crsctl status res osw
NAME=osw
TYPE=ora.local_resource.type
TARGET=ONLINE       , ONLINE
STATE=ONLINE on n01, ONLINE on n02

Trivia?
From “OS Watcher For Windows (OSWFW) User Guide [ID 433472.1]“:

OS Watcher for Windows is no longer supported.
It has been replace by the Cluster Health Monitor.

From “Cluster Health Monitor (CHM) FAQ [ID 1328466.1]“

Is the Cluster Health Monitor replacing OSWatcher?
…there [is] some information such as top, traceroute, and netstat that the Cluster Health Monitor does not collect, so running the Cluster Health Monitor while running OSWatcher is ideal. Both tools complement each other rather than supplement…

In my opinion another reason for still using OSWbb in spite of CHM is that CHM is very difficult to review yourself, it is also not yet the tool of choice for many within Oracle Support. OSWbb still has a place.

Quote from traceroute Unix man page by way of caveat:

Because of the load it could impose on the network, it is unwise to use traceroute during normal operations or from automated scripts.

Hmmmmm…..

“OS Watcher Black Box” was originally called “OS Watcher” but was renamed due to a clash of names with other unrelated, non-Oracle tool(s).

More to follow in the future

Mining the listener log with Perl

Update – 02-MAY-2012
I have had cause to use this script again at work so thought I’d develop it further. This post has been updated to reflect the new version.

Recently at work I was engaged by one of our application support teams because of application calls failing with “ORA-00020: maximum number of processes (300) exceeded”. We quickly identified that these errors were the symptom of another problem and not a concern in themselves, the real issue was slow database calls due to an infrastructure issue. The ORA-00020 errors were because the application was spawning new connections to try to work around the slow connections. Now I don’t know much about application servers but I know this is not uncommon. The application team however could not find any indication of this in their logs so it was up to me to prove it.

As far as I am aware there are 2 options for retrospectively proving this at a granular level (please share any other options I’ve missed):

1) Query the database audit trail – DBA_AUDIT_SESSION
2) Mine the listener log

On this database no auditing was enabled so I was left with the listener log as my data source.

I knocked up a quick KSH/AWK script to get my data and then massaged it via various means. The output was quite pleasing so over the recent long UK Easter holiday I Googled some Perl commands and had some fun with it, this post is to share the script and an example chart from Microsoft Excel based on the incident I described above.

The script can be downloaded from – llog-miner.pl

Below is an example of how to call the script – it works on *nix and Windows. The “-f” parameter is used to filter log entries on the date, this example is matching any minute between “11:00″ and “12:59″. I’ve included a full usage map further down this post.

C:\>perl llog-miner.pl -excel -f "04-APR-2012 1[12]" listener.log  > llog.csv

This is the chart showing the logon spike – click to enlarge. You can see the connection spike at 12:56 of over 250 connection attempts.

Connection rate from listener log

The script allows a few other options such as allowing listener logs to be “piped” into the script to allow multiple files to be processed in one command. Usage notes below.

C:\>perl llog-miner.pl -help

    Usage: llog-miner.pl [-help] [-filter <string>] [-[no]excel] [-[no]host] <listener_log_path|->
         : -excel|noexcel       : Output dates in Excel friendly format (default is noexcel)
         : -help                : Print this help
         : -host|nohost         : Include host on program (default is to include)
         : -filter <string>     : A regexp to match beginning of log line. e.g. "23-DEC-2012 0[12]" or "..-MAR-2013"
         : listener_log_path|-  : Input either a valid listener log name or "-" to read STDIN

    Examples:
         # Process listener.log looking for entries on the 21st and 22nd of Sept and format dates for Excel
         C:\>perl llog-miner.pl -filter "2[12]-SEP-2011" -excel listener.log

         # Process two listener log files looking for entries on Jan 3rd between 20:00 and 21:59 and exclude the hostname
         $ cat listener.log listener_db.log | llog-miner.pl -filter "03-JAN-2012 2[01]" -nohost -

An improvement I may make in the future is to expand minutes where there are no connection attempts. I have drafted some code to fill in gaps but have kept it to myself for now as it’s too ugly to share (at the moment).

I hope someone finds this useful and if not, well I found it entertaining if nothing else :-)

ADRCI and listener purging

I won’t go into any details on ADRCI as others have already done a better job than I could. The key piece of information for this post is that the MMON process shoulders the responsibility for ADR purging. So what if there isn’t an MMON process?

Note: All ADRCI “show control” output on this page has been edited for readability

Below is the ADR purge details for my local listener. These settings were defined using “set control (SHORTP_POLICY = 168,LONGP_POLICY = 336)”. “168″ = 7 days and “336″ = 14 days.

adrci
set home diag/tnslsnr/n01/listener
show control

ADRID      SHORTP_POLICY LONGP_POLICY LAST_MOD_TIME              LAST_AUTOPRG_TIME LAST_MANUPRG_TIME
---------- ------------- ------------ -------------------------- ----------------- ----------------------------
1794992421 168           336          2012-01-25 20:41:09.052526

adrci> host
$ date

Thu Feb 16 21:26:16 GMT 2012

$ cd  /u01/app/oracle/diag/tnslsnr/n01/listener/alert
$ ls -ltr

total 30260
-rw-r----- 1 grid oinstall 10485934 Jan 25 03:45 log_1.xml
-rw-r----- 1 grid oinstall 10485779 Feb  4 03:11 log_2.xml
-rw-r----- 1 grid oinstall  9954308 Feb 16 21:43 log.xml

Notice that the “LAST_AUTOPRG_TIME” column is blank. It looks like this system is not being purged automatically. Also notice that the date of this test was Feb 16th and the oldest file in the “alert” directory is dated “Jan 25″. According to M.O.S note “Which Files Are Part Of SHORTP_POLICY And LONGP_POLICY In ADR? [ID 975448.1]” the “alert” directory is under LONGP_POLICY which is 14 days so the file “log_1.xml” should have been purged.

Below I force a manual purge

adrci> purge
adrci> host

$ cd  /u01/app/oracle/diag/tnslsnr/n01/listener/alert
$ ls -ltr

total 20004
-rw-r----- 1 grid oinstall 10485779 Feb  4 03:11 log_2.xml
-rw-r----- 1 grid oinstall  9960304 Feb 16 21:53 log.xml

It has now gone. And LAST_MANUPRG_TIME has been set to record the fact.

ADRID      SHORTP_POLICY LONGP_POLICY LAST_MOD_TIME              LAST_AUTOPRG_TIME LAST_MANUPRG_TIME
---------- ------------- ------------ -------------------------- ----------------- --------------------------
1794992421 168           336          2012-01-25 20:41:09.052526                   2012-02-16 21:53:51.761639

So my recommendation is that alongside ADR’s auto purging and your own maintenance on the alert.log & listener.log files in the “trace” directories you should also routinely call something like below:

for i in `adrci exec="show homes"|grep listener`;do
 echo "adrci exec=\"set home $i;purge\""
 adrci exec="set home $i;purge";
 # maybe check $? here
done

I noticed this M.O.S note recently too. I’ve never seen the symptoms but it’s worth taking a look.

ADR Auto Purge Does Not Work On All RAC Instances, ASM and Standby Databases [ID 1326983.1]