ORAganism

Mining the listener log with Perl

Posted in Oracle by Neil Johnson on April 11, 2012

Update – 02-MAY-2012
I have had cause to use this script again at work so thought I’d develop it further. This post has been updated to reflect the new version.

Recently at work I was engaged by one of our application support teams because of application calls failing with “ORA-00020: maximum number of processes (300) exceeded”. We quickly identified that these errors were the symptom of another problem and not a concern in themselves, the real issue was slow database calls due to an infrastructure issue. The ORA-00020 errors were because the application was spawning new connections to try to work around the slow connections. Now I don’t know much about application servers but I know this is not uncommon. The application team however could not find any indication of this in their logs so it was up to me to prove it.

As far as I am aware there are 2 options for retrospectively proving this at a granular level (please share any other options I’ve missed):

1) Query the database audit trail – DBA_AUDIT_SESSION
2) Mine the listener log

On this database no auditing was enabled so I was left with the listener log as my data source.

I knocked up a quick KSH/AWK script to get my data and then massaged it via various means. The output was quite pleasing so over the recent long UK Easter holiday I Googled some Perl commands and had some fun with it, this post is to share the script and an example chart from Microsoft Excel based on the incident I described above.

The script can be downloaded from – llog-miner.pl

Below is an example of how to call the script – it works on *nix and Windows. The “-f” parameter is used to filter log entries on the date, this example is matching any minute between “11:00″ and “12:59″. I’ve included a full usage map further down this post.

C:\>perl llog-miner.pl -excel -f "04-APR-2012 1[12]" listener.log  > llog.csv

This is the chart showing the logon spike – click to enlarge. You can see the connection spike at 12:56 of over 250 connection attempts.

Connection rate from listener log

The script allows a few other options such as allowing listener logs to be “piped” into the script to allow multiple files to be processed in one command. Usage notes below.

C:\>perl llog-miner.pl -help

    Usage: llog-miner.pl [-help] [-filter <string>] [-[no]excel] [-[no]host] <listener_log_path|->
         : -excel|noexcel       : Output dates in Excel friendly format (default is noexcel)
         : -help                : Print this help
         : -host|nohost         : Include host on program (default is to include)
         : -filter <string>     : A regexp to match beginning of log line. e.g. "23-DEC-2012 0[12]" or "..-MAR-2013"
         : listener_log_path|-  : Input either a valid listener log name or "-" to read STDIN

    Examples:
         # Process listener.log looking for entries on the 21st and 22nd of Sept and format dates for Excel
         C:\>perl llog-miner.pl -filter "2[12]-SEP-2011" -excel listener.log

         # Process two listener log files looking for entries on Jan 3rd between 20:00 and 21:59 and exclude the hostname
         $ cat listener.log listener_db.log | llog-miner.pl -filter "03-JAN-2012 2[01]" -nohost -

An improvement I may make in the future is to expand minutes where there are no connection attempts. I have drafted some code to fill in gaps but have kept it to myself for now as it’s too ugly to share (at the moment).

I hope someone finds this useful and if not, well I found it entertaining if nothing else :-)

Tagged with: , ,

ADRCI and listener purging

Posted in Oracle by Neil Johnson on February 16, 2012

I won’t go into any details on ADRCI as others have already done a better job than I could. The key piece of information for this post is that the MMON process shoulders the responsibility for ADR purging. So what if there isn’t an MMON process?

Note: All ADRCI “show control” output on this page has been edited for readability

Below is the ADR purge details for my local listener. These settings were defined using “set control (SHORTP_POLICY = 168,LONGP_POLICY = 336)”. “168″ = 7 days and “336″ = 14 days.

adrci
set home diag/tnslsnr/n01/listener
show control

ADRID      SHORTP_POLICY LONGP_POLICY LAST_MOD_TIME              LAST_AUTOPRG_TIME LAST_MANUPRG_TIME
---------- ------------- ------------ -------------------------- ----------------- ----------------------------
1794992421 168           336          2012-01-25 20:41:09.052526

adrci> host
$ date

Thu Feb 16 21:26:16 GMT 2012

$ cd  /u01/app/oracle/diag/tnslsnr/n01/listener/alert
$ ls -ltr

total 30260
-rw-r----- 1 grid oinstall 10485934 Jan 25 03:45 log_1.xml
-rw-r----- 1 grid oinstall 10485779 Feb  4 03:11 log_2.xml
-rw-r----- 1 grid oinstall  9954308 Feb 16 21:43 log.xml

Notice that the “LAST_AUTOPRG_TIME” column is blank. It looks like this system is not being purged automatically. Also notice that the date of this test was Feb 16th and the oldest file in the “alert” directory is dated “Jan 25″. According to M.O.S note “Which Files Are Part Of SHORTP_POLICY And LONGP_POLICY In ADR? [ID 975448.1]” the “alert” directory is under LONGP_POLICY which is 14 days so the file “log_1.xml” should have been purged.

Below I force a manual purge

adrci> purge
adrci> host

$ cd  /u01/app/oracle/diag/tnslsnr/n01/listener/alert
$ ls -ltr

total 20004
-rw-r----- 1 grid oinstall 10485779 Feb  4 03:11 log_2.xml
-rw-r----- 1 grid oinstall  9960304 Feb 16 21:53 log.xml

It has now gone. And LAST_MANUPRG_TIME has been set to record the fact.

ADRID      SHORTP_POLICY LONGP_POLICY LAST_MOD_TIME              LAST_AUTOPRG_TIME LAST_MANUPRG_TIME
---------- ------------- ------------ -------------------------- ----------------- --------------------------
1794992421 168           336          2012-01-25 20:41:09.052526                   2012-02-16 21:53:51.761639

So my recommendation is that alongside ADR’s auto purging and your own maintenance on the alert.log & listener.log files in the “trace” directories you should also routinely call something like below:

for i in `adrci exec="show homes"|grep listener`;do
 echo "adrci exec=\"set home $i;purge\""
 adrci exec="set home $i;purge";
 # maybe check $? here
done

I noticed this M.O.S note recently too. I’ve never seen the symptoms but it’s worth taking a look.

ADR Auto Purge Does Not Work On All RAC Instances, ASM and Standby Databases [ID 1326983.1]

Tagged with: , , ,

Controlling The SCAN Listener Log Location

Posted in Oracle by Neil Johnson on January 23, 2012

I was lucky enough to attend UKOUG Conference 2011 last year. I saw several great presentations, one of which was a presentation on troubleshooting RAC by Julian Dyke. During this presentation Julian commented that we need to watch out for SCAN listeners having their ADR home inside the Grid Infrastructure home instead of with the Database listeners in the ADR home under $ORACLE_BASE.

Well this was one of the few things in his presentation I was fit to comment on so I tried to attract Julian’s attention by waving my arm, sadly my timid nature meant I went unnoticed. So this blog post is me going public with my attempted interruption.

First an example of how the log location for SCAN listeners typically differs to that of Database listeners. I personally do not like this behaviour as I like my Grid Infrastructure home to have resonably stable size, which it can have once the log directory’s rotation is mature, however the ADR home for listeners is not cleared down automatically so you can end up with a directory growing unchecked.

[grid@n01 n01]$ $GRID_HOME/bin/lsnrctl status LISTENER|grep Log
Listener Log File         /u01/app/oracle/diag/tnslsnr/n01/listener/alert/log.xml
[grid@n01 n01]$ $GRID_HOME/bin/lsnrctl status LISTENER_SCAN3|grep Log
Listener Log File         /u01/app/grid/11203/ghome_1/log/diag/tnslsnr/n01/listener_scan3/alert/log.xml

And we know from the Oracle documentation that ADR_BASE is defined as:

Purpose
To specify the base directory into which tracing and logging incidents are stored when ADR is enabled.

Default
The default on the server side is ORACLE_BASE, or ORACLE_HOME/log, if ORACLE_BASE is not defined.

So… let’s check ORACLE_BASE for CRS resources:

[grid@n01 ~]$ $GRID_HOME/bin/crsctl stat res ora.LISTENER.lsnr -p|grep -i base
USR_ORA_ENV=ORACLE_BASE=/u01/app/oracle
[grid@n01 ~]$ $GRID_HOME/bin/crsctl stat res ora.LISTENER_SCAN3.lsnr -p|grep -i base
[grid@n01 ~]$

Notice above that the Database listener has $ORACLE_BASE set and the Scan listener does not. Hence the log location.

This can be controlled for Database listeners by using the SETENV switch for SRVCTL but this does not work for Scan listeners.

[grid@n01 ~]$ $GRID_HOME/bin/srvctl setenv listener -l listener_scan1 -t "ORACLE_BASE=/u01/app/oracle"
PRKO-3162 : The actions setenv, getenv, and unsetenv are not suppported for Single Client Access Name listener listener_scan1
PRCN-2066 : Failed to retrieve cluster listener for listener_scan1 because it has the type of Single Client Access Name Listener

There is another way of setting environment variables for cluster resources and that is via crsctl modify resource.

[grid@n01 ~]$ $GRID_HOME/bin/crsctl modify resource ora.LISTENER_SCAN3.lsnr -attr USR_ORA_ENV=ORACLE_BASE=/u01/app/oracle
[grid@n01 ~]$ $GRID_HOME/bin/crsctl stat res ora.LISTENER_SCAN3.lsnr -p|grep -i base
USR_ORA_ENV=ORACLE_BASE=/u01/app/oracle
[grid@n01 ~]$ $GRID_HOME/bin/srvctl stop scan_listener -i 3
[grid@n01 ~]$ $GRID_HOME/bin/srvctl start scan_listener -i 3
[grid@n01 ~]$ lsnrctl status  listener_scan3 | grep Log
Listener Log File         /u01/app/oracle/diag/tnslsnr/n01/listener_scan3/alert/log.xml

To my mind the whole thing would seem a little cleaner if there was an “SRVCTL SETENV SCAN_LISTENER” command.

After writing this post I thought I’d check around for other information to support this post and, lo and behold, there is a My Oracle Support article covering almost exactly what I’ve written above. The article also talks about using the LISTENER.ORA attribute ADR_BASE_LISTENER_SCANn to control this.

Diagnostic Destination For Listeners Is Under Grid_home/Log Instead Of Oracle_base/Diag [ID 1269109.1]

A DBA_HIST_SQLSTAT query that I am very fond of

Posted in Oracle by Neil Johnson on December 14, 2011

This is a quick post to share a SQL statement I use a lot at work. The query mines the AWR tables (beware the licence implications) for a specific SQL ID and date/time range and shows a few choice statistics for each snapshot period.

awrsql.sql:

prompt enter start and end times in format DD-MON-YYYY [HH24:MI]
 
column sample_end format a21
select to_char(min(s.end_interval_time),'DD-MON-YYYY DY HH24:MI') sample_end
, q.sql_id
, q.plan_hash_value
, sum(q.EXECUTIONS_DELTA) executions
, round(sum(DISK_READS_delta)/greatest(sum(executions_delta),1),1) pio_per_exec
, round(sum(BUFFER_GETS_delta)/greatest(sum(executions_delta),1),1) lio_per_exec
, round((sum(ELAPSED_TIME_delta)/greatest(sum(executions_delta),1)/1000),1) msec_exec
from dba_hist_sqlstat q, dba_hist_snapshot s
where q.SQL_ID=trim('&sqlid.')
and s.snap_id = q.snap_id
and s.dbid = q.dbid
and s.instance_number = q.instance_number
and s.end_interval_time >= to_date(trim('&start_time.'),'dd-mon-yyyy hh24:mi')
and s.begin_interval_time <= to_date(trim('&end_time.'),'dd-mon-yyyy hh24:mi')
and substr(to_char(s.end_interval_time,'DD-MON-YYYY DY HH24:MI'),13,2) like '%&hr24_filter.%'
group by s.snap_id
, q.sql_id
, q.plan_hash_value
order by s.snap_id, q.sql_id, q.plan_hash_value
/

Nothing ground breaking and I’m sure many will have a similar script.

Below I have example output showing one of the scripts many successful outings, quite a dramatic plan change I’m sure you’ll agree.

SQL> @awrsql
enter start and end times in format DD-MON-YYYY [HH24:MI]
Enter value for sqlid: 1jjpo2i4b313g
Enter value for start_time: 15-NOV-2011
Enter value for end_time: 21-NOV-2011 13:00
Enter value for hr24_filter:

SAMPLE_END            SQL_ID        PLAN_HASH_VALUE EXECUTIONS PIO_PER_EXEC LIO_PER_EXEC  MSEC_EXEC
--------------------- ------------- --------------- ---------- ------------ ------------ ----------
15-nov-2011 TUE 08:00 1jjpo2i4b313g      3133159894     129629            0            5          0
16-nov-2011 WED 08:01 1jjpo2i4b313g      3133159894     115003            0            5         .1
17-nov-2011 THU 08:01 1jjpo2i4b313g      3133159894     115741            0            5          0
18-nov-2011 FRI 07:00 1jjpo2i4b313g      3133159894      30997            0            5         .1
18-nov-2011 FRI 08:00 1jjpo2i4b313g      3133159894      81034            0            5          0
21-nov-2011 MON 00:00 1jjpo2i4b313g       790865878         16     323091.6     323128.3    36905.8
21-nov-2011 MON 01:00 1jjpo2i4b313g       790865878         29     349676.2     349713.7    48387.2
21-nov-2011 MON 02:00 1jjpo2i4b313g       790865878         35     339474.6     339509.2    34057.7
21-nov-2011 MON 03:00 1jjpo2i4b313g       790865878         37     340934.6     340970.2    35899.4
21-nov-2011 MON 04:01 1jjpo2i4b313g       790865878         38     333469.1     333503.9    35450.8
21-nov-2011 MON 05:00 1jjpo2i4b313g       790865878         35     347559.3     347595.2    35231.8
21-nov-2011 MON 06:00 1jjpo2i4b313g       790865878         32     340224.8       340260    35208.3

I also like to use the query to track number of executions or LIO per execution over a longer time frame to see if either the frequency or individual impact of the SQL is changing over time. I can use the “hr24_filter” variable to do this, for example showing me all snapshots for hour “13″ over a whole month.

Tagged with: , ,

Ad-hoc Users and Undo Usage

Posted in Oracle by Neil Johnson on September 28, 2011

Here is a quote from the Oracle Documentation about V$UNDOSTAT.MAXQUERYLEN:

MAXQUERYLEN NUMBER Identifies the length of the longest query (in seconds) executed in the instance during the period. You can use this statistic to estimate the proper setting of the UNDO_RETENTION initialization parameter. The length of a query is measured from the cursor open time to the last fetch/execute time of the cursor. Only the length of those cursors that have been fetched/executed during the period are reflected in the view.

And here is a quote from My Oracle Support note “V$UNDOSTAT MAXQUERYLEN Suddenly Appears With a High Value For a Simple Query [ID 1307600.1]“

the MAXQUERYLEN value is defined as “the length of a query measured from the cursor open time to the last fetch/execute time of the cursor”. Hence if the cursor is suspended (e.g. dbms_lock.sleep) or the session is switched out (session switching) then the MAXQUERYLEN time will continue to increase but will not be show as no current execution/fetch is occurring on the cursor.

This is expected behaviour

Nothing wrong with any of this but it is all relevant to the test that follows. In order to set up the test I change my local database to have a small undo tablespace and generate some workload. I’ve collapsed the code in case you want to jump straight to the story, in summary it generates some transactions and random queries with a duration between 1 and 10 minutes.

alter system set undo_tablespace = undo scope=both sid='*';

create undo tablespace undo_small 
datafile 'C:\ORACLEXE\ORADATA\XE\UNDO_SMALL.DBF' size 30m reuse autoextend on maxsize 500m;

alter system set undo_tablespace = undo_small scope=both sid='*';

create or replace function slow_query(i_seconds_to_wait in number)
	return number
as
	l_n number;
	l_pad2 bigtab.pad2%TYPE;

	cursor c_bigtab is
		select pad2 from bigtab; 
begin
	open c_bigtab;
	fetch c_bigtab into l_pad2;

	sys.dbms_lock.sleep(i_seconds_to_wait);
	
	fetch c_bigtab into l_pad2;
	close c_bigtab;

	return i_seconds_to_wait;
end slow_query;
/

-- session #1 - generate trxns

declare
	L_rno number;
	L_i number;
	L_pad1 bigtab.pad1%TYPE;
begin
	while true loop
		L_i := L_i + 1;
		L_rno := round(dbms_random.value(1,1000000));

		update bigtab
		set pad1 = pad1 where rno = L_rno;

		commit;

		dbms_lock.sleep(0.01);
	end loop;
end;
/

-- session #2 - generate queries of random length betwen 1 min and 10 mins

declare
	l_sleep number;
	l_slept number;
begin
	while true loop
		l_sleep := round(dbms_random.value(60,600));

		select slow_query(l_sleep) into l_slept from dual;
	end loop;
end;
/

Output from V$UNDOSTAT after the background workload has been running for a while is below along with the current size of the undo tablespace.

END_TIME_CHAR          TXNCOUNT MAXQUERYLEN MAXQUERYID    ACTIVEBLKS UNEXPIREDBLKS EXPIREDBLKS TUNED_UNDORETENTION
-------------------- ---------- ----------- ------------- ---------- ------------- ----------- -------------------
28-SEP-2011 19:45:50      38118         584 200k2np23z57q        160           912         496             1425
28-SEP-2011 19:55:50      38404         452 200k2np23z57q        160          1120        1280             1292
28-SEP-2011 20:05:50      38507         479 200k2np23z57q        160          1704        1464             1320
28-SEP-2011 20:15:50      38340         383 200k2np23z57q        160          2656         496             1224

   FILE_ID FILE_NAME                                             MB
---------- --------------------------------------------- ----------
         5 C:\ORACLEXE\ORADATA\XE\UNDO_SMALL.DBF                 30

Now a simple query is executed from SQL Developer and the first set of rows is fetched. V$SESSION output below.

SQL> select username,program,status,state,event,sql_id,seconds_in_wait siw
  2  from v$session
  3  where username = 'ADHOC';

USERNAME PROGRAM         STATUS   STATE               EVENT                       SQL_ID               SIW
-------- --------------- -------- ------------------- --------------------------- ------------- ----------
ADHOC    SQL Developer   INACTIVE WAITING             SQL*Net message from client 6fwqzurbc8y7k       7

SQL> select sql_text from v$sql where sql_id = '6fwqzurbc8y7k';

SQL_TEXT
-----------------------------------------------------------------------
select * from bigtab

Another quote from the Oracle Documentation

SQL_ID VARCHAR2(13) SQL identifier of the SQL statement that is currently being executed

You can see from the V$SESSION output above that we have a SQL ID and therefore must be executing some SQL. But we also have a session status of “INACTIVE” and are waiting on event “SQL*Net message from client”. That is a nice way of spotting sessions that are similar to the focus of this post.

Around 15 minutes later the session is still waiting on the user to act and V$UNDOSTAT looks consistent with my steady state.

USERNAME PROGRAM         STATUS   STATE               EVENT                       SQL_ID               SIW
-------- --------------- -------- ------------------- --------------------------- ------------- ----------
ADHOC    SQL Developer   INACTIVE WAITING             SQL*Net message from client 6fwqzurbc8y7k        912

END_TIME_CHAR          TXNCOUNT MAXQUERYLEN MAXQUERYID    ACTIVEBLKS UNEXPIREDBLKS EXPIREDBLKS TUNED_UNDORETENTION
-------------------- ---------- ----------- ------------- ---------- ------------- ----------- -------------------
28-SEP-2011 19:45:50      38118         584 200k2np23z57q        160           912         496             1425
28-SEP-2011 19:55:50      38404         452 200k2np23z57q        160          1120        1280             1292
28-SEP-2011 20:05:50      38507         479 200k2np23z57q        160          1704        1464             1320
28-SEP-2011 20:15:50      38340         383 200k2np23z57q        160          2656         496             1224
28-SEP-2011 20:25:50      38557         985 200k2np23z57q        160          2592         816             1825
28-SEP-2011 20:31:28      20492         866 200k2np23z57q        160          2592         816             1706

We now fetch the next set of records and check V$SESSION and V$UNDOSTAT again.

USERNAME PROGRAM         STATUS   STATE               EVENT                       SQL_ID               SIW
-------- --------------- -------- ------------------- --------------------------- ------------- ----------
ADHOC    SQL Developer   INACTIVE WAITING             SQL*Net message from client 6fwqzurbc8y7k      35

END_TIME_CHAR          TXNCOUNT MAXQUERYLEN MAXQUERYID    ACTIVEBLKS UNEXPIREDBLKS EXPIREDBLKS TUNED_UNDORETENTION
-------------------- ---------- ----------- ------------- ---------- ------------- ----------- -------------------
28-SEP-2011 19:45:50      38118         584 200k2np23z57q        160           912         496             1425
28-SEP-2011 19:55:50      38404         452 200k2np23z57q        160          1120        1280             1292
28-SEP-2011 20:05:50      38507         479 200k2np23z57q        160          1704        1464             1320
28-SEP-2011 20:15:50      38340         383 200k2np23z57q        160          2656         496             1224
28-SEP-2011 20:25:50      38557         985 200k2np23z57q        160          2592         816             1825
28-SEP-2011 20:32:49      26370         992 6fwqzurbc8y7k        160          2592         816             1766

V$UNDOSTAT.MAXQUERYID is now that of the SQL Developer session. What if an extra 10 minutes had elapsed?

10 minutes later…

END_TIME_CHAR          TXNCOUNT MAXQUERYLEN MAXQUERYID    ACTIVEBLKS UNEXPIREDBLKS EXPIREDBLKS TUNED_UNDORETENTION
-------------------- ---------- ----------- ------------- ---------- ------------- ----------- -------------------
28-SEP-2011 19:45:50      38118         584 200k2np23z57q        160           912         496             1425
28-SEP-2011 19:55:50      38404         452 200k2np23z57q        160          1120        1280             1292
28-SEP-2011 20:05:50      38507         479 200k2np23z57q        160          1704        1464             1320
28-SEP-2011 20:15:50      38340         383 200k2np23z57q        160          2656         496             1224
28-SEP-2011 20:25:50      38557         985 200k2np23z57q        160          2592         816             1825
28-SEP-2011 20:35:50      41787         992 6fwqzurbc8y7k        160          3440         688             1833
28-SEP-2011 20:41:31      20088        1293 6fwqzurbc8y7k        160          3440         688             2134

The TUNED_UNDORETENTION is starting to climb but the number of expired blocks is unchanged because our UNDO tablespace is extending – it is now at 42M.

   FILE_ID FILE_NAME                                             MB
---------- --------------------------------------------- ----------
         5 C:\ORACLEXE\ORADATA\XE\UNDO_SMALL.DBF                 42

So let’s prevent it from extending and wait another 10 minutes before fetching the next set of rows.

alter database datafile 5 autoextend off;

END_TIME_CHAR          TXNCOUNT MAXQUERYLEN MAXQUERYID    ACTIVEBLKS UNEXPIREDBLKS EXPIREDBLKS TUNED_UNDORETENTION
-------------------- ---------- ----------- ------------- ---------- ------------- ----------- -------------------
28-SEP-2011 20:05:50      38507         479 200k2np23z57q        160          1704        1464             1320
28-SEP-2011 20:15:50      38340         383 200k2np23z57q        160          2656         496             1224
28-SEP-2011 20:25:50      38557         985 200k2np23z57q        160          2592         816             1825
28-SEP-2011 20:35:50      41787         992 6fwqzurbc8y7k        160          3440         688             1833
28-SEP-2011 20:45:50      41534        1594 6fwqzurbc8y7k        160          4608         576             2752
28-SEP-2011 20:55:50      44669        2196 6fwqzurbc8y7k        160          5200         264             2766

We can see that the number of expired blocks is falling and the number of un-expired blocks is rising. 10 minutes later I click to fetch the next set of records and…


ORA-01555: snapshot too old: rollback segment number 17 with name "_SYSSMU17$" too small
01555. 00000 -  "snapshot too old: rollback segment number %s with name \"%s\" too small"
*Cause:    rollback records needed by a reader for consistent read are
           overwritten by other writers
*Action:   If in Automatic Undo Management mode, increase undo_retention
           setting. Otherwise, use larger rollback segments

That was only after 35-40 minutes – well within reach of a normal trip to the canteen (admittedly my undo tablespace is sized modestly in order to help me prove a point).

So, here, in reverse order, are my top 3 undo related threats when giving read only access via a tool such as SQL Developer or Toad to application support staff (even though they have the best of intentions).

#3 Undo tablespace pressure.
Unbeknownst to the ad-hoc user he can quite easily cause pressure on the undo tablespace causing an AUTOEXTEND file to grow purely to satisfy their next page of data.

#2 Undo tablespace alerts
For undo tablespaces with no AUTOEXTEND files Grid Control or some other monitoring tool will alert as the number of unexpired blocks starts to fall causing unnecessary (albeit low priority) work for your first line DBAs.

#1 ORA-1555
For undo tablespaces with no AUTOEXTEND files and either a sufficiently high transaction rate or sufficiently high gap between page 1 and page X of our ad-hoc user’s query we may find that the undo required to construct page X has been overwritten. The user gets ORA-1555, thinks “no harm done” and re-starts the query. However the alert log now contains something like below and an alert is fired shortly afterwards.

Wed Sep 28 21:14:48 2011
ORA-01555 caused by SQL statement below (SQL ID: 6fwqzurbc8y7k, Query Duration=3538 sec, SCN: 0x0000.001aad4e):

If the hour is late and the company is pretty twitchy then this could involve someone getting out of bed to investigate a potentially failed batch run.

Have I missed any undo related threats?

DBMS_STATS And Direct Path Operations

Posted in Oracle by Neil Johnson on July 28, 2011

Continuing my theme of (almost pointless) testing that DBMS_STATS does as you would hope, this post proves the answer to the question below that was posed at my work place.

Do direct path loads count as modifications for DBMS_STATS or do they bypass it?

My initial instinct was “of course it catches them” but I stopped myself. I’ve never actually seen it so it would be improper to say it out loud. Better dash home and test it (“dash” = 1 month later in this case).

This test is in from Database 11.2 and has been verified on 10.2 also. It has also been tested and tweaked thanks to Surachart’s input – see the first comment on the post. I thought I’d unearthed an interesting difference between 10g and 11g – it turns out I’d only discovered that ARCHIVELOG is not the same as NOARCHIVELOG – doh!

Create a test table, populate it and check that the modifications are caught.

DROP TABLE tab1;
CREATE TABLE tab1
(	col1 NUMBER
,	col2 VARCHAR2(1));

INSERT INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 1000;
COMMIT;

exec DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO

column table_name format a10
select table_name,INSERTS,UPDATES,DELETES 
from USER_TAB_MODIFICATIONS
where table_name = 'TAB1';

TABLE_NAME    INSERTS    UPDATES    DELETES
---------- ---------- ---------- ----------
TAB1              999          0          0

All good. Now let’s try a direct path load.

ALTER TABLE tab1 NOLOGGING;

INSERT /*+ APPEND */ INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 1000;
COMMIT;

exec DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO

column table_name format a10
select table_name,INSERTS,UPDATES,DELETES 
from USER_TAB_MODIFICATIONS
where table_name = 'TAB1';

TABLE_NAME    INSERTS    UPDATES    DELETES
---------- ---------- ---------- ----------
TAB1             1998          0          0

Looking good but the self doubter in me is asking:

How can you prove that was a direct path load?

Like this hopefully.

sqlplus /

ALTER TABLE tab1 LOGGING;

INSERT INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 1000;
COMMIT;

@mystat
Enter value for namefilter: redo size

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
redo size                                                             19712

And…

sqlplus /

ALTER TABLE tab1 NOLOGGING;

INSERT /*+ APPEND */ INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 1000;
COMMIT;

@mystat
Enter value for snamefilter: redo size

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
redo size for direct writes                                              52
redo size                                                              4500

All good, but…

What about DDL?

…shouts my annoying inner self.

TRUNCATE TABLE tab1;

exec DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO

column table_name format a10
select table_name,INSERTS,UPDATES,DELETES 
from USER_TAB_MODIFICATIONS
where table_name = 'TAB1';

TABLE_NAME    INSERTS    UPDATES    DELETES
---------- ---------- ---------- ----------
TAB1             3996          0       3996

OK, so both my annoying inner voice and I am happy that direct writes are indeed caught by DBMS_STATS. I know this is as expected but heck, sometimes it’s good to see it in black and white.

Tagged with: , ,

Interesting Change in Database Listener Behaviour at 11g

Posted in Oracle by Neil Johnson on June 21, 2011

There is a small yet significant improvement in the implementation of TNSLSNR at 11g around the way it writes to the listener log files, both “log.xml” and “listener.log”. First though I’ll describe some 10g listener behaviour that anyone who’s written their own script for recycling the log file will have noticed. I should also mention this is all Unix based – I’m not aware if the same issues exist in a Windows environment.

Below I check the listener log file and then rename it “listener.log.old”.

$ cd /u01/app/oracle/product/10.2.0/db_1/network/log
$ ls -l listener*
-rw-r----- 1 oracle oinstall 3715 Jun 19 22:33 listener.log
$ mv listener.log listener.log.old

Next I connect to a database via the listener and check for files starting with “listener” in the log location.

$ sqlplus a/a@orcl
SQL> exit

$ ls -l listener*
-rw-r----- 1 oracle oinstall 3964 Jun 19 22:34 listener.log.old

$ tail -1 listener.log.old
19-JUN-2011 22:34:48 * (CONNECT_DATA=(SERVICE_NAME=orcl)(CID=(PROGRAM=sqlplus)(HOST=vb-centos-10.2-a)(USER=oracle))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.105)(PORT=24980)) * establish * orcl * 0

There is no new “listener.log” and my recent connection was logged in the “listener.log.old” file.

I can go one step further and remove the file while the listener is running and while I am “tailing” the file.

### session 1
$ tail -f listener.log.old

### session 2
$ rm listener.log.old
$ sqlplus a/a@orcl

### session 1
19-JUN-2011 22:38:12 * (CONNECT_DATA=(SERVICE_NAME=orcl)(CID=(PROGRAM=sqlplus)(HOST=vb-centos-10.2-a)(USER=oracle))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.105)(PORT=25010)) * establish * orcl * 0

Log entries are still going to the old log file even though the file is not visible from the operating system.

$ ls -l listener*
ls: listener*: No such file or directory

If I left it like this I could end up with a huge listener.log file that is essentially invisible. If ever you have “du” and “df” output that disagree by a huge amount in an Oracle home filesystem then this may be why. You should never see it in the ADR though as that would imply 11g. Before we go on to that we’ll restart the listener logging so the correct file is used (there are many ways to do this by the way).

$ lsnrctl

LSNRCTL for Linux: Version 10.2.0.4.0 - Production on 19-JUN-2011 22:44:43

Copyright (c) 1991, 2007, Oracle.  All rights reserved.

Welcome to LSNRCTL, type "help" for information.

LSNRCTL> set current_listener listener
Current Listener is listener
LSNRCTL> set log_status off
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=vb-centos-10.2-a)(PORT=1521)))
listener parameter "log_status" set to OFF
The command completed successfully
LSNRCTL> set log_status on
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=vb-centos-10.2-a)(PORT=1521)))
listener parameter "log_status" set to ON
The command completed successfully

Now let’s demonstrate what happens if we repeat this test at 11g (11.2 in my case but the behaviour is the same at 11.1).

$ cd /u01/app/oracle/diag/tnslsnr/vb-centos-11/listener/trace
$ ls -l listener*
-rw-r----- 1 oracle oinstall 31486 Jun 16 18:36 listener.log
$ mv listener.log listener.log.old

$ sqlplus  a/a@orcl
SQL> exit

$ ls -l listener*
-rw-r----- 1 oracle oinstall 243 Jun 16 18:36 listener.log
-rw-r----- 1 oracle oinstall 31486 Jun 16 18:36 listener.log.old

$ cat listener.log
Thu Jun 16 18:36:56 2011
16-JUN-2011 18:36:56 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=orcl)(CID=(PROGRAM=sqlplus)(HOST=vb-centos-11.2-a)(USER=oracle))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.102)(PORT=49695)) * establish * orcl * 0

So this time the listener has recreated “listener.log” file and not continued to write to the renamed log file. If we use the strace utility we can see why. An extract from the 11g output is below. Notice the “open” commands for the “log.xml” and “listener.log” files.

lstat64("/u01/app/oracle/diag/tnslsnr/vb-centos-11/listener/alert/log.xml", {st_mode=S_IFREG|0640, st_size=90949, ...}) = 0
open("/u01/app/oracle/diag/tnslsnr/vb-centos-11/listener/alert/log.xml", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0660) = 15
fcntl64(15, F_SETFD, FD_CLOEXEC)        = 0
write(15, "<msg time='2011-06-16T18:12:17.1"..., 236) = 236
close(15)                               = 0
stat64("/u01/app/oracle/diag/tnslsnr/vb-centos-11/listener/alert/log.xml", {st_mode=S_IFREG|0640, st_size=91185, ...}) = 0
times({tms_utime=118, tms_stime=1077, tms_cutime=0, tms_cstime=0}) = 434969334
lstat64("/u01/app/oracle/diag/tnslsnr/vb-centos-11/listener/trace/listener.log", {st_mode=S_IFREG|0640, st_size=1021, ...}) = 0
open("/u01/app/oracle/diag/tnslsnr/vb-centos-11/listener/trace/listener.log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0660) = 15
fcntl64(15, F_SETFD, FD_CLOEXEC)        = 0
write(15, "Thu Jun 16 18:12:17 2011\n", 25) = 25

And below an extract from tracing a 10g listener.

_llseek(3, 0, [4511], SEEK_CUR)         = 0
write(3, "19-JUN-2011 22:41:01 * service_u"..., 49) = 49

This time we do not get an open for each write – we only see the file reopened if we restart logging.

open("/u01/app/oracle/product/10.2.0/db_1/network/log/listener.log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 3

Let’s hope the next improvement is to take the cycling of the trace “listener.log” file (not xml version) into the code.

Tagged with: , ,

YAPO-4068: Yet Another Post on ORA-04068

Posted in Oracle by Neil Johnson on May 11, 2011

I’ve had discussions with developers over the last few months about how to deal with the dreaded ORA-04068.

oerr ora 04068
04068, 00000, "existing state of packages%s%s%s has been discarded"
// *Cause:  One of errors 4060 - 4067 when attempt to execute a stored
//          procedure.
// *Action: Try again after proper re-initialization of any application's
//          state.

We talked about using the SERIALLY_REUSABLE pragma which is a great solution and is well covered by the post below from Lauren Schneider so I won’t go over it here.

How to solve ORA-4068

This is a good way of protecting your application from ORA-04068 by changing the source package. But what if you have a package where sometimes the package state needs to be carried forward to a subsequent call and sometimes not?

You can control this behaviour from your calling code also, using the Oracle supplied procedure DBMS_SESSION.MODIFY_PACKAGE_STATE (or DBMS_SESSION.RESET_PACKAGE). These are documented in the Oracle Docs here – MODIFY_PACKAGE_STATE Procedure

An example using 2 different database sessions is below. First we demonstrate hitting the ORA-04068 exception:

-- session 1
-- create a test package

create or replace package mypkg as
	G_var number := 1;

	function retvar return number;
end mypkg;
/
create or replace package body mypkg as
	function retvar return number is
	begin
		return mypkg.G_var;
	end retvar;
end mypkg;
/

-- test a call to the package
declare
	L_num number;
begin
	L_num := mypkg.retvar;
end;
/

PL/SQL procedure successfully completed.


-- session 2
-- alter the package

alter package mypkg compile;



-- session 1
-- when we come back to session 1 we have lost our package state

declare
	L_num number;
begin
	L_num := mypkg.retvar;
end;
/

*
ERROR at line 1:
ORA-04068: existing state of packages has been discarded
ORA-04061: existing state of package "NEIL.MYPKG" has been invalidated
ORA-04065: not executed, altered or dropped package "NEIL.MYPKG"
ORA-06508: PL/SQL: could not find program unit being called: "NEIL.MYPKG"
ORA-06512: at line 4

Now we repeat the process but use a call to DBMS_SESSION.REINITIALIZE in order to protect ourselves. We are safe to do this as we know we do not want to carry forward the package state.

-- session 1

declare
	L_num number;
begin
	L_num := mypkg.retvar;
end;
/

PL/SQL procedure successfully completed.


-- session 2
-- alter the package

alter package mypkg compile;


-- session 1
-- we no longer hit ORA-04068

exec dbms_session.modify_package_state(dbms_session.reinitialize)

PL/SQL procedure successfully completed.

declare
        L_num number;
begin
        L_num := mypkg.retvar;
end;
/

PL/SQL procedure successfully completed.


So we can control this behaviour from the source package and from the calling code. Super.

Before posting this I did a quick Google to see if I was duplicating anything (I should really have done this first) and came across this post by Eddie Awads.

Here is How to Unpersist Your Persistent PL/SQL Package Data

This talks about the DBMS_SESSION.REINITIALIZE procedure used in this post and has a good comment from Paweł Barut regarding using this method in web applications.

DBMS_STATS.SET_TABLE_STATS and statistic staleness

Posted in Oracle by Neil Johnson on March 21, 2011

Recently at work there was a discussion regarding setting optimiser statistics using DBMS_STATS SET_*_STATISTICS procedures. The question was asked:

If we set the statistics ourselves will the automatic stat’s job not overwrite them the following night?

Now I was confident that this was not the case but, having never tested this explicitly, thought it better to get my facts straight.

First some quotes from the Oracle Documentation

Setting Statistics

You can set table, column, index, and system statistics using the SET_*_STATISTICS procedures. Setting statistics in the manner is not recommended, because inaccurate or inconsistent statistics can lead to poor performance.

Determining Stale Statistics

Monitoring tracks the approximate number of INSERTs, UPDATEs, and DELETEs for that table and whether the table has been truncated since the last time statistics were gathered. You can access information about changes of tables in the USER_TAB_MODIFICATIONS view. Following a data-modification, there may be a few minutes delay while Oracle Database propagates the information to this view. Use the DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO procedure to immediately reflect the outstanding monitored information kept in the memory.

The GATHER_DATABASE_STATS or GATHER_SCHEMA_STATS procedures gather new statistics for tables with stale statistics when the OPTIONS parameter is set to GATHER STALE or GATHER AUTO. If a monitored table has been modified more than 10%, then these statistics are considered stale and gathered again.

The case under discussion was that old favourite – new partitions added to existing tables. The test that follows creates a simple table with four partitions. Three of the partitions are mature, each containing 10,000 rows. The fourth partition is quite new and only contains 1,000 rows.

drop table part_tab;
create table part_tab
(	id	number
,	yrmon	number(6)
,	padcol	varchar2(100))
partition by list (yrmon)
(partition p201101 values (201101)
,partition p201102 values (201102)
,partition p201103 values (201103)
,partition p201104 values (201104));

insert into part_tab
select rownum,201101,lpad('x',100,'y')
from dual connect by rownum <= 10000;
insert into part_tab
select rownum,201102,lpad('x',100,'y')
from dual connect by rownum <= 10000;
insert into part_tab
select rownum,201103,lpad('x',100,'y')
from dual connect by rownum <= 10000;
insert into part_tab
select rownum,201104,lpad('x',100,'y')
from dual connect by rownum <= 1000;

select PARTITION_NAME,NUM_ROWS,BLOCKS,AVG_ROW_LEN,LAST_ANALYZED
from user_tab_statistics
where table_name = 'PART_TAB';

PARTITION_NAME                   NUM_ROWS     BLOCKS AVG_ROW_LEN LAST_ANALYZED
------------------------------ ---------- ---------- ----------- -----------------
                                    31000        568         110 05-MAR-2011 21:24
P201101                             10000        174         110 05-MAR-2011 21:24
P201102                             10000        174         110 05-MAR-2011 21:24
P201103                             10000        174         110 05-MAR-2011 21:24
P201104                              1000         46         110 05-MAR-2011 21:24

The statistics on the fourth partition are now faked using DBMS_STATS.SET_TABLE_STATS to match the other partitions.

-- fake statistics in partition 201104
begin
 dbms_stats.set_table_stats(null
 , 'part_tab'
 , 'P201104'
 , numrows=>10000
 , numblks=>174
 , avgrlen=>110);
end;
/

PARTITION_NAME                   NUM_ROWS     BLOCKS AVG_ROW_LEN LAST_ANALYZED
------------------------------ ---------- ---------- ----------- -----------------
                                    31000        568         110 05-MAR-2011 21:24
P201101                             10000        174         110 05-MAR-2011 21:24
P201102                             10000        174         110 05-MAR-2011 21:24
P201103                             10000        174         110 05-MAR-2011 21:24
P201104                             10000        174         110 05-MAR-2011 21:24

Now we update the partitions in a variety of ways as described by the comments in the code below.

-- partition "P201101": update 11% of rows  (> staleness threshold) and commit
update part_tab
set padcol = lpad('x',100,'z')
where yrmon = 201101 and rownum <= 1100;
commit;

-- partition "P201102": update 11% of rows  (> staleness threshold) and rollback
update part_tab
set padcol = lpad('x',100,'z')
where yrmon = 201102 and rownum <= 1100;
rollback;

-- partition "P201103": update 9% of rows  (< staleness threshold) and commit
update part_tab
set padcol = lpad('x',100,'z')
where yrmon = 201103 and rownum <= 900;
commit;

-- partition "P201104": update 9% of rows according to optimiser statistics
--  (< staleness threshold) but 90% or actual rows (> staleness threshold) 
update part_tab
set padcol = lpad('x',100,'z')
where yrmon = 201104 and rownum <= 900;
commit;

exec DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO

select PARTITION_NAME,INSERTS,UPDATES,DELETES 
from USER_TAB_MODIFICATIONS
where table_name = 'PART_TAB';

PARTITION_NAME                    INSERTS    UPDATES    DELETES
------------------------------ ---------- ---------- ----------
                                        0       4000          0
P201101                                 0       1100          0
P201102                                 0       1100          0
P201103                                 0        900          0
P201104                                 0        900          0

From the output above we can see that the rolled back modifications on partition “P201102″ still increase the counters in USER_TAB_MODIFICATIONS. Interesting.

Now we gather optimiser statistics. If DBMS_STATS respects the faked statistics then partition P201104′s statistics should remain unchanged. If DBMS_STATS can see through our fakery then partition P201104′s statistics should be recollected.

Watch to see which partitions’ LAST_ANALYZED value change from “21:24″ to “21:26″.

exec dbms_stats.GATHER_DATABASE_STATS(options=>'gather auto')

select PARTITION_NAME,NUM_ROWS,BLOCKS,AVG_ROW_LEN,LAST_ANALYZED
from user_tab_statistics
where table_name = 'PART_TAB';

PARTITION_NAME                   NUM_ROWS     BLOCKS AVG_ROW_LEN LAST_ANALYZED
------------------------------ ---------- ---------- ----------- -----------------
                                    31000        568         110 05-MAR-2011 21:26
P201101                             10000        174         110 05-MAR-2011 21:26
P201102                             10000        174         110 05-MAR-2011 21:26
P201103                             10000        174         110 05-MAR-2011 21:24
P201104                             10000        174         110 05-MAR-2011 21:24

As expected USER_TAB_STATISTICS is the truth as far DBMS_STATS is concerned. The partitions with 11% modifications have new statistics collected. The partitions with 9% modifications keep their existing statistics, including partition “P201104″.

Nothing new or surprising here but I now have the hard facts to put in front of my colleague, remaining mindful of the comment “Setting statistics in the manner is not recommended” at all times :)

Follow

Get every new post delivered to your Inbox.

Join 264 other followers