Mining the listener log with Perl

Update – 02-MAY-2012
I have had cause to use this script again at work so thought I’d develop it further. This post has been updated to reflect the new version.

Recently at work I was engaged by one of our application support teams because of application calls failing with “ORA-00020: maximum number of processes (300) exceeded”. We quickly identified that these errors were the symptom of another problem and not a concern in themselves, the real issue was slow database calls due to an infrastructure issue. The ORA-00020 errors were because the application was spawning new connections to try to work around the slow connections. Now I don’t know much about application servers but I know this is not uncommon. The application team however could not find any indication of this in their logs so it was up to me to prove it.

As far as I am aware there are 2 options for retrospectively proving this at a granular level (please share any other options I’ve missed):

1) Query the database audit trail – DBA_AUDIT_SESSION
2) Mine the listener log

On this database no auditing was enabled so I was left with the listener log as my data source.

I knocked up a quick KSH/AWK script to get my data and then massaged it via various means. The output was quite pleasing so over the recent long UK Easter holiday I Googled some Perl commands and had some fun with it, this post is to share the script and an example chart from Microsoft Excel based on the incident I described above.

The script can be downloaded from – llog-miner.pl

Below is an example of how to call the script – it works on *nix and Windows. The “-f” parameter is used to filter log entries on the date, this example is matching any minute between “11:00″ and “12:59″. I’ve included a full usage map further down this post.

C:\>perl llog-miner.pl -excel -f "04-APR-2012 1[12]" listener.log  > llog.csv

This is the chart showing the logon spike – click to enlarge. You can see the connection spike at 12:56 of over 250 connection attempts.

Connection rate from listener log

The script allows a few other options such as allowing listener logs to be “piped” into the script to allow multiple files to be processed in one command. Usage notes below.

C:\>perl llog-miner.pl -help

    Usage: llog-miner.pl [-help] [-filter <string>] [-[no]excel] [-[no]host] <listener_log_path|->
         : -excel|noexcel       : Output dates in Excel friendly format (default is noexcel)
         : -help                : Print this help
         : -host|nohost         : Include host on program (default is to include)
         : -filter <string>     : A regexp to match beginning of log line. e.g. "23-DEC-2012 0[12]" or "..-MAR-2013"
         : listener_log_path|-  : Input either a valid listener log name or "-" to read STDIN

    Examples:
         # Process listener.log looking for entries on the 21st and 22nd of Sept and format dates for Excel
         C:\>perl llog-miner.pl -filter "2[12]-SEP-2011" -excel listener.log

         # Process two listener log files looking for entries on Jan 3rd between 20:00 and 21:59 and exclude the hostname
         $ cat listener.log listener_db.log | llog-miner.pl -filter "03-JAN-2012 2[01]" -nohost -

An improvement I may make in the future is to expand minutes where there are no connection attempts. I have drafted some code to fill in gaps but have kept it to myself for now as it’s too ugly to share (at the moment).

I hope someone finds this useful and if not, well I found it entertaining if nothing else :-)

ADRCI and listener purging

I won’t go into any details on ADRCI as others have already done a better job than I could. The key piece of information for this post is that the MMON process shoulders the responsibility for ADR purging. So what if there isn’t an MMON process?

Note: All ADRCI “show control” output on this page has been edited for readability

Below is the ADR purge details for my local listener. These settings were defined using “set control (SHORTP_POLICY = 168,LONGP_POLICY = 336)”. “168″ = 7 days and “336″ = 14 days.

adrci
set home diag/tnslsnr/n01/listener
show control

ADRID      SHORTP_POLICY LONGP_POLICY LAST_MOD_TIME              LAST_AUTOPRG_TIME LAST_MANUPRG_TIME
---------- ------------- ------------ -------------------------- ----------------- ----------------------------
1794992421 168           336          2012-01-25 20:41:09.052526

adrci> host
$ date

Thu Feb 16 21:26:16 GMT 2012

$ cd  /u01/app/oracle/diag/tnslsnr/n01/listener/alert
$ ls -ltr

total 30260
-rw-r----- 1 grid oinstall 10485934 Jan 25 03:45 log_1.xml
-rw-r----- 1 grid oinstall 10485779 Feb  4 03:11 log_2.xml
-rw-r----- 1 grid oinstall  9954308 Feb 16 21:43 log.xml

Notice that the “LAST_AUTOPRG_TIME” column is blank. It looks like this system is not being purged automatically. Also notice that the date of this test was Feb 16th and the oldest file in the “alert” directory is dated “Jan 25″. According to M.O.S note “Which Files Are Part Of SHORTP_POLICY And LONGP_POLICY In ADR? [ID 975448.1]” the “alert” directory is under LONGP_POLICY which is 14 days so the file “log_1.xml” should have been purged.

Below I force a manual purge

adrci> purge
adrci> host

$ cd  /u01/app/oracle/diag/tnslsnr/n01/listener/alert
$ ls -ltr

total 20004
-rw-r----- 1 grid oinstall 10485779 Feb  4 03:11 log_2.xml
-rw-r----- 1 grid oinstall  9960304 Feb 16 21:53 log.xml

It has now gone. And LAST_MANUPRG_TIME has been set to record the fact.

ADRID      SHORTP_POLICY LONGP_POLICY LAST_MOD_TIME              LAST_AUTOPRG_TIME LAST_MANUPRG_TIME
---------- ------------- ------------ -------------------------- ----------------- --------------------------
1794992421 168           336          2012-01-25 20:41:09.052526                   2012-02-16 21:53:51.761639

So my recommendation is that alongside ADR’s auto purging and your own maintenance on the alert.log & listener.log files in the “trace” directories you should also routinely call something like below:

for i in `adrci exec="show homes"|grep listener`;do
 echo "adrci exec=\"set home $i;purge\""
 adrci exec="set home $i;purge";
 # maybe check $? here
done

I noticed this M.O.S note recently too. I’ve never seen the symptoms but it’s worth taking a look.

ADR Auto Purge Does Not Work On All RAC Instances, ASM and Standby Databases [ID 1326983.1]

Controlling The SCAN Listener Log Location

I was lucky enough to attend UKOUG Conference 2011 last year. I saw several great presentations, one of which was a presentation on troubleshooting RAC by Julian Dyke. During this presentation Julian commented that we need to watch out for SCAN listeners having their ADR home inside the Grid Infrastructure home instead of with the Database listeners in the ADR home under $ORACLE_BASE.

Well this was one of the few things in his presentation I was fit to comment on so I tried to attract Julian’s attention by waving my arm, sadly my timid nature meant I went unnoticed. So this blog post is me going public with my attempted interruption.

First an example of how the log location for SCAN listeners typically differs to that of Database listeners. I personally do not like this behaviour as I like my Grid Infrastructure home to have resonably stable size, which it can have once the log directory’s rotation is mature, however the ADR home for listeners is not cleared down automatically so you can end up with a directory growing unchecked.

[grid@n01 n01]$ $GRID_HOME/bin/lsnrctl status LISTENER|grep Log
Listener Log File         /u01/app/oracle/diag/tnslsnr/n01/listener/alert/log.xml
[grid@n01 n01]$ $GRID_HOME/bin/lsnrctl status LISTENER_SCAN3|grep Log
Listener Log File         /u01/app/grid/11203/ghome_1/log/diag/tnslsnr/n01/listener_scan3/alert/log.xml

And we know from the Oracle documentation that ADR_BASE is defined as:

Purpose
To specify the base directory into which tracing and logging incidents are stored when ADR is enabled.

Default
The default on the server side is ORACLE_BASE, or ORACLE_HOME/log, if ORACLE_BASE is not defined.

So… let’s check ORACLE_BASE for CRS resources:

[grid@n01 ~]$ $GRID_HOME/bin/crsctl stat res ora.LISTENER.lsnr -p|grep -i base
USR_ORA_ENV=ORACLE_BASE=/u01/app/oracle
[grid@n01 ~]$ $GRID_HOME/bin/crsctl stat res ora.LISTENER_SCAN3.lsnr -p|grep -i base
[grid@n01 ~]$

Notice above that the Database listener has $ORACLE_BASE set and the Scan listener does not. Hence the log location.

This can be controlled for Database listeners by using the SETENV switch for SRVCTL but this does not work for Scan listeners.

[grid@n01 ~]$ $GRID_HOME/bin/srvctl setenv listener -l listener_scan1 -t "ORACLE_BASE=/u01/app/oracle"
PRKO-3162 : The actions setenv, getenv, and unsetenv are not suppported for Single Client Access Name listener listener_scan1
PRCN-2066 : Failed to retrieve cluster listener for listener_scan1 because it has the type of Single Client Access Name Listener

There is another way of setting environment variables for cluster resources and that is via crsctl modify resource.

[grid@n01 ~]$ $GRID_HOME/bin/crsctl modify resource ora.LISTENER_SCAN3.lsnr -attr USR_ORA_ENV=ORACLE_BASE=/u01/app/oracle
[grid@n01 ~]$ $GRID_HOME/bin/crsctl stat res ora.LISTENER_SCAN3.lsnr -p|grep -i base
USR_ORA_ENV=ORACLE_BASE=/u01/app/oracle
[grid@n01 ~]$ $GRID_HOME/bin/srvctl stop scan_listener -i 3
[grid@n01 ~]$ $GRID_HOME/bin/srvctl start scan_listener -i 3
[grid@n01 ~]$ lsnrctl status  listener_scan3 | grep Log
Listener Log File         /u01/app/oracle/diag/tnslsnr/n01/listener_scan3/alert/log.xml

To my mind the whole thing would seem a little cleaner if there was an “SRVCTL SETENV SCAN_LISTENER” command.

After writing this post I thought I’d check around for other information to support this post and, lo and behold, there is a My Oracle Support article covering almost exactly what I’ve written above. The article also talks about using the LISTENER.ORA attribute ADR_BASE_LISTENER_SCANn to control this.

Diagnostic Destination For Listeners Is Under Grid_home/Log Instead Of Oracle_base/Diag [ID 1269109.1]

Interesting Change in Database Listener Behaviour at 11g

There is a small yet significant improvement in the implementation of TNSLSNR at 11g around the way it writes to the listener log files, both “log.xml” and “listener.log”. First though I’ll describe some 10g listener behaviour that anyone who’s written their own script for recycling the log file will have noticed. I should also mention this is all Unix based – I’m not aware if the same issues exist in a Windows environment.

Below I check the listener log file and then rename it “listener.log.old”.

$ cd /u01/app/oracle/product/10.2.0/db_1/network/log
$ ls -l listener*
-rw-r----- 1 oracle oinstall 3715 Jun 19 22:33 listener.log
$ mv listener.log listener.log.old

Next I connect to a database via the listener and check for files starting with “listener” in the log location.

$ sqlplus a/a@orcl
SQL> exit

$ ls -l listener*
-rw-r----- 1 oracle oinstall 3964 Jun 19 22:34 listener.log.old

$ tail -1 listener.log.old
19-JUN-2011 22:34:48 * (CONNECT_DATA=(SERVICE_NAME=orcl)(CID=(PROGRAM=sqlplus)(HOST=vb-centos-10.2-a)(USER=oracle))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.105)(PORT=24980)) * establish * orcl * 0

There is no new “listener.log” and my recent connection was logged in the “listener.log.old” file.

I can go one step further and remove the file while the listener is running and while I am “tailing” the file.

### session 1
$ tail -f listener.log.old

### session 2
$ rm listener.log.old
$ sqlplus a/a@orcl

### session 1
19-JUN-2011 22:38:12 * (CONNECT_DATA=(SERVICE_NAME=orcl)(CID=(PROGRAM=sqlplus)(HOST=vb-centos-10.2-a)(USER=oracle))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.105)(PORT=25010)) * establish * orcl * 0

Log entries are still going to the old log file even though the file is not visible from the operating system.

$ ls -l listener*
ls: listener*: No such file or directory

If I left it like this I could end up with a huge listener.log file that is essentially invisible. If ever you have “du” and “df” output that disagree by a huge amount in an Oracle home filesystem then this may be why. You should never see it in the ADR though as that would imply 11g. Before we go on to that we’ll restart the listener logging so the correct file is used (there are many ways to do this by the way).

$ lsnrctl

LSNRCTL for Linux: Version 10.2.0.4.0 - Production on 19-JUN-2011 22:44:43

Copyright (c) 1991, 2007, Oracle.  All rights reserved.

Welcome to LSNRCTL, type "help" for information.

LSNRCTL> set current_listener listener
Current Listener is listener
LSNRCTL> set log_status off
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=vb-centos-10.2-a)(PORT=1521)))
listener parameter "log_status" set to OFF
The command completed successfully
LSNRCTL> set log_status on
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=vb-centos-10.2-a)(PORT=1521)))
listener parameter "log_status" set to ON
The command completed successfully

Now let’s demonstrate what happens if we repeat this test at 11g (11.2 in my case but the behaviour is the same at 11.1).

$ cd /u01/app/oracle/diag/tnslsnr/vb-centos-11/listener/trace
$ ls -l listener*
-rw-r----- 1 oracle oinstall 31486 Jun 16 18:36 listener.log
$ mv listener.log listener.log.old

$ sqlplus  a/a@orcl
SQL> exit

$ ls -l listener*
-rw-r----- 1 oracle oinstall 243 Jun 16 18:36 listener.log
-rw-r----- 1 oracle oinstall 31486 Jun 16 18:36 listener.log.old

$ cat listener.log
Thu Jun 16 18:36:56 2011
16-JUN-2011 18:36:56 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=orcl)(CID=(PROGRAM=sqlplus)(HOST=vb-centos-11.2-a)(USER=oracle))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.102)(PORT=49695)) * establish * orcl * 0

So this time the listener has recreated “listener.log” file and not continued to write to the renamed log file. If we use the strace utility we can see why. An extract from the 11g output is below. Notice the “open” commands for the “log.xml” and “listener.log” files.

lstat64("/u01/app/oracle/diag/tnslsnr/vb-centos-11/listener/alert/log.xml", {st_mode=S_IFREG|0640, st_size=90949, ...}) = 0
open("/u01/app/oracle/diag/tnslsnr/vb-centos-11/listener/alert/log.xml", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0660) = 15
fcntl64(15, F_SETFD, FD_CLOEXEC)        = 0
write(15, "<msg time='2011-06-16T18:12:17.1"..., 236) = 236
close(15)                               = 0
stat64("/u01/app/oracle/diag/tnslsnr/vb-centos-11/listener/alert/log.xml", {st_mode=S_IFREG|0640, st_size=91185, ...}) = 0
times({tms_utime=118, tms_stime=1077, tms_cutime=0, tms_cstime=0}) = 434969334
lstat64("/u01/app/oracle/diag/tnslsnr/vb-centos-11/listener/trace/listener.log", {st_mode=S_IFREG|0640, st_size=1021, ...}) = 0
open("/u01/app/oracle/diag/tnslsnr/vb-centos-11/listener/trace/listener.log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0660) = 15
fcntl64(15, F_SETFD, FD_CLOEXEC)        = 0
write(15, "Thu Jun 16 18:12:17 2011\n", 25) = 25

And below an extract from tracing a 10g listener.

_llseek(3, 0, [4511], SEEK_CUR)         = 0
write(3, "19-JUN-2011 22:41:01 * service_u"..., 49) = 49

This time we do not get an open for each write – we only see the file reopened if we restart logging.

open("/u01/app/oracle/product/10.2.0/db_1/network/log/listener.log", O_WRONLY|O_CREAT|O_APPEND|O_LARGEFILE, 0666) = 3

Let’s hope the next improvement is to take the cycling of the trace “listener.log” file (not xml version) into the code.