AWR | ORAganism

This is a quick post to share a SQL statement I use a lot at work. The query mines the AWR tables (beware the licence implications) for a specific SQL ID and date/time range and shows a few choice statistics for each snapshot period.

awrsql.sql:

prompt enter start and end times in format DD-MON-YYYY [HH24:MI]
 
column sample_end format a21
select to_char(min(s.end_interval_time),'DD-MON-YYYY DY HH24:MI') sample_end
, q.sql_id
, q.plan_hash_value
, sum(q.EXECUTIONS_DELTA) executions
, round(sum(DISK_READS_delta)/greatest(sum(executions_delta),1),1) pio_per_exec
, round(sum(BUFFER_GETS_delta)/greatest(sum(executions_delta),1),1) lio_per_exec
, round((sum(ELAPSED_TIME_delta)/greatest(sum(executions_delta),1)/1000),1) msec_exec
from dba_hist_sqlstat q, dba_hist_snapshot s
where q.SQL_ID=trim('&sqlid.')
and s.snap_id = q.snap_id
and s.dbid = q.dbid
and s.instance_number = q.instance_number
and s.end_interval_time >= to_date(trim('&start_time.'),'dd-mon-yyyy hh24:mi')
and s.begin_interval_time <= to_date(trim('&end_time.'),'dd-mon-yyyy hh24:mi')
and substr(to_char(s.end_interval_time,'DD-MON-YYYY DY HH24:MI'),13,2) like '%&hr24_filter.%'
group by s.snap_id
, q.sql_id
, q.plan_hash_value
order by s.snap_id, q.sql_id, q.plan_hash_value
/

Nothing ground breaking and I’m sure many will have a similar script.

Below I have example output showing one of the scripts many successful outings, quite a dramatic plan change I’m sure you’ll agree.

SQL> @awrsql
enter start and end times in format DD-MON-YYYY [HH24:MI]
Enter value for sqlid: 1jjpo2i4b313g
Enter value for start_time: 15-NOV-2011
Enter value for end_time: 21-NOV-2011 13:00
Enter value for hr24_filter:

SAMPLE_END            SQL_ID        PLAN_HASH_VALUE EXECUTIONS PIO_PER_EXEC LIO_PER_EXEC  MSEC_EXEC
--------------------- ------------- --------------- ---------- ------------ ------------ ----------
15-nov-2011 TUE 08:00 1jjpo2i4b313g      3133159894     129629            0            5          0
16-nov-2011 WED 08:01 1jjpo2i4b313g      3133159894     115003            0            5         .1
17-nov-2011 THU 08:01 1jjpo2i4b313g      3133159894     115741            0            5          0
18-nov-2011 FRI 07:00 1jjpo2i4b313g      3133159894      30997            0            5         .1
18-nov-2011 FRI 08:00 1jjpo2i4b313g      3133159894      81034            0            5          0
21-nov-2011 MON 00:00 1jjpo2i4b313g       790865878         16     323091.6     323128.3    36905.8
21-nov-2011 MON 01:00 1jjpo2i4b313g       790865878         29     349676.2     349713.7    48387.2
21-nov-2011 MON 02:00 1jjpo2i4b313g       790865878         35     339474.6     339509.2    34057.7
21-nov-2011 MON 03:00 1jjpo2i4b313g       790865878         37     340934.6     340970.2    35899.4
21-nov-2011 MON 04:01 1jjpo2i4b313g       790865878         38     333469.1     333503.9    35450.8
21-nov-2011 MON 05:00 1jjpo2i4b313g       790865878         35     347559.3     347595.2    35231.8
21-nov-2011 MON 06:00 1jjpo2i4b313g       790865878         32     340224.8       340260    35208.3

I also like to use the query to track number of executions or LIO per execution over a longer time frame to see if either the frequency or individual impact of the SQL is changing over time. I can use the “hr24_filter” variable to do this, for example showing me all snapshots for hour “13” over a whole month.

I’m sharing this in the hope that it will provide others with all the relevant information to understand this behaviour, should they hit it, rather than have to spend time searching MOS and performing their own tests. The information is there on MOS, but it took me a little time to find it (note IDs listed later).

“AWR is broken! We don’t have any snapshots over the weekend.”

That’s what started this investigation. I took a look and sure enough we didn’t have an snapshots for some instances over a few recent time periods. I check the alert log and found trace files as shown below:

Trace file /data/<db os owner>/<DB NAME>/admin/diag/rdbms/<db name>/<DB NAME>8/trace/<DB NAME>8_m001_23365.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /data/<db os owner>/<DB NAME>/product/11.2.0.2
System name:    Linux
Node name:      <hostname removed>
Release:        2.6.18-194.3.1.0.3.el5
Version:        #1 SMP Tue Aug 31 22:41:13 EDT 2010
Machine:        x86_64
Instance name: <DB NAME>8
Redo thread mounted by this instance: 8
Oracle process number: 87
Unix process pid: 23365, image: <db os owner>@<hostname removed> (M001)

*** 2011-04-06 18:48:50.018
*** SESSION ID:(704.3391) 2011-04-06 18:48:50.018
*** CLIENT ID:() 2011-04-06 18:48:50.018
*** SERVICE NAME:(SYS$BACKGROUND) 2011-04-06 18:48:50.018
*** MODULE NAME:(MMON_SLAVE) 2011-04-06 18:48:50.018
*** ACTION NAME:(Remote-Flush Slave Action) 2011-04-06 18:48:50.018

DDE rules only execution for: ORA 12751
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
Executing ASYNC actions
----- START DDE Action: 'ORA_12751_DUMP' (Sync) -----
Runtime exceeded 900 seconds

The key thing to be aware of is that there has been a change in behaviour from 10g to 11g that limits the amount of time MMON (and slaves) are allowed to run. MOS 761298.1 “ORA-12751 cpu time or run time policy violation” covers this. The purpose of the change to stop MMON (and slaves) contending with application sessions that are already fighting for resources.

In this case the instance was getting hit very hard by the application during testing and the team reporting the problem were trying to squeeze every last drop of performance out of the system. Up until this point AWR reports were being used to identify where bottlenecks were occurring and then approaches for avoiding them proposed. Without AWR snapshots there would be no AWR reports, which in turn stopped them using the same approach to improve throughput…

Having highlighted the MOS note there was some question over whether this was actually the reason for the missing snapshots, so I had to come up with a way of validating the situation. I ran a manual snapshot, with SQL trace enabled and then ran tkprof on the trace file.

There are 2 elements of the trace file/tkprof output that I feel are relevant.

1) This section of top section of the tkprof output produced using the options “sort=exeela” shows the following.

SQL ID: 1uk5m5qbzj1vt Plan Hash: 0

BEGIN dbms_workload_repository.create_snapshot; END;

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.09     901.15          0          0         22           1
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      0.09     901.15          0          0         22           1

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: SYS

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  enq: WF - contention                           19        0.00          0.00
  row cache lock                                  3        0.00          0.00
  gc current grant busy                           2        0.00          0.00
  reliable message                              900        1.00        900.85
  log file sync                                   1        0.00          0.00
  SQL*Net message to client                       1        0.00          0.00
  SQL*Net message from client                     1        0.00          0.00
********************************************************************************

900 seconds (15 mins) waiting for “reliable message”. Unfortunately “reliable message” doesn’t make it onto the list of “wait events of greatest interest” in the Oracle documentation and information is sparse on MOS. However, I believe this represents the session waiting for a response from the other instances in the cluster.

2) Looking into the raw trace file at the section where the wait on “reliable message” ends shows the following message:

KEWRGRFR: failed waiting for remote flushers, rcode=-125

This was accepted as confirmation that we were hitting the issue in MOS 761298.1, but focus was then turned to how to live with this behaviour… And, obviously hitting the instance(s) that they were testing on a little let aggressively wasn’t an option, even though (in my opinion) they were unlikely to be getting optimal performance at this point (which I didn’t have time to investigate).

The options I offered where:

1) Manually take an AWR snapshot right before the testing, accept that it is unlikely that there will be snapshots during the application testing, manually take an AWR snapshot as soon as the testing completes, generate an AWR report covering the period between the 2 manual snapshots.

2) Trace the MMON slaves when at least one of the database servers is heavily loaded, perform a manual snapshot, examine trace files to pin point the exact cause of the delay in creating the snapshot, raise an SR to establish if we can disable the part of the snapshot activity that is responsible for the delay – assuming there is a specific part of the snapshot activity that is being slowed down by the load on the server(s) rather than all of it.

It was clear to me that option 1 was the logical choice. Option 2 was suggested on the basis of the hidden parameter “_awr_disabled_flush_tables”, which is covered in MOS 1301503.1 “How To Diagnose AWR Snapshots”. It would be useful if searching MOS for “ORA-12751” found this note.