SQL Trace | ORAganism

SQL> oradebug setorapid <process id for session we wanted to trace> SQL> oradebug session_event sql_trace wait=true, bind=true trace_buffer_on(256000) SQL> oradebug session_event <error number> trace name trace_buffer_off; name errorstack level 3 SQL> oradebug session_event <error number> trace name trace_buffer_off; name errorstack level 3 SQL> oradebug session_event <error number> trace name trace_buffer_off; name errorstack level 3 SQL> oradebug session_event <error number> trace name trace_buffer_off; name errorstack level 3

Part 1: Production Support Tip & Tricks #1 – Collecting Log Data

There’s no need to discuss SQL Trace other than the age-old concern – the overhead. Incident managers are risk averse and when you ask if you can trace a session on production you will always be asked what overhead it has and invariably you will say “erm… 5%… ish”. We can see from the Oracle documentation that Oracle’s stance on this has changed over the years:

9i docs:

“The additional overhead … is normally insignificant compared with the inherent overhead caused by the application’s inefficiency.”

11.2 docs:

“Using the SQL Trace facility can have a severe performance impact and may result in increased system overhead, excessive CPU usage, and inadequate disk space”

In this post I am going to attempt to assess the overhead of SQL Trace. My theory is that there are 2 potential factors.

The overhead of the SQL Trace code. There must be extra code involved and it must take some amount of CPU to execute it.
The overhead of writing trace files. Writing to a file can not be without any overhead.

My cunning plan is to test queries in 3 states:

With SQL Trace disabled
With SQL Trace enabled
With SQL Trace enabled but redirected to /dev/null thus removing any overhead of writing to the trace file

The queries will be tested in IO bound and CPU bound states. The tests were done on a Linux laptop running Oracle XE 11g with a single hard drive so the results are not “enterprise class” – your mileage may vary.

All of the code used can be found at these links, for brevity I have not included them inline – build-acme-schema.sql sqltrace-long1.sql sqltrace-long2.sql sqltrace-short1.sql sqltrace-short2.sql

Test #1 – IO bound queries with a long(ish) elapsed time

I say “long(ish)” as the queries only take a few seconds, the key thing is that the majority of time is spent fetching data. I flush the buffer cache before every execution to make sure the queries are IO bound. The test code can be found in sqltrace-long1.sql.

The results are:

Test \ Execution	1	2	3	4	5	6	Average	% Overhead
Long IO bound no trace	5.45s	5.45s	5.52s	5.52s	5.45s	5.34s	5.45s	–
Long IO bound with trace (32KB)	6.90s	6.89s	6.94s	7.09s	6.84s	7.01s	6.94s	27.31%
Long IO bound with trace (to null)	6.89s	6.82s	6.89s	6.94s	6.78s	7.12s	6.90s	26.61%

I was really surprised by this. For a small number of queries generating small trace files, only 32KB, I saw a 25% degradation in run time with SQL Trace enabled. The minimal size of the trace files did nothing to prove my theory about the overhead in writing the trace file but this was an eye opener.

Test #2 – CPU bound queries with a long(ish) elapsed time

This test used the same queries as before but I removed the flush of the buffer cache from before each query. This removed all IO waits. The test code can be found in sqltrace-long2.sql.

Test \ Execution	1	2	3	4	5	6	Average	% Overhead
Long CPU bound no trace	3.79s	3.76s	3.80s	3.73s	3.71s	3.79s	3.76s	–
Long CPU bound with trace (11KB)	4.90s	4.99s	5.02s	5.00s	4.89s	4.87s	4.94s	31.39%
Long CPU bound with trace (to null)	4.95s	5.00s	5.05s	5.05s	4.90s	4.86s	4.96s	32.02%

Interesting – so by running the same workload but with no disk reads to slow it down the overhead is slightly more pronounced. But infrequent long running queries are not the only fruit.

Test #3 – IO bound frequent queries with a short elapsed time

This test uses very fast queries in a tight loop. The proportion of SQL executes to row fetches is very different to the previous tests. The buffer cache is flushed before each execution. The test code can be found in sqltrace-short1.sql.

Test \ Execution	1	2	3	4	5	6	Average	% Overhead
Short IO bound no trace	43.99s	43.2s	39.44s	41.81s	39.00s	40.03s	41.24s	–
Short IO bound with trace (2.9MB)	41.35s	43.63s	40.70s	41.72s	43.24s	39.86s	41.75s	1.22%
Short IO bound with trace (to null)	41.12s	45.21s	41.50s	40.82s	42.44s	40.05s	41.85s	1.48%

So this more like it – only a 1.5% overhead.

Test #4 – CPU bound frequent queries with a short elapsed time

This test uses the same very fast queries in a tight loop from the previous test. The buffer cache flushes were removed and the number of iterations increased to get a longer elapsed time. The test code can be found in sqltrace-short2.sql.

Test \ Execution	1	2	3	4	5	6	Average	% Overhead
Short CPU bound no trace	2.58s	2.46s	2.46s	2.63s	2.49s	2.47s	2.51s	–
Short CPU bound with trace (32MB)	7.55s	7.52s	7.57s	7.64s	7.43s	7.67s	7.56s	200.72%
Short CPU bound with trace (to null)	4.45s	4.57s	4.69s	4.63s	4.54s	4.62	4.58s	82.23%

Ouch – a 200% overhead! Also the overhead of writing to the trace file was visible this time, SQL Trace overhead dropped from 200% to 80% when the trace file was /dev/null.

Summary

I am almost reluctant to summarise as these are not the most scientific of tests but the things I am taking away from this excercise are:

SQL workload that is CPU bound will see a dramatic degradation in elapsed time when SQL Trace is enabled – I’ll be wary of this in the future
SQL workload that is IO bound will see less of an impact from SQL Trace, but the impact could still be significant.
When trace file size is large then the speed of the trace file target filesystem is important. In these tests I used /dev/null which is of course cheating.
When IO response is important to a test remember to set filesystemio_option=setall. I wasted several evenings scratching my head at inconsistent, unpredictable test results.
Your mileage may vary

Feel free to point out mistakes in my train of thought or in the tests themselves. I’m not too happy with tests 1 & 2 – I’m not convinced they prove anything worthwhile. Tests 3 & 4 however have results I have found interesting and informative.

I think I’ll continue to tell Incident Managers that the overhead is around 5-10%, but only after I’ve given some careful thought to the criticality of the workload I want to trace.