DBMS_STATS And Direct Path Operations

Continuing my theme of (almost pointless) testing that DBMS_STATS does as you would hope, this post proves the answer to the question below that was posed at my work place.

Do direct path loads count as modifications for DBMS_STATS or do they bypass it?

My initial instinct was “of course it catches them” but I stopped myself. I’ve never actually seen it so it would be improper to say it out loud. Better dash home and test it (“dash” = 1 month later in this case).

This test is in from Database 11.2 and has been verified on 10.2 also. It has also been tested and tweaked thanks to Surachart’s input – see the first comment on the post. I thought I’d unearthed an interesting difference between 10g and 11g – it turns out I’d only discovered that ARCHIVELOG is not the same as NOARCHIVELOG – doh!

Create a test table, populate it and check that the modifications are caught.

DROP TABLE tab1;
CREATE TABLE tab1
(	col1 NUMBER
,	col2 VARCHAR2(1));

INSERT INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 1000;
COMMIT;

exec DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO

column table_name format a10
select table_name,INSERTS,UPDATES,DELETES 
from USER_TAB_MODIFICATIONS
where table_name = 'TAB1';

TABLE_NAME    INSERTS    UPDATES    DELETES
---------- ---------- ---------- ----------
TAB1              999          0          0

All good. Now let’s try a direct path load.

ALTER TABLE tab1 NOLOGGING;

INSERT /*+ APPEND */ INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 1000;
COMMIT;

exec DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO

column table_name format a10
select table_name,INSERTS,UPDATES,DELETES 
from USER_TAB_MODIFICATIONS
where table_name = 'TAB1';

TABLE_NAME    INSERTS    UPDATES    DELETES
---------- ---------- ---------- ----------
TAB1             1998          0          0

Looking good but the self doubter in me is asking:

How can you prove that was a direct path load?

Like this hopefully.

sqlplus /

ALTER TABLE tab1 LOGGING;

INSERT INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 1000;
COMMIT;

@mystat
Enter value for namefilter: redo size

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
redo size                                                             19712

And…

sqlplus /

ALTER TABLE tab1 NOLOGGING;

INSERT /*+ APPEND */ INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 1000;
COMMIT;

@mystat
Enter value for snamefilter: redo size

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
redo size for direct writes                                              52
redo size                                                              4500

All good, but…

What about DDL?

…shouts my annoying inner self.

TRUNCATE TABLE tab1;

exec DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO

column table_name format a10
select table_name,INSERTS,UPDATES,DELETES 
from USER_TAB_MODIFICATIONS
where table_name = 'TAB1';

TABLE_NAME    INSERTS    UPDATES    DELETES
---------- ---------- ---------- ----------
TAB1             3996          0       3996

OK, so both my annoying inner voice and I am happy that direct writes are indeed caught by DBMS_STATS. I know this is as expected but heck, sometimes it’s good to see it in black and white.

DBMS_STATS.SET_TABLE_STATS and statistic staleness

Recently at work there was a discussion regarding setting optimiser statistics using DBMS_STATS SET_*_STATISTICS procedures. The question was asked:

If we set the statistics ourselves will the automatic stat’s job not overwrite them the following night?

Now I was confident that this was not the case but, having never tested this explicitly, thought it better to get my facts straight.

First some quotes from the Oracle Documentation

Setting Statistics

You can set table, column, index, and system statistics using the SET_*_STATISTICS procedures. Setting statistics in the manner is not recommended, because inaccurate or inconsistent statistics can lead to poor performance.

Determining Stale Statistics

Monitoring tracks the approximate number of INSERTs, UPDATEs, and DELETEs for that table and whether the table has been truncated since the last time statistics were gathered. You can access information about changes of tables in the USER_TAB_MODIFICATIONS view. Following a data-modification, there may be a few minutes delay while Oracle Database propagates the information to this view. Use the DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO procedure to immediately reflect the outstanding monitored information kept in the memory.

The GATHER_DATABASE_STATS or GATHER_SCHEMA_STATS procedures gather new statistics for tables with stale statistics when the OPTIONS parameter is set to GATHER STALE or GATHER AUTO. If a monitored table has been modified more than 10%, then these statistics are considered stale and gathered again.

The case under discussion was that old favourite – new partitions added to existing tables. The test that follows creates a simple table with four partitions. Three of the partitions are mature, each containing 10,000 rows. The fourth partition is quite new and only contains 1,000 rows.

drop table part_tab;
create table part_tab
(	id	number
,	yrmon	number(6)
,	padcol	varchar2(100))
partition by list (yrmon)
(partition p201101 values (201101)
,partition p201102 values (201102)
,partition p201103 values (201103)
,partition p201104 values (201104));

insert into part_tab
select rownum,201101,lpad('x',100,'y')
from dual connect by rownum <= 10000;
insert into part_tab
select rownum,201102,lpad('x',100,'y')
from dual connect by rownum <= 10000;
insert into part_tab
select rownum,201103,lpad('x',100,'y')
from dual connect by rownum <= 10000;
insert into part_tab
select rownum,201104,lpad('x',100,'y')
from dual connect by rownum <= 1000;

select PARTITION_NAME,NUM_ROWS,BLOCKS,AVG_ROW_LEN,LAST_ANALYZED
from user_tab_statistics
where table_name = 'PART_TAB';

PARTITION_NAME                   NUM_ROWS     BLOCKS AVG_ROW_LEN LAST_ANALYZED
------------------------------ ---------- ---------- ----------- -----------------
                                    31000        568         110 05-MAR-2011 21:24
P201101                             10000        174         110 05-MAR-2011 21:24
P201102                             10000        174         110 05-MAR-2011 21:24
P201103                             10000        174         110 05-MAR-2011 21:24
P201104                              1000         46         110 05-MAR-2011 21:24

The statistics on the fourth partition are now faked using DBMS_STATS.SET_TABLE_STATS to match the other partitions.

-- fake statistics in partition 201104
begin
 dbms_stats.set_table_stats(null
 , 'part_tab'
 , 'P201104'
 , numrows=>10000
 , numblks=>174
 , avgrlen=>110);
end;
/

PARTITION_NAME                   NUM_ROWS     BLOCKS AVG_ROW_LEN LAST_ANALYZED
------------------------------ ---------- ---------- ----------- -----------------
                                    31000        568         110 05-MAR-2011 21:24
P201101                             10000        174         110 05-MAR-2011 21:24
P201102                             10000        174         110 05-MAR-2011 21:24
P201103                             10000        174         110 05-MAR-2011 21:24
P201104                             10000        174         110 05-MAR-2011 21:24

Now we update the partitions in a variety of ways as described by the comments in the code below.

-- partition "P201101": update 11% of rows  (> staleness threshold) and commit
update part_tab
set padcol = lpad('x',100,'z')
where yrmon = 201101 and rownum <= 1100;
commit;

-- partition "P201102": update 11% of rows  (> staleness threshold) and rollback
update part_tab
set padcol = lpad('x',100,'z')
where yrmon = 201102 and rownum <= 1100;
rollback;

-- partition "P201103": update 9% of rows  (< staleness threshold) and commit
update part_tab
set padcol = lpad('x',100,'z')
where yrmon = 201103 and rownum <= 900;
commit;

-- partition "P201104": update 9% of rows according to optimiser statistics
--  (< staleness threshold) but 90% or actual rows (> staleness threshold) 
update part_tab
set padcol = lpad('x',100,'z')
where yrmon = 201104 and rownum <= 900;
commit;

exec DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO

select PARTITION_NAME,INSERTS,UPDATES,DELETES 
from USER_TAB_MODIFICATIONS
where table_name = 'PART_TAB';

PARTITION_NAME                    INSERTS    UPDATES    DELETES
------------------------------ ---------- ---------- ----------
                                        0       4000          0
P201101                                 0       1100          0
P201102                                 0       1100          0
P201103                                 0        900          0
P201104                                 0        900          0

From the output above we can see that the rolled back modifications on partition “P201102″ still increase the counters in USER_TAB_MODIFICATIONS. Interesting.

Now we gather optimiser statistics. If DBMS_STATS respects the faked statistics then partition P201104’s statistics should remain unchanged. If DBMS_STATS can see through our fakery then partition P201104’s statistics should be recollected.

Watch to see which partitions’ LAST_ANALYZED value change from “21:24″ to “21:26″.

exec dbms_stats.GATHER_DATABASE_STATS(options=>'gather auto')

select PARTITION_NAME,NUM_ROWS,BLOCKS,AVG_ROW_LEN,LAST_ANALYZED
from user_tab_statistics
where table_name = 'PART_TAB';

PARTITION_NAME                   NUM_ROWS     BLOCKS AVG_ROW_LEN LAST_ANALYZED
------------------------------ ---------- ---------- ----------- -----------------
                                    31000        568         110 05-MAR-2011 21:26
P201101                             10000        174         110 05-MAR-2011 21:26
P201102                             10000        174         110 05-MAR-2011 21:26
P201103                             10000        174         110 05-MAR-2011 21:24
P201104                             10000        174         110 05-MAR-2011 21:24

As expected USER_TAB_STATISTICS is the truth as far DBMS_STATS is concerned. The partitions with 11% modifications have new statistics collected. The partitions with 9% modifications keep their existing statistics, including partition “P201104″.

Nothing new or surprising here but I now have the hard facts to put in front of my colleague, remaining mindful of the comment “Setting statistics in the manner is not recommended” at all times :)

When does METHOD_OPT size REPEAT not obey?

I have an example here of a case where a DBMS_STATS METHOD_OPT parameter setting of REPEAT does not do as expected (by me at least).

We have a table with 3 columns. The table is loaded with 1000 rows, all columns in all rows are set to 1.

drop  table tab1 ;
create table tab1 
(	colsz1 number
,	colsz3 number
,	colsz254 number
);
insert into tab1
select 1,1,1 from all_objects where rownum <= 1000;
commit;

We now gather statistics on each column with varying histogram sizes and check the resulting histogram settings for each one. The column names will help keep track of which is which.

exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz1 size 1');
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz3 size 3');
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz254 size 254');

column column_name format a12
select column_name,num_distinct,num_buckets,histogram 
from dba_tab_columns 
where table_name = 'TAB1' order by 1;

Below we can observe that “SIZE 1″ did not create a histogram and “SIZE 3″ & “SIZE 254″ created frequency histograms, each with 1 bucket. All as expected.

COLUMN_NAME  NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ------------ ----------- ---------------
COLSZ1                  1           1 NONE
COLSZ254                1           1 FREQUENCY
COLSZ3                  1           1 FREQUENCY

Now we add 3 new values to each column in the table giving us 4 distinct values in total.

insert into tab1 values (2,2,2);
insert into tab1 values (3,3,3);
insert into tab1 values (4,4,4);
commit;

If we gather statistics for all of the columns using METHOD_OPT size REPEAT what do you expect to happen? My expectation is that:

  • COLSZ1 will have 1 bucket
  • COLSZ3 will have 3 buckets and become a HEIGHT BALANCED histogram because we have more values that the number of buckets
  • COLSZ254 will have 4 buckets and remain a FREQUENCY histogram

Let’s see…

exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for all columns size repeat');

COLUMN_NAME  NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ------------ ----------- ---------------
COLSZ1                  4           1 NONE
COLSZ254                4           4 FREQUENCY
COLSZ3                  4           4 FREQUENCY

I was wrong! Size REPEAT ignored me (or perhaps corrected me) and created a fourth bucket. What if I don’t use size REPEAT and force a size of 3…

exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz3 size 3');

COLUMN_NAME  NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ------------ ----------- ---------------
COLSZ1                  4           1 NONE
COLSZ254                4           4 FREQUENCY
COLSZ3                  4           3 HEIGHT BALANCED

Now that’s what I expected. So now the histogram is HEIGHT BALANCED let’s try size REPEAT again.

exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for all columns size repeat');

COLUMN_NAME  NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ------------ ----------- ---------------
COLSZ1                  4           1 NONE
COLSZ254                4           4 FREQUENCY
COLSZ3                  4           3 HEIGHT BALANCED

Interesting. So this time size REPEAT did not introduce a fourth bucket.

At first this seemed to be significant and I was searching for a case where I could catch it out. However it looks to me that the DBMS_STATS code is behaving sensibly:

  • If we have a frequency histogram and a new value appears then DBMS_STATS will add an extra bucket if possible. In the example above I specifically asked for three buckets when there was only one value, when a fourth distinct value was added DBMS_STATS was smart enough to add another bucket.
  • If we have a HEIGHT BALANCED histogram and another distinct value is added to the column then DBMS_STATS assumes there was a good reason for specifying fewer buckets than values and leaves it well alone.

The title of this post was “When does METHOD_OPT size REPEAT not obey?”. The answer of course is “When it knows better!”

The implementation of this turns out to be quite simple and will be demonstrated in my next post.

What is the default for a METHOD_OPT value using SIZE REPEAT?

Past experience has always led me to believe that a METHOD_OPT value of “FOR ALL COLUMNS SIZE REPEAT” would default to “SIZE 1″ on columns without statistics. I have never explicitly tested this default however so this is the quick test I did to put my mind at rest.

First we create a test table that, being new, has never had any queries on it. Therefore “SIZE AUTO” should not create any histograms.

DROP TABLE tab1;
CREATE TABLE tab1
(	col1 NUMBER
,	col2 VARCHAR2(1));
INSERT INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 100;
UPDATE tab1 SET col2 = 'N' WHERE ROWNUM = 1;
COMMIT;

exec dbms_stats.gather_table_stats(USER,'tab1',method_opt=>'for all columns size auto');

SELECT column_name, num_buckets FROM user_tab_columns WHERE table_name = 'TAB1';

COLUMN_NAME                    NUM_BUCKETS
------------------------------ -----------
COL1                                     1
COL2                                     1

Ok, as expected there are no histograms. So now we repeat the test but query the columns before gathering statistics. This time “SIZE AUTO” should create histograms, at least on the skewed column “COL2″.

DROP TABLE tab1;
CREATE TABLE tab1
(	col1 NUMBER
,	col2 VARCHAR2(1));
INSERT INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 100;
UPDATE tab1 SET col2 = 'N' WHERE ROWNUM = 1;
COMMIT;

SELECT COUNT(*) FROM tab1 WHERE col1 = 1 and col2 = 'N';

exec dbms_stats.gather_table_stats(USER,'tab1',method_opt=>'for all columns size auto');

SELECT column_name, num_buckets FROM user_tab_columns WHERE table_name = 'TAB1';

COLUMN_NAME                    NUM_BUCKETS
------------------------------ -----------
COL1                                     1
COL2                                     2

NUM_BUCKETS for COL2 is “2” so we did get the expected histogram. Now we can repeat the test but this time gather statistics with “SIZE REPEAT”. If “SIZE REPEAT” defaults to “SIZE 1″ when no prior histograms are present then we should only see values of 1 for NUM_BUCKETS. If it defaults to “SIZE AUTO” then we should see NUM_BUCKETS > 0 for at least COL2. Let’s see:

DROP TABLE tab1;
CREATE TABLE tab1
(	col1 NUMBER
,	col2 VARCHAR2(1));
INSERT INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 100;
UPDATE tab1 SET col2 = 'N' WHERE ROWNUM = 1;
COMMIT;

SELECT COUNT(*) FROM tab1 WHERE col1 = 1 and col2 = 'N';

exec dbms_stats.gather_table_stats(USER,'tab1',method_opt=>'for all columns size repeat');

SELECT column_name, num_buckets FROM user_tab_columns WHERE table_name = 'TAB1';

COLUMN_NAME                    NUM_BUCKETS
------------------------------ -----------
COL1                                     1
COL2                                     1

So I believe the default value for “SIZE REPEAT” is “SIZE 1″, please let me know if you see a hole in my thinking.

Optimiser Statistics and Lower Case Columns

I was contacted recently about a performance problem in an application we support. The query summarises the number of issues that have not been closed. A simplified version is shown below:

SELECT  COUNT(*)
FROM    "issues"
WHERE   "state" IN ('open','new');

There are two things of note here. Firstly your eyes do not deceive you – those are lower case table and column names (more on this later). Secondly the “state” column is indexed yet the query is doing a full table scan on the 400,000 row “issues” table.

The reason for the full scan is skewed “state” data. There are a handful of new and open issues but a heck of a lot of closed ones (and no possibility of setting them all to NULL before you ask). So the index looks unattractive to the optimiser:

400,000 rows / 3 distinct values = rubbish index access

The obvious thing to do is put a histogram on the “state” column. Things did not go quite to plan. My demonstration of this is below (on a 9.2.0.8 database – the same as the application in question).

1) Create a test table with a mixture of upper and lower case columns, notice we have to wrap table and column names in double quotes (“”) to avoid a conversion to upper case.

create table "tab1"(col1 number,"col2" number,"col3" number);

begin
	for i in 1..10 loop
		insert into "tab1" values (i,i,i);
	end loop;
	commit;
end;
/

2) DBMS_STATS also performs upper case conversions

exec dbms_stats.gather_table_stats(null,'tab1');

column column_name format a10
select  tc.table_name,tc.column_name, tc.nullable nulls
,       tc.num_distinct, tc.num_buckets buckets
from    user_tab_columns tc
where tc.table_name = 'tab1'
order by 1,2;

TABLE_NAME   COLUMN_NAM N NUM_DISTINCT    BUCKETS
------------ ---------- - ------------ ----------
tab1         COL1       Y
tab1         col2       Y
tab1         col3       Y

And again with double quotes (“”) to prevent case conversion

exec dbms_stats.gather_table_stats(null,'"tab1"');

column column_name format a10
select  tc.table_name,tc.column_name, tc.nullable nulls
,       tc.num_distinct, tc.num_buckets buckets
from    user_tab_columns tc
where tc.table_name = 'tab1'
order by 1,2;

TABLE_NAME   COLUMN_NAM N NUM_DISTINCT    BUCKETS
------------ ---------- - ------------ ----------
tab1         COL1       Y           10          1
tab1         col2       Y           10          1
tab1         col3       Y           10          1

So far everything works as expected. Now for the histogram.

3) Create histograms on lower and upper case columns

exec dbms_stats.delete_table_stats(null,'"tab1"');
exec dbms_stats.gather_table_stats(null,'"tab1"'-
               ,method_opt=>'for columns col1 size 254');
exec dbms_stats.gather_table_stats(null,'"tab1"'-
               ,method_opt=>'for columns "col2" size 254');

column column_name format a10
select  tc.table_name,tc.column_name, tc.nullable nulls
,       tc.num_distinct, tc.num_buckets buckets
from    user_tab_columns tc
where   tc.table_name = 'tab1'
order by 1,2;

TABLE_NAME   COLUMN_NAM N NUM_DISTINCT    BUCKETS
------------ ---------- - ------------ ----------
tab1         COL1       Y           10          9
tab1         col2       Y
tab1         col3       Y

That wasn’t expected, there’s no histogram on “col2″. After trying an assortment of “special” characters and seeing the following message far too frequently…

ERROR at line 1:
ORA-06502: PL/SQL: numeric or value error
ORA-06512: at "SYS.DBMS_STATS", line 10502
ORA-06512: at "SYS.DBMS_STATS", line 10516
ORA-06512: at line 1

…I did what I should have done originally and searched Metalink:

Bug 4892211 – DBMS_STATS does not recognise lower case / special characters in column names

  • This issue is fixed in*
  • 10.2.0.3 (Server Patch Set) <javascript:taghelp(‘FIXED_A203′)>
  • 11.1.0.6 (Base Release) <javascript:taghelp(‘FIXED_B106′)>

4) At first I thought a workaround would be to add a histogram to all columns on the table and then remove the ones I don’t want. However I’m faced with the same problem – I can’t specify “col3″ to remove that histogram:

exec dbms_stats.gather_table_stats(null,'"tab1"'-
               ,method_opt=>'for all columns size 254');

TABLE_NAME   COLUMN_NAM N NUM_DISTINCT    BUCKETS
------------ ---------- - ------------ ----------
tab1         COL1       Y           10          9
tab1         col2       Y           10          9
tab1         col3       Y           10          9

exec dbms_stats.gather_table_stats(null,'"tab1"'-
               ,method_opt=>'for columns "col3" size 1');

TABLE_NAME   COLUMN_NAM N NUM_DISTINCT    BUCKETS
------------ ---------- - ------------ ----------
tab1         COL1       Y           10          1
tab1         col2       Y           10          9
tab1         col3       Y           10          9

So in my case the best I could do was put the histogram on all indexed columns and use “method_opt=>’for all columns size repeat'”for future optimiser statistic collections to keep the histogram in place.