DBMS_STATS.SET_TABLE_STATS and statistic staleness

Recently at work there was a discussion regarding setting optimiser statistics using DBMS_STATS SET_*_STATISTICS procedures. The question was asked:

If we set the statistics ourselves will the automatic stat’s job not overwrite them the following night?

Now I was confident that this was not the case but, having never tested this explicitly, thought it better to get my facts straight.

First some quotes from the Oracle Documentation

Setting Statistics

You can set table, column, index, and system statistics using the SET_*_STATISTICS procedures. Setting statistics in the manner is not recommended, because inaccurate or inconsistent statistics can lead to poor performance.

Determining Stale Statistics

Monitoring tracks the approximate number of INSERTs, UPDATEs, and DELETEs for that table and whether the table has been truncated since the last time statistics were gathered. You can access information about changes of tables in the USER_TAB_MODIFICATIONS view. Following a data-modification, there may be a few minutes delay while Oracle Database propagates the information to this view. Use the DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO procedure to immediately reflect the outstanding monitored information kept in the memory.

The GATHER_DATABASE_STATS or GATHER_SCHEMA_STATS procedures gather new statistics for tables with stale statistics when the OPTIONS parameter is set to GATHER STALE or GATHER AUTO. If a monitored table has been modified more than 10%, then these statistics are considered stale and gathered again.

The case under discussion was that old favourite – new partitions added to existing tables. The test that follows creates a simple table with four partitions. Three of the partitions are mature, each containing 10,000 rows. The fourth partition is quite new and only contains 1,000 rows.

drop table part_tab;
create table part_tab
(	id	number
,	yrmon	number(6)
,	padcol	varchar2(100))
partition by list (yrmon)
(partition p201101 values (201101)
,partition p201102 values (201102)
,partition p201103 values (201103)
,partition p201104 values (201104));

insert into part_tab
select rownum,201101,lpad('x',100,'y')
from dual connect by rownum <= 10000;
insert into part_tab
select rownum,201102,lpad('x',100,'y')
from dual connect by rownum <= 10000;
insert into part_tab
select rownum,201103,lpad('x',100,'y')
from dual connect by rownum <= 10000;
insert into part_tab
select rownum,201104,lpad('x',100,'y')
from dual connect by rownum <= 1000;

select PARTITION_NAME,NUM_ROWS,BLOCKS,AVG_ROW_LEN,LAST_ANALYZED
from user_tab_statistics
where table_name = 'PART_TAB';

PARTITION_NAME                   NUM_ROWS     BLOCKS AVG_ROW_LEN LAST_ANALYZED
------------------------------ ---------- ---------- ----------- -----------------
                                    31000        568         110 05-MAR-2011 21:24
P201101                             10000        174         110 05-MAR-2011 21:24
P201102                             10000        174         110 05-MAR-2011 21:24
P201103                             10000        174         110 05-MAR-2011 21:24
P201104                              1000         46         110 05-MAR-2011 21:24

The statistics on the fourth partition are now faked using DBMS_STATS.SET_TABLE_STATS to match the other partitions.

-- fake statistics in partition 201104
begin
 dbms_stats.set_table_stats(null
 , 'part_tab'
 , 'P201104'
 , numrows=>10000
 , numblks=>174
 , avgrlen=>110);
end;
/

PARTITION_NAME                   NUM_ROWS     BLOCKS AVG_ROW_LEN LAST_ANALYZED
------------------------------ ---------- ---------- ----------- -----------------
                                    31000        568         110 05-MAR-2011 21:24
P201101                             10000        174         110 05-MAR-2011 21:24
P201102                             10000        174         110 05-MAR-2011 21:24
P201103                             10000        174         110 05-MAR-2011 21:24
P201104                             10000        174         110 05-MAR-2011 21:24

Now we update the partitions in a variety of ways as described by the comments in the code below.

-- partition "P201101": update 11% of rows  (> staleness threshold) and commit
update part_tab
set padcol = lpad('x',100,'z')
where yrmon = 201101 and rownum <= 1100;
commit;

-- partition "P201102": update 11% of rows  (> staleness threshold) and rollback
update part_tab
set padcol = lpad('x',100,'z')
where yrmon = 201102 and rownum <= 1100;
rollback;

-- partition "P201103": update 9% of rows  (< staleness threshold) and commit
update part_tab
set padcol = lpad('x',100,'z')
where yrmon = 201103 and rownum <= 900;
commit;

-- partition "P201104": update 9% of rows according to optimiser statistics
--  (< staleness threshold) but 90% or actual rows (> staleness threshold) 
update part_tab
set padcol = lpad('x',100,'z')
where yrmon = 201104 and rownum <= 900;
commit;

exec DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO

select PARTITION_NAME,INSERTS,UPDATES,DELETES 
from USER_TAB_MODIFICATIONS
where table_name = 'PART_TAB';

PARTITION_NAME                    INSERTS    UPDATES    DELETES
------------------------------ ---------- ---------- ----------
                                        0       4000          0
P201101                                 0       1100          0
P201102                                 0       1100          0
P201103                                 0        900          0
P201104                                 0        900          0

From the output above we can see that the rolled back modifications on partition “P201102″ still increase the counters in USER_TAB_MODIFICATIONS. Interesting.

Now we gather optimiser statistics. If DBMS_STATS respects the faked statistics then partition P201104′s statistics should remain unchanged. If DBMS_STATS can see through our fakery then partition P201104′s statistics should be recollected.

Watch to see which partitions’ LAST_ANALYZED value change from “21:24″ to “21:26″.

exec dbms_stats.GATHER_DATABASE_STATS(options=>'gather auto')

select PARTITION_NAME,NUM_ROWS,BLOCKS,AVG_ROW_LEN,LAST_ANALYZED
from user_tab_statistics
where table_name = 'PART_TAB';

PARTITION_NAME                   NUM_ROWS     BLOCKS AVG_ROW_LEN LAST_ANALYZED
------------------------------ ---------- ---------- ----------- -----------------
                                    31000        568         110 05-MAR-2011 21:26
P201101                             10000        174         110 05-MAR-2011 21:26
P201102                             10000        174         110 05-MAR-2011 21:26
P201103                             10000        174         110 05-MAR-2011 21:24
P201104                             10000        174         110 05-MAR-2011 21:24

As expected USER_TAB_STATISTICS is the truth as far DBMS_STATS is concerned. The partitions with 11% modifications have new statistics collected. The partitions with 9% modifications keep their existing statistics, including partition “P201104″.

Nothing new or surprising here but I now have the hard facts to put in front of my colleague, remaining mindful of the comment “Setting statistics in the manner is not recommended” at all times :)

What is the default for a METHOD_OPT value using SIZE REPEAT?

Past experience has always led me to believe that a METHOD_OPT value of “FOR ALL COLUMNS SIZE REPEAT” would default to “SIZE 1″ on columns without statistics. I have never explicitly tested this default however so this is the quick test I did to put my mind at rest.

First we create a test table that, being new, has never had any queries on it. Therefore “SIZE AUTO” should not create any histograms.

DROP TABLE tab1;
CREATE TABLE tab1
(	col1 NUMBER
,	col2 VARCHAR2(1));
INSERT INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 100;
UPDATE tab1 SET col2 = 'N' WHERE ROWNUM = 1;
COMMIT;

exec dbms_stats.gather_table_stats(USER,'tab1',method_opt=>'for all columns size auto');

SELECT column_name, num_buckets FROM user_tab_columns WHERE table_name = 'TAB1';

COLUMN_NAME                    NUM_BUCKETS
------------------------------ -----------
COL1                                     1
COL2                                     1

Ok, as expected there are no histograms. So now we repeat the test but query the columns before gathering statistics. This time “SIZE AUTO” should create histograms, at least on the skewed column “COL2″.

DROP TABLE tab1;
CREATE TABLE tab1
(	col1 NUMBER
,	col2 VARCHAR2(1));
INSERT INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 100;
UPDATE tab1 SET col2 = 'N' WHERE ROWNUM = 1;
COMMIT;

SELECT COUNT(*) FROM tab1 WHERE col1 = 1 and col2 = 'N';

exec dbms_stats.gather_table_stats(USER,'tab1',method_opt=>'for all columns size auto');

SELECT column_name, num_buckets FROM user_tab_columns WHERE table_name = 'TAB1';

COLUMN_NAME                    NUM_BUCKETS
------------------------------ -----------
COL1                                     1
COL2                                     2

NUM_BUCKETS for COL2 is “2″ so we did get the expected histogram. Now we can repeat the test but this time gather statistics with “SIZE REPEAT”. If “SIZE REPEAT” defaults to “SIZE 1″ when no prior histograms are present then we should only see values of 1 for NUM_BUCKETS. If it defaults to “SIZE AUTO” then we should see NUM_BUCKETS > 0 for at least COL2. Let’s see:

DROP TABLE tab1;
CREATE TABLE tab1
(	col1 NUMBER
,	col2 VARCHAR2(1));
INSERT INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 100;
UPDATE tab1 SET col2 = 'N' WHERE ROWNUM = 1;
COMMIT;

SELECT COUNT(*) FROM tab1 WHERE col1 = 1 and col2 = 'N';

exec dbms_stats.gather_table_stats(USER,'tab1',method_opt=>'for all columns size repeat');

SELECT column_name, num_buckets FROM user_tab_columns WHERE table_name = 'TAB1';

COLUMN_NAME                    NUM_BUCKETS
------------------------------ -----------
COL1                                     1
COL2                                     1

So I believe the default value for “SIZE REPEAT” is “SIZE 1″, please let me know if you see a hole in my thinking.