METHOD_OPT of NULL

This is the start of a series of MEEK (maybe everyone else knows) posts as I seem to keep running into things about Oracle that I feel I should have known.

Until very recently I did not know what using METHOD_OPT => NULL in a DBMS_STATS call does. If you do then stop reading here 🙂

Creation of a test table…

MNASH@orcl> create table t as select * from dba_objects;

Table created.

MNASH@orcl> select count(*) from t;

  COUNT(*)
----------
     72563

MNASH@orcl> select table_name, num_rows from user_tables where table_name = 'T';

TABLE_NAME   NUM_ROWS
---------- ----------
T

MNASH@orcl> select table_name, column_name, num_distinct from user_tab_columns where table_name = 'T';

TABLE_NAME COLUMN_NAME                    NUM_DISTINCT
---------- ------------------------------ ------------
T          OWNER
T          OBJECT_NAME
T          SUBOBJECT_NAME
T          OBJECT_ID
T          DATA_OBJECT_ID
T          OBJECT_TYPE
T          CREATED
T          LAST_DDL_TIME
T          TIMESTAMP
T          STATUS
T          TEMPORARY
T          GENERATED
T          SECONDARY
T          NAMESPACE
T          EDITION_NAME

15 rows selected.

MNASH@orcl>

Gathering statistics using DBMS_STATS.GATHER_TABLE_STATS with METHOD_OPT => NULL

MNASH@orcl> exec dbms_stats.gather_table_stats(null,'T',method_opt=>null)

PL/SQL procedure successfully completed.

MNASH@orcl> select table_name, num_rows from user_tables where table_name = 'T';

TABLE_NAME   NUM_ROWS
---------- ----------
T               72563

MNASH@orcl> select table_name, column_name, num_distinct from user_tab_columns where table_name = 'T';

TABLE_NAME COLUMN_NAME                    NUM_DISTINCT
---------- ------------------------------ ------------
T          OWNER
T          OBJECT_NAME
T          SUBOBJECT_NAME
T          OBJECT_ID
T          DATA_OBJECT_ID
T          OBJECT_TYPE
T          CREATED
T          LAST_DDL_TIME
T          TIMESTAMP
T          STATUS
T          TEMPORARY
T          GENERATED
T          SECONDARY
T          NAMESPACE
T          EDITION_NAME

15 rows selected.

MNASH@orcl>

Statistics are gathered for the table, but not for the columns of the table.

I found out about this through a friend after puzzling over how some of the tables in a database had statistics at a table level, but none for the columns.

Just for completeness, the following shows what happens when METHOD_OPT is allowed to default:

MNASH@orcl> exec dbms_stats.gather_table_stats(null,'T')

PL/SQL procedure successfully completed.

MNASH@orcl> select table_name, num_rows from user_tables where table_name = 'T';

TABLE_NAME   NUM_ROWS
---------- ----------
T               72563

MNASH@orcl> select table_name, column_name, num_distinct from user_tab_columns where table_name = 'T';

TABLE_NAME COLUMN_NAME                    NUM_DISTINCT
---------- ------------------------------ ------------
T          OWNER                                    32
T          OBJECT_NAME                           43784
T          SUBOBJECT_NAME                          108
T          OBJECT_ID                             72563
T          DATA_OBJECT_ID                         7705
T          OBJECT_TYPE                              44
T          CREATED                                1174
T          LAST_DDL_TIME                          1272
T          TIMESTAMP                              1318
T          STATUS                                    1
T          TEMPORARY                                 2
T          GENERATED                                 2
T          SECONDARY                                 2
T          NAMESPACE                                21
T          EDITION_NAME                              0

15 rows selected.

MNASH@orcl>

Whether or not you want to use the default, which is to allow Oracle to create histograms as it sees appropriate, is a separate point which has received plenty of attention elsewhere.

In the case I was looking into there was actually a bug in the code used as a wrapper to DBMS_STATS and it was not intended that no column statistics were gathered, but I started to wonder if there would be a situation where using this undocumented option would be attractive. I did some fairly simple testing to establish the overhead of collecting column statistics…

The test was:

1 – Increase the number of rows in table T to around 1 million
2 – Delete table statistics for table T
3 – Flush the buffer cache
4 – Gather table statistics using compute with method_opt => null
5 – Repeat (2) to (4) 5 times
6 – Delete table statistics for table T
7 – Flush the buffer cache
8 – Gather table statistics using compute with method_opt => ‘for all columns size 1’
9 – Repeat (6) to (8) 5 times

Looking at the elapsed time for the statistics gathering I got averages of:

null: 2.3 seconds
for all columns size 1: 14.5 seconds

That’s a pretty big overhead, but a lot of information you are robbing the CBO of if you don’t gather them!

I then ran the same test, but with AUTO_SAMPLE_SIZE as this is a more realistic case and got averages of:

null: 1.5 seconds
for all columns size 1: 3.4 seconds

Note: The tests were performed on a 11.2.0.1 database.

Where is METHOD_OPT SIZE REPEAT’s repeat value stored?

In my last post When does METHOD_OPT SIZE REPEAT not obey we saw how SIZE REPEAT sometimes appears to know best and creates a FREQUENCY histogram with a different number of buckets than previously specified.

The mechanics of this became clearer while pondering another question that I often ask myself.

How does METHOD_OPT SIZE REPEAT remember the value last specified in order to repeat it?

My primitive testing leads me to believe that the key value is stored in column “SYS.HIST_HEAD$.BUCKET_CNT”. The test that follows shows my working.

The table created below has two columns with HEIGHT BALANCED histograms, one with no histogram and one with a FREQUENCY BASED histogram. The column names should help in following the test.

drop  table tab1 ;
create table tab1 
(	colskew1 number
,	colskew2 number
,	colnoskew number
,	colfreq number
);

insert into tab1
select case when rownum <= 500 then rownum else 0 end
,      case when rownum <= 500 then rownum else 0 end
,      rownum
,      case when rownum <= 50 then rownum else 0 end
from   all_objects 
where  rownum <= 4000;
commit;

exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colskew1 size 150');
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colskew2 size 254');
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colnoskew size 1');
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colfreq size 254');

column column_name format a12
select c.column_name,t.num_rows, c.num_distinct,c.num_buckets,c.histogram 
from user_tab_columns c, user_tables t
where t.table_name = 'TAB1' and c.table_name = t.table_name order by 1;

COLUMN_NAME    NUM_ROWS NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ---------- ------------ ----------- ---------------
COLSKEW1           4000          501         150 HEIGHT BALANCED
COLSKEW2           4000          501         254 HEIGHT BALANCED
COLNOSKEW          4000         4000           1 NONE
COLFREQ            4000           51          51 FREQUENCY

Now if we check the value stored in “SYS.HIST_HEAD$.BUCKET_CNT” we see the following.

select c.name,c.intcol#,h.bucket_cnt
from sys.col$ c , sys.hist_head$ h
where c.obj# = (select object_id from dba_objects where object_name ='TAB1')
and h.obj# = c.obj#
and h.intcol# = c.intcol#;

NAME                              INTCOL# BUCKET_CNT
------------------------------ ---------- ----------
COLSKEW1                                1        150
COLSKEW2                                2        254
COLNOSKEW                               3          1
COLFREQ                                 4       4000

So for HEIGHT BALANCED histograms (including an absence of a histogram) BUCKET_CNT holds the number of buckets specified when creating the histogram. The FREQUENCY BASED histogram however stores a value equal to the row count. This is different to the NUM_BUCKETS value displayed when querying USER_TAB_COLUMNS.

Now a statistics collection with a METHOD_OPT of SIZE REPEAT simply needs to use a value similar to “MIN(BUCKET_CNT,254)” as its target bucket count.

This was backed up (if not exactly confirmed) by then checking the definition of the SYS view USER_TAB_COLS. The extract below is for the NUM_BUCKETS column:

       case when nvl(h.distcnt,0) = 0 then h.distcnt
            when h.row_cnt = 0 then 1
            when (h.bucket_cnt > 255
                  or
                  (h.bucket_cnt > h.distcnt
                   and h.row_cnt = h.distcnt
                   and h.density*h.bucket_cnt < 1))
                then h.row_cnt
            else h.bucket_cnt
       end

We can see that for a BUCKET_CNT > 255 the value used for NUM_BUCKETS is switched for HIST_HEAD$.ROW_CNT – this is the number of records in USER_HISTOGRAMS for the column, i.e. for FREQUENCY BASED histograms the number of buckets actually stored.

I’m not sure I’ve demonstrated this very well. In essence for SIZE REPEAT:

  • FREQUENCY BASED histograms disregard any value previously specified and pick up a bucket count of the number of distinct rows in the row sample (if NUM_DISTINCT is <= 254 of course).
  • HEIGHT BALANCED histograms can always reuse the value stored in HIST_HEAD$.BUCKET_CNT which will only be between 1 and 254.

When does METHOD_OPT size REPEAT not obey?

I have an example here of a case where a DBMS_STATS METHOD_OPT parameter setting of REPEAT does not do as expected (by me at least).

We have a table with 3 columns. The table is loaded with 1000 rows, all columns in all rows are set to 1.

drop  table tab1 ;
create table tab1 
(	colsz1 number
,	colsz3 number
,	colsz254 number
);
insert into tab1
select 1,1,1 from all_objects where rownum <= 1000;
commit;

We now gather statistics on each column with varying histogram sizes and check the resulting histogram settings for each one. The column names will help keep track of which is which.

exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz1 size 1');
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz3 size 3');
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz254 size 254');

column column_name format a12
select column_name,num_distinct,num_buckets,histogram 
from dba_tab_columns 
where table_name = 'TAB1' order by 1;

Below we can observe that “SIZE 1” did not create a histogram and “SIZE 3” & “SIZE 254” created frequency histograms, each with 1 bucket. All as expected.

COLUMN_NAME  NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ------------ ----------- ---------------
COLSZ1                  1           1 NONE
COLSZ254                1           1 FREQUENCY
COLSZ3                  1           1 FREQUENCY

Now we add 3 new values to each column in the table giving us 4 distinct values in total.

insert into tab1 values (2,2,2);
insert into tab1 values (3,3,3);
insert into tab1 values (4,4,4);
commit;

If we gather statistics for all of the columns using METHOD_OPT size REPEAT what do you expect to happen? My expectation is that:

  • COLSZ1 will have 1 bucket
  • COLSZ3 will have 3 buckets and become a HEIGHT BALANCED histogram because we have more values that the number of buckets
  • COLSZ254 will have 4 buckets and remain a FREQUENCY histogram

Let’s see…

exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for all columns size repeat');

COLUMN_NAME  NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ------------ ----------- ---------------
COLSZ1                  4           1 NONE
COLSZ254                4           4 FREQUENCY
COLSZ3                  4           4 FREQUENCY

I was wrong! Size REPEAT ignored me (or perhaps corrected me) and created a fourth bucket. What if I don’t use size REPEAT and force a size of 3…

exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz3 size 3');

COLUMN_NAME  NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ------------ ----------- ---------------
COLSZ1                  4           1 NONE
COLSZ254                4           4 FREQUENCY
COLSZ3                  4           3 HEIGHT BALANCED

Now that’s what I expected. So now the histogram is HEIGHT BALANCED let’s try size REPEAT again.

exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for all columns size repeat');

COLUMN_NAME  NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ------------ ----------- ---------------
COLSZ1                  4           1 NONE
COLSZ254                4           4 FREQUENCY
COLSZ3                  4           3 HEIGHT BALANCED

Interesting. So this time size REPEAT did not introduce a fourth bucket.

At first this seemed to be significant and I was searching for a case where I could catch it out. However it looks to me that the DBMS_STATS code is behaving sensibly:

  • If we have a frequency histogram and a new value appears then DBMS_STATS will add an extra bucket if possible. In the example above I specifically asked for three buckets when there was only one value, when a fourth distinct value was added DBMS_STATS was smart enough to add another bucket.
  • If we have a HEIGHT BALANCED histogram and another distinct value is added to the column then DBMS_STATS assumes there was a good reason for specifying fewer buckets than values and leaves it well alone.

The title of this post was “When does METHOD_OPT size REPEAT not obey?”. The answer of course is “When it knows better!”

The implementation of this turns out to be quite simple and will be demonstrated in my next post.