Where is METHOD_OPT SIZE REPEAT’s repeat value stored?

In my last post When does METHOD_OPT SIZE REPEAT not obey we saw how SIZE REPEAT sometimes appears to know best and creates a FREQUENCY histogram with a different number of buckets than previously specified.

The mechanics of this became clearer while pondering another question that I often ask myself.

How does METHOD_OPT SIZE REPEAT remember the value last specified in order to repeat it?

My primitive testing leads me to believe that the key value is stored in column “SYS.HIST_HEAD$.BUCKET_CNT”. The test that follows shows my working.

The table created below has two columns with HEIGHT BALANCED histograms, one with no histogram and one with a FREQUENCY BASED histogram. The column names should help in following the test.

drop  table tab1 ;
create table tab1 
(	colskew1 number
,	colskew2 number
,	colnoskew number
,	colfreq number
);

insert into tab1
select case when rownum <= 500 then rownum else 0 end
,      case when rownum <= 500 then rownum else 0 end
,      rownum
,      case when rownum <= 50 then rownum else 0 end
from   all_objects 
where  rownum <= 4000;
commit;

exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colskew1 size 150');
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colskew2 size 254');
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colnoskew size 1');
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colfreq size 254');

column column_name format a12
select c.column_name,t.num_rows, c.num_distinct,c.num_buckets,c.histogram 
from user_tab_columns c, user_tables t
where t.table_name = 'TAB1' and c.table_name = t.table_name order by 1;

COLUMN_NAME    NUM_ROWS NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ---------- ------------ ----------- ---------------
COLSKEW1           4000          501         150 HEIGHT BALANCED
COLSKEW2           4000          501         254 HEIGHT BALANCED
COLNOSKEW          4000         4000           1 NONE
COLFREQ            4000           51          51 FREQUENCY

Now if we check the value stored in “SYS.HIST_HEAD$.BUCKET_CNT” we see the following.

select c.name,c.intcol#,h.bucket_cnt
from sys.col$ c , sys.hist_head$ h
where c.obj# = (select object_id from dba_objects where object_name ='TAB1')
and h.obj# = c.obj#
and h.intcol# = c.intcol#;

NAME                              INTCOL# BUCKET_CNT
------------------------------ ---------- ----------
COLSKEW1                                1        150
COLSKEW2                                2        254
COLNOSKEW                               3          1
COLFREQ                                 4       4000

So for HEIGHT BALANCED histograms (including an absence of a histogram) BUCKET_CNT holds the number of buckets specified when creating the histogram. The FREQUENCY BASED histogram however stores a value equal to the row count. This is different to the NUM_BUCKETS value displayed when querying USER_TAB_COLUMNS.

Now a statistics collection with a METHOD_OPT of SIZE REPEAT simply needs to use a value similar to “MIN(BUCKET_CNT,254)” as its target bucket count.

This was backed up (if not exactly confirmed) by then checking the definition of the SYS view USER_TAB_COLS. The extract below is for the NUM_BUCKETS column:

       case when nvl(h.distcnt,0) = 0 then h.distcnt
            when h.row_cnt = 0 then 1
            when (h.bucket_cnt > 255
                  or
                  (h.bucket_cnt > h.distcnt
                   and h.row_cnt = h.distcnt
                   and h.density*h.bucket_cnt < 1))
                then h.row_cnt
            else h.bucket_cnt
       end

We can see that for a BUCKET_CNT > 255 the value used for NUM_BUCKETS is switched for HIST_HEAD$.ROW_CNT – this is the number of records in USER_HISTOGRAMS for the column, i.e. for FREQUENCY BASED histograms the number of buckets actually stored.

I’m not sure I’ve demonstrated this very well. In essence for SIZE REPEAT:

  • FREQUENCY BASED histograms disregard any value previously specified and pick up a bucket count of the number of distinct rows in the row sample (if NUM_DISTINCT is <= 254 of course).
  • HEIGHT BALANCED histograms can always reuse the value stored in HIST_HEAD$.BUCKET_CNT which will only be between 1 and 254.

When does METHOD_OPT size REPEAT not obey?

I have an example here of a case where a DBMS_STATS METHOD_OPT parameter setting of REPEAT does not do as expected (by me at least).

We have a table with 3 columns. The table is loaded with 1000 rows, all columns in all rows are set to 1.

drop  table tab1 ;
create table tab1 
(	colsz1 number
,	colsz3 number
,	colsz254 number
);
insert into tab1
select 1,1,1 from all_objects where rownum <= 1000;
commit;

We now gather statistics on each column with varying histogram sizes and check the resulting histogram settings for each one. The column names will help keep track of which is which.

exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz1 size 1');
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz3 size 3');
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz254 size 254');

column column_name format a12
select column_name,num_distinct,num_buckets,histogram 
from dba_tab_columns 
where table_name = 'TAB1' order by 1;

Below we can observe that “SIZE 1″ did not create a histogram and “SIZE 3″ & “SIZE 254″ created frequency histograms, each with 1 bucket. All as expected.

COLUMN_NAME  NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ------------ ----------- ---------------
COLSZ1                  1           1 NONE
COLSZ254                1           1 FREQUENCY
COLSZ3                  1           1 FREQUENCY

Now we add 3 new values to each column in the table giving us 4 distinct values in total.

insert into tab1 values (2,2,2);
insert into tab1 values (3,3,3);
insert into tab1 values (4,4,4);
commit;

If we gather statistics for all of the columns using METHOD_OPT size REPEAT what do you expect to happen? My expectation is that:

  • COLSZ1 will have 1 bucket
  • COLSZ3 will have 3 buckets and become a HEIGHT BALANCED histogram because we have more values that the number of buckets
  • COLSZ254 will have 4 buckets and remain a FREQUENCY histogram

Let’s see…

exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for all columns size repeat');

COLUMN_NAME  NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ------------ ----------- ---------------
COLSZ1                  4           1 NONE
COLSZ254                4           4 FREQUENCY
COLSZ3                  4           4 FREQUENCY

I was wrong! Size REPEAT ignored me (or perhaps corrected me) and created a fourth bucket. What if I don’t use size REPEAT and force a size of 3…

exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz3 size 3');

COLUMN_NAME  NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ------------ ----------- ---------------
COLSZ1                  4           1 NONE
COLSZ254                4           4 FREQUENCY
COLSZ3                  4           3 HEIGHT BALANCED

Now that’s what I expected. So now the histogram is HEIGHT BALANCED let’s try size REPEAT again.

exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for all columns size repeat');

COLUMN_NAME  NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------ ------------ ----------- ---------------
COLSZ1                  4           1 NONE
COLSZ254                4           4 FREQUENCY
COLSZ3                  4           3 HEIGHT BALANCED

Interesting. So this time size REPEAT did not introduce a fourth bucket.

At first this seemed to be significant and I was searching for a case where I could catch it out. However it looks to me that the DBMS_STATS code is behaving sensibly:

  • If we have a frequency histogram and a new value appears then DBMS_STATS will add an extra bucket if possible. In the example above I specifically asked for three buckets when there was only one value, when a fourth distinct value was added DBMS_STATS was smart enough to add another bucket.
  • If we have a HEIGHT BALANCED histogram and another distinct value is added to the column then DBMS_STATS assumes there was a good reason for specifying fewer buckets than values and leaves it well alone.

The title of this post was “When does METHOD_OPT size REPEAT not obey?”. The answer of course is “When it knows better!”

The implementation of this turns out to be quite simple and will be demonstrated in my next post.

What is the default for a METHOD_OPT value using SIZE REPEAT?

Past experience has always led me to believe that a METHOD_OPT value of “FOR ALL COLUMNS SIZE REPEAT” would default to “SIZE 1″ on columns without statistics. I have never explicitly tested this default however so this is the quick test I did to put my mind at rest.

First we create a test table that, being new, has never had any queries on it. Therefore “SIZE AUTO” should not create any histograms.

DROP TABLE tab1;
CREATE TABLE tab1
(	col1 NUMBER
,	col2 VARCHAR2(1));
INSERT INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 100;
UPDATE tab1 SET col2 = 'N' WHERE ROWNUM = 1;
COMMIT;

exec dbms_stats.gather_table_stats(USER,'tab1',method_opt=>'for all columns size auto');

SELECT column_name, num_buckets FROM user_tab_columns WHERE table_name = 'TAB1';

COLUMN_NAME                    NUM_BUCKETS
------------------------------ -----------
COL1                                     1
COL2                                     1

Ok, as expected there are no histograms. So now we repeat the test but query the columns before gathering statistics. This time “SIZE AUTO” should create histograms, at least on the skewed column “COL2″.

DROP TABLE tab1;
CREATE TABLE tab1
(	col1 NUMBER
,	col2 VARCHAR2(1));
INSERT INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 100;
UPDATE tab1 SET col2 = 'N' WHERE ROWNUM = 1;
COMMIT;

SELECT COUNT(*) FROM tab1 WHERE col1 = 1 and col2 = 'N';

exec dbms_stats.gather_table_stats(USER,'tab1',method_opt=>'for all columns size auto');

SELECT column_name, num_buckets FROM user_tab_columns WHERE table_name = 'TAB1';

COLUMN_NAME                    NUM_BUCKETS
------------------------------ -----------
COL1                                     1
COL2                                     2

NUM_BUCKETS for COL2 is “2″ so we did get the expected histogram. Now we can repeat the test but this time gather statistics with “SIZE REPEAT”. If “SIZE REPEAT” defaults to “SIZE 1″ when no prior histograms are present then we should only see values of 1 for NUM_BUCKETS. If it defaults to “SIZE AUTO” then we should see NUM_BUCKETS > 0 for at least COL2. Let’s see:

DROP TABLE tab1;
CREATE TABLE tab1
(	col1 NUMBER
,	col2 VARCHAR2(1));
INSERT INTO tab1
SELECT	ROWNUM
,	'Y'
FROM	dual
CONNECT BY ROWNUM < 100;
UPDATE tab1 SET col2 = 'N' WHERE ROWNUM = 1;
COMMIT;

SELECT COUNT(*) FROM tab1 WHERE col1 = 1 and col2 = 'N';

exec dbms_stats.gather_table_stats(USER,'tab1',method_opt=>'for all columns size repeat');

SELECT column_name, num_buckets FROM user_tab_columns WHERE table_name = 'TAB1';

COLUMN_NAME                    NUM_BUCKETS
------------------------------ -----------
COL1                                     1
COL2                                     1

So I believe the default value for “SIZE REPEAT” is “SIZE 1″, please let me know if you see a hole in my thinking.

Optimiser Statistics and Lower Case Columns

I was contacted recently about a performance problem in an application we support. The query summarises the number of issues that have not been closed. A simplified version is shown below:

SELECT  COUNT(*)
FROM    "issues"
WHERE   "state" IN ('open','new');

There are two things of note here. Firstly your eyes do not deceive you – those are lower case table and column names (more on this later). Secondly the “state” column is indexed yet the query is doing a full table scan on the 400,000 row “issues” table.

The reason for the full scan is skewed “state” data. There are a handful of new and open issues but a heck of a lot of closed ones (and no possibility of setting them all to NULL before you ask). So the index looks unattractive to the optimiser:

400,000 rows / 3 distinct values = rubbish index access

The obvious thing to do is put a histogram on the “state” column. Things did not go quite to plan. My demonstration of this is below (on a 9.2.0.8 database – the same as the application in question).

1) Create a test table with a mixture of upper and lower case columns, notice we have to wrap table and column names in double quotes (“”) to avoid a conversion to upper case.

create table "tab1"(col1 number,"col2" number,"col3" number);

begin
	for i in 1..10 loop
		insert into "tab1" values (i,i,i);
	end loop;
	commit;
end;
/

2) DBMS_STATS also performs upper case conversions

exec dbms_stats.gather_table_stats(null,'tab1');

column column_name format a10
select  tc.table_name,tc.column_name, tc.nullable nulls
,       tc.num_distinct, tc.num_buckets buckets
from    user_tab_columns tc
where tc.table_name = 'tab1'
order by 1,2;

TABLE_NAME   COLUMN_NAM N NUM_DISTINCT    BUCKETS
------------ ---------- - ------------ ----------
tab1         COL1       Y
tab1         col2       Y
tab1         col3       Y

And again with double quotes (“”) to prevent case conversion

exec dbms_stats.gather_table_stats(null,'"tab1"');

column column_name format a10
select  tc.table_name,tc.column_name, tc.nullable nulls
,       tc.num_distinct, tc.num_buckets buckets
from    user_tab_columns tc
where tc.table_name = 'tab1'
order by 1,2;

TABLE_NAME   COLUMN_NAM N NUM_DISTINCT    BUCKETS
------------ ---------- - ------------ ----------
tab1         COL1       Y           10          1
tab1         col2       Y           10          1
tab1         col3       Y           10          1

So far everything works as expected. Now for the histogram.

3) Create histograms on lower and upper case columns

exec dbms_stats.delete_table_stats(null,'"tab1"');
exec dbms_stats.gather_table_stats(null,'"tab1"'-
               ,method_opt=>'for columns col1 size 254');
exec dbms_stats.gather_table_stats(null,'"tab1"'-
               ,method_opt=>'for columns "col2" size 254');

column column_name format a10
select  tc.table_name,tc.column_name, tc.nullable nulls
,       tc.num_distinct, tc.num_buckets buckets
from    user_tab_columns tc
where   tc.table_name = 'tab1'
order by 1,2;

TABLE_NAME   COLUMN_NAM N NUM_DISTINCT    BUCKETS
------------ ---------- - ------------ ----------
tab1         COL1       Y           10          9
tab1         col2       Y
tab1         col3       Y

That wasn’t expected, there’s no histogram on “col2″. After trying an assortment of “special” characters and seeing the following message far too frequently…

ERROR at line 1:
ORA-06502: PL/SQL: numeric or value error
ORA-06512: at "SYS.DBMS_STATS", line 10502
ORA-06512: at "SYS.DBMS_STATS", line 10516
ORA-06512: at line 1

…I did what I should have done originally and searched Metalink:

Bug 4892211 – DBMS_STATS does not recognise lower case / special characters in column names

  • This issue is fixed in*
  • 10.2.0.3 (Server Patch Set) <javascript:taghelp(‘FIXED_A203′)>
  • 11.1.0.6 (Base Release) <javascript:taghelp(‘FIXED_B106′)>

4) At first I thought a workaround would be to add a histogram to all columns on the table and then remove the ones I don’t want. However I’m faced with the same problem – I can’t specify “col3″ to remove that histogram:

exec dbms_stats.gather_table_stats(null,'"tab1"'-
               ,method_opt=>'for all columns size 254');

TABLE_NAME   COLUMN_NAM N NUM_DISTINCT    BUCKETS
------------ ---------- - ------------ ----------
tab1         COL1       Y           10          9
tab1         col2       Y           10          9
tab1         col3       Y           10          9

exec dbms_stats.gather_table_stats(null,'"tab1"'-
               ,method_opt=>'for columns "col3" size 1');

TABLE_NAME   COLUMN_NAM N NUM_DISTINCT    BUCKETS
------------ ---------- - ------------ ----------
tab1         COL1       Y           10          1
tab1         col2       Y           10          9
tab1         col3       Y           10          9

So in my case the best I could do was put the histogram on all indexed columns and use “method_opt=>’for all columns size repeat’”for future optimiser statistic collections to keep the histogram in place.