METHOD_OPT of NULL
This is the start of a series of MEEK (maybe everyone else knows) posts as I seem to keep running into things about Oracle that I feel I should have known.
Until very recently I did not know what using METHOD_OPT => NULL in a DBMS_STATS call does. If you do then stop reading here
Creation of a test table…
MNASH@orcl> create table t as select * from dba_objects;
Table created.
MNASH@orcl> select count(*) from t;
COUNT(*)
----------
72563
MNASH@orcl> select table_name, num_rows from user_tables where table_name = 'T';
TABLE_NAME NUM_ROWS
---------- ----------
T
MNASH@orcl> select table_name, column_name, num_distinct from user_tab_columns where table_name = 'T';
TABLE_NAME COLUMN_NAME NUM_DISTINCT
---------- ------------------------------ ------------
T OWNER
T OBJECT_NAME
T SUBOBJECT_NAME
T OBJECT_ID
T DATA_OBJECT_ID
T OBJECT_TYPE
T CREATED
T LAST_DDL_TIME
T TIMESTAMP
T STATUS
T TEMPORARY
T GENERATED
T SECONDARY
T NAMESPACE
T EDITION_NAME
15 rows selected.
MNASH@orcl>
Gathering statistics using DBMS_STATS.GATHER_TABLE_STATS with METHOD_OPT => NULL
MNASH@orcl> exec dbms_stats.gather_table_stats(null,'T',method_opt=>null) PL/SQL procedure successfully completed. MNASH@orcl> select table_name, num_rows from user_tables where table_name = 'T'; TABLE_NAME NUM_ROWS ---------- ---------- T 72563 MNASH@orcl> select table_name, column_name, num_distinct from user_tab_columns where table_name = 'T'; TABLE_NAME COLUMN_NAME NUM_DISTINCT ---------- ------------------------------ ------------ T OWNER T OBJECT_NAME T SUBOBJECT_NAME T OBJECT_ID T DATA_OBJECT_ID T OBJECT_TYPE T CREATED T LAST_DDL_TIME T TIMESTAMP T STATUS T TEMPORARY T GENERATED T SECONDARY T NAMESPACE T EDITION_NAME 15 rows selected. MNASH@orcl>
Statistics are gathered for the table, but not for the columns of the table.
I found out about this through a friend after puzzling over how some of the tables in a database had statistics at a table level, but none for the columns.
Just for completeness, the following shows what happens when METHOD_OPT is allowed to default:
MNASH@orcl> exec dbms_stats.gather_table_stats(null,'T') PL/SQL procedure successfully completed. MNASH@orcl> select table_name, num_rows from user_tables where table_name = 'T'; TABLE_NAME NUM_ROWS ---------- ---------- T 72563 MNASH@orcl> select table_name, column_name, num_distinct from user_tab_columns where table_name = 'T'; TABLE_NAME COLUMN_NAME NUM_DISTINCT ---------- ------------------------------ ------------ T OWNER 32 T OBJECT_NAME 43784 T SUBOBJECT_NAME 108 T OBJECT_ID 72563 T DATA_OBJECT_ID 7705 T OBJECT_TYPE 44 T CREATED 1174 T LAST_DDL_TIME 1272 T TIMESTAMP 1318 T STATUS 1 T TEMPORARY 2 T GENERATED 2 T SECONDARY 2 T NAMESPACE 21 T EDITION_NAME 0 15 rows selected. MNASH@orcl>
Whether or not you want to use the default, which is to allow Oracle to create histograms as it sees appropriate, is a separate point which has received plenty of attention elsewhere.
In the case I was looking into there was actually a bug in the code used as a wrapper to DBMS_STATS and it was not intended that no column statistics were gathered, but I started to wonder if there would be a situation where using this undocumented option would be attractive. I did some fairly simple testing to establish the overhead of collecting column statistics…
The test was:
1 – Increase the number of rows in table T to around 1 million
2 – Delete table statistics for table T
3 – Flush the buffer cache
4 – Gather table statistics using compute with method_opt => null
5 – Repeat (2) to (4) 5 times
6 – Delete table statistics for table T
7 – Flush the buffer cache
8 – Gather table statistics using compute with method_opt => ‘for all columns size 1′
9 – Repeat (6) to (8) 5 times
Looking at the elapsed time for the statistics gathering I got averages of:
null: 2.3 seconds
for all columns size 1: 14.5 seconds
That’s a pretty big overhead, but a lot of information you are robbing the CBO of if you don’t gather them!
I then ran the same test, but with AUTO_SAMPLE_SIZE as this is a more realistic case and got averages of:
null: 1.5 seconds
for all columns size 1: 3.4 seconds
Note: The tests were performed on a 11.2.0.1 database.
Where is METHOD_OPT SIZE REPEAT’s repeat value stored?
In my last post When does METHOD_OPT SIZE REPEAT not obey we saw how SIZE REPEAT sometimes appears to know best and creates a FREQUENCY histogram with a different number of buckets than previously specified.
The mechanics of this became clearer while pondering another question that I often ask myself.
How does METHOD_OPT SIZE REPEAT remember the value last specified in order to repeat it?
My primitive testing leads me to believe that the key value is stored in column “SYS.HIST_HEAD$.BUCKET_CNT”. The test that follows shows my working.
The table created below has two columns with HEIGHT BALANCED histograms, one with no histogram and one with a FREQUENCY BASED histogram. The column names should help in following the test.
drop table tab1 ; create table tab1 ( colskew1 number , colskew2 number , colnoskew number , colfreq number ); insert into tab1 select case when rownum <= 500 then rownum else 0 end , case when rownum <= 500 then rownum else 0 end , rownum , case when rownum <= 50 then rownum else 0 end from all_objects where rownum <= 4000; commit; exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colskew1 size 150'); exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colskew2 size 254'); exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colnoskew size 1'); exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colfreq size 254'); column column_name format a12 select c.column_name,t.num_rows, c.num_distinct,c.num_buckets,c.histogram from user_tab_columns c, user_tables t where t.table_name = 'TAB1' and c.table_name = t.table_name order by 1; COLUMN_NAME NUM_ROWS NUM_DISTINCT NUM_BUCKETS HISTOGRAM ------------ ---------- ------------ ----------- --------------- COLSKEW1 4000 501 150 HEIGHT BALANCED COLSKEW2 4000 501 254 HEIGHT BALANCED COLNOSKEW 4000 4000 1 NONE COLFREQ 4000 51 51 FREQUENCY
Now if we check the value stored in “SYS.HIST_HEAD$.BUCKET_CNT” we see the following.
select c.name,c.intcol#,h.bucket_cnt from sys.col$ c , sys.hist_head$ h where c.obj# = (select object_id from dba_objects where object_name ='TAB1') and h.obj# = c.obj# and h.intcol# = c.intcol#; NAME INTCOL# BUCKET_CNT ------------------------------ ---------- ---------- COLSKEW1 1 150 COLSKEW2 2 254 COLNOSKEW 3 1 COLFREQ 4 4000
So for HEIGHT BALANCED histograms (including an absence of a histogram) BUCKET_CNT holds the number of buckets specified when creating the histogram. The FREQUENCY BASED histogram however stores a value equal to the row count. This is different to the NUM_BUCKETS value displayed when querying USER_TAB_COLUMNS.
Now a statistics collection with a METHOD_OPT of SIZE REPEAT simply needs to use a value similar to “MIN(BUCKET_CNT,254)” as its target bucket count.
This was backed up (if not exactly confirmed) by then checking the definition of the SYS view USER_TAB_COLS. The extract below is for the NUM_BUCKETS column:
case when nvl(h.distcnt,0) = 0 then h.distcnt
when h.row_cnt = 0 then 1
when (h.bucket_cnt > 255
or
(h.bucket_cnt > h.distcnt
and h.row_cnt = h.distcnt
and h.density*h.bucket_cnt < 1))
then h.row_cnt
else h.bucket_cnt
end
We can see that for a BUCKET_CNT > 255 the value used for NUM_BUCKETS is switched for HIST_HEAD$.ROW_CNT – this is the number of records in USER_HISTOGRAMS for the column, i.e. for FREQUENCY BASED histograms the number of buckets actually stored.
I’m not sure I’ve demonstrated this very well. In essence for SIZE REPEAT:
- FREQUENCY BASED histograms disregard any value previously specified and pick up a bucket count of the number of distinct rows in the row sample (if NUM_DISTINCT is <= 254 of course).
- HEIGHT BALANCED histograms can always reuse the value stored in HIST_HEAD$.BUCKET_CNT which will only be between 1 and 254.
When does METHOD_OPT size REPEAT not obey?
I have an example here of a case where a DBMS_STATS METHOD_OPT parameter setting of REPEAT does not do as expected (by me at least).
We have a table with 3 columns. The table is loaded with 1000 rows, all columns in all rows are set to 1.
drop table tab1 ; create table tab1 ( colsz1 number , colsz3 number , colsz254 number ); insert into tab1 select 1,1,1 from all_objects where rownum <= 1000; commit;
We now gather statistics on each column with varying histogram sizes and check the resulting histogram settings for each one. The column names will help keep track of which is which.
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz1 size 1'); exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz3 size 3'); exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz254 size 254'); column column_name format a12 select column_name,num_distinct,num_buckets,histogram from dba_tab_columns where table_name = 'TAB1' order by 1;
Below we can observe that “SIZE 1″ did not create a histogram and “SIZE 3″ & “SIZE 254″ created frequency histograms, each with 1 bucket. All as expected.
COLUMN_NAME NUM_DISTINCT NUM_BUCKETS HISTOGRAM ------------ ------------ ----------- --------------- COLSZ1 1 1 NONE COLSZ254 1 1 FREQUENCY COLSZ3 1 1 FREQUENCY
Now we add 3 new values to each column in the table giving us 4 distinct values in total.
insert into tab1 values (2,2,2); insert into tab1 values (3,3,3); insert into tab1 values (4,4,4); commit;
If we gather statistics for all of the columns using METHOD_OPT size REPEAT what do you expect to happen? My expectation is that:
- COLSZ1 will have 1 bucket
- COLSZ3 will have 3 buckets and become a HEIGHT BALANCED histogram because we have more values that the number of buckets
- COLSZ254 will have 4 buckets and remain a FREQUENCY histogram
Let’s see…
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for all columns size repeat'); COLUMN_NAME NUM_DISTINCT NUM_BUCKETS HISTOGRAM ------------ ------------ ----------- --------------- COLSZ1 4 1 NONE COLSZ254 4 4 FREQUENCY COLSZ3 4 4 FREQUENCY
I was wrong! Size REPEAT ignored me (or perhaps corrected me) and created a fourth bucket. What if I don’t use size REPEAT and force a size of 3…
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for columns colsz3 size 3'); COLUMN_NAME NUM_DISTINCT NUM_BUCKETS HISTOGRAM ------------ ------------ ----------- --------------- COLSZ1 4 1 NONE COLSZ254 4 4 FREQUENCY COLSZ3 4 3 HEIGHT BALANCED
Now that’s what I expected. So now the histogram is HEIGHT BALANCED let’s try size REPEAT again.
exec dbms_stats.gather_table_stats(null,'tab1',estimate_percent=>100,method_opt=>'for all columns size repeat'); COLUMN_NAME NUM_DISTINCT NUM_BUCKETS HISTOGRAM ------------ ------------ ----------- --------------- COLSZ1 4 1 NONE COLSZ254 4 4 FREQUENCY COLSZ3 4 3 HEIGHT BALANCED
Interesting. So this time size REPEAT did not introduce a fourth bucket.
At first this seemed to be significant and I was searching for a case where I could catch it out. However it looks to me that the DBMS_STATS code is behaving sensibly:
- If we have a frequency histogram and a new value appears then DBMS_STATS will add an extra bucket if possible. In the example above I specifically asked for three buckets when there was only one value, when a fourth distinct value was added DBMS_STATS was smart enough to add another bucket.
- If we have a HEIGHT BALANCED histogram and another distinct value is added to the column then DBMS_STATS assumes there was a good reason for specifying fewer buckets than values and leaves it well alone.
The title of this post was “When does METHOD_OPT size REPEAT not obey?”. The answer of course is “When it knows better!”
The implementation of this turns out to be quite simple and will be demonstrated in my next post.
leave a comment