rmoff

June 15, 2011

Global statistics high/low values when using DBMS_STATS.COPY_TABLE_STATS

Filed under: copy_table_stats, dbms_stats, DWH, oracle, Statistics — rmoff @ 08:16

There is a well-documented problem relating to DBMS_STATS.COPY_TABLE_STATS between partitions where high/low values of the partitioning key column were just copied verbatim from the source partition. This particular problem has now been patched (see 8318020.8). For background, see Doug Burns’ blog and his excellent paper which covers the whole topic of statistics on partitioned tables.

This post Maintaining statistics on large partitioned tables on the Oracle Optimizer blog details what the dbms_stats.copy_table_stats does with regards to the high/low values:

It adjusts the minimum and maximum values of the partitioning column as follows; it uses the high bound partitioning value as the maximum value of the first partitioning column (it is possible to have concatenated partition columns) and high bound partitioning value of the previous partition as the minimum value of the first partitioning column for range partitioned table

However, two problems as I see them remain:

  1. Table global stats don’t update high_value for partitioning key
  2. high_value of one partition overlaps with low_value of the next.
    • Partition high-bound values are defined as LESS THAN, not LESS THAN OR EQUAL TO – therefore the maximum possible value of the column is less than this, not equal to it.
    • The minimum value of the partitioning column is correct using this method (although be aware of 10233186 if you use a MAXVALUE in your range partitioning).

Here’s a script that demonstrates the two issues, written and commented based on execution on 11.1.0.7:

/* copy_stats_1.sql

Illustrate apparent problem with high_val on partition statistics when using partition to partition statistics copy
  * Table global stats do not update high_value for partitioning key
  * high_value of one partition overlaps with low_value of the next.

Requires display_raw function by Greg Rahn, see here: http://tinyurl.com/display-raw

https://rnm1978.wordpress.com/

*/

set echo off
set timing off
set feedback off
set linesize 156
set pagesize 57
col owner for a10
col table_name for a30
col column_name for a30
col partition_name for a20
col low_val for a10
col high_val for a10
col num_rows for 999,999,999,999
col "sum of num_rows" for 999,999,999,999
break on stats_update_time skip 1 duplicates

clear screen

prompt ===== This script uses the DISPLAY_RAW function =======
prompt
prompt Available here: http://structureddata.org/2007/10/16/how-to-display-high_valuelow_value-columns-from-user_tab_col_statistics/
prompt
prompt ========================================================
prompt
prompt
prompt
prompt =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
prompt 1. Set up an partitioned table with data and examine the statistics
prompt
prompt
set echo on
pause
-- Create fact table
drop table BASE_DATA;
CREATE table BASE_DATA ( day_key integer, store_key INTEGER,  item_key  INTEGER, fact_001 NUMBER(15,0), fact_002 NUMBER(15,0), fact_003 NUMBER(18,2))
 PARTITION BY RANGE (DAY_KEY)
  SUBPARTITION BY HASH (store_key)
  SUBPARTITION TEMPLATE ( SUBPARTITION "SP1" , SUBPARTITION "SP2" , SUBPARTITION "SP3" , SUBPARTITION "SP4")
 ( PARTITION "PART_20110401"  VALUES LESS THAN (20110402))
 PARALLEL;

pause
-- Create indexes
CREATE UNIQUE INDEX BASE_DATA_PK ON BASE_DATA ("DAY_KEY", "STORE_KEY", "ITEM_KEY") LOCAL parallel;
create bitmap index base_data_ix2 on base_data (store_key) local parallel;
create bitmap index base_data_ix3 on base_data (item_key) local parallel;

pause 

-- Populate fact table
exec DBMS_RANDOM.SEED('StraussCookPieterson');
insert into BASE_DATA values (20110401,101,2000, dbms_random.value(0,999) , dbms_random.value(0,999) , dbms_random.value(0,999) );
insert into BASE_DATA values (20110401,102,2000, dbms_random.value(0,999) , dbms_random.value(0,999) , dbms_random.value(0,999) );
commit;

pause 

-- Gather full stats on table
set feedback on
exec dbms_stats.gather_table_stats(     ownname=>USER, tabname=>'BASE_DATA', granularity=>'AUTO');
set feedback off

pause 

select * from base_data order by day_key;
pause
-- Examine statistics

set echo off
prompt
prompt DBA_PART_TABLES
select partitioning_type, subpartitioning_type, partition_count from dba_part_tables where table_name='BASE_DATA' and owner=USER;

prompt
prompt DBA_TAB_STATS_HISTORY
SELECT table_name, partition_name, stats_update_time
FROM   dba_tab_stats_history
WHERE  owner = USER
AND table_name = 'BASE_DATA'
ORDER  BY stats_update_time asc;
pause

prompt
prompt  DBA_TAB_STATISTICS (table level only):
prompt  **************************************
select table_name,num_rows,
to_char(LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED"
from DBA_TAB_STATISTICS
where table_name='BASE_DATA' and owner=USER
and partition_name is null
;

pause 

compute sum of num_rows on report
prompt
prompt DBA_TAB_STATISTICS (Partition level):
prompt *************************************
select table_name,partition_name,num_rows,
to_char(LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED"
from DBA_TAB_STATISTICS
where table_name='BASE_DATA' and owner=USER
and partition_name is not null
and subpartition_name is null
order by table_name,partition_name
;
clear computes

pause 

prompt DBA_PART_COL_STATISTICS:
prompt ************************
select a.partition_name,a.column_name,to_char(a.LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED",
display_raw(a.low_value,b.data_type) as low_val,display_raw(a.high_value,b.data_type) as high_val
from DBA_PART_COL_STATISTICS a
inner join dba_tab_cols b on a.table_name=b.table_name and a.column_name=b.column_name and a.owner=b.owner
where a.table_name='BASE_DATA' and a.owner=USER and a.partition_name is not null
and a.column_name = 'DAY_KEY'
;

prompt
prompt Observe: Partition high/low values for DAY_KEY - currently 1st April
pause 

prompt
prompt DBA_TAB_COL_STATISTICS:
prompt ***********************
select a.column_name,to_char(a.LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED",
display_raw(a.low_value,b.data_type) as low_val,display_raw(a.high_value,b.data_type) as high_val
from DBA_TAB_COL_STATISTICS a inner join dba_tab_cols b on a.table_name=b.table_name and a.column_name=b.column_name and a.owner=b.owner
where a.table_name='BASE_DATA' and a.owner=USER and a.column_name = 'DAY_KEY'
;
prompt
prompt Observe: Table high/low values for DAY_KEY - currently 1st April
pause 

prompt
prompt
prompt
prompt =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
prompt 2. Create new partition and use dbms_stats.copy_table_stats to set the stats for it. Leave data in the table unchanged.
prompt
prompt
pause

set feedback on
set echo on
alter table base_data add PARTITION "PART_20110402"  VALUES LESS THAN (20110403);
exec dbms_stats.copy_table_stats(ownname=>USER, tabname=>'BASE_DATA',SRCPARTNAME=>'PART_20110401',DSTPARTNAME=>'PART_20110402');
pause
set feedback off

select * from base_data order by day_key;
pause
-- Examine statistics

set echo off
prompt
prompt DBA_PART_TABLES
select partitioning_type, subpartitioning_type, partition_count from dba_part_tables where table_name='BASE_DATA' and owner=USER;

prompt
prompt DBA_TAB_STATS_HISTORY
SELECT table_name, partition_name, stats_update_time
FROM   dba_tab_stats_history
WHERE  owner = USER
AND table_name = 'BASE_DATA'
ORDER  BY stats_update_time asc;
pause

prompt  DBA_TAB_STATISTICS (table level only):
prompt  **************************************
select table_name,num_rows,
to_char(LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED"
from DBA_TAB_STATISTICS
where table_name='BASE_DATA' and owner=USER
and partition_name is null
;

compute sum of num_rows on report
prompt
prompt DBA_TAB_STATISTICS (Partition level):
prompt *************************************
select table_name,partition_name,num_rows,
to_char(LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED"
from DBA_TAB_STATISTICS
where table_name='BASE_DATA' and owner=USER
and partition_name is not null
and subpartition_name is null
order by table_name,partition_name
;
clear computes

prompt
prompt Side note: Oracle doesn't aggregate the partition num_rows statistic up to global when doing a copy stats, 
prompt            so whilst the sum of partition num_rows is four, the global num_rows is still two.
prompt            Of course, at this point, there are only actually two rows of data in the table.
prompt   
prompt (also, observe that LAST_ANALYZED for the new partition is that of the partition from where the stats were copied, and isn't
prompt  the same as STATS_UPDATE_TIME for the partition on DBA_TAB_STATS_HISTORY - which makes sense when you think about it)
pause

prompt
prompt DBA_TAB_PARTITIONS:
prompt ********************
select partition_name, high_value from dba_tab_partitions where table_name='BASE_DATA' and table_owner=USER;
prompt
prompt DBA_PART_COL_STATISTICS:
prompt ************************
select a.partition_name,a.column_name,to_char(a.LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED",
display_raw(a.low_value,b.data_type) as low_val,display_raw(a.high_value,b.data_type) as high_val
from DBA_PART_COL_STATISTICS a
inner join dba_tab_cols b on a.table_name=b.table_name and a.column_name=b.column_name and a.owner=b.owner
where a.table_name='BASE_DATA' and a.owner=USER and a.partition_name is not null
and a.column_name = 'DAY_KEY'
;
prompt
prompt See the Partition high/low values for DAY_KEY in the new partition (PART_20110402) into which we copied the stats:
prompt --> low_value is correct
prompt --> high_value is out of range for possible data in that partition
prompt -----> high_value of the partition is < 20110403, ** not **  <= 20110403
prompt
pause
prompt
prompt DBA_TAB_COL_STATISTICS:
prompt ***********************
select a.column_name,to_char(a.LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED",
display_raw(a.low_value,b.data_type) as low_val,display_raw(a.high_value,b.data_type) as high_val
from DBA_TAB_COL_STATISTICS a inner join dba_tab_cols b on a.table_name=b.table_name and a.column_name=b.column_name and a.owner=b.owner
where a.table_name='BASE_DATA' and a.owner=USER and a.column_name = 'DAY_KEY'
;
prompt
prompt See the Table high/low values for DAY_KEY - currently 1st April, even though the stats on individual partitions has a (wrong) high_val of 3rd April.
pause 
prompt
prompt
prompt
prompt =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
prompt 3. Add another new partition and use dbms_stats.copy_table_stats to set the stats for it. Leave data in the table unchanged.
prompt
prompt
pause

set feedback on
set echo on
alter table base_data add PARTITION "PART_20110403"  VALUES LESS THAN (20110404);
exec dbms_stats.copy_table_stats(ownname=>USER, tabname=>'BASE_DATA',SRCPARTNAME=>'PART_20110401',DSTPARTNAME=>'PART_20110403');
pause
set feedback off

select * from base_data order by day_key;
pause
-- Examine statistics

set echo off
prompt
prompt DBA_PART_TABLES
select partitioning_type, subpartitioning_type, partition_count from dba_part_tables where table_name='BASE_DATA' and owner=USER;

prompt
prompt DBA_TAB_STATS_HISTORY
SELECT table_name, partition_name, stats_update_time
FROM   dba_tab_stats_history
WHERE  owner = USER
AND table_name = 'BASE_DATA'
ORDER  BY stats_update_time asc;
pause

prompt  DBA_TAB_STATISTICS (table level only):
prompt  **************************************
select table_name,num_rows,
to_char(LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED"
from DBA_TAB_STATISTICS
where table_name='BASE_DATA' and owner=USER
and partition_name is null
;

compute sum of num_rows on report
prompt
prompt DBA_TAB_STATISTICS (Partition level):
prompt *************************************
select table_name,partition_name,num_rows,
to_char(LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED"
from DBA_TAB_STATISTICS
where table_name='BASE_DATA' and owner=USER
and partition_name is not null
and subpartition_name is null
order by table_name,partition_name
;
clear computes

pause

prompt
prompt DBA_TAB_PARTITIONS:
prompt ********************
select partition_name, high_value from dba_tab_partitions where table_name='BASE_DATA' and table_owner=USER;
prompt
prompt DBA_PART_COL_STATISTICS:
prompt ************************
select a.partition_name,a.column_name,to_char(a.LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED",
display_raw(a.low_value,b.data_type) as low_val,display_raw(a.high_value,b.data_type) as high_val
from DBA_PART_COL_STATISTICS a
inner join dba_tab_cols b on a.table_name=b.table_name and a.column_name=b.column_name and a.owner=b.owner
where a.table_name='BASE_DATA' and a.owner=USER and a.partition_name is not null
and a.column_name = 'DAY_KEY'
;
prompt
prompt You can see that the high_value for the new partition is again too high for the possible values the partition could contain
prompt
prompt But this time we can also see the overlapping high_value of previous column with low_value of the next.
prompt   PART_20110401 has real stats 
prompt   PART_20110402 has copied stats, with a (wrong) high_value of 20110403
prompt   PART_20110403 has copied stats, with a low_value of 20110403 - which is the same as the high_value of the previous partition

prompt
pause
prompt
prompt DBA_TAB_COL_STATISTICS:
prompt ***********************
select a.column_name,to_char(a.LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED",
display_raw(a.low_value,b.data_type) as low_val,display_raw(a.high_value,b.data_type) as high_val
from DBA_TAB_COL_STATISTICS a inner join dba_tab_cols b on a.table_name=b.table_name and a.column_name=b.column_name and a.owner=b.owner
where a.table_name='BASE_DATA' and a.owner=USER and a.column_name = 'DAY_KEY'
;
prompt
prompt Table high/low values for DAY_KEY - still 1st April, even though the stats on individual partitions has a (wrong) high_val of 4th April.
pause 

prompt
prompt
prompt
prompt =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
prompt 4. Add data to the table, gather real statistics, examine them.
prompt
prompt
pause

set echo on
-- Populate fact table
insert into BASE_DATA values (20110402,101,2000, dbms_random.value(0,999) , dbms_random.value(0,999) , dbms_random.value(0,999) );
insert into BASE_DATA values (20110403,101,2000, dbms_random.value(0,999) , dbms_random.value(0,999) , dbms_random.value(0,999) );
commit;

pause 

-- gather full stats
exec dbms_stats.gather_table_stats(     ownname=>USER, tabname=>'BASE_DATA', granularity=>'AUTO');
pause

select * from base_data order by day_key;
pause
-- Examine statistics

set echo off
prompt
prompt DBA_PART_TABLES
select partitioning_type, subpartitioning_type, partition_count from dba_part_tables where table_name='BASE_DATA' and owner=USER;

prompt
prompt DBA_TAB_STATS_HISTORY
SELECT table_name, partition_name, stats_update_time
FROM   dba_tab_stats_history
WHERE  owner = USER
AND table_name = 'BASE_DATA'
ORDER  BY stats_update_time asc;
pause

prompt  DBA_TAB_STATISTICS (table level only):
prompt  **************************************
select table_name,num_rows,
to_char(LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED"
from DBA_TAB_STATISTICS
where table_name='BASE_DATA' and owner=USER
and partition_name is null
;
pause

compute sum of num_rows on report
prompt
prompt DBA_TAB_STATISTICS (Partition level):
prompt *************************************
select table_name,partition_name,num_rows,
to_char(LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED"
from DBA_TAB_STATISTICS
where table_name='BASE_DATA' and owner=USER
and partition_name is not null
and subpartition_name is null
order by table_name,partition_name
;
clear computes
prompt
prompt Table num_rows is now accurate
pause 

prompt
prompt DBA_TAB_PARTITIONS:
prompt ********************
select partition_name, high_value from dba_tab_partitions where table_name='BASE_DATA' and table_owner=USER;
prompt
prompt DBA_PART_COL_STATISTICS:
prompt ************************
select a.partition_name,a.column_name,to_char(a.LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED",
display_raw(a.low_value,b.data_type) as low_val,display_raw(a.high_value,b.data_type) as high_val
from DBA_PART_COL_STATISTICS a
inner join dba_tab_cols b on a.table_name=b.table_name and a.column_name=b.column_name and a.owner=b.owner
where a.table_name='BASE_DATA' and a.owner=USER and a.partition_name is not null
and a.column_name = 'DAY_KEY'
;

prompt
prompt Partition high/low values for DAY_KEY in each partition is correct
pause

prompt
prompt DBA_TAB_COL_STATISTICS:
prompt ***********************
select a.column_name,to_char(a.LAST_ANALYZED,'YYYY-MM-DD-HH24:MI:SS') "LAST_ANALYZED",
display_raw(a.low_value,b.data_type) as low_val,display_raw(a.high_value,b.data_type) as high_val
from DBA_TAB_COL_STATISTICS a inner join dba_tab_cols b on a.table_name=b.table_name and a.column_name=b.column_name and a.owner=b.owner
where a.table_name='BASE_DATA' and a.owner=USER and a.column_name = 'DAY_KEY'
;
prompt
prompt Table high/low values for DAY_KEY are now correct
pause 

/* #EOF */

From everything that I’ve read, representative stats are essential for Oracle to generate the most efficient explain plan to give the most optimal performance. Out of range problems caused by inaccurate statistics is something frequently referenced. However I’m out of my depth here to determine whether that’s true for the global statistics of this partitioning column not getting updated.

Copying stats have never been intended as a replacement for real stats, that much is clear and frequently stated. They should be part of a carefully designed stats gathering method, based on your applications data and frequency of loading. Hopefully the above, along with the other articles about copy stats out there, will add to the understanding of the functionality and importantly, its limitations. Copying the stats will just buy you time in a critical load schedule, postponing the point at which you do a proper gather. All copy stats is doing is making the statistics a bit more representative of the data – it’s not a proper sample of the data so the quality of the stats will never be as good as if you do a proper gather. When you do the real gather should be whichever comes first of:

  • the point at which you have time in your batch schedule
    or,
  • the stats are too unrepresentative of your data for the Oracle optimizer to generate a sufficiently efficient explain plan in order for your queries to run in the time which the users require.

Maria Colgan from Oracle has kindly reviewed my script and findings, and commented:

Your argument that copy stats sets the high_value wrongly (to high), is correct. We do over estimate the high value by setting it to the partition definition. As you correctly point out no value in the partition will have reach that high_value because a range partition is always specified as less than. We did this so that we can ensure there will be no greater value than this in the partition, otherwise we would have to guess what the max value is.

Maria also pointed out that with regards to the overlapping high/low values

However, this is not the expected behavior. The goal of copy_stats is to provide a temporary fix to the out of range problem by providing a representative set of statistics for a new partition. It is not supposed to be a replacement for statistics gathering.


Reading:

Watch out for these other bugs that I came across reference to:

  • 10234419 Extend dbms_stats.copy_table_stats to all range partitioning key columns
  • Doc ID 1292269.1ORA-01422 While running dbms_stats.copy_table_stats
    • ” This issue would occur when there are more than one schema with same table name.”

Many thanks to Maria Colgan and Doug Burns for reviewing this post and providing useful feedback.

Advertisements

May 26, 2011

Data Warehousing and Statistics in Oracle 11g – Automatic Optimizer Statistics Collection

Filed under: dbms_stats, DWH, oracle, Statistics — rmoff @ 13:35

Chucking a stick in the spokes of your carefully-tested ETL/BI …

My opinion is that automated stats gathering for non-system objects should be disabled on Oracle Data Warehouses across all environments.

All it does it cover up poor design or implementation which has omitted to consider statistics management. Once you get into the realms of millions or billions of rows of data, the automated housekeeping may well not have time to stat all of your tables on each run. And then it becomes a quasi-lottery when your tables will get processed. Or what if you’re working with intra-day loads (eg. near real-time) – the housekeeping job only runs once a day by default.

Even if you have a suitable window and are happy that the job gathers all that it needs to all of the time, what if you want to run your batch at the same time as the task window defined? If you want to run your batch highly parallel (and why wouldn’t you?) then will the stats gather suffer? or affect your batch by running stats highly parallel too?

Suppose you are relying on the auto stats job, and don’t want to run it at the same time as your batch, so you come up with a suitable schedule for them to run at different times. What happens when your DW grows and you need to add a new batch process, and so have to move the window again? How do you know that moving it won’t affect the previous batch’s stats?

If you’re building on an existing system and want to test the performance of your new batch, how are you going to simulate the behaviour of your auto stats job? Even if you trigger it manually, are you going to accurately simulate the statistics that it may or may not need to gather each night? How do you factor in the magical 10% staleness to trigger a stats gather? That is one serious test rig if you want to reproduce all of that.

If you have stats management in place, then turning the auto stats off (for non-system objects) won’t hurt. And if you’re not, then the auto stats job will cover this up in your environments all the way from Dev through to Prod. The first time someone will ask about stats management is when you’re scratching your head over a report or ETL stage “that used to work fine”. And then the horrible truth will dawn that you screwed up, and should have built it into your design from the beginning.

As we say around here, if you want a job done properly, do it tha’ sen. Or rather, as Greg Rahn more articulately says:

I tend to advise people that for a DW the stats gathering should be part of the data flow (ETL/ELT) process and to disable the default job
[…]
If you wish to collect your statistics manually, then you should change the value of AUTOSTATS_TARGET to ORACLE instead of AUTO (DBMS_STATS.SET_PARAM(‘AUTOSTATS_TARGET’,’ORACLE’)). This will keep the dictionary stats up to date and allow you to manually gather stats on your schemas

Julian Dyke says something supporting this view too in his presentation here:

In complex databases do not rely on Auto job
– Unpredictable collection behaviour / duration
– Unpredictable execution plan changes

If you can’t disable the autostats job for whatever reason (maybe another application on the same DB would require changes to accommodate it), then you can shield yourself from its nefarious influences by using LOCK_SCHEMA_STATS to lock the stats on your schema(s). When you manually maintain the stats yourself, you either unlock them first, or use the FORCE option of the stats procedures.

Stabilisers on a high-performance motorbike

It’s easy enough to understand why Oracle built the Automated Stats job, and why it’s enabled by default. In an effort to move towards the Self Managing Database, it makes sense to automate whatever you can, so that the scope for poor performance is reduced. Abstracting things slightly, the optimizer is just part of the DB code, and stats reason for being is to support the optimizer, so why not keep it under the covers where possible?
The trouble with this is that it might be fine for the middle of the road. The bog standard, quick-win, fire it and run project doing nicely standard OLTP work. One fewer thing for the developer to worry about. It’s probably quite good for lots of things. But Oracle RDBMS is a big beast, and an expensive bit of kit. Do you really want to meander along in the slow lane all the time, content to be using a one-size-fits-all approach?
Kawasaki motorbike with stabilisers

If you’re serious about exploiting the power of Oracle for your data warehouse, then you need to understand what needs to be done to get it to perform. One of the big factors is accurate, representative statistics. And to get these you have to take the stabilisers off and learn how to do it properly yourself, because you’re the one that understands your data. Data loads are going to be different, data distribution is going to be different, reporting is going to be different. There’s a finite set of patterns that you’ll find in standard DW methodology, but it’s up to you to read about them (Greg Rahn, Doug Burns, et al) and understand how they apply to your system, and not rely on Oracle’s approximation of a stats method for an average system.

Why do I need to manage the stats myself? Doesn’t Oracle do it automagically when they’re stale?

Doesn’t Oracle gather stats automagically when they’re stale?
Yes, it does, BUT:

  • Only if the window allocated to it allows for time
  • not stale ≠ representative stats .
    Or to rearrange the equation: your stats can be unrepresentative of your data, and the stats not be ‘stale’.

So even whilst they’re not “stale”, that’s not to say the global statistics for your table are still representative. After one day, the statistics are already becoming unrepresentative of the data (think max value of date, transaction number, etc), but are still not “stale”.
Oracle will, by default, consider a table “stale” once 10% has changed. But most DWs are going to be loading many millions of rows a day, so the 10% (default) change for a table to be considered stale is going to be quite high. A table loading 20 million rows per day will hit c.1 billion rows in total after less than two months. But of a billion rows, a hundred million (10%) need to change before the table’s statistics are “stale”. 20 into 100 goes 5 … so your statistics would only become “stale” roughly every five days.

There’s a good presentation from OpenWorld 2008 by Oracle’s Real World Performance Group, entitled Real-World Database Performance Techniques and Methods. In it they discuss statistics management in detail, including the following “Six Challenges to the Cost Based Optimizer”:

1. Data skew
2. Bind peeking
3. Column low/high values
4. Data correlation between columns
5. Cardinality Approximations
6. The debugging process

At least two of these (data skew, and column low/high values – out-of-range) can occur (which is bad, mm’kay?) with statistics which are STALE=FALSE.

The point is, if you’re serious about getting the best explain plan from the CBO, you can’t rely on STALE as a sole indicator of how representative your statistics are of your data.

Let’s remember why we even care about good statistics. Some people seem to think that it’s optional. That it’s the geek equivalent of spending every weekend lovingly polishing the exterior of one’s favourite car – nice to have and ideally should be done, but ultimately just for show and won’t make it go any faster.
The DB is there to support the users of whatever application it is. And users being users, want their answers now. This gives us our starting point, and a logical flow of conclusions drawn from this:

  • Our requirement is for performance, and repeatable, consistent performance.
    • To get this we want Oracle to execute the query as efficiently as possible.
    • To do this, Oracle needs to understand the data that it’s being asked to query.
    • If it doesn’t understand the data, how can we expect to access it in the most efficient way?
    • This understanding is imparted to Oracle through statistics.
    • So statistics need to be representative of the data.

As soon as you are changing data (eg a DW batch load), you need to consider whether the stats are still going to give the CBO the best chance of getting the right plan. If they aren’t as representative of your data as they could be then you can’t expect the CBO to come up with the best plan.
If your data doesn’t change much and once a week works for you then great. But the point is you need to understand your data, so that you can plan your statistics strategy around it so that Oracle can understand it.

Reading & References

Thanks to Greg Rahn for reviewing my post and suggesting some changes

May 19, 2011

OBIEE performance – get your database sweating

Filed under: bi, DWH, obiee, oracle, performance, rant, rrdtool — rmoff @ 16:01

Just because something produces the correct numbers on the report, it doesn’t mean you can stop there.

How you are producing those numbers matters, and matters a lot if you have an interest in the long-term health of your system and its ability to scale.

OBIEE is the case in point here, but the principle applies to any architecture with >1 tiers or components.

Let me start with a rhetorical question. The user has got a report which has ten rows of data. Which of the following methods is going to be a more efficient way to generate the report?

  1. Transfer the ten rows of data from the database back to the rendering application, and the rendered report to the user
  2. Transfer fourteen million rows of data from the database back to the rendering application, get the application to crunch these to the required ten rows, transfer rendered report to the user

Obviously, it’s the first one. In the second one (and this is no hyperbolic example to prove a point, I’m sorry to say) we have the following overheads:

  • Network traffic of fourteen million rows from the database to the application server
  • Disk and/or Memory impact on the application server, depending on how it’s processing those fourteen millions rows
  • CPU impact on the application server when it processes those fourteen million rows

Considering OBIEE specifically, you should be aiming to be able to answer all of your questions on the database directly. As soon as you start using the database simply as a source for dumping big volumes of rows into your OBIEE server, you’re storing up problems.
An RDBMS is designed for crunching lots of data. You’ve already lifted all that data off the disk when you selected it out of the tables on Oracle DB – why not get Oracle DB to do the additional processing required too? Where’s the benefit in lifting & shifting all of that data to then reprocess it again?
In fact, consider why you’re crunching big volumes of data each time in the first place. If multiple people have the same big queries, are you missing some aggregates in your data source that would support these queries much better? Or do you need to be looking towards multi-dimensional sources such as Oracle OLAP or Essbase?

The flexibility and power of OBIEE to do its own calculations and aggregations on data is a double-edged sword, and one to be wielded with responsibility. It is a great feature and one which cannot be avoided if you’re using federated queries across disparate sources. But if you’re using it out of ignorance or laziness to brute-force a solution instead of doing the big work at the lowest level possible then you’re asking for trouble.

If you rely on OBIEE to do the heavy work of your reporting solution, then you need to factor this in to your hardware spec for the machine. As a reporting interface to well aggregated data with fairly simple reports, we’ve found that it runs at minimal CPU, and doesn’t stress the disk. You can predict fairly reliably that this should scale just fine so long as your database can. But if you put the workload on the OBIEE server too, you’re going to hit bottlenecks much sooner.

The principle of moving as little data around as possible is described by Cary Millsap as “Filter Early” and described well in his blog article. Thanks to @RonCrisco for helping me remember the name. @cdturri pointed out that it’s also just common sense :-).
Applying it to a specific system, Exadata uses it in its SmartScan technology where it applies filtering of data directly on the storage rather than bringing all the data back up to the database SQL processing layer (h/t @ocpdba).

Case-study

Here’s an example of a situation where all is not well in OBIEE-land.

We were alerted to it by unix complaining that a filesystem was running low on space. Checking the contents of the filesystem we spotted these monsters:

[user@server]/data/bi/tmp $ls -lrt
total 112914646
[...]
-rw-------   1 user   biadmin    29122976800 Apr  6 11:32 nQS_20117_465_33898735.TMP
-rw-------   1 user   biadmin    24816966976 Apr  6 11:12 nQS_20117_464_33792457.TMP
-rw-------   1 user   biadmin    3582054936 Apr  6 11:46 nQS_20117_469_37979712.TMP
[...]

These are temporary files generated by BI Server (nqsserver), and for those of you viewing in black and white, those files are getting on for 30GB a go!

When queried, NQQuery.log embarrassingly admits facts such as:

Rows 13,894,550, bytes 3,260,497,648 retrieved from database query id: xxxx

and Usage Tracking shows the kind of damage being inflicted:

                      Total
                       time     row num db     cum num
START_TS     END_TS     sec   count  query      db row 

06-APR 10:05 10:18      579       0      6   3,436,816 
06-APR 10:05 10:18      553       0      4   3,239,101 
06-APR 10:05 10:18      383       0      3   1,624,656 
06-APR 10:11 11:48     5694       0      1  13,894,550 
06-APR 10:11 11:48     4314       0      1  11,840,156 
06-APR 10:21 10:27      336     456      4   3,239,101 

A look at the machine’s vital statistics for this time period shows the impact:

Affected BI Server, showing impact of workload which quietens down c.12:00

For a bit more detail, here’s the time period at greater resolution:

Systems metric during the heavy report execution

Notice that the server starts paging out during the work, and there’s a lot of disk activity – both read and write. CPU goes from a normal c.5% up to 20-30%.

This was one user doing this — care to place bets on how the system would perform if it were rolled out to ten users running this kind of workload?

For reference, here’s one of our other BI Servers which serves at least 10 times as many users, where the reports do the bulk of their work in the database:

Another BI Server on the same day, with a greater number of users but reports which do their big work on the database

Monitoring for problems

There are a few easy ways to look out for this bad behaviour. The first is in Usage Tracking, where you can look at S_NQ_ACCT.CUM_NUM_DB_ROW and compare it to S_NQ_ACCT.ROW_COUNT, even writing an alert for when this ratio goes above a defined threshold.
You could also look at an alert on S_NQ_ACCT.CUM_NUM_DB_ROW going above a fixed number, since depending on your system’s purpose and architectural principles you might want to aim to never be pulling back more rows from the database than necessary for a report.

I’d also pay close attention to S_NQ_ACCT.NUM_DB_QUERY, as this will show you reports generating more than one query on the database and thus increasing the workload on the BI Server processing the combined results.

Another thing to check is the NQQuery.log, looking at the number of bytes returned from a database query. This could feed into some kind of generated alert for closer investigation if a query returns above a certain number.
OBI 10g:

grep retrieved NQQuery.log|awk '{print $5}'|sort -n

OBI 11g:

grep retrieved NQQuery.log|awk '{print $20}'|sort -n

(if you’re running on Windows then to run this command you’ll need either cygwin or google for a Win32 version of grep/awk/sort – or brew your own Powershell version)

I’d also strongly recommend monitoring your BI Server’s TEMP folder (defined as WORK_DIRECTORY_PATHS in NQSConfig.INI), as this is where it lands the data to disk when it’s got to crunch it. Monitor this for two reasons – to spot when bad stuff’s happening, but also in case it fills up and causes your BI Server to barf. For the latter, you can expect to get:

[nQSError: 10058] A general error has occurred. 
[nQSError: 46118] Out of disk space. 

Developing for the long term

How do you assess your development projects for quality?
You presumably are keen that the reports match the specifications and return the right numbers. You hopefully also have NFRs for how quickly these reports run.
But what about system impact of new developments? How do you quantify this?

Unless you are delivering a standalone project, fire-and-forget, then maybe you can performance test for load and concurrency up-front to validate how well your solution will scale to the required users.
But lots of BI projects are iterative and may well be across functions too. Just because your development project is first to land it doesn’t give you the right to dump a steaming pile on the servers and proclaim that it works with your user volumes and meets their NFRs, so all is well. What about all the projects that come along next, and are going to be hindered by your mess?

This is where another double-edged sword comes into play – “Best Practice”. For me, Best Practice is a way of doing something that multiple implementations, time and expertise has shown to be the best way of not screwing up. It is the starting point from which to work, understanding deviations as required. What it is not, and what gets it the bad reputation, is a fixed set of crude rules to be adhered to blindly and implemented without questioning or understanding.

If a system is not adhering to the kind of best practice I’m talking about here – filter early, in essence – then there may be a good reason. But that reason must be consciously and loudly acknowledged and documented.

April 11, 2011

Entertaining Exadata FUD from HP

Filed under: DWH, Exadata, HP, oracle — rmoff @ 09:06

Chris Mellor at The Register posted an interesting article a couple of days ago, entitled HP and Violin build Oracle Exadata killer. The slidedeck has been removed from HP’s FTP site, but a bit of Google magic throws up a couple of mirror copies.

It’s an entertaining read (“Do a Proof of Concept! 94% win rate!! We can and do win against Exadata!!”), and a nice illustration of the FUD techniques that companies use in marketing their products against others. Greg Rahn has taken Netezza to task in the past for this, and to be fair at least Netezza had a serious white paper to back up their arguments. HP’s deck (including choice sections such as “How to sell against Exadata”) is IMHO nothing more than a biased set of arguments for salespeople to use to attempt to bullshit customers.

The deck is entirely aimed at OLTP workloads, and lays all its weight on the IOPS you’ll get from lots of Flash memory (the Violin bit). As any ful kno, building a Data Warehouse on a system based on IOPS with no reference to throughput (GB/s), it doomed to painful failure. My understanding of Exadata is that it’s in part all about a properly balanced configuration. Unbalance this configuration at your peril, as you’ll start pushing unidentified bottlenecks around your system.

Kevin Closson (ex-Oracle performance architect on Exadata, now at EMC), commented on The Register article:

By my assessment the HPDBS (DL980 + Violin solution) is likely not positioned as an Exadata killer for bandwidth-sensitive DW/BI workloads. It simply doesn’t have enough high-bandwidth storage plumbing.

Ironically for HP, they actually quote Kevin in their slide deck with a selective quotation from his blog article about a TPC-H benchmark result here. The quote is used to imply that Exadata isn’t anything other than just disks thrown together. If you’re going to selectively quote, how about this one in the next paragraph of the same article – “I’m just comparing one Oracle Database 11g result to another Oracle Database 11g resul[…]The benchmark result was outstanding and it was Oracle technology in both the Database and Storage grid[…]” …

Kevin goes on to comment on HPDBS:

On the other hand, a single-rack Exadata only supports a scalable read:write ratio of 40:1 (their data sheet 1,000,000 RIOP : 50,000 WIOPS). Actually, that 50,000 WIOPS is a gross number accounting neither for redundant writes (ASM redundancy) nor the larger sequential writes that a transaction processing system also must concurrently sustain. In other words, mileage varies (downward trend)

Maybe HPDBS is a good solution, maybe it’s not. The deck certainly wasn’t published for public consumption so maybe it’s unfair to judge it on that. But it’s an interesting peek into the murky workings of technical sales. I’d be naïve to think that there isn’t a whole bunch of Oracle decks with similar “how to sell against xxxx” sections.

At the end of the day, solutions need to be judged on hard facts and whole pictures alone. For that reason, I’d take the technical blogs of respected writers any day above a sales pitch. Kevin Closson is a good example of this – working for a competitor he could easily have taken the opportunity to stick the knife in to Exadata, but as a respected technical writer he lets the facts speak for themselves.

March 11, 2011

Getting good quality I/O throughput data

Filed under: AWR, io, oracle, performance, visualisation — rmoff @ 11:33

This post expands on one I made last year here about sampling frequency (of I/O throughput, but it’s a generic concept).
The background to this is my analysis of the performance and capacity of our data warehouse on Oracle 11g.

Before I get too boring, here’s the fun bit:

Pork Pies per Hour (PP/h)

Jim wants to enter a championship pork-pie eating competition. He’s timed himself practising and over the course of an hour he eats four pork pies. So we might say that his Pork Pies per Hour (PP/h) rate is 4.

The competition lasts for thirty minutes. The world champion can eat eight pork pies in thirty minutes. Does Jim stand a chance?
(let’s pretend he has an insatiable appetite and isn’t going to get full, and all other smart-ass factors)

If his consumption rate was 4 PP/h, and he only has half an hour, then we would predict he’ll consume 4 * 0.5 = 2 pork pies. So Jim’s going to get his ass beat.

Or is he?

What if the rate of 4 PP/h masks the fact that the hour broke down as follows:

  • First 15 minutes: he ate one pork pie
  • 15 – 30 minutes: had a beer
  • 30 – 45 minutes: gobbled down three pork pies
  • 45 – 60 minutes: he had another beer and gently sweated pork fumes?

If that were the case and we had sampled every fifteen minutes, we’d see this:

So what we want to know, which is the maximum rate at which he can consume pork pies, is exposed only when we sample at an appropriate frequency.

Mega bites to Mega bytes

Enough of pork pies, and back to the tasty subject of I/O throughput. The point I am trying to make is that without an appropriate sample size the data that we have becomes less useful.

You can always process your data further to derive conclusions from it over a longer term. Five second samples are going to be fairly unintelligible if considered unprocessed over the timespan of a year.

But what you can’t do is add back in the detail that you lost by sampling with too great a frequency. Once that moment’s passed, it’s gone.

By the nature of metric which is a rate at which something happens, the sample is going to be an average over the sampling period. The problem with this is that it can mask peaks in the throughput. Peaks may (or may not) be hitting a ceiling in your system which an average figure will make you think you are plenty clear of.

System metrics for Oracle are available through AWR, which typically samples every hour. OS-level tools may sample more frequently, but in the context of capacity planning and analysis, periods are often going to be 10s of minutes, or hourly and upwards.

The following illustrates the effect of averaging I/O throughput figures.

The data is the average I/O throughput, sampled every five seconds (through this method). Note that already this is an average, but in the context of hourly samples (for example) we will have to live with five seconds as the starting point.

This first graph shows the original data, with a five minute average drawn over it. For each, the maximum is noted.
It can be seen that the maximum the I/O throughput hit was a shade over 3GB/s. That was the real maximum that we were driving through the I/O pipe over the period of time (ignoring that it’s a 5-second average). Now look at the maximum of the 5 minute average – we lost c240MB/s in our maximum, which is now 2.7GBs.

In the second graph the original sample is shown, with a 30 minute average. It’s clear to see the effect of averaging the data has – the peaks and troughs are smoothed out, giving a more even line. But is this what we want? Our apparent maximum I/O based on a 30 minute average has now almost halved!
Apparently, we only needed 1.6GB/s of I/O throughput during this time period. The graph clearly shows that this is a false statement. But what about now?

Same graph as before, but without the context of the original sample. Given this data – which is what you’ll have if you collect I/O throughput data that’s sampled every 30 minutes – then how would you know what the maximum throughput during that period was? It is impossible to know!

Pushing this point further, the same 30 minute average, over an extended period:

What’s the maximum throughput that was required during this period? When was the I/O throughput approaching capacity?
You can’t know from the averaged data alone!

Here’s the same extended period, with the original 5 second samples. This is just proving the point, that the 30 minute samples have obliterated the peaks particularly around 04:00 – 06:00.

So what?

To be able to plan for a system’s I/O capacity we need to know more than how much I/O it transferred over a relatively long period of time. We need to know what the biggest demand it put on the system was, otherwise we risk unseen bottlenecks. To make this useful, we also need to understand if these big demands were prolonged peaks or not. Particularly in a DW/BI environment, load is generally going to be sporadic. Sure, we run regular batches at predictable times, and may look to cache reports at fixed times, but they’re exceptions not the rule.

If a system has a limit of 3GB/s, and we hit that 3GB/s for a few seconds, what is the implication of that? We need to transfer the same amount of data, so hitting the limit means that the transfer is going to take longer. But a few seconds here and there may not matter — it all comes down to the context.

A report that runs for five minutes which hits I/O throughput limit for a second or two isn’t as much of a concern as one where the I/O hits the limit for minutes on end. There’s plenty written about system capacity and scalability, and it should be clear that if a system is hitting a capacity limit (whether it’s I/O, CPU, or whatever) for prolonged periods then the overall stability is going to suffer. Maybe that five minute report which spends four minutes on bottlenecked I/O doesn’t bother the user, but what about the report that’s supposed to run in a few seconds which is sat waiting for I/O at the same time?

(Ed: I’m pretty sure that I’m straying into the realm of knees and such, but need (knee’d?) to go off an read some more about it first to be sure)

I love twitter

In writing this, I’ve been struggling with some of the concepts which I can instinctively feel but don’t have the understanding to properly articulate. Particularly, if my assertion is that long sample periods are not a good idea, what is a good sample period? It’s no use being a smart-ass and rubbishing the data we collect if I can’t explain how we should collect it.

So, I turned to twitter. Twitter is awesome. (I think of Chet almost everytime I say this because he was one of the main guys who convinced me it was as good as the fuss made out. Check out his presentation all about it here here).

and back the answers came:

twitter is teh awesome

twitter is teh awesome

Amazingly helpful stuff, and focussed on my specific question. Sure, Google has changed our lives when it comes to finding the answers to questions. But (a) there is a lot of crap written on the internet (Blunder On So, anyone?), and (b) you will often find generally interesting things in the area in which you are interested, but for the more specific uncommon questions it’s unlikely you’ll get a direct hit.

Here, I had industry-leading figures directly answering my specific question! I love twitter. And I really appreciate experts taking the time to share their knowledge, experience and expertise.

What next?

Nothing that I’m writing here is new, but I enjoy writing things down to clarify my thoughts.

Courtesy of the good folk of twitter, I have some great links to follow up and digest.

March 9, 2011

Comparing methods for recording I/O – V$SYSSTAT vs HP Measureware

Filed under: io, oracle, rrdtool, unix — rmoff @ 13:38

I wrote last year about Graphing I/O data using gnuplot and Oracle V$SYSSTAT, using a script from Kevin Closson in his article How To Produce Raw, Spreadsheet-Ready Physical I/O Data With PL/SQL. Good For Exadata, Good For Traditional Storage.
Here I’ve got a simple comparison of the data recorded through this script (in essence, Oracle’s V$SYSSTAT), and directly on the OS through HP’s MeasureWare. It’s graphed out with my new favourite tool, rrdtool:

Happily for me, the data marries up almost exactly. One might say that so it ought. But it’s always good to have explicit proof of such a supposition.

January 8, 2011

Materialised Views – PCT Partition Truncation

Filed under: bug, DWH, MViews, oracle — rmoff @ 18:12

I’ve been doing some work recently that involved the use of Materialised Views on Oracle 11g (11.1.0.7), particularly around PCT refresh. There are some things that are not clear from the documentation, or are actually bugs so far as I’m concerned, and I’ve detailed these below.

In this example I was working on part of a DWH with c.2 millions rows aggregated up daily. One of the things that I spent a long time trying to get to work was Partition Truncation when using PCT refresh. We had tried and discarded “FAST” refresh as being completely non-performant for our volumes.

There was something about PCT that confused me for a while, and so maybe other thickos like me will also be confused so I’ll clarify it here: PCT is about Tracking the Change on Partitions of the detail table. It can have interactions with a partition on an MView, but this is not where its terminology refers to.

Our desired aggregation behaviour was as follows. Given the following:

  • a base fact table (partitioned by day)
  • a weekly aggregation MV (partitioned by week)
  • data loaded daily, MV refreshed daily

we would expect the following to be the most efficient way of refreshing the MV:

  1. When we load a new day’s data, for a new week: a new partition populated on the weekly MV
  2. When we load a new day’s data, for an existing week : truncate the partition on the weekly MV, and reload it (As any fule kno, truncate/insert is going to perform better than delete/insert)
  3. When existing data is updated on the base table: corresponding weekly MV partition truncated and reloaded

From what I read in the documentation, this is what PCT would be expected to do. However, it didn’t work like this for us at first.

The Answer

I’ve discussed these elements in more detail in sections below, but to cut to the chase, here is how we got PCT Partition Truncation to work for a detail table partitioned on day and aggregate table partitioned on week:

  • In the MView definition:
    1. Partition by range or list (n.b. not interval)
    2. Join to reference table for partition key (n.b. don’t use ANSI syntax)
      • Eg. Day->Week aggregation, join to calendar reference table
      • Alternatively partition MView on same partition key as detail table
    3. Define the MView USING NO INDEX to prevent the default global index being created
  • When refreshing the MView:

    1. alter session set “_mv_refresh_costing”=’rule_pt_pd_fa_co’;
    2. dbms_mview.refresh(
      • atomic_refresh=>false
      • method=>’P’ or ‘?’

Note: this is what we did to get it to work; it may be that other versions differ particularly around the _mv_refresh_costing parameter and its default value. I presume there is standard disclaimer around using undocumented parameters too.

RTFM!

Links to 11gR1 docs:

What it says in the manual

See here for the documentation regarding PCT partition truncation.
There are three conditions that you must obey for it to work (although see below for a further condition):

  1. The materialized view is partitioned on the partitioning key column or join dependent expressions of the detail table.
    If you think about this it makes a lot of sense. PCT is about tracking change on the partition of the detail table, and so if data in this partition changes, Oracle must be told how this relates to the data in the MView, so that it can refresh just the relevant subset of it.

    • An example of the former (partitioned on the partitioning key of the detail table) is base table and aggregate MV both partitioned on Store, or Warehouse, etc.

      The same partitioning key

    • An example of the latter (join dependent expression) is where you link to a second table in the MV definition, giving Oracle the understanding of the relation between two different partitioning columns (for example, how do weeks relate to days):

      Join Dependent Expression

  2. The materialized view should be range or list partitioned
  3. PCT refresh is nonatomic
    • That is, you specify the atomic_refresh parameter in your refresh call as false:
      EXEC DBMS_MVIEW.REFRESH(	[…], 
      			[…],
      			atomic_refresh=>false,
      			[…]);
      
    • By default (i.e. if you do not specify it), atomic_refresh=>true

<grumble>
The documentation on all this is there, if you look for it. My personal opinion is that other functionality of Oracle DB is better documented than this, and that DBMS_MVIEW as a whole gives away its long history in the disjointed nature of the documentation. If it were re-written today with a focus on data warehousing I’m confident it would be clearer. Anyway, that’s my excuse for not RTFMing properly…
</grumble>

What it doesn’t say in the manual

Despite having met all the above conditions for PCT partition truncation, it still wasn’t happening. When adding or updating data for an existing weekly partition, we always saw DELETE … INSERT, rather than TRUNCATE … INSERT. The DELETE took ages, as would be expected

A 10979 trace (see below) on the MView refresh showed this crucial line :

[...]

Value of _mv_refresh_costing : rule

[...]

This undocumented parameter is at the very heart of getting PCT partition truncation. By default it is set to Rule, which means that you will get DELETE … INSERT and never TRUNCATE…INSERT. There is a document on My Oracle Support: PCT REFRESH ISSUES DELETE WHERE IT SHOULD ISSUE TRUNCATE (Doc ID 733673.1).

The suggested solution is to set it to Cost, which instructs Oracle to cost the different options and use the best for the given situation. However, during testing we saw an instance of bad costing on this leading Oracle to still DELETE instead of TRUNCATE. Therefore, we opted for forcing the choice through rule, but specified in order of preference:

alter session set "_mv_refresh_costing"='rule_pt_pd_fa_co';

where:

  • pt = PCT – TRUNC (i.e. TRUNCATE … INSERT)
  • pd = PCT – DEL/TRUNC (i.e. DELETE … INSERT)
  • fa = FAST
  • co = COMPLETE

(Credit: “wilhelm2000” on this OTN forum post)

N.B. Despite the ‘TRUNC’ in the name, “PCT – DEL/TRUNC” does not do a TRUNCATE – maybe the name comes from DELETE being used as a kind of TRUNCATE?

[What it kind of says in the manual] – Default global index

Our elation at seeing a TRUNCATE occur was short-lived, because the refresh still took ages to run.

After a bit of head-scratching and more RTFMing, we found that the reason for the slow refresh time was down to a default global index that is created on MViews. Named I_SNAP$.[…], they are used when you do “fast” (incremental) refreshes, using the mv$log tables. However, if you’re only doing PCT refreshes of a MView, they are a positive hindrance because they need rebuilding every time. On an MView with millions of rows this takes a long time.

Here is an abridged trace for a PCT-Truncate refresh of an MView with the default global index in place:

/* MV_REFRESH (IND_UNUSABLE) */ ALTER INDEX "HR"."I_SNAP$_MV_WEEK" UNUSABLE

ALTER TABLE "HR"."MV_WEEK" TRUNCATE  PARTITION PART_20101122 UPDATE GLOBAL INDEXES

/* MV_REFRESH (INS) */ INSERT /*+ SKIP_UNQ_UNUSABLE_IDX APPEND BYPASS_RECURSIVE_CHECK */ INTO[…]

BEGIN  sys.dbms_index_utl.multi_level_build(index_list=>'"HR"."I_SNAP$_MV_WEEK"',just_unusable=>TRUE, cont_after_err=>TRUE, concurrent=>TRUE); END;

By defining the MView using the USING NO INDEX clause, the global index is not created and the PCT-Truncate works much more efficiently:

ALTER TABLE "HR"."MV_WEEK" TRUNCATE PARTITION[…]

/* MV_REFRESH (INS) */ INSERT /*+ APPEND  […]

Refresh method

To get PCT refresh, use method=>’P’

EXEC DBMS_MVIEW.REFRESH(	[…], 
			[…],
			method=>'P',
			[…]);

You can also use method=>’?’ where Oracle will try PCT first, and then Complete if PCT is not possible. However, if PCT isn’t possible you may have a problem that you want to know about rather than rebuilding the MV each time without you being aware of it.

Bug: Interval Partitioned MView and PCT Partition Truncation

Interval partitioning removes the headache of partition management for new data. Unfortunately, it appears that you can’t partition MViews and refresh them using PCT partition truncation.
If you try to PCT partition truncate refresh an interval-partitioned MView, you get this error:

ERROR at line 1:
ORA-12008: error in materialized view refresh path
ORA-00936: missing expression
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2545
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2751
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2720
ORA-06512: at line 1

You don’t get the error if you do a non-atomic refresh – but if you do a non-atomic refresh you don’t get partition truncation (you get DELETE instead).

Test case: mv_issue_01a.sql

Bug: MView defined with ANSI SQL

This will probably have some people exclaiming “serves you right …” but ANSI SQL joins (INNER JOIN etc, instead of joining tables in the WHERE clause) is what I was brought up on. I discovered after several frustrating hours that you shouldn’t use it when you define MViews, as it can result in an erroneous “COMPILATION ERROR” state for the MView when data on the base table is changed.

There’s a of MOS article: Mview Refresh Fails And COMPILE_STATE => COMPILATION_ERROR (Doc ID 1081493.1), which refers to bug 5759944 and for which there’s a patch number. It’s supposedly fixed in 11.2 but appears to still be present in 11.2.0.1.0.

Test case: mv_issue_02.sql

10979 Tracing

To understand exactly what happens when a MView refresh takes place, you can enable tracing using:

-- enable the trace
alter session set events '10979 trace name context forever';
-- SQL commands to trace go here
 [...]
-- Disable the trace 
alter session set events '10979 trace name context off';

Here’s an annotated trace log of a successful PCT partition truncation MView refresh:

First is one of the key bits – is the refresh atomic (atomic_refresh=>’true’, which is the default) or non-atomic (atomic_refresh=>’false’) which is required for PCT partition truncation

 NON ATOMIC REFRESH
 ON DEMAND REFRESH

 ONLY TRUNCATE based PCT REFRESH possible

Next follows the statements that Oracle would execute, for all of the MV refresh methods that it thinks are applicable in this refresh scenario.

Method 4 is a DELETE … INSERT, whilst Method 5 is TRUNCATE … INSERT.
Note that the global index hasn’t been removed from the MView, hence the additional index maintenance commands

Refresh Method 4
REFRESH Stmt 0
/* MV_REFRESH (DEL) */ DELETE FROM "HR"."MV_WEE[...]
REFRESH Stmt 1
/* MV_REFRESH (INS) */ INSERT /*+ BYPASS_RECURSIVE_CHEC[...]

Refresh Method 5
REFRESH Stmt 0
/* MV_REFRESH (IND_UNUSABLE) */ ALTER INDEX "HR[...]
REFRESH Stmt 1
ALTER TABLE "HR"."MV_WEEK" TRUNCATE  PARTITION [...]
REFRESH Stmt 2
/* MV_REFRESH (INS) */ INSERT /*+ SKIP_UNQ_UNUSABLE_IDX[...]
REFRESH Stmt 3
BEGIN  sys.dbms_index_utl.multi_level_build(index_list=[...]

Watch out for this parameter value – it’s crucial!


Value of _mv_refresh_costing : COST

By default (11.1.0.7) it’s “RULE” and will do DELETE…INSERT *always* instead of TRUNCATE…INSERT

If you’re using COST, then the cost of the possible refresh methods will be calculated:

REFRESH STATEMENT
/* MV_REFRESH (DEL) */ DELETE FROM "HR"."MV_WEE[...]
COST = 4.002855
REFRESH STATEMENT
/* MV_REFRESH (INS) */ INSERT /*+ BYPASS_RECURSIVE_CHEC[...]
COST = 2.571476
TOTAL COST OF REFRESH = 6.574331


REFRESH STATEMENT
/* MV_REFRESH (IND_UNUSABLE) */ ALTER INDEX "HR[...]
COST = 0.000000
REFRESH STATEMENT
ALTER TABLE "HR"."MV_WEEK" TRUNCATE  PARTITION [...]
COST = 0.000000
REFRESH STATEMENT
/* MV_REFRESH (INS) */ INSERT /*+ SKIP_UNQ_UNUSABLE_IDX[...]
COST = 2.279282
REFRESH STATEMENT
BEGIN  sys.dbms_index_utl.multi_level_build(index_list=[...]
COST = 0.000000
TOTAL COST OF REFRESH = 2.279282

After costing, or by rule, the method chosen will be stated:


Refresh method picked PCT - TRUNC

Oracle then records the actual statements executed:


 Executed Stmt -
/* MV_REFRESH (IND_UNUSABLE) */ ALTER INDEX "HR[...]
 Executed Stmt -
ALTER TABLE "HR"."MV_WEEK" TRUNCATE  PARTITION [...]
 Executed Stmt -
/* MV_REFRESH (INS) */ INSERT /*+ SKIP_UNQ_UNUSABLE_IDX[...]
 Executed Stmt -
BEGIN  sys.dbms_index_utl.multi_level_build(index_list=[...]

A final word of warning – remember that the trace file is an internal diagnostic, it is not for user-consumption. I spent a while worrying about this entry in the log:

PARSE ERROR #19:len=12330 dep=1 uid=142 oct=3 lid=142 tim=35441256289311 err=10980
SELECT "R"."BSNS_WK_KEY" "WK_KEY","D"."SOURCE_SYSTEM_KEY" "SOURCE_SYSTEM_KEY","D"."
CLOSE #19:c=0,e=0,dep=1,type=0,tim=35441256289311

Even though there was no error returned to the user session calling the refresh, I figured this must be some problem. However, according to MOS doc Errors In Refresh Snapshot Trace Files ORA-10980 (Doc ID 294513.1) the parse error is raised and cleared internally and therefore nothing to worry about.

January 5, 2011

Oracle Whitepaper – “Best Practices for a Data Warehouse on Oracle Database 11g”

Filed under: DWH, oracle — rmoff @ 09:11

Poking around on My Oracle Support today, I found a link to a white paper dated November 2010, titled “Best Practices for a Data Warehouse on Oracle Database 11g”. It’s new to me and I’ve not noticed a blog post announcing it, so I thought I’d share it here. It’s by Maria Colgan, who has posted in the past on both the Inside the Oracle Optimizer blog and The Data Warehouse Insider blog. Here’s the link to it:

I found the link via Exadata V2 Starter Kit (Doc ID 1244344.1). There wasn’t one in my stocking from Santa this year, but I can dream 😉
The doc has some other useful links including ones like the above white paper which are not solely Exadata related.

October 26, 2010

Graphing I/O data using gnuplot and Oracle V$SYSSTAT

Filed under: io, oracle, unix — rmoff @ 15:42

Continuing in the beard-scratching theme of Unix related posts (previously – awk), here’s a way to graph out the I/O profile of your Oracle database via the Oracle metrics in gv$sysstat, and gnuplot. This is only the system I/O as observed by Oracle, so for belts & braces (or to placate a cynical sysadmin ;-)) you may want to cross-reference it with something like sar.

First, a pretty picture of what you can get:

Example gnuplot output of Oracle I/O

Why would you want to do this when you’ve got a pretty GUI in EM/Grid and flash graphs? Because the data in EM is averaged out over a relatively large sample (one minute for recent history, and whatever your AWR samplerate is for older history) and thus lower than the point-in-time I/O being driven through your server. My previous post (The danger of averages) illustrates this. For identifying bottlenecks or capacity planning, you need to know how much I/O throughput you really use. Consider this rough example: Using V$SYSMETRIC_HISTORY you can get an average over the last minute. For the first 30 seconds of this sampled minute you ran a query consuming I/O at 100 MB/s. For the last 30 seconds there was no system activity. V$SYSMETRIC_HISTORY shows a figure of 50 MB/s, as this is the average over a minute. You look at your system’s IO profile in EM and see 50 MB/s. Your hardware guys tell you that the system has a capacity of 100 MB/s throughput, so you think you’re well in the clear for capacity, when in reality you’re already hitting it. Now, what happens when two instances of this same query – each requiring 100 MB/s – runs? It will take twice as long when run concurrently (because they’ll have to share the throughput available, and thus get ~50 MB/s). It might be that this is acceptable, that the trade-off of hardware cost to increase I/O throughput capacity isn’t justifiable for making the queries run faster. But it’s important to be aware of the bottlenecks in a system so that they can be mitigated and considered in any capacity planning.

So, anyway, back to the point of this post:

Kevin Closson’s written a neat script here which will write out IO metrics from gv$sysstat to a flat file on the Oracle host. It looks like this:

2010-10-26-09:09:58|1|1|0|
2010-10-26-09:10:03|0|0|0|
2010-10-26-09:10:08|51|51|0|
2010-10-26-09:10:13|87|87|0|
2010-10-26-09:10:19|108|108|0|
2010-10-26-09:10:24|118|118|0|
2010-10-26-09:10:29|116|117|0|
2010-10-26-09:10:34|451|454|0|
2010-10-26-09:10:39|692|694|0|
2010-10-26-09:10:44|894|895|2|
2010-10-26-09:10:49|875|879|1|
2010-10-26-09:10:54|990|990|2|
2010-10-26-09:10:59|922|920|1|
2010-10-26-09:11:04|768|765|2|

I wanted a quick way to visualise the data, and also to plot it out once it was over the number of rows that Excel will chart at once (32k I think). gnuplot was the answer, but it’s a bit of a sod to get running as a fire-and-forget so here’s a help.
Here’s how it works:

  • For ease of use it copies the file generated by the above script to your local machine. If you don’t want to do that then remove the scp line.
  • It uses sed to convert bar (|) characters to Space ( ), as gnuplot requires whitespace separated columns. tr would probably do the job too.
  • It uses an inline input to gnuplot, but you could move this to a separate config file if you wanted
  • It plots the graph on screen (assuming you have configured X), and also writes it to a png file
# Download the file from a remote server
# Prompts for password, or use ssh key authentication to make it seamless
scp user@remotehost:/tmp/io.DBINSTANCE.log .
# Convert bar (|) to Space ( )
sed -e 's/|/ /g' io.DBINSTANCE.log > io.log
# Plot a graph (remove --persist if you don't want the window displayed)
gnuplot --persist <<EOF
set title "Read I/O MB/s\nSampled every 5 seconds"
set xdata time
set timefmt "%Y-%m-%d-%H:%M:%S"
set format x "%d %b\n%H:%M"
set ylabel "MB/s"
# You can set the y range to a specific constant based on your IO capacity
set yrange [0:3000]
set xlabel "Date/Time"
unset key
set grid
plot "io.log" using 1:3 with boxes fs
set terminal png font "courier, 10" size 1200,800
set output "io.png"
replot
EOF

This was written on cygwin, and presumably should work on any ‘nix system.


Ob. OBIEE: An alternative to gnuplot would be to graph the data in OBIEE using the text file as source data to a hacked together RPD 🙂 :

September 14, 2010

The danger of averages – Measuring I/O throughput

Filed under: io, oracle — rmoff @ 12:39

This query, based on AWR snapshots on sys.wrh$_sysstat, includes in its metrics the I/O read throughput for a given snapshot duration.

However it’s important to realise the huge limitation to this figure – it’s an average. It completely shoots you in the foot if you’re looking at capacity requirements.

Consider this real-life example extracted from the above query:

Timestamp					Total Read MBPS
===========================================
14-SEP-10 05.15.12.660      113.748
14-SEP-10 06.00.40.953      202.250
14-SEP-10 06.45.52.750       34.649
14-SEP-10 07.30.03.394       10.953
14-SEP-10 08.15.15.243       57.833
14-SEP-10 09.00.27.180       30.177

So, it looks like early in the morning we’re using about 200 MB/s throughput, and by about 9am somewhere around 30-50 MB/s ?

Let’s have a look at V$SYSMETRIC_HISTORY (which gives numbers every minute for the last hour) for the samples corresponding to the last AWR sample in the above resultset (08:15 – 09:00):
[io2.sql]

BEGIN_TIME           VALUE_MB 
-------------------  -------- 
2010-09-14-08:14:55       0.7 
2010-09-14-08:15:54       0.0 
2010-09-14-08:16:54       0.1 
2010-09-14-08:17:55       0.0 
2010-09-14-08:18:54       0.0 
2010-09-14-08:19:55     318.5 
2010-09-14-08:20:54     258.5 
2010-09-14-08:21:54     183.6 
2010-09-14-08:22:55      24.8 
2010-09-14-08:23:54       0.0 
2010-09-14-08:24:55       0.0 
2010-09-14-08:25:54       0.0 
2010-09-14-08:26:54       0.0 
2010-09-14-08:27:55       0.0 
2010-09-14-08:28:54       0.0 
2010-09-14-08:29:55       0.0 
2010-09-14-08:30:54       0.0 
2010-09-14-08:31:54       0.1 
2010-09-14-08:32:55       0.0 
2010-09-14-08:33:54       0.0 
2010-09-14-08:34:55       0.0 
2010-09-14-08:35:54       0.1 
2010-09-14-08:36:54       3.0 
2010-09-14-08:37:54       0.0 
2010-09-14-08:38:54       0.0 
2010-09-14-08:39:55       0.0 
2010-09-14-08:40:54       0.0 
2010-09-14-08:41:55       0.1 
2010-09-14-08:42:54       0.0 
2010-09-14-08:43:54       0.0 
2010-09-14-08:44:55       0.0 
2010-09-14-08:45:54       0.0 
2010-09-14-08:46:55       0.1 
2010-09-14-08:47:54     156.9 
2010-09-14-08:48:54     413.1 
2010-09-14-08:49:55       0.1 
2010-09-14-08:50:54       0.0 
2010-09-14-08:51:55       3.1 
2010-09-14-08:52:54       0.0 
2010-09-14-08:53:54       0.0 
2010-09-14-08:54:55       0.0 
2010-09-14-08:55:54       0.0 
2010-09-14-08:56:55       0.0 
2010-09-14-08:57:54       0.0 
2010-09-14-08:58:54       0.0 
2010-09-14-08:59:55       0.6 

(METRIC_ID = 2093 “Physical Read Total Bytes Per Sec”)
If you average out the numbers in this 45-minute sample, you get 30MB/s – which ties in with what AWR shows too. But it is clearly wrong to say that the IO throughput for the period is 30 MB/s. In terms of capacity the system is at times utilising over 400 MB/s – albeit for a short period of time:

v$sysmetric_history

(As a side note – this is a good illustration of why a bar chart is more appropriate here, rather than a line graph which is what I initially opted for. A line chart joins the data points giving a incorrect assumption of the value at points in between samples. A bar chat shows what it was when we sampled it, and only then. For planning capacity, it’s important to be considering only what we know to be true.)

For estimating something like disk space requirements, an average per x time slice extrapolated up may work, because the peaks will balance out the troughs – kind of what the point of an average is. But when thinking about capacity and the size of pipes a system requires, an average can wildly distort things. It’s the peaks that are important because it’s those that will bottleneck the system.

This script from Kevin Closson is good for recording granular IO throughput over a period of time.

Blog at WordPress.com.