rmoff

December 31, 2010

Data Warehousing and Statistics in Oracle 11g – incremental global statistics

Filed under: etl, oracle, Statistics — rmoff @ 00:39

This is a series of posts where I hope to humbly plug some gaps in the information available (or which has escaped my google-fu) regarding statistics management in Oracle 11g specific to Data Warehousing.

Incremental Global Statistics is new functionality in Oracle 11g (and 10.2.0.4?) and is explained in depth in several places including:

In essence, Oracle maintains information about each partition when statistics is gathered on the partition, and it uses this to work out the global statistics – without having to scan the whole table. For a more detailed description, see the above links. It’s important to note that this is not the same as aggregated global statistics (which Doug Burns covers in detail here)

To use it, you there are two conditions:

  1. The INCREMENTAL value for a partition table is set to TRUE
  2. You gather statistics on that table with the GRANULARITY parameter set to AUTO

Too good to be true?

From what I’d read, it sounded ideal for what we do, which is load big (10M+ rows per day) tables, partitioned on day. However, when I started testing it I discovered what may be a problem. The information about each partition is stored in the SYSAUX tablespace in two tables:

  • SYS.WRI$_OPTSTAT_SYNOPSIS_HEAD$
  • SYS.WRI$_OPTSTAT_SYNOPSIS$

Looking at the WRI$_OPTSTAT_SYNOPSIS$ table on one of our databases, it had 216 million rows in and took up about 16GB of space. The documentation does say “the SYSAUX tablespace consumes additional space to maintain the global statistics” but this is an awful lot of space.

When I ran a test to gather incremental statistics on a two-partition table with a single row in each partition, it took three minutes to gather stats each time! A quick look at SQL Monitor showed that a lot of the time was spent on a DELETE from WRI$_OPTSTAT_SYNOPSIS$.

In a database with no other data in WRI$_OPTSTAT_SYNOPSIS$, the stats gather was sub-second.

Looking at the data on WRI$_OPTSTAT_SYNOPSIS$, it can be determined that for every table, partition, and column, there is an entry on the header table WRI$_OPTSTAT_SYNOPSIS_HEAD$. There is a unique synopsis number given to each combination, which has one or many entries on the synopsis hash table WRI$_OPTSTAT_SYNOPSIS$. There seems to be one hash for every distinct value in the table/partition/column.

Behind the scenes

You can check whether a table is set for INCREMENTAL global statistics in two ways.
For an individual table, dbms_stats.get_prefs will return the value:

select dbms_stats.get_prefs(ownname=>'HR',pname=>'INCREMENTAL', tabname=>'BASE_T1') from dual;

Alternatively, to list all tables in the database that have INCREMENTAL set, use this:

select u.name "OWNER" ,o.name "TABLE_NAME" ,p.valchar
from  sys.OPTSTAT_USER_PREFS$ p 
inner join sys.obj$ o on p.obj#=o.obj# 
inner join sys.user$ u on o.owner#=u.user#
where p.PNAME = 'INCREMENTAL'

To look at the synopses (synopsii?), use this query:

SELECT u.NAME                                          "owner",
       o.NAME                                          "table_name",
       p.subname                                       "partition_name",
       c.NAME                                          "column_name",
       to_char(h.analyzetime, 'YYYY-MM-DD-HH24:MI:SS') "analyse_Time",
       COUNT(*)                                        "hash entries"
FROM   sys.wri$_optstat_synopsis_head$ h
       left join sys.wri$_optstat_synopsis$ s
         ON h.synopsis# = s.synopsis#
       left join sys.obj$ o
         ON h.bo# = o.obj#
       left join sys.user$ u
         ON o.owner# = u.user#
       left join sys.col$ c
         ON h.bo# = c.obj#
            AND h.intcol# = c.intcol#
       left join (SELECT bo#,
                         obj#
                  FROM   sys.tabpart$
                  UNION ALL
                  SELECT bo#,
                         obj#
                  FROM   sys.tabcompart$) tp
         ON h.bo# = tp.bo#
            AND h.group# = tp.obj# * 2
       left join sys.obj$ p
         ON tp.obj# = p.obj#
GROUP  BY u.NAME,
          o.NAME,
          p.subname,
          c.NAME,
          h.analyzetime
ORDER  BY u.NAME,
          o.NAME,
          c.NAME;  

Test case

This is the test case I’ve been using to investigate the issue. It is hopefully self-documenting.

I’ve written a set of queries that examine the statistics in the data dictionary so that I can see how they get built up.
USER_TAB_STATS_HISTORY is good for seeing a chronological record of the stats gathers.

/* https://rnm1978.wordpress.com/ */

--
-- *****************************
--  test_incr_stats.sql
--
--  Test incremental statistic gathering
-- *****************************
--

set echo off
set timing off
set feedback on


prompt
prompt ************
prompt Check the current size of the synopsis table WRI$_OPTSTAT_SYNOPSIS$

select table_name, num_rows from dba_tables where table_name like 'WRI$_OPTSTAT_SYNOPSIS%';

select to_char(min(h.analyzetime),'YYYY-MM-DD-HH24:MI:SS') "Earliest Synopsis"
FROM    SYS.WRI$_OPTSTAT_SYNOPSIS_HEAD$ h;


prompt
prompt ************
prompt Create a simple test table

drop table BASE_T1;
CREATE TABLE BASE_T1 ( day_key   INTEGER, store_num INTEGER, fact_01   NUMBER(38,4) DEFAULT 0)
PARTITION BY RANGE ( "DAY_KEY") ( PARTITION PART_1 VALUES LESS THAN (2) NOCOMPRESS,
                            PARTITION PART_2 VALUES LESS THAN (3) NOCOMPRESS)
PARALLEL;

prompt
prompt ************
prompt Set the table to INCREMENTAL stats
exec dbms_stats.set_table_prefs(pname=>'INCREMENTAL',ownname=>USER,tabname=>'BASE_T1',pvalue=>'TRUE');


prompt
prompt ************
prompt Gather initial stats
set timing on
exec dbms_stats.gather_table_stats(     ownname=>USER, tabname=>'BASE_T1', granularity=>'AUTO');
set timing off



prompt
prompt ************
prompt Add one row of data to partition 1
insert into base_t1 (day_key,store_num,fact_01) values (1, 1,10);


prompt
prompt ************
prompt Gather stats
set timing on
exec dbms_stats.gather_table_stats(     ownname=>USER, tabname=>'BASE_T1', granularity=>'AUTO');
set timing off



prompt
prompt ************
prompt Add one row of data to partition 2
insert into base_t1 (day_key,store_num,fact_01) values (2, 1,10);


prompt
prompt ************
prompt Gather stats
set timing on
exec dbms_stats.gather_table_stats(     ownname=>USER, tabname=>'BASE_T1', granularity=>'AUTO');
set timing off




prompt
prompt ************
prompt Add another row of data to partition 1, with a new store_num value
insert into base_t1 (day_key,store_num,fact_01) values (1, 2,10);




prompt
prompt ************
prompt Gather stats
set timing on
exec dbms_stats.gather_table_stats(     ownname=>USER, tabname=>'BASE_T1', granularity=>'AUTO');
set timing off




set linesize 156
col table_name for a12
col partition_name for a14
col column_name for a12
col high_value for a8
col low_value for a8
col global_stats head "Global|Stats" for a7
col stale_stats head "Stale|Stats" for a5

prompt
prompt Incremental stats setting:
select dbms_stats.get_prefs(ownname=>USER,pname=>'INCREMENTAL', tabname=>'BASE_T1') from dual;

prompt
prompt Actual data in the table:
select day_key,count(*) from BASE_T1 group by day_key order by day_key asc;

prompt
prompt USER_TAB_STATISTICS:
select table_name,partition_name,num_rows,sample_size,
to_char(last_analyzed,'YYYY-MM-DD-HH24:MI:SS') "Last Analyzed",global_stats,stale_stats
from user_tab_statistics where table_name='BASE_T1';

prompt
prompt USER_TAB_STATS_HISTORY:
select table_name,partition_name,stats_update_time from user_tab_stats_history where table_name='BASE_T1'
order by stats_update_time asc
;

prompt
prompt USER_TAB_COL_STATISTICS:
select table_name,column_name,sample_size,to_char(last_analyzed,'YYYY-MM-DD-HH24:MI:SS') "Last Analyzed",
global_stats,num_distinct,low_value,high_value
from USER_TAB_COL_STATISTICS where table_name='BASE_T1';

prompt
prompt USER_PART_COL_STATISTICS:
select table_name,partition_name,column_name,sample_size,to_char(last_analyzed,'YYYY-MM-DD-HH24:MI:SS') "Last Analyzed",
global_stats,num_distinct,low_value,high_value
from USER_PART_COL_STATISTICS where table_name='BASE_T1';

prompt
prompt Synopsis data:
SELECT o.name "TABLE_NAME"
        ,p.subname "Part"
        ,c.name "Column"
        ,to_char(h.analyzetime,'YYYY-MM-DD-HH24:MI:SS') "Analyse Time"
        ,count(*) "Hash count"
FROM    SYS.WRI$_OPTSTAT_SYNOPSIS_HEAD$ h
left join sys.wri$_optstat_synopsis$ s on h.synopsis# = s.synopsis#
left join sys.obj$ o on h.bo#=o.obj#
left join sys.user$ u on o.owner#=u.user#
left join sys.col$ c on h.bo#=c.obj# and h.intcol# = c.intcol#
left join (select bo#,obj# from sys.tabpart$ union all select bo#,obj# from sys.tabcompart$) tp on h.bo#=tp.bo# and h.group#=tp.obj#*2
left join sys.obj$ p on tp.obj#=p.obj#
where  u.name = USER and o.name = 'BASE_T1'
group by u.name,o.name ,p.subname,c.name,h.analyzetime
order by u.name,o.name;

Results

This was run on Oracle 11.1.0.7, on several different databases. I’ve edited the output slightly for brevity.

Where SYS.WRI$_OPTSTAT_SYNOPSIS is small, it can be seen that the stats gather is fast – as would be expected for a table so small:

************
Check the current size of the synopsis table WRI$_OPTSTAT_SYNOPSIS$

TABLE_NAME     NUM_ROWS
------------ ----------
WRI$_OPTSTAT          0
WRI$_OPTSTAT       1940
WRI$_OPTSTAT     287236

Gather initial stats
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.16
************
Add one row of data to partition 1
************
Gather stats
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.27
************
Add one row of data to partition 2
************
Gather stats
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.27
************
Add another row of data to partition 1, with a new store_num value
************
Gather stats
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.34

Incremental stats setting:
DBMS_STATS.GET_PREFS(OWNNAME=>USER,PNAME=>'INCREMENTAL',TABNAME=>'BASE_T1')
------------------------------------------------------------------------------------------------
TRUE

Actual data in the table:

   DAY_KEY   COUNT(*)
---------- ----------
         1          2
         2          1

2 rows selected.


USER_TAB_STATISTICS:

                                                                       Global  Stale
TABLE_NAME   PARTITION_NAME   NUM_ROWS SAMPLE_SIZE Last Analyzed       Stats   Stats
------------ -------------- ---------- ----------- ------------------- ------- -----
BASE_T1                              3           3 2010-12-30-18:04:56 YES     NO
BASE_T1      PART_1                  2           2 2010-12-30-18:04:56 YES     NO
BASE_T1      PART_2                  1           1 2010-12-30-18:04:56 YES     NO

3 rows selected.


USER_TAB_STATS_HISTORY:

TABLE_NAME   PARTITION_NAME STATS_UPDATE_TIME
------------ -------------- ---------------------------------------------------------------------------
BASE_T1      PART_1         30-DEC-10 18.04.55.633710 +00:00
BASE_T1      PART_2         30-DEC-10 18.04.55.633710 +00:00
BASE_T1                     30-DEC-10 18.04.55.645162 +00:00
BASE_T1      PART_1         30-DEC-10 18.04.55.856920 +00:00
BASE_T1                     30-DEC-10 18.04.55.910722 +00:00
BASE_T1      PART_2         30-DEC-10 18.04.56.126645 +00:00
BASE_T1                     30-DEC-10 18.04.56.181336 +00:00
BASE_T1      PART_1         30-DEC-10 18.04.56.442624 +00:00
BASE_T1                     30-DEC-10 18.04.56.527702 +00:00

9 rows selected.


USER_TAB_COL_STATISTICS:

                                                          Global
TABLE_NAME   COLUMN_NAME  SAMPLE_SIZE Last Analyzed       Stats   NUM_DISTINCT LOW_VALU HIGH_VAL
------------ ------------ ----------- ------------------- ------- ------------ -------- --------
BASE_T1      DAY_KEY                3 2010-12-30-18:04:56 YES                2 C102     C103
BASE_T1      STORE_NUM              3 2010-12-30-18:04:56 YES                2 C102     C103
BASE_T1      FACT_01                3 2010-12-30-18:04:56 YES                1 C10B     C10B

3 rows selected.


USER_PART_COL_STATISTICS:

                                                                         Global
TABLE_NAME   PARTITION_NAME COLUMN_NAME  SAMPLE_SIZE Last Analyzed       Stats   NUM_DISTINCT LOW_VALU HIGH_VAL
------------ -------------- ------------ ----------- ------------------- ------- ------------ -------- --------
BASE_T1      PART_1         DAY_KEY                2 2010-12-30-18:04:56 YES                1 C102     C102
BASE_T1      PART_1         STORE_NUM              2 2010-12-30-18:04:56 YES                2 C102     C103
BASE_T1      PART_1         FACT_01                2 2010-12-30-18:04:56 YES                1 C10B     C10B
BASE_T1      PART_2         DAY_KEY                1 2010-12-30-18:04:56 YES                1 C103     C103
BASE_T1      PART_2         STORE_NUM              1 2010-12-30-18:04:56 YES                1 C102     C102
BASE_T1      PART_2         FACT_01                1 2010-12-30-18:04:56 YES                1 C10B     C10B

6 rows selected.


Synopsis data:

TABLE_NAME   Part                           Column                         Analyse Time        Hash count
------------ ------------------------------ ------------------------------ ------------------- ----------
BASE_T1      PART_2                         DAY_KEY                        2010-12-30-18:04:56          1
BASE_T1      PART_2                         FACT_01                        2010-12-30-18:04:56          1
BASE_T1      PART_1                         STORE_NUM                      2010-12-30-18:04:56          2
BASE_T1      PART_1                         DAY_KEY                        2010-12-30-18:04:56          1
BASE_T1      PART_2                         STORE_NUM                      2010-12-30-18:04:56          1
BASE_T1      PART_1                         FACT_01                        2010-12-30-18:04:56          1

6 rows selected.

Note that there are two hash values for the synopsis for PART_1 column STORE_NUM, because there are two values in the column in that partition.

You can see clearly from USER_TAB_STATS_HISTORY three things:

  1. The order in which stats are gathered – partitions, and then table
  2. That partitions that have not been updated are not re-analyzed
  3. That global stats are updated each time the stats gather is run

Where SYS.WRI$_OPTSTAT_SYNOPSIS is large, stats gather is much slower:

Check the current size of the synopsis table WRI$_OPTSTAT_SYNOPSIS$

TABLE_NAME                       NUM_ROWS
------------------------------ ----------
WRI$_OPTSTAT_SYNOPSIS_PARTGRP           0
WRI$_OPTSTAT_SYNOPSIS_HEAD$         64259
WRI$_OPTSTAT_SYNOPSIS$          216854569

************
Gather initial stats
Elapsed: 00:00:00.57
************
Add one row of data to partition 1
************
Gather stats
Elapsed: 00:03:04.58
************
Add one row of data to partition 2
************
Gather stats
Elapsed: 00:02:25.20
************
Add another row of data to partition 1, with a new store_num value
************
Gather stats
Elapsed: 00:02:25.76
************

Actual data in the table:

   DAY_KEY   COUNT(*)
---------- ----------
         1          2
         2          1

USER_TAB_STATISTICS:

                                                                       Global  Stale
TABLE_NAME   PARTITION_NAME   NUM_ROWS SAMPLE_SIZE Last Analyzed       Stats   Stats
------------ -------------- ---------- ----------- ------------------- ------- -----
BASE_T1                              3           3 2010-12-30-17:51:34 YES     NO
BASE_T1      PART_1                  2           2 2010-12-30-17:50:53 YES     NO
BASE_T1      PART_2                  1           1 2010-12-30-17:48:27 YES     NO

USER_TAB_STATS_HISTORY:

TABLE_NAME   PARTITION_NAME STATS_UPDATE_TIME
------------ -------------- ---------------------------------------------------------------------------
BASE_T1      PART_1         30-DEC-10 17.43.39.320426 +00:00
BASE_T1      PART_2         30-DEC-10 17.43.39.320426 +00:00
BASE_T1                     30-DEC-10 17.43.39.360753 +00:00
BASE_T1      PART_1         30-DEC-10 17.46.02.331166 +00:00
BASE_T1                     30-DEC-10 17.46.43.939090 +00:00
BASE_T1      PART_2         30-DEC-10 17.48.27.926559 +00:00
BASE_T1                     30-DEC-10 17.49.09.144722 +00:00
BASE_T1      PART_1         30-DEC-10 17.50.53.818049 +00:00
BASE_T1                     30-DEC-10 17.51.34.915096 +00:00

USER_TAB_COL_STATISTICS:

                                                          Global
TABLE_NAME   COLUMN_NAME  SAMPLE_SIZE Last Analyzed       Stats   NUM_DISTINCT LOW_VALU HIGH_VAL
------------ ------------ ----------- ------------------- ------- ------------ -------- --------
BASE_T1      DAY_KEY                3 2010-12-30-17:51:34 YES                2 C102     C103
BASE_T1      STORE_NUM              3 2010-12-30-17:51:34 YES                2 C102     C103
BASE_T1      FACT_01                3 2010-12-30-17:51:34 YES                1 C10B     C10B

USER_PART_COL_STATISTICS:

                                                                         Global
TABLE_NAME   PARTITION_NAME COLUMN_NAME  SAMPLE_SIZE Last Analyzed       Stats   NUM_DISTINCT LOW_VALU HIGH_VAL
------------ -------------- ------------ ----------- ------------------- ------- ------------ -------- --------
BASE_T1      PART_1         DAY_KEY                2 2010-12-30-17:50:53 YES                1 C102     C102
BASE_T1      PART_1         STORE_NUM              2 2010-12-30-17:50:53 YES                2 C102     C103
BASE_T1      PART_1         FACT_01                2 2010-12-30-17:50:53 YES                1 C10B     C10B
BASE_T1      PART_2         DAY_KEY                1 2010-12-30-17:48:27 YES                1 C103     C103
BASE_T1      PART_2         STORE_NUM              1 2010-12-30-17:48:27 YES                1 C102     C102
BASE_T1      PART_2         FACT_01                1 2010-12-30-17:48:27 YES                1 C10B     C10B

Synopsis data:

TABLE_NAME   Part                           Column                         Analyse Time        Hash count
------------ ------------------------------ ------------------------------ ------------------- ----------
BASE_T1      PART_1                         FACT_01                        2010-12-30-17:50:53          1
BASE_T1      PART_2                         DAY_KEY                        2010-12-30-17:48:27          1
BASE_T1      PART_1                         STORE_NUM                      2010-12-30-17:50:53          2
BASE_T1      PART_2                         FACT_01                        2010-12-30-17:48:27          1
BASE_T1      PART_2                         STORE_NUM                      2010-12-30-17:48:27          1
BASE_T1      PART_1                         DAY_KEY                        2010-12-30-17:50:53          1

Why the worry?

My worry is that with wide tables and many partitions, the synopsis history could grow very large. In the same way that FTS of a big table to gather global stats the ‘old’ way is going to get slower as the size increases, is the same going to happen with incremental stats?

I’ve dug around on My Oracle Support but not hit any specific bugs on this.

I found a posting on OTN describing the same behaviour as I’ve found, but with the comment “Oracle […] just tell me that it is normal, expected behaviour and that if I don’t like it I should turn off Incremental global stats.”.

Doing some back-of-a-fag-packet maths with some of our tables would have the synopsis information growing at 150k rows per day for one table. We’ve quite a few wide & big tables, and unless we can convince our users to make do with no history ๐Ÿ˜‰ they’re going to keep growing.

Hopefully someone will see this and point out something I’ve not understood about the functionality, or missed in the documentation.
If not, then I’m going to do some larger-scale testing to try and put numbers around stats gather times using incremental vs non-incremental. If nothing else, to get a better understanding of how big this table could be growing.

I’d be really interested to know what other data warehousing sites on 11g do in terms of partitioned table statistics and whether they use incremental stats or not.

Comments most welcome, please! ๐Ÿ™‚

Advertisements

December 6, 2010

Adding OBIEE monitoring graphs into OAS

Filed under: hack, jmx, monitoring, OAS, obiee, unix — rmoff @ 21:30

Introduction

This is the third part of three detailed articles making up a mini-series about OBIEE monitoring. It demonstrates how to capture OBIEE performance information, and optionally graph it out and serve it through an auto-updating webpage.

This final article describes how to bolt on to OAS a simple web page hosting the graphs that you created in part 2, plotting data from OBIEE collected in part 1.

The webpage

This is just an old-school basic HTML page, with a meta-refresh tag (which note that Chrome doesn’t work with) and img tags:

<html>
<meta http-equiv="refresh" content="60">
<head>
<title>OBIEE Servers</title>
</head>
<body>
<img src="server01.png"></br>
<img src="server02.png"></br>
</body>
</html>

I shan’t patronise you nor embarrass myself with my rusty HTML skills any further – I’ll leave you how to build your dashboard how you want it.

OAS

This is hack! I am not an expert at Apache, so please don’t take my word for it that this is the best way to do it. It worked for me, but mightn’t for you.

If you’ve got OAS installed for your OBIEE installation, you can tweak it to serve up your new graphs too. If you’re using OC4J, IIS, or another webserver, then you’ll have to figure this bit out yourself.

Assuming that your OBIEE JMX graphs and HTML files are in /tmp/obieejmx, make sure that they’re readable by all:

chmod -R o+rx /tmp/obieejmx

Now go to your OAS folder, navigate to Apache/Apache/conf, and MAKE A BACKUP of httpd.conf

cd /your/path/to/OAS/here
cd Apache/Apache/conf
cp httpd.conf httpd.conf.bak

Open httpd.conf in vi (or if you’re not a real man then FTP the file to Windows and open it in Notepad ๐Ÿ˜‰ )

  1. Search for
    <IfModule mod_alias.c>
  2. Add the following beneath it:
        Alias /obieejmx/ "/tmp/obieejmx/"
        <Directory "/obieejmx/">
            AllowOverride None
            Options None
            Order allow,deny
            Allow from all
        </Directory>
    
    • Here’s where you’d change the location of your graphs and HTML file if you needed to
  3. Save httpd.conf
  4. Restart Apache
    opmnctl restartproc ias-component=HTTP_Server

    or if that doesn’t work restart OAS

    opmnctl shutdown
    opmnctl startall

Assuming you normally access OBIEE through http://myserver:7777/analytics/ then you should now be able to go to http://myserver:7777/obieejmx/ and view the fruits of your hard-earned work.

What next

Obviously, the gnuplot/OAS hack is a bit crude, but for me was the quickest way to get “to market” the power of the OBIEE systems management metric collection by jmx that is possible for anyone with some basic *nix skills and some time to put it together.

The BI Management Pack for Enterprise Manager probably provides some if not all of this functionality but isn’t always available to use (and also has licensing implications).

Whether you collect metrics for day-to-day monitoring of OBIEE, capacity planning, or investigative work, I’ve hopefully demonstrated how easy it is to work with once you’ve got the basics mastered. And the beauty of doing it with shell scripts is that you can customise it to your heart’s content.

There’s a whole bunch of analysis that I’d like to do now, around things like our registered user count vs logged on users vs active users (to determine what actually is our concurrent user rate), as well as profiling BI Server load against database load.

It would also be fun to develop the HTML just a little bit further to create a mock drill-down on the graphs, although if you’re anything like me be aware of “just tweaking for a minute” turning into far too long given then throwaway nature of the solution.

Finally, bear in mind this is now dated technology – some of it may be on the junk heap for OBI11g.

Charting OBIEE performance data with gnuplot

Filed under: gnuplot, jmx, monitoring, obiee, unix, visualisation — rmoff @ 21:30

Introduction

This is the second part of three detailed articles making up a mini-series about OBIEE monitoring. It demonstrates how to capture OBIEE performance information, and optionally graph it out and serve it through an auto-updating webpage.

This article takes data that part one showed you how to collect into a tab-separated file that looks something like this:

2010-11-29-14:48:18     1       0       11      0       3       2       1       676     340     0       53      1       0       41      0       3       0
2010-11-29-14:49:18     1       0       11      0       3       2       1       676     0       0       0       1       0       0       0       3       0
2010-11-29-14:50:18     2       0       16      1       4       3       1       679     0       0       0       1       0       0       0       4       0
2010-11-29-14:51:18     2       2       19      1       4       3       1       679     32      0       53      1       0       58      0       4       0
2010-11-29-14:52:18     2       1       19      1       4       3       4       682     0       0       0       1       0       0       0       4       0
2010-11-29-14:53:18     2       1       19      1       4       3       4       682     0       0       0       1       0       0       0       4       0
2010-11-29-14:54:18     2       0       19      1       4       3       1       682     0       0       0       1       0       0       0       4       0

and plot it into something like looks like this:

gnuplot

Depending on what you’re wanting to use the data for, and your level of comfort with command-line tools, you may just want to take your datafile and plot it with something like Excel for ad-hoc analysis. However, for fully automated data capture and rendering, I’d recommend giving gnuplot a go. It’s not the simplest of tools and the documentation is so vast as to be overwhelming – but if you’ve the time to tweak it then you will find it very flexible.

gnuplot can be run interactively (which is useful for trialling settings) or as part of batch script, and can output to screen or file (eg. png). It assumes that your data is tab separated (which it should be if you’ve used my script).

The key bit to understand is the “plot xxx using yyy” statement. In this, xxx is your data file, whilst yyy is the columns to plot for x and y data respectively. So assuming that your x-axis is time and the first column in your datafile, and you want to plot the number of Active Sessions (which in my script is the fourth metric, so the fifth column – time being the first), you would have “using 1:5”.

plot "datafile.jmx" using 1:5 with lines

To plot to a PNG file, use a script like this:

gnuplot <<EOF
set xdata time
set timefmt "%Y-%m-%d-%H:%M:%S"
set format x "%d %b\n%H:%M"
set yrange[0:30]
set grid
set key box outside below
set terminal png font "arial, 7" size 410,200 enhanced truecolor
set output "graph01.png"
set title "Oracle BI PS Sessions\nActive Sessions"
plot "datafile.jmx" using 1:5 with lines

This line invokes gnuplot, and <<EOF tells it to pass the subsequent lines to gnuplot until end of file (or you put EOF literal in the file).

gnuplot <<EOF

This sets the x-axis as a time axis, gnuplot then scales it appropriately. The timefmt defines the format of the time data in the input file.

set xdata time
set timefmt "%Y-%m-%d-%H:%M:%S"

The format of the axis labels on the x-axis is defined thus:

set format x "%d %b\n%H:%M"

This hard-codes the y-axis range. Use “set autoscale” to revert to the default of gnuplot setting the range.

set yrange[0:30]

This puts a grid on the chart, and defines the title

set grid
set title "Oracle BI PS Sessions\nActive Sessions"

This tells gnuplot to write to a png file, using arial 7pt font, 410×200 pixels, and higher resolution. The “set output” command defines the actual filename.

set terminal png font "arial, 7" size 410,200 enhanced truecolor
set output "graph01.png"

If you omit the “set terminal png […]” line you’ll get an interactive graph from gnuplot, which can be useful for testing different settings.

Plotting multiple datasets

You can visualise multiple datasets alongside each other easily in gnuplot. You might want to do this for related metrics (eg Active Sessions vs Current Sessions), or a clustered OBIEE deployment:

The same metric on two servers in a load-balanced OBIEE deployment

Related metrics from a single server

To do this you append multiple statements to the “plot” command, separated by a comma:

plot "datafile_server01.jmx" using 1:4 with lines title "Server 01"\
, "datafile_server02.jmx" using 1:4 with lines title "Server 02"

(the \ is a line-continuation character)

The title for each dataset is defined as part of the statement as is seen above. This can be shown in a key which is enabled by default and can be moved to the bottom of the graph by using:

set key box outside below

Generating gnuplot scripts automagically

You may have realised by now that this is all quite fiddly to set up. Here are two scripts that will help generate gnuplot scripts. It’s based on the tcl script from part one and uses the first item in the array, ID, to determine the column number that a metric will appear in in the datafile.

This first one writes a series of gnuplot statements to plot each metric out on an individual graph, and assumes you have a two-server cluster so plots the datafiles from two servers on the same graph.

cat gnuplot_header
grep lappend obieejmx_server01.tcl|\
sed s/{//g|\
sed s/}//g|\
sed s/"lappend mbeanattrs "//g|\
sed s/"Oracle BI Management:Oracle BI=Performance,AppName="//g|\
sed s/"Oracle BI Server,"/"BI"/g|\
sed s/"name=Oracle BI General"/""/g|\
sed s/"type=Oracle BI Physical DB,name=Oracle Data Warehouse"/"DB"/g|\
sed s/"type=Oracle BI DB Connection Pool,name=Star_Oracle Data Warehouse_Oracle Data Warehouse Connection Pool"/"Conn Pool"/g|\
sed s/"Oracle BI Presentation Server,name=Oracle BI PS "/"BIPS"/g|\
sed s/"Connection"/"Conn"/g|\
sed s/"Current"/"Curr"/g|\
sed s/"Sessions"/"Sess"/g|\
awk 'FS="\t" {
        $1+=1
        print "set output \""$1".png\""
        print "set title \"" $2 "\\n" $3 "\""
        print "plot \"server01.jmx\" using 1:" $1 " with lines title \"server01\",\"server02.jmx\" using 1:" $1 " with lines title \"server02\""
}'

gnuplot_header is a file you need to create in the same directory, and can just be a copy of the example gnuplot statement above.

This script plots multiple metrics for a single server onto the same graph. You may want to break the statements up into separate graph plots to group together related metrics or ones with similar ranges – this script will at least give you the correct column and titles to get you started.

cat gnuplot_header
grep lappend obieejmx_server01.tcl|\
sed s/{//g|\
sed s/}//g|\
sed s/"lappend mbeanattrs "//g|\
sed s/"Oracle BI Management:Oracle BI=Performance,AppName="//g|\
sed s/"Oracle BI Server,"/"BI"/g|\
sed s/"name=Oracle BI General"/""/g|\
sed s/"type=Oracle BI Physical DB,name=Oracle Data Warehouse"/"DB"/g|\
sed s/"type=Oracle BI DB Connection Pool,name=Star_Oracle Data Warehouse_Oracle Data Warehouse Connection Pool"/"Conn Pool"/g|\
sed s/"Oracle BI Presentation Server,name=Oracle BI PS "/"BIPS"/g|\
sed s/"Connection"/"Conn"/g|\
sed s/"Current"/"Curr"/g|\
sed s/"Sessions"/"Sess"/g|\
awk 'BEGIN {
        FS="\t"
        print "set output \"server01.png\""
        print "set title \"server01\""
        printf "plot "
     }
     {
        $1+=1
        if (NR>1) {printf "\\\n,"}
        printf " \"server01.jmx\" using 1:" $1 " with lines title \"" $2 ":" $3 "\""
     }
     END {print ""}'

Both scripts use a series of sed statements to try and trim down the MBean names into something shorter for plotting on the graph key.

An example output would look something like this:

# One metric per graph, across two servers
gnuplot --persist <<EOF
set xdata time
set timefmt "%Y-%m-%d-%H:%M:%S"
set format x "%d %b\n%H:%M"
set grid
set key box outside below
set terminal png font "arial, 7" size 410,200 enhanced truecolor
set yrange [0:30]
set output "2.obiee.png"
set title "Oracle BI Presentation Server - Oracle BI PS Connection Pool\nCurrent Open Connections"
plot "server01.jmx" using 1:2 with lines title "server01","server02.jmx" using 1:2 with lines title "server02"
set output "3.obiee.png"
set title "Oracle BI Presentation Server - Oracle BI PS Query Cache\nCurrent Running Queries"
plot "server01.jmx" using 1:3 with lines title "server01","server02.jmx" using 1:3 with lines title "server02"
set output "5.obiee.png"
set title "Oracle BI Presentation Server - Oracle BI PS Sessions\nActive Sessions"
plot "server01.jmx" using 1:5 with lines title "server01","server02.jmx" using 1:5 with lines title "server02"
set output "6.obiee.png"
set title "Oracle BI Presentation Server - Oracle BI PS Sessions\nCurrent Sessions"
plot "server01.jmx" using 1:6 with lines title "server01","server02.jmx" using 1:6 with lines title "server02"
set output "7.obiee.png"
set title "Oracle BI Presentation Server - Oracle BI PS Sessions\nSessions Logged On"
plot "server01.jmx" using 1:7 with lines title "server01","server02.jmx" using 1:7 with lines title "server02"
set output "8.obiee.png"
# Single server, multiple metrics on one graph
gnuplot --persist <<EOF
set xdata time
set timefmt "%Y-%m-%d-%H:%M:%S"
set format x "%d %b\n%H:%M"
set grid
set key box outside right
set terminal png font "arial,9" size 1224,500 enhanced truecolor

set output "server01.png"
set title "server01"
set yrange[0:30]
plot  "server01.jmx" using 1:2 with lines title "BIPSConn Pool:Curr Open Conns"\
, "server01.jmx" using 1:3 with lines title "BIPSQuery Cache:Curr Running Queries"\
, "server01.jmx" using 1:5 with lines title "BIPSSess:Active Sess"\
, "server01.jmx" using 1:6 with lines title "BIPSSess:Curr Sess"\
, "server01.jmx" using 1:7 with lines title "BIPSSess:Sess Logged On"\
, "server01.jmx" using 1:8 with lines title "BIConn Pool:Curr Busy Conn Count"\
, "server01.jmx" using 1:13 with lines title "BI:Active Execute Requests"\
, "server01.jmx" using 1:14 with lines title "BI:Active Fetch Requests"\
, "server01.jmx" using 1:17 with lines title "BI:Total sessions"

Plotting subsets of data

You may well want to plot out subsets of the graph data, for example, the last 24 hours. This is simple to do, just run the jmx datafile through something like tail first:

# Create 24 hour extract of data
# Data is sampled every minute, so the last 24 hours will be 60*24 = 1440
tail -n 1440 server01.jmx > server01.24hr.jmx

Refreshing your graphs automagically

Assuming you’ve got your graph plotting script(s) in separate shell file(s), create a wrapper like this:

# Uncomment this for debug, to echo each statement as it's executed
# set -x
#
# This will loop for ever and ever
# (or until one becomes less than two)
while [ 1 -lt 2 ]
do
        # create any subset data files, using tail, as described above 
        ./create_6hr_logs.sh
        ./create_24hr_logs.sh
        # plot your graph(s)
        ./plot_by_server.sh
        ./plot_by_server_summary.sh
        ./plot_by_server_6hr.sh
        ./plot_by_server_6hr_summary.sh
        ./plot_6hr_graphs.sh
        ./plot_24hr_graphs.sh
        # Copy all output png files to a tmp folder
        # from where they'll be served up as a web page
        # (I told you it was a hack)
        cp *.png /tmp/obieejmx
        chmod g+r /tmp/obieejmx/*.png
        #
        # Write the current time, for debug purposes
        date
        # Wait for a minute
        sleep 60
done

And then run your wrapper in the same way as the metric collection, using nohup and &

nohup ./run_graphs_forever.sh &

What next?

So you’ve got your data … you’ve got your graphs … now why not serve them up through a convenient webpage that refreshes automatically? Click here for part three that shows you how to bolt this onto an existing OAS installation.

Collecting OBIEE systems management data with jmx

Filed under: jmx, mbeans, monitoring, obiee, systemsmanagement — rmoff @ 21:30

Introduction

This is the first part of three detailed articles making up a mini-series about OBIEE monitoring. It demonstrates how to capture OBIEE performance information, and optionally graph it out and serve it through an auto-updating webpage.

For some background on OBIEE’s Systems Management component, along with JMX and MBeans, see here and here. The following assumes you know your mbeans from coffee beans and jmx from a bmx.

The metric collection is built around the jmxsh tool. This is similar to jmxterm and both provide command-line access to jmx. Once it’s commandline, it’s scriptable ๐Ÿ™‚

This was developed and works on both Windows (through cygwin) and HP-UX.

The Script

jmxsh uses the a scripting language based on tcl, and with a bit of trial-and-error I developed the following script, obieejmx.tcl. It connects to a remote BI server, authenticates to the JMX agent, and then periodically polls the jmx counters and writes the values to a tab-separated file. It runs until you cancel it.

# obieejmx.tcl
# OBIEE JMX collector
# https://rnm1978.wordpress.com
#

# Set connection details
set host my-remote-server
set port 9980
set user jmx-user-name
set pw my-jmx-password

# Define the counters (userdefined ID / MBean / Attribute list, tab separated)
set mbeanattrs [list]
lappend mbeanattrs {1   Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Connection Pool      Current Open Connections}
lappend mbeanattrs {2   Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Query Cache  Current Running Queries}
lappend mbeanattrs {3   Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Query Cache  Current Cache Entries}
lappend mbeanattrs {4   Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Sessions     Active Sessions}
lappend mbeanattrs {5   Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Sessions     Current Sessions}
lappend mbeanattrs {6   Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Sessions     Sessions Logged On}
lappend mbeanattrs {7   Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,type=Oracle BI DB Connection Pool,name=Star_Oracle Data Warehouse_Oracle Data Warehouse Connection Pool     Current Busy Connection Count}
lappend mbeanattrs {8   Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,type=Oracle BI DB Connection Pool,name=Star_Oracle Data Warehouse_Oracle Data Warehouse Connection Pool     Current Connection Count}
lappend mbeanattrs {9   Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,type=Oracle BI Physical DB,name=Oracle Data Warehouse       KiloBytes/sec}
lappend mbeanattrs {10  Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,type=Oracle BI Physical DB,name=Oracle Data Warehouse       Queries/sec}
lappend mbeanattrs {11  Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,type=Oracle BI Physical DB,name=Oracle Data Warehouse       Rows/sec}
lappend mbeanattrs {12  Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,name=Oracle BI General      Active Execute Requests}
lappend mbeanattrs {13  Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,name=Oracle BI General      Active Fetch Requests}
lappend mbeanattrs {14  Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,name=Oracle BI General      Avg. query elapsed time}
lappend mbeanattrs {15  Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,name=Oracle BI General      Queries/sec}
lappend mbeanattrs {16  Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,name=Oracle BI General      Total sessions}
lappend mbeanattrs {17  Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,name=Oracle BI General      Succeeded Logins Ratio as %}

# Write the header
puts -nonewline "Timestamp"
foreach {mbeanattr} $mbeanattrs {
        # Get the mbean & attribute
        set parts [split $mbeanattr "\t"]
        set mbean [lindex $parts 1]
        set attr [lindex $parts 2]
        puts -nonewline "\t$mbean $attr"
}
puts ""

# Connect to the host
jmx_connect -h $host -p $port -U $user -P $pw

# Get the values
while {0 == 0} {
        puts -nonewline "[clock format [clock seconds] -format %Y-%m-%d-%H:%M:%S]"
        foreach {mbeanattr} $mbeanattrs {
                # Get the mbean & attribute
                set parts [split $mbeanattr "\t"]
                set mbean [lindex $parts 1]
                set attr [lindex $parts 2]

                #Uncomment for debug:
                #puts "---"
                #puts "$mbean $attr"

                # Get the metric and write to output
                puts -nonewline "\t[ jmx_get -m $mbean $attr]"

        }
        # Flush the output buffer line
        puts ""
        # Sleep for 60 seconds
        after 60000
}

You may want to tweak the polling frequency, depending on the metrics that you’re collecting and the purpose of them. For building up a general picture of system usage (active sessions, etc), then every minute – or greater – should be sufficient. For other metrics which record rates per second (eg “New Sessions / sec”) then you may well want to sample more frequently.

You invoke this via jmxsh (download link) using the following syntax:

$java -jar jmxsh-R5.jar obieejmx.tcl

or if you’ve setup the (ambiguously-named) file jmxsh, you can use this:

$./jmxsh obieejmx.tcl

It will write the counter values to stdout, so capture it to file using

./jmxsh obieejmx.tcl >> results.jmx 

To run it continually as a background process, use nohup (so it doesn’t die when you logoff) and & (to run it in the background):

# Run collector indefinitely
nohup ./jmxsh obieejmx.tcl >> results.jmx &

The output you’ll get will look like this:

2010-11-29-14:41:17     5       0       38      1       12      9       0       614     0       0       0       0       0       0       0       3       0
2010-11-29-14:42:17     5       0       33      1       12      9       0       614     0       0       0       0       0       0       0       3       0
2010-11-29-14:43:17     5       0       33      1       12      9       0       614     0       0       0       0       0       0       0       3       0
2010-11-29-14:44:17     5       0       33      1       12      9       0       614     0       0       0       0       0       0       0       3       0

To stop the collector running, you’ll need to find the process

$ps -ef|jmxsh 
userid 14695     1  2  Nov 29  ?         5:12 /opt/java6/bin/IA64N/java -jar ./jmxsh-R5.jar obieejmx.tcl

and then kill it

kill 14695

Defining the counters

You’ll have noticed in my script that I define an array of counter names. You can get a list of all the counters in various ways including through Presentation Services (saw.dll?perfmon), OAS/OC4J, or JConsole. My personal preference is using Presentation Services (saw.dll?perfmon) as it gives the list nice and neatly and with an explanation of each counter.
Once you’ve decided which you want to collect, you need to use jmxsh again to get the correct format. Counters are defined as Attributes (eg Current Open Connections) within MBeans (eg Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Connection Pool). Different jmx interfaces seem to label the MBean in a different format, for example:

jmxsh:  Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Sessions
jmxterm:Oracle BI Management:AppName=Oracle BI Presentation Server,Oracle BI=Performance,name=Oracle BI PS Sessions

(spot the order in which “Oracle BI=” and “AppName=” are listed)
So for using jmxsh in your script, use jmxsh to get the exact MBean names otherwise you’ll spend a long time tearing your hair out wondering why it’s not working!
To get jmxsh to list the MBeans, you use it in the ‘browse’ mode. First off, run jmxsh and connect to your OBIEE server using the jmx_connect command


$ java -jar jmxsh-R5.jar
jmxsh v1.0, Tue Jan 22 16:23:12 GMT 2008


Type 'help' for help.  Give the option '-?' to any command
for usage help.


Starting up in shell mode.
% jmx_connect -h myserver -p 9980 -U myjmxuser
Password: ********
Connected to service:jmx:rmi:///jndi/rmi://myserver:9980/jmxrmi.

Hit enter at this point and it’ll switch to browse mode, and list out the MBean Domains.


%
Entering browse mode.
====================================================


 Available Domains:


       1. java.util.logging
       2. JMImplementation
       3. Oracle BI Management
       4. java.lang


  SERVER: service:jmx:rmi:///jndi/rmi://myserver:9980/jmxrmi

Select Oracle BI Management

====================================================
Select a domain: 3
====================================================

This lists all the MBeans within Oracle BI Management – there’s a lot!


 Available MBeans:


       1. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Chart Engine
       2. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,type=Oracle BI DB Connection Pool,name=Star_Oracle Data Warehouse_Oracle Data Warehouse Writeback Connection Pool
       3. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,type=Oracle BI PS Thread Pools,name=TaskScheduler
       4. Oracle BI Management:Oracle BI=Configuration,type=Presentation Server Configuration,name=Query
       5. Oracle BI Management:Oracle BI=Configuration,type=Presentation Server Configuration,name=AsyncLogon[ThreadPoolDefaults]
       6. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Query Cache
[...]
     134. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Sessions
[...]

You can filter by defining a glob filter by typing your search term at the “Select an mbean:” prompt. For example:

====================================================
Select an mbean: Performance
====================================================

 Available MBeans:

       1. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Chart Engine
       2. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,type=Oracle BI DB Connection Pool,name=Star_Oracle Data Warehouse_Oracle Data Ware
house Writeback Connection Pool
       3. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,type=Oracle BI PS Thread Pools,name=TaskScheduler
       4. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Query Cache
[...]
      53. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,type=Oracle BI Generic Cache,name=Star_DrillDownQuery_Cache

  SERVER: service:jmx:rmi:///jndi/rmi://myserver:9980/jmxrmi
  DOMAIN: Oracle BI Management
  GLOB:   *Performance* (space to clear)

This shows just MBeans with Performance in the name. Alternatively use a wildcard within the glob:

Select an mbean: Performance*Cache
====================================================

 Available MBeans:

       1. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Query Cache
       2. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Catalog XML Cache
       3. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,type=Oracle BI Generic Cache,name=Star_DrillDownInfo_Cache
       4. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Security Manager Account Cache
       5. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,type=Oracle BI Generic Cache,name=Star_LDAP_Cache
       6. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Catalog Attribute Cache
       7. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Security Manager Account Memberships Cache
       8. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,type=Oracle BI Generic Cache,name=Star_Plan_Cache
       9. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,type=Oracle BI Generic Cache,name=Star_ColumnAggrInfo_Cache
      10. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,type=Oracle BI Generic Cache,name=Star_RowWiseInit_Cache
      11. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS DXE Cache
      12. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Cube Cache
      13. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,name=Oracle BI Data Cache
      14. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS XML Document Caches
      15. Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Server,type=Oracle BI Generic Cache,name=Star_DrillDownQuery_Cache

  SERVER: service:jmx:rmi:///jndi/rmi://myserver:9980/jmxrmi
  DOMAIN: Oracle BI Management
  GLOB:   *Performance.*Cache* (space to clear)

If you use globs, remember to clear them by typing space and then enter, otherwise when you list Attributes you won’t see any which don’t also match your filter.

To view the Attributes for an MBean, enter the MBean’s number:

====================================================
Select an mbean: 134
====================================================

 Attribute List:

       1. -r- Integer     Current Sessions
       2. -r- Integer     Maximum Sessions
       3. -r- Integer     Sessions Logged On
       4. -r- Integer     Maximum Logged On
       5. -r- Integer     Current Embryonic Sessions
       6. -r- Integer     Maximum Embryonic Sessions
       7. -r- Integer     Active Sessions
       8. -r- Integer     Maximum Active Sessions
       9. -r- Integer     Total Lifetime Sessions
      10. -r- Integer     New Sessions / sec
      11. -r- Integer     Total Failed Logons
      12. -r- Integer     Failed Logons/sec
      13. -r- Integer     Total Logons
      14. -r- Integer     New Logons/Sec


  SERVER: service:jmx:rmi:///jndi/rmi://myserver:9980/jmxrmi
  DOMAIN: Oracle BI Management
  MBEAN:  Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Sessions

Once you’ve chosen your MBeans and Attributes, you can incorporate them into the obieejmx.tcl script by adding additional lappend lines. The format is:

lappend mbeanattrs {<ID><tab><mbean><tab><attribute>}

ID is just a number used later on in the process, it can be whatever you like. Make sure the three values are tab-separated.
An example line is:

lappend mbeanattrs {17  Oracle BI Management:Oracle BI=Performance,AppName=Oracle BI Presentation Server,name=Oracle BI PS Sessions      Current Embryonic Sessions}

If you get the error “Cannot convert result to a string” then check your MBean and Attribute names, normally this error means it can’t find what you’ve asked for. Also check that the array member definitions (lappend) are tab separated, not just space separated.

Where next?

Now you’ve got the data, do something with it! See charting OBIEE performance data with gnuplot.

OBIEE monitoring

Filed under: hack, jmx, mbeans, monitoring, obiee, systemsmanagement — rmoff @ 21:29

Those of you who read my blog regularly may have noticed I have a slight obsession with the OBIEE systems management capability which is exposed through JMX. Venkat has blogged this week about JMX in OBI11g, and it’s clearly a technology worth understanding properly.
I’ve recently been tinkering with how to make use of it for monitoring purposes, most recently using JConsole and discussed here. What follows is an extension of this idea, cobbled together with a bit of shell scripting, awk, gnuplot, and sticky backed plastic. It’s built on OBIEE 10g – for OBI11g it may differ (although I understand that Performance MBeans still exist).

Whether you collect metrics for day-to-day monitoring of OBIEE, capacity planning, or investigative work, it’s valuable data (in my humble opinion) that will help you understand the usage of the application by the users that you support.

To whet your appetite, here’s a sample of what you can produce, in realtime:

Performance metrics from a two-server OBIEE cluster

Performance data related to Sessions in BI and Presentation Services

Before you start this, I recommend reading how to secure your jmx agent if you’re working with production systems.

Overview

There are three parts to my monitoring application, and you can pretty much pick and mix as you want. Obviously without any data collected then graphing it will be pretty dull, but you may opt to collect the data and then work with it another way (Excel, OBIEE, etc).
I’ve broken the details down into three separate blog posts:

  1. Metric collection from a remote BI Server, using jmxsh
  2. Graph rendering of the collected data, using gnuplot
  3. Web page serving of the rendered graphs, bolted onto the OAS already in place for Presentation Services.

December 3, 2010

OBIEE 10g – javahost hang

Filed under: javahost, obiee, sawserver — rmoff @ 14:06

Hot on the heels of one problem, another has just reared its head.

Users started reporting an error with reports that included charts:

Chart server does not appear to be responding in a timely fashion. It may be under heavy load or unavailable.

Set up is a OBIEE 10.1.3.4.1 two-server deployment with BI/PS/Javahost clustered and loadbalanced throughout.

Diagnostics

Javahost was running, and listening, on both servers:

$ps -ef|grep javahost
obieeadm 14076     1  0  Nov 25  ?         9:23 /app/oracle/product/OracleAS_1/jdk/bin/IA64N/java -server -classpath /app/oracle/product/obiee/web/javahost/lib/core/sautils.ja
$netstat -a|grep 9810|grep LISTEN
tcp        0      0  *.9810                 *.*                     LISTEN

In Javahost log file on both servers there were these errors reported, but since javahost had started over a week ago:

Nov 30, 2010 8:08:36 AM MessageProcessorImpl processMessage
WARNING: Unexpected exception. Connection will be closed
java.io.EOFException
        at com.siebel.analytics.web.sawconnect.sawprotocol.SAWProtocol.readInt(SAWProtocol.java:167)
        at com.siebel.analytics.javahost.MessageProcessorImpl.processMessage(MessageProcessorImpl.java:133)
        at com.siebel.analytics.javahost.Listener$Job.run(Listener.java:223)
        at com.siebel.analytics.javahost.standalone.SAJobManagerImpl.threadMain(SAJobManagerImpl.java:205)
        at com.siebel.analytics.javahost.standalone.SAJobManagerImpl$1.run(SAJobManagerImpl.java:153)
        at java.lang.Thread.run(Thread.java:595)

Charts are written to a temp folder, but none have been written since yesterday afternoon:

$ls -lrt /data/bi/tmp/sawcharts/ |tail -n 2
-rw-r-----   1 obieeadm   biadmin      13611 Dec  2 16:30 saw4cee1a27-7.tmp
-rw-r-----   1 obieeadm   biadmin          0 Dec  2 16:31 saw4cee1a27-32.tmp

$ls -lrt /data/bi/tmp/sawcharts/ |tail -n 2
-rw-r-----   1 obieeadm   biadmin       7454 Dec  2 15:25 saw4cee219b-1.tmp
-rw-r-----   1 obieeadm   biadmin          0 Dec  2 15:28 saw4cee219b-6.tmp

First time the error was seen: (from sawserver.out.log)

server01: Fri Dec  3 09:40:23 2010
server02: Thu Dec  2 15:44:38 2010

Resolution

It looked like javahost was up, but not responding to requests — which is pretty much what the error message said on the tin. The solution was that of many a computer problem – turn it off and turn it back on again.

Since the rest of the (production!) OBIEE service was up and in use, I didn’t want to use the normal shutdown script run-saw.sh as this would also kill Presentation Services. Therefore I extracted the following from run-saw.sh and ran it manually on server01:

set +u
ANA_INSTALL_DIR=/app/oracle/product/obiee
. ${ANA_INSTALL_DIR}/setup/common.sh
./shutdown.sh -service

This successfully killed javahost. I restarted it using :

nohup ./run.sh -service >> /data/bi/web/log/javahost.out.log 2>&1 &

But – the error remained when I refreshed the reports (on both servers).

I then killed javahost on server02 using the same method. At this point, Charts started working again. Presumably Presentation Services had been using javahost on server02 and not recognising it had hung saw no reason to switch to javahost on server01. Once it was killed on server02 it switched and thus started working again.
To complete the work I restarted javahost on server02.

Investigation

The only hit on MOS and Google I found was this: OBIEE Chart Server Error When Showing Charts (Doc ID 944139.1) which details some parameters to tweak, although more to do with javahost being busy (which it wasn’t in this case).

December 2, 2010

Troubleshooting OBIEE – LDAP (ADSI) authentication

Filed under: obiee, unix — rmoff @ 17:52

They say about travelling that it’s the journey and not the destination, and the same is true with this problem we hit during a deployment to Production.

We were deploying a new OBIEE 10g implementation, with authentication provided by Microsoft Active Directory (AD) through the LDAP functionality in OBIEE. As a side note, it’s a rather nice way to do authentication, although maybe I’m biased coming from our previous implementation which used EBS integrated authentication and was a bugger to set up and work with.

The Error

Testing of the LDAP authentication worked fine, but when we moved to Production we started hitting this error, intermittently. When we did hit the error, we saw the OBIEE “Logging On…” screen for a while before the error.

     [nQSError: 13011] Query for Initialization Block 'LDAP User Access Initialization Block' has failed.
     [53003] LDAP bind failure: Can't contact LDAP server.

Sporadically, it would all start working, until BI Server was next bounced, at which point it often stopped again.

The Cause

I’ll jump ahead to the answer here in case you’re not interested in the diagnostics.

Our Active Directory domain has multiple Domain Controllers (DCs). The Active Directory server defined in the LDAP connection is a DNS entry with multiple IPs, for the different DCs (for resilience). The IPs are returned in round-robin fasion, not load-balanced. For some reason one of the IPs returned by DNS lookup was not valid.

What was happening was that OBIEE was sporadically hitting the duff IP from the resolution of the AD server address. It would timeout after failing to connect to the duff IP, which is when we would get the “LDAP bind failure: Can’t contact LDAP server” error. An SR we raised confirmed that OBIEE won’t retry a failed connection, otherwise I guess we might have moved on to the next (valid) IP.

To workaround this until DNS could be fixed, we hardcoded (yuck)the AD server name as a specific IP rather than DNS entry, and crossed our fingers that the server doesn’t go pop ๐Ÿ™‚
A permanent solution would be to specify multiple LDAP servers in the RPD.

Diagnostics

(OBIEE 10.1.3.4.1, HP-UX Itanium 11.31)
Starting off with the most obvious, and then getting more and more detailed as the problem still couldn’t be resolved (pun intended).

  • Opening up the RPD in online mode and testing the LDAP Init Block and LDAP Server connection from within the Admin Tool both worked, proving that the connectivity from the BI Server to the LDAP server was not the problem
  • Pinging the AD server name from the BI Server, worked

As well as proving the connection between BI Server and LDAP, we checked the LDAP server. As our corporate AD, any problems with it would be well known before little BI got involved.

Next up we used ldapsearch to prove connectivity and valid LDAP credentials with the AD server:

$whereis ldapsearch
ldapsearch: /opt/ldapux/bin/ldapsearch
$/opt/ldapux/bin/ldapsearch -v -b "dc=mycompany, dc=co, dc=uk" -h adserver.co.uk -p 389 -D "cn=SVC_OBIEE, ou=service accounts, ou=service management, dc=mycompany, dc=co, dc=uk" -w Password "(sAMAccountName=testuser)" sAMAccountName

This should return output from the LDAP server. You can fiddle with the latter parameters to get different information out of LDAP – use the -help flag if you want to know more.

ldapsearch: started Thu Dec  2 12:33:18 2010

ldap_init( adserver.co.uk, 389 )
filter pattern: (sAMAccountName=testuser)
returning: sAMAccountName
filter is: (sAMAccountName=testuser)
version: 1
dn: CN=Fred Bloggs,OU=Users,OU=MyCompany,OU=Data Management,DC=mycompany,
 DC=co,DC=uk
sAMAccountName: TESTUSER
1 matches
$

From here, I set about building a test case. Test cases serve a dual purpose – they give the support team something to work with and reproduce the failure, but they also enforce a strictness in method which often reveals the problem itself. In an ideal BAAG world everything would only be approached in a methodical manner. However there is that no-man’s land between discovering a problem and beginning to properly diagnose it – the trick is to make the time in no-man’s land as short as possible (lest one gets shot down in a hail of irrational guesses and uncontrolled system changes).

The problem so far was one seen when a login to BI Server via Presentation Services was attempted, which by its nature is a GUI work pattern. To make it reproducible I ideally wanted a command line solution that could be run on demand to generate the failure. I dug out nqcmd, which executes commands against the BI Server directly and is fully scriptable. This had the added benefit of removing Presentation Services from consideration, which assuming the problem could be reproduced with nqcmd would be one less complex factor to debug.

You’ll find nqcmd in BIServer/server/Bin (or Bin64). On Unix don’t forget to set the environment paths first with sa-init.sh (or sa-init64.sh).

First I created a file q1.lsql with some valid Logical SQL (what Presentation Presentation Services fires at BI Server), which can then be called by nqcmd and executed:

SELECT Organisation."Store Name" saw_0 FROM "Loss Prevention" ORDER BY saw_0

Initially I ran nqcmd using Administrator login, to remove LDAP from consideration and prove everything else worked:

$nqcmd -d AnalyticsWeb64 -u Administrator -p adminPW -s q1.lsql -q

-------------------------------------------------------------------------------
          Oracle BI Server
          Copyright (c) 1997-2009 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------

[...]
Row count: 437
---------------------------------------------------------------------------------------------------------[...]
Processed: 1 queries

Next I tried to use my user ID to test the LDAP connectivity.

$date;nqcmd -d AnalyticsWeb64 -u myuserID -p RightPW -s q1.lsql -q;date
Tue Nov 23 08:42:33 GMT 2010

-------------------------------------------------------------------------------
          Oracle BI Server
          Copyright (c) 1997-2009 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------

[10018][State: 08004] [NQODBC] [SQL_STATE: 08004] [nQSError: 10018] Access for the requested connection is refused.
[53003] LDAP bind failure: Can't contact LDAP server.
Connect open failed

Connection open failed:
[10018][State: 08004] [NQODBC] [SQL_STATE: 08004] [nQSError: 10018] Access for the requested connection is refused.
[53003] LDAP bind failure: Can't contact LDAP server.
Connection open failed
Tue Nov 23 08:43:51 GMT 2010

By prefixing and suffixing the call to nqcmd with “date” and I record in my test case the time spend on nqcmd. In this case (lines 2 and 17 above) you can see it takes over a minute for BI Server to fail the login.

 

To rule out an error in parsing what the LDAP server returned on a successful connection, I tried it with the wrong password, but hit the same error:

$date;nqcmd -d AnalyticsWeb64 -u myuserID -p wrongPW -s q1.lsql -q
Tue Nov 23 08:12:38 GMT 2010

-------------------------------------------------------------------------------
          Oracle BI Server
          Copyright (c) 1997-2009 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------

date
[10018][State: 08004] [NQODBC] [SQL_STATE: 08004] [nQSError: 10018] Access for the requested connection is refused.
[53003] LDAP bind failure: Can't contact LDAP server.
Connect open failed

Connection open failed:
[10018][State: 08004] [NQODBC] [SQL_STATE: 08004] [nQSError: 10018] Access for the requested connection is refused.
[53003] LDAP bind failure: Can't contact LDAP server.
Connection open failed
$date
Tue Nov 23 08:13:55 GMT 2010

My next attempt with the right password crashed the BI Server process, and on restart LDAP authentication worked!

$date;nqcmd -d AnalyticsWeb64 -u myuserID -p RightPW -s q1.lsql -q;date
Tue Nov 23 08:44:01 GMT 2010

[...]
Row count: 437
[...]
Processed: 1 queries
Tue Nov 23 08:44:03 GMT 2010

I also saw the error messages I’d expect with an invalid password or username:

$date;nqcmd -d AnalyticsWeb64 -u myuserID -p WrongPW -s q1.lsql -q;date
Tue Nov 23 08:50:37 GMT 2010

-------------------------------------------------------------------------------
          Oracle BI Server
          Copyright (c) 1997-2009 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------

[10018][State: 08004] [NQODBC] [SQL_STATE: 08004] [nQSError: 10018] Access for the requested connection is refused.
[53012] User authentication failure: myuserID.
Connect open failed

Connection open failed:
[10018][State: 08004] [NQODBC] [SQL_STATE: 08004] [nQSError: 10018] Access for the requested connection is refused.
[53012] User authentication failure: myuserID.
Connection open failed
Tue Nov 23 08:50:37 GMT 2010
$date;nqcmd -d AnalyticsWeb64 -u badgerbadgermyuserID -p WrongPW -s q1.lsql -q;date
Tue Nov 23 08:51:20 GMT 2010

-------------------------------------------------------------------------------
          Oracle BI Server
          Copyright (c) 1997-2009 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------

[10018][State: 08004] [NQODBC] [SQL_STATE: 08004] [nQSError: 10018] Access for the requested connection is refused.
[53012] User authentication failure: badgerbadgermyuserID.
Connect open failed

Connection open failed:
[10018][State: 08004] [NQODBC] [SQL_STATE: 08004] [nQSError: 10018] Access for the requested connection is refused.
[53012] User authentication failure: badgerbadgermyuserID.
Connection open failed
Tue Nov 23 08:51:20 GMT 2010

Having already run a ping from the BI Server to LDAP server, I had no reason to run it again so I thought, but did anyway:

$/usr/sbin/ping mycompany.co.uk -n 1
PING 10.3.5.0: 64 byte packets

----10.3.5.0 PING Statistics----
1 packets transmitted, 0 packets received, 100% packet loss

100% packet loss. uh oh.

 

From here I did an nslookup up of the AD server name that we were using, and pinged each of the addresses returned – one of them was the one I happened to hit above, and was dead. The rest I’ve explained above under “The Cause”.

Geeky footnote

With that wonderful thing known as hindsight, I dug too deep too quickly, as I should have understood more around our AD server and what its name resolved to. Instead, I got stuck in to some tusc tracing:

# Determine the process ID of nqserver, and attach tusc to it
echo "Starting tusc"
ps -ef|grep nqsserver|grep -v grep|awk '{print $2}'|xargs tusc -aDeEfT "%F-%H:%M:%S" -Ao ~/tusc.`hostname`.out &

This writes acres of wonderful gobbledegook like this:

time(0x1ffffffff3abf628) ...............................
gettimeofday(0x1ffffffff3abf600, NULL) .................
write(6, "2 0 1 0 - 1 1 - 2 3   0 7 : 4 3 ".., 131) ....
send(20, "~ \0\0\002\0\0\0\0  \0\001\0\0\0".., 134, 0) .
gettimeofday(0x1ffffffff3ac0e10, NULL) .................
recv(20, "\f\0\0\0! ' \0\0", 8, 0) .....................
recv(20, "01\0\0\0U 17h ) \0\0- q ", 12, 0) ............
gettimeofday(0x1ffffffff3ac0e10, NULL) .................
send(20, "\b\0\0\002\0\0\0\0\0\0\0\0\0\0\0", 16, 0) ....
gettimeofday(0x1ffffffff3ac0e10, NULL) .................
recv(20, "\f\0\0\011' \0\0", 8, 0) .....................
recv(20, "01\0\0\0U 17h ) \0\0- q ", 12, 0) ............
gettimeofday(0x1ffffffff3ac0e10, NULL) .................
send(20, "04\0\0\002\0\0\0\0\0\0\0", 12, 0) ............
gettimeofday(0x1ffffffff3ac0e10, NULL) .................

but which also included this:

connect(22, 0x6000000000546540, 16) ................................................................................................... ERR#242 EHOSTUNREACH

A bit of google-fu throws up “EHOSTUNREACH: No route to host. A socket operation was attempted to an unreachable host.” and a figure of 75 seconds as the timeout – which matches with the delay observed before BI Server throws the error.
So in this case tusc wasn’t necessary but with a bit more nouse would have got me to the result sooner than a lucky ping did.

 

Blog at WordPress.com.