rmoff

September 8, 2011

Labelling Time axes in Excel

Filed under: Excel, hack, visualisation — rmoff @ 09:53

Excel may send chills down the spine of us when we hear users talking about its [ab]use, but it has its place in the toolset. For my money, it is a very good tool for knocking out graphs which look decent. Of course, rrdtool is my geek tool of choice for dynamic long-term graphing, but when doing scratch PoC work, I normally fall back to Excel.

One thing which has frustrated me over time is, well, time, and Excel’s handling thereof. How many times (these puns are getting tiresome already) have you seen an axis like this and gnashed your teeth?

Excel's default labelling of time axes sucks

Looking at the axis options shows some decimals, with no apparent bearing on the times shown on the axis:

By virtue of using Excel for quick ‘n dirty graphing, I normally don’t have the time to figure this out properly. I finally cracked, and I’m glad I did. The answer is very simple.

Time’s in Excel are based on fractions of 1 day. So, with a number of 1 = 24 hours, we can work backwards:

  • 1/2 is going to be half a day, twelve hours = 0.5
  • 1/24 = one hour = 0.04166666666667
  • 1/24/2 = Half an hour = 0.02083333333333
  • 1/24/2/2 = Quarter of an hour = 0.010416666666666667
  • 1/24/60 = One minute = 0.000694444444

This is a good example of understandable backend functionality (storing times as a plain number) ought to be shielded from the end user, and the interface design has fallen one step short. Excel knows the data is Time, and good interface would at the very least offer the option to define axes in terms of time, if not hide the decimals entirely.

Still, making use of what we have, we can still get Excel to behave, it’s just a bit labourious:
You want a graph with a maximum of 1 hour, major lines every fifteen minutes, and minor at 5 minutes? No problem.

Sensible time axis labelling in Excel

So next time you’re graphing a time series in Excel, fire up Calc and work out your fractions, for some proper time axis labelling.

May 19, 2011

OBIEE performance – get your database sweating

Filed under: bi, DWH, obiee, oracle, performance, rant, rrdtool — rmoff @ 16:01

Just because something produces the correct numbers on the report, it doesn’t mean you can stop there.

How you are producing those numbers matters, and matters a lot if you have an interest in the long-term health of your system and its ability to scale.

OBIEE is the case in point here, but the principle applies to any architecture with >1 tiers or components.

Let me start with a rhetorical question. The user has got a report which has ten rows of data. Which of the following methods is going to be a more efficient way to generate the report?

  1. Transfer the ten rows of data from the database back to the rendering application, and the rendered report to the user
  2. Transfer fourteen million rows of data from the database back to the rendering application, get the application to crunch these to the required ten rows, transfer rendered report to the user

Obviously, it’s the first one. In the second one (and this is no hyperbolic example to prove a point, I’m sorry to say) we have the following overheads:

  • Network traffic of fourteen million rows from the database to the application server
  • Disk and/or Memory impact on the application server, depending on how it’s processing those fourteen millions rows
  • CPU impact on the application server when it processes those fourteen million rows

Considering OBIEE specifically, you should be aiming to be able to answer all of your questions on the database directly. As soon as you start using the database simply as a source for dumping big volumes of rows into your OBIEE server, you’re storing up problems.
An RDBMS is designed for crunching lots of data. You’ve already lifted all that data off the disk when you selected it out of the tables on Oracle DB – why not get Oracle DB to do the additional processing required too? Where’s the benefit in lifting & shifting all of that data to then reprocess it again?
In fact, consider why you’re crunching big volumes of data each time in the first place. If multiple people have the same big queries, are you missing some aggregates in your data source that would support these queries much better? Or do you need to be looking towards multi-dimensional sources such as Oracle OLAP or Essbase?

The flexibility and power of OBIEE to do its own calculations and aggregations on data is a double-edged sword, and one to be wielded with responsibility. It is a great feature and one which cannot be avoided if you’re using federated queries across disparate sources. But if you’re using it out of ignorance or laziness to brute-force a solution instead of doing the big work at the lowest level possible then you’re asking for trouble.

If you rely on OBIEE to do the heavy work of your reporting solution, then you need to factor this in to your hardware spec for the machine. As a reporting interface to well aggregated data with fairly simple reports, we’ve found that it runs at minimal CPU, and doesn’t stress the disk. You can predict fairly reliably that this should scale just fine so long as your database can. But if you put the workload on the OBIEE server too, you’re going to hit bottlenecks much sooner.

The principle of moving as little data around as possible is described by Cary Millsap as “Filter Early” and described well in his blog article. Thanks to @RonCrisco for helping me remember the name. @cdturri pointed out that it’s also just common sense :-).
Applying it to a specific system, Exadata uses it in its SmartScan technology where it applies filtering of data directly on the storage rather than bringing all the data back up to the database SQL processing layer (h/t @ocpdba).

Case-study

Here’s an example of a situation where all is not well in OBIEE-land.

We were alerted to it by unix complaining that a filesystem was running low on space. Checking the contents of the filesystem we spotted these monsters:

[user@server]/data/bi/tmp $ls -lrt
total 112914646
[...]
-rw-------   1 user   biadmin    29122976800 Apr  6 11:32 nQS_20117_465_33898735.TMP
-rw-------   1 user   biadmin    24816966976 Apr  6 11:12 nQS_20117_464_33792457.TMP
-rw-------   1 user   biadmin    3582054936 Apr  6 11:46 nQS_20117_469_37979712.TMP
[...]

These are temporary files generated by BI Server (nqsserver), and for those of you viewing in black and white, those files are getting on for 30GB a go!

When queried, NQQuery.log embarrassingly admits facts such as:

Rows 13,894,550, bytes 3,260,497,648 retrieved from database query id: xxxx

and Usage Tracking shows the kind of damage being inflicted:

                      Total
                       time     row num db     cum num
START_TS     END_TS     sec   count  query      db row 

06-APR 10:05 10:18      579       0      6   3,436,816 
06-APR 10:05 10:18      553       0      4   3,239,101 
06-APR 10:05 10:18      383       0      3   1,624,656 
06-APR 10:11 11:48     5694       0      1  13,894,550 
06-APR 10:11 11:48     4314       0      1  11,840,156 
06-APR 10:21 10:27      336     456      4   3,239,101 

A look at the machine’s vital statistics for this time period shows the impact:

Affected BI Server, showing impact of workload which quietens down c.12:00

For a bit more detail, here’s the time period at greater resolution:

Systems metric during the heavy report execution

Notice that the server starts paging out during the work, and there’s a lot of disk activity – both read and write. CPU goes from a normal c.5% up to 20-30%.

This was one user doing this — care to place bets on how the system would perform if it were rolled out to ten users running this kind of workload?

For reference, here’s one of our other BI Servers which serves at least 10 times as many users, where the reports do the bulk of their work in the database:

Another BI Server on the same day, with a greater number of users but reports which do their big work on the database

Monitoring for problems

There are a few easy ways to look out for this bad behaviour. The first is in Usage Tracking, where you can look at S_NQ_ACCT.CUM_NUM_DB_ROW and compare it to S_NQ_ACCT.ROW_COUNT, even writing an alert for when this ratio goes above a defined threshold.
You could also look at an alert on S_NQ_ACCT.CUM_NUM_DB_ROW going above a fixed number, since depending on your system’s purpose and architectural principles you might want to aim to never be pulling back more rows from the database than necessary for a report.

I’d also pay close attention to S_NQ_ACCT.NUM_DB_QUERY, as this will show you reports generating more than one query on the database and thus increasing the workload on the BI Server processing the combined results.

Another thing to check is the NQQuery.log, looking at the number of bytes returned from a database query. This could feed into some kind of generated alert for closer investigation if a query returns above a certain number.
OBI 10g:

grep retrieved NQQuery.log|awk '{print $5}'|sort -n

OBI 11g:

grep retrieved NQQuery.log|awk '{print $20}'|sort -n

(if you’re running on Windows then to run this command you’ll need either cygwin or google for a Win32 version of grep/awk/sort – or brew your own Powershell version)

I’d also strongly recommend monitoring your BI Server’s TEMP folder (defined as WORK_DIRECTORY_PATHS in NQSConfig.INI), as this is where it lands the data to disk when it’s got to crunch it. Monitor this for two reasons – to spot when bad stuff’s happening, but also in case it fills up and causes your BI Server to barf. For the latter, you can expect to get:

[nQSError: 10058] A general error has occurred. 
[nQSError: 46118] Out of disk space. 

Developing for the long term

How do you assess your development projects for quality?
You presumably are keen that the reports match the specifications and return the right numbers. You hopefully also have NFRs for how quickly these reports run.
But what about system impact of new developments? How do you quantify this?

Unless you are delivering a standalone project, fire-and-forget, then maybe you can performance test for load and concurrency up-front to validate how well your solution will scale to the required users.
But lots of BI projects are iterative and may well be across functions too. Just because your development project is first to land it doesn’t give you the right to dump a steaming pile on the servers and proclaim that it works with your user volumes and meets their NFRs, so all is well. What about all the projects that come along next, and are going to be hindered by your mess?

This is where another double-edged sword comes into play – “Best Practice”. For me, Best Practice is a way of doing something that multiple implementations, time and expertise has shown to be the best way of not screwing up. It is the starting point from which to work, understanding deviations as required. What it is not, and what gets it the bad reputation, is a fixed set of crude rules to be adhered to blindly and implemented without questioning or understanding.

If a system is not adhering to the kind of best practice I’m talking about here – filter early, in essence – then there may be a good reason. But that reason must be consciously and loudly acknowledged and documented.

March 11, 2011

Getting good quality I/O throughput data

Filed under: AWR, io, oracle, performance, visualisation — rmoff @ 11:33

This post expands on one I made last year here about sampling frequency (of I/O throughput, but it’s a generic concept).
The background to this is my analysis of the performance and capacity of our data warehouse on Oracle 11g.

Before I get too boring, here’s the fun bit:

Pork Pies per Hour (PP/h)

Jim wants to enter a championship pork-pie eating competition. He’s timed himself practising and over the course of an hour he eats four pork pies. So we might say that his Pork Pies per Hour (PP/h) rate is 4.

The competition lasts for thirty minutes. The world champion can eat eight pork pies in thirty minutes. Does Jim stand a chance?
(let’s pretend he has an insatiable appetite and isn’t going to get full, and all other smart-ass factors)

If his consumption rate was 4 PP/h, and he only has half an hour, then we would predict he’ll consume 4 * 0.5 = 2 pork pies. So Jim’s going to get his ass beat.

Or is he?

What if the rate of 4 PP/h masks the fact that the hour broke down as follows:

  • First 15 minutes: he ate one pork pie
  • 15 – 30 minutes: had a beer
  • 30 – 45 minutes: gobbled down three pork pies
  • 45 – 60 minutes: he had another beer and gently sweated pork fumes?

If that were the case and we had sampled every fifteen minutes, we’d see this:

So what we want to know, which is the maximum rate at which he can consume pork pies, is exposed only when we sample at an appropriate frequency.

Mega bites to Mega bytes

Enough of pork pies, and back to the tasty subject of I/O throughput. The point I am trying to make is that without an appropriate sample size the data that we have becomes less useful.

You can always process your data further to derive conclusions from it over a longer term. Five second samples are going to be fairly unintelligible if considered unprocessed over the timespan of a year.

But what you can’t do is add back in the detail that you lost by sampling with too great a frequency. Once that moment’s passed, it’s gone.

By the nature of metric which is a rate at which something happens, the sample is going to be an average over the sampling period. The problem with this is that it can mask peaks in the throughput. Peaks may (or may not) be hitting a ceiling in your system which an average figure will make you think you are plenty clear of.

System metrics for Oracle are available through AWR, which typically samples every hour. OS-level tools may sample more frequently, but in the context of capacity planning and analysis, periods are often going to be 10s of minutes, or hourly and upwards.

The following illustrates the effect of averaging I/O throughput figures.

The data is the average I/O throughput, sampled every five seconds (through this method). Note that already this is an average, but in the context of hourly samples (for example) we will have to live with five seconds as the starting point.

This first graph shows the original data, with a five minute average drawn over it. For each, the maximum is noted.
It can be seen that the maximum the I/O throughput hit was a shade over 3GB/s. That was the real maximum that we were driving through the I/O pipe over the period of time (ignoring that it’s a 5-second average). Now look at the maximum of the 5 minute average – we lost c240MB/s in our maximum, which is now 2.7GBs.

In the second graph the original sample is shown, with a 30 minute average. It’s clear to see the effect of averaging the data has – the peaks and troughs are smoothed out, giving a more even line. But is this what we want? Our apparent maximum I/O based on a 30 minute average has now almost halved!
Apparently, we only needed 1.6GB/s of I/O throughput during this time period. The graph clearly shows that this is a false statement. But what about now?

Same graph as before, but without the context of the original sample. Given this data – which is what you’ll have if you collect I/O throughput data that’s sampled every 30 minutes – then how would you know what the maximum throughput during that period was? It is impossible to know!

Pushing this point further, the same 30 minute average, over an extended period:

What’s the maximum throughput that was required during this period? When was the I/O throughput approaching capacity?
You can’t know from the averaged data alone!

Here’s the same extended period, with the original 5 second samples. This is just proving the point, that the 30 minute samples have obliterated the peaks particularly around 04:00 – 06:00.

So what?

To be able to plan for a system’s I/O capacity we need to know more than how much I/O it transferred over a relatively long period of time. We need to know what the biggest demand it put on the system was, otherwise we risk unseen bottlenecks. To make this useful, we also need to understand if these big demands were prolonged peaks or not. Particularly in a DW/BI environment, load is generally going to be sporadic. Sure, we run regular batches at predictable times, and may look to cache reports at fixed times, but they’re exceptions not the rule.

If a system has a limit of 3GB/s, and we hit that 3GB/s for a few seconds, what is the implication of that? We need to transfer the same amount of data, so hitting the limit means that the transfer is going to take longer. But a few seconds here and there may not matter — it all comes down to the context.

A report that runs for five minutes which hits I/O throughput limit for a second or two isn’t as much of a concern as one where the I/O hits the limit for minutes on end. There’s plenty written about system capacity and scalability, and it should be clear that if a system is hitting a capacity limit (whether it’s I/O, CPU, or whatever) for prolonged periods then the overall stability is going to suffer. Maybe that five minute report which spends four minutes on bottlenecked I/O doesn’t bother the user, but what about the report that’s supposed to run in a few seconds which is sat waiting for I/O at the same time?

(Ed: I’m pretty sure that I’m straying into the realm of knees and such, but need (knee’d?) to go off an read some more about it first to be sure)

I love twitter

In writing this, I’ve been struggling with some of the concepts which I can instinctively feel but don’t have the understanding to properly articulate. Particularly, if my assertion is that long sample periods are not a good idea, what is a good sample period? It’s no use being a smart-ass and rubbishing the data we collect if I can’t explain how we should collect it.

So, I turned to twitter. Twitter is awesome. (I think of Chet almost everytime I say this because he was one of the main guys who convinced me it was as good as the fuss made out. Check out his presentation all about it here here).

and back the answers came:

twitter is teh awesome

twitter is teh awesome

Amazingly helpful stuff, and focussed on my specific question. Sure, Google has changed our lives when it comes to finding the answers to questions. But (a) there is a lot of crap written on the internet (Blunder On So, anyone?), and (b) you will often find generally interesting things in the area in which you are interested, but for the more specific uncommon questions it’s unlikely you’ll get a direct hit.

Here, I had industry-leading figures directly answering my specific question! I love twitter. And I really appreciate experts taking the time to share their knowledge, experience and expertise.

What next?

Nothing that I’m writing here is new, but I enjoy writing things down to clarify my thoughts.

Courtesy of the good folk of twitter, I have some great links to follow up and digest.

March 9, 2011

Comparing methods for recording I/O – V$SYSSTAT vs HP Measureware

Filed under: io, oracle, rrdtool, unix — rmoff @ 13:38

I wrote last year about Graphing I/O data using gnuplot and Oracle V$SYSSTAT, using a script from Kevin Closson in his article How To Produce Raw, Spreadsheet-Ready Physical I/O Data With PL/SQL. Good For Exadata, Good For Traditional Storage.
Here I’ve got a simple comparison of the data recorded through this script (in essence, Oracle’s V$SYSSTAT), and directly on the OS through HP’s MeasureWare. It’s graphed out with my new favourite tool, rrdtool:

Happily for me, the data marries up almost exactly. One might say that so it ought. But it’s always good to have explicit proof of such a supposition.

March 1, 2011

Shiny new geek toys — rrdtool and screen

Filed under: jmx, unix, visualisation — rmoff @ 20:28

I’ve added two new toys to my geek arsenal today. First is one with which I’ve dabbled before, but struggled to master. The second is a revelation to me and which I discovered courtesy of twitter.

rrdtool

rrdtool is a data collection and graphing tool which I’ve been aware of for a while. I wanted to use it when I wrote about Collecting OBIEE systems management data with JMX, but couldn’t get it to work. I’ll not lie to you – it is a bitch to work with at first. Or put a more polite way, it has a steep learning curve. But when you reach the top of the curve and realise its potential…wow. You’ll soon understand why it is so widely used. I plan to write this up soon, but it let me draw nice graphs like this:

Example output from rrdtool of data collected via jmx

screen

The second discovery is a tool called screen. Innocuously named (and a bugger to search for on google), I’d never heard of it until today when Frits Hoogland mentioned it and the subsequent stream of approving tweets from various geeks whose opinions I respect made me go and read up about it.

It is “a full-screen window manager that multiplexes a physical terminal between several processes, typically interactive shells”.
Why’s it good?
Two things:

  1. In one ssh session, you can have multiple ‘windows’. So you can be running one long process whilst working on another. Or developing some code and compiling it in the other. Or anything else clever you can think of.
  2. As important, you can reattach to sessions you’ve lost connectivity to. Running a long sqlplus job that you’ve had to wrap in a nohup’d shell wrapper? Disconnecting your laptop to take home and log on via VPN? Dialled in on a dodgy line that drops connection? All of these are no problem – you can reattach to your live session any time you want.

There’s a good introduction to screen in this article here.

December 6, 2010

Charting OBIEE performance data with gnuplot

Filed under: gnuplot, jmx, monitoring, obiee, unix, visualisation — rmoff @ 21:30

Introduction

This is the second part of three detailed articles making up a mini-series about OBIEE monitoring. It demonstrates how to capture OBIEE performance information, and optionally graph it out and serve it through an auto-updating webpage.

This article takes data that part one showed you how to collect into a tab-separated file that looks something like this:

2010-11-29-14:48:18     1       0       11      0       3       2       1       676     340     0       53      1       0       41      0       3       0
2010-11-29-14:49:18     1       0       11      0       3       2       1       676     0       0       0       1       0       0       0       3       0
2010-11-29-14:50:18     2       0       16      1       4       3       1       679     0       0       0       1       0       0       0       4       0
2010-11-29-14:51:18     2       2       19      1       4       3       1       679     32      0       53      1       0       58      0       4       0
2010-11-29-14:52:18     2       1       19      1       4       3       4       682     0       0       0       1       0       0       0       4       0
2010-11-29-14:53:18     2       1       19      1       4       3       4       682     0       0       0       1       0       0       0       4       0
2010-11-29-14:54:18     2       0       19      1       4       3       1       682     0       0       0       1       0       0       0       4       0

and plot it into something like looks like this:

gnuplot

Depending on what you’re wanting to use the data for, and your level of comfort with command-line tools, you may just want to take your datafile and plot it with something like Excel for ad-hoc analysis. However, for fully automated data capture and rendering, I’d recommend giving gnuplot a go. It’s not the simplest of tools and the documentation is so vast as to be overwhelming – but if you’ve the time to tweak it then you will find it very flexible.

gnuplot can be run interactively (which is useful for trialling settings) or as part of batch script, and can output to screen or file (eg. png). It assumes that your data is tab separated (which it should be if you’ve used my script).

The key bit to understand is the “plot xxx using yyy” statement. In this, xxx is your data file, whilst yyy is the columns to plot for x and y data respectively. So assuming that your x-axis is time and the first column in your datafile, and you want to plot the number of Active Sessions (which in my script is the fourth metric, so the fifth column – time being the first), you would have “using 1:5”.

plot "datafile.jmx" using 1:5 with lines

To plot to a PNG file, use a script like this:

gnuplot <<EOF
set xdata time
set timefmt "%Y-%m-%d-%H:%M:%S"
set format x "%d %b\n%H:%M"
set yrange[0:30]
set grid
set key box outside below
set terminal png font "arial, 7" size 410,200 enhanced truecolor
set output "graph01.png"
set title "Oracle BI PS Sessions\nActive Sessions"
plot "datafile.jmx" using 1:5 with lines

This line invokes gnuplot, and <<EOF tells it to pass the subsequent lines to gnuplot until end of file (or you put EOF literal in the file).

gnuplot <<EOF

This sets the x-axis as a time axis, gnuplot then scales it appropriately. The timefmt defines the format of the time data in the input file.

set xdata time
set timefmt "%Y-%m-%d-%H:%M:%S"

The format of the axis labels on the x-axis is defined thus:

set format x "%d %b\n%H:%M"

This hard-codes the y-axis range. Use “set autoscale” to revert to the default of gnuplot setting the range.

set yrange[0:30]

This puts a grid on the chart, and defines the title

set grid
set title "Oracle BI PS Sessions\nActive Sessions"

This tells gnuplot to write to a png file, using arial 7pt font, 410×200 pixels, and higher resolution. The “set output” command defines the actual filename.

set terminal png font "arial, 7" size 410,200 enhanced truecolor
set output "graph01.png"

If you omit the “set terminal png […]” line you’ll get an interactive graph from gnuplot, which can be useful for testing different settings.

Plotting multiple datasets

You can visualise multiple datasets alongside each other easily in gnuplot. You might want to do this for related metrics (eg Active Sessions vs Current Sessions), or a clustered OBIEE deployment:

The same metric on two servers in a load-balanced OBIEE deployment

Related metrics from a single server

To do this you append multiple statements to the “plot” command, separated by a comma:

plot "datafile_server01.jmx" using 1:4 with lines title "Server 01"\
, "datafile_server02.jmx" using 1:4 with lines title "Server 02"

(the \ is a line-continuation character)

The title for each dataset is defined as part of the statement as is seen above. This can be shown in a key which is enabled by default and can be moved to the bottom of the graph by using:

set key box outside below

Generating gnuplot scripts automagically

You may have realised by now that this is all quite fiddly to set up. Here are two scripts that will help generate gnuplot scripts. It’s based on the tcl script from part one and uses the first item in the array, ID, to determine the column number that a metric will appear in in the datafile.

This first one writes a series of gnuplot statements to plot each metric out on an individual graph, and assumes you have a two-server cluster so plots the datafiles from two servers on the same graph.

cat gnuplot_header
grep lappend obieejmx_server01.tcl|\
sed s/{//g|\
sed s/}//g|\
sed s/"lappend mbeanattrs "//g|\
sed s/"Oracle BI Management:Oracle BI=Performance,AppName="//g|\
sed s/"Oracle BI Server,"/"BI"/g|\
sed s/"name=Oracle BI General"/""/g|\
sed s/"type=Oracle BI Physical DB,name=Oracle Data Warehouse"/"DB"/g|\
sed s/"type=Oracle BI DB Connection Pool,name=Star_Oracle Data Warehouse_Oracle Data Warehouse Connection Pool"/"Conn Pool"/g|\
sed s/"Oracle BI Presentation Server,name=Oracle BI PS "/"BIPS"/g|\
sed s/"Connection"/"Conn"/g|\
sed s/"Current"/"Curr"/g|\
sed s/"Sessions"/"Sess"/g|\
awk 'FS="\t" {
        $1+=1
        print "set output \""$1".png\""
        print "set title \"" $2 "\\n" $3 "\""
        print "plot \"server01.jmx\" using 1:" $1 " with lines title \"server01\",\"server02.jmx\" using 1:" $1 " with lines title \"server02\""
}'

gnuplot_header is a file you need to create in the same directory, and can just be a copy of the example gnuplot statement above.

This script plots multiple metrics for a single server onto the same graph. You may want to break the statements up into separate graph plots to group together related metrics or ones with similar ranges – this script will at least give you the correct column and titles to get you started.

cat gnuplot_header
grep lappend obieejmx_server01.tcl|\
sed s/{//g|\
sed s/}//g|\
sed s/"lappend mbeanattrs "//g|\
sed s/"Oracle BI Management:Oracle BI=Performance,AppName="//g|\
sed s/"Oracle BI Server,"/"BI"/g|\
sed s/"name=Oracle BI General"/""/g|\
sed s/"type=Oracle BI Physical DB,name=Oracle Data Warehouse"/"DB"/g|\
sed s/"type=Oracle BI DB Connection Pool,name=Star_Oracle Data Warehouse_Oracle Data Warehouse Connection Pool"/"Conn Pool"/g|\
sed s/"Oracle BI Presentation Server,name=Oracle BI PS "/"BIPS"/g|\
sed s/"Connection"/"Conn"/g|\
sed s/"Current"/"Curr"/g|\
sed s/"Sessions"/"Sess"/g|\
awk 'BEGIN {
        FS="\t"
        print "set output \"server01.png\""
        print "set title \"server01\""
        printf "plot "
     }
     {
        $1+=1
        if (NR>1) {printf "\\\n,"}
        printf " \"server01.jmx\" using 1:" $1 " with lines title \"" $2 ":" $3 "\""
     }
     END {print ""}'

Both scripts use a series of sed statements to try and trim down the MBean names into something shorter for plotting on the graph key.

An example output would look something like this:

# One metric per graph, across two servers
gnuplot --persist <<EOF
set xdata time
set timefmt "%Y-%m-%d-%H:%M:%S"
set format x "%d %b\n%H:%M"
set grid
set key box outside below
set terminal png font "arial, 7" size 410,200 enhanced truecolor
set yrange [0:30]
set output "2.obiee.png"
set title "Oracle BI Presentation Server - Oracle BI PS Connection Pool\nCurrent Open Connections"
plot "server01.jmx" using 1:2 with lines title "server01","server02.jmx" using 1:2 with lines title "server02"
set output "3.obiee.png"
set title "Oracle BI Presentation Server - Oracle BI PS Query Cache\nCurrent Running Queries"
plot "server01.jmx" using 1:3 with lines title "server01","server02.jmx" using 1:3 with lines title "server02"
set output "5.obiee.png"
set title "Oracle BI Presentation Server - Oracle BI PS Sessions\nActive Sessions"
plot "server01.jmx" using 1:5 with lines title "server01","server02.jmx" using 1:5 with lines title "server02"
set output "6.obiee.png"
set title "Oracle BI Presentation Server - Oracle BI PS Sessions\nCurrent Sessions"
plot "server01.jmx" using 1:6 with lines title "server01","server02.jmx" using 1:6 with lines title "server02"
set output "7.obiee.png"
set title "Oracle BI Presentation Server - Oracle BI PS Sessions\nSessions Logged On"
plot "server01.jmx" using 1:7 with lines title "server01","server02.jmx" using 1:7 with lines title "server02"
set output "8.obiee.png"
# Single server, multiple metrics on one graph
gnuplot --persist <<EOF
set xdata time
set timefmt "%Y-%m-%d-%H:%M:%S"
set format x "%d %b\n%H:%M"
set grid
set key box outside right
set terminal png font "arial,9" size 1224,500 enhanced truecolor

set output "server01.png"
set title "server01"
set yrange[0:30]
plot  "server01.jmx" using 1:2 with lines title "BIPSConn Pool:Curr Open Conns"\
, "server01.jmx" using 1:3 with lines title "BIPSQuery Cache:Curr Running Queries"\
, "server01.jmx" using 1:5 with lines title "BIPSSess:Active Sess"\
, "server01.jmx" using 1:6 with lines title "BIPSSess:Curr Sess"\
, "server01.jmx" using 1:7 with lines title "BIPSSess:Sess Logged On"\
, "server01.jmx" using 1:8 with lines title "BIConn Pool:Curr Busy Conn Count"\
, "server01.jmx" using 1:13 with lines title "BI:Active Execute Requests"\
, "server01.jmx" using 1:14 with lines title "BI:Active Fetch Requests"\
, "server01.jmx" using 1:17 with lines title "BI:Total sessions"

Plotting subsets of data

You may well want to plot out subsets of the graph data, for example, the last 24 hours. This is simple to do, just run the jmx datafile through something like tail first:

# Create 24 hour extract of data
# Data is sampled every minute, so the last 24 hours will be 60*24 = 1440
tail -n 1440 server01.jmx > server01.24hr.jmx

Refreshing your graphs automagically

Assuming you’ve got your graph plotting script(s) in separate shell file(s), create a wrapper like this:

# Uncomment this for debug, to echo each statement as it's executed
# set -x
#
# This will loop for ever and ever
# (or until one becomes less than two)
while [ 1 -lt 2 ]
do
        # create any subset data files, using tail, as described above 
        ./create_6hr_logs.sh
        ./create_24hr_logs.sh
        # plot your graph(s)
        ./plot_by_server.sh
        ./plot_by_server_summary.sh
        ./plot_by_server_6hr.sh
        ./plot_by_server_6hr_summary.sh
        ./plot_6hr_graphs.sh
        ./plot_24hr_graphs.sh
        # Copy all output png files to a tmp folder
        # from where they'll be served up as a web page
        # (I told you it was a hack)
        cp *.png /tmp/obieejmx
        chmod g+r /tmp/obieejmx/*.png
        #
        # Write the current time, for debug purposes
        date
        # Wait for a minute
        sleep 60
done

And then run your wrapper in the same way as the metric collection, using nohup and &

nohup ./run_graphs_forever.sh &

What next?

So you’ve got your data … you’ve got your graphs … now why not serve them up through a convenient webpage that refreshes automatically? Click here for part three that shows you how to bolt this onto an existing OAS installation.

September 23, 2010

A good maxim to bear in mind when designing reports

Filed under: bi, visualisation — rmoff @ 08:07

Jeff Atwood of Coding Horror fame observed:

This is sage advice (if a little crude :-)) to bear in mind when building reports for users.

February 8, 2010

Illustrating data

Filed under: bi, visualisation — rmoff @ 12:51

In the following list, which two mind-mapping programs are rated best?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Now look at the actual numbers, and answer again

Different answer?

I can’t be the only one in this frantic world whose eyes are drawn to the pictures instead of words and leap to conclusions.
It’s only because I use FreeMind and was surprised it scored so low …. and then realised it hadn’t.
Looks like the HTML rendering isn’t the same here (FF3.6) as when the web page author wrote it.

This is a good example of how pictures can be so much more powerful than numbers alone, so long as you’re drawing the viewers eyes to the correct numbers!

Blog at WordPress.com.