rmoff

March 5, 2010

Who’s been at the cookie jar? EBS-BI authentication and Load Balancers

Filed under: cluster, load balancing, obiee, sawserver, support — rmoff @ 10:44

We hit a very interesting problem in our Production environment recently. We’d made no changes for a long time to the configuration, but all of a sudden users were on the phone complaining. They could login to BI from EBS but after logging in the next link they clicked took them to the OBIEE “You are not logged in” screen.

Our users login to EBS R12 and then using EBS authentication log in to OBIEE (10.1.3.4). Our OBIEE is deployed on OAS, load balanced across two servers by an F5 BIG-IP hardware load balancer.

In the OBIEE NQServer.log we started to see a lot of these errors around the time users started complaining:

[nQSError: 13011] Query for Initialization Block 'EBS Security Context' has failed.
[nQSError: 23006] The session variable, NQ_SESSION.ACF, has no value definition.

The EBS/BI authentication configuration was not done by me, and the theory of it was one of the things on my to-do list to understand but as is the way had never quite got around to it. Here was a good reason to learn very quickly! This posting by Gerard Braat is fantastic and brought me up to speed quickly. There’s also a doc on My Oracle Support, 552735.1, and some more info from Gareth Roberts on the OTN forum here.

We stopped Presentation Services on one of the servers, and suddenly users could use the system again. If we reversed the stopped/started servers, users could use the system. With one Presentation Services server running, the system was fine. With both up, users got “You are not logged in”. What did this demonstrate? That on their own, there was nothing wrong with our Presentation Services instances.

We soon suspected the load balancer. The load balancer sets a cookie on each user’s web browser at the initial connection as they connect to BI. The cookie is used in each subsequent connection to define which application server the user should be routed to. This is because Presentation Services cannot maintain state across instances and so the user must always come through to the same application server that they initially connected to (and therefore authenticated on).

What had happened was that the Load Balancer was issuing cookies with an expiry date already in the past (the clock was set incorrectly on it *facepalm*). This meant that the initial connection from EBS to BI was successful, because authentication was done as expected. But – the next time the client came back to the BI server for a new or updated report, they hit the Load Balancer and since the cookie holding the BI app server affinity was invalid (it had already expired) the Load Balancer sends them to any BI app server. If it’s not the one that they authenticated against then BI tries to authenticate them again, but they don’t have the acf URL string (which comes through in the initial EBS click through to BI), and hence the “The session variable, NQ_SESSION.ACF, has no value definition.” error in the NQServer.log and “You are not logged in” error shown to the user.

As soon as the date was fixed on the load balancer cookies were served properly, we brought up both Presentation Services, and everything worked again. Phew.

Footnote: I cannot recommend this tool highly enough : Fiddler2. It makes tracing HTTP traffic, request headers, cookies, etc, a piece of cake (cookie?).

November 12, 2009

Deploying Oracle Business Intelligence Enterprise Edition on Sun Systems

Filed under: cluster, obiee, performance, unix — rmoff @ 11:28


A very interesting new PDF from Sun on deploying OBIEE has been published, with discussions on architecture, performance and best practice.

This Sun BluePrints article describes an enterprise deployment architecture for Oracle Business Intelligence Enterprise Edition using Sun servers running the Solaris Operating System and Sun Storage 7000 Unified Storage systems. Designed to empower employees in organizations in any industry—from customer service, shipping, and finance to manufacturing, human resources, and more—to become potential decision makers, the architecture brings fault tolerance, security, resiliency, and performance to enterprise deployments. Taking advantage of key virtualization technologies, the architecture can be used to consolidate multiple tiers onto a single system to help reduce cost and complexity. A short discussion of the performance characteristics of the architecture using a realistic workload also is included.

The paper’s by Maqsood Alam, Luojia Chen, Chaitanya Jadaru, Ron Graham and Giri Mandalika.

Direct download: Deploying Oracle Business Intelligence Enterprise Edition on Sun Systems
Main link (requires Sun registration to download): https://www.sun.com/offers/details/821-0698.xml

September 15, 2009

OBIEE cluster controller failover in action

Filed under: cluster, load balancing, obiee, performance, sawserver, unix — rmoff @ 15:06

Production cluster is 2x BI Server and 2x Presentation Services, with a BIG-IP F5 load balancer on the front.

1pub

Symptoms

Users started reporting slow login times to BI.
Our monitoring tool (Openview) reported that “BIServer01 may be down. Failed to contact it using ping.”.
BIServer01 cannot be reached by ping or ssh from Windows network.

Diagnostics

nqsserver and nqsclustercontroller on BIServer01 was logging these repeated errors:

[nQSError: 12002] Socket communication error at call=send: (Number=9) Bad file number

Whether OBIEE was running on BIServer01 or not, users could still use OBIEE but with a delayed login.

Majority of the login time spent on the OBIEE “Logging in … ” screen, which is not normally seen because login is quick.

Network configuration issues found on BIServer01.

Initial suspicion was that EBS authentication was the cause of the delay, as this is only used at login time so would fit with the behaviour observed. They checked their system and could see no problems. They also reported that the authentication SQL only hit EBS just before OBIEE logged in.

Diagnosis

Using nqcmd on one of the Presentation Services boxes it could be determined that failover of Cluster Controllers was occuring, but only after timing out on contacting the Primary Cluster Controller (BIServer01).
2pub

[biadm@PSServer01]/app/oracle/product/obiee/setup $set +u
[biadm@PSServer01]/app/oracle/product/obiee/setup $. ./sa-init64.sh
[biadm@PSServer01]/app/oracle/product/obiee/setup $nqcmd

-------------------------------------------------------------------------------
Oracle BI Server
Copyright (c) 1997-2006 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------

Give data source name: Cluster64
Give user name: Administrator
Give password: xxxxxxxxxxxxx
[60+ second wait here]

This conclusion was reached because after setting PrimaryCCS to BIServer02 there was no delay in connecting. I changed the odbc.ini entry for Cluster64 to switch the CCS server order around
[…]
PrimaryCCS=BIServer02
SecondaryCCS=BIServer01
[…]

[biadm@PSServer01]/app/oracle/product/obiee/setup $nqcmd

-------------------------------------------------------------------------------
Oracle BI Server
Copyright (c) 1997-2006 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------

Give data source name: Cluster64
Give user name: Administrator
Give password: xxxxxxxxxxxxx
[logs straight in]

Any changes to odbc.ini have to be followed by a bounce of sawserver.

Resolution

To fix the slow login for users whilst the network problems were investigated I switched the order of CCS in the odbc.ini configuration and bounced each sawserver:
3pub
For the end-users the problem was resolved as they could now log straight in.
However at this stage we’re still running with half a cluster. If BIServer02 had failed at this point then the BI service would have become unavailable.

The root-cause was a network configuration error on the four servers combined with a possible hardware failure.

Summary

Ignoring Scheduler, a two-machine OBIEE cluster has an Active:Active pair of BI Servers. Analytics traffic to these servers is routed via an Active:Passive pair of Cluster Controllers.

The client (eg sawserver) uses ODBC config syntax to define which Cluster Controller to try contacting first. This is the PrimaryCCS. If it connects then the PrimaryCCS will return the name of the BI Server to the client, which will then send all subsequent ODBC connections to the BI Server direct.

If the client cannot connect to the PrimaryCCS in the time defined it will try the SecondaryCCS instead. The SecondaryCCS behaves exactly the same as the PrimaryCCS – it returns the name of the BI Server to the client for direct ODBC connection.

The Cluster Controller maintains the state of the BI Servers and if one becomes unavailable will know not to route any Analytics traffic to it.

The failover of the Cluster Controller itself is stateless, it is local only to the client session context. This means that each new client session has to go through the failover from Primary to Secondary CCS with the associated timeout delay.

[update 21st Sept] I’ve tested out the same configuration over four VM OEL 4 servers, and cannot reproduce the delayed login time. When one CCS is taken down failover to the other appears almost instantaneous [/update]

FinalTimeOutForContactingCCS

odbc.ini has the parameter FinalTimeOutForContactingCCS set to 60 seconds. Changing this to a lower value does NOT appear to reduce the failover time.

August 14, 2009

Unix script to report on OBIEE and OBIA processes state

Filed under: Apache, cluster, dac, obia, obiee, sawserver, unix — rmoff @ 07:22

Here’s a set of scripts that I use on our servers as a quick way to check if the various BI components are up and running.

areservicesrunning

Because we split the stack across servers, there are different scripts called in combination. On our dev boxes we have everything and so the script calls all three sub-scripts, whereas on Production each server will run one of:

  1. BI Server
  2. Presentation Server & OAS
  3. Informatica & DAC

The scripts source another script called process_check.sh which I based on the common.sh script that comes with OBIEE.

The BI Server script includes logic to only check for the cluster controller if it’s running on a known clustered machine. This is because in our development environment we don’t cluster the BI Server.

Each script details where the log files and config files can be found, obviously for your installation these will vary. I should have used variables for these, but hey, what’s a hacky script if not imperfect 🙂

The script was written and tested on HP-UX.

Installation

Copy each of these onto your server in the same folder.

You might need to add that folder to your PATH.

Edit are_processes_running.sh so that it calls the appropriate scripts for the components you have installed.

You shouldn’t need to edit any of the other scripts except to update log and config paths.

The scripts

are_processes_running.sh

# are_processes_running.sh
# RNM 2009-04-21
# https://rnm1978.wordpress.com

clear
echo "=-=-=-=-=-=-=-=-=-=-=- "
echo " "

# Comment out the scripts that are not required
# For example if there is no ETL on this server then only
# run the first two scripts
_are_BI_processes_running.sh
_are_PS_processes_running.sh
_are_INF_processes_running.sh

echo " "
echo "=-=-=-=-=-=-=-=-=-=-=- "

_are_BI_processes_running.sh

# _are_BI_processes_running.sh
# RNM 2009-04-21
# https://rnm1978.wordpress.com

. process_check.sh

########## BI Server #################
echo "====="
if [ "$(is_process_running nqsserver)" = yes ]; then
  tput bold
  echo "nqsserver (BI Server) is running"
  tput rmso
else
  tput rev
  echo "nqsserver (BI Server) is not running"
  tput rmso
  echo "  To start it enter:"
  echo "    run-sa.sh start64"
fi
echo "  Log files:"
echo "    tail -n 50 -f /app/oracle/product/obiee/server/Log/NQServer.log"
echo "    tail -n 50 -f /app/oracle/product/obiee/server/Log/nqsserver.out.log"
echo "    tail -n 50 -f /app/oracle/product/obiee/server/Log/NQQuery.log"
echo "  Config file:"
echo "    view /app/oracle/product/obiee/server/Config/NQSConfig.INI"

echo "====="
if [ "$(is_process_running nqscheduler)" = yes ]; then
  tput bold
  echo "nqscheduler (BI Scheduler) is running"
  tput rmso
else
  tput rev
  echo "nqscheduler (BI Scheduler) is not running"
  tput rmso
  echo "  To start it enter:"
  echo "    run-sch.sh start64"
fi
echo "  Log files:"
echo "    tail -n 50 -f /app/oracle/product/obiee/server/Log/NQScheduler.log"
echo "    tail -n 50 -f /app/oracle/product/obiee/server/Log/nqscheduler.out.log"
echo "    ls -l /app/oracle/product/obiee/server/Log/iBots/"
echo "  Config file:"
echo "    view /data/bi/scheduler/config/instanceconfig.xml"

echo "====="
echo "$hostname"
if [ "$(hostname)" = "BICluster1" -o "$(hostname)" = "BICluster2" ]; then
  if [ "$(is_process_running nqsclustercontroller)" = yes ]; then
    tput bold
    echo "BI Cluster Controller is running"
    tput rmso
  else
    tput rev
    echo "BI Cluster Controller is not running"
    tput rmso
    echo "  To start it enter:"
    echo "    run-ccs.sh start64"
  fi
    echo "  Log files:"
  echo "    tail -n 50 -f /app/oracle/product/obiee/server/Log/NQCluster.log"
  echo "    tail -n 50 -f /app/oracle/product/obiee/server/Log/nqsclustercontroller.out.log"
  echo "  Config file:"
  echo "    view /app/oracle/product/obiee/server/Config/NQClusterConfig.INI"
else
  echo "(Not checked for Cluster Controller because not running on BICluster1 or BICluster2)"
fi

_are_PS_processes_running.sh

# _are_PS_processes_running.sh
# RNM 2009-04-21
# https://rnm1978.wordpress.com

. process_check.sh

########## OAS  #################
echo "====="
if [ "$(is_process_running httpd)" = yes ]; then
  tput bold
  echo "Apache (HTTP server) is running"
  tput rmso
else
  tput rev
  echo "Apache (HTTP server) is not running"
  tput rmso
  echo "  It should have been started as part of OAS. Check that opmn (Oracle Process Manager and Notification) is running"
  echo "  If opmn is running then run this command to check the status of the components:"
  echo "    opmnctl status -l"
  echo "  If opmn is not running then start it with this command:"
  echo "    opmnctl startall"
fi
echo "  Log files:"
echo "    ls -lrt /app/oracle/product/OAS_1013/Apache/Apache/logs"
echo "  Config file:"
echo "    view /app/oracle/product/OAS_1013/Apache/Apache/conf/httpd.conf"

echo "====="
if [ "$(is_process_running opmn)" = yes ]; then
  tput bold
  echo "opmn (OAS - Oracle Process Manager and Notification) is running"
  tput rmso
else
  tput rev
  echo "opmn (OAS - Oracle Process Manager and Notification) is not running"
  tput rmso
  echo "  To start it use this command:"
  echo "    opmnctl startall"
fi
echo "  Log files:"
echo "    ls -lrt /app/oracle/product/OAS_1013/opmn/logs"
echo "    ls -lrt /app/oracle/product/OAS_1013/j2ee/home/log"
echo "  Config file:"
echo "    view /app/oracle/product/OAS_1013/opmn/conf/opmn.xml"
echo "    view /app/oracle/product/OAS_1013/j2ee/home/config/server.xml"

########## Presentation Services #################
echo "====="
if [ "$(is_process_running javahost)" = yes ]; then
  tput bold
  echo "javahost is running"
  tput rmso
else
  tput rev
  echo "javahost is not running"
  tput rmso
  echo "  It is started as part of the sawserver startup script"
  echo "  To start it run this command:"
  echo "    run-saw.sh start64"
    echo "  To start it independently run this command:"
  echo "    /app/oracle/product/obiee/web/javahost/bin/run.sh"
  fi
echo "  Log files:"
echo "    ls -lrt /data/web/web/log/javahost/"
echo "  Config file:"
echo "    view /app/oracle/product/obiee/web/javahost/config/config.xml"

echo "====="
if [ "$(is_process_running sawserver)" = yes ]; then
  tput bold
  echo "sawserver (Presentation Services) is running"
  tput rmso
else
  tput rev
  echo "sawserver (Presentation Services) is not running"
  tput rmso
  echo "  To start it enter:"
  echo "    run-saw.sh start64"
fi
echo "  Log files:"
echo "    tail -n 50 -f /data/web/web/log/sawserver.out.log"
echo "    tail -n 50 -f /data/web/web/log/sawlog0.log"

echo "  Config file:"
echo "    view /data/web/web/config/instanceconfig.xml"
echo "    ls -l /data/web/web/config/"

_are_INF_processes_running.sh

# _are_INF_processes_running.sh
# RNM 2009-04-22
# https://rnm1978.wordpress.com

. process_check.sh

########## Informatica #################
echo "====="
inf_running=1
if [ "$(is_process_running server/bin/pmrepagent)" = yes ]; then
  tput bold
  echo "pmrepagent (Informatica Repository Server) is running"
  tput rmso
else
  tput rev
  echo "pmrepagent (Informatica Repository Server) is not running"
  tput rmso
  inf_running=0
fi
if [ "$(is_process_running server/bin/pmserver)" = yes ]; then
  tput bold
  echo "pmserver (Informatica Server) is running"
  tput rmso
else
  tput rev
  echo "pmserver (Informatica Server) is not running"
  tput rmso
  inf_running=0
fi
if [ "$inf_running" -eq 0 ]; then
  echo " "
  echo "  To start PowerCenter:"
  echo "    cd /app/oracle/product/informatica/server/tomcat/bin"
  echo "    infaservice.sh startup"
fi
echo " "
echo "  Log files (PowerCenter):"
echo "    ls -lrt /app/oracle/product/informatica/server/tomcat/logs"
echo " "
echo "  Log files (ETL jobs):"
echo "    ls -lrt /app/oracle/product/informatica/server/infa_shared/SessLogs"
echo "    ls -lrt /app/oracle/product/informatica/server/infa_shared/WorkflowLogs"

########## DAC #################

echo "====="
if [ "$(is_process_running com.siebel.etl.net.QServer)" = yes ]; then
  tput bold
  echo "DAC is running"
  tput rmso
else
  tput rev
  echo "DAC is not running"
  tput rmso
  echo " "
  echo "  To start the DAC server:"
  echo "    cd /app/oracle/product/informatica/DAC_Server/"
  echo "    nohup startserver.sh &"
  echo " "
fi
echo "  Log files:"
echo "    ls -lrt /app/oracle/product/informatica/DAC_Server/log"

process_check.sh

</pre>
# process_check.sh
# get_pid plagiarised from OBIEE common.sh
# RNM 2009-04-03
# RNM 2009-04-30 Exclude root processes (getting false positive from OpenView polling with process name)

get_pid ()
{
 echo `ps -ef| grep $1 | grep -v grep | grep -v "    root " | awk '{print $1}'` # the second grep excludes the grep process from matching itself, the third one is a hacky way to avoid root processes
}

is_process_running ()
{
process=$1
#echo $process
procid=`get_pid $process`
#echo $procid
if test "$procid" ; then
 echo "yes"
else
 echo "no"
fi
}

March 30, 2009

Bug in Clustered Publisher Scheduler – ClusterManager: detected 1 failed or restarted instances

Filed under: BI publisher, cluster, quartz — rmoff @ 10:40

Follow on from setting up Publisher in a clustered environment, I’ve found a nasty little bug in the scheduling element of Publisher, Quartz.

Looking at the oc4j log file /opmn/logs/default_group~home~default_group~1.log I can see OC4J starting up, and then a whole load of repeated messages:

09/03/30 11:28:43 Oracle Containers for J2EE 10g (10.1.3.3.0) initialized
– ClusterManager: detected 1 failed or restarted instances.
– ClusterManager: Scanning for instance “myserver.fqdn.company.net1238408921404″‘s failed in-progress jobs.
– ClusterManager: detected 1 failed or restarted instances.
– ClusterManager: Scanning for instance “myserver.fqdn.company.net1238408921404″‘s failed in-progress jobs.
– ClusterManager: detected 1 failed or restarted instances.
– ClusterManager: Scanning for instance “myserver.fqdn.company.net1238408921404″‘s failed in-progress jobs.
– ClusterManager: detected 1 failed or restarted instances.
– ClusterManager: Scanning for instance “myserver.fqdn.company.net1238408921404″‘s failed in-progress jobs.
– ClusterManager: detected 1 failed or restarted instances.
– ClusterManager: Scanning for instance “myserver.fqdn.company.net1238408921404″‘s failed in-progress jobs.
– ClusterManager: detected 1 failed or restarted instances.
– ClusterManager: Scanning for instance “myserver.fqdn.company.net1238408921404″‘s failed in-progress jobs.
– ClusterManager: detected 1 failed or restarted instances.
– ClusterManager: Scanning for instance “myserver.fqdn.company.net1238408921404″‘s failed in-progress jobs.
– ClusterManager: detected 1 failed or restarted instances.
– ClusterManager: Scanning for instance “myserver.fqdn.company.net1238408921404″‘s failed in-progress jobs.
[… repeated for 38MB worth ]

Metalink to the rescue …. a search for “Search: ClusterManager: Scanning for instance” throws up doc 739623.1 – Repeated Error Appears In Log File – ClusterManager: detected 1 failed or restarted instances which details the problem and references bug # 7264646.

This is a bug in Quartz (the Publisher scheduling tool), which has been fixed in 1.5.2 (the version that’s included with Publisher is 1.5.1).

On my installation quartz was located in /j2ee/home/applications/xmlpserver/xmlpserver/WEB-INF/lib

Implenting the fix described on Metalink doc 739623.1 solved the problem.

March 24, 2009

Clustering Publisher – Scheduler and Report Repository

Filed under: BI publisher, cluster, obiee, quartz — rmoff @ 11:28

The Oracle BI Publisher Enterprise Cluster Deployment doc which I just found through Metalink highlighted a couple of points:
– Report repository should be shared
– The scheduler should be configured for a cluster

Report Repository
Through Admin>System Maintenance>Report Repository I changed the path from the default, /xmlp/XMLP to a NFS mount data/shared/xmlp and restarted the xmlpserver application in OAS. On coming back up Publisher complained because all its config files (in xmlp/Admin), had disappeared. I’d not moved any of the contents of /xmlp/XMLP since Report Repository suggested to me that it was just for reports, ergo with no reports yet created there was nothing to move.
So pedantaries aside, I moved the contents of /xmlp/XMLP to my new share, data/shared/xmlp. Publisher was happy after this.

A side effect of config being held in the “Report Repository” path is that when I configured the second BI Publisher server to use this new shared path all of the config I’d done on the first server was applied to the second. I wonder if this is how it’s supposed to work, or there’s going to be server-specific config written to a shared location which will cause problems?

With hindsight, and if the config can be shared like this, then setting up the shared file system first would have been best, and then I’d have only had to configure the one server and the second would have picked it up (for Scheduler changes etc).

Scheduler
I installed the Scheduler schema successfully, and ticked the Enable Clustering under Scheduler Properties. Doing some poking around (google for “Enable Clustering” “Scheduler Properties”) I found this page which documents Quartz (used for scheduling in BI Publisher, some more info here). It states

Enable clustering by setting the “org.quartz.jobStore.isClustered” property to “true”. Each instance in the cluster should use the same copy of the quartz.properties file.

The last sentence of this is reassuring as it describes what I’ve now got with the shared Report Repository folder. Checking data/shared/xmlp/Admin/Scheduler/quartz-config.properties shows that it now includes:

org.quartz.jobStore.isClustered=true

March 23, 2009

OBIEE Publisher – configuring connection to clustered BI Server

Filed under: BI publisher, cluster, obiee — rmoff @ 13:52

I’m setting up a clustered OBIEE 10.1.3.4 production environment. There are four servers; two BI Server + Cluster Controller + Scheduler and two OAS + Presentation Services + Publisher. Clustering of BI is configured, now I’m setting up the other bits. Today is Publisher.

On publisher instance A connections to the BI Servers directly work fine:
jdbc:oraclebi://serverA.fqdn.company.net:9703/ jdbc:oraclebi://serverB.fqdn.company.net:9703/
both work individually as Connection Strings (with database driver class of oracle.bi.jdbc.AnaJdbcDriver) – verified with “Test Connection” button.
Connections also work when specifying the hostname only (i.e. no FQDN).

In Oracle Business Intelligence Enterprise Edition Deployment Guide p.40 the connection string to use for a cluster is specified:
jdbc:oraclebi://:9706/PrimaryCCS= Cluster Controller Host>;PrimaryCCSPort=9706;SecondaryCCS= Controller Host>;SecondaryCCSPort=9706
This doesn’t work straight out of the box. Both attempts fail with Could not establish connection.
1 – documented suggestion :
jdbc:oraclebi://serverA:9706/PrimaryCCS=serverA;PrimaryCCSPort=9706;SecondaryCCS=serverB;SecondaryCCSPort=9706

2 – adding FQDN to the first instance of the cluster controller host had been suggested by a doc I read :
jdbc:oraclebi://serverA.fqdn.company.net:9706/PrimaryCCS=serverA;PrimaryCCSPort=9706;SecondaryCCS=serverB;SecondaryCCSPort=9706

3 – add FQDN to all hostnames just for good measure : jdbc:oraclebi://serverA.fqdn.company.net:9706/PrimaryCCS=serverA.fqdn.company.net;PrimaryCCSPort=9706;SecondaryCCS=serverB.fqdn.company.net;SecondaryCCSPort=9706

Thought – we’ve proved that BI Server is up and running by specifying them as direct connections above, but we’ve not proved that the Cluster Controller is running. Logging into BI Administrator and using the Cluster Manager proves that all the components are up and running:

Since things weren’t working as expected, I went looking for some log files.
It’s useful to remember that all J2EE/OAS logs for xmlpserver, analytics, etc can be viewed easily through Enterprise Manager. Log in to EM (in my case it’s at http://serverC:7777/em) and then navigate to OC4J home (under the ‘Members’ section) and then click ‘Logs’ in the top right of the page.
In this instance I found the xmlpserver logs under Components – OC4J – home:1 – Application xmlpserver
NB this also gives you the file path to the log if you prefer not to use the web interface each time: [OAS home]/j2ee/home/application-deployments/xmlpserver/home_default_group_1/application.log

There was nothing in the log since startup, so no smoking guns there.

Back to google for a look to see if there’s more information on the syntax for the JDBC connection. Searching for jdbc:oraclebi PrimaryCCS threw up the Oracle Business Intelligence Publisher Administrator’s and Developer’s Guide
From this the connection string can be clearly explained:

<URL>:= <Prefix>: [//<Host>:<Port>/][<Property Name>=<Property Value>;]*

where

<Prefix>: is the string jdbc:oraclebi

<Host>: is the hostname of the analytics server. It can be an IP Address or hostname. The default is localhost.

<Port> is he port number that the server is listening on. The default is 9703.
[…]

<PrimaryCCS> -(For clustered configurations) specifies the primary CCS machine name instead of using the “host” to connect. If this property is specified, the “host” property value is ignored. The jdbc driver will try to connect to the CCS to obtain the load-balanced machine. Default is localhost.

From the syntax in the doco I added LogLevel and LogFilePath to jdbc connection string, but didn’t get any logs produced.
Changed the Publisher logging level to Debug (Admin>System Maintenance>Server Configuration) and through OAS restarted xmlpublisher. Tested clustered connection string again but got no more detailed log. Changed logging level back to Exception.

Resorted to searching Metalink and Metalink 3 (because one support system would be too obvious). Hit straight away in Metalink 3 doc ID 559795.1 “BI Publisher does not accept cluster jdbc connection strings” – a semi colon is missing from the end of the statement!

This now works fine:
jdbc:oraclebi://serverA:9706/PrimaryCCS=serverA;PrimaryCCSPort=9706;SecondaryCCS=serverB;SecondaryCCSPort=9706;
For reference, this also works fine:
jdbc:oraclebi://badgerbadgerbadger:9706/PrimaryCCS=serverA;PrimaryCCSPort=9706;SecondaryCCS=serverB;SecondaryCCSPort=9706;
(i.e. the first hostname is ignored, as stated in the documentation)

This documentation error is listed as bug 7499504

Moral of the lessons is – check Metalink for bugs first!

Blog at WordPress.com.