rmoff

March 5, 2010

Who’s been at the cookie jar? EBS-BI authentication and Load Balancers

Filed under: cluster, load balancing, obiee, sawserver, support — rmoff @ 10:44

We hit a very interesting problem in our Production environment recently. We’d made no changes for a long time to the configuration, but all of a sudden users were on the phone complaining. They could login to BI from EBS but after logging in the next link they clicked took them to the OBIEE “You are not logged in” screen.

Our users login to EBS R12 and then using EBS authentication log in to OBIEE (10.1.3.4). Our OBIEE is deployed on OAS, load balanced across two servers by an F5 BIG-IP hardware load balancer.

In the OBIEE NQServer.log we started to see a lot of these errors around the time users started complaining:

[nQSError: 13011] Query for Initialization Block 'EBS Security Context' has failed.
[nQSError: 23006] The session variable, NQ_SESSION.ACF, has no value definition.

The EBS/BI authentication configuration was not done by me, and the theory of it was one of the things on my to-do list to understand but as is the way had never quite got around to it. Here was a good reason to learn very quickly! This posting by Gerard Braat is fantastic and brought me up to speed quickly. There’s also a doc on My Oracle Support, 552735.1, and some more info from Gareth Roberts on the OTN forum here.

We stopped Presentation Services on one of the servers, and suddenly users could use the system again. If we reversed the stopped/started servers, users could use the system. With one Presentation Services server running, the system was fine. With both up, users got “You are not logged in”. What did this demonstrate? That on their own, there was nothing wrong with our Presentation Services instances.

We soon suspected the load balancer. The load balancer sets a cookie on each user’s web browser at the initial connection as they connect to BI. The cookie is used in each subsequent connection to define which application server the user should be routed to. This is because Presentation Services cannot maintain state across instances and so the user must always come through to the same application server that they initially connected to (and therefore authenticated on).

What had happened was that the Load Balancer was issuing cookies with an expiry date already in the past (the clock was set incorrectly on it *facepalm*). This meant that the initial connection from EBS to BI was successful, because authentication was done as expected. But – the next time the client came back to the BI server for a new or updated report, they hit the Load Balancer and since the cookie holding the BI app server affinity was invalid (it had already expired) the Load Balancer sends them to any BI app server. If it’s not the one that they authenticated against then BI tries to authenticate them again, but they don’t have the acf URL string (which comes through in the initial EBS click through to BI), and hence the “The session variable, NQ_SESSION.ACF, has no value definition.” error in the NQServer.log and “You are not logged in” error shown to the user.

As soon as the date was fixed on the load balancer cookies were served properly, we brought up both Presentation Services, and everything worked again. Phew.

Footnote: I cannot recommend this tool highly enough : Fiddler2. It makes tracing HTTP traffic, request headers, cookies, etc, a piece of cake (cookie?).

November 6, 2009

OBIEE clustering – specifying multiple Presentation Services from Presentation Services Plug-in

Filed under: load balancing, OAS, obiee, sawserver, unix — rmoff @ 12:00

Introduction

Whilst the BI Cluster Controller takes care nicely of clustering and failover for BI Server (nqsserver), we have to do more to ensure further resilience of the stack.

A diagram I come back to again and again when working out configuration or connectivity problems is the one on P16 of the Deployment Guide. With this you can work out most issues for yourself through simple reasoning. Print it out, pin it to your wall, and read it!

As a reminder, when a user calls up the address for Answers or Dashboards the flow goes :

  1. web browser
  2. web serve r (eg OAS – Apache)
  3. app server (eg OAS – OC4J) -> BI Presentation Services Plug-in (“analytics”)
  4. BI Presentation Services
  5. (BI Server)
  6. (Database)

With clustering we are aiming to spread the load as much as possible. This gives us resilience if a component fails and capacity as the work is shared out.

This posting examines how to configure step 3 on the above list (BI Presentation Services Plug-in) to work with multiple BI Presentation Services.

From the Deployment Guide:

BI Presentation Services Plug-ins route session requests to BI Presentation Services instances using native protocol. The connections are load balanced using native load balancing capability.

BI Presentation Services receives requests from BI Presentation Services Plug-in […]. Although an initial user session request can go to any BI Presentation Services in the cluster, each user is then bound to a specific BI Presentation Services instance.

Be aware that “BI Presentation Services” is not the same as “BI Presentation Services Plug-in”:

  • “BI Presentation Services” is sawserver, a service in its own right.
  • “BI Presentation Services Plug-in” is a java servlet called analytics deployed within a J2ee application server.
    • There is also a version for IIS using ISAPI. This article is only about the j2ee version. The configuration principles should remain the same for the ISAPI plugin though.

Configuration

To configure the j2ee plug-in, do the following:

  1. Locate web.xml found in $J2EE_home/applications/analytics/analytics/WEB-INF
    • See note below regarding this path as it is contrary to that given in the Deployment Guide on p35
  2. Create a backup of the web.xml file
  3. By default the file will have two sets of init-params. Remove these:
    <init-param>
    <param-name>oracle.bi.presentation.sawserver.Host</param-name>
    <param-value>localhost</param-value>
    </init-param>
    <init-param>
    <param-name>oracle.bi.presentation.sawserver.Port</param-name>
    <param-value>9710</param-value>
    </init-param>
    
  4. Add in a new init-param in place of the two you removed, specifying your Presentation Services hosts and ports (syntax is host:port) in a semi-colon delimited list
    <init-param>
    <param-name>oracle.bi.presentation.sawservers</param-name>
    <param-value>BISandbox01:9710;BISandbox02:9710</param-value>
    </init-param>
  5. Save your modified web.xml file
  6. Restart your application server
    • In OAS you can use opmnctl restartproc
  7. Login to Answers and test that it works
  8. Stop one of your Presentation Services (sawserver)
  9. Refresh Answers. You’ll probably get a 500 Internal Server Error.
    • If you check the application.log it shows that it can’t connect to the Presentation Services (because you’ve just stopped it, duh!)
  10. Refresh Answers again in a minute or two. You should get Presentation Services back, but from a different instance.
    • Does anyone know where this period is defined, eg is it a timeout setting, multiple failed connection attempts?
  11. Work through all your Presentation Services servers, stopping and starting the service on each to ensure each is being picked up

How do you know which Presentation Services you’re using?

This is where it can get a bit confusing!

The images that you see rendered on the page are local to the BI Presentation Services Plug-in. So if you muck around with the files in /res you can tag the login page with the server that analytics plugin is running on. If you’re not using web server load balancing then this will always be the web server that you’re connecting to.

The web catalog is specified by the BI Presentation Services instance. Once your clustering is setup then obviously you must share or replicate your web catalog. However whilst setting up the plugin->presentation services connectivity it might be an idea to have separate instances. Set up the default dashboard on login simply to show the Presentation Sevices server name as a text box (hardcode it). Do this for each server. You can go and check the actual Request in the web catalog on each server’s file system to make sure you’re on the right one.

Logfiles

  • BI Presentation Services Plug-in:
    •  $J2EE_home/application-deployments/analytics/home_default_group_1/application.log
    • Also available through OAS’s Enterprise Manager, click Logs link top right and navigate to the analytics Application
  • BI Presentation Services:
    • $OracleBIData/web/log/sawlog0.log

Common errors

500 Internal Server Error

Servlet error: An exception occurred. The current application deployment descriptors do not allow for including it in this response. Please consult the application log for details.

BI Presentation Services Plug-in has thrown an error, and you should check its logfile (see below).

analytics: Servlet error java.net.ConnectException: Connection refused

The BI Presentation Services Plug-in is trying to connect to a Presentation Services and can’t. Either you’ve specified the wrong host or port details in the web.xml, or Presentation Services (sawserver) is not running.

Internal Server Error

The server encountered an internal error or misconfiguration and was unable to complete your request.

This typically means that the BI Presentation Services Plug-in is not running. Check in OAS that the analytics application is started

Bonus – shared config

In researching this I found an interesting point in the 10.1.3.4.1 release notes. You can specify the analytics configuration in a shared config file using the oracle.bi.presentation.sawbridge.configFilePath param-name.

On a clustered setup with shared filesystem you can therefore have one file listing the Presentation Services servers to use, and reference this from each analytics config.

Ref: Configuring Oracle BI EE Using an EAR File

web.xml location

The Deployment Guide p35 states that the web.xml for java servlet is $OracleBI_HOME/web/app/WEB-INF. However, in my experience this should actually be $J2EE_home/applications/analytics/analytics/WEB-INF.

The table on p97 in the Infrastructure Installation and Configuration Guide concurs with this, and shows different locations for web.xml. The difference is whether your installation using IIS or OAS/OC4J.

So for OAS/OC4J web.xml is $J2EE_home/applications/analytics/analytics/WEB-INF, and for IIS’s ISAPI plugin it is $OracleBI_HOME/web/app/WEB-INF

September 15, 2009

OBIEE cluster controller failover in action

Filed under: cluster, load balancing, obiee, performance, sawserver, unix — rmoff @ 15:06

Production cluster is 2x BI Server and 2x Presentation Services, with a BIG-IP F5 load balancer on the front.

1pub

Symptoms

Users started reporting slow login times to BI.
Our monitoring tool (Openview) reported that “BIServer01 may be down. Failed to contact it using ping.”.
BIServer01 cannot be reached by ping or ssh from Windows network.

Diagnostics

nqsserver and nqsclustercontroller on BIServer01 was logging these repeated errors:

[nQSError: 12002] Socket communication error at call=send: (Number=9) Bad file number

Whether OBIEE was running on BIServer01 or not, users could still use OBIEE but with a delayed login.

Majority of the login time spent on the OBIEE “Logging in … ” screen, which is not normally seen because login is quick.

Network configuration issues found on BIServer01.

Initial suspicion was that EBS authentication was the cause of the delay, as this is only used at login time so would fit with the behaviour observed. They checked their system and could see no problems. They also reported that the authentication SQL only hit EBS just before OBIEE logged in.

Diagnosis

Using nqcmd on one of the Presentation Services boxes it could be determined that failover of Cluster Controllers was occuring, but only after timing out on contacting the Primary Cluster Controller (BIServer01).
2pub

[biadm@PSServer01]/app/oracle/product/obiee/setup $set +u
[biadm@PSServer01]/app/oracle/product/obiee/setup $. ./sa-init64.sh
[biadm@PSServer01]/app/oracle/product/obiee/setup $nqcmd

-------------------------------------------------------------------------------
Oracle BI Server
Copyright (c) 1997-2006 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------

Give data source name: Cluster64
Give user name: Administrator
Give password: xxxxxxxxxxxxx
[60+ second wait here]

This conclusion was reached because after setting PrimaryCCS to BIServer02 there was no delay in connecting. I changed the odbc.ini entry for Cluster64 to switch the CCS server order around
[…]
PrimaryCCS=BIServer02
SecondaryCCS=BIServer01
[…]

[biadm@PSServer01]/app/oracle/product/obiee/setup $nqcmd

-------------------------------------------------------------------------------
Oracle BI Server
Copyright (c) 1997-2006 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------

Give data source name: Cluster64
Give user name: Administrator
Give password: xxxxxxxxxxxxx
[logs straight in]

Any changes to odbc.ini have to be followed by a bounce of sawserver.

Resolution

To fix the slow login for users whilst the network problems were investigated I switched the order of CCS in the odbc.ini configuration and bounced each sawserver:
3pub
For the end-users the problem was resolved as they could now log straight in.
However at this stage we’re still running with half a cluster. If BIServer02 had failed at this point then the BI service would have become unavailable.

The root-cause was a network configuration error on the four servers combined with a possible hardware failure.

Summary

Ignoring Scheduler, a two-machine OBIEE cluster has an Active:Active pair of BI Servers. Analytics traffic to these servers is routed via an Active:Passive pair of Cluster Controllers.

The client (eg sawserver) uses ODBC config syntax to define which Cluster Controller to try contacting first. This is the PrimaryCCS. If it connects then the PrimaryCCS will return the name of the BI Server to the client, which will then send all subsequent ODBC connections to the BI Server direct.

If the client cannot connect to the PrimaryCCS in the time defined it will try the SecondaryCCS instead. The SecondaryCCS behaves exactly the same as the PrimaryCCS – it returns the name of the BI Server to the client for direct ODBC connection.

The Cluster Controller maintains the state of the BI Servers and if one becomes unavailable will know not to route any Analytics traffic to it.

The failover of the Cluster Controller itself is stateless, it is local only to the client session context. This means that each new client session has to go through the failover from Primary to Secondary CCS with the associated timeout delay.

[update 21st Sept] I’ve tested out the same configuration over four VM OEL 4 servers, and cannot reproduce the delayed login time. When one CCS is taken down failover to the other appears almost instantaneous [/update]

FinalTimeOutForContactingCCS

odbc.ini has the parameter FinalTimeOutForContactingCCS set to 60 seconds. Changing this to a lower value does NOT appear to reduce the failover time.

April 15, 2009

OBIEE and F5 BIG-IP

Filed under: Apache, load balancing, OAS, obiee — rmoff @ 13:44

We’ve got a setup of two OAS/Presentation Services boxes and two BI Server boxes, with load balancing/failover throughout.
The Load Balancing of the web requests is being done through a separate bit of kit, an F5 BIG-IP load balancer. This directs the requests at the two OAS servers.

The problem we have is that by default OAS serves HTTP on port 7777, but the F5 is using port 80. A request for our load balanced URL: http://bi.mycompany.com/analytics/ barfs out with

Internet Explorer cannot display the webpage

Most likely causes:
-You are not connected to the Internet.
-The website is encountering problems.
-There might be a typing error in the address.

or in FireFox:

Failed to Connect The connection was refused when attempting to contact bi.mycompany.com:7777. Though the site seems valid, the browser was unable to establish a connection.

Using the excellent HttpFox add-in for Firefox I could see the HTTP requests/responses:

  1. http://bi.mycompany.com/analytics/ goes via the loadbalancer on the default HTTP port 80 to OAS
  2. OAS responds with HTTP/1.1 302 Moved Temporarily to http://bi.mycompany.com:7777/analytics/saw.dll?Dashboard
  3. The web client requests this URL (http://bi.mycompany.com:7777/analytics/saw.dll?Dashboard) from the load balancer but because it’s port 7777 F5 rejects the request (NS_ERROR_CONNECTION_REFUSED)

We could also just use the direct URL http://bi.mycompany.com/analytics/saw.dll?Dashboard but this is hardly user friendly (and also means that if they typo when entering it they’ll get an unhelpful error as above)

Looking at the httpd.conf for Apache to find the port config made me think of the UseCanonicalName setting which I also encountered recently. This setting is to do with how Apache deals with the server name in the URL being requested and the hostname of the server configured in Apache.
When I got the behaviour described above UseCanonicalName was set to Off, which I think means Apache does not rewrite the URL at all, so the redirect was to http://bi.mycompany.com:7777/analytics/saw.dll?Dashboard which is the F5 Load Balancer address.
If I changed UseCanonicalName to On then the F5 load balancing starts to work, as this happens instead:

  1. http://bi.mycompany.com/analytics/ goes via the loadbalancer on the default HTTP port 80 to OAS
  2. OAS responds with HTTP/1.1 302 Moved Temporarily to http://oasserver_1.mycompany.com:7777/analytics/saw.dll?Dashboard

i.e. the request goes directly to one of the load balanced servers, and correctly on port 7777.
The disadvantage of this is that the URL used by the web client then becomes http://oasserver_1.mycompany.com which means the user is no longer hitting the load balancer so any failover wouldn’t get picked up. It also means that users might start bookmarking OAS servers directly instead of the load balancer, again meaning that they don’t hit the load balancer so a server failover wouldn’t get picked up.

Eventually I got this resolved, with a bit of help from a very helpful chap at Oracle. By changing the httpd.conf to set Port 80, when Apache rewrites URLs it now uses Port 80.
Listen remains as 7777.
Traffic from web client now hits the LB on port 80, which forwards to 7777 on one of the OAS servers, which if necessary rewrite the URL and use port 80 in the rewrite.
Because Listen remains as 7777 there is no need to run Apache as root.
You can also set ServerName to the load balancer address (bi.mycompany.com) and UseCanonicalName to On. If you do this then I don’t think it’s possible to access web pages on a specific OAS server (eg oasserver_1) because entering http://oasserver_1.mycompany.com:7777/analytics just redirects to bi.mycompany.com/analytics.

Ref: Deploying F5 with Oracle Application Server 10g
Ref: Oracle HTTP Server – Port setting
Ref: Metalink 301755.1 – What Is the Difference Between Port & Listen In Httpd.Conf

Create a free website or blog at WordPress.com.