rmoff

October 18, 2010

When is a bug not a bug? When it’s a “design decision”

Filed under: informatica, obia, rant, support — rmoff @ 13:02

Last month I wrote about a problem that Informatica as part of OBIA was causing us, wherein an expired database account would bring Oracle down by virtue of multiple connections from Informatica.

I raised an SR with Oracle (under OBIA support), who after some back-and-forth with Informatica, were told:

This is not a bug. That the two error messages coming back from Oracle are handled differently is the result of a design decision and as such not a product fault.

Is “design decision” the new “undocumented feature” ?

September 2, 2010

Misbehaving Informatica kills Oracle

Filed under: bug, informatica, obia, ORA-28001, oracle, security — rmoff @ 13:23

This problem, which in essence is bad behaviour from Informatica bringing down Oracle, is a good illustration of unintended consequences of an apparently innocuous security setting.
Per our company’s security standards, database passwords expire every 90 days. When this happens users are prompted to change their password before they can continue logging into Oracle. This applies to application user IDs too.
It appears that Informatica 8.6.1 HF6 (part of OBIA 7.9.6.1) doesn’t handle an expired password well, spawning multiple connections to the database, eventually bringing Oracle down through memory SWAP space exhaustion.

As a side note, one of our DBAs has been investigating how to prevent a client connection accidentally (through bad coding) or maliciously (DoS) bringing down Oracle in this way, his findings are documented here.

Investigation

As the introduction to any good IT horror story goes … “Everything was running fine; nothing had changed”.

Then our monitoring showed swap space usage on our Oracle 11g database server increasing, and soon after Oracle crashed. The DBAs restarted Oracle, but shortly after the swap space usage was on its way up. The unix tool Glance showed a lot of Oracle processes on the box.

Database

Our Informatica repository user expired on the day on which this happened (Aug 27th):

select username, account_status, expiry_date from dba_users

USERNAME                       ACCOUNT_STATUS                   EXPIRY_DATE
------------------------------ -------------------------------- ---------
INF_REPO                       EXPIRED                          27-AUG-10

When a user ID expires an ORA-28001 is given at login:

sqlplus INF_REPO/password

SQL*Plus: Release 11.1.0.7.0 - Production on Thu Sep 2 08:40:17 2010

Copyright (c) 1982, 2008, Oracle.  All rights reserved.

ERROR:
ORA-28001: the password has expired

Changing password for INF_REPO
New password:

This is the throughput figures for Oracle from Enterprise Manager, note the Logons rate starting to increase at c.13:30 (The rate at 03:00AM is representative of the usual logon load on the database):

Server

Note SWAP space increase (dark blue line) at c. midday (nb GMT/BST mean not all the graphs will align):

Database server metrics

Database server metrics

Note number of Alive Oracle processes (faint yellow line!):

Database server metrics - Oracle application only

Database server metrics - Oracle application only

Informatica

In the Informatica exceptions.log and node.log are the initial errors:

ERROR [Master Elect Data Writer] [DOM_10162] An exception occurred while updating the master election row.
java.sql.SQLException: [informatica][Oracle JDBC Driver]No more data available to read.

ERROR [Master Elect Data Writer] [DOM_10162] An exception occurred while updating the master election row.
java.sql.SQLException: [informatica][Oracle JDBC Driver][Oracle]ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
HPUX-ia64 Error: 2: No such file or directory

ERROR [Master Elect Data Writer] [DOM_10162] An exception occurred while updating the master election row.
java.sql.SQLException: [informatica][Oracle JDBC Driver][Oracle]ORA-28001: the password has expired

Followed by the repeated error approximately every ten seconds:

ERROR [Master Elect Data Writer] [DOM_10162] An exception occurred while updating the master election row.
java.sql.SQLException: [informatica][Oracle JDBC Driver][Oracle]ORA-28001: the password has expired

There are also final errors in the log, occuring once only just after midnight:

FATAL [Domain Monitor] [PCSF_10374] Failed to persist [CpuUsageSummary] with error [[informatica][Oracle JDBC Driver]No more data available to read.].
FATAL [Domain Monitor] [PCSF_10374] Failed to persist [RepoUsageSummary] with error [[informatica][Oracle JDBC Driver]No more data available to read.].

After these Informatica shut down.

Theory

This is what I think is happening:

  • Informatica has a polling component (“[Master Elect Data Writer]”) that updates a database table (part of the Informatica repository) every minute
  • Once the user has expired, Informatica gets ORA-28001: the password has expired when it tries to connect
  • Informatica does not handle ORA-28001 correctly
  • It appears that it leaves the connection open
  • It then retries a few seconds later
  • The connections stack up, each taking swap space allocated to the Oracle process that the connection spawns
  • Eventually the server runs out of resource and Oracle crashes
  • At midnight another Informatica component (“[Domain Monitor]”) tries to update a database table (part of the Informatica repository), and gets the ORA-28001 error.
  • This second component (“[Domain Monitor]”) correctly takes the error as fatal and aborts the Informatica server process

Resolution

In my opinion, Informatica should consistently treat ORA-28001 as fatal.

At the very least, if Informatica isn’t treating ORA-28001 as fatal it should close the connection to the database correctly.


An update from Informatica here

October 12, 2009

OBIA 7.9.6.1 released

Filed under: informatica, obia — rmoff @ 07:47

The latest point release of Oracle Business Intelligence Applications, 7.9.6.1, has been released and is available for download from here (direct link to download).

The version.txt reports the version as:

Build: 7.9.6.1.100609.2038
Release Version: Oracle Business Intelligence Applications 7.9.6.1
Package: 100609.2038

No updated documentation library yet through, so can’t nosey through the release notes. The docs that come with the download are labelled 7.9.6 and dated April 09 so don’t look like they’ve been updated either.

————–

Update 13th Oct:
@alex has got the scoop on the new functionality. From this and the press release here’s a few snippets:

  • 28 supported languages
  • […] ideal for data warehouses running on the Sun Oracle Database Machine.
    • So I assume this means that it supports 11gR2 whether Exadata’d or not?
  • “adds support for Informatica PowerCenter 8.6.1 for ETL”
  • More supported DBs and OSs: IBM DB2 9.5 / Microsoft SQL Server 2008 / Teradata v13 / OEL 5 / IBM AIX 6.1 / Windows Server 2008 / RH Linux 5

Still no updated documentation on OTN. Never let it be said that the documentation lets OBIA down …. 😐
————–
Update 15th Oct:
Documentation is now available on OTN

March 25, 2009

ORA-12537 / ORA-12518 [Informatica DAC error CMN_1022]

Filed under: dac, informatica, obia, oracle — rmoff @ 08:44

We’re getting problems with an instance of Informatica / out-of-the-box OBIA on a new set of servers. When we run the execution plan we get this error soon after starting:

MAPPING> DBG_21075 Connecting to database [TNSENTRY], user [MYUSER]
MAPPING> CMN_1761 Timestamp Event: [Tue Mar 24 18:56:33 2009]
MAPPING> CMN_1022 Database driver error…
CMN_1022 [
Database driver error…
Function Name : Logon
ORA-12537: TNS:connection closed

Database driver error…
Function Name : Connect
Database Error: Failed to connect to database using user [MYUSER] and connection string [TNSENTRY].]

MAPPING> CMN_1761 Timestamp Event: [Tue Mar 24 18:56:33 2009]
MAPPING> CMN_1076 ERROR creating database connection.

One or two tasks using the DataWarehouse connection succeed, and then the rest fail with the above error.

That one or two tasks succeed proves that the connection string is specified correctly, plus I’d expect to see an auth error if our username/pw was incorrect. We’ve verified the Physical Data Source in DAC, but stupidly in Informatica (Workflow Manager – Connections – Relational) there’s no “test connection”.

Both of the errors, Informatica’s CMN_1022 and Oracle’s ORA-12537, are generic “somat’s bust” ones, neither providing a clue to what the problem is.

Metalink 3 has several entries for CMN_1022 but they just point to configuration/installation errors with the database connectivity.

There’s a matching article on OTN Forums but without a definitive solution

In DAC Physical Data Sources the Max Num Connections is 10. The OTN forum posting refers to performace so guessing maybe Oracle wasn’t happy with the # of concurrent connections I changed it to 1, but the problem remained.

This is on Informatica 8.1.1, Oracle client 10.2.0 and Oracle DB 11.1.0.7.

Our DBA had a look and validated all the connectivity, and also granted the user DBA just to make sure it wasn’t a priviledges issue.

I turned on tracing in the sqlclient (add trace_level_client=16 to the sqlnet.ora in $TNS_ADMIN) and got this rather helpful output:

***********************************************************************
Fatal NI connect error 12537, connecting to:
(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=mydb.company.com)(PORT=1521))(CONNECT_DATA=(SID=TNSENTRY)(SERVER=DEDICATED)(CID=(PROGRAM=pmdtm)(HOST=
apphost)(USER=unixuser))))

VERSION INFORMATION:
TNS for HPUX: Version 10.2.0.1.0 – Production
TCP/IP NT Protocol Adapter for HPUX: Version 10.2.0.1.0 – Production
Time: 25-MAR-2009 11:09:46
Tracing to file: /app/oracle/product/informatica/server/bin/cli_2844.trc
Tns error struct:
ns main err code: 12537
TNS-12537: TNS:connection closed
ns secondary err code: 12560
nt main err code: 507
TNS-00507: Connection closed
nt secondary err code: 0
nt OS err code: 0

and delving into the guts of the .trc file found:

(11) [25-MAR-2009 11:09:46:011] nsprecv: reading from transport…
(11) [25-MAR-2009 11:09:46:011] nttrd: entry
(11) [25-MAR-2009 11:09:46:100] nttrd: exit
(11) [25-MAR-2009 11:09:46:100] ntt2err: entry
(11) [25-MAR-2009 11:09:46:100] ntt2err: Read unexpected EOF ERROR on 38
(11) [25-MAR-2009 11:09:46:100] ntt2err: exit
(11) [25-MAR-2009 11:09:46:100] nsprecv: error exit
(11) [25-MAR-2009 11:09:46:100] nserror: entry
(11) [25-MAR-2009 11:09:46:101] nserror: nsres: id=0, op=68, ns=12537, ns2=12560; nt[0]=507, nt[1]=0, nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0

So maybe it’s the DB server that’s not playing ball? I’m guessing the “Read unexpected EOF ERROR on 38” might be relevant.

Taking the opportunity to learn a bit more about Oracle connectivity, I had a look at Oracle® Database Net Services Administrator’s Guide 10g Release 2 (10.2) – Troubleshooting Oracle Net Services. This details setting up logs and traces, and points to Trace Assistant, trcasst. Running it on one of the trace files from a failed connection reported this:

///////////////////////////////////////////////////////////////
Error found. Error Stack follows for thread #: 11
id:0
Operation code:68
NS Error 1:12537
NS Error 2:12560
NT Generic Error:507
Protocol Error:0
OS Error:0
NS & NT Errors Translation
12537, 00000 "TNS:connection closed"
// *Cause: "End of file" condition has been reached; partner has disconnected.
// *Action: None needed; this is an information message.
/
12560, 00000 "TNS:protocol adapter error"
// *Cause: A generic protocol adapter error occurred.
// *Action: Check addresses used for proper protocol specification. Before
// reporting this error, look at the error stack and check for lower level
// transport errors.For further details, turn on tracing and reexecute the
// operation. Turn off tracing when the operation is complete.
/
00507, 00000 "Connection closed"
// *Cause: Normal "end of file" condition has been reached; partner has
// disconnected.
// *Action: None needed; this is an information message.
/
///////////////////////////////////////////////////////////////

which is the same error as I found in the trace file but with each code explained.

We tested different permutations of servers:

Inf server A / 10g client -> DB Server A (11g) -> Fails
Inf server A / 10g client -> DB Server Y (11g) -> Success
Inf server B / 10g client -> DB Server B (11g) -> Success
Inf server A / 10g client -> DB Server Z (10g) -> Success
Inf server C / 11g client -> DB Server C (11g) -> Success
Inf Server C / 11g client -> DB Server A (11g) -> Success

So now we have three identical setups (same informatica/oracle client/oracle DB), two of which work, one fails – when run against Server A.

Our DBA ran a trace on the listener on Server A and picked up this error:

TNS-12518: TNS:listener could not hand off client connection
TNS-12547: TNS:lost contact
TNS-12560: TNS:protocol adapter error
TNS-00517: Lost contact
HPUX Error: 32: Broken pipe

which points to a possible OS issue.

Ref: Oracle® Database Installation Guide 11g Release 1 (11.1) for HP-UX – 2.7 Configure Kernel Parameters
Ref: Metalink article 550859.1 – TROUBLESHOOTING GUIDE TNS-12518 TNS listener could not hand off client connection

The UNIX team checked the kernel settings between DB Server A and DB Server Y, but found no differences (in particular they checked maxuprc and nproc).

This problem eventually got resolved after two actions:

1) Database server was restarted
2) Oracle PROCESSES was increased from 200 to 500

We suspect the restart fixed the problem as one of the UNIX guys spotted some “performance funnies” (technical term 😉 ) on the box prior to the restart.

Blog at WordPress.com.