Pages

Saturday, August 23, 2008

Diagnosing Library Cache Latch Contention: A Real Case Study

The other day, our data warehouse database was effectively down. The database was up and running; but all connection attempts to the database would just hang. Since connections just failed, the DBA couldn’t even get in to see if the sessions connected were running normally or not. The ETL team reported that the processing slowed down to a crawl. The DBA, as instructed, went on to check the wait events; but since she couldn’t even log in, she couldn’t do that. Interestingly, CPU was around 70%, normal for the time of the day; and I/O was about 90%, again normal.

So, in simple terms – she was stuck and she applied the age-old technique: when in doubt, reboot. She asked the SA to reboot the server. It took about 30 minutes; and after it cam backup and the database started everything seemed normal for about 10 minutes. And, then the same problem stuck – the database got unresponsive.

At that point, I was called for help. In this blog I am sort of trying to paint a picture of what I did in the next half hour or so to resolve the issue. I hope by giving this running commentary of the events, facts and figures, you will follow my thoughts as I went about diagnosing the issue.

Symptoms

(1) Database connections were hanging
(2) Connections on the server using SQL*Plus as sysdba was also hanging; so we couldn’t even examine the issue.
(3) The system couldn’t be rebooted everytime; every reboot means more work for the ETL folks to clean up the failed sessions.

Initial Analysis

Obviously something is holding some resource back in the database; but what? Unless we connected to the database, we couldn’t check. And since couldn’t check, we couldn’t remediate. It was a classic case of Catch-22.

Action

This is one of those cases where a backdoor entry to the database instance comes really handy. Most people are not aware of the option in SQL*Plus called “prelim”. This option connects to the SGA; but does not open a session.

(1) So, the first thing I did is to open a sessionless SQL*Plus session using the Oracle user:

$ sqlplus -prelim
SQL>

Note, it didn’t say anything familiar like “Connected to Oracle Database 10.2.0.3”, etc. All it showed was the “SQL>” prompt. That was because it didn’t actually connect to the database.

(2) Then I used the “oradebug” utility to analyze the SGA

SQL> oradebug setmypid
SQL> oradebug hanganalyze 12

This produced a tracefile in the user_dump_dest directory. The file wasn’t difficult to find, since it was the last file created. Even if I didn’t find the file, I could have used the process ID to find the file. The file would have been named crmprd1_ora_13392.trc, assuming 13392 was the process ID.

(3) Let’s examine the file. Here are first few lines:

*** 2008-08-23 01:21:44.200
==============
HANG ANALYSIS:
==============
Found 163 objects waiting for
<0/226/17/0x1502dab8/16108/no>
Open chains found:
Chain 1 : :
<0/226/17/0x1502dab8/16108/no>
<0/146/1/0x1503e898/19923/latch:>

This tells me a lot. First it shows that the SID 146 Serial# 17 is waiting for library cache latch. The blocking session is SID 226 Serial# 17. The latter is not waiting for anything of blocking nature. I also noted the OS process IDs of these sessions – 16108 and 19923.

(4) Next I checked for two more tracefiles with these OS PIDs in their names.

crmprd1_ora_16108.trc
crmprd1_ora_19923.trc

(5) I opened the first one, the one that is the blocker. Here are the first few lines:

*** 2008-08-23 01:08:18.840
*** SERVICE NAME:(SYS$USERS) 2008-08-23 01:08:18.781
*** SESSION ID:(226.17) 2008-08-23 01:08:18.781
LIBRARY OBJECT HANDLE: handle=c0000008dc703810 mtx=c0000008dc703940(8000) cdp=32737
name=UPDATE DW_ETL.FRRS_PROFILER SET CONSUMER_LINK = :"SYS_B_0", ADDRESS_LINK = :"SYS_B_1", ADDRESS_MATCH = :"SYS_B_2", PROC
ESSED=:"SYS_B_3" WHERE RNUM = :"SYS_B_4"
hash=a029fce7bb89655493e7e51a544592a4 timestamp=08-23-2008 00:10:23
namespace=CRSR flags=RON/KGHP/TIM/OBS/PN0/MED/KST/DBN/MTX/[504100d0]
kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=10 hpc=0058 hlc=0058
lwt=c0000008dc7038b8[c0000008dc7038b8,c0000008dc7038b8] ltm=c0000008dc7038c8[c0000008dc7038c8,c0000008dc7038c8]
pwt=c0000008dc703880[c0000008dc703880,c0000008dc703880] ptm=c0000008dc703890[c0000008dc703890,c0000008dc703890]
ref=c0000008dc7038e8[c0000008dc7038e8,c0000008dc7038e8] lnd=c0000008dc703900[c0000008dc703900,c0000008dc703900]
LOCK OWNERS:
lock user session count mode flags
---------------- ---------------- ---------------- ----- ---- ------------------------
c0000008d079f1b8 c0000006151744d8 c0000006151744d8 16 N [00]
c0000008d4e90c40 c0000006151bcb58 c0000006151bcb58 16 N [00]
c0000008d0812c40 c0000008151a0438 c0000008151a0438 16 N [00]

(6) This is a treasure trove of information for debugging. First it shows the SID and Serial# (226.17), which confirms the SID we identified earlier. It shows the exact SQL statement being used. Finally it shows all the locks. I didn’t particularly care about the specifics of locks; but it gave me enough information to prove that the SID 226 was causing a wait for a whole lot of other sessions.

(7) My investigation is not done; I need to find out the sessions waiting for this. So, I searched the file for a section called “PROCESS STATE”. Here is a snippet from the file:

PROCESS STATE
-------------
Process global information:
process: c00000081502dab8, call: c000000817167890, xact: 0000000000000000, curses: c00000081519ef88, usrses: c000000815
19ef88
----------------------------------------
SO: c00000081502dab8, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=370, calls cur/top: c000000817167890/c000000817167890, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 115 0 4
last post received-location: kslpsr
last process to post me: c000000615002038 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: c000000615002038 1 6
(latch info) wait_event=0 bits=20
holding (efd=4) c0000008d7b69598 Child library cache level=5 child#=10
Location from where latch is held: kglhdgc: child:: latch
Context saved from call: 13
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
291 (197, 1219468295, 197)
279 (197, 1219468295, 197)
374 (197, 1219468295, 197)
267 (197, 1219468295, 197)
372 (197, 1219468295, 197)
... several lines sniped ...
307 (15, 1219468295, 15)
181 (6, 1219468295, 6)
waiter count=58
Process Group: DEFAULT, pseudo proc: c0000008e03150d8
O/S info: user: oracrmp, term: UNKNOWN, ospid: 16108
OSD pid info: Unix process pid: 16108, image: oracle@sdwhpdb1

(8) This told me everything I needed to know. There were 58 sessions waiting for library cache latch held by SID 226. I also knew the OS Process ID and the SQL statement of the blocking session.

(9) At that time we engaged the Application Owner to explain what was going on. As he explained it, he issues the update statement in a loop. And that’s not all; he executes it in 8 different threads. No wonder we have had library cache latch contention. So, we had to track 8; not just one session. We trudged on. All the sessions dumped their information. So, I searched the directory for some other files with the same issues:

$ grep “UPDATE DW_ETL” *.trc

(10) And I found 9 more sessions (or, rather, processes). Here is a snippet from another file:

350 (167, 1219470122, 167)
197 (167, 1219470122, 167)
waiter count=185
Process Group: DEFAULT, pseudo proc: c0000008e03150d8
O/S info: user: oracrmp, term: UNKNOWN, ospid: 16114

This process had 185 waiters! Ouch!

(11) Now comes a decision point. I knew who is blocking and who were being blocked; although I didn’t yet know what latch exactly is being contented for. I could have dumped the library cache latches to get that information; but the application owner volunteered to terminate the sessions. The application, fortunately, was restartable. So, we decided to kill all of these errant sessions on the unix prompt.

$ kill -9

(12) After killing a few processes, the database started responding. After killing all of them, the database wait events came back to completely normal. Connections were established and applications started behaving normally.

After step 11, I could have used the library cache dump to examine the exact library element in the center of the contention; but that’s a topic for another blog.

Takeaways

(1) When something seems to be “hung”, don’t get hung up on that. A session almost always waits for something; rarely it is just “hung”. You should check what it is waiting for by selecting the EVENT column of V$SESSION (10g) or V$SESSION_WAIT (9i).
(2) When you can’t logon to the database to get the information, try using oradebug command.
(3) To use oradebug, you should use SQL*Plus. Since you can’t login, use “sqlplus -prelim" to get the SQL prompt.
(4) Use oradebug setmypid to start the oradebug session, and then use oradebug hanganalyze to create a dumpfile of all hang related issues
(5) Use oradebug help to see all oradebug commands

Tuesday, August 19, 2008

Put External Procedures on a Different Listener

As a best practice, I recommend placing external procedures, if you use them, on a different listener.ora file. Why so?

External Procedures are actually O/S level programs executed by the listener, as the user "oracle" (or whatever the username for Oracle software is).

Let me repeat that: external procs allow database sessions to execute O/S level programs with the privilege level of the Oracle user! Do you see a problem here? Since the programs run as "oracle", they can do anything the user can do at the command line: "ls -l", "rm listener.log", or even "rm datafiles". Do you want that? Of course not.

Unfortunately there are several vulnerabilities that exist which exploit this particular feature. the CPU patches address some of them; but the bad guys respond by exposing even more holes. In this cat and mouse game, the best thing, in my opinion, is to reduce or even eliminate the possibility if feasible. If you don't use external procs (most people don't), why put them on the listener.ora file? Unfortunately the default config puts the external procs; so you should remove them and you can do that easily.

If you must use external procs, then I suggest using a different listener for that. Doing so allows you to shutdown that functionality while allowing normal database connectivity. If you perceive an imminent threat, you can take evasive action by shutting down the ext proc listener. You can't do that if the external jobs are on the same listener.

Again, consider this: what is the harm is doing that? Nothing. and what is the benefit? I just showed you, however small that may be. So, if you lost nothing and potentially gain something, why not do it?

Why should you set the ADMIN_RESTRICTIONS_LISTENER to ON

Recently someone probably went through the slides of my session on "Real Life DBA Best Practices" and had a question on OTN forum why I was recommending setting the parameter to ON, as a best practice. I responded on the forum; but I feel it's important enough to put it here as well.

As a best practice, I recommend setting this parameter to ON (the default is OFF). But as I profess, a best practice is not one without a clear explanation. Here is the explanation.

Over the period of time, the Oracle Database has encountered several security vulnerabilities, some of them on the listener. Some are related to buffer overflow. others involve unauthorized access into the listener process itself. Some of the listener access exploits come from external listener manipulations. Did you know that you do not need to even log into a server to connect to the listener? As long as the port the listener is listeneing on is open (and it will be, for obvious reasons) you can connect to the listener from a remote server.

In 10g, Oracle provided a default mechanism that does not require password from the oracle user manipulating the listener via online commands. Having said that, there have been bugs and there will be. Those vulnerabilities usually get fixed later; but most often the fix does not get to the software quickly enough.

So, what should you do to protect against these vulnerabilities? I consider a simple thing to do is to remove the possibilty altogether; and that's where the admin restrictions come into picture. After setting this parameter, you can't dynamically change the parameter. So, even though a connection is made somehow from an outside server - bug or not - eliminating the possibilty altogether mitigates the risk. And, that's why recommend it.

Let's ponder on the problem a little bit more. Is that a problem is setting the parameter? Absolutely not. When you need to change a parameter, you simply log on to the server, update the listener.ora and issue "lsnrctl reload". This reloads the parameter file dynamically. Since you never stopped the listener, you will not see unsuccessful conection requests from clients. So, it is dynamic. If you are the oracle user, then you can log on to the server; so there is no issue there.

I advocate this policy rather than dyanamic parameter changes, for these simple reasons:

(1) It plugs a potential hole dues to remote listener vulnerability attacks, regardless of the probabilty of that happening.
(2) It forces you to make changes to listener.ora file, which shows the timestamp.
(3) I ask my DBAs to put extensive comments on the parameter files, including the listener.ora file, to explain the change. I also ask them to comment a previous line and create a new line with the new value, rather than updating a value directly. This sort of documentation is a gem during debugging. Changing in the parameter file allows that, while dynamic change does not.

So, I don't see a single functionality I lose by this practice; and I just showed you some powerful reasons to adopt this practice. No loss, and some gain, however small you consider that to be - and that's why I suggest it.

As I mentioned earlier, a best practice is not one without a clear explanation. I hope this explanation makes it clear.

Saturday, August 09, 2008

Resource Manager I/O Calibration in 11g

You are considering a new platform for the Oracle database; or perhaps worried about the I/O capabilities of the existing platform. How do you know if the I/O drivers, LUN creation, etc. are capableof handling the I/O load? This is a very important question; I/O is the single most important thing in an Oracle database that can cripple an application, however large or small.


Here comes the Calibrate I/O tool in Oracle Database 11g. There are some really cool features of 11g that never made into the blitzkrieg of Oracle Marketing. This is one of them. This is a procedure in the DBMS_RESOURCE_MANAGER package. It drives some I/O to the database files to test the throughput and much more metrics.


So, what is the big deal? This type of tests can be done by simple unix system as well; or even by little apps from the storage vendors.


The important difference is that the tool uses Oracle database software libraries to drive the I/O; not some generic I/O generator. This capability is important - you are testing the I/O for an Oracle database, right?

So, let's get on with the how-it-works part of the post. You need to have asynch I/O capability otherwise this will fail with the following message:



DECLARE*

ERROR at line 1:ORA-56708: Could not find any datafiles with asynchronous i/o capability

ORA-06512: at "SYS.DBMS_RMIN", line 453ORA-06512: at "SYS.DBMS_RESOURCE_MANAGER", line 1153ORA-06512: at line 7

To enable asynch I/O, set two values in the init.ora file (or spfile) and recycle the instances:



disk_asynch_io = true



filesystemio_options = asynch

The first one is default; so you may not have to change it. The default of the second one is NONE; so you will probably have to change it. After setting these values and recycling the instance, confirm at the file level:



select name,asynch_io from v$datafile f,v$iostat_file i 

where f.file#=i.file_no 

and (filetype_name='Data File' or filetype_name='Temp File')

/









NAME ASYNCH_IO


-------------------------------------------------- ---------
+DG1/labrac/datafile/system.261.648437429 ASYNC_ON

+DG1/labrac/datafile/system.261.648437429 ASYNC_ON

+DG1/labrac/datafile/sysaux.262.648437447 ASYNC_ON


+DG1/labrac/datafile/undotbs1.263.648437455 ASYNC_ON

+DG1/labrac/datafile/undotbs2.265.648437477 ASYNC_ON


+DG1/labrac/datafile/users.266.648437523 ASYNC_ON

+DG1/sras_01 ASYNC_ON

This shows the files have Asynch I/O enabled.

Caution: even if the above query returns ASYNC_ON, you may not be able to run the calibrate process. The culprit is the number of asynch I/O slots available in the system. If there are no free slots, the files will not use asynch I/O and you will get ORA-56708. To find out the maximum avalable asynch I/O slots:


$ cat /proc/sys/fs/aio-max-nr
65536

To find out how many are being used:



$ cat /proc/sys/fs/aio-nr
65536

Well, that sort of tells it all. Of the 65536, all are being used; so we are getting ORA-56708. The reason was I was runninganother database on the server, which was consuming all those slots. After shutting down the other database, I checked the utilization again:

$ cat /proc/sys/fs/aio-nr


35456


The value dropped substantially and allowed my calibration to proceed.

Now start the calibration process. The precedure CALIBRATE_IO of the package DBMS_RESOURCE_MANAGER accepts two parameters:

  • NUM_PHYSICAL_DISKS - the number of physical disks you have. Remember: this is the number for disks, not LUNs.
  • MAX_LATENCY - the maximum latency for the disk access (in milliseconds)

And it has three OUT parameters:

  • MAX_IOPS - the maximum number of I/O Operations per second
  • MAX_MBPS - the maximum Megabytes per second
  • ACTUAL_LATENCY - the actual latency observed during the test

Since this has OUT parameters, we will have to write a little PL/SQL program to get the values.

set serveroutput on

declare

l_latency integer;

l_iops integer;

l_mbps integer;

begin

dbms_resource_manager.calibrate_io (

18, /* # of disks */

10, /* max latency */

l_iops, /* I/O Ops/sec */

l_mbps, /* MBytes/sec */

l_latency /* actual latency */

);

dbms_output.put_line ('I/O Ops/sec = ' l_iops);

dbms_output.put_line ('Actual Latency = ' l_latency);

dbms_output.put_line('MB/sec = ' l_mbps);

end;

/


When this is going on, you can check the status:






SQL> select * from v$io_calibration_status

2 /

STATUS
-------------
CALIBRATION_TIME
----------------
IN PROGRESS


This shows that the operation is still going on. After some time you will see the status change to READY. Here is the output from my system (a RAC database with 2 nodes, each with 4 CPUs, with a Dell NAS, with 28 disks).


I/O Ops/sec = 122
Actual Latency = 15
MB/sec = 13

PL/SQL procedure successfully completed.


So, it shows us that the actual latency is somewhat like 15 milliseconds. The system can sustain a maximum of 122 I/O Ops per second with a throuput of 13 MB/sec.


You can check the results of the I/O Calibration run later from the data dictionary view DBA_RSRC_IO_CALIBRATE. Let's check this view. Since it has only one row and there are many columns, I used the very handy Print Table utility by Tom Kyte.



SQL> exec print.tbl('select * from dba_rsrc_io_calibrate');

START_TIME : 09-AUG-08 10.49.34.440392 AM
END_TIME : 09-AUG-08 11.04.33.446146 AM
MAX_IOPS : 122
MAX_MBPS : 13
MAX_PMBPS : 4
LATENCY : 15
NUM_PHYSICAL_DISKS : 28



The columns are self explanatory, in line with the previous output, except the MAX_PMBPS, which shows the maximum sustainable I/O in MB/s for a single process.

This output is great; but it is an aggregation. You can check another view V$IOSTAT_FILE for file-level metrics in calibration run:


select file_no,small_read_megabytes,small_read_reqs,

large_read_megabytes,large_read_reqs

from v$iostat_file






Output:







FILE_NO SMALL_READ_MEGABYTES SMALL_READ_REQS LARGE_READ_MEGABYTES LARGE_READ_REQS
---------- -------------------- --------------- -------------------- ---------------
1 267 33234 230 256 1 0 6 0 2
2 514 50337 318 331
3 121 15496 118 118
4 44 5661 43 43
5 1 83 0 0 6 46 5846 45 45

In addition to the columns, this view has a lot more columns. Here is the output for a single datafile:


SQL> exec print.tbl('select * from v$iostat_file where file_no = 1')


FILE_NO : 1


FILETYPE_ID : 2


FILETYPE_NAME : Data File


SMALL_READ_MEGABYTES : 267


SMALL_WRITE_MEGABYTES : 11


LARGE_READ_MEGABYTES : 230


LARGE_WRITE_MEGABYTES : 0


SMALL_READ_REQS : 33235


SMALL_WRITE_REQS : 1083


SMALL_SYNC_READ_REQS : 33013


LARGE_READ_REQS : 256


LARGE_WRITE_REQS : 2


SMALL_READ_SERVICETIME : 2650538


SMALL_WRITE_SERVICETIME : 10676


SMALL_SYNC_READ_LATENCY : 2647154


LARGE_READ_SERVICETIME : 2745759


LARGE_WRITE_SERVICETIME : 10


ASYNCH_IO : ASYNC_ON


RETRIES_ON_ERROR : 0


-----------------


Using these merics you can decide if the I/O is sufficient or still needs more tweaking.

Recently I used it on a system that required Oracle 10g. But I installed 11g on that server and ran this calibration test, which unearthed some inherent issues with the I/O system. Once rectified, and reconfirmed with the calibration tool, it was ready for the 10g installation.


Thursday, August 07, 2008

Full Table Scans on RAC

The phone goes off, yet again, and this time it seems to be a serious problem. Here are the symptoms: all the applications seems to be working fine albeit at a lesser performance level, except one app, which has crawled to a halt. The app has abended since the queries are timing out. This is, by the a way a RAC database.

This is rather interesting - only one application is slow, instead of all the applications. If this were a systemwide issue, then all the apps would have been affected; not just one. The DBA checked the wait events, as she should have and the predminant events were "gc buffer busy". Well, this is a systemwide issue as a result of saturated interconnect, isn't it - she muses; then how come application A is the only one affected?

So, this is a big clue: the waits are RAC related - "gc buffer busy", yet only one application is slow. First, I got the SIDs of the sessions waiting for that event:

select SID, sql_id, sql_child_number
from v$session
where event = 'gc buffer busy';

After I got the SIDs, I checked the SQL statements issued by the sessions:

select sql_text
from v$sql
where sql_id = ''
and child_number = 24

The results were interesting - all of them were updating or selecting from a single table. Next I checked the execution plan for the SQLs:

select * from table (
dbms_xplan.display_cursor (
'9h1r67dtjxfbs',24,'BASIC'
))
/

PLAN_TABLE_OUTPUT
------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select ...

Plan hash value: 871755554
---------------------------------------------------------------------------------
Id Operation Name
---------------------------------------------------------------------------------
0 SELECT STATEMENT


The output tells it all - there is a full table scan going on for a 14 million row table. Well, why? I checked for the presence of indexes and yes, there is an index. Why isn't the index being used?

This is yet another case of "why my index is not being used". It happens at a relatively lesser frequency; but happens nevertheless.

Upon examination I discovered that the table is range partitioned and the index is local. Next I checked if the index has become unusable somehow:

select partition_name, status
from dba_ind_partitions
where index_name = 'the_inex_name';


PARTITION_NAME STATUS
------------------------------ --------

PT_SEP2009 USABLE
PT_OCT2009 UNUSABLE
PMAX UNUSABLE

Note the status of the last two partitions - UNUSABLE. Why so?

I checked the last_ddl_time:

select last_ddl_time
from dba_objects
where object_name = 'the_index_name';

And voila! The index was altered today. Why? The answer came quickly: there is a partition maintenance script that splits the PMAX partition to PMAX and a partition named PT_OCT2009. The split worked fine; but the DBA forgot to add a line like:

alter index this_index_name rebuild partition PMAX;

So, the partition remained unusable. Remember, this is a local index; so the clause MAINTAIN GLOBAL INDEXES will not do anything. The partition must be explicitly rebuilt. Since the partition was not usable, the query plan suddenly changed to full table scan, as I saw earlier. Since this caused a lot of buffer gets, the buffer pool of all the instances of the RAC cluster got flloded with the blocks of this table, causing gc buffer busy waits. So, even though the wait event is RAC related, the issue was due to abnormally large buffer gets and was limited to a specific application only. Eventually, the issue would have saturated the interconnect and affected other apps; but since we caught it early, that event was averted.

I rebuilt the index partitions by explicitly issuing the alter index commands. After that, the queries started using the index and, of course, the gc waits disappeared.

Lessons:

(1) just because you see "gc" waits, don't automatically assume it is RAC specific issue. The underlying issue may have manifested itself as the RAC component issues.
(2) Add to the list of "why oracle is not using my index" one more: unusable index partition(s)
(3) When you split partition of a table, in addition to the clause "maintain global indexes", issue the rebuild of the split partitions of local indexes.

Monday, August 04, 2008

Issues in RAC with Wrong Listener Configuration

The one thing that frustrates me the most is the lack of clear communication. Without clarity in what is being transpired, it's impossible for the receiver to understand the meaning - not just the superficial content but the subtle undertones as well. And it fails to materialize a manageable goal. I could be rambling about humans, of course; but the issue in communication goes beyond - to the things we create - writings, books, articles, poems, software programs, whatever. Oh, did I mention software programs?

Well, in this case I am talking about the software programs we all live by - the Oracle database; more specifically the RAC. Recently we suffered an outage. The reason turned out to be a simple issue of human forgetfulness; but the frustrating part was the lack of communication from the CRS about the issue. So, here is how it went.

This is a three node RAC. We applied the RAC bundle patch in a rolling fashion. Node 3 was brought down, patch applied and was brought up - everything was hunky dorey.

Then node 2 was brought down, patch applied and was brought up; but here was a little problem - the VIP didn't come up. The VIP relocated to node 3. But all the 23 services defined on the node were online. Strange. We assumed the issue was with the bundle patch ina rolling manner; so we decided to go ahead and do the deed on node 1, the remaining one.

When node 1 was brought up, we saw the real issues. The VIP for node 1 came up on node 3 and none of the services were up on node 1. Worse, the services that were up on node 2 suddenly went offline.

A lot of frantic calls to Oracle Support, request for esaclation to duty manager followed, along with the customary yelling and sioghing by the customer. At the end of 2 hours and with a Sev 1 SR, we hadn't gone anywhere.

At this time I took an executive decision. I asked the DBA to issue an ALTER SYSTEM comamnd on node 2 to bring up the services manually, bypassing CRS.

ALTER SYSTEM SET SERVICE_NAME = 'X','Y','Z',....

All the services where up on the instance 2 at this time, confirmed by SHOW PARAMETER SERVICE_NAMES command; but they still showed up as OFFLINE in crs_stat command. That meant, the instance had no issue with the services; the CRS had.

Then we issued a lsnrctl reload command to relaod the listener (note, we didn't use CRS to manipulate the listener - a very important clue you will see why later). The listener started fine with all service names. So far so good. We tried a few connections from the client and they connected fine. We did the same for node 1 and it worked too.

So, at this time, we had the services all up and running; but CRS had no idea about them. That meant the service failover probably wouldn't occur.

Next, we reloacted the VIP using the crs_relocate command. Al the VIPs (of node 1 and node 2) came back to the respective nodes. That was strange as well. If there was an issue with VIPs related to the networking software or hardware, how could they come back?

All these time we analyzed all the logs - the css, crs, racg alogs along with the alert log of the CRS (this is 10.2; so we have the alert log for CRS as well). Nothing there to indicate why it happened. And that's the basis of my rambling about lack of communication.

So, as always we went through the changes in the last few weeks. One of the crucial changes was the introduction of local_listener and remote_listener parameters in init.ora. These parameters can be placed as absolute values - with the host name, ports, etc. - or as the TNS alias entries. We had chosen the latter. The local_listener parameter was ok; but the remote_listener was listener_xxx.

Here was the entry for listener_xxx on tnsnames.ora:

listeners_xxx = (address_list = (address = (protocol=tcp)(host=sxxx1)(port=1521)) (address = (protocol=tcp)(host=sxxx2)(port=1521)) (address = (protocol=tcp)(host=sxxx3)(port=1521)))

There is nothing wrong with the syntax; but note the host value; sxxx1, sxxx2 and sxxx3 are all staging servers; and this is a production cluster. The communication does not work between prod and staging. So, the remote listener parameter pointed to hosts that were not even reachable! By the way, the entries were correct in node 3.

This caused the services to fail when started. Well, then how come they came up fine with ALTER SYSTEM?

My guess is since the ALTER SYSTEM command didn't go through the listener, it was not affected. CRS, on the other hand, goes through the listener.

Fine; but how come my lsnrctl reload command work? Again, my guess is, that comamnd simply starts the listener; the remote listener parameter in the database doesn't affect it. Since the local_listener parameter was correct, the services were successfully registered.

What about the VIP? How did the relocate work? I can only guess, of course. The conjecture is that the VIP is forced to go to the preferred node unless there is some sort of hardware issue, which was not in our case. So the relocate worked.

Phew! all these due to the sloppy work by a DBA. While this was deplorable, the issue with human error will continue to exist. Typos and wrong copy-and-paste will happen, no matter how much severe we make the punishment. I strongly feel CRS should have pointed out this error somewhere. In all that gets spewn out on CRS and CSS logs, how much is actually decipherable? Almost 99% has no meaning attached to it, at least for mere mortals. The idea behind the alert log of the CRS was to show the customer relevant issues in a common log; but there was no indication of this issue.

See the point about communication?