Saturday, August 23, 2008

Diagnosing Library Cache Latch Contention: A Real Case Study

The other day, our data warehouse database was effectively down. The database was up and running; but all connection attempts to the database would just hang. Since connections just failed, the DBA couldn’t even get in to see if the sessions connected were running normally or not. The ETL team reported that the processing slowed down to a crawl. The DBA, as instructed, went on to check the wait events; but since she couldn’t even log in, she couldn’t do that. Interestingly, CPU was around 70%, normal for the time of the day; and I/O was about 90%, again normal.

So, in simple terms – she was stuck and she applied the age-old technique: when in doubt, reboot. She asked the SA to reboot the server. It took about 30 minutes; and after it cam backup and the database started everything seemed normal for about 10 minutes. And, then the same problem stuck – the database got unresponsive.

At that point, I was called for help. In this blog I am sort of trying to paint a picture of what I did in the next half hour or so to resolve the issue. I hope by giving this running commentary of the events, facts and figures, you will follow my thoughts as I went about diagnosing the issue.

Symptoms

(1) Database connections were hanging
(2) Connections on the server using SQL*Plus as sysdba was also hanging; so we couldn’t even examine the issue.
(3) The system couldn’t be rebooted everytime; every reboot means more work for the ETL folks to clean up the failed sessions.

Initial Analysis

Obviously something is holding some resource back in the database; but what? Unless we connected to the database, we couldn’t check. And since couldn’t check, we couldn’t remediate. It was a classic case of Catch-22.

Action

This is one of those cases where a backdoor entry to the database instance comes really handy. Most people are not aware of the option in SQL*Plus called “prelim”. This option connects to the SGA; but does not open a session.

(1) So, the first thing I did is to open a sessionless SQL*Plus session using the Oracle user:

$ sqlplus -prelim
SQL>

Note, it didn’t say anything familiar like “Connected to Oracle Database 10.2.0.3”, etc. All it showed was the “SQL>” prompt. That was because it didn’t actually connect to the database.

(2) Then I used the “oradebug” utility to analyze the SGA

SQL> oradebug setmypid
SQL> oradebug hanganalyze 12

This produced a tracefile in the user_dump_dest directory. The file wasn’t difficult to find, since it was the last file created. Even if I didn’t find the file, I could have used the process ID to find the file. The file would have been named crmprd1_ora_13392.trc, assuming 13392 was the process ID.

(3) Let’s examine the file. Here are first few lines:

*** 2008-08-23 01:21:44.200
==============
HANG ANALYSIS:
==============
Found 163 objects waiting for
<0/226/17/0x1502dab8/16108/no>
Open chains found:
Chain 1 : :
<0/226/17/0x1502dab8/16108/no>
<0/146/1/0x1503e898/19923/latch:>

This tells me a lot. First it shows that the SID 146 Serial# 17 is waiting for library cache latch. The blocking session is SID 226 Serial# 17. The latter is not waiting for anything of blocking nature. I also noted the OS process IDs of these sessions – 16108 and 19923.

(4) Next I checked for two more tracefiles with these OS PIDs in their names.

crmprd1_ora_16108.trc
crmprd1_ora_19923.trc

(5) I opened the first one, the one that is the blocker. Here are the first few lines:

*** 2008-08-23 01:08:18.840
*** SERVICE NAME:(SYS$USERS) 2008-08-23 01:08:18.781
*** SESSION ID:(226.17) 2008-08-23 01:08:18.781
LIBRARY OBJECT HANDLE: handle=c0000008dc703810 mtx=c0000008dc703940(8000) cdp=32737
name=UPDATE DW_ETL.FRRS_PROFILER SET CONSUMER_LINK = :"SYS_B_0", ADDRESS_LINK = :"SYS_B_1", ADDRESS_MATCH = :"SYS_B_2", PROC
ESSED=:"SYS_B_3" WHERE RNUM = :"SYS_B_4"
hash=a029fce7bb89655493e7e51a544592a4 timestamp=08-23-2008 00:10:23
namespace=CRSR flags=RON/KGHP/TIM/OBS/PN0/MED/KST/DBN/MTX/[504100d0]
kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=10 hpc=0058 hlc=0058
lwt=c0000008dc7038b8[c0000008dc7038b8,c0000008dc7038b8] ltm=c0000008dc7038c8[c0000008dc7038c8,c0000008dc7038c8]
pwt=c0000008dc703880[c0000008dc703880,c0000008dc703880] ptm=c0000008dc703890[c0000008dc703890,c0000008dc703890]
ref=c0000008dc7038e8[c0000008dc7038e8,c0000008dc7038e8] lnd=c0000008dc703900[c0000008dc703900,c0000008dc703900]
LOCK OWNERS:
lock user session count mode flags
---------------- ---------------- ---------------- ----- ---- ------------------------
c0000008d079f1b8 c0000006151744d8 c0000006151744d8 16 N [00]
c0000008d4e90c40 c0000006151bcb58 c0000006151bcb58 16 N [00]
c0000008d0812c40 c0000008151a0438 c0000008151a0438 16 N [00]

(6) This is a treasure trove of information for debugging. First it shows the SID and Serial# (226.17), which confirms the SID we identified earlier. It shows the exact SQL statement being used. Finally it shows all the locks. I didn’t particularly care about the specifics of locks; but it gave me enough information to prove that the SID 226 was causing a wait for a whole lot of other sessions.

(7) My investigation is not done; I need to find out the sessions waiting for this. So, I searched the file for a section called “PROCESS STATE”. Here is a snippet from the file:

PROCESS STATE
-------------
Process global information:
process: c00000081502dab8, call: c000000817167890, xact: 0000000000000000, curses: c00000081519ef88, usrses: c000000815
19ef88
----------------------------------------
SO: c00000081502dab8, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=370, calls cur/top: c000000817167890/c000000817167890, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 115 0 4
last post received-location: kslpsr
last process to post me: c000000615002038 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: c000000615002038 1 6
(latch info) wait_event=0 bits=20
holding (efd=4) c0000008d7b69598 Child library cache level=5 child#=10
Location from where latch is held: kglhdgc: child:: latch
Context saved from call: 13
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
291 (197, 1219468295, 197)
279 (197, 1219468295, 197)
374 (197, 1219468295, 197)
267 (197, 1219468295, 197)
372 (197, 1219468295, 197)
... several lines sniped ...
307 (15, 1219468295, 15)
181 (6, 1219468295, 6)
waiter count=58
Process Group: DEFAULT, pseudo proc: c0000008e03150d8
O/S info: user: oracrmp, term: UNKNOWN, ospid: 16108
OSD pid info: Unix process pid: 16108, image: oracle@sdwhpdb1

(8) This told me everything I needed to know. There were 58 sessions waiting for library cache latch held by SID 226. I also knew the OS Process ID and the SQL statement of the blocking session.

(9) At that time we engaged the Application Owner to explain what was going on. As he explained it, he issues the update statement in a loop. And that’s not all; he executes it in 8 different threads. No wonder we have had library cache latch contention. So, we had to track 8; not just one session. We trudged on. All the sessions dumped their information. So, I searched the directory for some other files with the same issues:

$ grep “UPDATE DW_ETL” *.trc

(10) And I found 9 more sessions (or, rather, processes). Here is a snippet from another file:

350 (167, 1219470122, 167)
197 (167, 1219470122, 167)
waiter count=185
Process Group: DEFAULT, pseudo proc: c0000008e03150d8
O/S info: user: oracrmp, term: UNKNOWN, ospid: 16114

This process had 185 waiters! Ouch!

(11) Now comes a decision point. I knew who is blocking and who were being blocked; although I didn’t yet know what latch exactly is being contented for. I could have dumped the library cache latches to get that information; but the application owner volunteered to terminate the sessions. The application, fortunately, was restartable. So, we decided to kill all of these errant sessions on the unix prompt.

$ kill -9

(12) After killing a few processes, the database started responding. After killing all of them, the database wait events came back to completely normal. Connections were established and applications started behaving normally.

After step 11, I could have used the library cache dump to examine the exact library element in the center of the contention; but that’s a topic for another blog.

Takeaways

(1) When something seems to be “hung”, don’t get hung up on that. A session almost always waits for something; rarely it is just “hung”. You should check what it is waiting for by selecting the EVENT column of V$SESSION (10g) or V$SESSION_WAIT (9i).
(2) When you can’t logon to the database to get the information, try using oradebug command.
(3) To use oradebug, you should use SQL*Plus. Since you can’t login, use “sqlplus -prelim" to get the SQL prompt.
(4) Use oradebug setmypid to start the oradebug session, and then use oradebug hanganalyze to create a dumpfile of all hang related issues
(5) Use oradebug help to see all oradebug commands

27 comments:

Anonymous said...

Arup,

Excellent walkthrough of a real-world problem, I really do like these kind of examples.

Very useful!

John.

Arup Nanda said...

Thank you, John. That is precisely the idea behind these blog entries. Similar to this one, I plan to publish several real life walk-throughs of problems and solutions. Since I work as a DBA, I face issues like this all the time. All it needs is some time to collate the information in a presentable and coherent fashion. I hope I will find time to continue doing exactly that.

Noons said...

Damn useful stuff, Arup.
Bookmarked.

Arup Nanda said...

Thanks, Nuno. It was a pleasure to cross paths with you again. I haven't visited Oracle-L for quite a while, where I assume you are still active.

Anonymous said...

Arup,

thank you for this excelent example. For further reading I suggest Metalink Note: 215858.1 - "Interpreting HANGANALYZE trace files to diagnose hanging and performance problems".

Adrian Hollay

Anonymous said...

sqlplus -prelim on which database version does it works

parag said...

sqlplus -prelim on which database verion does it works

Arup Nanda said...

The -prelim option is definitely available on 10g Release 2. I tried it on 10g R1 and it worked; but I can't confirm its validity. None of our databases are R1. It's definitely not available in 9i and prior.

Anonymous said...

This post was very very useful. Thanks for sharing your experience with us.

Anonymous said...

Well said.

Chris Adkin said...

Hi Arup,

An interesting posting. However, several things spring to mind:-

1. The developer is using hard coded literals, evident from the fact that binds are being prefixed by :SYS_B and not :B1, :B2 etc. Presumably cursor_sharing is set to similar.

2. There are technical notes on Metalink alluding to high version counts when cursor_sharing is set to similar; 296377.1 and 377847.1. It would have been interesting to have had the ability to see the version counts on these UPDATEs and also wait events on the library cache pin mutex.

3. Did someone suggest to the developer that if he is carrying out updates in a tight loop, bulk binding via PL/SQL is really the way to go. I think there are some circumstances in 10g where the database engine will do this under the covers without it having to be explicitly coded for.

A very good walk through all in all.

Chris

Anonymous said...

Excellent article...Always i use to search eagerly this kind of real-time problem.

Please keep posting this kind of real-time problems you have resolved.

Awaiting for your new real-time issues walkthrough...

Please keep posting this kind of posts !!!!!

Regards,
Yasser.

Anonymous said...

Hi Arup,

If this issue occurs on Oracle 9i version, how can we login to sga with out connecting to the database.

As sqlplus -prelim option is present only in Oracle 10g version.

Regards,
Yasser.

Anonymous said...

Excellent!!!
Did help a lot with a similar problem.

Great work!

Anonymous said...

Arup,
Nice article. I am also facing same type of problem in my production environment. How you have solved this problem? It will be very usefull for me. Have application team done any code change?

Kumar Madduri said...

Hi Arup:

1. Great Article. What I like about your articles is the simplicity with which you explain. It is like peeling a banana and giving it to us for eating (if you know what I mean)

2. Few clarifications
You mention 'First it shows that the SID 146 Serial# 17 is waiting for library cache latch'
Should it not be
First it shows that the SID 146 Serial# 1 is waiting for library cache latch

3. You mention ' So, we had to track 8; not just one session. We trudged on. All the sessions dumped their information. So, I searched the directory for some other files with the same issues'

How did you track the remaining sessions? Was tha available in oradebug? Did you do a ls -ltr and get the latest files and checked the last 10 or 15 files?


4. You mentiond ' I could have dumped the library cache latches to get that information; but the application owner volunteered to terminate the sessions'

But this situation can happen again when the developer runs his program again. The remedy is probably not issue all those updates in parallel or pospone this activity to a later time when there may be less contention. What would be the long term solution?


Thank you again for such an excellent article.

Anonymous said...

Hi Arup:

Excellent article...thanks

k roy said...

Arup ... I really enjoyed reading this. Great stuff.

Thanks.

Will said...

Excellent analysis, Arup.

Anonymous said...
This comment has been removed by a blog administrator.
Unknown said...

i can't use sqlplus -prelim, my database version is 10.2.0.3, please see below :

[oracle@deployment2 ~]$ sqlplus -prelim

SQL*Plus: Release 10.2.0.3.0 - Production on Fri Jul 30 09:15:18 2010

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

Enter user-name: sys
Enter password:
ERROR:
ORA-24300: bad value for mode


Enter user-name:

any comments ??

Thanks

Arup Nanda said...

@andes

I could reproduce the behavior in 10.2.0.3.

$ sqlplus -prelim

SQL*Plus: Release 10.2.0.3.0 - Production on Fri Jul 30 03:46:05 2010

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

Enter user-name: / as sysdba
03:46:09 SQL> exit
Disconnected from ORACLE

$ sqlplus -prelim

SQL*Plus: Release 10.2.0.3.0 - Production on Fri Jul 30 03:46:16 2010

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

Enter user-name: sys
Enter password:
03:46:40 SQL>

As you can see both "/ as sysdba" and "sys/password as sysdba" worked.

You are not entering " as sysdba". You are entering just "". Remember, sys password must be given with sysdba clause.

You may also enter it as:

$ sqlplus -prelim

SQL*Plus: Release 10.2.0.3.0 - Production on Fri Jul 30 04:12:40 2010

Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

Enter user-name: sys as sysdba
Enter password:
04:12:46 SQL>

Unknown said...

Hi Arup

i understand now. Thanks for your reply!

Emiliano said...

Hi Arup,
I have this situation, with db hang. Any idea?

Found 11 objects waiting for
<4947/951/0x6e845ff8/13204/No Wait>
Found 11 objects waiting for
<41/2570/0x6ebc6a58/12454/latch free>
Found 10 objects waiting for
<3336/1628/0x6eaf2648/679/latch free>
Found 20 objects waiting for
<5589/800/0x6e898b38/1259/latch free>
Found 16 objects waiting for
<2192/2293/0x6e82cdf8/27111/latch free>
Found 18 objects waiting for
<4475/1274/0x6e97c228/4024/latch free>
Found 20 objects waiting for
<3533/1281/0x6eb01928/1238/latch free>
Found 32 objects waiting for
<908/8663/0x6e8122d8/25902/latch free>
Found 16 objects waiting for
<5333/808/0x6e82bd38/8862/latch free>
Found 13 objects waiting for
<4450/1413/0x6ebdacc8/15475/latch free>
Found 14 objects waiting for
<5784/927/0x6ea836c8/5289/latch free>

Mohammed Taj said...

Useful info...


Taj

Keira Page said...

This is a really good read for me, Must admit that you are one of the best bloggers I ever saw.Thanks for posting this informative article.

Aryan Smith said...

It will be very usefull for me. Have application team done any code change?
Sheds Fort Lauderdale

Translate