The Arup Nanda Blog: 2009

Tuesday, December 29, 2009

Instance_Number is busy Message during Standby Instance Startup

Recently I encountered a situation that demonstrates how, in an Oracle database, an error may stem from a very unrelated cause. A DBA was building a physical standby database for an upcominmg training. The two servers he was using were part of a RAC cluster; so the Oracle binaries were already there. He decided to use the same ORACLE_HOME for the new databases as well - a quite logical decision to savbe on space and administration issues. He created a primary database on server n1 and a standby database on the server n2. Follooing the usual manual procedure in building the standby database, he copied the pfile from primary database, modified the parameters and brought up the standby instance in nomount mode on the server n2.

SQL> startup nomount pfile=initSTBY.ora

But it refused to come up, with the following error:

ORA-00304: requested INSTANCE_NUMBER is busy

Alert log showed:

USER (ospid: 14210): terminating the instance due to error 304
Instance terminated by USER, pid = 14210

This was highly unusual. The primary and standby both were non-RAC; there was no instance number concept in a non-RAC database. By the way, the RAC instance on that server (or on the server n1) was not running; so there was no question of any conflict with the RAC instances either. The primary database was called PRIM while the standby was called STBY - eliminating the possibility of an instance name clash as well. And this error came while merely trying to start the instance, not even while mounting - eliminating the standby controlfile as a cause.

The error 304 showed:

00304, 00000, "requested INSTANCE_NUMBER is busy"
// *Cause: An instance tried to start by using a value of the
// initialization parameter INSTANCE_NUMBER that is already in use.
// *Action: Either
// a) specify another INSTANCE_NUMBER,
// b) shut down the running instance with this number
// c) wait for instance recovery to complete on the instance with
// this number.

Needless to say, being for a non-RAC database there was no "instance_number" parameter in the initialization parameter file of primary or the standby. So, the suggestions for the resolution seemed odd. MetaLink provided no help. All the ORA-304 errors were related to RAC with the instance_number mismatch.

As it always happens, it fell on my lap at this time. With just days to go live, I had to find a solution quickly. Long hours of troubleshooting, tracing the processes and examination of occasional trace files did not yield any clue. All the clue seemed to point to RAC, which this database was not. The Oracle Home was a RAC home, which meant the oracle binary was linked with the "rac" option.

So, the next logical step was to install a new Oracle Home without the rac option. After doing so, I tried to bring up the instance, using the new ORACLE_HOME and LD_LIBRARY_PATH variable; but, alas, the same error popped up.

Just to eliminate the possibility of some unknown bug, I decided to put an instance_number parameter, setting it to "1", from the default "0". The same error. I changed it to "2", again, the result was the same error.

Although this didn't help, at least it gave a clue that the error was not related to instance_number. The error message was clearly wrong. With this in mind, I went back to the basics. I went through the alert log with a fine toothed comb, scanning and analyzing each line.

The following line drew my attention:

DB_UNQIUE_NAME STBY is not in the Data Guard configuration

This was odd; the db_unique_name STBY was not defined in DG configuration. [BTW, note the spelling of "unique" in the message above. That is not what I typed; it was a copy and paste from the actual message in the alert log. Someone in Oracle Development should really pay atytention to typos in messages. This is clearly more than a nuisance; what if some process scans for db_unique_name related errors? It will not find the message at all!]

Checking the dg configuration, I found that the DBA has correctly defined the primary and standby names. In any case, Data Guard has not been started yet; this is merely at the instance startup - why is it complaining for data guard configuration at this time.

Perplexed, I resorted to a different approach. I renamed the pfile and all other relevant files. Then I built the standby myself, from scratch - using the same names - PRIM and STBY. And this time, everything worked fine. The instance STBY did come up.

While this solvbed the urgency problem, everyone, inclduing myself, wanted to know what the issue was in the earlier case where the DBA had failed to bring up the instance. To get the answer, I compared the files I created with the DBA created when tried and failed. Voila! The cause was immediately clear - the DBA forgot to put a vital parameter in the pfile of the standby instance:

db_unique_name = 'STBY'

This parameter was absent; so it took the default value as the db_name, which was "PRIM". This caused the instance to fail with a seemingly unrelated message - "ORA-304 Instance_number is busy"!

Learning Points

In Oracle, most errors are obvious; but some are not. So, do not assume the error message is accurate. If all logic fails, assume the error messsage is wrong, or at least inaccurate.
Instead of over-analyzing the process already followed, it may make sense to take a breather, wipe out everything and start fropm scratch. This is evben mor effective when someone else does it, offering a fresh approach and possibly not repeating the same mistakes.
Finally, the issue at hand: if you do not define db_unique_name parameter in the standnby instance, you will receive ORA-304 during instance startup.

Hope this was helpful. Happy New Year, everybody.

Friday, December 18, 2009

Resolving Gaps in Data Guard Apply Using Incremental RMAN BAckup

Recently, we had a glitch on a Data Guard (physical standby database) on infrastructure. This is not a critical database; so the monitoring was relatively lax. And that being done by an outsourcer does not help it either. In any case, the laxness resulted in a failure remaining undetected for quite some time and it was eventually discovered only when the customer complained. This standby database is usually opened for read only access from time to time.This time, however, the customer saw that the data was significantly out of sync with primary and raised a red flag. Unfortunately, at this time it had become a rather political issue.

Since the DBA in charge couldn’t resolve the problem, I was called in. In this post, I will describe the issue and how it was resolved. In summary, there are two parts of the problem:

(1) What happened
(2) How to fix it

What Happened

Let’s look at the first question – what caused the standby to lag behind. First, I looked for the current SCN numbers of the primary and standby databases. On the primary:

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
1447102

On the standby:

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
1301571

Clearly there is a difference. But this by itself does not indicate a problem; since the standby is expected to lag behind the primary (this is an asynchronous non-real time apply setup). The real question is how much it is lagging in the terms of wall clock. To know that I used the scn_to_timestamp function to translate the SCN to a timestamp:

SQL> select scn_to_timestamp(1447102) from dual;

SCN_TO_TIMESTAMP(1447102)
-------------------------------
18-DEC-09 08.54.28.000000000 AM

I ran the same query to know the timestamp associated with the SCN of the standby database as well (note, I ran it on the primary database, though; since it will fail in the standby in a mounted mode):

SQL> select scn_to_timestamp(1301571) from dual;

SCN_TO_TIMESTAMP(1301571)
-------------------------------
15-DEC-09 07.19.27.000000000 PM

This shows that the standby is two and half days lagging! The data at this point is not just stale; it must be rotten.

The next question is why it would be lagging so far back in the past. This is a 10.2 database where FAL server should automatically resolved any gaps in archived logs. Something must have happened that caused the FAL (fetch archived log) process to fail. To get that answer, first, I checked the alert log of the standby instance. I found these lines that showed the issue clearly:

…
Fri Dec 18 06:12:26 2009
Waiting for all non-current ORLs to be archived...
Media Recovery Waiting for thread 1 sequence 700
Fetching gap sequence in thread 1, gap sequence 700-700
… …
Fri Dec 18 06:13:27 2009
FAL[client]: Failed to request gap sequence
GAP - thread 1 sequence 700-700
DBID 846390698 branch 697108460
FAL[client]: All defined FAL servers have been attempted.
…

Going back in the alert log, I found these lines:

Tue Dec 15 17:16:15 2009
Fetching gap sequence in thread 1, gap sequence 700-700
Error 12514 received logging on to the standby
FAL[client, MRP0]: Error 12514 connecting to DEL1 for fetching gap sequence
Tue Dec 15 17:16:15 2009
Errors in file /opt/oracle/admin/DEL2/bdump/del2_mrp0_18308.trc:
ORA-12514: TNS:listener does not currently know of service requested in connect descriptor
Tue Dec 15 17:16:45 2009
Error 12514 received logging on to the standby
FAL[client, MRP0]: Error 12514 connecting to DEL1 for fetching gap sequence
Tue Dec 15 17:16:45 2009
Errors in file /opt/oracle/admin/DEL2/bdump/del2_mrp0_18308.trc:
ORA-12514: TNS:listener does not currently know of service requested in connect descriptor

This clearly showed the issue. On December 15th at 17:16:15, the Managed Recovery Process encountered an error while receiving the log information from the primary. The error was ORA-12514 “TNS:listener does not currently know of service requested in connect descriptor”. This is usually the case when the TNS connect string is incorrectly specified. The primary is called DEL1 and there is a connect string called DEL1 in the standby server.

The connect string works well. Actually, right now there is no issue with the standby getting the archived logs; so there connect string is fine - now. The standby is receiving log information from the primary. There must have been some temporary hiccups causing that specific archived log not to travel to the standby. If that log was somehow skipped (could be an intermittent problem), then it should have been picked by the FAL process later on; but that never happened. Since the sequence# 700 was not applied, none of the logs received later – 701, 702 and so on – were applied either. This has caused the standby to lag behind since that time.

So, the fundamental question was why FAL did not fetch the archived log sequence# 700 from the primary. To get to that, I looked into the alert log of the primary instance. The following lines were of interest:

…
Tue Dec 15 19:19:58 2009
Thread 1 advanced to log sequence 701 (LGWR switch)
Current log# 2 seq# 701 mem# 0: /u01/oradata/DEL1/onlinelog/o1_mf_2_5bhbkg92_.log
Tue Dec 15 19:20:29 2009Errors in file /opt/oracle/product/10gR2/db1/admin/DEL1/bdump/del1_arc1_14469.trc:
ORA-00308: cannot open archived log '/u01/oraback/1_700_697108460.dbf'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3
Tue Dec 15 19:20:29 2009
FAL[server, ARC1]: FAL archive failed, see trace file.
Tue Dec 15 19:20:29 2009
Errors in file /opt/oracle/product/10gR2/db1/admin/DEL1/bdump/del1_arc1_14469.trc:
ORA-16055: FAL request rejected
ARCH: FAL archive failed.
Archiver continuing
Tue Dec 15 19:20:29 2009
ORACLE Instance DEL1 - Archival Error. Archiver continuing.
…

These lines showed everything clearly. The issue was:

ORA-00308: cannot open archived log '/u01/oraback/1_700_697108460.dbf'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory

The archived log simply was not available. The process could not see the file and couldn’t get it across to the standby site.

Upon further investigation I found that the DBA actually removed the archived logs to make some room in the filesystem without realizing that his action has removed the most current one which was yet to be transmitted to the remote site. The mystery surrounding why the FAL did not get that log was finally cleared.

Solution

Now that I know the cause, the focus was now on the resolution. If the archived log sequence# 700 was available on the primary, I could have easily copied it over to the standby, registered the log file and let the managed recovery process pick it up. But unfortunately, the file was gone and I couldn’t just recreate the file. Until that logfile was applied, the recovery will not move forward. So, what are my options?

One option is of course to recreate the standby - possible one but not technically feasible considering the time required. The other option is to apply the incremental backup of primary from that SCN number. That’s the key – the backup must be from a specific SCN number. I have described the process since it is not very obvious. The following shows the step by step approach for resolving this problem. I have shown where the actions must be performed – [Standby] or [Primary].

1. [Standby] Stop the managed standby apply process:

SQL> alter database recover managed standby database cancel;

Database altered.

2. [Standby] Shutdown the standby database

3. [Primary] On the primary, take an incremental backup from the SCN number where the standby has been stuck:

RMAN> run {
2> allocate channel c1 type disk format '/u01/oraback/%U.rmb';
3> backup incremental from scn 1301571 database;
4> }

using target database control file instead of recovery catalog
allocated channel: c1
channel c1: sid=139 devtype=DISK

Starting backup at 18-DEC-09
channel c1: starting full datafile backupset
channel c1: specifying datafile(s) in backupset
input datafile fno=00001 name=/u01/oradata/DEL1/datafile/o1_mf_system_5bhbh59c_.dbf
… …
piece handle=/u01/oraback/06l16u1q_1_1.rmb tag=TAG20091218T083619 comment=NONE
channel c1: backup set complete, elapsed time: 00:00:06
Finished backup at 18-DEC-09
released channel: c1

4. [Primary] On the primary, create a new standby controlfile:

SQL> alter database create standby controlfile as '/u01/oraback/DEL1_standby.ctl';

Database altered.

5. [Primary] Copy these files to standby host:

oracle@oradba1 /u01/oraback# scp *.rmb *.ctl oracle@oradba2:/u01/oraback
oracle@oradba2's password:
06l16u1q_1_1.rmb 100% 43MB 10.7MB/s 00:04
DEL1_standby.ctl 100% 43MB 10.7MB/s 00:04

6. [Standby] Bring up the instance in nomount mode:

SQL> startup nomount

7. [Standby] Check the location of the controlfile:

SQL> show parameter control_files

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
control_files string /u01/oradata/standby_cntfile.ctl

8. [Standby] Replace the controlfile with the one you just created in primary.

9. $ cp /u01/oraback/DEL1_standby.ctl /u01/oradata/standby_cntfile.ctl

10.[Standby] Mount the standby database:

SQL> alter database mount standby database;

11.[Standby] RMAN does not know about these files yet; so you must let it know – by a process called cataloging. Catalog these files:

$ rman target=/

Recovery Manager: Release 10.2.0.4.0 - Production on Fri Dec 18 06:44:25 2009

Copyright (c) 1982, 2007, Oracle. All rights reserved.

connected to target database: DEL1 (DBID=846390698, not open)
RMAN> catalog start with '/u01/oraback';

using target database control file instead of recovery catalog
searching for all files that match the pattern /u01/oraback

List of Files Unknown to the Database
=====================================
File Name: /u01/oraback/DEL1_standby.ctl
File Name: /u01/oraback/06l16u1q_1_1.rmb

Do you really want to catalog the above files (enter YES or NO)? yes
cataloging files...
cataloging done

List of Cataloged Files
=======================
File Name: /u01/oraback/DEL1_standby.ctl
File Name: /u01/oraback/06l16u1q_1_1.rmb

12.Recover these files:

RMAN> recover database;

Starting recover at 18-DEC-09
using channel ORA_DISK_1
channel ORA_DISK_1: starting incremental datafile backupset restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
destination for restore of datafile 00001: /u01/oradata/DEL2/datafile/o1_mf_system_5lptww3f_.dbf
...…
channel ORA_DISK_1: reading from backup piece /u01/oraback/05l16u03_1_1.rmb
channel ORA_DISK_1: restored backup piece 1
piece handle=/u01/oraback/05l16u03_1_1.rmb tag=TAG20091218T083619
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07

starting media recovery

archive log thread 1 sequence 8012 is already on disk as file /u01/oradata/1_8012_697108460.dbf
archive log thread 1 sequence 8013 is already on disk as file /u01/oradata/1_8013_697108460.dbf
… …

13. After some time, the recovery fails with the message:

archive log filename=/u01/oradata/1_8008_697108460.dbf thread=1 sequence=8009
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 12/18/2009 06:53:02
RMAN-11003: failure during parse/execution of SQL statement: alter database recover logfile '/u01/oradata/1_8008_697108460.dbf'
ORA-00310: archived log contains sequence 8008; sequence 8009 required
ORA-00334: archived log: '/u01/oradata/1_8008_697108460.dbf'

This happens because we have come to the last of the archived logs. The expected archived log with sequence# 8008 has not been generated yet.

14.At this point exit RMAN and start managed recovery process:

SQL> alter database recover managed standby database disconnect from session;

Database altered.

15.Check the SCN’s in primary and standby:

[Standby] SQL> select current_scn from v$database;

CURRENT_SCN
-----------
1447474
[Primary] SQL> select current_scn from v$database;

CURRENT_SCN
-----------
1447478
Now they are very close to each other. The standby has now caught up.

Tuesday, December 08, 2009

Sepaking at New York Oracle Users Group

I had the opportunity to speak at the New York Oracle Users Group (http://nyoug.org/info/metro_area_user_group/Metro_Area_User_Group_Meeting_Sep_2009_Agenda.pdf) today at Hotel New Yorker. I spoke about reucing the risk during the 10g to 11g upgrade using various tools and techniques available to us as a DBA. Here is the slide deck, if you are iterested.

Some of the technologies I have described during the talk have been explained in detail on various articles I have written. Here they are:

Database Replay: http://www.oracle.com/technology/oramag/oracle/08-jan/o18dbasereplay.html
SQL Performance Analyzer (SPA):
http://www.oracle.com/technology/oramag/oracle/08-mar/o28sqlperf.html
SQL Plan Management: http://www.oracle.com/technology/oramag/oracle/09-mar/o29spm.html

As always, I highly appreciate folks coming to me after the talk and provide the feedback. Please keep the feedbacks coming, via emails, phone calls, telepathy, whatever. I really look forward to them.

Wednesday, November 18, 2009

RAC Perf Tuning Seminar in Istanbul, Turkey

For those who attended my 2-day training event in Istanbul, I wish to express my sincere thanks for the participation. For me, or any speaker, the privilege of having your attention for 2 days away from your work and family, means a lot. Some folks came all the way from Ankara and had to go back to work the following day. I hope you all got to learn something worthwhile your time.

The scripts are located in: http://www.proligence.com/racperf_istanbul/scripts.zip The password and userid is the ones I gave you in the class.

I also want to reiterate my earlier request to send me your detailed and honest feedback at arup@proligence.com. You may want to write it in Turkish, if that is more convenient. The important thing is to provide the feedback; I can always translate using Google.

Thanks, Hande and Madalina from Oracle University for arranging it. Much appreciated. Now I am off to Estonia.

Thursday, October 15, 2009

OOW09 Session#4 DBA 11g New Features

For all those who came to my last of my four sessions - 11g New Features for DBAs - I appreciate your taking the time. It was a pleasant surprise to see about 500 people showing up at a lunch time slot on the last day of the conference.

Here is the presentation link. I hope you enjoyed the session and found it useful.

OOW09 Session# 3

I just finished the third of my four presentations - SQL Plan Management. Considering it was at 9 AM on the last day of the conference, right after the big Aerosmith concert, I was expecting a lot less crowd. But, to my pleasant surprise about 150 brave souls turned up. Thank you all. I hope you found it useful.

Here is the presentation material. While you are there, feel free to browse around. And, of course, I will highly appreciate if you could send me your comments, either here or via email - whatever you are comfortable with.

Wednesday, October 14, 2009

OOW09 My Session#2

Today I delivered the second of my four sessions - "Upgrade Case Study: Database Replay, Snapshot Standby and Plan Baselines".

For those you attended, I thank you very much. Here is the presentation.

Tuesday, October 13, 2009

OOW Day2

Why do you come to Open World? I'm sure we will get all kinds of reasons, as many as there are stars in the sky. Some predominant themes are - getting to know more about the Oracle (or related) technologies by attending sessions, reconnecting with old friends and building networking. Of course, getting freebies from the Exhibit Halls, I'm sure, can't be far behind as a motivator.

I come to OOW for all those reasons as well. But high up in my list is the visit to the Exhibit Halls. No; not for the tee-shirts that do not fit me and graphics I don't really dig. I visit the demogrounds and exhibit halls to know about the products and tools that I should be aware of. Where else would you find 1000+ companies advertising the products at one place? Sure, I can call them and find out; but ho do I find them? OOW exhibit halls are prime "hunting" grounds to look for new ideas and tools that I should be interested in; or at least be aware of. I can not only look at the tools; I can actually get some relevant technical facts in 5 minutes which might take weeks of scheduling and hours of marketing talk. And, if I decide the product is not relevant; I can always walk away. I have the privilege of walking away; they don't. If I call them to my office, "they" have that option; not me :) If I find something attractive, I can always follow up and get to know more.

Oracle demogrounds are even better. Not only I can meet Oracle PMs there; but the people who never come out to the public world - developers, development managers, architects and so on. These unsung heroes are mostly the reason why Oracle is what it is now. I meet the known faces, get to know new ones and establish new relationships. They hear from me what customers want and I learn the innards of some features I am curious about.

So, I spent almost the whole day yesterday navigating through demo grounds and exhibit halls. I could cover only a small fraction. In between I had to attend some meetings at work. Going to OOW is never "going away". I wish it was.

Sunday, October 11, 2009

OOW09 - RAC Performance Tuning

For all those who came to my session - many, many thanks. There is no better sight for a presenter than to see a roomful of attendees, especially with people standing near the walls. The fire marshal was not amused probably; but I was grateful. The harrowing incident of a blue screen of death on my PC - not just once but twice - just before the presentation was about to start was enough to throw me into a panic mode; but the third time was a charm. It worked. Phew!

You can download the presentation here. And while you are there, look around and download some more of my sessions as well.

Thanks a lot once again. I'm off the keynote now.

ACE Directors Product Briefing '09

One of the most valuable benefits of being an Oracle ACE Director is the briefings by Oracle Product Managers at the Oracle HQ. This year the briefing was on Friday Oct 9th at Oracle conference center rather than the customary Hilton Hotel.

While I was a little disappointed at the coverage of the database topics, I quickly recovered from the alphabet soup that makes up the netherworld of middleware and tools. However, a surprise visit by Thomas Kurian to address questions from the audience about the various product roadmaps was testimonial that Oracle is dead serious about the ACE Program. That proves the commitment Oracle has made for the user community - very heartening.

As always, Vikky Lira and Lillian Buziak did a wonderful job of organizing the event. Considering about 100 ACE Directors from 20+ countries, that is no small task. Perhaps the highlight of the organization was the detailed briefing sheets Lillian prepared for each one individually, down to what car service one takes and when - simply superb! No amount of thanks will be enough. From the bottom of my heart, thank you, Vikky and Lillian. And, thank you Justin Kestelyn - for kicking off and running the event year after year.

Open World 09 Starts

Oracle Open World 2009 has officially started with the User Group sessions today. I am presenting a session today. I started off by registering and getting my cool Blogger badge holder, hanging off the even cooler ACE Director lanyard.

I went off to the first session of today on the IOUG bucket - Workload Management by Alex Gorbachev. Alex is one of those people who know their stuff; so there is always something to be learned from there. Alex successfully demonstrated the difference between Connection Load Balancing and Server Side Listener Load Balancing, with pmon trace to show how the sessions are balanced. It sheds light on the question - why Oracle is not balancing the workload.

If you didn't attend this, you should definitely download the presentation and check it out later.

Thursday, September 03, 2009

ASM Dynamic Volume Manager and ASM Clustered File System

Two of the top features in 11gR2 are the ASM Dynamic Volume Manager (ADVM) and ASM Clustered File System (ACFS). What is the big deal about these two?

ADVM allows you to create a volume from an ASM diskgroup. Here is an example where we created a volume called asm_vol1 of 100 MB on a diskgroup called DATA:


ASMCMD [+] > volcreate -G DATA -s 100M asm_vol1

Internally it issues the command


alter diskgroup DATA add volume 'asm_vol1' size 100M;

Now you enable the volume you just created:


ASMCMD [+] > volenable -G DATA asm_vol1

Internally it issues:

alter diskgroup DATA enable volume 'asm_vol1';

You can perform other commands like resize, delete, disable; but more on that later on a full length article.

Now that the volume is created, what can you do with it. Well, like all volumes, you can create a filesystem on it. Here is an example of creating a FS called acfs1:


[root@oradba2 ~]# mkdir /acfs1
[root@oradba2 ~]# /sbin/mkfs -t acfs /dev/asm/asm_vol1-207
mkfs.acfs: version                   = 11.2.0.1.0.0
mkfs.acfs: on-disk version           = 39.0
mkfs.acfs: volume                    = /dev/asm/asm_vol1-207
mkfs.acfs: volume size               = 268435456


[root@oradba2 ~]# /sbin/acfsutil registry -a -f /dev/asm/asm_vol1-207 /acfs1
acfsutil registry: mount point /acfs1 successfully added to Oracle Registry

If you get an error, use the force option:


[root@oradba2 /]# /sbin/mkfs.acfs -f /dev/asm/asm_vol1-207
mkfs.acfs: version                   = 11.2.0.1.0.0
mkfs.acfs: on-disk version           = 39.0
mkfs.acfs: volume                    = /dev/asm/asm_vol1-207
mkfs.acfs: volume size               = 268435456
mkfs.acfs: Format complete.

Now you mount the the filesystem:


[root@oradba2 /]# /bin/mount -t acfs /dev/asm/asm_vol1-207 /acfs1

Now if you will check the filesystem, you will notice a new one - /acfs1


[root@oradba2 /]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                     149313708  22429752 119176924  16% /
/dev/sda1               101086     11765     84102  13% /boot
tmpfs                   907628    591640    315988  66% /dev/shm
/dev/asm/asm_vol1-207
                        262144     37632    224512  15% /acfs1

This new filesystem is actually carved out of the ASM diskspace. This can be used as a regular filesystem:


[root@oradba2 /]# cd /acfs1
[root@oradba2 acfs1]# ls
lost+found
[root@oradba2 acfs1]# touch 1

So, what't the big deal about it? Plenty.

First, this is part of the ASM management; so all the bells and whistles of ASM - such as asynch i/o, etc. - applies to this filesystem.

Second, this is now a "cluster" filesystem; it is visible across a cluster. So, now you have a fully functional generic filesystem visible across the cluster.

Third, this is now protected by the Grid infrastructure, if you have installed it. Remember from my earlier posts that in 11gR2 you can now have a restartable grid infrastructure even on a single instance.

More on ASM Dynamic Volume Manager later in a full length article. But I hope this makes you really interested.

Wednesday, September 02, 2009

Oracle 11g R2 Features

Continuing on the previous posts, here is another gee-whiz feature of 11gR2 - the "deinstall" feature. Yes, that's right the deinstall one. Sometimes installations fail; sometimes you have to deinstall something to clean out the server for other use. Sometimes, I did, you have to clean out beta code to install production code. A deinstall utility stops all the processes, removes all the relevant software and components (such as diskgroups), updates all config files and make all necessary modifications to the other files. All these are done without you ever bothering about remnants that may cause issues later.

You have to download the deinstall software from 11gR2 download from OTN. Choose "see all" to get to that software.

Here is the demonstration of the deinstall utility:


[oracle@oradba2 deinstall]$ ./deinstall -home /opt/oracle/product/11.2/grid1
ORACLE_HOME = /opt/oracle/product/11.2/grid1
Location of logs /opt/oracle/oraInventory/logs/

############ ORACLE DEINSTALL & DECONFIG TOOL START ############


######################## CHECK OPERATION START ########################
Install check configuration START


Checking for existence of the Oracle home location /opt/oracle/product/11.2/grid
1
Oracle Home type selected for de-install is: SIHA
Oracle Base selected for de-install is: /opt/oracle
Checking for existence of central inventory location /opt/oracle/oraInventory
Checking for existence of the Oracle Grid Infrastructure home /opt/oracle/produc
t/11.2/grid1

Install check configuration END

Traces log file: /opt/oracle/oraInventory/logs//crsdc.log

Network Configuration check config START

Network de-configuration trace file location: /opt/oracle/oraInventory/logs/netd
c_check22387.log

Specify all Oracle Restart enabled listeners that are to be de-configured [LISTE
NER]:

Network Configuration check config END

Asm Check Configuration START

ASM de-configuration trace file location: /opt/oracle/oraInventory/logs/asmcadc_
check22388.log

Automatic Storage Management (ASM) instance is detected in this Oracle home /opt/oracle/product/
11.2/grid1.
ASM Diagnostic Destination : /opt/oracle
ASM Diskgroups : +DATA1,+FRA1
Diskgroups will be dropped
De-configuring ASM will drop all the diskgroups and it's contents at cleanup time. This will aff
ect all of the databases and ACFS that use this ASM instance(s).

After some initial question and answer it shows a summary of activities and prompts you for a confirmation:


####################### CHECK OPERATION SUMMARY #######################
Oracle Grid Infrastructure Home is: /opt/oracle/product/11.2/grid1
The cluster node(s) on which the Oracle home exists are: (Please input nodes seperated by ",", eg: node1,node2,...)null
Oracle Home selected for de-install is: /opt/oracle/product/11.2/grid1
Inventory Location where the Oracle home registered is: /opt/oracle/oraInventory
Following Oracle Restart enabled listener(s) will be de-configured: LISTENER
ASM instance will be de-configured from this Oracle home
Do you want to continue (y - yes, n - no)? [n]: y
A log of this session will be written to: '/opt/oracle/oraInventory/logs/deinstall_deconfig2009-09-02_02-12-22-PM.out'
Any error messages from this session will be written to: '/opt/oracle/oraInventory/logs/deinstall_deconfig2009-09-02_02-12-22-PM.err'

After you press "y", it starts the operation of a clean deinstallation. The output continues as shown below:


######################## CLEAN OPERATION START ########################
ASM de-configuration trace file location: /opt/oracle/oraInventory/logs/asmcadc_clean22389.log
ASM Clean Configuration START
ASM deletion in progress. This operation may take few minutes.
ASM Clean Configuration END

Network Configuration clean config START

Network de-configuration trace file location: /opt/oracle/oraInventory/logs/netdc_clean22390.log

De-configuring Oracle Restart enabled listener(s): LISTENER

De-configuring listener: LISTENER
    Stopping listener: LISTENER
    Listener stopped successfully.
    Unregistering listener: LISTENER
    Listener unregistered successfully.
    Deleting listener: LISTENER
    Listener deleted successfully.
Listener de-configured successfully.

De-configuring Listener configuration file...
Listener configuration file de-configured successfully.

De-configuring Naming Methods configuration file...
Naming Methods configuration file de-configured successfully.

De-configuring backup files...
Backup files de-configured successfully.

The network configuration has been cleaned up successfully.

Network Configuration clean config END


---------------------------------------->

At some point you will be asked to shutdown cssd, etc. which need root privileges. The deinstall utility shows a comamnd string you can run as root to accomplish this task:


Run the following command as the root user or the administrator on node "oradba2".

/opt/oracle/software/11gR2/deinstall/perl/bin/perl -I/opt/oracle/software/11gR2/deinstall/perl/lib -I/opt/oracle/software/11gR2/deinstall/crs/install /opt/oracle/software/11gR2/deinstall/crs/install/roothas.pl -force  -delete -paramfile /opt/oracle/software/11gR2/deinstall/response/deinstall_Ora11g_gridinfrahome1.rsp

Press Enter after you finish running the above commands

Running the command on a different terminal as root:


[root@oradba2 ~]# /opt/oracle/software/11gR2/deinstall/perl/bin/perl -I/opt/oracle/software/11gR2/deinstall/perl/lib -I/opt/oracle/software/11gR2/deinstall/crs/install /opt/oracle/software/11gR2/deinstall/crs/install/roothas.pl -force  -delete -paramfile /opt/oracle/software/11gR2/deinstall/response/deinstall_Ora11g_gridinfrahome1.rsp
2009-09-02 14:20:57: Checking for super user privileges
2009-09-02 14:20:57: User has super user privileges
2009-09-02 14:20:57: Parsing the host name
Using configuration parameter file: /opt/oracle/software/11gR2/deinstall/response/deinstall_Ora11g_gridinfrahome1.rsp
CRS-2673: Attempting to stop 'ora.cssd' on 'oradba2'
CRS-2677: Stop of 'ora.cssd' on 'oradba2' succeeded
CRS-4549: Stopping resources.
CRS-2673: Attempting to stop 'ora.diskmon' on 'oradba2'
CRS-2677: Stop of 'ora.diskmon' on 'oradba2' succeeded
CRS-4133: Oracle High Availability Services has been stopped.
ACFS-9200: Supported
Successfully deconfigured Oracle Restart stack

Now going back to the original terminal where deinstall was called from, press Enter. The output continues:

Oracle Universal Installer clean START

Detach Oracle home '/opt/oracle/product/11.2/grid1' from the central inventory on the local node : Done

Delete directory '/opt/oracle/product/11.2/grid1' on the local node : Done

The Oracle Base directory '/opt/oracle' will not be removed on local node. The directory is in use by Oracle Home '/opt/oracle/product/10.2/db1'.

The Oracle Base directory '/opt/oracle' will not be removed on local node. The directory is in use by central inventory.

Oracle Universal Installer cleanup was successful.

Oracle Universal Installer clean END

Oracle install clean START

Clean install operation removing temporary directory '/tmp/install' on node 'oradba2'

Oracle install clean END

Moved default properties file /opt/oracle/software/11gR2/deinstall/response/deinstall_Ora11g_gridinfrahome1.rsp as /opt/oracle/software/11gR2/deinstall/response/deinstall_Ora11g_gridinfrahome1.rsp1

######################### CLEAN OPERATION END #########################

####################### CLEAN OPERATION SUMMARY #######################
ASM instance was de-configured successfully from the Oracle home
Following Oracle Restart enabled listener(s) were de-configured successfully: LISTENER
Oracle Restart was already stopped and de-configured on node "oradba2"
Oracle Restart is stopped and de-configured successfully.
Successfully detached Oracle home '/opt/oracle/product/11.2/grid1' from the central inventory on the local node.
Successfully deleted directory '/opt/oracle/product/11.2/grid1' on the local node.
Oracle Universal Installer cleanup was successful.

Oracle install successfully cleaned up the temporary directories.
#######################################################################

############# ORACLE DEINSTALL & DECONFIG TOOL END #############

The components are cleanly deinstalled now. The directories have been cleaned up by this tool.

This was available in 11gR1 as well; but R2 just makes it very user friendly.

More on this later.

Tuesday, September 01, 2009

Oracle 11g Release 2 is Finally Out

Finally, it's that time again - the birth of a new versionof Oracle - 11g Release 2. Being Release 2, it does not have as much bells and whistles as the 11g.

I downloaded it immediately and started installation. Some of the gee-whiz features of this release are:

(1) Editions
(2) ASM Filesystem
(3) Oracle Restart
(5) Columnar Compression

I have been beta testing this for some time; so I had seen previews of the release. Continuing the previous serieses, I will write the new features series for 11gR2 on OTN as well - it will be a 11 part series.

A little bit about Oracle Restart. It adds a lightweight clusterware functionality to a single instance database. If the instance crashes, OR brings it up, monitors it ans so on. And by the way, this is called "Grid Infrastructure". So you have to install two Oracle Homes - one each for grid and the rdbms.

When there is Grid, there is srvctl, of course. The grid infrastructure comes with srvctl. Here is how you check what is running from a specific Oracle Home:

oracle@oradba1 ~# srvctl status home -o /opt/oracle/product/11gR2/db1 -s state.txt
Database d112d1 is running on node oradba1

The above command create a file called state.txt.

oracle@oradba1 ~# cat state.txt
db-d112d1

It shows the database name - D112D1.

This is done on a single instance Oracle database; not a cluster. But the grid infrastructure looks and feels like a cluster. Here are some more commands to check status:

oracle@oradba1 ~# srvctl status listener
Listener LISTENER is enabled
Listener LISTENER is running on node(s): oradba1
oracle@oradba1 ~# srvctl status asm -a
ASM is running on oradba1
ASM is enabled.

A bunch of new processes suppor this grid infrastructure:


oracle   19046     1  0 18:13 ?        00:00:03 /opt/oracle/product/11gR2/grid1/bin/ohasd.bin reboot
oracle 19487 1 0 18:15 ? 00:01:14 /opt/oracle/product/11gR2/grid1/bin/oraagent.bin
oracle 19502 1 0 18:15 ? 00:00:01 /opt/oracle/product/11gR2/grid1/bin/tnslsnr LISTENER -inherit
oracle 19656 1 0 18:15 ? 00:00:01 /opt/oracle/product/11gR2/grid1/bin/cssdagent
oracle 19658 1 0 18:15 ? 00:00:02 /opt/oracle/product/11gR2/grid1/bin/orarootagent.bin
oracle 19674 1 0 18:15 ? 00:00:01 /opt/oracle/product/11gR2/grid1/bin/ocssd.bin
oracle 19687 1 0 18:15 ? 00:00:00 /opt/oracle/product/11gR2/grid1/bin/diskmon.bin -d -f

Let's see what happens when you kill the instance.


oracle@oradba1 ~# ps -aefgrep pmon
oracle   14225 13768  0 23:15 pts/7    00:00:00 grep pmon
oracle   19866     1  0 18:16 ?        00:00:00 asm_pmon_+ASM
oracle   26965     1  0 20:53 ?        00:00:00 ora_pmon_D112D1
oracle@oradba1 ~# kill -9 26965

This will, of course, crash the instance. Let's chck after some time:


oracle@oradba1 ~# ps -aef|grep pmon
oracle   14315     1  0 23:15 ?        00:00:00 ora_pmon_D112D1
oracle   14686 11492  0 23:17 pts/2    00:00:00 grep pmon
oracle   19866     1  0 18:16 ?        00:00:00 asm_pmon_+ASM

Where did the pmon come from? Didn't the instance just crash?

The instance was restarted by Oracle Restart.

What if you want to just keep the instance down, e.g. during a maintenance. Well, just shutdown normally; the instance will stay down. When you are ready, start the instance using either SQL*Plus or srvctl:

oracle@oradba1 ~# srvctl start database -d d112d1

Remember, D112D1 is a single instance database.

More on this later, on OTN.

Monday, May 11, 2009

Collaborate 09 Ends

I wrapped up a hectic three days at Collaborate this year. As if to echo the sentiments on the economy in general, and Oracle database technology related in particular, the attendance was way down this year. The vast expanse of Orange County Convention Center didn't make it easy either; it made it appear even smaller! Well, it was Orlando; the weather and traffic cooperated and I'm pretty sure some attendees picked up a few well deserved resting points either alone or with visiting family.

I presented three sessions:

(1) RAC Performance Tuning
(2) Real World Best Practices for DBAs (with Webcast)
(3) All About Deadlocks

I also participated in the RAC Experts' Panel as a Panelist.

As always, I used to opportunity to meet old friends and acquintances. Collaborate is all about knowledge sharing; I didn't lose when it came to getting some myself. I attended some highly useful sessions:

(1) Database Capacity Planning by Ashish Rege
(2) Partitioning for DW by Vincent
(3) Hints on Hints by Jonathan Lewis
(4) 11g Performance Tuning by Rich Niemic
(5) HA Directions from Oracle by Ashish Ray
(6) Storage Performance Diagnosis by Gaja
(7) Two sessions - on Database Replay and Breaking Oracle by Jeremiah Wilton

and some more. It has been a worthy conference.

Friday, March 13, 2009

Excellent Information on Exadata

If you are interested to know more about the technology (not the marketing hype) behind Exadata - the datawarehouse appliance from Oracle, I just discovered a treasure trove of information on this blog:
http://www.pythian.com/news/1267/interview-kevin-closson-on-the-oracle-exadata-storage-server#comment-350433

Kevin Closson, one the lead architects of the Exadata machine has been interviewed by Christo Kutovsky and Paul Paul VallÃ©e of Phythian. Kevin, in his usual detailed style explains the innards of the beast, aided by the excellent leading questioning by Christo.

I believe this is best writing I have seen on Exadata - all good stuff, no fluff. Thank you, Kevin.

Saturday, January 24, 2009

Ultra-Fast MV Alteration using Prebuilt Table Option

Here is an interesting question posed to me one time and I had found a solution. After 9 years, I encountered the same question and was shocked to find that many people still don't know about a little trick that could avoid a potential problem later.

Someone asked me how to modify a column of a Materialized View, e.g. from varchar2(20) to varchar2(25), or something similar. Drop and recreate? Not an option. We are talking about a several hundred GB MV with a very complex query that will take days to complete.

Problem

When you alter a materialized view to add a column or modify a column definition, unfortunately there is no command functionally equivalent to ALTER MATERIALIZED VIEW … ADD COLUMN. The only way to alter an MV is to completely drop and recreate it with the alteration. That approach may be acceptable for small MVs; but for larger MVs the cost of rebuilding can make the process quite infeasible. In addition to the time it will take to rebuild the entire MV (which could be days, depending on the size), the redo/undo generation and the surge in logical I/O due to the MV query may seriously affect the performance of the source database. In some cases, large MVs may even fail to be rebuilt as sometimes the undo segments may not have the undo information for long running queries – causing ORA-1555 errors.

So is there a better approach? Yes, there is. In this document I am going to explain a better approach for creating an MV that makes the alterations possible without rebuilding the MV – a task accomplished in mere seconds as opposed to potentially days.

Concept of Segments

Segments are stored units in Oracle. So, a table has a segment; not a view – since the contents of the view are not stored; only the view definition is. A Materialized View, however, stores the contents; so it is a segment.

Actually, the concept of segment goes a little bit further. If the table is partitioned, then each partition is a different segment. So, the relationship between tables and segments is one-to-many.

When you create an object that needs storage, such as a table, an MV or an index, Oracle first creates the corresponding segment. Once that is complete, the segment is shrouded by the cover of the object. The segment still continue to exist; but is now connected to the object. Until the segment is completely created and populated, the object technically does not exist. The segment may, in some cases, have a different name from the object. If the segment creation (or population) fails, Oracle automatically cleans up the remnants of the failed segment; but sometimes it may not be, leaving behind the chards that are eventually cleaned up by SMON process.

MVs and Segments

Anyway, how is this discussion about segments relevant to our objective here –the fast alteration of MViews?

Plenty. Remember, MVs are nothing but tables behind the covers? Property-wise, MVs and tables are like sisters, not even cousins. You can think of MVs are regular tables with some built in intelligence about how they were created (the defining query), how often they should be refreshed automatically by a job and how queries should be transformed to take advantage of the presence of the MVs. But apart from that, there is not much difference. You can directly insert into an MV, create indexes and so on. As far as a segment is concerned, there is no difference between an MV and a table. In fact Oracle stores the segment as a table:

SQL> select SEGMENT_TYPE
2 from user_segments
3 where SEGMENT_NAME = 'MV1';

SEGMENT_TYPE
------------------
TABLE

However, the biggest difference is the very issue we are discussing – you can’t add/modify columns of an MV while you can do that freely for a table. If I could attempt to logically represent tables and MVs, here is how it would look like.

The segment is the same. If it was created as an MV, the properties of MV take over the segment. If it was created as a table, the properties of a table take over the control.

Prebuilt Table

Since under the covers the segment is the same for both MV and table, can’t you take advantage of the fact? Suppose you have a table and you now want to convert that to an MV. In other words, you want to repoint that arrow initially pointed at the table to the MV properties:

Can you do it? Yes, of course you can. Since at the segment level it is the same, Oracle allows you to do it. When you create an MV, you can use a special clause ON PREBUILT TABLE. Here is how you create a MV in the regular approach:

create materialized view mv1
never refresh as
select cast(count (1) as number(10)) cnt from t1;

If you check the objects created:

SQL> select object_id, data_object_id, object_type
2 from user_objects
3 where object_name = 'MV1';

OBJECT_ID DATA_OBJECT_ID OBJECT_TYPE
---------- -------------- -------------------
74842 74842 TABLE
74843 MATERIALIZED VIEW

So, it creates two objects – a table and an MV - anyway. Note a very important difference though: the DATA_OBJECT_ID for the MV object is null. If you drop the MV and check for the objects:

SQL> drop materialized view mv1;

Materialized View dropped.

SQL> select object_id, data_object_id, object_type
2 from user_objects
3 where object_name = 'MV1';

no rows selected

Even though there were two objects – a table and an MV, when you dropped the MV, both were dropped. The table object didn’t have an independent existence. Dropping the MV drops the table automatically.

Now, in the modified approach, you first create the table in the same name as the MV you are going to create:

SQL> create table mv1 (cnt number(10));

Next you create the MV by adding a new clause called ON PREBUILT TABLE shown below:

create materialized view mv1
on prebuilt table
never refresh
as
select cast(count (1) as number(10)) cnt from t1;

Now there will be two objects as well – one table and one MV. The MV simply took over the command over the segment but since the table already existed, it did not recreate the table object. So there are still only 2 objects.

One concern: since you created the table manually, can you accidentally drop it? Let’s see:

SQL> drop table mv1;
drop table mv1
*
ERROR at line 1:
ORA-12083: must use DROP MATERIALIZED VIEW to drop "ARUP"."MV1"

That answers it. The table simply loses its independent existence. However, see what happens when you drop the MV:

SQL> DROP MATERIALIZED VIEW mv1;

Materialized view dropped.

Now check the segment:

SQL> select segment_type
2 from user_segments
3 where segment_name = 'MV1';

SEGMENT_TYPE
------------------
TABLE

The segment still exists! When you dropped the MV, the segment was not dropped; it simply reverted to being a table. You can confirm that by checking the objects view:

OBJECT_ID DATA_OBJECT_ID OBJECT_TYPE
---------- -------------- -------------------
77432 77432 TABLE

Voila! The object still exists as a table. Previously you saw dropping the MV removed all the objects and the segment. However, in this approach the segment was preserved. Since it reverted to a table, you can do all things possible in a table – select from it, create index, and - most important – modify the column. You can alter the column to make NUMBER(11).

SQL> alter table mv1 modify (cnt number(11));

Table altered.

Now, create the MV again:

create materialized view mv1
on prebuilt table
never refresh as
select cast(count (1) as number(11)) cnt from t1;

That’s it. The MV is altered. The whole process took about a few seconds, and since you didn’t have to recreate the segment, you saved enormous load on the database. Here a schematic representation of what happened.

Now you know how powerful prebuilt table option is. It only affects how you define the MV; nothing else. All other properties of the MV remain intact. The end users don’t even know about the prebuilt table option; but for the DBA it remains a powerful tool in the arsenal. As a best practice I recommend creating any MV, regardless of size, with the ON PREBUILT TABLE clause. In small tables you probably don’t see a huge advantage; but what if today’s small table grows to a large one tomorrow? It’s better to be safe than sorry.

Conversion to the New Approach

Now that you understand the power of the prebuilt option, you may be wondering how to convert the existing MVs to the new clause. Unfortunately there is no conversion path. You have to drop and recreate the MVs. That is why this time – when we are moving MVs to new tablespaces – we have the golden opportunity.

One approach is to create new tables with new names and then rename them. Here are the steps:

1. Create a table with nologging clause from the old MV
create table new_mv1
nologging
as
select * from mv1;

2. Capture the MV definition from the data dictionary:

select dbms_metadata.get_ddl ('MATERIALIZED_VIEW','MV1')
from dual ;

DBMS_METADATA.GET_DDL('MATERIALIZED_VIEW','MV1')
------------------------------------------------
CREATE MATERIALIZED VIEW "ARUP"."MV1" ("CNT")
ORGANIZATION HEAP PCTFREE 10
… and so on …

3. Spool this to a file to be executed later.

4. Edit this file to place ON PREBUILT TABLE CLAUSE.

CREATE MATERIALIZED VIEW "ARUP"."MV1" ("CNT")
ORGANIZATION HEAP ON PREBUILT TABLE PCTFREE 10

5. Take a Data Pump export with CONTENTS=METADATA_ONLY option. This creates all relevant privileges on the export dump file. Keep it aside.

6. Drop the Materialized View MV1.

7. Rename table NEW_MV1 to MV1

8. Execute the script you created earlier to recreate the MV.

9. Import the export dump file. It will recreate all the privileges.

This is slow; but the best approach since it generates minimum amount of redo and undo.

Hope this is helpful. You may look at an article I wrote http://www.dbazine.com/oracle/or-articles/nanda2 The article describes, with complete code, how to alter an MV where the refresh occurs across databases.

Tuesday, January 20, 2009

The other day, we had a serious issue in the ASM diskgroups - one diskgroup refused to come up because one disk was missing; but it was not clear from the message which of the 283 devices was missing. This underscores the difficulty in diagnosing ASM discovery related issues. In this post, I have tried to present a way to diagnose this sort of issues through a real example.

We had planned to move from one storage device to another (EMC DMX-3 to DMX-4) using SRDF/A technology. The new storage was attached to a new server. The idea was to replicate data at the storage level using SRDF/A. At the end of the replication process, we shut the database and ASM down and brought up the ASM instance in the newer storage on the new server. Since the copy was disk level, the disk signatures were intact and the ASM disks retained their identity from the older storage. So, when the ASM instance was started, it recognized all the disks and mounted all the diskgroups (10 of them) except one.

While bringing up a disk group called APP_DG3 on the new server it complained that disk number “1” is missing; but it was not clear which particular disk was. In this blog the situation was diagnosed and performed.

Note: the asm disk paths changed on the storage. This was not really a problem; since we could simply define a new asm_diskstring. Remember: the diskstring initialization parameter simply tells the ASM instance which disks should be looked at while discovering. Once those disks are identified, ASM looks at its signature on the disk headers to check the properties - the disk number, the diskgroup it belongs to, the capacity, version compatibilty and so on. So as long as the correct asm_diskstring init parameter is provided, ASM can readily discover the disks and get the correct names.

Diagnosis

This issue arises when ASM does not find all the disks required for that disk group. There could be several problems:

(i) the disk itself is physically not present
(ii) it’s present but not allowed to be read/write at the SAN level
(iii) it’s present but permissions not present in the OS
(iv) it’s present but the disk is not mapped properly; so the disk header shows something else. ASM knows the disk number, group, etc. from the disk header. If the disk header is not readable; or is not an ASM disk, the header will not reveal anything to ASM and hence will not mount.

If an ASM diskgroup is not mounted, the group_number for that disk shows “0”. If it’s mounted, the group number shows whatever the group number of the disk group is. Please note: the disk numbers are dynamic. So, APP_DG1 may have a group number “1” but the number may change to “2” after the next recycle.

Since the issue involved APP_DG3, I checked the group number for the group APP_DG3 from the production ASM (the old SAN on the old server) by issuing the query:

ASM> select group_number, name
2 from v$asm_diskgroup
3 where name = 'APP_DG3'
4 /

GROUP_NUMBER NAME
------------ ------------------------------
4 ASM_DG3

This shows the group number is 4 for the APP_DG3 group. I will use this information later during the analsysis.

On the current production server, I checked the devices and disk number of group number 4:

ASM> select disk_number, path
2 from v$asm_disk
3 where group_number = 4;

DISK_NUMBER PATH
----------- --------------------
0 /dev/rdsk/c83t7d0
1 /dev/rdsk/c83t13d5
2 /dev/rdsk/c83t7d2
… and so on …
53 /dev/rdsk/c83t7d1

54 rows selected.

On the new server, I listed out the disks not mounted by the disk groups. Knowing that disks belonging to an unmounted diskgroup show a group number 0, the following query pulls the information:

ASM> select disk_number, path
2 from v$asm_disk
3 where group_number = 0;

DISK_NUMBER PATH
----------- --------------------
0 /dev/rdsk/c110t1d1
2 /dev/rdsk/c110t1d2
3 /dev/rdsk/c110t1d3
… and so on …
254 /dev/rdsk/c110t6d7

54 rows selected.

Carefully study the output. The results did not show anything for disk number “1”. The disks were numbered 0 followed by 2, 3 and so on. The final disk was numbered “254”, instead of 54. So, the disk number “1” was not discovered by ASM.

From the output we know that production disk /dev/rdsk/c83t7d0 mapped to new server disk /dev/rdsk/c110t1d1, since they have the same disk# (“0”). For disk# 2, production disk /dev/rdsk/c83t7d2 is mapped to /dev/rdsk/c110t1d2 and so on. However, production disk /dev/rdsk/c83t13d5 is not mapped to anything on the new server, since there is no disk #1 on the new server.

Next I asked the Storage Admin what he mapped for disk /dev/rdsk/c83t13d5 from production. He mentioned a disk called c110t6d25.

I checked in the new server, if that disk is even visible:

ASM> select path
2 from v$asm_disk
3 where path = '/dev/rdsk/c110t6d25'
4 /

no rows selected

It confirmed my suspicion – ASM can’t even read the disk. Again, the reasons could any of the above mentioned ones - disk is not presented, does not have correct permission, etc.

In this case the physical disk was actually present and was owned by “oracle”; but not accessible to ASM. The issue was with SRDF not making the disk read/write. It was still in sync mode, preventing the disk to be enabled for writing. ASM couldn't open the disk in read write mode; so it rejected it as a member of any diskgroup and assigned it a default disk number 254.

After Storage Admin fixed the issue by making the disk read write, I re-issued the discovery:

ASM> select path
2 from v$asm_disk
3 where path = '/dev/rdsk/c110t6d25'
4 /

PATH
-------
/dev/rdsk/c110t6d25

It returned with a value. Now ASM can read it correctly. I mounted the disk:

ASM> alter diskgroup APP_DG3 mount;

It mounted successfully; because it got all the disks to make up the complete group.

After that the disk# 254 also went away. Now the disks showed 0, 1, 2, 3, … 53 for the group on both prod and the new server.

Monday, January 19, 2009

Making a Shell Variable Read Only

Being inherently lazy, I am always a sucker for shortcuts, neat tricks to cut my work and, most important, not to do the same thing again and again. Here is a tip I find useful.

Have you ever been frustrated to find that some line has changed some important shell variable such as ORACLE_BASE inside a shell script? The list of variables that are important to safety and efficiency of your shell is a long one - PS1, ORACLE_BASE, PATH, and so on. Using this little known command, you can easily "protect" a variable. The trick is to make it readonly. First, set the variable:

# export ORACLE_BASE=/opt/oracle

Then make it readonly:

# readonly ORACLE_BASE

Now if you want to set it:

# export ORACLE_BASE=/opt/oracle1
-bash: ORACLE_BASE: readonly variable

You can't. You can't even unset the variable:

# unset ORACLE_BASE
-bash: unset: ORACLE_BASE: cannot unset: readonly variable

This is a cool way to protect important variables.

To get a list of variables that are readonly, use

# declare -r
declare -ar BASH_VERSINFO='([0]="3" [1]="00" [2]="15" [3]="1" [4]="release" [5]="i386-redhat-linux-gnu")'
declare -ir EUID="500"
declare -rx ORACLE_BASE="/opt/oracle"
declare -ir PPID="13204"
declare -r SHELLOPTS="braceexpand:emacs:hashall:histexpand:history:interactive-comments:monitor"
declare -ir UID="500"

Unfortunately there is no comamnd to make it readwrite.

In the same way, you can also prevent a specific variable not to be set. LD_LIBRARY_PATH should not be set during some type of installations. To force it that way:

# export LD_LIBRARY_PATH
# readonly LD_LIBRARY_PATH

Now if you want to assign a value:

# export LD_LIBRARY_PATH=d
-bash: LD_LIBRARY_PATH: readonly variable

You will not be able to. You can also achieve the same goal by:

# declare -r LD_LIBRARY_PATH=

I hope you find it useful.

Pages