Tuesday, January 20, 2009

The other day, we had a serious issue in the ASM diskgroups - one diskgroup refused to come up because one disk was missing; but it was not clear from the message which of the 283 devices was missing. This underscores the difficulty in diagnosing ASM discovery related issues. In this post, I have tried to present a way to diagnose this sort of issues through a real example.

We had planned to move from one storage device to another (EMC DMX-3 to DMX-4) using SRDF/A technology. The new storage was attached to a new server. The idea was to replicate data at the storage level using SRDF/A. At the end of the replication process, we shut the database and ASM down and brought up the ASM instance in the newer storage on the new server. Since the copy was disk level, the disk signatures were intact and the ASM disks retained their identity from the older storage. So, when the ASM instance was started, it recognized all the disks and mounted all the diskgroups (10 of them) except one.

While bringing up a disk group called APP_DG3 on the new server it complained that disk number “1” is missing; but it was not clear which particular disk was. In this blog the situation was diagnosed and performed.

Note: the asm disk paths changed on the storage. This was not really a problem; since we could simply define a new asm_diskstring. Remember: the diskstring initialization parameter simply tells the ASM instance which disks should be looked at while discovering. Once those disks are identified, ASM looks at its signature on the disk headers to check the properties - the disk number, the diskgroup it belongs to, the capacity, version compatibilty and so on. So as long as the correct asm_diskstring init parameter is provided, ASM can readily discover the disks and get the correct names.

Diagnosis

This issue arises when ASM does not find all the disks required for that disk group. There could be several problems:

(i) the disk itself is physically not present
(ii) it’s present but not allowed to be read/write at the SAN level
(iii) it’s present but permissions not present in the OS
(iv) it’s present but the disk is not mapped properly; so the disk header shows something else. ASM knows the disk number, group, etc. from the disk header. If the disk header is not readable; or is not an ASM disk, the header will not reveal anything to ASM and hence will not mount.

If an ASM diskgroup is not mounted, the group_number for that disk shows “0”. If it’s mounted, the group number shows whatever the group number of the disk group is. Please note: the disk numbers are dynamic. So, APP_DG1 may have a group number “1” but the number may change to “2” after the next recycle.

Since the issue involved APP_DG3, I checked the group number for the group APP_DG3 from the production ASM (the old SAN on the old server) by issuing the query:

ASM> select group_number, name
2 from v$asm_diskgroup
3 where name = 'APP_DG3'
4 /

GROUP_NUMBER NAME
------------ ------------------------------
4 ASM_DG3

This shows the group number is 4 for the APP_DG3 group. I will use this information later during the analsysis.

On the current production server, I checked the devices and disk number of group number 4:

ASM> select disk_number, path
2 from v$asm_disk
3 where group_number = 4;

DISK_NUMBER PATH
----------- --------------------
0 /dev/rdsk/c83t7d0
1 /dev/rdsk/c83t13d5
2 /dev/rdsk/c83t7d2
… and so on …
53 /dev/rdsk/c83t7d1

54 rows selected.

On the new server, I listed out the disks not mounted by the disk groups. Knowing that disks belonging to an unmounted diskgroup show a group number 0, the following query pulls the information:

ASM> select disk_number, path
2 from v$asm_disk
3 where group_number = 0;


DISK_NUMBER PATH
----------- --------------------
0 /dev/rdsk/c110t1d1
2 /dev/rdsk/c110t1d2

3 /dev/rdsk/c110t1d3
… and so on …
254 /dev/rdsk/c110t6d7

54 rows selected.

Carefully study the output. The results did not show anything for disk number “1”. The disks were numbered 0 followed by 2, 3 and so on. The final disk was numbered “254”, instead of 54. So, the disk number “1” was not discovered by ASM.

From the output we know that production disk /dev/rdsk/c83t7d0 mapped to new server disk /dev/rdsk/c110t1d1, since they have the same disk# (“0”). For disk# 2, production disk /dev/rdsk/c83t7d2 is mapped to /dev/rdsk/c110t1d2 and so on. However, production disk /dev/rdsk/c83t13d5 is not mapped to anything on the new server, since there is no disk #1 on the new server.

Next I asked the Storage Admin what he mapped for disk /dev/rdsk/c83t13d5 from production. He mentioned a disk called c110t6d25.

I checked in the new server, if that disk is even visible:

ASM> select path
2 from v$asm_disk
3 where path = '/dev/rdsk/c110t6d25'
4 /

no rows selected

It confirmed my suspicion – ASM can’t even read the disk. Again, the reasons could any of the above mentioned ones - disk is not presented, does not have correct permission, etc.

In this case the physical disk was actually present and was owned by “oracle”; but not accessible to ASM. The issue was with SRDF not making the disk read/write. It was still in sync mode, preventing the disk to be enabled for writing. ASM couldn't open the disk in read write mode; so it rejected it as a member of any diskgroup and assigned it a default disk number 254.

After Storage Admin fixed the issue by making the disk read write, I re-issued the discovery:

ASM> select path
2 from v$asm_disk
3 where path = '/dev/rdsk/c110t6d25'
4 /

PATH
-------
/dev/rdsk/c110t6d25

It returned with a value. Now ASM can read it correctly. I mounted the disk:

ASM> alter diskgroup APP_DG3 mount;

It mounted successfully; because it got all the disks to make up the complete group.

After that the disk# 254 also went away. Now the disks showed 0, 1, 2, 3, … 53 for the group on both prod and the new server.
Post a Comment

Translate