The Arup Nanda Blog: Problem

Showing posts with label Problem. Show all posts

Monday, September 01, 2008

Magic of Block Change Tracking

The other day, one of the instances of our main 3-node RAC database crashed due to an I/O error. Apparently there were many "dead paths" from the host to the SAN, which made some LUNs in the ASM diskgroup not being recognizable by the ASM instance. Why? Well, that's the topic for another blog. The point behind this one has nothing to do with the reason. All I want to convey that there was a LUN problem on one of the nodes which brought the instance down. Being RAC, the database was serviced from other two nodes - praise God for RAC! - and the users didn't notice it terribly (or, I would like to think that way).

After a few days we noticed the incremental RMAN backup taking a long time. This caused major issues - it took a long time and I/O waits went through the roof. In fact it took increasingly longer every day that passed by that unfortunate collapse of the node. Everyone was quite intrigued - what could be the connection between an instance crash and instance crashing? All sorts of theories cropped up - from failed HBA cards to undiscovered RAC bugs.
This is where I got involved. The following chronicles the diagnosis of the issue and the resolution.

First, the increased length of time is obviously a result of the incemental backups doing more work, i.e. more changed blocks. What caused so many changed blocks? Interviews with stakeholdrs yielded no clear answer - there was absolutely no reason for increased activity. Since we are doing proper research, I decided to start with the facts. How much was the extra blocks processed by incrementals?

I started with this simple query:

select completion_time, datafile_blocks, blocks_read, blocks
from v$backup_datafile
where file# = 1
order by 1
/

Output:

COMPLETIO DATAFILE_BLOCKS BLOCKS_READ BLOCKS
--------- --------------- ----------- ----------
18-JUL-08 524288 32023 31713
19-JUL-08 524288 11652 10960
20-JUL-08 524288 524288 12764
21-JUL-08 524288 524288 5612
22-JUL-08 524288 524288 11089

The columns are:

DATAFILE_BLOCKS - the number of blocks in the datafile at that time
BLOCKS_READ - the exact number of blocks the RMAN incremental backup read
BLOCKS - the numberof blocks it actually backed up

From the above output, a pattern emerges - until Jul 20th, the backup read only a few blocks; but on July 20th, it started scanning the entire file - all the blocks! I checked for a few other datafiles and the story is the same everywhere. With a 4.5 TB database, if the incremental backup reads the datafiles in entirity, then I/O would obviously go for a toss. That explains the I/O and time issue.

But why did RMAN switch from reading a few blocks to the whole file that day? The #1 suspect is Block Change Tracking. The 10g feature BCT allows RMAN to scan only the changed blocks and not the entire file. We use that. So, did something happen to make that disappear?

to answer, I issued a modified query:

select completion_time, datafile_blocks, blocks_read, blocks, used_change_tracking
from v$backup_datafile
where file# = 1
order by 1
/

Output:

COMPLETIO DATAFILE_BLOCKS BLOCKS_READ BLOCKS USE
--------- --------------- ----------- ---------- ---
18-JUL-08 524288 32023 31713 YES
19-JUL-08 524288 11652 10960 YES
20-JUL-08 524288 524288 12764 NO
21-JUL-08 524288 524288 5612 NO
22-JUL-08 524288 524288 11089 NO

Bingo! The BCT use ceased from the 20th July date. That was what caused the whole file to be scanned. But why was it stopped? No one actually stopped it.

Investigating even further, I found from the alert log of Node 1:

Sun Jul 20 00:23:52 2008
CHANGE TRACKING ERROR in another instance, disabling change tracking
Block change tracking service stopping.

From Node 2:

Sun Jul 20 00:23:51 2008
CHANGE TRACKING ERROR in another instance, disabling change tracking
Block change tracking service stopping.

Alert log of Node 3 showed the issue:

Sun Jul 20 00:23:50 2008
Unexpected communication failure with ASM instance:
ORA-12549: TNS:operating system resource quota exceeded
CHANGE TRACKING ERROR 19755, disabling change tracking
Sun Jul 20 00:23:50 2008
Errors in file /xxx/oracle/admin/XXXX/bdump/xxx3_ctwr_20729.trc:
ORA-19755: could not open change tracking file
ORA-19750: change tracking file: '+DG1/change_tracking.dbf'
ORA-17503: ksfdopn:2 Failed to open file +DG1/change_tracking.dbf
ORA-12549: TNS:operating system resource quota exceeded
Block change tracking service stopping.

The last message shows the true error. The error was “operating system resource quota exceeded”, making the diskgroup unavailable. Since the ASM diskgroup was down, all the files were also not available, including BCT file. Surprisingly, Oracle decided to stop BCT altogether rather than report it as a problem and let the user decide the next steps. So block change tracking was silently disabled and the DBAs didn't get a hint of that. Ouch!

Resolution

Well, now that we discovered the issue, we took the necessary steps to correct it. Because of the usual change control process, it took some time to have the change approved and put in place. We executed the following to put the BCT file.

alter database enable block change tracking using file '+DG1/change_tracking.dbf'

The entry in alert log confirms it (all all nodes)

Block change tracking file is current.

But this does not solve the issue completely. to use block change tracking, there has to be a baseline, which is generally a full backup. We never take a full backup. We always take an incremental image copy and then merge to a full backup on a separate location. So, the first order of business was to take a full backup. After that we immediately took an incremental. It took just about an hour, down from some 18+ hours earlier.

Here is some analysis. Looking at the backup of just one file - file#1, i.e. SYSTEM datafile:

select COMPLETION_TIME, USED_CHANGE_TRACKING, BLOCKS, BLOCKS_READ
from v$backup_datafile
where file# = 1
order by 1
/

The output:

COMPLETIO USE BLOCKS BLOCKS_READ
--------- --- ---------- -----------
18-AUG-08 NO 31713 524288
18-AUG-08 NO 10960 524288
20-AUG-08 NO 12764 524288
21-AUG-08 NO 5612 524288
22-AUG-08 NO 11089 524288
23-AUG-08 NO 8217 524288
23-AUG-08 NO 8025 524288
25-AUG-08 NO 3230 524288
26-AUG-08 NO 6629 524288
27-AUG-08 NO 11094 524288 <= the filesize was increased 28-AUG-08 NO 3608 786432 29-AUG-08 NO 8199 786432 29-AUG-08 NO 12893 786432 31-AUG-08 YES 1798 6055 01-SEP-08 YES 7664 35411
Columns descriptions:

USE - was Block Change Tracking used?
BLOCKS - the number of blocks backed up
BLOCKS_READ - the number of blocks read by the backup

Note, when the BCT was not used, the *entire* file - 524288 blocks - were
being read every time. Of course only a percent of that was being backed up
since that percentage changed; but the whole file was being checked.
After BCT, note how the "blocks read" number dropped dramatically. That is
the magic behind the dropped time.

I wanted to find out exactly how much I/O savings BCT was bringing us. A simple query would show that:

select sum(BLOCKS_READ)/sum(DATAFILE_BLOCKS)
from v$backup_datafile
where USED_CHANGE_TRACKING = 'YES'
/

The output:
.09581342

That's just 9.58%. After BCT, only 9.58% of the blocks of the datafiles were scanned! Consider the impact of that. Before BCT, the entire file was scanned for changed blocks. After BCT, only about 9.58% of the blocks were scanned for changed blocks. Just 9.58%. How sweet is that?!!!

Here are three representative files:

File# Blocks Read Actual # of blocks Pct Read
----- ------------- ------------------- --------
985 109 1254400 .009
986 1 786432 .000
987 1 1048576 .000

Note, files 986 and 987 were virtually unread (only one block was read). Before BCT, all the 1048576 blocks were read; after BCT only 1 was. This makes perfect sense. These files are essentially older data; so nothing changes there. RMAN incremental is now blazing fast because it scans less than 10% of the blocks. The I/O problem disappered too, making the database performance even better.

So, we started with some random I/O issue, causing a node failure, which led to increased time for incremental, which was tracjed down to a block change tracking file being suspended by Oracle silently without raising an error.

Takeaways:

The single biggest takeway you should get is that just because it is defined, don't get the idea it is going to be there. So, a periodic check for the BCT file is a must. I will work on developing an automated tool to check for non-use of BCT file. The tool will essentially issue:

SELECT count(1)
FROM v$backup_datafile
where USED_CHANGE_TRACKING = 'NO'
/

If the output is >1, then an alert should be issued. Material for the next blog. Thanks for reading.

Thursday, August 07, 2008

Full Table Scans on RAC

The phone goes off, yet again, and this time it seems to be a serious problem. Here are the symptoms: all the applications seems to be working fine albeit at a lesser performance level, except one app, which has crawled to a halt. The app has abended since the queries are timing out. This is, by the a way a RAC database.

This is rather interesting - only one application is slow, instead of all the applications. If this were a systemwide issue, then all the apps would have been affected; not just one. The DBA checked the wait events, as she should have and the predminant events were "gc buffer busy". Well, this is a systemwide issue as a result of saturated interconnect, isn't it - she muses; then how come application A is the only one affected?

So, this is a big clue: the waits are RAC related - "gc buffer busy", yet only one application is slow. First, I got the SIDs of the sessions waiting for that event:

select SID, sql_id, sql_child_number
from v$session
where event = 'gc buffer busy';

After I got the SIDs, I checked the SQL statements issued by the sessions:

select sql_text
from v$sql
where sql_id = ''
and child_number = 24

The results were interesting - all of them were updating or selecting from a single table. Next I checked the execution plan for the SQLs:

select * from table (
dbms_xplan.display_cursor (
'9h1r67dtjxfbs',24,'BASIC'
))
/

PLAN_TABLE_OUTPUT
------------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
select ...

Plan hash value: 871755554
---------------------------------------------------------------------------------
Id Operation Name
---------------------------------------------------------------------------------
0 SELECT STATEMENT

The output tells it all - there is a full table scan going on for a 14 million row table. Well, why? I checked for the presence of indexes and yes, there is an index. Why isn't the index being used?

This is yet another case of "why my index is not being used". It happens at a relatively lesser frequency; but happens nevertheless.

Upon examination I discovered that the table is range partitioned and the index is local. Next I checked if the index has become unusable somehow:

select partition_name, status
from dba_ind_partitions
where index_name = 'the_inex_name';

PARTITION_NAME STATUS
------------------------------ --------

PT_SEP2009 USABLE
PT_OCT2009 UNUSABLE
PMAX UNUSABLE

Note the status of the last two partitions - UNUSABLE. Why so?

I checked the last_ddl_time:

select last_ddl_time
from dba_objects
where object_name = 'the_index_name';

And voila! The index was altered today. Why? The answer came quickly: there is a partition maintenance script that splits the PMAX partition to PMAX and a partition named PT_OCT2009. The split worked fine; but the DBA forgot to add a line like:

alter index this_index_name rebuild partition PMAX;

So, the partition remained unusable. Remember, this is a local index; so the clause MAINTAIN GLOBAL INDEXES will not do anything. The partition must be explicitly rebuilt. Since the partition was not usable, the query plan suddenly changed to full table scan, as I saw earlier. Since this caused a lot of buffer gets, the buffer pool of all the instances of the RAC cluster got flloded with the blocks of this table, causing gc buffer busy waits. So, even though the wait event is RAC related, the issue was due to abnormally large buffer gets and was limited to a specific application only. Eventually, the issue would have saturated the interconnect and affected other apps; but since we caught it early, that event was averted.

I rebuilt the index partitions by explicitly issuing the alter index commands. After that, the queries started using the index and, of course, the gc waits disappeared.

Lessons:

(1) just because you see "gc" waits, don't automatically assume it is RAC specific issue. The underlying issue may have manifested itself as the RAC component issues.
(2) Add to the list of "why oracle is not using my index" one more: unusable index partition(s)
(3) When you split partition of a table, in addition to the clause "maintain global indexes", issue the rebuild of the split partitions of local indexes.

Monday, August 04, 2008

Issues in RAC with Wrong Listener Configuration

The one thing that frustrates me the most is the lack of clear communication. Without clarity in what is being transpired, it's impossible for the receiver to understand the meaning - not just the superficial content but the subtle undertones as well. And it fails to materialize a manageable goal. I could be rambling about humans, of course; but the issue in communication goes beyond - to the things we create - writings, books, articles, poems, software programs, whatever. Oh, did I mention software programs?

Well, in this case I am talking about the software programs we all live by - the Oracle database; more specifically the RAC. Recently we suffered an outage. The reason turned out to be a simple issue of human forgetfulness; but the frustrating part was the lack of communication from the CRS about the issue. So, here is how it went.

This is a three node RAC. We applied the RAC bundle patch in a rolling fashion. Node 3 was brought down, patch applied and was brought up - everything was hunky dorey.

Then node 2 was brought down, patch applied and was brought up; but here was a little problem - the VIP didn't come up. The VIP relocated to node 3. But all the 23 services defined on the node were online. Strange. We assumed the issue was with the bundle patch ina rolling manner; so we decided to go ahead and do the deed on node 1, the remaining one.

When node 1 was brought up, we saw the real issues. The VIP for node 1 came up on node 3 and none of the services were up on node 1. Worse, the services that were up on node 2 suddenly went offline.

A lot of frantic calls to Oracle Support, request for esaclation to duty manager followed, along with the customary yelling and sioghing by the customer. At the end of 2 hours and with a Sev 1 SR, we hadn't gone anywhere.

At this time I took an executive decision. I asked the DBA to issue an ALTER SYSTEM comamnd on node 2 to bring up the services manually, bypassing CRS.

ALTER SYSTEM SET SERVICE_NAME = 'X','Y','Z',....

All the services where up on the instance 2 at this time, confirmed by SHOW PARAMETER SERVICE_NAMES command; but they still showed up as OFFLINE in crs_stat command. That meant, the instance had no issue with the services; the CRS had.

Then we issued a lsnrctl reload command to relaod the listener (note, we didn't use CRS to manipulate the listener - a very important clue you will see why later). The listener started fine with all service names. So far so good. We tried a few connections from the client and they connected fine. We did the same for node 1 and it worked too.

So, at this time, we had the services all up and running; but CRS had no idea about them. That meant the service failover probably wouldn't occur.

Next, we reloacted the VIP using the crs_relocate command. Al the VIPs (of node 1 and node 2) came back to the respective nodes. That was strange as well. If there was an issue with VIPs related to the networking software or hardware, how could they come back?

All these time we analyzed all the logs - the css, crs, racg alogs along with the alert log of the CRS (this is 10.2; so we have the alert log for CRS as well). Nothing there to indicate why it happened. And that's the basis of my rambling about lack of communication.

So, as always we went through the changes in the last few weeks. One of the crucial changes was the introduction of local_listener and remote_listener parameters in init.ora. These parameters can be placed as absolute values - with the host name, ports, etc. - or as the TNS alias entries. We had chosen the latter. The local_listener parameter was ok; but the remote_listener was listener_xxx.

Here was the entry for listener_xxx on tnsnames.ora:

listeners_xxx = (address_list = (address = (protocol=tcp)(host=sxxx1)(port=1521)) (address = (protocol=tcp)(host=sxxx2)(port=1521)) (address = (protocol=tcp)(host=sxxx3)(port=1521)))

There is nothing wrong with the syntax; but note the host value; sxxx1, sxxx2 and sxxx3 are all staging servers; and this is a production cluster. The communication does not work between prod and staging. So, the remote listener parameter pointed to hosts that were not even reachable! By the way, the entries were correct in node 3.

This caused the services to fail when started. Well, then how come they came up fine with ALTER SYSTEM?

My guess is since the ALTER SYSTEM command didn't go through the listener, it was not affected. CRS, on the other hand, goes through the listener.

Fine; but how come my lsnrctl reload command work? Again, my guess is, that comamnd simply starts the listener; the remote listener parameter in the database doesn't affect it. Since the local_listener parameter was correct, the services were successfully registered.

What about the VIP? How did the relocate work? I can only guess, of course. The conjecture is that the VIP is forced to go to the preferred node unless there is some sort of hardware issue, which was not in our case. So the relocate worked.

Phew! all these due to the sloppy work by a DBA. While this was deplorable, the issue with human error will continue to exist. Typos and wrong copy-and-paste will happen, no matter how much severe we make the punishment. I strongly feel CRS should have pointed out this error somewhere. In all that gets spewn out on CRS and CSS logs, how much is actually decipherable? Almost 99% has no meaning attached to it, at least for mere mortals. The idea behind the alert log of the CRS was to show the customer relevant issues in a common log; but there was no indication of this issue.

See the point about communication?

The Arup Nanda Blog

Pages