Sunday, December 12, 2010

100 Things You Probably Didn't Know About Oracle Database

Recently, while delivering a presentation on Cache Fusion at New York Oracle Users Group (www.nyoug.org), the regional user group where I have been a long time member, I was surprised to hear from many participants some beliefs they had held for a long time that were completely wrong. I always thought these were as obvious as they come; but of course I was dead wrong. What was even more surprising that most of these believers were veterans in Oracle Database technologies; not newbies. Part of the problem – I think – lies with the system that focuses on the execution rather than learning and part of it due to the lack of clear documentation. During that discussion some encouraged me to write about these. I immediately agreed it was a great idea and merited serious attention. Here is the product: my attempt at explaining some of the “mysteries” of how Oracle Database operates. I will cover 100 such nuggets of information, roughly once a week.

Before you start, however, I would like to bring your attention to this important point. You may already be aware of these facts. I did; so it is reasonable to believe that a vast majority would as well. Under no circumstances I am claiming these to be ground breaking or awe-inspiring. If you are already familiar with this specific knowledge, I am not at all surprised. Please feel free to skip. For those who read on, I hope you found these helpful and will take a moment to write to me how you felt.

Part 1: Myth of Commit Causing Buffer to be Flushed to the Disk

Consider a scenario: In the EMP table I updated my salary from 1000 to 2000; and committed immediately. The very instance after I issued commit, if I check the datafile on the disk of the tablespace where this table is located, which value should I see – 1000 or 2000? (Remember, the value was committed)

Did you answer 2000 (perhaps because it was committed)? If so, then consider a normal application where commits are issued up to tens of thousands of times every minute. In a traditional database the weakest link in the chain is always I/O. If Oracle issued an update of the datafile every time someone commits, it would grind to a halt.

Did you answer 1000? Well, in that case, consider a case when the instance crashes. The datafile would have had 1000; not 2000 – the value that was committed. In such a case the instance must bring back the value committed (2000, in this case) to the datafile. How?

Let’s examine a different scenario. Suppose I did not issue a commit after the update (perhaps I was not sure of the implication of giving myself a pay hike or perhaps I had pang of conscience). I left the session as is and left for the day. The next day I was sick and didn’t come to work. 24 hours passed since I updated the record. At that point, if someone reads the datafile, what value would they see – 1000 or 2000?

Did you answer 1000 – a logical choice since the read consistency model of Oracle Database guarantees that the other sessions will see the pre-change data for the un-committed transactions?

Question #3 in this scenario: if you check the redo log file (not the datafile), what value will you find there – 1000 or 2000? Remember, it has not been committed. Did you answer 1000? It sort of makes sense; the changes are not committed so there is no reason for them to be in the redo log file, which is a very important part of the recovery process. If you answered 2000, then how would you explain the recovery process? In case of instance failure, the recovery must read the redo log file and since the transaction was not committed, it must roll the vale back to the previous – 1000. How would it do that if the redo log file contains 2000, not 1000?

The answers, if you can’t wait any longer: 1000 for the first question, 2000 for the second and 2000 for the third. How so? Please read on.

Explanation

To understand the mechanics of the process, let’s go over the buffer management process of the Oracle database. It’s a rudimentary detail but is quite vital in the path to understand the myth here. Consider a very small table in an equally small tablespace we created:


SQL> create tablespace testts datafile '/tmp/testts_01.dbf' size 1M;

SQL> create table mythbuster1 (col1 varchar2(200)) tablespace testts;


Insert a row:


SQL> insert into mythbuster1 values (‘ORIGINAL_VALUE’);
SQL> Commit;

Shutdown and restart the database so that the buffer cache is completely devoid of this table. You can also issue ALTER SYSTEM FLUSH BUFFER_CACHE; but I want to make sure all traces of this table (and value of the column inside) vanish from all memory areas – buffer cache, shared pool, PGA, whatever. You can now check the presence of the value in the datafile:

$ strings /tmp/testts_01.dbf
}|{z
-N?pD112D2
TESTTS
 1j)
 w>!
ORIGINAL_VALUE

The value is right there. Now suppose a user issues a statement like this from SQL*Plus:
SQL> select * from mythbuster1;

Oracle creates a process – called “server process” – on behalf of this user session to service the request from the session. This process is named, in unix and like OS’es, oracle. Here is how you can find it out:



$ ps -aef|grep sqlplus
oracle   14257 14214  0 13:42 pts/2    00:00:00 sqlplus   as sysdba
$ ps -aef | grep 14257
oracle   14257 14214  0 13:42 pts/2    00:00:00 sqlplus   as sysdba
oracle   14258 14257  0 13:42 ?        00:00:00 oracleD112D2 DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
The process 14258 is the server process. The SQL*Plus process is known as the user process which can be any process a user executes such as a Java program, a Pro*C code, a TOAD session and so on. It’s the server process that handles all the interaction with the Oracle database; not the user process. This is why Oracle database interaction is said to be based on a Two Task Architecture; there are always two tasks – the user task that a regular user has written and the server task that performs the database operations. This is an important concept established during the early foundations of the Oracle database to protect the database from errant code in the user task introduced either maliciously or inadvertently.

The server process then identifies the block the row exists in. Since the database instance just came up the buffer cache is empty and the block will not be found. Therefore the server process issues a read call from the datafile for that specific block. The block is read from the disk to the buffer cache. Until the loading of the block from the disk to the buffer cache is complete, the session waits with the event – db file scattered read. In this very case the session issues a full table scan. Had it performed an index scan, the session would have waited with the event db file sequential read. [I know, I know – it seems to defy conventional logic a little bit. I would have assumed index scan to be named scattered reads and full table scans to be sequential].

Once this process is complete, the buffer cache holds the copy of the block of the table mythbuster1. Subsequent session, if they select from the table, will simply get the data from this buffer; not from the disk.

Now, suppose the session issues the statement:

SQL> update mythbuster1 set col1 = ‘CHANGED_VALUE’;
And commits:
SQL> commit;

Immediately afterwards, check for the presence of the values in the datafile:
$ strings /tmp/testts_01.dbf
}|{z
-N?pD112D2
TESTTS
 1j)
 w>!
ORIGINAL_VALUE

The old value, not the new value, is found. The datafile on the disk still has the old value; not the new one, even though the transaction has been committed. The update statement actually updated only the buffer in the cache; not the disk. So, when is the data on the disk updated?

The datafile gets updated by a process known as Database Writer (a.k.a. Database Buffer Writer). It’s named DBW0. Actually, there may be more than one such process and they are named DBW0, DBW1, etc. – more conveniently addressed as DBWn. For the purpose of the discussion here, let’s assume only one process – DBW0. It has only one responsibility – to update the datafile with the most up to date buffers from the buffer caches. [Note: I used buffer caches – plural. This is not a typo. There may be more than one buffer cache in the database – keep, recycle, default and other block sizes – but that’s for another day]. The buffer that has been updated is known as a dirty buffer since its contents are different from the block on the disk. DBW0 process writes the contents of the buffer to the disk – making it clean again.

But the big question is when DBW0 writes the dirty buffer to the disk? Ah, that’s the very question we are pondering over here. There are several “triggering” events that cause DBW0 to copy the buffers to the disk – also called flushing of the buffers. By the way, DBW0 is a lazy process; it does not flush buffers by itself or on a regular basis. It sleeps most of the time and must be woken up by another process to perform its duties. One such watchdog process is called the Checkpoint (you can check its existence by ps -aef | grep ckpt in Unix systems). Checkpoint actually does not perform the flushing (also called checkpointing activity); but calls the DBW0 process to do it. How often does Checkpoint process perform a checkpoint? It depends on various conditions – the biggest of all is the MTTR setting, which we will cover later in a different installment.

Next. let's examine a different scenario. Drop the table, create the table again, recycle the database to remove all buffers of the table and then perform the update; but do not commit. Then flush the buffers from the cache to the disk. You can also trigger a checkpointing activity manually instead of waiting for the checkpoint process. Here is how to do it:

SQL> alter system checkpoint;
After that statement completes, check the presence of the values in the datafile again:
$ strings /tmp/testts_01.dbf
}|{z
-N?pD112D2
TESTTS
 1j)
 w>!
CHANGED_VALUE,
ORIGINAL_VALUE
The old value is still there; but that is an artifact; it will eventually be gone. The new value is updated in the datafile. But do you remember a very important fact – the transaction is still not committed? In a different session, if you check the data in COL1 column, you will see the value ORIGINAL_VALUE. Where does Oracle get that value from? It gets that value from the Undo Segments in the Undo Tablespace. The undo tablespace contains the pre-change value.

Well, now you may wonder how on earth the server process knows that the data is uncommitted and therefore the undo segment is to be checked. Good question. Let me add yet another wrinkle to it – the datafile contains the new value; not the old one. How does Oracle even know to return which rows pre-change? It gets that information from the header of the block where the transactions are recorded – called Transaction Table – or, a little bit differently: Interested Transaction List (ITL). I will cover that in detail in a future installment of this series. For the time being, please bear in mind that the block header holds that information. When the server process accesses the buffer (or the block on the disk) to get the column value, it accesses the transaction table, sees that there is an uncommitted transaction against it and gets the undo information from there. Finally it creates a different copy of the buffer as it would have looked like had the update statement not been issued. This process is called Consistent Read (CR) Processing.

Now back to our original discussion. Since DBW0 does not immediately flush the buffers to the datafile, it makes datafile inconsistent with the committed data. Won’t that compromise the recovery process? What would happen when the instance crashes before the flushing has occurred? Since the transaction was previously committed, the recovery should update the datafile. Where does that information come from? Undo tablespace? No; undo tablespace is also another datafile; it gets flushed in the same manner; so it may not have those values. Besides it may not even contain the new value.

Redo Stream

This is where the other leg of the database’s guarantee of the committed transaction comes in. When the changes occur in the table, Oracle also records the information in another pool in the memory called Log Buffer. Compared to buffer caches, which could be several terabytes; this buffer is tiny – often just a few MBs. The update statement records the pre and post change values to the log buffer (not to the log file, mind you). But the log buffer is just an area of memory; it also goes away when the instance crashes. So how does Oracle use the information to protect the committed data?

This is where the redo log files (a.k.a. online redo logs) come into picture. When the session commits, the contents of the log buffer are flushed to the redo log files. Until the flushing is completed, the session waits with various wait events depending on conditions, the majority of which are “log file sync” and “log file parallel write”. But does the log buffer flushing occur only when a commit occurs? No. There are other triggering events as well:
(1) When one third of the log buffer is full
(2) When 1 MB of log buffer is written
(3) Every three seconds

There are other events as well; but these are the major ones. Since commit statement flushes the log buffer to the redo log file, even if the instance crashes the information is stored in the redo log file and can be easily read by the instance recovery processes. In case of a RAC database, a single instance may have crashed. The instance recovery is done by one of the surviving instances. But it must read the redo entries of the crashed instance to reconstruct the blocks on the disk. This is why the redo log files, although for only one instance, must be visible to all nodes.

Even if the commit is not issued, the other triggering events flush the contents of the log buffer to the redo log files as well. The presence of the redo entries on the redo log files is independent of the commit. From the previous discussion you learned that the checkpoint flushes the buffers from the cache to the disk, regardless of the issuance of the commit statement. Therefore, these are the interesting possibilities after a session updates the data (which is updated in the buffer cache):


Scenario
Session committed?
Log Buffer Flushed
Checkpoint Occurred
Datafile Updated
Redo Log Updated
Comment
1
No
No
No
No
No

2
No
Yes
No
No
Yes

3
No
No
Yes
Yes
No

4
No
Yes
Yes
Yes
Yes

5
Yes
Yes
No
No
Yes
Commit will force a redo log flush
6
Yes
Yes
Yes
Yes
Yes



Looking at the table above you may see some interesting conundrums – redo log has the changed data but datafile does not and vice versa. How does Oracle know when and what exactly to recover since the presence of record in the redo log file is not a guarantee that the data was committed?

To address that issue, Oracle places a special “marker”, called a Commit Marker in the redo stream which goes into the redo log buffer. When instance recovery is required, Oracle doesn’t just recover anything that is present in the redo log buffer; it looks for a commit marker. If one is not found, then the changes are deemed to be uncommitted; and therefore Oracle rolls them back. If the changes are not found in the redo log, then the changes are uncommitted – guaranteed (remember, a commit will definitely flush the log buffer to redo). In that case Oracle rolls them back from the datafiles – a process known as rolling back. When the changes are found in redo log (along with the commit marker) but no corresponding changes in the datafile (scenario #5), Oracle will apply the changes to the datafile from the redo entries – a process known as roll forward. Recovery consists of both rolling back and forward.

To put it all together, here is a rough algorithm for the actions of the recovery process:

Read the redo log entries starting with the oldest one
Check the SCN number of the change
Look for the commit marker. If the commit marker is found, then data has been committed.
If found, then look for the changes in the datafile (via the SCN number)
    Change has been reflected in the datafile?
    If yes, then move on
    If no, then apply the changes to the datafile (roll forward)
If not found, then the data is uncommitted. Look for the changes in the datafile.
    Change found in datafile?
    If no, then move on
    If yes, then update the datafile with the pre-change data (rollback)

Takeaways

Let me reiterate some of the lessons from this installment.

(1) Data buffers are flushed to the disk from the buffer cache independently of the commit statement. Commit does not flush the buffers to the disk.
(2) If the buffer is modified in the buffer cache but not yet flushed to the disk, it is known as a dirty buffer.
(3) If a buffer is clean (i.e. not dirty), it does not mean that the data changes have been committed.
(4) When a commit occurs, the log buffer (not the buffer cache) is flushed to the disk
(5) Log buffer may already have been flushed to the disk due to other triggering events. So if a change in found in the redo log file, the change is not necessarily commited.
(6) A commit statement puts a special “commit marker” on the redo log, which is the guarantee of a commit.
(7) The frequency of the flushing of the buffer cache to the datafiles is controlled by the MTTR setting and whether free buffers are needed in the cache due to incoming blocks from the datafiles.

How do you use this information? There are several things for you to consider:

(1) The more you commit, the more log buffer will be flushed, not very good for I/O.
(2) The more aggressive the MTTR target is, the less time it will take if the instance crashes, but the more the frequency of flushing to the datafile will be as well – causing I/O
(3) The MTTR target has nothing to do with commit frequency; they are two independent activities. So, reducing commit frequency will not cause a reduction in flushing frequency.
(4) If your buffer size is small, there will be more the need to flush
(5) Exactly how small is “small”? There is no fixed formula; it depends on how much of the data in the buffer cache is updated.

I hope you enjoyed this installment of “100 Things …”.  In the future installments I will explain some of the other nuances of the Oracle database that you may not have been aware of. As always, I will highly appreciate if you could drop me a line telling me your feedback – good, bad and anything inbetwen.

59 comments:

Baskar said...

Hi Sir,

Wonderful Explanation. Thanks a lot.

baskar.l

Renjith Madhavan said...

This is invaluable . Thanks a lot sir ....

Regards
Renjith Madhavan

goiyala said...

simply excellant Arup.

Kumar Madduri said...

Hi Arup
This is good explanation. Not just this, I think all your blogs and presentations are very good. You explain it in the simplest terms and yet dont miss any point.

Kumar

Surachart said...

This is a great post.

Ankur Khurana said...

Waiting for next part..

Anonymous said...

Hi Sir,

Nice blog.Thanks for it.

After reading the blog i have 2 points to share :-


1. Now, suppose the session issues the statement:

SQL> update mythbuster1 set col1 = ‘CHANGED_VALUE’;

And commits:

SQL> commit;


After some paragraphs you mention "The new value is updated in the datafile. But do you remember a very important fact that the transaction is still not committed?
In a different session, if you check the data in COL1 column, you will see the value ORIGINAL_VALUE. Where does Oracle get that value from?"

I suppose the "commit" should not be there after the update statement.


2. "Although checkpoint forces a logfile switch, and therefore by definition a resultant redo log flush".

As far as i know Checkpoint does not cause a logfile switch.The "alter system switch logfile" may trigger complete checkpoint depending on the status of the redo log group.I might be wrong in understanding,kindly share your views.

Regards,
Anand

Arup said...

@Anand - thank you. For the first point, that was supposed to be a different scenario; but I didn't break it up from the original thread. So it may have added to the confusion. Added the next scenario. I hope that helps.

As to your second point, you are right. That was a typo resulting in some incorrect explanation on my part. I corrected them now.

Thanks again for catching them. IT also gave me an idea to talk about different type of checkpoints such as thread checkpoint, object checkpoint, etc.

Arup said...

Thank you @Bhaskar @Renjith @goiyala @Kumar @Surachart and @Ankur for sharing your thoughts.

Flado said...

Hi Arup,

As far as I know, before a dirty buffer may be written to disk, the redo about the change must be written. That's why you sometimes see 'log file sync' waits for the background processes (DBWn). So Scenario 3 cannot happen, I think.

Cheers!
Flado

Marko Sutic said...

Great explanation Arup.

Maybe the best blog post I've seen lately.

Regards,
Marko

Alberto said...

Great job, Arup.
Your blog is simply super...
Thanks a lot....
Ciao
Alberto

Srikar Dasari, Bangalore said...

Great Job Arup.
Would request you to share knowledge about I/O (Physical and logical) etc. in next instalments, as it is dificult/confusion to understand about I/O and how to interpret at various scenarios.

Thanks
Srikar

lava said...

great article and insight

Martin Berger said...

Arup, for your Question #3 I would expect the values 2000 and 1000 in the redo log: 2000 for the data block and 1000 for the block from UNDO TableSpace?
UNDO is just 'another' block of the data file and to be sure it's in the redo-stream before the real data block?

Bundit said...

I know, I know –
it seems to defy conventional logic a little bit.
I would have assumed index scan to be named scattered reads
and full table scans to be sequential


@Arup From your point, do you mean the index scan as "random I/O from disk drive"
and table scan as "sequential reads from disk drives" ?

Arup said...

@Bundit

>> @Arup From your point, do you mean the index scan as "random I/O from disk drive"
and table scan as "sequential reads from disk drives" ?

When the data is selected as full table scan, it is "scattered" all over the buffer cache - hence the term scattered read. Index scans are much more disciplined - they follow a root blck, then to the leaf, then to the subleaf, etc. Therefore they follow a "sequential" manner of searching. Hence the term sequential read.

Both are random reads as far as the database is concerned. The full table scan does not necessarily mean that the blocks will be contiguos.

Arup said...

@Martin - you are right. I was just trying to explain the concept without too much of technical details. So, I should have (to be technicallly 100% accurate) mmentioned the value to be 2000 in the data portion of the redo stream. However, as I wanted to balance the content and complexity, I decided to omit that. Perhaps I should have. And you are right - the undo data will be found in the redo stream as well, but as undo data.

Martin Berger said...

Arup, your explanation was well balanced! Please do not change anything. I just wanted to make sure I remember that detail right, not to criticize you.

Rangarajan said...

Hi Arup,

Thanks for providing such simple and lucid explanation. However I too like Flado think that Scenario 3 cannot happen. Please correct me if my understanding is not correct.

Joel Garry said...

Very nice!

I'm wondering if perhaps the confusion about scenario 3 is that it might or might not have followed scenario 2. So yes, it would have to be in the redo log if the redo buffer was flushed, but not put there by scenario 3.

Flado said...

Joel,
No confusion, actually. I think this is invariant - at no time may a datafile get ahead of the redo stream.
Scenario 3 only becomes possible if we are talking about different changes - one described by the redo still in the log buffer, and a different (earlier) change written to the datafile.

Murali said...

Very Good Refresher!

Pavan Kumar said...

Hi Arup,

It's a very good refresher for us... !! It's went throught a like "Story teller by grand ma.. to little young grand daughter at bed time.. " Hat's off and tell me when you will visit india.. so we can meet you up..

Ashish said...

sir ji..
gr8 explanation.

I hv a scenario.. ltl bit different from this.
++ Suppose buffer cache is small like some KBs and I am updating a table with data in MBs so how does it accommodate the dirty buffer until data is not commited and at the same time other session also needs the buffer chache.

lava said...

yes, ots all too great a forum, the discussions ideas placed ehre and comments all great for dear ORACLE, and experts here

Jayesh said...

Very well explained Arup. Thanks a lot.

Anonymous said...

Hi Arup,

Great explanation..simply superb..Can you please explain what ASHISH asked...It will be a little bit refresher as well..Thanks!

lava said...

yes its superb article

Arup said...

@Ashish - as I mentioned in the article, commit has nothing to do with the buffers being written to the disk. The buffer to disk activity is done by dbwr process which occurs independently and asynchronously. When the buffers are written to the disk as a result of space running out of buffers or just checkpoint occurring, the transaction may or may not be committed. When commit occurs, the only thing happens is that the log buffer is flushed; not db buffer cache.

So, if the buffer cache is, say 80 KB, your db block size is 8K, you have 10 buffers (note, these are just purely theoretical numbers for illustration only). You update a table with 20 blocks. In that case the first 10 blocks will fill the buffer cache. When the 11th block comes, the first block has to be forced back to the disk to make room for it, even though it is not committed.

poluri said...

Hi Arup,
Great explanation in simplest terms..thanks a lot

Tone Czar said...

Thank you, Arup. I have a couple of your books, and they are invaluable!

Anonymous said...

Hi Arup,

Thanks a lot for invaluable knowledge.

~Srikanth

VISHWANATH SHARMA said...

Good to understand oracle internals...thanx arup

Sanblazer said...

Great Post,,can't wait for the rest of the parts.

Sanblazer said...

Just noticed the other part of the series..thx again

-Sandeep

Anonymous said...

thanks!!

Robert said...

Arup,

Great Post Simply u made it clear with the explanation.

I have clarification on the statement
"The old value is still there; but that is an artifact; it will eventually be gone. The new value is updated in the datafile"

Why both the modified value and original value are in datafile.

FYI,

I tried the same scenario in my database it come same as u told and i update the same column by different value "VERSION2" and i checked the datafile it shows VERSION2,ORGINAL.
Regards,

Robert

Sri said...

A Superb one ..:-)

Anonymous said...

This is excellent explanation never seen .. Many thanks
I have couple of doubts here ..
You said “The server process then identifies the block the row exists in”
1.Can you pl explain how Server process identifies the block the row exists ? is it using data dictionary ?

2.Can you please verify if my below understanding is correct ..in 10g (private redo/in memory undo enabled)

update emp set sal=2000 where emp_id in (1,2)
(let us say - there are 2 blocks to be updated ..each row exists in each block )

Now :

1.Server Process READS first data block into buffer cache
2.Server Process READS undo block into buffer cache
3.Server process prepared redo change vector for undo block in 'private redo'
4.Server process prepares change vector for data block and put it in 'in memory undo'

6.Server Process READS second data block into buffer cache
7.Server Process READS undo block into buffer cache
8.Server process prepared redo change vector for undo block in 'private redo'
9.Server process prepares change vector for data block and put it in 'in memory undo'
10.Finally combine all these change vectors and write them to 'redo log buffer'
11.now - visit first block and make the change (apply the redo vector) - update sal=2000 for emp_id=1
and visit second block and make the change (apply the redo vector) - update sal=2000 for emp_id=2

Is the above process flow correct ?


and now they are 'dirty blocks' - which are waiting for 'commit' /'rollback' from user

but in the mean time - the dirty blocks can be flushed to Disk ..??? and after the dirty blocks gets flushed
if I rollback - what happens ?

Many thanks for your time and help

Cheers
Bix

Arindam said...

I only knew the complete process superfluously. But ur explanation just gave me the best pictorial description I would have never even thought of in my wildest dreams. Thanks a TON for the wonderful insight... U indeed deserve the title of an Oracle junkie :-)

DeepakP said...

Thanks for this nice post.

Can you please explain how does recovery work on UNDO tablespace? Do we have ITLs for UNDO blocks?

Ravikanth said...

Thanks for teh nice post Arup.

Can you tell me where would the un-committed data be stored before writing to redo log buffer or data file i.e your scenario 1 of the table? Would it be in SGA or PGA and module holding it?

Arup said...

@Ravikanth - the uncimmited data will be in the buffer cache.

Hcl@Rajnish said...

awesome post got lots of satisfaction to read this article.

Jacob Naylor said...

Hey saw your post today Brian & I must say you have done a great job in understanding the basics of this topic & working on the updates. I am looking forward to more articles from you on the same topic.

Unknown said...

Hello Sir,
Thank you so much for this wonderful explanation.

Regards
Ajay.

saurabh.sharma said...

Thanks Arup for your well balanced explanation, it really helps me a lot.

SwapnaVenkatesh said...

awesome explanation, thank you

Anonymous said...

Hi,

Thanks for a great article.

One question though. Following on from Flado's comment, I'm a little unclear how scenario 3 would arise and how Oracle would handle it in the event of an instance failure.

The only possible cause of events I can think of that might cause it is:
1. Make a data change without commiting
2. Issue an explicit checkpoint
3. Immediate instance failure

On recovery of the instance how is the datafile rolled back?
Presumably the undo associated with the change was also written to the datafiles as part of the checkpoint.
Does Oracle look at the SCN of the datafile header, realise it's greater than the last SCN in the log files and somehow work out the uncommited data with their associated undo and roll it back?

Many thanks.

Flado said...

Anon,

Scenario 3 still cannot happen. The checkpoint at step 2 will flush the redo before flushing the data block buffer.
Rolling back requires rolling forward of the undo tablespace and then using it to roll the uncommitted transactions back. Instance (and media) recovery can only roll forward, ergo no datafile may get ahead of the redo stream on disk.
Cheers,

Flado

Anonymous said...

Thanks for your reply, Flado.

That was my original understanding too, but based on this article I was beginning to wonder if a checkpoint didn't always flush relevent redo to the logs.

Arup - could you possibly comment on what circumstances you see scenario 3 being possible?

Regards.

Unknown said...

Simply Superb... Hatss off...Keep it up...

Sakthee said...

Arup,

In the scenario 2, u said u will not commit the updated value i.e., 2000 and leave. Log in after a day and you still see the value as '2000'. That means the value will roll backward only if SMON comes into picture? won't the reports or the new value available to other users will be inconsistent?

Regards,
Sakthi

Arup said...

@sakhtee - which specific portion of the text are you referring to?

If I made a change to 2000 from 1000, did NOT commit and leave, after 24 hours what values would someone see - that was the question. The answer depends on "where" you check.

(a) if you check the database by logging on to a session, then you will see 1000 (the pre-change value). Since the change was not committed, the users will be shown the pre-change value as a part of read consistency.

(b) if you check in the datafile, you will see 2000; since the datafile will have been updated by DBWn process already.

SMON has no role to play here.

Hope this helps.

Anonymous said...

Arup,
Thanks a lot.It is really helpfull information.

Rob K said...

Hi Arup,

Thanks for taking the time to write an EXCELLENT article - you described this technical process very well in an easy-to-understand language.

This article was particularly helpful in understanding some of the root causes behind "Checkpoint not complete" errors in my DB's alert log.

Thank you for taking the time to make this excellent blog posting! I will check your other blog postings as well.

Arup Nanda said...

Thank you, Rob.

jacklinemelda said...

With our Write my Research Paer Services we not only write the Cheap Nursing Term Papers for the students but also guide them on how they can be able to write an academic paper by themselves.

Translate