Sunday, December 12, 2010

New York Oracle User Group 2010 Session

Thank you for all those who came to my session in NYOUG Metro Area Meeting. I know you had choices; and  I am honored by for gracing my session. I apologize for running out of time. In the hindsight, I should have entertained questions at the end.

You can download the presentation here.
You can download the paper that describes the presentation here.

Based on the feedback I received, I started a new blog series "100 Things You Probably Didn't Know About Oracle Database". You may want to check out the first of the series. It talks about the commits and the frequency of block flushes to the disk - questions that came up during my presentation.

100 Things You Probably Didn't Know About Oracle Database

Recently, while delivering a presentation on Cache Fusion at New York Oracle Users Group (www.nyoug.org), the regional user group where I have been a long time member, I was surprised to hear from many participants some beliefs they had held for a long time that were completely wrong. I always thought these were as obvious as they come; but of course I was dead wrong. What was even more surprising that most of these believers were veterans in Oracle Database technologies; not newbies. Part of the problem – I think – lies with the system that focuses on the execution rather than learning and part of it due to the lack of clear documentation. During that discussion some encouraged me to write about these. I immediately agreed it was a great idea and merited serious attention. Here is the product: my attempt at explaining some of the “mysteries” of how Oracle Database operates. I will cover 100 such nuggets of information, roughly once a week.

Before you start, however, I would like to bring your attention to this important point. You may already be aware of these facts. I did; so it is reasonable to believe that a vast majority would as well. Under no circumstances I am claiming these to be ground breaking or awe-inspiring. If you are already familiar with this specific knowledge, I am not at all surprised. Please feel free to skip. For those who read on, I hope you found these helpful and will take a moment to write to me how you felt.

Part 1: Myth of Commit Causing Buffer to be Flushed to the Disk

Consider a scenario: In the EMP table I updated my salary from 1000 to 2000; and committed immediately. The very instance after I issued commit, if I check the datafile on the disk of the tablespace where this table is located, which value should I see – 1000 or 2000? (Remember, the value was committed)

Did you answer 2000 (perhaps because it was committed)? If so, then consider a normal application where commits are issued up to tens of thousands of times every minute. In a traditional database the weakest link in the chain is always I/O. If Oracle issued an update of the datafile every time someone commits, it would grind to a halt.

Did you answer 1000? Well, in that case, consider a case when the instance crashes. The datafile would have had 1000; not 2000 – the value that was committed. In such a case the instance must bring back the value committed (2000, in this case) to the datafile. How?

Let’s examine a different scenario. Suppose I did not issue a commit after the update (perhaps I was not sure of the implication of giving myself a pay hike or perhaps I had pang of conscience). I left the session as is and left for the day. The next day I was sick and didn’t come to work. 24 hours passed since I updated the record. At that point, if someone reads the datafile, what value would they see – 1000 or 2000?

Did you answer 1000 – a logical choice since the read consistency model of Oracle Database guarantees that the other sessions will see the pre-change data for the un-committed transactions?

Question #3 in this scenario: if you check the redo log file (not the datafile), what value will you find there – 1000 or 2000? Remember, it has not been committed. Did you answer 1000? It sort of makes sense; the changes are not committed so there is no reason for them to be in the redo log file, which is a very important part of the recovery process. If you answered 2000, then how would you explain the recovery process? In case of instance failure, the recovery must read the redo log file and since the transaction was not committed, it must roll the vale back to the previous – 1000. How would it do that if the redo log file contains 2000, not 1000?

The answers, if you can’t wait any longer: 1000 for the first question, 2000 for the second and 2000 for the third. How so? Please read on.

Explanation

To understand the mechanics of the process, let’s go over the buffer management process of the Oracle database. It’s a rudimentary detail but is quite vital in the path to understand the myth here. Consider a very small table in an equally small tablespace we created:


SQL> create tablespace testts datafile '/tmp/testts_01.dbf' size 1M;

SQL> create table mythbuster1 (col1 varchar2(200)) tablespace testts;


Insert a row:


SQL> insert into mythbuster1 values (‘ORIGINAL_VALUE’);
SQL> Commit;

Shutdown and restart the database so that the buffer cache is completely devoid of this table. You can also issue ALTER SYSTEM FLUSH BUFFER_CACHE; but I want to make sure all traces of this table (and value of the column inside) vanish from all memory areas – buffer cache, shared pool, PGA, whatever. You can now check the presence of the value in the datafile:

$ strings /tmp/testts_01.dbf
}|{z
-N?pD112D2
TESTTS
 1j)
 w>!
ORIGINAL_VALUE

The value is right there. Now suppose a user issues a statement like this from SQL*Plus:
SQL> select * from mythbuster1;

Oracle creates a process – called “server process” – on behalf of this user session to service the request from the session. This process is named, in unix and like OS’es, oracle. Here is how you can find it out:



$ ps -aef|grep sqlplus
oracle   14257 14214  0 13:42 pts/2    00:00:00 sqlplus   as sysdba
$ ps -aef | grep 14257
oracle   14257 14214  0 13:42 pts/2    00:00:00 sqlplus   as sysdba
oracle   14258 14257  0 13:42 ?        00:00:00 oracleD112D2 DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
The process 14258 is the server process. The SQL*Plus process is known as the user process which can be any process a user executes such as a Java program, a Pro*C code, a TOAD session and so on. It’s the server process that handles all the interaction with the Oracle database; not the user process. This is why Oracle database interaction is said to be based on a Two Task Architecture; there are always two tasks – the user task that a regular user has written and the server task that performs the database operations. This is an important concept established during the early foundations of the Oracle database to protect the database from errant code in the user task introduced either maliciously or inadvertently.

The server process then identifies the block the row exists in. Since the database instance just came up the buffer cache is empty and the block will not be found. Therefore the server process issues a read call from the datafile for that specific block. The block is read from the disk to the buffer cache. Until the loading of the block from the disk to the buffer cache is complete, the session waits with the event – db file scattered read. In this very case the session issues a full table scan. Had it performed an index scan, the session would have waited with the event db file sequential read. [I know, I know – it seems to defy conventional logic a little bit. I would have assumed index scan to be named scattered reads and full table scans to be sequential].

Once this process is complete, the buffer cache holds the copy of the block of the table mythbuster1. Subsequent session, if they select from the table, will simply get the data from this buffer; not from the disk.

Now, suppose the session issues the statement:

SQL> update mythbuster1 set col1 = ‘CHANGED_VALUE’;
And commits:
SQL> commit;

Immediately afterwards, check for the presence of the values in the datafile:
$ strings /tmp/testts_01.dbf
}|{z
-N?pD112D2
TESTTS
 1j)
 w>!
ORIGINAL_VALUE

The old value, not the new value, is found. The datafile on the disk still has the old value; not the new one, even though the transaction has been committed. The update statement actually updated only the buffer in the cache; not the disk. So, when is the data on the disk updated?

The datafile gets updated by a process known as Database Writer (a.k.a. Database Buffer Writer). It’s named DBW0. Actually, there may be more than one such process and they are named DBW0, DBW1, etc. – more conveniently addressed as DBWn. For the purpose of the discussion here, let’s assume only one process – DBW0. It has only one responsibility – to update the datafile with the most up to date buffers from the buffer caches. [Note: I used buffer caches – plural. This is not a typo. There may be more than one buffer cache in the database – keep, recycle, default and other block sizes – but that’s for another day]. The buffer that has been updated is known as a dirty buffer since its contents are different from the block on the disk. DBW0 process writes the contents of the buffer to the disk – making it clean again.

But the big question is when DBW0 writes the dirty buffer to the disk? Ah, that’s the very question we are pondering over here. There are several “triggering” events that cause DBW0 to copy the buffers to the disk – also called flushing of the buffers. By the way, DBW0 is a lazy process; it does not flush buffers by itself or on a regular basis. It sleeps most of the time and must be woken up by another process to perform its duties. One such watchdog process is called the Checkpoint (you can check its existence by ps -aef | grep ckpt in Unix systems). Checkpoint actually does not perform the flushing (also called checkpointing activity); but calls the DBW0 process to do it. How often does Checkpoint process perform a checkpoint? It depends on various conditions – the biggest of all is the MTTR setting, which we will cover later in a different installment.

Next. let's examine a different scenario. Drop the table, create the table again, recycle the database to remove all buffers of the table and then perform the update; but do not commit. Then flush the buffers from the cache to the disk. You can also trigger a checkpointing activity manually instead of waiting for the checkpoint process. Here is how to do it:

SQL> alter system checkpoint;
After that statement completes, check the presence of the values in the datafile again:
$ strings /tmp/testts_01.dbf
}|{z
-N?pD112D2
TESTTS
 1j)
 w>!
CHANGED_VALUE,
ORIGINAL_VALUE
The old value is still there; but that is an artifact; it will eventually be gone. The new value is updated in the datafile. But do you remember a very important fact – the transaction is still not committed? In a different session, if you check the data in COL1 column, you will see the value ORIGINAL_VALUE. Where does Oracle get that value from? It gets that value from the Undo Segments in the Undo Tablespace. The undo tablespace contains the pre-change value.

Well, now you may wonder how on earth the server process knows that the data is uncommitted and therefore the undo segment is to be checked. Good question. Let me add yet another wrinkle to it – the datafile contains the new value; not the old one. How does Oracle even know to return which rows pre-change? It gets that information from the header of the block where the transactions are recorded – called Transaction Table – or, a little bit differently: Interested Transaction List (ITL). I will cover that in detail in a future installment of this series. For the time being, please bear in mind that the block header holds that information. When the server process accesses the buffer (or the block on the disk) to get the column value, it accesses the transaction table, sees that there is an uncommitted transaction against it and gets the undo information from there. Finally it creates a different copy of the buffer as it would have looked like had the update statement not been issued. This process is called Consistent Read (CR) Processing.

Now back to our original discussion. Since DBW0 does not immediately flush the buffers to the datafile, it makes datafile inconsistent with the committed data. Won’t that compromise the recovery process? What would happen when the instance crashes before the flushing has occurred? Since the transaction was previously committed, the recovery should update the datafile. Where does that information come from? Undo tablespace? No; undo tablespace is also another datafile; it gets flushed in the same manner; so it may not have those values. Besides it may not even contain the new value.

Redo Stream

This is where the other leg of the database’s guarantee of the committed transaction comes in. When the changes occur in the table, Oracle also records the information in another pool in the memory called Log Buffer. Compared to buffer caches, which could be several terabytes; this buffer is tiny – often just a few MBs. The update statement records the pre and post change values to the log buffer (not to the log file, mind you). But the log buffer is just an area of memory; it also goes away when the instance crashes. So how does Oracle use the information to protect the committed data?

This is where the redo log files (a.k.a. online redo logs) come into picture. When the session commits, the contents of the log buffer are flushed to the redo log files. Until the flushing is completed, the session waits with various wait events depending on conditions, the majority of which are “log file sync” and “log file parallel write”. But does the log buffer flushing occur only when a commit occurs? No. There are other triggering events as well:
(1) When one third of the log buffer is full
(2) When 1 MB of log buffer is written
(3) Every three seconds

There are other events as well; but these are the major ones. Since commit statement flushes the log buffer to the redo log file, even if the instance crashes the information is stored in the redo log file and can be easily read by the instance recovery processes. In case of a RAC database, a single instance may have crashed. The instance recovery is done by one of the surviving instances. But it must read the redo entries of the crashed instance to reconstruct the blocks on the disk. This is why the redo log files, although for only one instance, must be visible to all nodes.

Even if the commit is not issued, the other triggering events flush the contents of the log buffer to the redo log files as well. The presence of the redo entries on the redo log files is independent of the commit. From the previous discussion you learned that the checkpoint flushes the buffers from the cache to the disk, regardless of the issuance of the commit statement. Therefore, these are the interesting possibilities after a session updates the data (which is updated in the buffer cache):


Scenario
Session committed?
Log Buffer Flushed
Checkpoint Occurred
Datafile Updated
Redo Log Updated
Comment
1
No
No
No
No
No

2
No
Yes
No
No
Yes

3
No
No
Yes
Yes
No

4
No
Yes
Yes
Yes
Yes

5
Yes
Yes
No
No
Yes
Commit will force a redo log flush
6
Yes
Yes
Yes
Yes
Yes



Looking at the table above you may see some interesting conundrums – redo log has the changed data but datafile does not and vice versa. How does Oracle know when and what exactly to recover since the presence of record in the redo log file is not a guarantee that the data was committed?

To address that issue, Oracle places a special “marker”, called a Commit Marker in the redo stream which goes into the redo log buffer. When instance recovery is required, Oracle doesn’t just recover anything that is present in the redo log buffer; it looks for a commit marker. If one is not found, then the changes are deemed to be uncommitted; and therefore Oracle rolls them back. If the changes are not found in the redo log, then the changes are uncommitted – guaranteed (remember, a commit will definitely flush the log buffer to redo). In that case Oracle rolls them back from the datafiles – a process known as rolling back. When the changes are found in redo log (along with the commit marker) but no corresponding changes in the datafile (scenario #5), Oracle will apply the changes to the datafile from the redo entries – a process known as roll forward. Recovery consists of both rolling back and forward.

To put it all together, here is a rough algorithm for the actions of the recovery process:

Read the redo log entries starting with the oldest one
Check the SCN number of the change
Look for the commit marker. If the commit marker is found, then data has been committed.
If found, then look for the changes in the datafile (via the SCN number)
    Change has been reflected in the datafile?
    If yes, then move on
    If no, then apply the changes to the datafile (roll forward)
If not found, then the data is uncommitted. Look for the changes in the datafile.
    Change found in datafile?
    If no, then move on
    If yes, then update the datafile with the pre-change data (rollback)

Takeaways

Let me reiterate some of the lessons from this installment.

(1) Data buffers are flushed to the disk from the buffer cache independently of the commit statement. Commit does not flush the buffers to the disk.
(2) If the buffer is modified in the buffer cache but not yet flushed to the disk, it is known as a dirty buffer.
(3) If a buffer is clean (i.e. not dirty), it does not mean that the data changes have been committed.
(4) When a commit occurs, the log buffer (not the buffer cache) is flushed to the disk
(5) Log buffer may already have been flushed to the disk due to other triggering events. So if a change in found in the redo log file, the change is not necessarily commited.
(6) A commit statement puts a special “commit marker” on the redo log, which is the guarantee of a commit.
(7) The frequency of the flushing of the buffer cache to the datafiles is controlled by the MTTR setting and whether free buffers are needed in the cache due to incoming blocks from the datafiles.

How do you use this information? There are several things for you to consider:

(1) The more you commit, the more log buffer will be flushed, not very good for I/O.
(2) The more aggressive the MTTR target is, the less time it will take if the instance crashes, but the more the frequency of flushing to the datafile will be as well – causing I/O
(3) The MTTR target has nothing to do with commit frequency; they are two independent activities. So, reducing commit frequency will not cause a reduction in flushing frequency.
(4) If your buffer size is small, there will be more the need to flush
(5) Exactly how small is “small”? There is no fixed formula; it depends on how much of the data in the buffer cache is updated.

I hope you enjoyed this installment of “100 Things …”.  In the future installments I will explain some of the other nuances of the Oracle database that you may not have been aware of. As always, I will highly appreciate if you could drop me a line telling me your feedback – good, bad and anything inbetwen.

Friday, December 03, 2010

Exadata on Cloud

I just saw this from Oracle Newsflash. Oracle Launches Exadata On Demand for High-Performance Computing. http://www.oracle.com/newsletters/information-indepth/on-demand/dec-10/exadataod.html.

This is very exciting. Of course the details are little at this time. But when the model matures, it allows us to play with the technology without investing a whole lot in in. A pet peeve of mine about Exadata was that it was a hardware; not a software. Unlike other Oracle products, I couldn't just download it and test the features. The limitation is huge. My role in my organization is to develop roadmaps and champion technological innovations. But without getting hands dirty, it comes to a pure paper excercise, which is not something I ever do.

With this, I am sure a lot of folks will now be able to validate their current systems against the Exadata platforms, making it easier and more prudent for them to champion it within their sphere of influence.

Translate