Friday, February 05, 2016

No Data Loss without Sync Data Guard

A few months ago I wrote about an exciting product that allows you to achieve no data loss in a disaster even without using the maximum protection Data Guard that relies on super expensive synchronous networks. Recently I had a chance to actually work on that system and replicate the scenarios. Here I explain how that actually works, with commands and actual outputs.

Background

First, you may want to refresh your memory on the concept of that product in the earlier blog http://arup.blogspot.com/2015/04/no-data-loss-without-synchronous-network.html The product is Phoenix system from Axxana. To reiterate, I am not an employee of that company nor in its payroll; I just want to make others aware of this innovative out of the box solution to a problem we are all familiar with. I usually don't write about specific products unless I feel strongly about its effectiveness to the user community.

Let's examine the issue we are facing:
  1. In a Data Guard environment, some amount of redo information may be left behind at the primary site.
  2. If the primary site meets with a disaster where nothing is left, this data is lost for ever.
  3. If this data loss is not acceptable then you have to run the Data Guard in maximum protection mode which guarantees the updates occurring on the primary to be reflected on the standby database immediately. That way the data is updated at standby site and is not affected by the loss of the primary site.
  4. However, this also means that the network has to be super fast and super reliable to ensure the updates at the standby site. Unfortunately this also means the network is super expensive. And this could also become technically challenging for networks across very large distances.
  5. This is why many corporations that can't afford the expensive network infrastructure do not implement the maximum protection Data Guard and merely accept the data loss as an unavoidable problem.
Well, until now. Axxana's product addresses the issue in a non-traditional manner. The only thing that stands between the full recovery on the standby site in case of a disaster and an expensive network is the redo information that has not been transmitted yet to the standby server. If this information were available to the standby host, you would have been able to recover it completely. The Phoenix system from Axxana is a solid state drive enclosed inside a case protected from common disaster elements such as flood, fire, bomb, etc. You put this special box at the primary site and create one member of the redo log group on its storage, as shown in the figure below. This box has a protected port of network, etc. that allows you to access its contents even if the normal ports are all destroyed by the disaster. It even has a cellular modem that allows it to be connected remotely. In fact the Axxana software can pull the necessary data from this special box remotely from the standby host and apply to the standby database. And the best thing about it is that it is the plain vanilla Data Guard you are familiar with; not any special magic that goes under the hood.
Axxana Blackbox
Fig 1. Axxana Blackbox Internals



Let's see how it works with a hands-on approach inside a lab. I have two hosts:



Host
Database
Primary
e02lora adb
Standby
e02rora1 adbsb


The storage is ASM; not a filesystem. Replication in a filesystem is relatively simple; so I deliberately chose it to see if any ASM nuances coming up. The Data Guard is configured with maximum performance, so there is no need for an expensive fast low latency network infrastructure. The setup has an Axxana Phoenix system which consists of three components:
  1. Blackbox - the "special" protected storage system.
  2. Collector - a small server at the primary site that collects the changes to the files under the Axxana protection. In this case, they are controlfile, redo logs and archived logs. I don't need to place any database files there because they are useless for recovery. Keeping this small set of files also makes it possible to put the protected files of many databases--not just one--inside a single blackbox.
  3. Recoverer - a small server at the standby site that receives the changes from the blackbox and keeps the files up to date when disaster occurs.

Setting up

First I ensured that the managed recovery process is running and the standby database is in MOUNTED mode.

Then I created a test schema in the production database.
e02lora$ sqlplus / as sysdba
SQL> grant create session, create table, create procedure, unlimited tablespace to arup identified by arup;
Then I created some objects and data inside this schema.
SQL> connect arup/arup
SQL> create table T (c1 number, c2 timestamp);
SQL> create index IN_T ON T (C1);
SQL> create or replace procedure INS_T (p_c1 in number, p_commit in number) as
2 begin
3 for i in 1..p_c1 loop
4 insert into T values (i,current_timestamp);
5 if mod(i,p_commit)=0 then
6 commit;
7 end if;
8 end loop;
9 end;
10 /

SQL> create or replace procedure DEL_T (p_commit in number) as
2 begin
3 delete T where mod(C1, p_commit)=5;
4 commit;
5 end;
6 /
SQL> create or replace procedure UPD_T (p_commit number) as
2 begin
3 update T set C1 =-1 where mod(C1, p_commit)=3;
4 commit;
5 end;
6 /

SQL> exec INS_T (400000, 5000);

SQL> commit;
After creating the test data, I switched a few log files and waited a bit to make sure the changes are replicated to the standby.

Simulating a Failure


Now to simulate a disconnected network. I didn't have any special tools with me. So, I did the next best thing: I disabled the ethernet interface.

e02rora$ su -
Password:
e02rora$ ifdown eth0

At this point the standby host will not receive the updates. All the changes made to the primary database will be located at the primary site only. I canceled the recovery process first.

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

I made a few changes to the primary data.

SQL> connect arup/arup
SQL> exec DEL_T (500);
SQL> exec UPD_T (500);

SQL> conn / as sysdba
SQL> alter system switch logfile;

SQL> conn arup/arup
SQL> select count(*) from T

COUNT(1)
----------
399200

The table has only 399,200 records. Since the Managed Recovery Process is stopped these final changes will not be replicated to the standby. I can confirm that by opening the standby in read only mode and checking the actual tables.


On the standby host:

SQL> conn / as sysdba
SQL> ALTER DATABASE OPEN READ ONLY;

Database altered.

SQL> SELECT COUNT (1) FROM ARUP.T;

 COUNT(1)
----------
400000


The standby has all 400,000 rows, compared to 399,200 rows in primary. The discrepancy is due to the unpropagated changes to the standby. At this point I simulate a failure in the primary by killing the pmon process.

e02lora$ ps -aef|grep pmon
UID PID PPID TTY STIME COMMAND
oracle 10184 10564 pty0 16:26:17 pmon
e02lora$ kill -9 10184


The primary database is down. The data in the redo logs is lost as well since they have not been propagated to the standby system yet.

Performing Recovery

Now that we have simulated a disaster, let's see how to recover from it. Remember, since the Data Guard configuration is Maximum Performance, the database can only be recovered up to the most recent log entry. With Axxana software, however, there is additional data that can be pulled from the Blackbox (the storage unit at the primary site that would not be destroyed).  But how do you--the DBA--know what files are available at the standby site, which are still left at the primary site, and, most important, how to get those into the standby host? It gets even more complicated since the Blackbox is just a storage device; you have to mount the volumes, and mount the ASM disks, etc. These may be fun when creating a brand new database; definitely not so when you are under the gun to recover the database and bring your business online.



No worries. This is where the Axxana software comes into rescue. I spun up the provided shell script at the standby site. This script contacts the blackbox at the primary, pulls the needed data and completes the creation of necessary files at the standby site. Once the files are at the standby site, all you have to do is to perform the typical managed standby database recovery to complete the recovery process. The best part of all? The script even gives you a step by step instruction sheet along with specific files names which you can copy and paste when Oracle prompts for it. Here is how I call the script and the resultant output.

root@e02rora1 AxxRecovery# ./AxxRecovery.sh
Logging to '/home/oracle/AxxRecovery/logs/axxana.debug.log'
Calculating Optimization Parameters [done]
Attempting to connect to Axxana's Recoverer [done]

Perform Failover?

Warning!
This operation terminates the Axxana protection for all DBs.

1) OK
2) Cancel & Exit

At this point the script pauses and asks me for confirmation. I enter "1" and the script continues:


Requesting Axxana's Recoverer to 'FAILOVER' [done]

Recoverer communicates with the Black Box (ENTER_FAILOVER_START) [done]
Recoverer communicates with the Black Box (ENTER_FAILOVER_BBX) [done]
Recoverer communicates with the Black Box (ENTER_FAILOVER_CONNECT) [done]

Recoverer is in FAILOVER mode

Transferring required logs files of the database 'adb'.

The script shows me the names of the files along with their respective transfer status and the %age progress.

FILE NAME | Size in MBytes | Stage | Progress
==================================|=================|==============|===================
ADB-group_3.258.902686913 | 50 (0) | Completed | 100%
ADB-thread_1_seq_111.266.902852033| 46 (46) | Completed | 100%
ADB-group_2.257.902686911 | 50 (50) | Completed | 100%
ADB-1454516139530A12435456XT_112.C| 11 (11) | Completed | 100%
ADB-group_1.256.902686901 | 50 (50) | Completed | 100%
==================================|=================|==============|===================
Total: | 208 (159) | 5 of 5 files recovered

All required log files exist at '/home/oracle/AxxRecovery/axx_recovery_files'

Preparing user recovery scripts ...

You can safely perform the Oracle RPO=0 recovery process

Please follow the RPO=0 recovery instructions that are described in the file '/home/oracle/AxxRecovery/axx_required_files/recovery_instructions.txt'


As I mentioned, the script creates a detailed step-by-step instruction sheet to be followed for the standby recovery. I am actually glad that it does not perform a recovery automatically. That is one process you want to watch and proceed with caution. You probably have only one chance at it; and rushing through it may force you to take missteps. But at the same time you want to think as less as possible under those stressful conditions. So, the detailed instruction sheet comes handy. The last line shows the location of the instructions files. Here is how the file looks like in my case, after removing some banner items:

|---------------------------------------------------------------------------------------------------------------|
| RECOVERY INSTRUCTIONS (03/02/2016 16:26:19) |
|===============================================================================================================|
===============================================================================================================

Step 1) Output setup
====================

Please log into the standby database (as sys user)
and then run the following commands at the SQL prompt:

SQL>
SET SERVEROUTPUT ON
SET LINES 999
SET PAGES 0

Step 2) Recovery using archive log files
========================================

Please run the following commands at the SQL prompt:
* (You can safely ignore the following ORA Error 'Managed Standby Recovery not active')

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

SQL> RECOVER STANDBY DATABASE UNTIL CANCEL;

-- when requested, feed the following file(s):

+DATA/adbsb/archivelog/2016_02_03/thread_1_seq_110.382.902852027
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ARCHIVELOG/2016_02_03/thread_1_seq_111.266.902852033

-- finally enter 'Cancel'

SQL> CANCEL

Step 3) Switching to primary control file
=========================================
Please run the following commands at the SQL prompt:

SQL> SHUTDOWN IMMEDIATE
SQL> STARTUP NOMOUNT
SQL> ALTER SYSTEM SET CONTROL_FILES='/home/oracle/AxxRecovery/axx_required_files/1454516139530A12435456XT_112.C' SCOPE=SPFILE;
SQL> SHUTDOWN IMMEDIATE
SQL> STARTUP MOUNT

Step 4) Renaming data and redo log file names
=============================================
Please run the following SQL statement

SQL> @/home/oracle/AxxRecovery/axx_required_files/logAndDateFileRename.sql

Step 5) Recovery using primary control file
===========================================
Please run the following command at the SQL prompt:

SQL> RECOVER DATABASE UNTIL CANCEL USING BACKUP CONTROLFILE
-- when requested, feed the following file(s):

/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_1.256.902686901
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_2.257.902686911
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_3.258.902686913

-- You should now see a message saying 'Media Recovery complete'

Step 6) Open the Standby Database
=====================================================
Please run the following commands at the SQL prompt:

SQL> ALTER DATABASE OPEN RESETLOGS;

Well, it's dumbed down enough for those stressful moments associated with a standby database recovery, down to even the set pagesize commands in SQL*Plus--quite easy to forget in stressful situations. Note it shows the time it got generated at the very beginning, non-US date format as dd/mm/yyyy. Anyway, I followed the instructions step by step. Step 4 requires some attention. It shows the how to change the names of the redo logs and data files after the switchover.

For the sake of brevity I don't want to show the entire output. Here is an excerpt from the tail end of the activity:

...
... output truncated ...
...
Specify log: {=suggested | filename | AUTO | CANCEL}
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_2.257.902686911
ORA-00279: change 5172890 generated at 02/03/2016 16:16:56 needed for thread 1
ORA-00289: suggestion : +DATA
ORA-00280: change 5172890 for thread 1 is in sequence #114
ORA-00278: log file '/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_2.257.902686911' no longer needed for this recovery

Specify log: {=suggested | filename | AUTO | CANCEL}
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_3.258.902686913
Log applied.
Media recovery complete.

SQL> ALTER DATABASE OPEN RESETLOGS;

Database altered.

SQL> SELECT COUNT (1) FROM ARUP.T;

COUNT(1)
----------
399200


Voila! The output shows "Media recovery complete". The count is 399,200--the same as the number in the production database. The recovery process got those last changes. I accomplished my objective without a synchronous replication.

Summary

Recovery without data loss has always been dependent on a high throughput, low latency, ultra-reliable network. Unfortunately the cost of this infrastructure often precludes the use of the no data loss setups in organizations. The Axxana solution is innovative in the sense that it addresses the issue with a completely out of the box solution slashing costs dramatically, in fact so dramatically that most corporations will be able accomplish no data loss in Oracle Data Guard setups. At the heart of the setup is the "blackbox" which is just solid state storage with protection to withstand water, fire and bombing. When a disaster obliterates the primary site, this blockbox likely survives and through its cellular modem allows its content to be pulled to the standby site to reconstruct the important files that have not been replicated yet. Once those files are retrieved, it's quite simple to recover the standby database completely. In this blog you saw an example of how I did exactly that as a user.

Hope this helps in making some decisions about no data loss setup in Data Guard with respect to cost.

Monday, November 09, 2015

NOUG Session: How Cache Fusion Works

For all those attended my session at Northeast Oracle User Group at Babson College in Wellesley, MA. Have you ever wondered how Cache Fusion knows where to get the block from? Or, how block locks vary from row locks? Or you are confused about the meaning and purpose of various Global Cache Service (GCS), Global Resource Directory (GRD) and Global Enqueue Service (GES). The session was meant to explain how all these actually work under the covers with live demos.


I hope you found the session useful and entertaining. As with all my sessions, I use slides as an aid to presentation; not to communicate the concepts. I have written an accompanying paper which explains these in detail. You can download the paper here. Or, the presentation here. All the scripts I used can be downloaded here.




As always, your feedback will be greatly appreciated.

Sunday, October 25, 2015

Oracle Open World 2015: The Art and Craft of Tracing

Thank you very much to those who attended my session The Art and Craft of Tracing in the #IOUG User Group track. It was a great pleasure seeing standing room only, especially on a slot right after lunch. I hope you found the session useful.


Download the presentation here and the scripts I used here.


As always, I would love to hear from you. Please use the comments space to give your feedback or send me an email, or tweet (@ArupNanda), or facebook post.

Thursday, September 24, 2015

NYOUG Session: Latches Demystified

Thank you all those who came to attend my session on demystifying latches at New York Oracle Users Group in Manhattan. I hope you found the session useful and enjoyable.





In this session, you learned:


What are latches – the purpose
Buffer cache latches
Shared pool latches
Identifying latch waits


Here is the slide deck and the scripts to reproduce the demos I showed as well as those I couldn't.


As usual, I will highly appreciate hearing from you how you liked it, or didn't. Please feel free to comment here or, contact me directly.


Email: arup@proligence.com
Twitter: @ArupNanda
Facebook: ArupKNanda
Google+ ArupNanda



Saturday, August 01, 2015

NYOUG July 2015 Session on Tracing and Profiling

Many thanks to all those came to attend my two sessions "The Art and Craft of Tracing" and "Profiling for Performance in PL/SQL" at New York Oracle Users Group meeting on July 29th. Here you can download the presentations as well as the scripts I used for the demos.


Presentation: Tracing
Presentation: Profiling
Scripts for both Tracing and Profiling sessions.(ZIP file)


As always, your feedback will be highly appreciated. Do write to me about what you liked, or didn't like and how it helped you at work.



Sunday, April 12, 2015

No Data Loss without Synchronous Network

I don’t usually write about specific products; but once in a while I encounter something worth talking about, especially if it addresses a very common issue anyone with datacenter management responsibilities will understand: avoiding the unavoidable loss of last minute data changes in database after a disaster but without expensive synchronous replication. This product solves that problem with an innovative out of the box approach. The product is Phoenix Systems from Axxana.

Replication and Last Mile Transactions

Imagine this scenario: the datacenter was on fire. All components—servers, storage, network—are destroyed. You can’t salvage anything of value. What happens to the data you need to continue your business? Fortunately you have a disaster recovery plan. You replicate the database to a remote datacenter using either Data Guard (from Oracle), or some storage level technology such as EMC’s SRDF. So the database comes up on the remote site; but there is a problem. Since you used asynchronous replication (such as maximum performance mode in Data Guard), not all changes have made it to remote site. The final few changes, usually known as last mile transactions were yet to be shipped to the remote site when the disaster struck. These are lost forever in the fire. While you do have a database at the remote datacenter, and that database is almost complete—the operative word is “almost”; not 100%, mind you—some data is inevitably lost. What if you can’t afford to be “almost” complete? Organizations such as financial institutions, hospitals, manufacturing concerns and airlines where losing data is simply not an option, this is not a luxury they can afford. Even in case of other organizations where data loss may not be as unacceptable, the loss of data creates a sense of uncertainty, since you don’t know exactly what was lost.

If you need to be 100% certain that all data is available at the remote site, what would you do?

It’s easy, using synchronous replication. All changes on the primary database are reflected on the remote database site in real time. If you used Data Guard, you would have run it in maximum protection mode. In that case, Oracle would have committed the changes in the remote database before confirming the commit at the local site. When the primary site is gone, the remote site is available with 100% of the data. Bye bye, “almost”.

Brilliant. So, why didn’t you do it? Better yet, why doesn’t everyone do it, since it is that simple? Data loss could be unacceptable in the worst case and unsettling at best. Why accept it?

Synchronous Replication

Well, there is a pesky little detail. Synchronous replication means the network connectivity has to be rock solid. Networks have three general characteristics: throughput (how much data they can pump. Imagine a three lane highway compared to a single lane local road), latency (how much time passes to process the data; not the actual speed. For instance, the car on the highway may travel at 70 miles per hour; but it will be a few minutes spent on the ramp to the highway) and reliability (does the data move from point A to point B with 100% accuracy or it has to be re-transmitted a few times).

Synchronous replication requires the throughput to be very high (to support the large amounts change at least during bursts of activity), latency to be very low (otherwise the remote site will get the data late and respond to the primary even later, causing the primary to hold off the commit) and extremely reliable (otherwise the primary may think the remote site is not accessible and therefore shutdown itself to protect the data). If you run the Data Guard in maximum protection mode, you are essentially telling Oracle to make sure that the remote site has absolutely, positively, undoubtedly (add any other adjective you can think of) received the change and has committed.

If Oracle can’t ensure that because of any reason, such as not getting the response in time due to a less reliable network, what choices does it have? If it allows further data changes, the changes are not there at the remote site yet. At this time if the primary database fails, the data is gone. Therefore Oracle has no choice but to shut down the primary database to stop any changes coming in. It’s maximum protection, after all, and that’s what you have instructed it to do.

So, if you decide to use maximum protection, you have to use a high throughput, low latency and extremely reliable network infrastructure. Most public commercial network infrastructures are not. Either you have to contract a commercial carrier to provide this elevated level of service, or build your own, such as using dark fiber. The costs of the network infrastructure become exponentially high, especially when the remote site is farther away from the primary. In many cases, the cost of the network itself may be several times that of the database infrastructure it protects. Owing to the cost limitations, you may be forced to locate the remote site close by, e.g. in New York City and Hoboken, NJ. It will still be exceedingly expensive; and it may not offer the same degree of protection that you expect. These two cities are close enough to be in the same exposure area of disasters such as floods, hurricanes, war and so on. The farther away the remote site is, the more protected your data is; but so is your cost. Could you accomplish no data loss without this expensive proposition?

Until now, the decision was really black and white. If you want no data loss at all, you have no choice but to go for a super expensive network solution. Many companies can’t justify that high sticker price and therefore settle for potential data loss. Many, in fact, make detailed plans as a part of the business continuity efforts on handling this lost data.

It’s assumed that zero data loss is synonymous with expensive network. If you don’t have a high throughput, low latency, highly reliable network, you have to live with some data loss.

Here is the good news. No, it doesn’t have to. Now, it’s possible to have the cheaper commoditized public network infrastructure and still have complete data protection. Allow me to explain.

Data Protection in Oracle

In Oracle, the data files are written asynchronously at different intervals unrelated to the data changes and commits. In other words, when you commit a change, the data files may not have that changed data. In fact the change occurs in the memory only (called a buffer cache) and may not exist in the data files for hours afterwards. Similarly when you make a change but not commit, the data can still be persisted to the data files. Let me repeat that: the data files are updated with the changed data even if you didn’t commit yet. This is the reason why if you have a storage or operating system level replication solution—even synchronous—replicating the data files, the remote site may or may not have the data, even hours after the change.

How does Oracle protect the data that was changed and committed but in the memory, if the data files do not have them? It captures the pre- and post-change data and packages them into something called redo blocks. Remember, these have nothing to do with data blocks. These are merely changes created by activities performed on the database. This redo data—also known as redo vector—is written to a special area in memory called log buffer. When you commit, the relevant redo blocks from the log buffer are written to special files in the database called redo log files, also known as online redo log files. The commit waits until this writing—known as redo flushing—has ended. You can check the Oracle sessions waiting for this flushing to complete by looking at the event “log file sync”. Since the changes—most importantly—committed changes are recorded in the redo log files, Oracle does not need to rely on the memory alone to know which changes are committed and which are not. In case of a failure, Oracle examines the redo logs to find these changes and updates the data files accordingly. Redo logs are very small compared to the data files.

By the way, redo flushing also occurs at other times: every three seconds, every filled 1 MB of log buffer, when a third of the log buffer is full and some other events; but those additional flushes merely make sure the redo log file is up to date, even if there is no commit.

As you can see, this redo log file becomes the most important thing in the data recovery process. When a disaster occurs, you may have copies of the data files at the remote site (thanks to the replication); but as you learned in the previous section the copies are not useful yet since they may not have all the committed changes and may even have uncommitted changes. In other words, this copy is not considered “consistent” by Oracle. After a disaster, Oracle needs to check the redo log files and apply the changes to make the data files consistent. This is known as a “media recovery”. You have to initiate the media recovery process. In case of a synchronous replication at storage or Operating System level, the redo logs are perfectly in sync with the primary site and Oracle has no trouble getting to the last committed transaction just before the failure. There will be no data lost as a result of the recovery process. In case of Data Guard with maximum protection, this is not required since the changes are updated at the remote site anyway. But what about your cheaper, commodity network with asynchronous replication? The redo logs at the remote site will not be up to date with the primary site’s redo logs. When you perform media recovery, you can’t get to the very last change before the failure, simply because you may not have them. Therefore you can perform a recovery only up to the last available information in the redo at the remote site. This is known as “incomplete” media recovery, distinguished from the earlier described “complete” media recovery. To complete the media recovery, you need the information in the redo log files at the primary site; but remember, the primary site is destroyed. This information is gone. You end up with a data loss. Perhaps even worse, you won’t even know exactly how much you lost, since you don’t have the access to the primary redo log files.

Now, consider this situation carefully. All you need is the last redo log file from the primary database to complete the recovery. Unfortunately the file is not available because it’s destroyed or otherwise inaccessible since the site itself is inaccessible. This tiny little file is the only thing that stays between you and complete recovery. What if you somehow magically had access to this file, even though the rest of the data center is gone? You would have been able to complete the recovery with no data loss and looked like a hero and that too without a synchronous replication solution with super expensive network.

Enter the Black Box

Oracle’s redo data can be written to multiple files at the same time. This is called a redo log group and the files are called members. Log flushing writes to all members before confirming the flush. As you can see, all members of a group have the same data. Multiple members are created only for redundancy. As long as one member of a group is available, Oracle can use it to perform recovery. This is where the new tool from Axxana comes in. It is a storage device—named Phoenix—where you create the second member of the redo log groups. The first member of the group is on your normal storage as usual. When disaster strikes and nukes the primary site, you have a copy of the all-important redo log from the Phoenix system.

This is where the first benefit of Phoenix systems comes in. The storage is not just any ordinary one. It’s encased in a bomb-proof, fire-proof and water-proof container which preserves the internal storage from many calamities. The storage has normal connectivity such a network port to connect a network as a NAS and a fiber port to connect to a fiber switch as a SAN. Under ordinary circumstances you would use these ports to connect to your infrastructure system and use the storage. After a disaster, due to the indestructible nature of the enclosure, this storage will most likely be intact. All you have to do is to access the data from it and perform a complete recovery. No data needs to be lost.

But how do you get to the information on the Phoenix system? This is where the second benefit comes in. The Phoenix system transfers its data to another component of the Phoenix system at the remote site. The Phoenix system creates a replica of the required redo logs at the remote site. Since the only data on it are the small redo log files, the amount to transfer is very little and does not put a toll on your other network infrastructure.

Note a very important point: Phoenix system does not perform any transfer of data during normal operation. It’s only during a disaster the system transports the required redo log files to complete a 100% recovery at the remote site.




But it still depends on getting the data out of the Phoenix system that was at the primary site. What if the disaster site is physically inaccessible or it is infeasible to physically transport the Phoenix system to a location where it can be connected to your network? It’s quite possible in case of floods, hurricanes or other natural disasters or manmade ones like war or strikes. The network cables are also likely out of commission after a disaster. Without that access, how can Phoenix system extract the needed last mile transactions from the primary site—you might ask.

No worries; there is a cellular modem built into the Phoenix system that allows you to connect to it from the remote site and transfer the data over a cellular network wirelessly. The system also has its own battery that allows it to stay operational even when external power is gone—a common occurrence in almost any calamity. What’s more: the transfer of data after the disaster can also utilize this cellular connectivity; so you may not even need to physically connect to this storage at the primary site (now defunct due to the disaster) to your network. The data you need to perform the complete no-data-loss recovery may already be transferred over to the system at the remote site and is waiting for you. In any case, you have access to the data. And all this comes without the need to invest in synchronous replication and expensive network infrastructure.



In summary, assuming the primary site is gone and you are using a cheaper asynchronous network, the remote site will be partially up-to-date and the last mile transactions will be lost with the loss of the primary database. However, since you used the disaster-proof Phoenix system for an additional member of the redo log group, the last mile transactions will be intact in that system; the remote system won’t have it yet. At this point you have multiple options:

1)      If the network infrastructure between primary and remote sites is still operational (rare; but possible), Phoenix system data transfer creates the copy of redo logs at the remote site automatically. With this in place, you can perform a complete database recovery. Data Loss: 0%.
2)      If the network infrastructure is not operational, Phoenix will automatically engage the built-in cellular modem and initiate the data transfer to the remote site. It will be slow; but the amount of data to be transferred is small; so it won’t matter much. Once that is complete, your recovery will be complete with 0% data loss.
3)      If you don’t have cellular access either, but have access to the physical Phoenix system device, you can mount it on your network and perform a complete recovery.

As I started off, this is a very simple and elegant out of the box solution to a complex problem. It opens up possibilities for a no-data-loss recovery scenario in case of a disaster, where the option didn’t even exist. In a future post I will describe in detail my hands on experience with screenshots, scripts, etc. with the Phoenix system. Stay tuned.

More about Axxana: www.axxana.com

Translate