A few months ago I wrote about an exciting product that allows you
to achieve no data loss in a disaster even without using the maximum
protection Data Guard that relies on super expensive synchronous
networks. Recently I had a chance to actually work on that system
and replicate the scenarios. Here I explain how that actually works,
with commands and actual outputs.
Background
First, you may want to refresh your memory on the concept of that
product in the earlier blog http://arup.blogspot.com/2015/04/no-data-loss-without-synchronous-network.html
The product is Phoenix system from Axxana.
To reiterate, I am not an employee of that company nor in its
payroll; I just want to make others aware of this innovative out of
the box solution to a problem we are all familiar with. I usually
don't write about specific products unless I feel strongly about its
effectiveness to the user community.
Let's examine the issue we are facing:
- In a Data Guard environment, some amount of redo information may be left behind at the primary site.
- If the primary site meets with a disaster where nothing is left, this data is lost for ever.
- If this data loss is not acceptable then you have to run the Data Guard in maximum protection mode which guarantees the updates occurring on the primary to be reflected on the standby database immediately. That way the data is updated at standby site and is not affected by the loss of the primary site.
- However, this also means that the network has to be super fast
and super reliable to ensure the updates at the standby site.
Unfortunately this also means the network is super
expensive. And this could also become technically
challenging for networks across very large distances.
- This is why many corporations that can't afford the expensive network infrastructure do not implement the maximum protection Data Guard and merely accept the data loss as an unavoidable problem.
Well, until now. Axxana's product addresses the issue in a
non-traditional manner. The only thing that stands between the full
recovery on the standby site in case of a disaster and an expensive
network is the redo information that has not been transmitted yet to
the standby server. If this information were available to the
standby host, you would have been able to recover it completely. The
Phoenix system from Axxana is a solid state drive enclosed inside a
case protected from common disaster elements such as flood, fire,
bomb, etc. You put this special box at the primary site and create
one member of the redo log group on its storage, as shown in the
figure below. This box has a protected port of network, etc. that
allows you to access its contents even if the normal ports are all
destroyed by the disaster. It even has a cellular modem that allows
it to be connected remotely. In fact the Axxana software can pull
the necessary data from this special box remotely from the standby
host and apply to the standby database. And the best thing about it
is that it is the plain vanilla Data Guard you are familiar with;
not any special magic that goes under the hood.
Fig 1. Axxana Blackbox Internals |
Let's see how it works with a hands-on approach inside a lab. I have
two hosts:
Host |
Database
|
|
Primary |
e02lora | adb |
Standby
|
e02rora1 | adbsb |
The storage is ASM; not a filesystem. Replication in a filesystem is
relatively simple; so I deliberately chose it to see if any ASM
nuances coming up. The Data Guard is configured with maximum
performance, so there is no need for an expensive fast low
latency network infrastructure. The setup has an Axxana Phoenix
system which consists of three components:
- Blackbox - the "special" protected storage system.
- Collector - a small server at the primary site that
collects the changes to the files under the Axxana protection.
In this case, they are controlfile, redo logs and archived logs.
I don't need to place any database files there because they are
useless for recovery. Keeping this small set of files also makes
it possible to put the protected files of many databases--not
just one--inside a single blackbox.
- Recoverer - a small server at the standby site that
receives the changes from the blackbox and keeps the files up
to date when disaster occurs.
Setting up
First I ensured that the managed recovery process is running and the
standby database is in MOUNTED mode.Then I created a test schema in the production database.
e02lora$ sqlplus / as sysdba SQL> grant create session, create table, create procedure, unlimited tablespace to arup identified by arup;
Then I created some objects and data inside this schema.
SQL> connect arup/arup SQL> create table T (c1 number, c2 timestamp); SQL> create index IN_T ON T (C1); SQL> create or replace procedure INS_T (p_c1 in number, p_commit in number) as 2 begin 3 for i in 1..p_c1 loop 4 insert into T values (i,current_timestamp); 5 if mod(i,p_commit)=0 then 6 commit; 7 end if; 8 end loop; 9 end; 10 / SQL> create or replace procedure DEL_T (p_commit in number) as 2 begin 3 delete T where mod(C1, p_commit)=5; 4 commit; 5 end; 6 /
SQL> create or replace procedure UPD_T (p_commit number) as 2 begin 3 update T set C1 =-1 where mod(C1, p_commit)=3; 4 commit; 5 end; 6 / SQL> exec INS_T (400000, 5000); SQL> commit;
After creating the test data, I switched a few log files and waited
a bit to make sure the changes are replicated to the standby.
Simulating a Failure
Now to simulate a disconnected network. I didn't have any special tools with me. So, I did the next best thing: I disabled the ethernet interface.
e02rora$ su - Password: e02rora$ ifdown eth0
At this point the standby host will not receive the updates. All the
changes made to the primary database will be located at the primary site only. I canceled the recovery process first.
SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
I made a few changes to the primary data.
SQL> connect arup/arup SQL> exec DEL_T (500); SQL> exec UPD_T (500); SQL> conn / as sysdba SQL> alter system switch logfile; SQL> conn arup/arup SQL> select count(*) from T COUNT(1) ---------- 399200
The table has only 399,200 records. Since the Managed Recovery
Process is stopped these final changes will not be replicated to the
standby. I can confirm that by opening the standby in read only mode
and checking the actual tables.
On the standby host:
SQL> conn / as sysdba SQL> ALTER DATABASE OPEN READ ONLY; Database altered. SQL> SELECT COUNT (1) FROM ARUP.T; COUNT(1) ---------- 400000
The standby has all 400,000 rows, compared to 399,200 rows in
primary. The discrepancy is due to the unpropagated changes to the
standby. At this point I simulate a failure in the primary by
killing the pmon process.
e02lora$ ps -aef|grep pmon UID PID PPID TTY STIME COMMAND oracle 10184 10564 pty0 16:26:17 pmon e02lora$ kill -9 10184
The primary database is down. The data in the redo logs is lost as well since they have not been propagated to the standby system yet.
Performing Recovery
Now that we have simulated a disaster, let's see how to recover from
it. Remember, since the Data Guard configuration is Maximum
Performance, the database can only be recovered up to the most
recent log entry. With Axxana software, however, there is additional
data that can be pulled from the Blackbox (the storage unit at the
primary site that would not be destroyed). But how do you--the DBA--know what files are available at the standby site, which are still left at the primary site, and, most
important, how to get those into the standby host? It gets even more complicated since the Blackbox is just a storage device; you have to mount the volumes,
and mount the ASM disks, etc. These may be fun when creating a brand
new database; definitely not so when you are under the gun to recover
the database and bring your business online.
No worries. This is where the Axxana software comes into rescue. I
spun up the provided shell script at the standby site. This script
contacts the blackbox at the primary, pulls the needed data and
completes the creation of necessary files at the standby site. Once
the files are at the standby site, all you have to do is to perform
the typical managed standby database recovery to complete the
recovery process. The best part of all? The script even gives you a
step by step instruction sheet along with specific files names which
you can copy and paste when Oracle prompts for it. Here is how I
call the script and the resultant output.
root@e02rora1 AxxRecovery# ./AxxRecovery.sh Logging to '/home/oracle/AxxRecovery/logs/axxana.debug.log' Calculating Optimization Parameters [done] Attempting to connect to Axxana's Recoverer [done] Perform Failover? Warning! This operation terminates the Axxana protection for all DBs. 1) OK 2) Cancel & Exit
At this point the script pauses and asks me for confirmation. I
enter "1" and the script continues:
Requesting Axxana's Recoverer to 'FAILOVER' [done] Recoverer communicates with the Black Box (ENTER_FAILOVER_START) [done] Recoverer communicates with the Black Box (ENTER_FAILOVER_BBX) [done] Recoverer communicates with the Black Box (ENTER_FAILOVER_CONNECT) [done] Recoverer is in FAILOVER mode Transferring required logs files of the database 'adb'. The script shows me the names of the files along with their respective transfer status and the %age progress. FILE NAME | Size in MBytes | Stage | Progress ==================================|=================|==============|=================== ADB-group_3.258.902686913 | 50 (0) | Completed | 100% ADB-thread_1_seq_111.266.902852033| 46 (46) | Completed | 100% ADB-group_2.257.902686911 | 50 (50) | Completed | 100% ADB-1454516139530A12435456XT_112.C| 11 (11) | Completed | 100% ADB-group_1.256.902686901 | 50 (50) | Completed | 100% ==================================|=================|==============|=================== Total: | 208 (159) | 5 of 5 files recovered All required log files exist at '/home/oracle/AxxRecovery/axx_recovery_files' Preparing user recovery scripts ... You can safely perform the Oracle RPO=0 recovery process Please follow the RPO=0 recovery instructions that are described in the file '/home/oracle/AxxRecovery/axx_required_files/recovery_instructions.txt'
As I mentioned, the script creates a detailed step-by-step
instruction sheet to be followed for the standby recovery. I am
actually glad that it does not perform a recovery automatically.
That is one process you want to watch and proceed with caution. You
probably have only one chance at it; and rushing through it may
force you to take missteps. But at the same time you want to think
as less as possible under those stressful conditions. So, the
detailed instruction sheet comes handy. The last line shows the
location of the instructions files. Here is how the file looks like
in my case, after removing some banner items:
|---------------------------------------------------------------------------------------------------------------| | RECOVERY INSTRUCTIONS (03/02/2016 16:26:19) | |===============================================================================================================| =============================================================================================================== Step 1) Output setup ==================== Please log into the standby database (as sys user) and then run the following commands at the SQL prompt: SQL> SET SERVEROUTPUT ON SET LINES 999 SET PAGES 0 Step 2) Recovery using archive log files ======================================== Please run the following commands at the SQL prompt: * (You can safely ignore the following ORA Error 'Managed Standby Recovery not active') SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL; SQL> RECOVER STANDBY DATABASE UNTIL CANCEL; -- when requested, feed the following file(s): +DATA/adbsb/archivelog/2016_02_03/thread_1_seq_110.382.902852027 /home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ARCHIVELOG/2016_02_03/thread_1_seq_111.266.902852033 -- finally enter 'Cancel' SQL> CANCEL Step 3) Switching to primary control file ========================================= Please run the following commands at the SQL prompt: SQL> SHUTDOWN IMMEDIATE SQL> STARTUP NOMOUNT SQL> ALTER SYSTEM SET CONTROL_FILES='/home/oracle/AxxRecovery/axx_required_files/1454516139530A12435456XT_112.C' SCOPE=SPFILE; SQL> SHUTDOWN IMMEDIATE SQL> STARTUP MOUNT Step 4) Renaming data and redo log file names ============================================= Please run the following SQL statement SQL> @/home/oracle/AxxRecovery/axx_required_files/logAndDateFileRename.sql Step 5) Recovery using primary control file =========================================== Please run the following command at the SQL prompt: SQL> RECOVER DATABASE UNTIL CANCEL USING BACKUP CONTROLFILE -- when requested, feed the following file(s): /home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_1.256.902686901 /home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_2.257.902686911 /home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_3.258.902686913 -- You should now see a message saying 'Media Recovery complete' Step 6) Open the Standby Database ===================================================== Please run the following commands at the SQL prompt: SQL> ALTER DATABASE OPEN RESETLOGS;
Well, it's dumbed down enough for those stressful moments associated
with a standby database recovery, down to even the set pagesize commands in SQL*Plus--quite easy to forget in stressful situations. Note it shows the time it got generated at the very beginning, non-US date format as dd/mm/yyyy. Anyway, I followed the instructions step by step.
Step 4 requires some attention. It shows the how to change the names
of the redo logs and data files after the switchover.
For the sake of brevity I don't want to show the entire output. Here
is an excerpt from the tail end of the activity:
... ... output truncated ... ... Specify log: {=suggested | filename | AUTO | CANCEL} /home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_2.257.902686911 ORA-00279: change 5172890 generated at 02/03/2016 16:16:56 needed for thread 1 ORA-00289: suggestion : +DATA ORA-00280: change 5172890 for thread 1 is in sequence #114 ORA-00278: log file '/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_2.257.902686911' no longer needed for this recovery Specify log: { =suggested | filename | AUTO | CANCEL} /home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_3.258.902686913 Log applied. Media recovery complete. SQL> ALTER DATABASE OPEN RESETLOGS; Database altered. SQL> SELECT COUNT (1) FROM ARUP.T; COUNT(1) ---------- 399200
Voila! The output shows "Media recovery complete". The count is
399,200--the same as the number in the production database. The
recovery process got those last changes. I accomplished my objective
without a synchronous replication.
Summary
Recovery without data loss has always been dependent on a high
throughput, low latency, ultra-reliable network. Unfortunately the
cost of this infrastructure often precludes the use of the no data
loss setups in organizations. The Axxana solution is innovative in
the sense that it addresses the issue with a completely out of the
box solution slashing costs dramatically, in fact so dramatically
that most corporations will be able accomplish no data loss in
Oracle Data Guard setups. At the heart of the setup is the
"blackbox" which is just solid state storage with protection to
withstand water, fire and bombing. When a disaster obliterates the primary site, this
blockbox likely survives and through its cellular modem allows its
content to be pulled to the standby site to reconstruct the
important files that have not been replicated yet. Once those files
are retrieved, it's quite simple to recover the standby database
completely. In this blog you saw an example of how I did exactly
that as a user.
Hope this helps in making some decisions about no data loss setup in Data Guard with respect to cost.