Pages

Wednesday, June 17, 2009

May'09 Disaster Recovery: Production brought back to Life!

The test setup for selected users was functioning smoothly. This bought us sometime to look for the bad Disks in local Market. The IT Administrators somehow managed to get the Disks. By the en of the 3rd Day, they replaced and rebuilt the Disks in the RAID5 Configuration. The O/S Partition was formatted and server O/S was upgraded to MS Windows 2003 Server [Standard Edition].

We managed to recover most of the data on the RAID5 Disks, especially the Application. The database files were physically recovered, but I considered rebuilding the Database on the Server for the following reasons:
  1. The Server O/S was upgraded
  2. Oracle 9iR2 Software had to be reinstalled
  3. Data had changed in the last 3 days, as people were using the Secondary Setup. So, I needed to clone the production as the Secondary Setup.
On the 4th Day, I cleared the Database Files on the Production Server and installed the Oracle 9iR2 software and patched it to 9.2.0.8. The Application was intact, so I did not have to do any thing on those lines. Now, I had 3 options to setup the Production Server.
  1. Using export dump of the Secondary Instance, create the Production Oracle Database.
  2. Using RMAN, clone the Production Oracle Database with the "Duplicate Target Database" command.
  3. Or, simply clone the Production Oracle Database using the Cold Backup of the Secondary Database.
The 3rd option was simple and effective. With assistance from one of my colleague, we performed the Cloning using Cold Backup. And, Within an hour the Production Instance was up.

The Application Testing was carried out to check the Application were running smoothly and for the Data Validation. Once the testing was successful, we registered the database in the Recovery Catalog, and took a full database backup. We had to ensure that the Oracle Services were owned by the Domain Administrator Account and not the Local Account for registering the database and taking a Full Database RMAN Backup ensuring that the Controlfile Autobackup and SPFILE backup was on the shared location along with the RMAN Backups.
[Note: 145843.1 How to Configure RMAN to Write to Shared Drives on Windows NT/2000]

Once the backup was complete, the Temporary Setup was shutdown and we brought the Production Server online for all the users.

In the next 2 weeks, the following necessary arrangements were made for the short-comings seen in the Disaster Situation:
  1. Application Backup is daily ensured to Tapes
  2. Application files, as of 13th May 2009, have been backed up to DVDs. Every 15 days, Application files backup to DVDs is being ensured.
  3. A Temporary Server was arranged by IT Administrators and has been cloned (using RMAN) same as that of the Production Server. Scheduled Jobs run at 3 intervals so as to Clone the Secondary database is put in place. The cloning process takes more than 2 hours to complete. In case of an unforeseen disaster, we can easily switch to the Secondary Server with a minimal Data Loss.
  4. Source Code Backup and its relevant Document Control is strictly ensured prior to moving to Production.
  5. After the Production Server’s Operating System Upgrade to Windows Server 2003, the RMAN backup location has been changed to SAN Storage location, which is further backed up to Tapes by the IT Administrators. The earlier RMAN backup performance issue of 13 hours has been resolved. Now, the backup completes in less than an hour. You can read about it here.
  6. For the next 3 months, we will be carrying out planned monthly recovery of complete Application from the Tapes and/or the DVDs. Once the recovery simulation comfort level is attained, we can carry out the simulation every Quarter.
  7. Finally, the new Server Procurement Process has been started, and which server to purchase has been finalized.
We have setup the Secondary Database on a Temporary Server. Due to our limitations on Oracle Licensing (Standard Edition), we are not able to use the Oracle Dataguard, as it is only available as part of the Enterprise Edition. Being simple and effective, I have currently opted to use RMAN Duplicate Target Database to clone the Secondary Database on the Temporary Server. Alternately, I am trying to find a solution for manual replication via own scripts, as the current cloning method adds some data loss disadvantages. The Cloning has been scripted and scheduled to run at 3 intervals in a day:
  • 4 am: Clone after the Full Database RMAN Backups
  • 11 am: Clone in the mid of the Working Day (after all Archivelog Backups are available up to 11 am)
  • 4 pm: Clone at the end of the Working Day (after all Archivelog Backups are available up to 4 pm)
Once the Production System is migrated to a new server, the old server can be used as a Standby Database. Also, we will be migrating the production Database from Oracle 9iR2 [9.2.0.8] to Oracle Database 10g [10.2.0.4].