Pages

Sunday, May 31, 2009

May'09 Disaster Recovery: Production Lost

Finally, I am able to write on the Production Server Crash that happened at 07:25 am on 14th May '09.

Our server has a 4 disk configuration for RAID5. We had encountered a disk failure couple of weeks back, which was recovered the same day. That was when our ERP Database Files had got corrupted. You can read about it here.

This time, 2 Disks of the RAID5 Configuration crashed and to survive a crash in a RAID5 setup you need minimum 3 disks running. So, everything along with the server went down. The server was hosting our Oracle Forms6i-based ERP Application as well as our Oracle ERP Database.

Initially, I was a little confident as I had my Full Database RMAN Backups available as well as last night's full database export dump file. For the disks in RAID5 Configuration, being kind of an obsolete model, we were not available to carry out the replacements. Our supplier told us that the disks may be available in the Australian Warehouse, and it would take 2-3 weeks to send it to us. As far as business continuity is concerned, there was none in place and we had to make a temporary setup available as earliest as possible. The IT Administrators immediately arranged for a Server where I could build and bring the whole ERP online. I started the Oracle Product Installation [9.2.0, and patched it to 9.2.0.8]. The Oracle Product Installation and Oracle Database creation was too slow. Literally slow, to the extent that it took almost 5-6 hours and the database creation still did not complete. The Server was a Pentium-III 1 GHz single processor, with 1 GB RAM.

Meanwhile, I found two severe concerns that would hinder the re-build/recovery of the ERP.

  1. The Application Backup was supposed to be on the DDS tapes, the tape drive of which was in-built with the production server. To access the tape drives, we would have to bring back the server, as we did not have any spare tape drive.
  2. The controlfile autobackup and spfile Backup related to the Full Database RMAN Backup were on the crashed server (which should have been on the backup location in te first place). This means that all my RMAN Backups were of no use, until the controlfile autobackup file was recovered from the crashed server's disks.
    Later, I found out that one of the batch jobs was resetting the controlfile autobackup location from "\\testdb\orcl" to "D:\ORACLE\ora92\DATABASE\...".
I had the last night's full export dump, and had no choice but to create a new database with the same. Because the installation activity on the alternate server was quite slow, I started the Database Creation on the Test Server in parallel, which was a Pentium IV with 2 GB RAM. The Database Creation finished in less than 1 hour. Then, I started the full database import, which took another 3-4 hours to complete. If I had the RMAN Backups available, this whole process would have finished in 3 hours or less. Nevertheless, the import activity provided some benefits to me which were only possible in the migration of the database to a new server. I will be discussing about this in the next post.

Meanwhile, I had another problem at my head. From where do I get the ERP Application as the backups were supposed to be on the tapes, and the tape drive was not functional. We had an year old application source available on one of the locations. The issue here was that in a year's span, the developers had put in a couple of new modules, reports and bug fixes. We were maintaining a Manual Source Code Control [Change Management] and were able to recover 95-96% of the ERP Application. There were 2-3 sub modules whose source code was missed due to mismanaged source code control. The only way we could have a complete recovery was if we could recover from the tapes, or repair and recover the same from the Disks.

We managed to bring the temporary setup online for a set of users per department.

By now, the management realized that a Business Continuity Plan needs to be in place. And, that our Disaster Recovery Plan needed a serious re-visit.
  1. Our Backups to tapes should be made easily recoverable, even in case of a tape drive failure.
  2. Our Application Backup and Source Code should be maintained in other Medias atleast at an interval of 1 month or less.
  3. Our RMAN Backup needs to be made more fool proof if we were to completely rely on it.
  4. Our Business Continuity Setup should be made available so as to avoid major downtime.
In the next post, I will be sharing with you more on the subject and the Complete Recovery of the Production Server.

Sunday, May 24, 2009

RMAN Performance Hiccups Resolved

This is in relation with the earlier post "Investigating Hiccups in RMAN Implementation for Production Database", where I was trying to find out the severe performance issues when I carried out RMAN Backup on a network storage (SAN, available to me as NAS). Here are the findings from the earlier tests:

The backup tablespace size is 1.36 GB with one datafile.
The backup piece size in all 3 test cases is 1.1 GB each.
RMAN Backup Time at 3 location:
  • Local Backup Duration (D:\oracle\backup): 04:55 minutes
  • Shared Backup Duration (\\testdb\orcl): 02:25 minutes
  • Shared SAN Duration (\\blade5\mis\backup\RMAN): 26:25 minutes
So, basically I was expecting a lot more better performance or atleast similar performance to the shared backup location for RMAN Backups. We (me, IT Guys and Oracle Support Guys) really were not able to figure out why it was happening.

With the crash that occured on the 11th May, we had the server completely recovered and rebuilt. In the event of rebuilding the server, the IT Administrators upgraded the O/S from Windows 2000 Server Standard Edition to Windows 2003 Server Standard Edition.

After cloning the Recovered Database, I setup RMAN Backup Jobs and did a trial backup on the Shared SAN location, and guess what!!!! My backup which used to take 8 hrs to 13 hrs to complete on the Shared SAN location, now finished in 35-40 minutes.

I guess there was something related to the O/S after all. What change resolved the backup performance issue, is still unknown, except for the fact that upgrading to Windows 2003 Server helped resolve the backup performance issue.

Tuesday, May 19, 2009

Major Disaster Recovered Last Week

Last week, to be more precise 14th May '09, we had a major Server Crash. More than 2 of the Disks in the RAID5 Configuration went Poof!!! And yes, We brought the Server back from the dead, even though it really needed to lay in its grave this time (server's been running for more than 7-8 years) !!!! Can't wait to share with you my last week's experience, but before I post about it I need to carry out some priority tasks.

Lot of endeavors to share related to RMAN and exp Backups, Temporary System setup, Application Recovery Failure, Tape Backup Startegy, Re-Visit on our D/R Plan, Recovery from Failed Disks, O/S Upgrade and new findings on the RMAN Backup performance issue, Standby and DG, and last but not the least procurement of a new server!!! At last, after 2 years the management realizes we need to Change the server. :D

Tuesday, May 5, 2009

10G Flashback-related Crash Scenario

Today, I created a crash scenario by deleting all the files (including the archives and backups) in the FLASH RECOVERY AREA (C:\oracle\flash_recovery_area) in in my test 10G database. After which, I was only able to mount the database. A "startup" or an "alter database open" gave the following error: "ORA-38760: This database instance failed to turn on flashback database".

Here, I will show how I worked around the problem, to get my database up and running.


SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod
PL/SQL Release 10.2.0.1.0 - Production
CORE 10.2.0.1.0 Production
TNS for 32-bit Windows: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production

SQL> archive log list
Database log mode Archive Mode
Automatic archival Enabled
Archive destination USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence 192
Next log sequence to archive 192
Current log sequence 194

SQL> show parameter rec
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
buffer_pool_recycle string
control_file_record_keep_time integer 7
db_recovery_file_dest string C:\oracle/flash_recovery_area
db_recovery_file_dest_size big integer 4G
db_recycle_cache_size big integer 0
ldap_directory_access string NONE
recovery_parallelism integer 0
recyclebin string on
use_indirect_data_buffers boolean FALSE

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-38760: This database instance failed to turn on flashback database

SQL> alter database flashback off;
Database altered.

SQL> alter database flashback on;
alter database flashback on
*
ERROR at line 1:
ORA-38706: Cannot turn on FLASHBACK DATABASE logging.
ORA-38714: Instance recovery required.

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-38760: This database instance failed to turn on flashback database



Usually at this point database should start. In our case, we need to check for Guaranteed Restore Point, and if it exists then we need to drop it.



SQL> select NAME, SCN, GUARANTEE_FLASHBACK_DATABASE, DATABASE_INCARNATION# from v$restore_point;
NAME SCN GUA DATABASE_INCARNATION#
---------- ---------- --- ---------------------
A 4073278 YES 2

SQL> DROP RESTORE POINT A;
Restore point dropped.

SQL> select NAME, SCN, GUARANTEE_FLASHBACK_DATABASE, DATABASE_INCARNATION# from v$restore_point;
no rows selected

SQL> alter database open;
Database altered.

SQL> show sga
Total System Global Area 314572800 bytes
Fixed Size 1248768 bytes
Variable Size 79692288 bytes
Database Buffers 226492416 bytes
Redo Buffers 7139328 bytes




Sunday, May 3, 2009

Documentation Index for Real Application Clusters

I was looking for notes on Oracle Data Guard Installation and Troubleshooting on Metalink (My Oracle Support), and accidentally found this great Metalink Note on Real Application Clusters. Thought of sharing this information on the blog. It will surely help us some day.
Subject: Documentation Index for Real Application Clusters
Doc ID: 188135.1                                     Type:REFERENCE 
Modified Date :01-JUN-2007          Status:PUBLISHED