Pages

Sunday, May 31, 2009

May'09 Disaster Recovery: Production Lost

Finally, I am able to write on the Production Server Crash that happened at 07:25 am on 14th May '09.

Our server has a 4 disk configuration for RAID5. We had encountered a disk failure couple of weeks back, which was recovered the same day. That was when our ERP Database Files had got corrupted. You can read about it here.

This time, 2 Disks of the RAID5 Configuration crashed and to survive a crash in a RAID5 setup you need minimum 3 disks running. So, everything along with the server went down. The server was hosting our Oracle Forms6i-based ERP Application as well as our Oracle ERP Database.

Initially, I was a little confident as I had my Full Database RMAN Backups available as well as last night's full database export dump file. For the disks in RAID5 Configuration, being kind of an obsolete model, we were not available to carry out the replacements. Our supplier told us that the disks may be available in the Australian Warehouse, and it would take 2-3 weeks to send it to us. As far as business continuity is concerned, there was none in place and we had to make a temporary setup available as earliest as possible. The IT Administrators immediately arranged for a Server where I could build and bring the whole ERP online. I started the Oracle Product Installation [9.2.0, and patched it to 9.2.0.8]. The Oracle Product Installation and Oracle Database creation was too slow. Literally slow, to the extent that it took almost 5-6 hours and the database creation still did not complete. The Server was a Pentium-III 1 GHz single processor, with 1 GB RAM.

Meanwhile, I found two severe concerns that would hinder the re-build/recovery of the ERP.

  1. The Application Backup was supposed to be on the DDS tapes, the tape drive of which was in-built with the production server. To access the tape drives, we would have to bring back the server, as we did not have any spare tape drive.
  2. The controlfile autobackup and spfile Backup related to the Full Database RMAN Backup were on the crashed server (which should have been on the backup location in te first place). This means that all my RMAN Backups were of no use, until the controlfile autobackup file was recovered from the crashed server's disks.
    Later, I found out that one of the batch jobs was resetting the controlfile autobackup location from "\\testdb\orcl" to "D:\ORACLE\ora92\DATABASE\...".
I had the last night's full export dump, and had no choice but to create a new database with the same. Because the installation activity on the alternate server was quite slow, I started the Database Creation on the Test Server in parallel, which was a Pentium IV with 2 GB RAM. The Database Creation finished in less than 1 hour. Then, I started the full database import, which took another 3-4 hours to complete. If I had the RMAN Backups available, this whole process would have finished in 3 hours or less. Nevertheless, the import activity provided some benefits to me which were only possible in the migration of the database to a new server. I will be discussing about this in the next post.

Meanwhile, I had another problem at my head. From where do I get the ERP Application as the backups were supposed to be on the tapes, and the tape drive was not functional. We had an year old application source available on one of the locations. The issue here was that in a year's span, the developers had put in a couple of new modules, reports and bug fixes. We were maintaining a Manual Source Code Control [Change Management] and were able to recover 95-96% of the ERP Application. There were 2-3 sub modules whose source code was missed due to mismanaged source code control. The only way we could have a complete recovery was if we could recover from the tapes, or repair and recover the same from the Disks.

We managed to bring the temporary setup online for a set of users per department.

By now, the management realized that a Business Continuity Plan needs to be in place. And, that our Disaster Recovery Plan needed a serious re-visit.
  1. Our Backups to tapes should be made easily recoverable, even in case of a tape drive failure.
  2. Our Application Backup and Source Code should be maintained in other Medias atleast at an interval of 1 month or less.
  3. Our RMAN Backup needs to be made more fool proof if we were to completely rely on it.
  4. Our Business Continuity Setup should be made available so as to avoid major downtime.
In the next post, I will be sharing with you more on the subject and the Complete Recovery of the Production Server.

No comments:

Post a Comment