Our server has a 4 disk configuration for RAID5. We had encountered a disk failure couple of weeks back, which was recovered the same day. That was when our ERP Database Files had got corrupted. You can read about it here.
This time, 2 Disks of the RAID5 Configuration crashed and to survive a crash in a RAID5 setup you need minimum 3 disks running. So, everything along with the server went down. The server was hosting our Oracle Forms6i-based ERP Application as well as our Oracle ERP Database.
Initially, I was a little confident as I had my Full Database RMAN Backups available as well as last night's full database export dump file. For the disks in RAID5 Configuration, being kind of an obsolete model, we were not available to carry out the replacements. Our supplier told us that the disks may be available in the Australian Warehouse, and it would take 2-3 weeks to send it to us. As far as business continuity is concerned, there was none in place and we had to make a temporary setup available as earliest as possible. The IT Administrators immediately arranged for a Server where I could build and bring the whole ERP online. I started the Oracle Product Installation [9.2.0, and patched it to 9.2.0.8]. The Oracle Product Installation and Oracle Database creation was too slow. Literally slow, to the extent that it took almost 5-6 hours and the database creation still did not complete. The Server was a Pentium-III 1 GHz single processor, with 1 GB RAM.
Meanwhile, I found two severe concerns that would hinder the re-build/recovery of the ERP.
- The Application Backup was supposed to be on the DDS tapes, the tape drive of which was in-built with the production server. To access the tape drives, we would have to bring back the server, as we did not have any spare tape drive.
- The controlfile autobackup and spfile Backup related to the Full Database RMAN Backup were on the crashed server (which should have been on the backup location in te first place). This means that all my RMAN Backups were of no use, until the controlfile autobackup file was recovered from the crashed server's disks.
Later, I found out that one of the batch jobs was resetting the controlfile autobackup location from "\\testdb\orcl" to "D:\ORACLE\ora92\DATABASE\...".
Meanwhile, I had another problem at my head. From where do I get the ERP Application as the backups were supposed to be on the tapes, and the tape drive was not functional. We had an year old application source available on one of the locations. The issue here was that in a year's span, the developers had put in a couple of new modules, reports and bug fixes. We were maintaining a Manual Source Code Control [Change Management] and were able to recover 95-96% of the ERP Application. There were 2-3 sub modules whose source code was missed due to mismanaged source code control. The only way we could have a complete recovery was if we could recover from the tapes, or repair and recover the same from the Disks.
We managed to bring the temporary setup online for a set of users per department.
By now, the management realized that a Business Continuity Plan needs to be in place. And, that our Disaster Recovery Plan needed a serious re-visit.
- Our Backups to tapes should be made easily recoverable, even in case of a tape drive failure.
- Our Application Backup and Source Code should be maintained in other Medias atleast at an interval of 1 month or less.
- Our RMAN Backup needs to be made more fool proof if we were to completely rely on it.
- Our Business Continuity Setup should be made available so as to avoid major downtime.
No comments:
Post a Comment