Pages

Thursday, April 30, 2009

Complete Recovery of Production Instance

Yesterday at 01:12 pm, the production instance crashed. We were reported of "Shared Memory Realm Does not Exist" error by the users.

I noticed in the Alertlog, that the Control File (control05.ctl) was having some problem.


Wed Apr 29 13:04:37 2009
Errors in file d:\oracle\admin\orcl0\bdump\orcl_arc2_4852.trc:
ORA-00202: controlfile: 'C:\ORACLE\ORADATA\ORCL\CONTROL05.CTL'
ORA-27091: skgfqio: unable to queue I/O
ORA-27070: skgfdisp: async read/write failed
OSD-04006: ReadFile() failure, unable to read from file
O/S-Error: (OS 2) The system cannot find the file specified.


I brought down the instance, as it was in nomount state, and then renamed the corrupted Control File (control05.ctl) and by copying one of the other control files, I created a new Control File (control05.ctl).

I fired the startup to see the control file issue was resolved, but also got a couple of new error messages related to datafile 12.

The alert log revealed the following:


Wed Apr 29 13:17:29 2009
Errors in file d:\oracle\admin\orcl0\udump\orcl_ora_6512.trc:
ORA-00600: internal error code, arguments: [krdsod_1], [198804811], [0], [198804867], [0], [], [], []
ORA-01115: IO error reading block from file 12 (block # 334316)
ORA-01110: data file 12: 'D:\ORACLE\ORADATA\ORCL\ORION_ISCO01.DBF'
ORA-27070: skgfdisp: async read/write failed
OSD-04016: Error queuing an asynchronous I/O request.
O/S-Error: (OS 2) The system cannot find the file specified.


This file was part of the main production tablespace, kind of the heart of the Functional ERP System. I fired a recover on the tablespace and it still gave the same error.


SQL> alter database recover datafile 12;
alter database recover datafile 12
*
ERROR at line 1:
ORA-00283: recovery session canceled due to errors
ORA-01115: IO error reading block from file 12 (block # 334316)
ORA-01110: data file 12: 'D:\ORACLE\ORADATA\ORCL\ORION_ISCO01.DBF'
ORA-27070: skgfdisp: async read/write failed
OSD-04016: Error queuing an asynchronous I/O request.
O/S-Error: (OS 2) The system cannot find the file specified.


I was wondering how many files have been affected in this manner, to decide between full database restore or individual datafile/tablespace restore and recover.

Meanwhile, from the IT Admins, I came to know that one of the disks from the 4 RAID5 disks had a problem (defunc) at almost the same time when we noticed a Instance Crash, and that the disk needed to be fixed.

The database is in Archivelog mode and we have daily full database backups and also Archivelog backups in every 4 hours taken. Then I tried to restore the whole tablespace (datafiles 11, 12, 18, 19), and it again failed at datafile 12. Then, I renamed the existing datafile (ORION_ISCO01.DBF) and then restored the datafile 12 (ORION_ISCO01.DBF) from the backup, and eventually restored all the other 3 datafiles. And then I sucessfully recovered the tablespace (SR 7473105.994).

Now, I was able to successfully open the database.


SQL> alter database open;
Database altered.

SQL>


I checked the Systems and they were all running fine. And, then I took a full database RMAN backup with retention policy of redundacy 2. I raised a Service Request (SR 7559528.992) with Oracle Support to look into what the error messages mean and is it directly related to the RAID5 disk failure.
I would require to answer a few questions to the Management regarding this Incident, or should I say, Disaster Recovery Situation. It's high time, we need to change the database server.

Today, the users are using the system as if nothing happened. Thanks to the implementation of RMAN, I was able to bring the database back online in 3:30 hours. I can't imagine what I would have done with the full export dumps, or the ealier weekly Cold Backups (NOARCHIVELOG).

I will keep you all posted on my further findings.

2 comments: