Monday, November 9, 2009

Configure Enterprise Manager (Database Control) Manually

Oracle Enterprise Manager - Database Control (dbconsole) is something that usually installed and configured automatically whenever you install a new Oracle 10g database, using Database Configuration Assistant.

I was wondering, if i was to delete the service or remove Enterprise Manager - Database Control, how is it possible to bring back EM! This is what I did to figure manually install and configure the Enterprise Manager for my 10g Test Database.

First, I dropped my repository to remove the EM-Database Control and delete the EM Service.
C:\>emca -deconfig dbcontrol db -repos drop

STARTED EMCA at Nov 8, 2009 2:23:59 PM
EM Configuration Assistant, Version 10.2.0.1.0 Production
Copyright (c) 2003, 2005, Oracle. All rights reserved.

Enter the following information:
Database SID: db10gee
Listener port number: 1521
Password for SYS user:
Password for SYSMAN user:

Do you wish to continue? [yes(Y)/no(N)]: y
Nov 8, 2009 2:24:20 PM oracle.sysman.emcp.EMConfig perform
INFO: This operation is being logged at
C:\oracle\product\10.2.0\db_2\cfgtoollogs\emca\DB10GEE\emca_2009-11-08_02-23-59-PM.log.
Nov 8, 2009 2:24:24 PM oracle.sysman.emcp.util.DBControlUtil stopOMS
INFO: Stopping Database Control (this may take a while) ...
Nov 8, 2009 2:24:31 PM oracle.sysman.emcp.EMReposConfig stopDBMSJobs
WARNING: Error initializing SQL connection. SQL operations cannot be
performed
Nov 8, 2009 2:24:31 PM oracle.sysman.emcp.EMReposConfig invoke
WARNING: Unable to remove DBMS jobs.Nov 8, 2009 2:24:31 PM
oracle.sysman.emcp.EMReposConfig dropRepository
INFO: Dropping the EM repository (this may take a while) ...
Terminate batch job (Y/N)? n

C:\>emctl status agent
Oracle Enterprise Manager 10g Database Control Release 10.2.0.1.0
Copyright (c) 1996, 2005 Oracle Corporation. All rights
reserved.
---------------------------------------------------------------
Agent is Not Running

C:\>emctl start dbconsole
Oracle Enterprise Manager 10g Database Control Release 10.2.0.1.0
Copyright (c) 1996, 2005 Oracle Corporation. All rights
reserved.
http://zaffark:5500/em/console/aboutApplication
Starting Oracle Enterprise Manager 10g Database Control ...The service name
is invalid.

More help is available by typing NET HELPMSG 2185.

C:\>NET HELPMSG 2185

The service name is invalid.

EXPLANATION

You tried to start a service that is not configured on this
system.

ACTION

Check the spelling of the service name or check the configuration
information for the service using the Services option from Server Manager.

So, the EM-Database Control has been removed and the EM Windows Service has been deleted. So, let's create the Repository and install Enterprise Manager-Database Control.
C:\>emca -repos create

STARTED EMCA at Nov 8, 2009 3:23:24 PM
EM Configuration Assistant, Version 10.2.0.1.0 Production
Copyright (c) 2003, 2005, Oracle. All rights reserved.

Enter the following information:
Database SID: DB10GEE
Listener port number: 1521
Password for SYS user:
Password for SYSMAN user:

Do you wish to continue? [yes(Y)/no(N)]: Y
Nov 8, 2009 3:23:41 PM oracle.sysman.emcp.EMConfig perform
INFO: This operation is being logged at
C:\oracle\product\10.2.0\db_2\cfgtoollogs\emca\DB10GEE\emca_2009-11-08_03-23-24-PM.log.
Nov 8, 2009 3:23:42 PM oracle.sysman.emcp.EMReposConfig
createRepository
INFO: Creating the EM repository (this may take a while) ...
Nov 8, 2009 3:28:49 PM oracle.sysman.emcp.EMReposConfig invoke
INFO: Repository successfully created
Enterprise Manager configuration completed successfully
FINISHED EMCA at Nov 8, 2009 3:28:50 PM

C:\>emca -config dbcontrol db

STARTED EMCA at Nov 8, 2009 3:30:30 PM
EM Configuration Assistant, Version 10.2.0.1.0 Production
Copyright (c) 2003, 2005, Oracle. All rights reserved.

Enter the following information:
Database SID: db10gee
Database Control is already configured for the database db10gee
You have chosen to configure Database Control for managing the database
db10gee
This will remove the existing configuration and the default settings and
perform a fresh configuration
Do you wish to continue? [yes(Y)/no(N)]: y
Listener port number: 1521
Password for SYS user:
Password for DBSNMP user:
Password for SYSMAN user:
Email address for notifications (optional):
Outgoing Mail (SMTP) server for notifications (optional):
-----------------------------------------------------------------

You have specified the following settings
Database ORACLE_HOME
................ C:\oracle\product\10.2.0\db_2
Database hostname
................ zaffark
Listener port number ................ 1521
Database SID ................ db10gee
Email address for notifications ...............
Outgoing Mail (SMTP) server for notifications ...............

-----------------------------------------------------------------
Do you wish to continue? [yes(Y)/no(N)]: y
Nov 8, 2009 3:31:18 PM oracle.sysman.emcp.EMConfig perform
INFO: This operation is being logged at
C:\oracle\product\10.2.0\db_2\cfgtoollogs\emca\DB10GEE\emca_2009-11-08_03-30-29-PM.log.
Nov 8, 2009 3:31:23 PM oracle.sysman.emcp.util.DBControlUtil stopOMS
INFO: Stopping Database Control (this may take a while) ...
Nov 8, 2009 3:31:48 PM oracle.sysman.emcp.util.DBControlUtil startOMS
INFO: Starting Database Control (this may take a while) ...
Nov 8, 2009 3:32:48 PM oracle.sysman.emcp.EMDBPostConfig
performConfiguration
INFO: Database Control started successfullyNov 8, 2009 3:32:50 PM
oracle.sysman.emcp.EMDBPostConfig performConfiguration
INFO: >>>>>>>>>>> The Database Control URL
is http://zaffark:5500/em <<<<<<<<<<<
Enterprise Manager configuration completed successfully
FINISHED EMCA at Nov 8, 2009 3:32:50 PM

C:\>

I checked my Database Control URL and the Enterprise Manager configuration was a success.

Thursday, September 24, 2009

Alas! Our Systems will run on a New Server...

Yesterday, I came to know that we have received the new server. The installations & configurations will be done on Saturday, and will be made available to me to carry out the Migration. I will share more on it soon, including the Server Configurations.

Thursday, July 9, 2009

Toad World PIPELINE Newsletter (July '09 Issue) has my Technical Paper

Today, I was immensely delighted to know that the Toad World PIPELINE Newsletter (July '09 Issue) is about to be released. You might question me why the emotions at display?

I had sent my Oracle Technical Paper (the one which was published in Oracle Customer Knowledge Exchange in Apr '09) to the Technical Content Editor of Quest Software, after which it was immediately considered for being published in the July '09 Issue of Toad World PIPELINE Newsletter.

Since then I was anxiously waiting for a reply from Quest Software or the July '09 Issue itself. Yesterday, I received an email stating that the newsletter is just about to go out (next day or so) and that I can see my feature before the issue.

Here is the link to Toad World PIPELINE Newsletter (July '09 Issue):
http://www.quest.com/ecard/27254/default.htm

Wednesday, June 17, 2009

May'09 Disaster Recovery: Production brought back to Life!

The test setup for selected users was functioning smoothly. This bought us sometime to look for the bad Disks in local Market. The IT Administrators somehow managed to get the Disks. By the en of the 3rd Day, they replaced and rebuilt the Disks in the RAID5 Configuration. The O/S Partition was formatted and server O/S was upgraded to MS Windows 2003 Server [Standard Edition].

We managed to recover most of the data on the RAID5 Disks, especially the Application. The database files were physically recovered, but I considered rebuilding the Database on the Server for the following reasons:
  1. The Server O/S was upgraded
  2. Oracle 9iR2 Software had to be reinstalled
  3. Data had changed in the last 3 days, as people were using the Secondary Setup. So, I needed to clone the production as the Secondary Setup.
On the 4th Day, I cleared the Database Files on the Production Server and installed the Oracle 9iR2 software and patched it to 9.2.0.8. The Application was intact, so I did not have to do any thing on those lines. Now, I had 3 options to setup the Production Server.
  1. Using export dump of the Secondary Instance, create the Production Oracle Database.
  2. Using RMAN, clone the Production Oracle Database with the "Duplicate Target Database" command.
  3. Or, simply clone the Production Oracle Database using the Cold Backup of the Secondary Database.
The 3rd option was simple and effective. With assistance from one of my colleague, we performed the Cloning using Cold Backup. And, Within an hour the Production Instance was up.

The Application Testing was carried out to check the Application were running smoothly and for the Data Validation. Once the testing was successful, we registered the database in the Recovery Catalog, and took a full database backup. We had to ensure that the Oracle Services were owned by the Domain Administrator Account and not the Local Account for registering the database and taking a Full Database RMAN Backup ensuring that the Controlfile Autobackup and SPFILE backup was on the shared location along with the RMAN Backups.
[Note: 145843.1 How to Configure RMAN to Write to Shared Drives on Windows NT/2000]

Once the backup was complete, the Temporary Setup was shutdown and we brought the Production Server online for all the users.

In the next 2 weeks, the following necessary arrangements were made for the short-comings seen in the Disaster Situation:
  1. Application Backup is daily ensured to Tapes
  2. Application files, as of 13th May 2009, have been backed up to DVDs. Every 15 days, Application files backup to DVDs is being ensured.
  3. A Temporary Server was arranged by IT Administrators and has been cloned (using RMAN) same as that of the Production Server. Scheduled Jobs run at 3 intervals so as to Clone the Secondary database is put in place. The cloning process takes more than 2 hours to complete. In case of an unforeseen disaster, we can easily switch to the Secondary Server with a minimal Data Loss.
  4. Source Code Backup and its relevant Document Control is strictly ensured prior to moving to Production.
  5. After the Production Server’s Operating System Upgrade to Windows Server 2003, the RMAN backup location has been changed to SAN Storage location, which is further backed up to Tapes by the IT Administrators. The earlier RMAN backup performance issue of 13 hours has been resolved. Now, the backup completes in less than an hour. You can read about it here.
  6. For the next 3 months, we will be carrying out planned monthly recovery of complete Application from the Tapes and/or the DVDs. Once the recovery simulation comfort level is attained, we can carry out the simulation every Quarter.
  7. Finally, the new Server Procurement Process has been started, and which server to purchase has been finalized.
We have setup the Secondary Database on a Temporary Server. Due to our limitations on Oracle Licensing (Standard Edition), we are not able to use the Oracle Dataguard, as it is only available as part of the Enterprise Edition. Being simple and effective, I have currently opted to use RMAN Duplicate Target Database to clone the Secondary Database on the Temporary Server. Alternately, I am trying to find a solution for manual replication via own scripts, as the current cloning method adds some data loss disadvantages. The Cloning has been scripted and scheduled to run at 3 intervals in a day:
  • 4 am: Clone after the Full Database RMAN Backups
  • 11 am: Clone in the mid of the Working Day (after all Archivelog Backups are available up to 11 am)
  • 4 pm: Clone at the end of the Working Day (after all Archivelog Backups are available up to 4 pm)
Once the Production System is migrated to a new server, the old server can be used as a Standby Database. Also, we will be migrating the production Database from Oracle 9iR2 [9.2.0.8] to Oracle Database 10g [10.2.0.4].

Sunday, May 31, 2009

May'09 Disaster Recovery: Production Lost

Finally, I am able to write on the Production Server Crash that happened at 07:25 am on 14th May '09.

Our server has a 4 disk configuration for RAID5. We had encountered a disk failure couple of weeks back, which was recovered the same day. That was when our ERP Database Files had got corrupted. You can read about it here.

This time, 2 Disks of the RAID5 Configuration crashed and to survive a crash in a RAID5 setup you need minimum 3 disks running. So, everything along with the server went down. The server was hosting our Oracle Forms6i-based ERP Application as well as our Oracle ERP Database.

Initially, I was a little confident as I had my Full Database RMAN Backups available as well as last night's full database export dump file. For the disks in RAID5 Configuration, being kind of an obsolete model, we were not available to carry out the replacements. Our supplier told us that the disks may be available in the Australian Warehouse, and it would take 2-3 weeks to send it to us. As far as business continuity is concerned, there was none in place and we had to make a temporary setup available as earliest as possible. The IT Administrators immediately arranged for a Server where I could build and bring the whole ERP online. I started the Oracle Product Installation [9.2.0, and patched it to 9.2.0.8]. The Oracle Product Installation and Oracle Database creation was too slow. Literally slow, to the extent that it took almost 5-6 hours and the database creation still did not complete. The Server was a Pentium-III 1 GHz single processor, with 1 GB RAM.

Meanwhile, I found two severe concerns that would hinder the re-build/recovery of the ERP.

  1. The Application Backup was supposed to be on the DDS tapes, the tape drive of which was in-built with the production server. To access the tape drives, we would have to bring back the server, as we did not have any spare tape drive.
  2. The controlfile autobackup and spfile Backup related to the Full Database RMAN Backup were on the crashed server (which should have been on the backup location in te first place). This means that all my RMAN Backups were of no use, until the controlfile autobackup file was recovered from the crashed server's disks.
    Later, I found out that one of the batch jobs was resetting the controlfile autobackup location from "\\testdb\orcl" to "D:\ORACLE\ora92\DATABASE\...".
I had the last night's full export dump, and had no choice but to create a new database with the same. Because the installation activity on the alternate server was quite slow, I started the Database Creation on the Test Server in parallel, which was a Pentium IV with 2 GB RAM. The Database Creation finished in less than 1 hour. Then, I started the full database import, which took another 3-4 hours to complete. If I had the RMAN Backups available, this whole process would have finished in 3 hours or less. Nevertheless, the import activity provided some benefits to me which were only possible in the migration of the database to a new server. I will be discussing about this in the next post.

Meanwhile, I had another problem at my head. From where do I get the ERP Application as the backups were supposed to be on the tapes, and the tape drive was not functional. We had an year old application source available on one of the locations. The issue here was that in a year's span, the developers had put in a couple of new modules, reports and bug fixes. We were maintaining a Manual Source Code Control [Change Management] and were able to recover 95-96% of the ERP Application. There were 2-3 sub modules whose source code was missed due to mismanaged source code control. The only way we could have a complete recovery was if we could recover from the tapes, or repair and recover the same from the Disks.

We managed to bring the temporary setup online for a set of users per department.

By now, the management realized that a Business Continuity Plan needs to be in place. And, that our Disaster Recovery Plan needed a serious re-visit.
  1. Our Backups to tapes should be made easily recoverable, even in case of a tape drive failure.
  2. Our Application Backup and Source Code should be maintained in other Medias atleast at an interval of 1 month or less.
  3. Our RMAN Backup needs to be made more fool proof if we were to completely rely on it.
  4. Our Business Continuity Setup should be made available so as to avoid major downtime.
In the next post, I will be sharing with you more on the subject and the Complete Recovery of the Production Server.

Sunday, May 24, 2009

RMAN Performance Hiccups Resolved

This is in relation with the earlier post "Investigating Hiccups in RMAN Implementation for Production Database", where I was trying to find out the severe performance issues when I carried out RMAN Backup on a network storage (SAN, available to me as NAS). Here are the findings from the earlier tests:

The backup tablespace size is 1.36 GB with one datafile.
The backup piece size in all 3 test cases is 1.1 GB each.
RMAN Backup Time at 3 location:
  • Local Backup Duration (D:\oracle\backup): 04:55 minutes
  • Shared Backup Duration (\\testdb\orcl): 02:25 minutes
  • Shared SAN Duration (\\blade5\mis\backup\RMAN): 26:25 minutes
So, basically I was expecting a lot more better performance or atleast similar performance to the shared backup location for RMAN Backups. We (me, IT Guys and Oracle Support Guys) really were not able to figure out why it was happening.

With the crash that occured on the 11th May, we had the server completely recovered and rebuilt. In the event of rebuilding the server, the IT Administrators upgraded the O/S from Windows 2000 Server Standard Edition to Windows 2003 Server Standard Edition.

After cloning the Recovered Database, I setup RMAN Backup Jobs and did a trial backup on the Shared SAN location, and guess what!!!! My backup which used to take 8 hrs to 13 hrs to complete on the Shared SAN location, now finished in 35-40 minutes.

I guess there was something related to the O/S after all. What change resolved the backup performance issue, is still unknown, except for the fact that upgrading to Windows 2003 Server helped resolve the backup performance issue.

Tuesday, May 19, 2009

Major Disaster Recovered Last Week

Last week, to be more precise 14th May '09, we had a major Server Crash. More than 2 of the Disks in the RAID5 Configuration went Poof!!! And yes, We brought the Server back from the dead, even though it really needed to lay in its grave this time (server's been running for more than 7-8 years) !!!! Can't wait to share with you my last week's experience, but before I post about it I need to carry out some priority tasks.

Lot of endeavors to share related to RMAN and exp Backups, Temporary System setup, Application Recovery Failure, Tape Backup Startegy, Re-Visit on our D/R Plan, Recovery from Failed Disks, O/S Upgrade and new findings on the RMAN Backup performance issue, Standby and DG, and last but not the least procurement of a new server!!! At last, after 2 years the management realizes we need to Change the server. :D

Tuesday, May 5, 2009

10G Flashback-related Crash Scenario

Today, I created a crash scenario by deleting all the files (including the archives and backups) in the FLASH RECOVERY AREA (C:\oracle\flash_recovery_area) in in my test 10G database. After which, I was only able to mount the database. A "startup" or an "alter database open" gave the following error: "ORA-38760: This database instance failed to turn on flashback database".

Here, I will show how I worked around the problem, to get my database up and running.


SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod
PL/SQL Release 10.2.0.1.0 - Production
CORE 10.2.0.1.0 Production
TNS for 32-bit Windows: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production

SQL> archive log list
Database log mode Archive Mode
Automatic archival Enabled
Archive destination USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence 192
Next log sequence to archive 192
Current log sequence 194

SQL> show parameter rec
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
buffer_pool_recycle string
control_file_record_keep_time integer 7
db_recovery_file_dest string C:\oracle/flash_recovery_area
db_recovery_file_dest_size big integer 4G
db_recycle_cache_size big integer 0
ldap_directory_access string NONE
recovery_parallelism integer 0
recyclebin string on
use_indirect_data_buffers boolean FALSE

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-38760: This database instance failed to turn on flashback database

SQL> alter database flashback off;
Database altered.

SQL> alter database flashback on;
alter database flashback on
*
ERROR at line 1:
ORA-38706: Cannot turn on FLASHBACK DATABASE logging.
ORA-38714: Instance recovery required.

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-38760: This database instance failed to turn on flashback database



Usually at this point database should start. In our case, we need to check for Guaranteed Restore Point, and if it exists then we need to drop it.



SQL> select NAME, SCN, GUARANTEE_FLASHBACK_DATABASE, DATABASE_INCARNATION# from v$restore_point;
NAME SCN GUA DATABASE_INCARNATION#
---------- ---------- --- ---------------------
A 4073278 YES 2

SQL> DROP RESTORE POINT A;
Restore point dropped.

SQL> select NAME, SCN, GUARANTEE_FLASHBACK_DATABASE, DATABASE_INCARNATION# from v$restore_point;
no rows selected

SQL> alter database open;
Database altered.

SQL> show sga
Total System Global Area 314572800 bytes
Fixed Size 1248768 bytes
Variable Size 79692288 bytes
Database Buffers 226492416 bytes
Redo Buffers 7139328 bytes




Sunday, May 3, 2009

Documentation Index for Real Application Clusters

I was looking for notes on Oracle Data Guard Installation and Troubleshooting on Metalink (My Oracle Support), and accidentally found this great Metalink Note on Real Application Clusters. Thought of sharing this information on the blog. It will surely help us some day.
Subject: Documentation Index for Real Application Clusters
Doc ID: 188135.1                                     Type:REFERENCE 
Modified Date :01-JUN-2007          Status:PUBLISHED

Thursday, April 30, 2009

Complete Recovery of Production Instance

Yesterday at 01:12 pm, the production instance crashed. We were reported of "Shared Memory Realm Does not Exist" error by the users.

I noticed in the Alertlog, that the Control File (control05.ctl) was having some problem.


Wed Apr 29 13:04:37 2009
Errors in file d:\oracle\admin\orcl0\bdump\orcl_arc2_4852.trc:
ORA-00202: controlfile: 'C:\ORACLE\ORADATA\ORCL\CONTROL05.CTL'
ORA-27091: skgfqio: unable to queue I/O
ORA-27070: skgfdisp: async read/write failed
OSD-04006: ReadFile() failure, unable to read from file
O/S-Error: (OS 2) The system cannot find the file specified.


I brought down the instance, as it was in nomount state, and then renamed the corrupted Control File (control05.ctl) and by copying one of the other control files, I created a new Control File (control05.ctl).

I fired the startup to see the control file issue was resolved, but also got a couple of new error messages related to datafile 12.

The alert log revealed the following:


Wed Apr 29 13:17:29 2009
Errors in file d:\oracle\admin\orcl0\udump\orcl_ora_6512.trc:
ORA-00600: internal error code, arguments: [krdsod_1], [198804811], [0], [198804867], [0], [], [], []
ORA-01115: IO error reading block from file 12 (block # 334316)
ORA-01110: data file 12: 'D:\ORACLE\ORADATA\ORCL\ORION_ISCO01.DBF'
ORA-27070: skgfdisp: async read/write failed
OSD-04016: Error queuing an asynchronous I/O request.
O/S-Error: (OS 2) The system cannot find the file specified.


This file was part of the main production tablespace, kind of the heart of the Functional ERP System. I fired a recover on the tablespace and it still gave the same error.


SQL> alter database recover datafile 12;
alter database recover datafile 12
*
ERROR at line 1:
ORA-00283: recovery session canceled due to errors
ORA-01115: IO error reading block from file 12 (block # 334316)
ORA-01110: data file 12: 'D:\ORACLE\ORADATA\ORCL\ORION_ISCO01.DBF'
ORA-27070: skgfdisp: async read/write failed
OSD-04016: Error queuing an asynchronous I/O request.
O/S-Error: (OS 2) The system cannot find the file specified.


I was wondering how many files have been affected in this manner, to decide between full database restore or individual datafile/tablespace restore and recover.

Meanwhile, from the IT Admins, I came to know that one of the disks from the 4 RAID5 disks had a problem (defunc) at almost the same time when we noticed a Instance Crash, and that the disk needed to be fixed.

The database is in Archivelog mode and we have daily full database backups and also Archivelog backups in every 4 hours taken. Then I tried to restore the whole tablespace (datafiles 11, 12, 18, 19), and it again failed at datafile 12. Then, I renamed the existing datafile (ORION_ISCO01.DBF) and then restored the datafile 12 (ORION_ISCO01.DBF) from the backup, and eventually restored all the other 3 datafiles. And then I sucessfully recovered the tablespace (SR 7473105.994).

Now, I was able to successfully open the database.


SQL> alter database open;
Database altered.

SQL>


I checked the Systems and they were all running fine. And, then I took a full database RMAN backup with retention policy of redundacy 2. I raised a Service Request (SR 7559528.992) with Oracle Support to look into what the error messages mean and is it directly related to the RAID5 disk failure.
I would require to answer a few questions to the Management regarding this Incident, or should I say, Disaster Recovery Situation. It's high time, we need to change the database server.

Today, the users are using the system as if nothing happened. Thanks to the implementation of RMAN, I was able to bring the database back online in 3:30 hours. I can't imagine what I would have done with the full export dumps, or the ealier weekly Cold Backups (NOARCHIVELOG).

I will keep you all posted on my further findings.