Pages

Thursday, March 19, 2009

Journey from 9.0.1 to 9.2.0.8: Part-III

Within the first few weeks when I had joined the Company, it was brought to my attention that the System Crashes were quite frequent, and that I had to resolve it at the earliest to prevent downtimes having business impact. I had inquired and gathered enough non-technical information regarding the System Crash Issue from the perspective of the Developers, the IT Admins, and few of the users. I also had a glimpse of the Alert.log and found a peculiar error whose frequent occurrence was notable. Then it happened. The users started calling in that their sessions have hung and they were not able to login to open new sessions. Even the developers could not do anything. I had logged in using Toad with my sysdba account and even that session was frozen.

I remember rushing towards the Datacenter and coordinating with the IT Admins, as I did not have any remote access to control the database at my end. To my surprise, it was a pretty casual scenario there. We logged in to the server and I noticed that the CPU utilization was 99%-100%. I tried to connect using the sysdba privileges and was not able to login. Eventually, we decided to restart the server and then the System was back to normal.

I had to do something, where we do not need to restart the server every time and avoid the Crashes from happening. Two questions that were haunting me, and I was desperately seeking answer to them were:

  1. What could have caused the server CPU utilization to shoot up at 99-100%?
  2. Why wasn’t I able to login using my sysdba privileges, despite being part of the ORA_DBA group?
A detailed study of the Alert.log revealed that the instance raised a series of errors “ORA-600 [25012]”, and immediately after that the instance crashed, or the server was restarted. Further investigation revealed that the error was being raised since 2005. And the subsequent occurrences increased gradually. Then a few weeks later another similar incident occurred. This time we could see ORA-600 [12209] and ORA-600 [17281] in the Alert.log and the trace files. I immediately raised a Service Request [6434255.992] with the Oracle Support.

The Oracle Support pointed out that we were hitting a Bug [2622690] and there are many reasons why it could occur, in our cases this time it was because of Call stack and Circumstances Match. And, they could not do anything more as we were running the database on a Desupported version of Oracle. The last supported Release was Oracle 9iR2 [9.2.0.8.0]. Below is the update from Oracle Support on the issue:

You are hitting bug 2622690.

Details:

ORA-600 [12209] can occur when using shared servers (MTS) if the client connection aborts.

In this case the database crashed because PMON had problems cleaning up the Shared server process. There is a fix created on top of 9.0.1.4 for ms windows. The patch should be located under 3183731. However, you must first install patchset 9.0.1.4

The bug is fixed in 10g, but I could find no occurrences for the bug in 9.2.0 either.
Therefore, my recommendation is to install and upgrade to a supported release. Which is 9.2.0.8 or 10g.The bug is fixed in 10g and should not reoccur in 9.2.0.8.

However should it reoccur, we can request a backport.

A new note was created, which can be accessed via Metalink [452099.1].

My Analysis with using all information in hand to my first question was that when the Bug was being hit and the CPU utilization increased up to 99-100%, a dump file was being generated by Oracle until the server was restarted, the size of which would usually vary from 25 MB to 300MB (based on the last 3-4 known cases).

Later, I found out the max_dump_file_size parameter to be set to UNLIMITED. I changed it to 100MB and monitored the effect of the change for the ORA-600 errors encountered. Based on the findings, I further reduced the max_dump_file_size parameter to 10MB. After this change, I would rarely restart the server. I could restart the database via remote access. You can call this a temporary work-around to the ORA-600 being encountered.

With respect to the second question, I suppose the login failure could be because of the increased CPU utilization to 99-100%, and oracle prioritizing the dump file creation, such that oracle probably did not have sufficient resources to allow the establishment of the new connections (be it sys or any other account).

To fix this issue permanently, we had to migrate the database to the latest supported release, i.e. Oracle 10gR2. But due to resource constraints on the server, especially the available Memory (1.5 GB) and to some extent the storage, the only way we could migrate to 10gR2 would be if we could migrate to a new server with optimal configurations. And if we were to fix the issue, we had to at least apply the patch 3183731 over Oracle 9iR1 (9.0.1.4) as suggested by Oracle Support or migrate to Oracle 9iR2 (9.2.0.8) on the same server. Oracle Support recommended moving to Oracle 9iR2 [9.2.0.8].

I prepared a Root Cause Analysis Report on the Incident, and a Management Summary depicting the Average Loss (in terms of Cost) that the company is incurring due to the Downtimes resulting from these incidents. And, build up a case to migrate to Oracle 9iR2 [9.2.0.8].

What I had to do next was, to draft the action plan for the migration from Oracle 9iR1 [9.0.1] to Oracle 9iR2 [9.2.0.8]. And then, carry out a serious of tests to ensure that I tackle all the issues related to migration before hand, especially bearing in mind that we have an OLTP ERP Database, and to ensure that the ORA-600 nightmares don't occur post migration. Parallely, I had to, any how, upgrade the RAM on the server to have a smooth migration as well as to address the performance issues related to resource constraints.

In a nutshell, my Action Plan for the migration/upgrade on the same server was as follows:

  1. Firstly, to install Oracle 9iR2 [9.2.0] base home in a separate location on the database server,
  2. Then, to patch the Oracle 9iR2 [9.2.0] base home with 9.2.0.8 patchset,
  3. Before beginning the upgrade process, to ensure a cold backup of Oracle 9iR1 [9.0.1] Database is taken,
  4. Then, using either DBUA or Manual process, to upgrade the Oracle 9iR1 [9.0.1] database to Oracle 9iR2 [9.2.0.8],
  5. If everything is perfect, then to take a post upgrade cold backup,
  6. If any major issues are encountered during the upgrade process, then to restore the cold backup prior to upgrade and point the database to the old home Oracle 9iR1 [9.0.1].

In the final part to "Journey from 9.0.1 to 9.2.0.8" series, I will update my experience in Oracle 9iR2 [9.2.0.8] production upgrade, and how I tackled the post-upgrade issues.

No comments:

Post a Comment