Pages

Tuesday, February 24, 2009

Journey from 9.0.1 to 9.2.0.8: Part-II

We've seen in the first part that some serious attention was needed towards few of the issues which were escalating day-by-day, namely:


  1. Blocking Locks, affecting user activities
  2. Serious Performance Degradation (Slow Performance, Session Hangs, etc.)
  3. System Crashes in Business Hours, causing downtime
After I joined the company, it took me a while to get the picture by studying the logs and regularly interacting with everyone in the department, with the users & the relevant points of contact in the company. I remember to have drafted a 19-page system review report, which highlighted the following:


  • SERVER CONFIGURATIONS & USAGE STATISTICS
  • APPLICATION ARCHITECTURE & BACKUP STRATEGY
  • DATABASE ARCHITECTURE & CONFIGURATION
  • DATABASE BACKUP-RECOVERY STRATEGY
  • IMMEDIATE RISK FACTORS & AVAILABLE SOLUTIONS (w/ACTION PLAN)
  • LONG-TERM RISK FACTORS & AVAILABLE SOLUTIONS
  • SERVICE IMPROVEMENT PLANS
The database was hosted on a Pentium III 1GHz Dual Core, IBM e-Server, with 1.5 GB RAM and RAID 5 configured 50 GB HDD (40 GB dedicated to Oracle). The Operating System was Windows Server 2000 Standard Edition.
The database was created using Oracle 9i Release 1 (9.0.1) software, with the default name that Oracle gives (ORCL). Default Block Size was used (4K). Archive Logging was disabled. The size of all the Datafiles was around 30 GB. Total SGA was 930 MB.
Out of 930 MB SGA, the Shared Pool was 320 MB, the Buffer Cache was 600 MB, the Log Buffer was 512 KB, and the Sort Area Size was 1.5 MB.

Backups scheduled were schema level export dumps, taken alternate days and pushed to DDS Tapes, No Recovery Testing had been carried out before to check the validity of the backups or even if it was possible to recover the data from the backups. This task was of grave importance, i.e. to test the validity of current backups and replace them with Oracle RMAN/User-Managed Backup, or atleast have a full database export carried out until RMAN is implemented.

Enabling Archive Logging took sometime due to the Resource Constraints on the server. But until that could be resolved, I had already scripted and scheduled a full database export dump creation. I periodically tested the complete recovery of the full export dumps, just to assure database availability to the point of last backup. This was really helpful, because now I could also test migration of the database to any version using the export dumps.

Two major resource contentions that were evidently visible were:


  • The Free RAM on the server would be between 150-200 MB at peak hours, indicating that the server & the database required more memory for better performance.

We had a hard time finding RAM for the server as the model was near to extinct. Eventually, over a period of time, i.e. around an year, we were able to upgrade the RAM to 4 GB. Until then, there was no way but to fix a few things using the existing resources.

  • And, 40 GB partition on which the database was residing had only 4 GB of free space left.
Out of 30 GB data files' size, the TEMP tablespace size was unusually 12 GB, so I had to resize the Temporary tablespace to free quite a lot of space.

The listener.log file had grown more than 2 GB in size. After recreating the log file, I was able to claim another 2GB of space.

With the claimed space, I appropriately sized the tablespaces/datafiles that were lacking available free space, such that there was more than 80-85% of free space available in each tablespace.

Statistics were not gathered on any of the tables. A series of Testing was carried out along with the developers on the behavior of the system in CBO mode with gathered statistics. One notable issue that we faced during the transition was with some Oracle Reports that would occasionally show No Data. These reports were fixed by adding NVL functions on the referenced parameters by Oracle Reports. Over and above, we successfully, migrated from RBO to CBO, and were seeing numbers to measure performance (COST, CARDINALITY, etc).

I configured Statspack to run at an interval of 15 minutes and analyzed the reports for activities performed duting peak hours. Time and again, I visited the top SQLs and wherever possible tuned them along with the assistance of developers. The ERP was poorly indexed. With CBO in play, I was able to test the performance gains due to any new indexes on tables referenced by Top Disk Read & Buffer Get Queries, and then created them on the production.

I came across a lot of unindexed foriegn keys. This was one of the main reasons for Blocking Locks. Using one of the scripts that I had, I created indexes for the unindexed foriegn keys. I also, seggregated the Indexes in separate tablespaces dedicated for indexes. In a couple of days, we could see a drastic decrease in the number of user calls related to Blocking Lock Issue.

By now, I had gradually dealt with both the Performance Degradation Issue and the Blocking Locks Issue, and yet have serious resource constraints to deal with until I get to upgrade the RAM on the server or migrate the database to a new server.

In the next part, I will elaborate more on how I dealt with the System Crash Issue that occured couple of weeks after I joined and how I tackled and resolved the same.

No comments:

Post a Comment