oracle

ORA- 4031 – A Case Study

Today I will be taking up one ORA-4031 issue we faced on one of our client database. Please note that SQL statements mentioned below have been changed and does not reveal any confidential information.

Our client was facing ORA-4031 on a 9.2 database.This was a new Database which had gone Live recently. Following error messages were recorded in alert log.

ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 4200 bytes of shared memory ("shared pool","select    owner#,name,namespace...","sga heap(1,0)","library cache")

Initial shared pool at time of failure was 160 Mb. After the failure the shared_pool_size was increased to 1 Gb. We did not have any other information and were asked to perform RCA for the same. Luckily we had Statspack configured on the server, so we took a report for the duration when the error occurred.

              Snap Id     Snap Time      Sessions Curs/Sess Comment
            --------- ------------------ -------- --------- -------------------
Begin Snap:     1000 15-Apr-08 06:32:06  #######       4.6
  End Snap:     1001 15-Apr-0806:53:41  #######       4.6
   Elapsed:               21.58 (mins)

Cache Sizes (end)
~~~~~~~~~~~~~~~~~
Buffer Cache:     2,048M      Std Block Size:          8K
Shared Pool Size:       160M          Log Buffer:     32,768K

Load Profile
~~~~~~~~~~~~                Per Second       Per Transaction
                           ---------------       ---------------
    Redo size:              2,459.98              2,911.95
    Logical reads:              6,447.76              7,632.40
    Block changes:                  8.31                  9.83
    Physical reads:              3,642.28              4,311.47
    Physical writes:                  0.80                  0.95
    User calls:                106.51                126.08
    <strong>Parses:                151.48                179.31</strong>
    <strong>Hard parses:                 11.08                 13.12</strong>
    Sorts:                 91.50                108.31
    Logons:                  1.41                  1.67
    Executes:                283.09                335.10
    Transactions:                  0.84

We can clearly see that lot of hard parses were occurring at the time of error.

Library Cache Activity for DB: TESTDB1  Instance: TESTDB1
 Snaps: 1000 -1001-&gt;"Pct Misses"  should be very low

                         Get  Pct        Pin        Pct               Invali-
Namespace           Requests  Miss     Requests     Miss     Reloads  dations
--------------- ------------ ------ -------------- ------ ---------- --------
BODY                   1,564    1.9          2,903   62.9        469        0
CLUSTER                6,827    0.0          9,011    0.1          0        0
INDEX                  9,376   62.8          9,375   63.7          5        0
SQL AREA              77,058   11.0        442,341    5.2      <strong>6,723 </strong>       0
TABLE/PROCEDURE       72,495   10.4        135,173   24.2      <strong>8,291</strong>        0
TRIGGER                2,288   14.9          2,288   49.5        218        0
          -------------------------------------------------------------

Also looking at information from V$LIBRARYCACHE, we can see that there were lot of reloads happening for SQL AREA and TABLE/PROCEDURE Namespace. This gives indication that Shared Pool is not appropriately sized. So we kind of expected this to be a shared pool sizing issue.

But after increasing the shared_pool_size to 1 Gb, we saw that memory for “sql area” was increasing. So as to avoid further ORA-4031, we started flushing the shared pool when the free memory reached to 200M.

NAME                                  BYTES
    -------------------------- ----------------
 <strong>free memory                     597,154,376</strong>  &lt; -large amount of free, was done after flushing shared pool
<strong>sql area                        315,365,096</strong> &lt; - High value
library cache                    82,229,616
miscellaneous                    81,476,336
gcs resources                    39,331,928
gcs shadows                      26,133,184

Whenever I observe high memory usage for Sqlarea, I run following sql statements.

1) 10g

<strong><span style="color: #3366ff;">SQL&gt; select sa.sql_text,sa.version_count ,ss.*from v$sqlarea sa,v$sql_shared_cursor ss where<span> </span>sa.address=ss.address and sa.version_count &gt; 50 order by sa.version_count ;</span></strong>

8i/9i

<strong><span style="color: #3366ff;">select sa.sql_text,sa.version_count ,ss.*from v$sqlarea sa,v$sql_shared_cursor ss<span> </span>where sa.address=ss.KGLHDPAR<span> </span>and sa.version_count &gt; 50<span> </span>order by sa.version_count ;</span></strong>

– This one is to find if child cursors are being generated. To know about child cursors, refer to following Metalink note

Note 296377.1 – Handling and resolving unshared cursors/large version_counts

We did not get any SQL in this category.

2) Use script from Asktom website to find if application is using Bind variables. You can find ,more information by clicking here

This returned quite a few statements

    INSERT INTO EMP (columnname, columnname, columnname, columnname, columnname, columnname, columnname, columnname, columnname, columnname, columnname ,columnname) VALUES ('#', TO_DATE('#','#'), @, '#',
    @, @, @, '#', @, @, @,     @, @, @, @, @)          <strong>1948</strong>

    DELETE FROM EMP WHERE EMPID=@          <strong>1976</strong>

    INSERT INTO EMP (columnname, columnname, columnname, columnname) VALUES (@, @, '#',    @, NULL , @)          <strong>1976</strong>

    DELETE FROM EMP WHERE TYPE=@ AND EMPID=@          <strong>1976</strong>

     INSERT INTO DEPT VALUES (@, '#', '#', TO_DATE('#','#'), SYSDATE)
          <strong>2012</strong>

    DELETE FROM DEPT WHERE DEPTNO=@ AND DEPT_ID=@ AND MGR=@ AND SAL='#'          <strong>2375</strong>

    INSERT INTO SALARY (columnname, columnname, columnname, columnname, columnname, columnname, columnname) VA LUES (@, @, @, '#', '#', @, TO_DATE('#','#'), TO_DATE('#','#'), @, @)          <strong>2377</strong>

We can clearly see that statements not being shared and thus causing the high memory usage in shared pool.

Each sql statement in shared pool will be occupying some memory. If the sql are similar and differ only in literal, then they should be shared as otherwise they will occupy lot of extra space. By not sharing the statements, we are wasting space in shared pool and causing shared pool fragmentation.

Solution again is to use Bind variables or CURSOR_SHARING.

Can ASM DiskGroup Be Renamed?

This was actually a question on Oracle forum which I had replied to. Basically the Poster, wanted to know if he could rename the ASM Diskgroup name by renaming/editing ASM Disk header. He had also mentioned that he had heard about this being done by Oracle for its few customers using kfed.

Answer is NO. It is not possible to rename the diskgroup by editing the ASM disk header. kfed is known to be used for patching ASM disk headers for corruption (only oracle support can do it) and for viewing ASM header contents. Only way to change this by dropping and recreating the diskgroup.

In case you wish to create a new diskgroup with a name say +DG1 which was being used by a Diskgroup which is not mounted (Have some ASM Disk members still in ASM_DISKSTRING path), then you would face following error

    <strong>ORA-15030</strong>: diskgroup name "string" is in use by another diskgroup
    <strong>Cause:</strong> A CREATE DISKGROUP command specified a Diskgroup name that was already assigned to another diskgroup.
    <strong>Action: </strong>Select a different name for the Diskgroup.

In case you wish to create the Diskgroup with same name +DG1 you will be required to clear the ASM disk header using

dd if=/dev/zero of=/dev/raw/raw11 bs=1024 count=100

After this you can recreate the Diskgroup with same name.

One Poster suggested renaming at LUN/Storage level. I believed this to be a destructive idea which could corrupt the Diskgroup. nvengurl replied to this and informed that we read ASM disk header to mount the Diskgroup and thus changing the LUN name/id/path will not solve the issue.

Update: This article is valid for 10g and 11gR1. Since 11gR2 oracle has introduced renamedg utility which can be used to rename diskgroup. I have documented the steps here

11.1.0.6 ASM installation on Solaris fails -I

Looks like that there is no easy way for me to install/configure Oracle components. And every install/configuration leads me to some or the other problems. Anyways I was trying to install Oracle Database 11g on a Solaris 10 but the CSS does not seems to come up. While issuing localconfig add as root user it gives following message and fails

-bash-3.00# ./localconfig add
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
Configuration for local CSS has been initialized

Cleaning up Network socket directories
Setting up Network socket directories
Adding to inittab
Startup will be queued to init within 30 seconds.
Checking the status of new Oracle init process...
Expecting the CRS daemons to be up within 600 seconds.

Giving up: Oracle CSS stack appears NOT to be running.
Oracle CSS service would not start as installed
Automatic Storage Management(ASM) cannot be used until Oracle CSS service is started

On checking further for css log messages in $ORACLE_HOME/log/hostname/client , I found following messages

Oracle Database 11g CRS Release 11.1.0.6.0 - Production Copyright 1996, 2007 Oracle. All rights reserved.
2008-06-18 21:38:29.721: [ CSSCLNT][1]clsssInitNative: failed to connect to (ADDRESS=(PROTOCOL=ipc)(<strong>KEY=OCSSD_LL_test1zone2_</strong>))
, rc 9

On checking the /var/adm/messages file , found following errors

Jun 18 01:16:23 test1zone2 root: [ID 702911 user.error] Oracle Cluster Synchronization Service starting by user request.
Jun 18 01:16:25 test1zone2 root: [ID 702911 user.error] Cluster Ready Services completed waiting on dependencies.
Jun 18 01:16:35 test1zone2 last message repeated 9 times

/var/tmp/.oracle does not show any files created after running localconfig add. These are called Socket files and CSS uses these for communication.I suspect this to be a issue.
I checked our Linux machine (with RAC) and found that we have some files and one of them is named as

srwxrwxrwx  1 oracle oinstall 0 Jun 11 22:10 <strong>sOCSSD_LL_prod01_</strong>

This is quite similar to the error message in css.log files i.e KEY=OCSSD_LL_test1zone2_

As this is a test machine so I cant raise a ticket with Oracle. Currently posted a thread on Oracle Forum. Let’s see if someone is able to figure out something. I will keep you posted. If anyone is interested in reading how these sockets work then they can visit this article from Frits Hoogland.

DBConsole Issue on RAC -Part II

Continuing with DBConsole issue, we were able to get the cluvfy return success for the nodes.

Basically Oracle suggested us that when we use ssh we should not get any banner. E.g

[oracle@PROD01 ~]$ ssh PROD02 date
Fri Jun 13 02:00:41 IST 2008

But in our case it was displaying a banner which displayed a warning message when someone logged in Server.

As we are using Linux, we renamed file /etc/issue.net to something else and tried running cluvfy again. It was successful this time

[oracle@PROD01 ~]$ cluvfy comp nodecon -n all

Verifying node connectivity

Checking node connectivity...

Node connectivity check passed for subnet "10.X.X.X" with node(s) PROD02,PROD01.
Node connectivity check passed for subnet "192.X.X.X" with node(s) PROD02,PROD01

Interfaces found on subnet "192.X.X.X" that are likely candidates for VIP:
PROD02 eth3:192.X.X.X
PROD01 eth3:192.X.X.X

Interfaces found on subnet "10.X.X.X" that are likely candidates for a private interconnect:
PROD02 eth2:10.X.X.X eth2:10.X.X.X
PROD01 eth2:10.X.X.X eth2:10.X.X.X

Node connectivity check passed.


Verification of node connectivity was successful.

But the DBConsole issue still remains. It is still unable to find the hostname. Now waiting for Oracle 🙁

DBConsole Issue on RAC -Part I

Currently I am working on issue where DBConsole is not starting on our 2 Node RAC system. When I try to start, I get following errors

[oracle@PROD01 ~]$ emctl status dbconsole
TZ set to US/Pacific
Exception in getting local host
java.net.UnknownHostException: PROD01: PROD01
        at java.net.InetAddress.getLocalHost(InetAddress.java:1191)
        at oracle.sysman.emSDK.conf.TargetInstaller.getLocalHost
(TargetInstaller.java:5561)
        at oracle.sysman.emSDK.conf.TargetInstaller.main
(TargetInstaller.java:4126)
Exception in getting local host

I tried recreating the DBConsole but that also failed and gave following error

[oracle@PROD01 ~]$ emca -config dbcontrol db  -cluster

STARTED EMCA at Jun 12, 2008 3:29:40 AM
EM Configuration Assistant, Version 10.2.0.1.0 Production
Copyright (c) 2003, 2005, Oracle.  All rights reserved.

Jun 12, 2008 3:29:40 AM oracle.sysman.emcp.util.ClusterUtil getHostName
SEVERE: Error getting hostname for the cluster node PROD01. This node may not be configured correctly
Enter the following information:
Database unique name: testdb1
Jun 12, 2008 3:29:42 AM oracle.sysman.emcp.ParamsManager getInaccessibleNodeList
WARNING: The following cluster nodes are unavailable: [PROD01, PROD02].
Jun 12, 2008 3:29:42 AM oracle.sysman.emcp.ParamsManager getInaccessibleSidList
WARNING: The requested operation will not be performed for the following instances: [testdb11, testdb12].
No cluster nodes found when configuring the RAC database for EM

Above error informs that the nodes are not available, but if we check the status, they are indeed running.

[oracle@PROD01 ~]$ crs_stat -t
Name           Type           Target    State     Host
------------------------------------------------------------
ora.testdb1.db    application    ONLINE    ONLINE    PROD01
ora....omp1.cs application    ONLINE    ONLINE    PROD01
ora....11.inst application    ONLINE    ONLINE    PROD01
ora....12.inst application    ONLINE    ONLINE    PROD02
ora....SM1.asm application    ONLINE    ONLINE    PROD01
ora....01.lsnr application    ONLINE    ONLINE    PROD01
ora....d01.gsd application    ONLINE    ONLINE    PROD01
ora....d01.ons application    ONLINE    ONLINE    PROD01
ora....d01.vip application    ONLINE    ONLINE    PROD01
ora....dM2.asm application    ONLINE    ONLINE    PROD02
ora....02.lsnr application    ONLINE    ONLINE    PROD02
ora....d02.gsd application    ONLINE    ONLINE    PROD02
ora....d02.ons application    ONLINE    ONLINE    PROD02
ora....d02.vip application    ONLINE    ONLINE    PROD02

At this moment I searched Metalink for any known issues. I came across

Note.388440.1 – Problem Emca Fails To Configure DB Control For RAC Database Error Getting Hostname For The Cluster Node

According to this we need to confirm that SSH is set and output of “cluvfy comp nodecon -n all” command should return Sucess. In our case SSH was already set. So I tried using the command but it was Unsucessful

[oracle@PROD01 ~]$ cluvfy comp nodecon -n all

Verifying node connectivity

Verification of node connectivity was unsuccessful on all the nodes.

At this moment we created a SR with Oracle Support. We were asked to then check

Note 549667.1 – Cluvfy returns “Unsuccessful” for most commands, with no other details

We verified that this note was not applicable to us as file permissions for files (Discussed in Note 549667.1) were correctly set. Now we have one more SR which has been created with RAC team to resolve the “Cluvfy” issue.

It’s been long wait and despite of SR being Escalated, still haven’t got a response from Analyst. Will keep you all posted about the issue and will share the solution. Meanwhile if someone else has also faced this situation and resolved it, then do let me know.