10g

‘Metric execution timed out’ error reported by Grid

We use 10.2 Enterprise Manager Grid control  to monitor our oracle database environments . Apart from the normal metrics provided by oracle we also use host based and sql user defined metrics which allows you to monitor almost anything . Coming back to the main issue, following error was reported from one of the monitored target (Non Prod server)

PROD Apr 30, 2010 8: Status
Time: May 30, 2010 2:28:14 AM PDT
Sev: Metric Error Start
Message: Metric evaluation error start - Metric execution timed out
in 360 seconds

Checking the $AGENT_HOME/sysman/log/emagent.trc, same messages were recorded in the trace file

2010-05-30 02:39:11,585 Thread-4045400992 WARN  scheduler: oracle_database:orcli.xx.xx.xx_orcli1:Response is due to run again but is still running from last schedule
2010-05-30 02:39:12,280 Thread-3981994912 ERROR fetchlets.oslinetok: Metric execution timed out in 600 seconds
2010-05-30 02:39:13,279 Thread-4088384416 ERROR fetchlets.oslinetok: Metric execution timed out in 600 seconds

I used emctl status agent but it did not report any pending files. My first step was to  restart the agent but it didn’t come up properly after restart

[oracle@prod1]~% emctl status agent
Oracle Enterprise Manager 10g Release 4 Grid Control 10.2.0.4.0.
Copyright (c) 1996, 2007 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent Version     : 10.2.0.4.0
OMS Version       : 10.2.0.4.0
Protocol Version  : 10.2.0.4.0
Agent Home        : /oracle/product/agent10g
Agent binaries    : /oracle/product/agent10g
Agent Process ID  : 1025
Parent Process ID : 1009
---------------------------------------------------------------
<strong>Agent is Running but Not Ready</strong>

Again looking at emagent.trc file, following error was observed

2010-05-30 03:19:30,238 Thread-4092558240 ERROR TargetManager: TIMEOUT when compute dynamic properties for target orcli.xx.xx.xx_orcli1

At this moment I enabled the blackout for this host as I didn’t wanted too many alerts for this server. On checking MOS aka metalink, I found a note recommending altering values of parameters in agent configuration file which controls the time limit for agent to make connection to databse. But on checking found those parameter’s were already set to specified value (Don’t remember the MOS article now, anyways it was not relevant in our case).
Next step was that I executed top command to see server usage

[oracle@prod1]~/product/agent10g/sysman/config% top

top - 03:33:00 up 298 days,  8:57,  3 users,  load average: 1.10, 1.10, 1.24
Tasks: 333 total,   2 running, 331 sleeping,   0 stopped,   0 zombie
Cpu(s): 13.6% us,  1.2% sy,  1.0% ni, 83.5% id,  0.7% wa,  0.0% hi,  0.0% si
Mem:  32939820k total, 31860120k used,  1079700k free,   655108k buffers
Swap: 12578884k total,   425388k used, 12153496k free, 22205928k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 <strong>3042 oracle    25   0 50780  13m 6300 R 99.7  0.0   2780:40 tnslsnr</strong>
 8588 root      25  10 77084  10m 1228 S  8.3  0.0 594:49.26 sec.pl
19660 oracle    15   0 5978m 1.4g 1.4g S  0.7  4.6  81:12.82 oracle
19662 oracle    -2   0 5978m 1.8g 1.8g S  0.7  5.7  28:26.12 oracle
19666 oracle    -2   0 5978m 1.8g 1.8g S  0.7  5.7  29:23.61 oracle
 8402 root      15   0 21968 1296  876 S  0.3  0.0   6:35.15 sshd
 8857 root      34  19     0    0    0 S  0.3  0.0 380:06.93 kipmi0
 9413 oracle    15   0 3154m 103m  96m S  0.3  0.3  19:11.10 oracle

We can see that tnslsnr is consuming close to 100% cpu. I tried lsnrctl status command which hung. Then I tried pinging to the listener service using tnsping which also timed out.   I restarted the listener by killing it and starting with srvctl

[oracle@prod1]~% kill -9 3042
[oracle@prod1]~% ps -ef|grep tns
oracle   15600 30213  0 03:34 pts/1    00:00:00 grep tns
[oracle@prod1]~% srvctl start listener -n prod1
[oracle@prod1]~% ps -ef|grep tns
oracle   16022     1  0 03:35 ?        00:00:00 /oracle/product/10.2/bin/tnslsnr LISTENER_prod1 -inherit
oracle   16078 30213  0 03:35 pts/1    00:00:00 grep tns

After this agent came up normally. This co-relates with timeout issue reported by agent while making connection to database. I think the correct approach to the problem should have been checking the listener connections instead of restarting agent.

Coming back to the problem, agent was up and running now but  Grid Console reported agent as offline and emctl upload agent command failed . We again look at emagent.trc which recorded following errors

2010-05-30 03:40:31,050 Thread-4136642240 WARN  recvlets.aq: Deferred nmevqd_refreshState for [orcli.xx.xx.xx_orcli1,oracle_database]

This issue was caused due to blackout .  MOS note “Agent crashes with accumulation of blackouts and recvlets.aq: Failed to refreshState [ID 740000.1] ” describes the issue.

Note asks to stop the agent,then removing the blackouts.xml file, issuing emctl clearstate agent command and finally restarting the agent. Please find below commands for reference

$AGENT_HOME/bin/emctl stop agent

rm $AGENT_HOME/sysman/emd/blackouts.xml

$AGENT_HOME/bin/emctl clearstate agent

$AGENT_HOME/bin/emctl start agent

After performing this step, ’emctl upload agent’ succeeded and grid started monitoring the target normally. Note that in case you are using clustered agent, log files and blackouts.xml file will be found under $AGENT_HOME/<node_name>/sysman/

opmnctl start fails when ORA_NLS10 parameter is set

I got into this issue , where in while starting Grid Control components, opmnctl command failed with following errors

./opmnctl start
globalInitNLS: NLS boot file not found or invalid
 -- default linked-in boot block used
XML parser init: error 201.
globalInitNLS: NLS boot file not found or invalid
 -- default linked-in boot block used
XML parser init: error 201.
globalInitNLS: NLS boot file not found or invalid
 -- default linked-in boot block used
XML parser init: error 201.
opmnctl: opmn start failed

On checking environment variables, I found ORA_NLS10 env variable was set. Unsetting it resolved the issue.

Cluvfy Reports CRS not Installed on Nodes

Before I start my post, I would like to wish all our readers a “Happy New Year”.

While reviewing a 10gR2 RAC configuration I faced following errors on invoking cluvfy utility

$ ./cluvfy stage -post crsinst -n prod01,prod02 -verbose

Performing post-checks for cluster services setup 

Checking node reachability...

Check: Node reachability from node "prod01-fe"
  Destination Node                      Reachable?
  ------------------------------------  ------------------------
  prod01                              yes
  prod02                              yes
Result: Node reachability check passed from node "prod01-fe".

Checking user equivalence...

Check: User equivalence for user "oracle"
  Node Name                             Comment
  ------------------------------------  ------------------------
  prod02                              passed
  prod01                              passed
Result: User equivalence check passed for user "oracle".

ERROR:
CRS is not installed on any of the nodes.
Verification cannot proceed.

Post-check for cluster services setup was unsuccessful on all the nodes.

Cluvfy has reported that Clusterware has not been installed on the server. But this was strange and unexpected as one RAC Database was already running on these nodes and crs_stat reported the status of all CRS resources. ocrcheck also did not report any problems with the OCR. To dig further I checked Cluvfy logs located under $ORA_CRS_HOME/cv/log. It had recorded following errors

[main] [15:52:41:800] [OUIData.readInventoryData:393]  ==== CRS home added: Oracle home properties:
Name     : OraCRS
Type     : CRS-HOME
Location : /app/oracle/product/10.2/crs
Node list: [prod01-fe, prod02-fe]
; Thu Dec 24 15:52:41 GMT+08:00 2009
[main] [15:52:41:800] [OUIData.readInventoryData:401]  ==== ORACLE home added: Oracle home properties:
Name     : OraHome
Type     : ORACLE-HOME
Location : /app/oracle/product/10.2/db_1
Node list: [prod01, prod02]
; Thu Dec 24 15:52:41 GMT+08:00 2009
[main] [15:52:41:800] [VerificationUtil.isCRSInstalled:1262]  CRS wasn't found installed  on node: prod02; Thu Dec 24 15:52:41
 GMT+08:00 2009
[main] [15:52:41:800] [VerificationUtil.isCRSInstalled:1262]  CRS wasn't found installed  on node: prod01; Thu Dec 24 15:52:41
 GMT+08:00 2009

You can notice that CRS_HOME has recorded Node list as [prod01-fe, prod02-fe] and for DB_HOME it is [prod01, prod02]. Actual hostname for the nodes are prod01 and prod02. prod01-fe and prod02-fe were aliases for the two nodes.

Cluvfy uses Central Oracle Inventory to obtain ORACLE_HOME information. Checking the $ORACLE_BASE/oraInventory/ContentsXML/inventory.xml confirmed that the nodes corresponding to CRS_HOME were stored as prod01-fe and prod02-fe ( I have changed the Angled brackets to Round brackets as html considers them as tags. Don’t worry your inventory file is not corrupted 🙂 )

(INVENTORY)
(VERSION_INFO)
   (SAVED_WITH)10.2.0.1.0(/SAVED_WITH)
   (MINIMUM_VER)2.1.0.6.0(/MINIMUM_VER)
(/VERSION_INFO)
(HOME_LIST)
(HOME NAME="OraCRS" LOC="/app/oracle/product/10.2/crs" TYPE="O" IDX="1" CRS="true")
   (NODE_LIST)
      (NODE NAME="prod01-fe"/)
      (NODE NAME="prod02-fe"/)
   (/NODE_LIST)
(/HOME)
(HOME NAME="OraHome" LOC="/app/oracle/product/10.2/db_1" TYPE="O" IDX="2")
   (NODE_LIST)
      (NODE NAME="prod01"/)
      (NODE NAME="prod02"/)
   (/NODE_LIST)
(/HOME)

There are few reported issues on metalink but they relate to CRS=”true” not being present for CRS_HOME entry. There is one more reported issue listed here by Surachart which was caused by incorrect permissions on /etc/oraInst.loc file.

To correct this problem, we had to run runInstaller -updateNodelist command to update the correct nodes in file. Even though some metalink notes recommend to change this file manually,  I would recommend using runInstaller command for updating inventory.
Before executing the command, I ran olsnodes command to confirm that OCR stored prod01,prod02 in its repository.

runInstaller -updateNodeList -silent "CLUSTER_NODES={prod01,prod02}" ORACLE_HOME="/app/oracle/product/10.2/crs" ORACLE_HOME_NAME="OraCRS" LOCAL_NODE="prod01" CRS=true
Starting Oracle Universal Installer...

No pre-requisite checks found in oraparam.ini, no system pre-requisite checks will be executed.
The inventory pointer is located at /var/opt/oracle/oraInst.loc
The inventory is located at /app/oracle/oraInventory
'UpdateNodeList' was successful.

Now the cluvfy worked fine. I do not have any reasoning for why the incorrect nodes were recorded in Oracle Inventory, but above solution should take care of this.

ORA-27146 Errors while starting up Database

Short post on a issue faced by me, while starting up a cloned database

SQL> startup nomount
ORA-24323: value not allowed
ORA-27146: post/wait initialization failed

This database environment was cloned and was being started for first time after cloning. Running hcve scripts on system indicated that OS kernel parameters were adequately sized.

SHMMNI set to 1000 is adequate
SHMSEG set to 120 is adequate
SEMMNI set to 700 is adequate
SEMMNS set to 2000 is adequate
MAX_THREAD_PROC set to 256 is adequate
NFLOCKS set to 1450 is adequate

My Oracle Support articles suggested this issue is related to semaphores setting. As settings were correct, I attempted to start instance by commenting processes parameter in init.ora (setting was 1000). This time database started.

I modified the processes parameter back to 1000 and attempted to restart the Database. It started  again!! I am clueless on why it initially failed and then started with same setting. Does anyone have any reasoning for this behavior?

Oracle Patch Set update (PSU 2) released for 10.2.0.4 and 11.1.0.7

Oracle has recently released Oracle Patch Set update (PSU 2) along with CPU October 2009. As informed earlier , PSU are released quarterly along with CPU patches.

If you are running 10.2.0.4 RAC database, you need to apply following patches

1) Apply Patch 8833280 (10.2.0.4.2 PSU 2) on Oracle Database 10.2.0.4
2) Apply Patch 8436582 (10.2.0.4 CRS Bundle Patch #4) on Oracle Clusterware 10.2.0.4
3) Apply Patch 8705958 (10.2.0.4 CRS PSU 2 ) on Oracle Clusterware 10.2.0.4

1)  Patch 8833280 (10.2.0.4.2 PSU 2) for Oracle Database 10.2.0.4

2) Patch 8436582 (10.2.0.4 CRS Bundle Patch #4) for Oracle Clusterware 10.2.0.4 CRS PSU2 patch contains fixes for bugs fixed in Bundle patch 4

3)  Patch 8705958 (10.2.0.4 CRS PSU 2 ) for Oracle Clusterware 10.2.0.4

Critical Patch Update (CPU) are released each quarter and contain security fixes identified by Oracle. Oracle has recently introduced, Patch Set Update (PSUs) which are proactive cumulative patches containing recommended bug fixes that are released on a regular and predictable schedule (it also contains Security fixes part of CPU). PSUs are on the same quarterly schedule as the Critical Patch Updates (CPU), specifically the Tuesday closest to the 15th of January, April, July, and October.

It is recommended to apply latest Database Patchet (aleady on 10.2.0.4) along with latest Patch Set Update (PSU2) to Oracle Database as this helps to fix the critical bugs and also serve as a new baseline version for reporting issues to Oracle.

In case of RAC, you also need to apply CRS Patch Bundle to fix the CRS issues identified by Oracle.

Please note that Patch Set Updates can be applied on the base release version or on any earlier Patch Set Update. For example, 10.2.0.4.2 can be applied on 10.2.0.4.1 and 10.2.0.4.0 . Once a Patch Set Update has been applied, the recommended way to get future security content is to apply subsequent Patch Set Updates. Reverting from an applied Patch Set Update back to the Critical Patch Update, while technically possible, requires significant time and effort, and is not advised.

To know further, please read following My Oracle Support articles

Note 405820.1 – 10.2.0.X CRS Bundle Patch Information

Note 854428.1 – Intro to Patch Set Updates (PSU)

Note 8833280.8 lists down the bugs fixed as part of PSU2.

Sqlplus connection on AIX taking too long

Recently we faced a performance issue, where in sqlplus connection from application server A (AIX 5.3) to Database server DB (different physical server,10gR2 on AIX 5.3) was taking nearly 1-2 minutes.  We did some quick check’s to narrow down the problematic area.i.e Is it network or database server or some issue with application server?

1)Vmstat output reported Normal Run queue’s/memory utilisation/idle cpu on Application server A and Database Server DB . Rules out resource issue.

2) Connections from other application server B was taking normal time. Rules out Database server issue

3) If incorrect password was entered in connection details, second attempt resulted in normal quick response time. Indicates some network issues.

4) System Admin checked for network issues with traceroute command and reported normal results.

Even though SA confirmed that there was no network issue, we were sure that this has to do something with DNS/Network. As we had identical environment from where connection was working fine , so we initiated tracing of session using truss

<span style="font-family: verdana, geneva;"><strong>truss -aeo /tmp/sqlplus_timing.txt sqlplus username/passwd@DB</strong></span>

Issuing a tail command on trace for Application Server A reported following

<strong>open("/etc/netsvc.conf", O_RDONLY)              Err#13 EACCES
open("/etc/irs.conf", O_RDONLY)                 Err#2  ENOENT
getdomainname(0x09001000A00A0E98, 1024)         = 0</strong>
__libc_sbrk(0x0000000000010020)                 = 0x000000001021CBE0
getdomainname(0x09001000A00A0E98, 1024)         = 0
open("/etc/hesiod.conf", O_RDONLY)              Err#2  ENOENT
getdomainname(0x09001000A00A0E98, 1024)         = 0
getdomainname(0x09001000A00A0E98, 1024)         = 0
getdomainname(0x09001000A00A0E98, 1024)         = 0
socket(2, 2, 0)                                 = 9
getsockopt(9, 65535, 4104, 0x0FFFFFFFFFFF5B74, 0x0FFFFFFFFFFF5B70) = 0
connext(9, 0x09001000A0015B50, 16)              = 0
send(9, 0x0FFFFFFFFFFF6A80, 32, 0)              = 32
_poll(0x0FFFFFFFFFFF5C20, 1, 5000)              = 1
nrecvfrom(9, 0x0FFFFFFFFFFF7DB0, 1024, 0, 0x0FFFFFFFFFFF63E0, 0x0FFFFFFFFFFF5C08) = 78
send(9, 0x0FFFFFFFFFFF6A80, 22, 0)              = 22
_poll(0x0FFFFFFFFFFF5C20, 1, 5000)              = 1
nrecvfrom(9, 0x0FFFFFFFFFFF7DB0, 1024, 0, 0x0FFFFFFFFFFF63E0, 0x0FFFFFFFFFFF5C08) = 22
close(9)                                        = 0
socket(2, 2, 0)                                 = 9
sendto(9, 0x0FFFFFFFFFFF6A80, 22, 0, 0x09001000A0015B60, 16) = 22
_poll(0x0FFFFFFFFFFF5C20, 1, 5000) (sleeping...)
_poll(0x0FFFFFFFFFFF5C20, 1, 5000)              = 0
close(9)                                        = 0
socket(2, 2, 0)                                 = 9
sendto(9, 0x0FFFFFFFFFFF6A80, 22, 0, 0x09001000A0015B60, 16) = 22
_poll(0x0FFFFFFFFFFF5C20, 1, 5000) (sleeping...)
_poll(0x0FFFFFFFFFFF5C20, 1, 5000)              = 0
close(9)                                        = 0
socket(2, 2, 0)                                 = 9
sendto(9, 0x0FFFFFFFFFFF6A80, 22, 0, 0x09001000A0015B60, 16) = 22
_poll(0x0FFFFFFFFFFF5C20, 1, 10000) <strong>(sleeping...)</strong>
_poll(0x0FFFFFFFFFFF5C20, 1, 10000)             = 0
close(9)                                        = 0

We can see lot of sleeping messages for _poll system call.  We did not find these wait’s in truss output of Application Server B . Going up in truss output file, we see EACCESS (permission issue) on /etc/netsvc.conf and ENONT (file/directory not present) on /etc/irs.conf file.  On application server B, there were no EACCESS errors on /etc/netsvc.conf but ENONT errors were reported for /etc/irs.conf . Look’s like we need to check etc/netsvc.conf file permissions.

Comparing file permission’s, we noticed that permission on /etc/netsvc.conf had changed on Server A and  root user alone had read permission on file. Granting read permission on /etc/netsvc.conf to other’s resolved the issue immediately.

As per AIX Doc’s, /etc/netsvc.conf file is used to specify the ordering of name resolution for the sendmail command, gethostbyname subroutine, gethostaddr subroutine, and gethostent subroutine and alias resolution for the sendmail command. It is also used to override the default order and the order given in the /etc/irs.conf file by creating the /etc/netsvc.conf configuration file and specifying the desired ordering.

e.g If the resolver cannot find the name in the /etc/hosts file and you want to the resolver to use NIS, enter:

hosts = local , nis

If the resolver cannot find the name in the /etc/hosts file and you want to the resolver to use NIS, enter:
hosts = local , nis

I had usually seen /etc/resolv.conf file to specify name resolution.  AIX Doc’s for name resolution specifies following process for name resolution

To resolve a name in a domain network, the resolver routine first queries the domain name server database, which might be local if the host is a domain name server or on a foreign host. Name servers translate domain names into Internet addresses. The group of names for which a name server is responsible is its zone of authority. If the resolver routine is using a remote name server, the routine uses the domain name protocol (DOMAIN) to query for the mapping. To resolve a name in a flat network, the resolver routine checks for an entry in the local /etc/hosts file. When NIS or NIS+ is used, the /etc/hosts file on the master server is checked.

By default, resolver routines attempt to resolve names using the above resources. BIND/DNS is tried first. If the /etc/resolv.conf file does not exist or if BIND/DNS could not find the name, NIS is queried if it is running. NIS is authoritative over the local /etc/hosts, so the search ends here if it is running. If NIS is not running, then the local /etc/hosts file is searched. If none of these services can find the name, then the resolver routines return with HOST_NOT_FOUND. If all of the services are unavailable, then the resolver routines return with SERVICE_UNAVAILABLE.

The default order described above can be overwritten by creating the /etc/irs.conf configuration file and specifying the desired order. Also, both the default and /etc/irs.conf orderings can be overwritten with the environment variable, NSORDER. If either the /etc/irs.conf file or NSORDER environment variable are defined, then at least one value must be specified along with the option.

As mentioned, /etc/irs.conf is used to modify the default order and /etc/netsvc.conf can be used to override the order given in /etc/irs.conf. I couldn’t find any reference explaining the time limit used by resolver, but seem’s like after checking for some limited time, it fall’s back on default order of name resolution.