We use 10.2 Enterprise Manager Grid control to monitor our oracle database environments . Apart from the normal metrics provided by oracle we also use host based and sql user defined metrics which allows you to monitor almost anything . Coming back to the main issue, following error was reported from one of the monitored target (Non Prod server)
PROD Apr 30, 2010 8: Status Time: May 30, 2010 2:28:14 AM PDT Sev: Metric Error Start Message: Metric evaluation error start - Metric execution timed out in 360 seconds
Checking the $AGENT_HOME/sysman/log/emagent.trc, same messages were recorded in the trace file
2010-05-30 02:39:11,585 Thread-4045400992 WARN scheduler: oracle_database:orcli.xx.xx.xx_orcli1:Response is due to run again but is still running from last schedule 2010-05-30 02:39:12,280 Thread-3981994912 ERROR fetchlets.oslinetok: Metric execution timed out in 600 seconds 2010-05-30 02:39:13,279 Thread-4088384416 ERROR fetchlets.oslinetok: Metric execution timed out in 600 seconds
I used emctl status agent but it did not report any pending files. My first step was to restart the agent but it didn’t come up properly after restart
[oracle@prod1]~% emctl status agent Oracle Enterprise Manager 10g Release 4 Grid Control 10.2.0.4.0. Copyright (c) 1996, 2007 Oracle Corporation. All rights reserved. --------------------------------------------------------------- Agent Version : 10.2.0.4.0 OMS Version : 10.2.0.4.0 Protocol Version : 10.2.0.4.0 Agent Home : /oracle/product/agent10g Agent binaries : /oracle/product/agent10g Agent Process ID : 1025 Parent Process ID : 1009 --------------------------------------------------------------- <strong>Agent is Running but Not Ready</strong>
Again looking at emagent.trc file, following error was observed
2010-05-30 03:19:30,238 Thread-4092558240 ERROR TargetManager: TIMEOUT when compute dynamic properties for target orcli.xx.xx.xx_orcli1
At this moment I enabled the blackout for this host as I didn’t wanted too many alerts for this server. On checking MOS aka metalink, I found a note recommending altering values of parameters in agent configuration file which controls the time limit for agent to make connection to databse. But on checking found those parameter’s were already set to specified value (Don’t remember the MOS article now, anyways it was not relevant in our case).
Next step was that I executed top command to see server usage
[oracle@prod1]~/product/agent10g/sysman/config% top top - 03:33:00 up 298 days, 8:57, 3 users, load average: 1.10, 1.10, 1.24 Tasks: 333 total, 2 running, 331 sleeping, 0 stopped, 0 zombie Cpu(s): 13.6% us, 1.2% sy, 1.0% ni, 83.5% id, 0.7% wa, 0.0% hi, 0.0% si Mem: 32939820k total, 31860120k used, 1079700k free, 655108k buffers Swap: 12578884k total, 425388k used, 12153496k free, 22205928k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND <strong>3042 oracle 25 0 50780 13m 6300 R 99.7 0.0 2780:40 tnslsnr</strong> 8588 root 25 10 77084 10m 1228 S 8.3 0.0 594:49.26 sec.pl 19660 oracle 15 0 5978m 1.4g 1.4g S 0.7 4.6 81:12.82 oracle 19662 oracle -2 0 5978m 1.8g 1.8g S 0.7 5.7 28:26.12 oracle 19666 oracle -2 0 5978m 1.8g 1.8g S 0.7 5.7 29:23.61 oracle 8402 root 15 0 21968 1296 876 S 0.3 0.0 6:35.15 sshd 8857 root 34 19 0 0 0 S 0.3 0.0 380:06.93 kipmi0 9413 oracle 15 0 3154m 103m 96m S 0.3 0.3 19:11.10 oracle
We can see that tnslsnr is consuming close to 100% cpu. I tried lsnrctl status command which hung. Then I tried pinging to the listener service using tnsping which also timed out. I restarted the listener by killing it and starting with srvctl
[oracle@prod1]~% kill -9 3042 [oracle@prod1]~% ps -ef|grep tns oracle 15600 30213 0 03:34 pts/1 00:00:00 grep tns [oracle@prod1]~% srvctl start listener -n prod1 [oracle@prod1]~% ps -ef|grep tns oracle 16022 1 0 03:35 ? 00:00:00 /oracle/product/10.2/bin/tnslsnr LISTENER_prod1 -inherit oracle 16078 30213 0 03:35 pts/1 00:00:00 grep tns
After this agent came up normally. This co-relates with timeout issue reported by agent while making connection to database. I think the correct approach to the problem should have been checking the listener connections instead of restarting agent.
Coming back to the problem, agent was up and running now but Grid Console reported agent as offline and emctl upload agent command failed . We again look at emagent.trc which recorded following errors
2010-05-30 03:40:31,050 Thread-4136642240 WARN recvlets.aq: Deferred nmevqd_refreshState for [orcli.xx.xx.xx_orcli1,oracle_database]
This issue was caused due to blackout . MOS note “Agent crashes with accumulation of blackouts and recvlets.aq: Failed to refreshState [ID 740000.1] ” describes the issue.
Note asks to stop the agent,then removing the blackouts.xml file, issuing emctl clearstate agent command and finally restarting the agent. Please find below commands for reference
$AGENT_HOME/bin/emctl stop agent rm $AGENT_HOME/sysman/emd/blackouts.xml $AGENT_HOME/bin/emctl clearstate agent $AGENT_HOME/bin/emctl start agent
After performing this step, ’emctl upload agent’ succeeded and grid started monitoring the target normally. Note that in case you are using clustered agent, log files and blackouts.xml file will be found under $AGENT_HOME/<node_name>/sysman/
Recent Comments