Amit Bansal

Physical Corruption: ORA-1578 part 3

Recently we encountered a block corruption issue in 10g database on Linux x86 64 bit and using ASM for storing database files.
Saurabh had earlier written articles on ora-1578.

http://askdba.org/weblog/2008/04/physical-corruption-ora-1578-part-1/
http://askdba.org/weblog/2008/05/physical-corruption-ora-1578-part-2/

It was first time I was working on a corruption issue, so I followed above notes and found them quite useful. Good thing about ora-1578 error message is that it gives the problematic block and datafile number

ORA-01578: ORACLE data block corrupted (file # 699, block # 188654)
ORA-01110: data file 699:
'+DG_IDX/james/datafile/scott_emp_detail.381.705692005'

Querying dba_extents revealed that it was a Fact table and one partition was giving error.Since we had rman backup for this database we directly went for recovery using blockrecover command

Recovery Manager: Release 10.2.0.4.0 - Production on Sat Aug 7 01:48:51 2010

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

connected to target database: james (DBID=1812247572)

RMAN> run {blockrecover datafile 699 block 188654;}

Starting blockrecover at 07-AUG-10
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=2231 instance=james4 devtype=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: sid=2190 instance=james4 devtype=DISK
allocated channel: ORA_DISK_3
channel ORA_DISK_3: sid=2142 instance=james4 devtype=DISK
allocated channel: ORA_DISK_4
channel ORA_DISK_4: sid=2186 instance=james4 devtype=DISK
allocated channel: ORA_DISK_5
channel ORA_DISK_5: sid=2170 instance=james4 devtype=DISK
allocated channel: ORA_DISK_6
channel ORA_DISK_6: sid=2203 instance=james4 devtype=DISK
allocated channel: ORA_DISK_7
channel ORA_DISK_7: sid=2240 instance=james4 devtype=DISK
allocated channel: ORA_DISK_8
channel ORA_DISK_8: sid=2195 instance=james4 devtype=DISK

channel ORA_DISK_2: restoring block(s)
channel ORA_DISK_2: specifying block(s) to restore from backup set
restoring blocks of datafile 00699
channel ORA_DISK_2: reading from backup piece /RMNGD_RMAN1/backup/james/29lk4ift_1_1

user interrupt received
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of blockrecover command at 08/07/2010 02:52:27
RMAN-03099: job cancelled at user request

We had to cancel the block recovery as even after 1 hour it could not locate the block from backup piece. Checking backup piece found that this backup piece was of 184G

ls -lh /RMNGD_RMAN1/backup/james/29lk4ift_1_1
-rw-r-----  1 oracle dba 184G Jul 31 18:17 /RMNGD_RMAN1/backup/james/29lk4ift_1_1

We had stopped the data load for this table but delaying further was not possible as it was impacting application. Finally we  decided to use dbms_repair table to allow DML to mark the block as corrupt .
This table was FACT table which could have been loaded again (lucky for us 🙂 ). Before marking it corrupt, we took a dump of block

ALTER SYSTEM DUMP DATAFILE 699 block 188654;
BEGIN
DBMS_REPAIR.ADMIN_TABLES (
     TABLE_NAME => 'REPAIR_TABLE',
     TABLE_TYPE => dbms_repair.repair_table,
     ACTION     => dbms_repair.create_action,
     TABLESPACE => 'USERS');
END;
/

SET SERVEROUTPUT ON
DECLARE num_corrupt INT;
BEGIN
num_corrupt := 0;
DBMS_REPAIR.CHECK_OBJECT (
     SCHEMA_NAME => 'SCOTT',
     OBJECT_NAME => 'EMP_DETAIL_FACT',
     PARTITION_NAME => 'SYS_P1140013',
     REPAIR_TABLE_NAME => 'REPAIR_TABLE',
     CORRUPT_COUNT =>  num_corrupt);
DBMS_OUTPUT.PUT_LINE('number corrupt: ' || TO_CHAR (num_corrupt));
END;
/

SELECT OBJECT_NAME, BLOCK_ID, CORRUPT_TYPE, MARKED_CORRUPT,CORRUPT_DESCRIPTION, REPAIR_DESCRIPTION FROM REPAIR_TABLE;

  SET SERVEROUTPUT ON
  DECLARE num_fix INT;
  BEGIN
  num_fix := 0;
  DBMS_REPAIR.FIX_CORRUPT_BLOCKS (
      SCHEMA_NAME => 'SCOTT',
      OBJECT_NAME => 'EMP_DETAIL_FACT',
     PARTITION_NAME => 'SYS_P1140013',
      OBJECT_TYPE => dbms_repair.table_object,
      REPAIR_TABLE_NAME => 'REPAIR_TABLE',
      FIX_COUNT=> num_fix);
  DBMS_OUTPUT.PUT_LINE('num fix: ' || TO_CHAR(num_fix));
  END;
  /

  execute DBMS_REPAIR.SKIP_CORRUPT_BLOCKS('SCOTT','EMP_DETAIL_FACT')

Refer 10g doc to get details on DBMS_REPAIR.

This allowed the application to load data again. Later we worked with application team and created a new segment and renamed the old one to *_bkp and then dropped it.
Next day we got a alert informing errors with our rman backup

ORA-19566: exceeded limit of 0 corrupt blocks for file +DG_IDX/james/datafile/scott_emp_detail.381.705692005
continuing other job steps, job failed will not be re-run
channel ORA_DISK_8: finished piece 1 at AUG-10-2010 01:56:45
piece handle=/RMNGD_RMAN3/backup/james/i2lkuv0q_1_1 tag=james_INC_20100810 comment=NONE
channel ORA_DISK_8: backup set complete, elapsed time: 00:26:43
channel ORA_DISK_1: finished piece 1 at AUG-10-2010 02:13:01
piece handle=/RMNGD_RMAN1/backup/james/hrlkuv0i_1_1 tag=james_INC_20100810 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:43:07
channel ORA_DISK_6: finished piece 1 at AUG-10-2010 02:21:26
piece handle=/RMNGD_RMAN2/backup/james/i0lkuv0n_1_1 tag=james_INC_20100810 comment=NONE
channel ORA_DISK_6: backup set complete, elapsed time: 00:51:27
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================

Backup job failed again as the table was still lying in recyclebin.

03:16:47 SQL> select * from dba_recyclebin where original_name='EMP_DETAIL_FACT_BKP';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==03:16:47 SQL> select * from dba_recyclebin where original_name=\'EMP_DETAIL_FACT_BKP\';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==03:16:47 SQL> select * from dba_recyclebin where original_name='EMP_DETAIL_FACT_BKP';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==03:16:47 SQL> select * from dba_recyclebin where original_name='EMP_DETAIL_FACT_BKP';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==03:16:47 SQL> select * from dba_recyclebin where original_name='EMP_DETAIL_FACT_BKP';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==03:16:47 SQL> select * from dba_recyclebin where original_name='EMP_DETAIL_FACT_BKP';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==03:16:47 SQL> select * from dba_recyclebin where original_name='EMP_DETAIL_FACT_BKP';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==03:16:47 SQL> select * from dba_recyclebin where original_name='EMP_DETAIL_FACT_BKP';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==03:16:47 SQL> select * from dba_recyclebin where original_name=\'EMP_DETAIL_FACT_BKP\';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==03:16:47 SQL> select * from dba_recyclebin where original_name='EMP_DETAIL_FACT_BKP';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==03:16:47 SQL> select * from dba_recyclebin where original_name='EMP_DETAIL_FACT_BKP';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==03:16:47 SQL> select * from dba_recyclebin where original_name='EMP_DETAIL_FACT_BKP';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==03:16:47 SQL> select * from dba_recyclebin where original_name='EMP_DETAIL_FACT_BKP';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==03:16:47 SQL> select * from dba_recyclebin where original_name='EMP_DETAIL_FACT_BKP';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==03:16:47 SQL> select * from dba_recyclebin where original_name='EMP_DETAIL_FACT_BKP';

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12                                                                                                                                            NO        NO               4465594     4465594   4465594
      1088

SCOTT                       BIN$jXvLFhsgCOXgQA1M6jspCQ==$0 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896 EMP_DETAIL_FACT_BKP
DROP                            Table Partition
scott_emp_detail                                                                                                      2009-11-09:18:41:14
2010-08-10:09:40:39                                                    9.9476E+12
                                                                                                                                               NO        NO               4465594     4465594   4465594
     12896

So we decided to purge the recyclebin and resized datafile to ensure that we get rid of corrupt block. As a result maximum block for datafile was 117765, whereas we had error for block 188654.

03:19:07 SQL> conn scott

Enter password: 

Connected.

03:19:20 SQL> purge recyclebin;

03:23:39 SQL> alter database datafile '+DG_IDX/james/datafile/scott_emp_detail.381.705692005' resize 3800m;

Database altered.

  1* select max(block_id) from dba_extents where file_id=699

03:24:25 SQL> /

MAX(BLOCK_ID)

-------------

       117765

We also ran rman backup validate and dbverify to ensure that there was no corruption

RMAN> backup validate datafile 699;

Starting backup at 11-AUG-10
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=2171 instance=james4 devtype=DISK
allocated channel: ORA_DISK_2
channel ORA_DISK_2: sid=2157 instance=james4 devtype=DISK
channel ORA_DISK_1: starting full datafile backupset
channel ORA_DISK_1: specifying datafile(s) in backupset
input datafile fno=00699 name=+DG_IDX/james/datafile/scott_emp_detail.381.705692005
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:35
Finished backup at 11-AUG-10

Since this is ASM, dbv needs you to specify system password

dbv file=+DG_IDX/james/datafile/scott_emp_detail.381.705692005 userid=system/oracle123 blocksize=32768

DBVERIFY - Verification starting : FILE = +DG_IDX/james/datafile/scott_emp_detail.381.705692005
DBVERIFY - Verification complete

Total Pages Examined         : 118400
Total Pages Processed (Data) : 110979
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 0
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 7421
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 0
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Highest block SCN            : 0 (0.0)

All this exercise made us re-look at rman backup. We realized that we need to reduce the size of RMAN backup piece as having smaller piece will help us in faster recovery of data blocks or datafile loss.
We had two options

1) Specify value of maxpiecesize so as to limit size of backup piece.
2) Specify filesperset parameter to limit the number of files in backup piece.

Both things meant we had more backup pieces but definitely it would result in improved recovery times. We modified the backup script to include filesperset parameter

BACKUP AS COMPRESSED BACKUPSET INCREMENTAL LEVEL 1 DATABASE FILESPERSET=1

In our case we could reload the data so we dropped the old table. But in case you can’t do it and you have rman backups failing because of corrupt block issue, you can use following syntax in rman run clause to skip ‘n’ corrupt blocks

set maxcorrupt for datafile 699 to 1;

Cloning RAC Database with RMAN fails with error

Steps for cloning a RAC database  with RMAN is similar to cloning a single instance database. But while using rman duplicate in 10g, you will get following errors

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 06/14/2010 00:09:42
RMAN-03015: error occurred in stored script Memory Script
RMAN-06136: ORACLE error from auxiliary database: ORA-38856: cannot mark instance UNNAMED_INSTANCE_2 (redo thread 2) as enabled

RMAN> exit

This is due to Bug 4355382 ORA-38856: FAILED TO OPEN DATABASE WITH RESETLOGS WHEN USING RAC BACKUP

To avoid this we need to set the following initialization parameter on the auxiliary database; and restart the database in no mount mode

_no_recovery_through_resetlogs=TRUE

Unset this once rman duplicate command completes successfully. One more notable thing which I noticed while cloning this database was that if you use skip readonly clause to skip duplicating read only tablespace, entries for read only tablespace still remain and file names are marked as missing

/oracle/product/10.2/dbs/MISSING00048
OFFLINE READ ONLY

/oracle/product/10.2/dbs/MISSING00049
OFFLINE READ ONLY

This is normal behavior and not a bug. From docs

SKIP READONLY Excludes datafiles in read-only tablespaces from the duplicate database.
Note: A record for the skipped read-only tablespace still appears in DBA_TABLESPACES. By using this feature, you can activate the read-only tablespace later. For example, you can store the read-only tablespace data on a CD-ROM, then mount the CD-ROM later and view the data

If you don’t plan to make use of these tablespaces, you can drop them by simply using drop tablespace command .Meanwhile do you know that  tables can be dropped from read-only tablespace. To know more about it, you can read this entry from Tanel describing this behavior

‘Metric execution timed out’ error reported by Grid

We use 10.2 Enterprise Manager Grid control  to monitor our oracle database environments . Apart from the normal metrics provided by oracle we also use host based and sql user defined metrics which allows you to monitor almost anything . Coming back to the main issue, following error was reported from one of the monitored target (Non Prod server)

PROD Apr 30, 2010 8: Status
Time: May 30, 2010 2:28:14 AM PDT
Sev: Metric Error Start
Message: Metric evaluation error start - Metric execution timed out
in 360 seconds

Checking the $AGENT_HOME/sysman/log/emagent.trc, same messages were recorded in the trace file

2010-05-30 02:39:11,585 Thread-4045400992 WARN  scheduler: oracle_database:orcli.xx.xx.xx_orcli1:Response is due to run again but is still running from last schedule
2010-05-30 02:39:12,280 Thread-3981994912 ERROR fetchlets.oslinetok: Metric execution timed out in 600 seconds
2010-05-30 02:39:13,279 Thread-4088384416 ERROR fetchlets.oslinetok: Metric execution timed out in 600 seconds

I used emctl status agent but it did not report any pending files. My first step was to  restart the agent but it didn’t come up properly after restart

[oracle@prod1]~% emctl status agent
Oracle Enterprise Manager 10g Release 4 Grid Control 10.2.0.4.0.
Copyright (c) 1996, 2007 Oracle Corporation.  All rights reserved.
---------------------------------------------------------------
Agent Version     : 10.2.0.4.0
OMS Version       : 10.2.0.4.0
Protocol Version  : 10.2.0.4.0
Agent Home        : /oracle/product/agent10g
Agent binaries    : /oracle/product/agent10g
Agent Process ID  : 1025
Parent Process ID : 1009
---------------------------------------------------------------
<strong>Agent is Running but Not Ready</strong>

Again looking at emagent.trc file, following error was observed

2010-05-30 03:19:30,238 Thread-4092558240 ERROR TargetManager: TIMEOUT when compute dynamic properties for target orcli.xx.xx.xx_orcli1

At this moment I enabled the blackout for this host as I didn’t wanted too many alerts for this server. On checking MOS aka metalink, I found a note recommending altering values of parameters in agent configuration file which controls the time limit for agent to make connection to databse. But on checking found those parameter’s were already set to specified value (Don’t remember the MOS article now, anyways it was not relevant in our case).
Next step was that I executed top command to see server usage

[oracle@prod1]~/product/agent10g/sysman/config% top

top - 03:33:00 up 298 days,  8:57,  3 users,  load average: 1.10, 1.10, 1.24
Tasks: 333 total,   2 running, 331 sleeping,   0 stopped,   0 zombie
Cpu(s): 13.6% us,  1.2% sy,  1.0% ni, 83.5% id,  0.7% wa,  0.0% hi,  0.0% si
Mem:  32939820k total, 31860120k used,  1079700k free,   655108k buffers
Swap: 12578884k total,   425388k used, 12153496k free, 22205928k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 <strong>3042 oracle    25   0 50780  13m 6300 R 99.7  0.0   2780:40 tnslsnr</strong>
 8588 root      25  10 77084  10m 1228 S  8.3  0.0 594:49.26 sec.pl
19660 oracle    15   0 5978m 1.4g 1.4g S  0.7  4.6  81:12.82 oracle
19662 oracle    -2   0 5978m 1.8g 1.8g S  0.7  5.7  28:26.12 oracle
19666 oracle    -2   0 5978m 1.8g 1.8g S  0.7  5.7  29:23.61 oracle
 8402 root      15   0 21968 1296  876 S  0.3  0.0   6:35.15 sshd
 8857 root      34  19     0    0    0 S  0.3  0.0 380:06.93 kipmi0
 9413 oracle    15   0 3154m 103m  96m S  0.3  0.3  19:11.10 oracle

We can see that tnslsnr is consuming close to 100% cpu. I tried lsnrctl status command which hung. Then I tried pinging to the listener service using tnsping which also timed out.   I restarted the listener by killing it and starting with srvctl

[oracle@prod1]~% kill -9 3042
[oracle@prod1]~% ps -ef|grep tns
oracle   15600 30213  0 03:34 pts/1    00:00:00 grep tns
[oracle@prod1]~% srvctl start listener -n prod1
[oracle@prod1]~% ps -ef|grep tns
oracle   16022     1  0 03:35 ?        00:00:00 /oracle/product/10.2/bin/tnslsnr LISTENER_prod1 -inherit
oracle   16078 30213  0 03:35 pts/1    00:00:00 grep tns

After this agent came up normally. This co-relates with timeout issue reported by agent while making connection to database. I think the correct approach to the problem should have been checking the listener connections instead of restarting agent.

Coming back to the problem, agent was up and running now but  Grid Console reported agent as offline and emctl upload agent command failed . We again look at emagent.trc which recorded following errors

2010-05-30 03:40:31,050 Thread-4136642240 WARN  recvlets.aq: Deferred nmevqd_refreshState for [orcli.xx.xx.xx_orcli1,oracle_database]

This issue was caused due to blackout .  MOS note “Agent crashes with accumulation of blackouts and recvlets.aq: Failed to refreshState [ID 740000.1] ” describes the issue.

Note asks to stop the agent,then removing the blackouts.xml file, issuing emctl clearstate agent command and finally restarting the agent. Please find below commands for reference

$AGENT_HOME/bin/emctl stop agent

rm $AGENT_HOME/sysman/emd/blackouts.xml

$AGENT_HOME/bin/emctl clearstate agent

$AGENT_HOME/bin/emctl start agent

After performing this step, ’emctl upload agent’ succeeded and grid started monitoring the target normally. Note that in case you are using clustered agent, log files and blackouts.xml file will be found under $AGENT_HOME/<node_name>/sysman/

Downloading Oracle Software directly to Server

This article is based on Pythian’s aricle DOWNLOADING FROM OTN DIRECTLY TO YOUR DATABASE SERVER.

Article tell’s about using lynx (text based web browser available on linux) to generate the cookies file which can be then used with wget to download software. But when I tried to use lynx , it repeatedly gave error for javascript.  On searching net , I found a firefox extension “Export Cookies” which creates wget compatible cookies.

You can find the extension at https://addons.mozilla.org/en-US/firefox/addon/8154/

After you install the extension, clear your cookies and then login to otn.oracle.com and sign in. After you have done this,  go to Tools -> Export cookies and save the file.

Then you can copy the file to your server and use wget to download the software. Syntax is

wget –load-cookies=cookies.txt <otn_url>

E.g To download 11gr2 Grid infrastructure software for linux x86-64 bit

wget --load-cookies=cookies.txt http://download.oracle.com/otn/linux/oracle11g/R2/linux.x64_11gR2_grid.zip

11gR2 Silent Install errors with [SEVERE] – Invalid My Oracle Support credentials

While performing  11gR2 RAC Database Software installation with response file (silent mode), I got this error.

[SEVERE] - Invalid My Oracle Support credentials

I was using default response files for performing this installation. Searching on MOS (My Oracle Support ), I got a note indicating that following parameter need’s to be specified

MYORACLESUPPORT_USERNAME = 

MYORACLESUPPORT_PASSWORD =

I provided a dummy value for these two paramters but the installation again failed with same errors. Then I gave correct credentials and tried again, this time installation was succesful. My initial thought was ” Is Oracle forcing people to have valid MOS account for performing installation” But this couldn’t be true as I had earlier done GUI installation without providing these credentials. It was clear that there was some other parameter which controlled this behavior. So as to diagnose it , I created a response file using GUI (Un checking the check box for my oracle support credentials) . This time response file had two additional parameters

#------------------------------------------------------------------------------
# Specify whether to enable the user to set the password for
# My Oracle Support credentials. The value can be either true or false.
# If left blank it will be assumed to be false.
#
# Example    : SECURITY_UPDATES_VIA_MYORACLESUPPORT=true
#------------------------------------------------------------------------------
SECURITY_UPDATES_VIA_MYORACLESUPPORT=false

#------------------------------------------------------------------------------
# Specify whether user wants to give any proxy details for connection.
# The value can be either true or false. If left blank it will be assumed
# to be false.
#
# Example    : DECLINE_SECURITY_UPDATES=false
#------------------------------------------------------------------------------
DECLINE_SECURITY_UPDATES=true

We can see that DECLINE_SECUIRTY_UPDATES value is changed from default value of false to true.  From 11.2 docs

DECLINE_SECURITY_UPDATES
If the direct connection is unavailable, set this parameter to false if you do not want to configure Oracle Configuration Manager, or do not want to provide any proxy details for the connection. The default is False.

If you do not set MYORACLESUPPORT_USERNAME to any value and you set this parameter to true, the Oracle Configuration Manager is not configured.

So to workaround this problem, we need to set DECLINE_SECURITY_UPDATES=true