11g

Lists 11gR1/R2 articles

PROCESSED Messages not clearing from Oracle Queue

I was contacted by Dev team to look into Development database where Oracle queue size kept  on increasing. As per them messages were not getting cleared.They were running count on Queue table and the count kept increasing.
To find out exact cause I included the msg_state in the query and found out that there were lot of unprocessed messages

SQL> select queue,msg_state,count(*) from A$$JMS_QUEUE_TABLE group by queue,msg_state;

QUEUE                   MSG_STATE      COUNT(*)
------------------------------ ---------------- ----------
AQ$_JMS_QUEUE_TABLE_E      EXPIRED             3
JMS_QUEUE               PROCESSED           343
JMS_QUEUE               READY             3

Since this was dev box,I initially tried purging the queue using dbms_aqadm.purge_queue_table

DECLARE
po_t dbms_aqadm.aq$_purge_options_t;
BEGIN
dbms_aqadm.purge_queue_table('SCOTT.JMS_QUEUE_TABLE',
purge_condition => null,
purge_options => po_t);
END;

This removed the EXPIRED and READY messages but PROCESSED messages count didn’t decrease.
Next  I checked queue definition using all_queues.I saw a value named retention specified to 3600.

SQL> SELECT owner, name, retention FROM all_queues WHERE name LIKE '%JMS%';

OWNER                   NAME                  RETENTION
-------            ------------------        --------------
SCOTT               JMS_QUEUE              3600
SCOTT               AQ$_JMS_QUEUE_TABLE_E      0

When a queue is defined in Oracle, you can define how long a message can remain visible once it has been dequeued (i.e. the retention period). Modifying the retention to 0 cleared the messages instantly

exec dbms_aqadm.alter_queue(QUEUE_NAME=>'SCOTT.JMS_QUEUE',RETENTION_TIME=>0)

Some interesting learning for me 🙂

ORA-01873 error running SAP pre-upgrade scripts

While trying to run pre_upgrade_status.sql script for DB upgrade (10.2.0.4 to 11.2.0.4), we got following error

@pre_upgrade_status.sql
ERROR at line 1:
ORA-01873: the leading precision of the interval is too small
ORA-06512: at line 8

On debugging further , found out that it was due to limit imposed on “INTERVAL DAY TO SECOND” datatype. By default precision is 2 i.e. interval day to second is equivalent to interval day(2) to second(2). As per Oracle this is not bug and you should use internal datatype DSINTERVAL_UNCONSTRAINED

I couldn’t get any hit on web for this as this error will only be encountered if the AWR retention is set to 100 or more days. This is easily reproducible

--Set retention to >100 days
execute dbms_workload_repository.modify_snapshot_settings (interval => 60,retention =>145440)

--execute following code

declare
sap_awr_small      BOOLEAN  := FALSE;
sap_awr_retention_days NUMBER;
sap_awr_retention  INTERVAL DAY TO SECOND;
 BEGIN
       sap_awr_small := FALSE;

       SELECT RETENTION INTO sap_awr_retention from DBA_HIST_WR_CONTROL WHERE  dbid = (select dbid from v$database);
       -- dbms_output.put_line(sap_awr_retention);
       sap_awr_retention_days := extract (day from sap_awr_retention);
       -- dbms_output.put_line(sap_awr_retention_days);
       if (sap_awr_retention_days < 42) THEN
         sap_awr_small := TRUE;
       END IF;
end;
/

ERROR at line 1:
ORA-01873: the leading precision of the interval is too small
ORA-06512: at line 8

Easy workaround is to set retention to lower value say 99 days and revert it post upgrade

execute dbms_workload_repository.modify_snapshot_settings (interval => 60,retention =>142560)

In case you are facing this error in your application code, then you can Refer MyOracle Support article ORA-1873 : When Passing Datetime Interval with Precision > 2 in a Parameter (Doc ID 394260.1) for modifying your code to use custom type.

LGWR terminating instance due to error 338

Recently we came across a issue where our DB crashed with ORA-00338 error .

Errors in file /oracle/diag/rdbms/orcl11g/orc11g/trace/orc11g_lgwr_24118.trc:
ORA-00338: log 2 of thread 1 is more recent than control file
ORA-00312: online log 2 thread 1: '/oracle/oradata/orcl11g/redo02.log'
LGWR (ospid: 24118): terminating the instance due to error 338

DB couldn’t be restarted as it gave same errors while opening. For multiplexed redo log files, it reports error for both log files.

Error Description: (Reference :http://psoug.org/oraerror/ORA-00338.htm)
Log string of thread string is more recent than control file

Error Cause:

The control file change sequence number in the log file is greater than the number in the control file. This implies that the wrong control file is being used. Note that repeatedly causing this error can make it stop happening without correcting the real problem. Every attempt to open the database will advance the control file change sequence number until it is great enough.

Action:
Use the current control file or do backup control file recovery to make the control file current. Be sure to follow all restrictions on doing a backup control file recovery.

Above explanation suggests that there is problem with controlfile. Normally these errors are seen when doing incomplete recovery. To troubleshoot it ,we took dumps of redo log files (On recommendation of Oracle support)

Note: Below logs are from test system which was used to reproduce the issue

SQL> alter system dump logfile '/oracle/oradata/orcl11g/redo01.log' validate;
System altered.
SQL> alter system dump logfile '/oracle/oradata/orcl11g/redo02.log' validate;
ERROR at line 1:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/oracle/oradata/orcl11g/redo02.log'
SQL> alter system dump logfile '/oracle/oradata/orcl11g/redo03.log' validate;
*
ERROR at line 1:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/oracle/oradata/orcl11g/redo03.log'

redo02.log and redo03.log dump failed with errors that it does not contain redo. Since redo01.log dump was successful, we looked at trace.

DUMP OF REDO FROM FILE '/oracle/oradata/orcl11g/redo01.log'
 Opcodes *.*
 RBAs: 0x000000.00000000.0000 thru 0xffffffff.ffffffff.ffff
 SCNs: scn: 0x0000.00000000 thru scn: 0xffff.ffffffff
 Times: creation thru eternity
 VALIDATE ONLY
 FILE HEADER:
 Compatibility Vsn = 186647552=0xb200400
 Db ID=970369526=0x39d6a9f6, Db Name='TESTDB'
 Activation ID=2650290266=0x9df8385a
 Control Seq=5124=0x1404, File size=102400=0x19000
 File Number=1, Blksiz=512, File Type=2 LOG
 descrip:"Thread 0001, Seq# 0000000001, SCN 0x00000016f528-0xffffffffffff"
 thread: 1 nab: 0xffffffff seq: 0x00000001 hws: 0x3 eot: 1 dis: 0
 resetlogs count: 0x33acc28a scn: 0x0000.0016f528 (1504552)
 prev resetlogs count: 0x3377bd37 scn: 0x0000.000e2006 (925702)
 Low scn: 0x0000.0016f528 (1504552) 12/22/2014 06:13:30
 Next scn: 0xffff.ffffffff 01/01/1988 00:00:00
 Enabled scn: 0x0000.0016f528 (1504552) 12/22/2014 06:13:30
 Thread closed scn: 0x0000.0016f528 (1504552) 12/22/2014 06:13:30
 Disk cksum: 0xcec6 Calc cksum: 0xcec6
 Terminal recovery stop scn: 0x0000.00000000
 Terminal recovery 01/01/1988 00:00:00
 Most recent redo scn: 0x0000.00000000
 Largest LWN: 0 blocks
 End-of-redo stream : No
 Unprotected mode
 Miscellaneous flags: 0x800000
 Thread internal enable indicator: thr: 0, seq: 0 scn: 0x0000.00000000
 Zero blocks: 0
 Format ID is 2
 redo log key is 1679de3ad36cdd2684143daaa1635b8
 redo log key flag is 5
 Enabled redo threads: 1
END OF REDO DUMP
----- Redo read statistics for thread 1 -----

If you look at dumpfile it says that sequence is set to 1 and DB name=”TESTDB’ .Our instance name is orc11g and last sequence# was more than 1 (can also be confirmed from v$log). This indicated that our redo logs were overwritten by some other process. DB name in redo log hinted that it was done by reporting clone refresh process. Problem was that redo log volume was cross mounted on a reporting clone which overwrote the redo logs during the refresh process .

As a fix, new volume was provisioned for the clone redo logs and volume export was revoked. Since current redo log also got overwritten, we had to restore last hot backup and perform incomplete recovery till the last ETL start time (This was a datawarehouse db). In OLTP this would have caused data loss.

This issue is easily reproducible. If you try to clone a instance by using same name for redo log and open the database, you will receive the error in your source database. DB files are protected by DBWR so we are protected from risk of other Database opening datafiles and get ORA-01157

ORA-01157: cannot identify/lock data file 3 - see DBWR trace file
ORA-01110: data file 3: '/oracle/oradata/orcl11g/undotbs01.dbf'

portmap: unrecognized service on RHEL6

Quick note for people using NFS for shared storage on RAC database. Till RHEL5 we had to ensure nfs,nfslock and portmap service has to be running.
These services are required otherwise you will get following errors while mounting database

ORA-00210: cannot open the specified control file
ORA-00202: control file: '/u01/oradata/orcl/control01.ctl'
ORA-27086: unable to lock file - already in use

Mostly this could be auto-enabled on boot by using chkconfig command. While working on similar issue today, I found out that this service is not present in RHEL 6

# service portmap status
portmap: unrecognized service

The portmap service was used to map RPC program numbers to IP address port number combinations in earlier versions of Red Hat Enterprise Linux.
As per RHEL6 docs, portmap service has been replaced by rpcbind in Red Hat Enterprise Linux 6 to enable IPv6 support.
So following command will work

# service rpcbind status
rpcbind (pid  1587) is running...

You can read about NFS and associated processes from RHEL6 docs

cursor: pin S wait on X

Few weeks back thread came up on oracle-l freelist regarding “cursor: pin S wait on X” event . Tanel Poder replied to this thread with excellent explanation. You can read full thread here

 

This wait happens when your session wants to examine / execute an existing
child cursor – but it’s being pinned in X mode by someone else. The usual
reasons for this are:
1) The cursor got flushed out for some reason (or it’s just the first time
anyone is running it) and the other session is currently hard
parsing/compiling/loading the contents of this cursor back into library
cache. The longer the parsing/loading takes, the longer you wait. If the
parsing session hits some (performance) issues, this pin may be held for a
long time.

or

2) The child cursor is already loaded into cache, but someone else is
currently flushing it out (shared pool shortage, fragmentation & loading of
a big object, SGA target / ASMM reducing the SGA size).

So as explained above, this could be seen as side affect of setting SGA_TARGET in environment. I have also observed this behavior in RAC system where shared pool (>10G) is under pressure and LCK0 process is trying to release memory which causes lot of processes to wait for this event. At this time database is completely hung. “cursor:pin S wait on X” might be reported for lot of sessions but we should see if shared pool free memory is less and try to fix it.

11.2 AQ – Messages stuck in PROCESESED,WAIT State

I had been working today on AQ issue where messages were not moving from WAIT to READY State.This was on 11.2.0.3 database running on RHEL5u6(64 bit). I thought of quickly documenting it for easy reference
When you enqueue record, you can specify delay after which record should be ready for dequeue.
During this time, MSG_STATE is in ‘WAIT’ state, which changes to ‘READY’ state after current time reaches delay timestamp.All the stuck records were having delay_timestamp past current time.This can be verified by querying AQ$Queue_table.

Below example , you see that MSG_ID ‘D2DA56B9746E2E59E043096F93452D8A‘ has delay_timestamp of ’10-JAN-13 05.42.07.835125 AM’ which is less then current time ( 9:52 AM)

 09:52:00 SQL > select msg_id,msg_state,enq_timestamp,delay_timestamp from AQ$TEST_QTABLE where rownum <5;

 MSG_ID				 MSG_STATE  ENQ_TIMESTAMP		DELAY_TIMESTAMP
-------------------------------- ---------- ------------------------------ ------------------------------
D2DA56B94EF02E59E043096F93452D8A WAIT	    09-JAN-13 07.37.07.595334 AM   23-FEB-13 07.37.07.595334 AM
D2DA56B9746E2E59E043096F93452D8A WAIT	    09-JAN-13 07.42.07.835125 AM   10-JAN-13 05.42.07.835125 AM
D2DA56B94ED32E59E043096F93452D8A WAIT	    09-JAN-13 07.37.05.274447 AM   23-FEB-13 07.37.05.274447 AM
D2F1EDF5112D59A1E043096F93451A42 READY	    10-JAN-13 07.35.33.089265 AM   10-JAN-13 07.35.34.089265 AM

Along with this there were lot of messages in PROCESSED state.My first thought was that Retention for this queue was set to higher value, but on checking I found that it was set to 0. You can read my previous article on Retention and Processed messages in queue

V$AQ revealed that there were 2 more queues with Pending processed messages. As per documentation, Ready column should only show messages in READY state but here it included count for PROCESSED state too (Not sure if this is also bug)

select * from v$AQ where ready >1000
/
      QID    WAITING	   READY    EXPIRED AVERAGE_MSG_AGE TOTAL_WAIT AVERAGE_WAIT
--------- ---------- ---------- ---------- --------------- ---------- ------------
    79884	 1757	 6153814	  0	      92459 5.6914E+11	 92459.9842
    79892	    0	  183916	  0	   16643589 3.0610E+12	 16643589.1
    79894	    0	    1122	  0	    5420368 6081652896	    5420368

Verified that aq_tm_processes parameter was not set (Oracle recommends to unset it in 10g and above so that oracle can auto-tune it) and there were two processes running

$ps -ef|grep ora_q
oracle   25235     1  0 00:53 ?        00:00:00 ora_qmnc_alpha
oracle   25334     1  1 00:53 ?        00:04:26 ora_q001_alpha
oracle   26270     1  0 00:56 ?        00:00:00 ora_q002_alpha

To check further, I checked alert log for any ORA errors and found few entries for ORA-01405.

ORA-01405: fetched column value is NULL

On checking trace directory for queue process tracefile, same ora-01405 error was logged

kwqdlScanDeqLog during stmt fetch: retval -1, errnum 1405, errbuf ORA-01405: fetched column value is NULL

Checked Oracle MySupport and was able to locate note ID 1162862.1 – ORA-01405: Fetched Column Value Is Null For An Advanced Queue

In Oracle 11.2 a new dequeue log table named as AQ$_QT_L has been introduced which is used for storing message identifiers of committed dequeued operations on the queue.Due to bug 12899768 there are few message id with ‘00000000000000000000000000000000’ in this queue table which causes PROCESSED messages to get stuck.

Note 12899768.8 indicates that this is bug in 11.2.0.2 and 11.2.0.3 database

Symptoms include

– ORA-01405 errors for qmon process
– Lot of PROCESSED messages in queue
– Messages with message id’s like ‘00000000000000000000000000000000’

We need to either apply patch for this bug or set event ‘10852 trace name context forever, level 16384’
On searching for this event, came across Note 1365655.1 – Dequeue by msgid of a message in WAIT state does not work properly in 11.2

This note matched our main problem where messages are not changing state from WAIT to ready. This note also suggested setting event 10852 followed by database restart. To correct this issue we followed below steps

a)Stopped the Apps to prevent enqueue/dequeue. You can also stop it using dbms_aqadm.stop_queue() but this would generate errors in app.
b)deleted messages with msgid like ‘00000000000000000000000000000000’ from new queue table

delete from AQ$TEST_TABLE_L where msgid ='00000000000000000000000000000000';
commit;

c)Set event in spfile and restart database

conn / as sysdba
alter system set event='10852 trace name context forever, level 16384' scope=spfile;

After restart,messages with delay_timestamp older then current time changed state to READY and count in V$AQ started dropping indicating that qmon process was purging processed messages.