LGWR terminating instance due to error 338

Recently we came across a issue where our DB crashed with ORA-00338 error .

Errors in file /oracle/diag/rdbms/orcl11g/orc11g/trace/orc11g_lgwr_24118.trc:
ORA-00338: log 2 of thread 1 is more recent than control file
ORA-00312: online log 2 thread 1: '/oracle/oradata/orcl11g/redo02.log'
LGWR (ospid: 24118): terminating the instance due to error 338

DB couldn’t be restarted as it gave same errors while opening. For multiplexed redo log files, it reports error for both log files.

Error Description: (Reference :http://psoug.org/oraerror/ORA-00338.htm)
Log string of thread string is more recent than control file

Error Cause:

The control file change sequence number in the log file is greater than the number in the control file. This implies that the wrong control file is being used. Note that repeatedly causing this error can make it stop happening without correcting the real problem. Every attempt to open the database will advance the control file change sequence number until it is great enough.

Action:
Use the current control file or do backup control file recovery to make the control file current. Be sure to follow all restrictions on doing a backup control file recovery.

Above explanation suggests that there is problem with controlfile. Normally these errors are seen when doing incomplete recovery. To troubleshoot it ,we took dumps of redo log files (On recommendation of Oracle support)

Note: Below logs are from test system which was used to reproduce the issue

SQL> alter system dump logfile '/oracle/oradata/orcl11g/redo01.log' validate;
System altered.
SQL> alter system dump logfile '/oracle/oradata/orcl11g/redo02.log' validate;
ERROR at line 1:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/oracle/oradata/orcl11g/redo02.log'
SQL> alter system dump logfile '/oracle/oradata/orcl11g/redo03.log' validate;
*
ERROR at line 1:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/oracle/oradata/orcl11g/redo03.log'

redo02.log and redo03.log dump failed with errors that it does not contain redo. Since redo01.log dump was successful, we looked at trace.

DUMP OF REDO FROM FILE '/oracle/oradata/orcl11g/redo01.log'
 Opcodes *.*
 RBAs: 0x000000.00000000.0000 thru 0xffffffff.ffffffff.ffff
 SCNs: scn: 0x0000.00000000 thru scn: 0xffff.ffffffff
 Times: creation thru eternity
 VALIDATE ONLY
 FILE HEADER:
 Compatibility Vsn = 186647552=0xb200400
 Db ID=970369526=0x39d6a9f6, Db Name='TESTDB'
 Activation ID=2650290266=0x9df8385a
 Control Seq=5124=0x1404, File size=102400=0x19000
 File Number=1, Blksiz=512, File Type=2 LOG
 descrip:"Thread 0001, Seq# 0000000001, SCN 0x00000016f528-0xffffffffffff"
 thread: 1 nab: 0xffffffff seq: 0x00000001 hws: 0x3 eot: 1 dis: 0
 resetlogs count: 0x33acc28a scn: 0x0000.0016f528 (1504552)
 prev resetlogs count: 0x3377bd37 scn: 0x0000.000e2006 (925702)
 Low scn: 0x0000.0016f528 (1504552) 12/22/2014 06:13:30
 Next scn: 0xffff.ffffffff 01/01/1988 00:00:00
 Enabled scn: 0x0000.0016f528 (1504552) 12/22/2014 06:13:30
 Thread closed scn: 0x0000.0016f528 (1504552) 12/22/2014 06:13:30
 Disk cksum: 0xcec6 Calc cksum: 0xcec6
 Terminal recovery stop scn: 0x0000.00000000
 Terminal recovery 01/01/1988 00:00:00
 Most recent redo scn: 0x0000.00000000
 Largest LWN: 0 blocks
 End-of-redo stream : No
 Unprotected mode
 Miscellaneous flags: 0x800000
 Thread internal enable indicator: thr: 0, seq: 0 scn: 0x0000.00000000
 Zero blocks: 0
 Format ID is 2
 redo log key is 1679de3ad36cdd2684143daaa1635b8
 redo log key flag is 5
 Enabled redo threads: 1
END OF REDO DUMP
----- Redo read statistics for thread 1 -----

If you look at dumpfile it says that sequence is set to 1 and DB name=”TESTDB’ .Our instance name is orc11g and last sequence# was more than 1 (can also be confirmed from v$log). This indicated that our redo logs were overwritten by some other process. DB name in redo log hinted that it was done by reporting clone refresh process. Problem was that redo log volume was cross mounted on a reporting clone which overwrote the redo logs during the refresh process .

As a fix, new volume was provisioned for the clone redo logs and volume export was revoked. Since current redo log also got overwritten, we had to restore last hot backup and perform incomplete recovery till the last ETL start time (This was a datawarehouse db). In OLTP this would have caused data loss.

This issue is easily reproducible. If you try to clone a instance by using same name for redo log and open the database, you will receive the error in your source database. DB files are protected by DBWR so we are protected from risk of other Database opening datafiles and get ORA-01157

ORA-01157: cannot identify/lock data file 3 - see DBWR trace file
ORA-01110: data file 3: '/oracle/oradata/orcl11g/undotbs01.dbf'

Tags: corruption, oracle, redo

3 thoughts on “LGWR terminating instance due to error 338”

alok

25 December, 2014 at 2:57 pm

Nice blog.

Pavan Kumar Kumar N

3 January, 2015 at 10:51 pm

Hi Amit,

It’s an good post buddy.. !!
One question was popping out of my brain, why the ORL’s are placed on nfs mounts, I hope it would be better to replace to local file system if it’s non-ASM. Since, the performance and furture issues will get minimized.

- Amit
  
  7 January, 2015 at 10:50 am
  
  Pavan,
  
  This is 60Tb warehouse so we are using netapp filers for database storage which also provides us facility of snapshot backups.
  
  Regards
  Amit

LGWR terminating instance due to error 338

Related

3 thoughts on “LGWR terminating instance due to error 338”

alok

Pavan Kumar Kumar N

Amit

Leave a ReplyCancel reply