Performance

Resolving Shutdown Immediate Hang Situations

Many times Oracle DBA’s are faced with a situation where shutting down the instance with shutdown immediate seems to take infinite time and gives a impression that the session is hung. In most of the cases this is not a hang. Hang can be defined as a scenario when few sessions/processes are waiting on some action and in turn blocking other session/processes to complete their tasks. Now the original sessions could be waiting because of resource issues or Oracle Bugs.

Shutdown immediate can take long time to complete (appear to be hung) because of three reasons:

1. Uncommitted transactions are being rolled back.

2. SMON is cleaning temp segments or performing delayed block cleanouts.

3. Processes still continue to be connected to the database and do not terminate.

1. Uncommitted transactions are being rolled back:

This is the case when the message ‘Waiting for smon to disable tx recovery’ is posted in the alert log after we issue shutdown immediate.

There are two reasons for this:
– A large query was running at the time of shutdown immediate.
-A large transaction was running at the time of shutdown immediate.

For large queries:

SQL > select count(*) from v$session_longops where time_remaining>0;

If it returns a value > 0 then we can do a shutdown abort and then startup restrict and then again shutdown immediate.

For large transactions:

SQL > select sum(used_ublk) from v$transaction;

If it returns a large value then we have to wait for a long time for shutdowm to get completed.
If the large transaction is aborted and then shutdown is issued then we have to query v$fast_start_transactions and v$fast_start_server, we will not see anything in v$transaction at this time.

At this particular moment transaction recovery is going on and the count(*) will keep on decreasing:

SQL > select count(*) from v$fast_start_transaction;

Decreasing count will show that recovery is going on and when the recovery is completed the database will be shutdown.

But it is not desirable under some circumstances such as, when we have very short maintance window and we need to perform a shutdown immediate to do some work, in those cases we can use the following event and set in the init.ora file TEMPERORARLY To disable transaction recovery:

event=”10513 trace name context forever, level 2″

and bounce the instance and issue shutdown immediate to get complete without transaction recovery.SMON will not do a transaction recovery untill this event is set in the init.ora file so it is necessary to remove this event whenever you get a chance to shutdown the database again, this time shutdown immediate can even take 3-5 hours(Just remove this event from pfile).

2. SMON is cleaning temp segments or performing delayed block cleanouts:

During a SHUTDOWN IMMEDIATE and SHUTDOWN NORMAL, SMON cleans up extents which are no longer needed and marking them as freed. It means that count from uet$ will decrease and count in fet$ will increase.

To verify that the temporary segments are decreasing have an active session available in SQL during the SHUTDOWN IMMEDIATE. Run query to ensure the database is not hanging, but is actually perform extent cleanup:

SQL> select count(block#) from fet$;
COUNT(BLOCK)
----------
115

SQL> select count(block#) from uet$;
COUNT(BLOCK)
----------
713

After some time, issue the query again and check the results:

SQL> select count(block#) from fet$;
COUNT(BLOCK)
----------
210

SQL > select count(block#) from uet$;
COUNT(BLOCK)
----------
512

If you do not have sufficient time to wait for this cleanup then you can set the following event and bounce the database and reissue shutdown immediate to skip this cleanup:

event=”10061 trace name context forever, level 10″

It allows you to prevent SMON from cleaning up temporary segments. Again it is not recommended to set this event event forever. Whenever you have large downtime remove this event and allow SMON to do its work.

3. Processes still continue to be connected to the database and do not terminate:

After issuing shutdown immediate, If we see entries in alert log file as:

Tue Jan  8 12:00:27 2008
Active call for process 10071 user 'oracle' program '[email protected] (J001)'
SHUTDOWN: waiting for active calls to complete.
Tue Jan  8 12:00:57 2008

SHUTDOWN: Active sessions prevent database close operation

It shows that there are some active calls at program ‘[email protected] (J001)’ which pmon is not able to clear up.This message is due to the fact that database is waiting for pmon to clean up processes, but pmon is unable to clean them. The client connections to the server are causing the shutdown immediate or normal to hang. Do the following in this case:

1. Before shutdown immediate, shutdown the listener:

$ lsnrctl stop

2. Now check if there are any connection present at the database as:

$ ps -eaf | grep LOCAL

It will give you the OSPIDs of the client connected to database.

3 Manually kill them as:

# Kill -9 <OSPID>

4. Issue shutdown immediate now.

Do not forget to bring up the listener after startup 🙂

In addition to this you can set 10046 event in the session used to shutdown the instance. This will help to tell the event on which session is waiting

SQL>alter session set events '10046 trace name context forever, level 12'

SQL>Shutdown immediate;

Look for the trace file in user_dump_dest location. Also look at the alert.log for any other messages. They might be helpful in case the shutdown is experiencing hang situation.

Tracing Sessions with Event 10046

10046 Event is used to trace SQL statements (similar to sql_trace=true) with additional details depending on the tracing level which is specified.

10046 EVENT levels:

1 – Enable standard SQL_TRACE functionality (Default)
4 – As Level 1 PLUS trace bind values
8 – As Level 1 PLUS trace waits
12 – As Level 1 PLUS both trace bind values and waits

There are various ways to enable the tracing

1) Using Alter session command

ALTER SESSION SET EVENTS ‘10046 TRACE NAME CONTEXT FOREVER, LEVEL 12’;

To Turn it off
ALTER SESSION SET EVENTS ‘10046 TRACE NAME CONTEXT OFF’;

2) Using DBMS_SESSION

EXECUTE DBMS_SESSION.SESSION_TRACE_ENABLE(waits => TRUE, binds => FALSE);


The TRUE argument specifies that wait information will be present in the trace. The FALSE argument specifies that bind information will not be present in the trace.
The SESSION_TRACE_DISABLE procedure disables the trace for the invoking session.

For example:
EXECUTE DBMS_SESSION.SESSION_TRACE_DISABLE();

3) Using DBMS_SYSTEM.SET_SQL_TRACE_IN_SESSION

EXEC SYS.DBMS_SYSTEM.SET_SQL_TRACE_IN_SESSION(SID,SERIAL#,TRUE);

Where SID and SERIAL# points to the session being traced and can be retrieved from V$SESSION

To turn it off
EXEC SYS.DBMS_SYSTEM.SET_SQL_TRACE_IN_SESSION(SID,SERIAL#,FALSE);

4) Using DBMS_MONITOR

Starting from Oracle 10g, there is a new package which can be used to enable tracing.

EXEC DBMS_MONITOR.SESSION_TRACE_ENABLE(SESSION_ID=>sid,SERIAL_NUM=>serial#,WAITS=>TRUE,BINDS=>TRUE)

To turn it off

EXECUTE DBMS_MONITOR.SESSION_TRACE_DISABLE(session_id => sid, serial_num => serial#);

Please refer to following documentation link for more details

http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/sqltrace.htm#i8759

5) Using ORADEBUG

connect / as sysdba
Sql>oradebug setospid 1234
Sql>oradebug unlimit
Sql>oradebug event 10046 trace name context forever,level 12

Here 1234 is the OS process Id for the session to be traced.

Sql>oradebug tracefile_name

Above command will show the location of the trace file

To disable oradebug tracing

Sql> oradebug event 10046 trace name context off

Whichever method you use, Oracle will create the trace files in the server’s user_dump_dest directory.

Sign on Letter to Larry Ellison on AWR and ASH Licensing

While surfing across the blog’s, I found this particular link, a compaign which is aimed at getting free access to the ASH and AWR reports. I believe most of us know that we need to pay extra fees for the Diagnostic Tuning Pack which also includes ASH and AWR report. (If you’re not already familiar with these restrictions, you can read about them in the Oracle 10g Licensing Information Manual)

We all have seen how much better these reports are , when it comes to diagnosing general performance problem or some transient problems.

I believe this is a good initiative by Mark Brinsmead and I have already signed onto the letter. You can also join in by signing here.

HW enqueue contention with LOB

Definition
Enqueues are local locks that serialize access to various resources. Enqueue wait event indicates a wait for a lock that is held by another session (or sessions) in an incompatible mode to the requested mode.

HW Enqueue – High Watermark enqueue is acquired when a segment’s high water mark is moved, which will normally happen during a heavy insert operation on the segment.

Detailed explanation
The HW enq is used to move the segment hwm and also to reclaim the space in lob segments.HW enqueue is required for extending segments in locally managed as well as dictionary managed tablespace. The period for which it is held will be shorter for locally managed tablespaces but we do hold it to serialize extensions.

When lobs are deleted or updated, space freed up is not reclaimed immediately. Deleted lob chunks are maintained in the lob index segment for purpose of CR – the committed free space is reclaimed from the lob index when the free space in the lob-base segment is exhausted. Delayed reclaimation is indirectly controlled by PCTVERSION or RETENTION. Movement of free space from lob index to lob-base segment is done under hw enqueue.

When inserts happen, space for new rows are satisfied in the following orderly ways:

a. Allocate free blocks from the segment directly (these blocks are not mapped by the Lobindex).

b. Reclaim freed blocks (from previous updates/ deletes, the ones mapped by the lobindex) that are not needed to satisfy PCTVERSION or RETENTION. In extreme cases, reclaimed blocks even if it means not being able to guarantee RETENTION. This requires HW-enqueue on the lobindex segment.

c. Extend the segment.

To Identify HW – contention
Finding Blockers:

If a session is stuck waiting on an “enqueue” wait use View V$LOCK to find the blocker/s:

SELECT DECODE(request,0,’Holder: ‘,’Waiter: ‘)||sid sess, id1, id2, lmode, request, type
FROM V$LOCK
WHERE (id1, id2, type) IN (SELECT id1, id2, type FROM V$LOCK WHERE request>0)
ORDER BY id1, request;

To view the waiters:

SELECT * FROM v$lock WHERE request > 0;

Description:

* V$LOCK.TYPE is the lock type from P1
* V$LOCK.ID1 is the lock ID1 from P2
* V$LOCK.ID2 is the lock ID2 from P3

In v$lock Id1 is the tablespace number in which the segment (that is having its high water mark changed) resides and ID2 is the tablespace relative DBA of the segment header.

To identify the segment on which HW enqueue wait is being observed, run the following:

select DBMS_UTILITY.DATA_BLOCK_ADDRESS_FILE(ID2) FILE#,
DBMS_UTILITY.DATA_BLOCK_ADDRESS_BLOCK(ID2) BLOCK#
from v$lock
where type = ‘HW’;

select owner, segment_type, segment_name
from dba_extents
where file_id = <>
and <> between block_id and block_id + blocks – 1;

Reducing/ Resolving Contention:

1. Recreate the objects with the freelist groups and preallocate extents using ALTER TABLE…ALLOCATE EXTENT. This will help improve performance

2. Review the freelist and freelist group setting. If it is too high – we may be moving more blocks to freelists and hold segment hw for a longer period of time since it takes time to format all the blocks.

If preallocation does not work, increase the number of free lists. The speed of high water mark movement depends on the number of free lists.

3. If this is during lob space reclamation, review the pctversion setting. Depending on the usage of the blob you can set pctversion to 0 or small values to avoid space reclamation. The default is 10%.

4. Most issues with HW contention on LOBs and space reclaimation have been resolved by applying patch# 6376915 and setting the event 44951.

This fix causes ASSM LOB space to batch up reclaim instead of just reclaiming the requested/required number of LOB chunks. To enable this fix, set event 44951 to the maximum number of chunks that you would like to have reclaimed per attempt. The maximum allowed is 1024. Anything larger becomes 1024. However, if the requested amount of space is larger than the event’s value, the reclaim will be for the requested amount of space.

5. Other considerations:
a. Partitioning the LOB segment.
b. Increasing the chunk size.
c. Rebuild the table or truncate the table.

One such issue which we faced in one of our environment.

AWR Report

Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 129618 27-Jun-12 01:00:10 240 2.1
End Snap: 129631 27-Jun-12 03:10:18 276 2.1
Elapsed: 130.13 (mins)
DB Time: 1,683.99 (mins)

Cache Sizes

Begin End
Buffer Cache: 7,136M 7,136M Std Block Size: 8K
Shared Pool Size: 1,536M 1,536M Log Buffer: 19,456K

Load Profile

Per Second Per Transaction Per Exec Per Call
DB Time(s): 12.9 0.2 0.01 0.00
DB CPU(s): 0.6 0.0 0.00 0.00
Redo size: 1,512,791.1 24,438.7
Logical reads: 42,623.7 688.6
Block changes: 6,243.5 100.9
Physical reads: 2,725.6 44.0
Physical writes: 364.1 5.9
User calls: 4,557.4 73.6
Parses: 1,646.8 26.6
Hard parses: 0.0 0.0
W/A MB processed: 0.6 0.0
Logons: 0.1 0.0
Executes: 1,647.5 26.6
Rollbacks: 0.8 0.0
Transactions: 61.9

Top 5 Timed Foreground Events

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
enq: HW – contention 119,796 57,558 480 56.97 Configuration
direct path read 1,293,467 16,330 13 16.16 User I/O
log file sync 554,365 9,730 18 9.63 Commit
db file sequential read 274,737 5,536 20 5.48 User I/O
DB CPU 4,789 4.74

We can see in above report that this is 2 hour report and 57% of DB time is spent on waits for enq: HW -Contention. Other wait events are also related to I/O and this also indicates slow I/O device (Avg wait for db file sequential read is 20 ms, normally you would see 6 ms). We  move to section for “SQL Ordered by Elapsed Time” to see if we can identify sql

SQL ordered by Elapsed Time

Elapsed Time (s) Executions Elapsed Time per Exec (s) %Total %CPU %IO SQL Id SQL Module SQL Text
63,955.96 958,651 0.07 63.30 0.50 8.76 16w9z88mxsyz3 insert into SCOTT.JBPM4_…
15,336.34 0 15.18 4.81 95.54 7paqx6x20zdy3 JDBC Thin Client SELECT /*+ ORDERED PARALLEL (…

As per AWR report ,sql_id 16w9z88mxsyz3 is taking 64% of total elapsed time for all the sql and this is insert statement with very high execution rate. Digging further into structure of table, we identified that it has 2 clob columns and its residing in tablespace of 4M extent size. We can also look at section “Segments by Physical Writes” ,”Segments by Direct Physical Writes” to identify the segment and notice that they are Lob segments associated with this table. Solution in this case was to hash partition this table (64 partitions) to distribute the inserts into different segments and reduce the number of concurrent requests to allocate extents.

Diagnostics For Database Hang

Many times Oracle DBA’s are in a situation when the database is hung and does not seem to be responding. In some scenarios, state is such that you cannot even connect to the sqlplus session. Majority people restart the database (Sometimes I wonder if this is due to the fact that most of us started working on Computer’s on Microsoft Windows 🙂 ) and then log a ticket with Oracle support . They inturn happily inform us that “They do not have any diagnostic information to diagnose and resolve the issue and we need to wait for next occurrence to collect some diagnostic information)

Based on my experiences , I am writing this article to assist my fellow Oracle DBA’s to diagnose the problem and collect the required information. So Let’s Start.

1) First of all we need to ensure that this is really a database hung situation and not a slow database condition. This can be done by asking some questions to users.

a) Is a particular user complaining of database hang or its the condition for all the users. If one or few user are reporting then are these users executing a batch job?

b)Are you able to make new connections to database?

c)Also check if any initialisation parameter has been changed recently?

d)Check if any resource manager plan is in effect.

One more way to establish if database is hung is t0 try to query v$session_wait view to find events being waited on

select sid,event,seq#,p1,p2,p3 from V$session_wait where wait_time=0 and event not like ‘%message%’;

This will give the events for all the waiting session. In case you see something like ‘log file switch (archiving required)’ then this problem is caused by archiving issue. See if there is free space in archiving destination.

Suppose this gives events like row cache enqueue or latches, then we need to gather Hanganalyze and Systemstate for the support.

Else it could be that you are experiencing a slow database. In this case use AWR or statspack to diagnose the issue. Look out for top timed events. In case you see Library Latch or shared pool latch consuming lot of time, then look at the Hard parses per sec section in Load profile.

2)Look at database alert log and see if any messages are present. In case you are facing Latching or Enqueue issues, then you might see errors like below

PMON failed to acquire latch, see PMON dump

Errors in file /u01/BDUMP/test10_pmon_12864.trc:

In this case you are required to upload tracefile (reported in alert log) to Oracle support.

Note : -Make sure that max_dump_file_size is set to unlimited so that the tracefile contains the complete data.

Coming back to Hanganalyze and Systemstate. Find details for them below

A)Hanganalyze

HANGANALYZE is used to determine if a session is waiting for a resource, and reports the relationships between blockers and waiters.

Use following syntax and take hanganalyze from two sessions at interval of 1 min

SQL>sqlplus “/ as sysdba”

SQL> oradebug setmypid

SQL> oradebug unlimit;

SQL> oradebug hanganalyze 3

SQL>oradebug tracefile_name

Last command will report the tracefile name which has to be uploaded to Oracle support.

Alternatively you can use

SQL>ALTER SESSION SET EVENTS ‘IMMEDIATE TRACE NAME HANGANALYZE LEVEL 3’;

In case you wish to understand how to interpret hanganalyze file, then use Metalink Note:215858.1: Interpreting HANGANALYZE trace files to diagnose hanging and performance problems

B)Systemstate

Systemstate is used to dump the process information which is useful for Oracle support to diagnose why the sessions are waiting.

For 9.2.0.6 and above gather systemstate as below

SQL> oradebug setmypid

SQL> oradebug unlimit;

SQL> oradebug dump systemstate 266

SQL>oradebug tracefile_name

Last command will report the tracefile name which has to be uploaded to Oracle support. Perform this 2-3 times at interval of 1 min.

Again you can use

ALTER SESSION SET EVENTS ‘IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 266’;

For Oracle 9.2.0.5 and less use level 10 instead of 266

ALTER SESSION SET EVENTS ‘IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 266’;

Level 266 includes short stacks (Oracle function calls) which are useful for Oracle Developers to determine which Oracle function’s are causing the problem. This is also helpful in matching existing bugs.

In case you are unable to connect to database then capture systemstate using below note

Note 121779.1-Taking a SYSTEMSTATE dump when you cannot CONNECT to Oracle.

Apart from this, following information can also be captured

a)Database alert log

b)AWR report /statspack report for 30-60 min during database hang

c)Output of OS tools to ensure that everything is fine at OS level.

E.g

$vmstat 2 20

This will capture 20 snapshots at 2 seconds interval. Look for CPU contention or swapping issues

In Addition to above , you can use utility called LTOM which has got some predefined rules based on which it determines that database is in Hung situation and takes systemstate and hanganalyze dump automatically .

Please refer to following Metalink note for more details

Note:352363.1: LTOM – The On-Board Monitor User Guide

In case you are able to narrow down to a blocking session manually, then you can very well take errorstack for the blocking process as below

connect / as sysdba

oradebug setospid 1234

oradebug unlimit

oradebug dump errorstack 3

wait 1 min

oradebug dump errorstack 3

wait 1 min

oradebug dump errorstack 3

oradebug tracefile_name

* In case the ospid for blocking session is 1234

Last command will report the tracefile name which has to be uploaded to Oracle support.

In case you are able to capture the above information, you stand 99% chance of getting solution . I have kept 1 % for the cases when Oracle Support will ask for setting up some events and waiting for Next Hang Occurence for getting more information.