oracle

11g: Multiple failed login attempt can block New Application connections

In Oracle database 11g, a new feature/enhancement has potential of bringing down the application due to multiple sessions hitting database with wrong password.

As per Bug 7715339 – Logon failures causes “row cache lock” waits – Allow disable of logon delay [ID 7715339.8]

In 11g there is an intentional delay between allowing failed logon attempts to retry. After 3 successive failures a sleep delay is introduced starting
at 3 seconds and extending to 10 seconds max. During each delay the user X row cache lock is held in exclusive mode preventing any concurrent logon attempt as user X (and preventing any other operation which would need the row cache lock for user X).

Let’s simulate this using a test case.I am using a 11.2.0.3 database.Open two or more database server  connections

Connection 1

create user amit identified by amit;
grant connect to amit;
alter user amit profile MONITORING_PROFILE;

MONITORING_PROFILE has failed_login_attempts set to unlimited.

Create two scripts

$cat 1.sh
sqlplus -s amit/test123 <<EOF
EOF
$cat lock_db.sh
#!/bin/bash
for i in {1..1000}
do
nohup sh 1.sh &
done

I have set loop to 1000 which you can increase depending on your process limit. This is test database with process parameter set to 150. Execute script

sh lock_db.sh 

Now if you go to Connection 2, you will see following issues

a) Database will not allow fresh connections for short time as you will get ORA-00020: maximum number of processes (150) exceeded errors.
After that database connections will be possible
b) You cannot connect to AMIT user for considerable time even with correct password. All sessions will be waiting on library cache lock

 select event,count(*) from V$session group by event order by 2;

EVENT								   COUNT(*)
---------------------------------------------------------------- ----------
Streams AQ: waiting for time management or cleanup tasks		  1
VKTM Logical Idle Wait							  1
Space Manager: slave idle wait						  1
SQL*Net message from client						  1
ges remote message							  1
ASM background timer							  1
GCR sleep								  1
smon timer								  1
pmon timer								  1
Streams AQ: qmn coordinator idle wait					  1
asynch descriptor resize						  1
Streams AQ: qmn slave idle wait 					  1
PING									  1
class slave wait							  2
gcs remote message							  2
DIAG idle wait								  2
wait for unread message on broadcast channel				  2
rdbms ipc message							 18
library cache lock							101

select
 distinct
   ses.ksusenum sid, ses.ksuseser serial#, ses.ksuudlna username,KSUSEMNM module,
   ob.kglnaown obj_owner, ob.kglnaobj obj_name
   ,lk.kgllkcnt lck_cnt, lk.kgllkmod lock_mode, lk.kgllkreq lock_req
   , w.state, w.event, w.wait_Time, w.seconds_in_Wait
 from
  x$kgllk lk,  x$kglob ob,x$ksuse ses
  , v$session_wait w
where lk.kgllkhdl in
(select kgllkhdl from x$kgllk where kgllkreq >0 )
and ob.kglhdadr = lk.kgllkhdl
and lk.kgllkuse = ses.addr
and w.sid = ses.indx
order by seconds_in_wait desc
/

      SID    SERIAL# USERNAME	 MODULE 			     OBJ_OWNER	     OBJ_NAME		LCK_CNT  LOCK_MODE   LOCK_REQ STATE		  EVENT 			  WAIT_TIME SECONDS_IN_WAIT
---------- ---------- ---------- ----------------------------------- --------------- --------------- ---------- ---------- ---------- ------------------- ------------------------------ ---------- ---------------
       153	   77		 testbox01		     84 		      0 	 0	    3 WAITING		  library cache lock			  0		  3
	36	  101		 testbox01		     84 		      0 	 0	    3 WAITING		  library cache lock			  0		  3
	16	  137		testbox01		     84 		      0 	 0	    3 WAITING		  library cache lock			  0		  3
       143	  203		 testbox01		     84 		      0 	 0	    3 WAITING		  library cache lock			  0		  3
	26	  261		 testbox01		     84 		      1 	 3	    0 WAITED KNOWN TIME   library cache lock			119		  3

You will notice that seconds_in_wait value will increase . Also ‘alter system kill session’  takes long time to kill session (it seems as if hung). As per bug ,any operation involving row cache lock for that user will not be allowed. I confirmed by trying to reset profile to Default and session was stuck.

Till all these sessions are cleared from database, you will not be allowed to login. This can severy impact applications as fresh connection would not be possible. Pre-existing sessions will work
If you repeat same experiment in 10g, you will not face similar issue.

To disable this behavior i.e no sleep delay set following event

event=”28401 trace name context forever, level 1″

You can set it dynamically too using

alter system set events ‘28401 trace name context forever, level 1’;

To go back to original behavior

alter system set events ‘28401 trace name context off’;

Bug 7715339 mentions that if FAILED_LOGIN_ATTEMPTS is set to some value, then further attempts to log in will then correctly fail immediately with no delay

I tested this and found that this is not entirely true as my script was able to again spawn 101 sessions which were waiting on library cache lock again. Fresh connection with correct password was again stuck. DBA_USERS view was showing that account is locked. Though here the session cleared in quick time as compared to test case involving failed_login_attempts to unlimited.

V$SQL_SHARED_CURSOR in 11.2.0.2

Oracle 11.2.0.2 has introduced new column “REASON” to V$SQL_SHARED_CURSORS. I came across this extremely useful column while debugging multiple child cursor issue (11.2.0.2 seems to have too many bugs which can cause multiple child cursors).

As per documentation , REASON column has been introduced in 11.2.0.2

REASON - CLOB	Child number, id, and reason the cursor is not shared. The content of this column is structured using XML.

Reason column is helpful as we can easily interpret the reason why child cursors are not being shared. e.g

<ChildNode>
<ChildNumber>151</ChildNumber>
<ID>9</ID>
<reason>PQ Slave mismatch(5)</reason>
<size>2x4</size>
<ctxpq_StmtId_pqctx>0</ctxpq_StmtId_pqctx>
<fxu_kxfxutqidb>2422463</fxu_kxfxutqidb>
</ChildNode>
<ChildNode><ChildNumber>151</ChildNumber>
<ID>3</ID>
<reason>Optimizer mismatch(2)</reason>
<size>4x4</size>
<parallel_query_default_dop>16</parallel_query_default_dop>
<kxfr_Default_DOP>80</kxfr_Default_DOP>
<isParallel>1</isParallel>
<Need_Default_Dop>0</Need_Default_Dop>
</ChildNode>
<ChildNode>
<ChildNumber>151</ChildNumber>
<ID>9</ID><reason>PQ Slave mismatch(5)</reason><size>2x4</size><ctxpq_StmtId_pqctx
>0</ctxpq_StmtId_pqctx><fxu_kxfxutqidb>1433944</fxu_kxfxutqidb></ChildNode>

We can verify the same thing by checking PQ_SLAVE_MISMATCH column

select sql_id,PQ_SLAVE_MISMATCH,OPTIMIZER_MODE_MISMATCH from  V$SQL_SHARED_CURSOR where sql_id='b9uz0akdcx58q' and child_number=151;

SQL_ID	      P O
------------- - -
b9uz0akdcx58q Y N

There is one more column PURGED_CURSOR which is new (I guess it came from 11.2.0.1) which tells if the cursor has been purged using dbms_shared_pool.purge package.

BUG:10082277: Database Connections failing with ORA-4031

A short write-up on a problem faced in few newly upgraded databases to 11.2.0.1 :

I faced this issue in couple of databases which were recently upgraded to 11gR2, the problem is so severe that even connecting using “/ as sysdba”  is not working and erroring out with

<span style="font-family: arial,helvetica,sans-serif; font-size: small;">ORA-04031: unable to allocate 254 bytes of shared memory ("shared pool","unknown object","PCUR^bb2f222c","kkscsAddChildNodeToContext)</span>

The SGA is sized efficiently, SGA_TARGET=8GB and shared_pool_size=900MB. Looking at the trace file generated by this error, there were 4 subpools generated and the all the 4 subpools have enough “free momory” under them.

=======================
Memory Utilization of Subpool 1
=======================

Allocation Name  Size
______________ ____________
“free memory ”    413 721 528

=======================
Memory Utilization of Subpool 2
=======================
Allocation Name  Size
_________          ____________
“free memory ”    369 109 200

=======================
Memory Utilization of Subpool 3
=======================
Allocation Name     Size
______________    ____________
“free memory ”       1 035 124 136
“PCUR ” 4 955 061 232

=======================
Memory Utilization of Subpool 4
=======================
Allocation Name       Size
______________       ____________
“free memory ”          370 023 728

We can see that all the subpools have more than 300MB of memory as free, But in the “Subpool 3” we can see some unusual amount of memory allocated to heap area “PCUR”  4 955 061 232 i.e 5GB Approx.

Now its time to find where “PCUR’ is leaking memory??

Searching in Metalink for this showed a known BUG in 11.2.0.1 and 11.2.0.2 :

<span style="font-family: arial,helvetica,sans-serif; font-size: small;">Bug 10082277  Excessive allocation in PCUR heap of "kkscsAddChildNo" (ORA-4031)</span>

As per this BUG, memory type “”kkscsAddChildNo”” under the “perm” space of “PCUR” subheap is growing over the time and is not getting deallocated leading to this situation.  The solution is to apply this patch.

Oracle will be fixing this in version 12G.

BUG: Creating 11gR2 RAC Using VMware On Toshiba Laptop

I was setting up 11gR2 RAC grid infra on my new Toshiba L745 series laptop on OEL5.4, the installation was going very smoothly but suddenly the VMWare machines crashed while performing remote operations on second node with following message:

<span style="font-family: arial,helvetica,sans-serif;">"vmxaiomgr.retrycontabort" Please disconnect and reconnect if the storage is external to comupter</span>

The next thing was to look at the error file generated by vmware, and it states the following about this error:

<span style="font-family: arial,helvetica,sans-serif;">"Insufficient quota to complete the requested service (1453)" </span>

Though I had enough Diskspace, RAM availability, CPU etc. etc. it was hard to believe that I am running out of allocated resources.
Googling this error message, I saw THIS VMware Communities blog which explains the same error which I faced. It explains about a bug in Toshiba’s WiFi service “cfWiMAXService” i.e “ConfigFree WiMAX” which is causing all this and not allowing to perform the remote copy operation of Grid Infra install.

<span style="font-family: arial,helvetica,sans-serif;">The solution is to disable ConfigFree Service on Toshiba Laptop</span>

After disabling this service, the installation worked fine. Though I am not aware what will be the effect of disabling this serviceon laptop but the laptop is working fine since last two weeks.

Thanks to the above mentioned link which helped me solving this issue.

10.2 CRS startup issue

Today I faced a strange issue with CRS  post host reboot. CRS was not coming up and we could see following message in $ORA_CRS_HOME/log/<hostname>/client/clsc*.log

cat clsc26.log
Oracle Database 10g CRS Release 10.2.0.4.0 Production Copyright 1996, 2008 Oracle.  All rights reserved.
2011-07-01 21:00:14.345: [ COMMCRS][2541577376]clsc_connect: (0x6945e0) no listener at (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))

2011-07-01 21:00:14.345: [ COMMCRS][2541577376]clsc_connect: (0x695020) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))

It looked like like issue with socket files, so I removed /var/tmp/.oracle files (This is RHEL4 box). Tried starting crs with ‘crsctl start crs’ and still no socket files were written. /tmp/crsctl*log files were getting generated but they were empty. I spent close to 1 hour rebooting host and trying various stuff. Then I decided to run the daemons mentioned in /etc/inittab  manually i.e

/etc/init.d/init.evmd run
/etc/init.d/init.cssd fatal
/etc/init.d/init.crsd run

When I ran init.evmd I got following errors

# /etc/init.d/init.evmd run
Startup will be queued to init within 30 seconds.
/home/oracle/.bash_profile: line 6: ulimit: open files: cannot modify limit: Operation not permitted
*** glibc detected *** double free or corruption (fasttop): 0x0000000000688960 ***
-bash: line 1: 17389 Aborted                 /apps/oracle/product/102crs/bin/crsctl check boot >/tmp/crsctl.17085

It pointed to issue with .bash_profile so I renamed it to .old and retried the operation. This time it succeeded and crs also came up fine.

There was entry for ulimit -n 2048 in .bash_profile which was causing it. I am not aware why ulimit is causing issue, will try to find it and post details

ORA-00132 while starting 11gR2 database

Short post on a error which can take lot of time to debug. You have upgraded a database and after restoring all files (init,tnsnames) try to startup 11g database and it fails with following error

SQL> startup nomount
ORA-00119: invalid specification for system parameter REMOTE_LISTENER
ORA-00132: syntax error or unresolved network name 'scan-clu:1521'

11gR2 requires remote_listener to be set to <scan_name:port>. Problem is that the sqlnet.ora file has been modified/replaced (from earlier version backup) which does not have EZCONNECT and has only TNSNAMES. Ensure that your sqlnet.ora file contain’s EZCONNECT

NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT)