Unix

Cluster SSH tool Utility

Many times you come across scenario’s when you wish to open multiple ssh windows and execute same commands.e.g You wish to see alert log for multi-node RAC simultaneously or edit sysctl.conf files for multiple machines.

Cluster SSH utility helps solve this problem as it opens multiple SSH sessions and allows simultaneous control.

In case you are using MAC OS X, you can download from Google Code site . Utility is called csshX.

For Linux you can download from Sourceforge site and utility is called Cluster SSH.

I am using csshX to show demo. Suppose I have 3 hosts host1,host2,host3 I need to use following commands on terminal app

$csshX host1 host2 host3

or

$csshX host[1-3]

As you can see we can specify range using square brackets.  csshX will create an SSH session to each remote host in separate Terminal.app windows. A master window will also be created. All keyboard input in the master will be sent to all the slave windows. Below screenshot displays how windows will look like

I can enter commands in Master (Red Color) window and it will execute same commands in all 3 windows.If you want to execute commands in particular window then go directly  to that window.  In case we need to open 3 sessions for host1 then we execute

$csshx host1+3

Effect Of Multiple SHMMAX Settings

Last week I saw a warning message at database startup time saying:

WARNING: EINVAL creating segment of size 0x000000000f0020xx
fix shm parameters in /etc/system or equivalent 

It is an Oracle 10204 database running on Solaris.

Searching MOS for exact meaning for this warning, it states that a new shared memory segment is getting created to accommodate SGA.

As the message indicated, I opened /etc/system file to verify the settings of SHMMAX parameter and found the SHMMAX value to be 4GB. I stopped at this point and closed the /etc/system file. Then the next thing to check is the number of oracle instances running on the server and the size of largest SGA.

There were two instances running on the server and the largest SGA was set to 1.8G and the other SGA size was 700M.

This setting shows that there is no need to create additional shared memory segment. Then I checked the /etc/system file again, but this time I used the following command :

<span style="font-size: small;"><span style="font-family: arial,helvetica,sans-serif;">$ cat /etc/system | grep shmmax</span></span>
<span style="font-size: small;"><span style="font-family: arial,helvetica,sans-serif;">set shmsys:shminfo_shmmax=4000000000  ==&gt;4GB</span></span>
<span style="font-size: small;"><span style="font-family: arial,helvetica,sans-serif;">set shmsys:shminfo_shmmax=500000000  ==&gt;500M</span></span><span style="font-size: small;"><span style="font-family: arial,helvetica,sans-serif;">

There were two different values set for SHMMAX parameter.

The cause of the above warning message came out to be:
As the files are read from bottom-to-top, server was taking SHMMAX value as 500M and ignoring the 4GB value.

After commenting SHMMAX value of 500M, the warning message disappeared.

How To Discover Disk Name When LUN Number is Known (OEL5)

I am creating 11gR2 RAC setup for one of my client. Following Oracle documentation for storage, I opted for Oracle ASM and asked storage team for new physical devices.
Storage admin thus provided me with a set of LUNs instead of actual physical device names like /dev/sdcxxx.

Now the major task is to get the actual device name associated with LUNs.
In OEL4 this is easy to get by issuing

# iscsi-ls -l
This command will give output like:

<span style="font-size: medium;"><span style="font-family: times new roman,times;">DEVICE DETAILS:
---------------
LUN ID : 0
Vendor: DELL     Model: MD3000i          Rev: 0670
Type:   Direct-Access                    ANSI SCSI revision: 05
page83 type3: 6001e4f0003fa9970000083c48573c94
page80: 383438303045500a
Device: /dev/sdh
LUN ID : 1
Vendor: DELL     Model: MD3000i          Rev: 0670
Type:   Direct-Access                    ANSI SCSI revision: 05
page83 type3: 6001e4f0003fbc5b000008dd486dae67
page80: 383438303045500a
Device: /dev/sdl
LUN ID : 2
Vendor: DELL     Model: MD3000i          Rev: 0670
Type:   Direct-Access                    ANSI SCSI revision: 05
page83 type3: 6001e4f0003fbc5b0000082048574033
page80: 383438303045500a
Device: /dev/sdp</span></span>

But this command is obsolete in OEL5. The replacement for iscsi-ls in OEL5 is iscsiadm.

In OEL5, there are two ways to get the device name associated with LUN:

1. Use iscsiadm -m host -P 4

# iscsiadm -m host -P 4

It will show the output as:

<span style="font-size: medium;"><span style="font-family: times new roman,times;">************************
Attached SCSI devices:
************************
scsi6 Channel 00 Id 0 Lun: 0
Attached scsi disk sdk          State: running
scsi6 Channel 00 Id 0 Lun: 1
Attached scsi disk sdm          State: running
scsi6 Channel 00 Id 0 Lun: 10
Attached scsi disk sdah         State: running
scsi6 Channel 00 Id 0 Lun: 11
Attached scsi disk sdai         State: running
scsi6 Channel 00 Id 0 Lun: 13
Attached scsi disk sdap         State: running
scsi6 Channel 00 Id 0 Lun: 14
Attached scsi disk sdaq         State: running</span></span>

2. Install the OEL5 package: lsscsi-0.17-3.el5 (or similar version).

# lsscsi -l

<span style="font-size: medium;"><span style="font-family: times new roman,times;">It will show the output as:</span></span>

<span style="font-size: medium;"><span style="font-family: times new roman,times;">[6:0:0:22]   disk    DELL     MD3000i          0670  /dev/sdbp
state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=60
[6:0:0:24]   disk    DELL     MD3000i          0670  /dev/sdbq
state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
[6:0:0:31]   disk    DELL     Universal Xport  0670  /dev/sdbx
state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=60
[6:0:0:32]   disk    DELL     MD3000i          0670  /dev/sdby
state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
[6:0:0:33]   disk    DELL     MD3000i          0670  /dev/sdcg
state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=60
[6:0:0:71]   disk    DELL     MD3000i          0670  /dev/sdch
state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=60
[6:0:0:72]   disk    DELL     MD3000i          0670  /dev/sdci
state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
[6:0:0:73]   disk    DELL     MD3000i          0670  /dev/sdck
state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=60</span></span>

<span style="font-size: medium;"><span style="font-family: times new roman,times;">Here we can see the LUN numbers in begning:</span></span>

[6:0:0:73] i.e 73 is the LUN number here.

[6:0:0:72] i.e 72 is the LUN number here.

Finally I created the disks with correct device names. 🙂

Cheers!!!

Saurabh Sood

Sqlplus connection on AIX taking too long

Recently we faced a performance issue, where in sqlplus connection from application server A (AIX 5.3) to Database server DB (different physical server,10gR2 on AIX 5.3) was taking nearly 1-2 minutes.  We did some quick check’s to narrow down the problematic area.i.e Is it network or database server or some issue with application server?

1)Vmstat output reported Normal Run queue’s/memory utilisation/idle cpu on Application server A and Database Server DB . Rules out resource issue.

2) Connections from other application server B was taking normal time. Rules out Database server issue

3) If incorrect password was entered in connection details, second attempt resulted in normal quick response time. Indicates some network issues.

4) System Admin checked for network issues with traceroute command and reported normal results.

Even though SA confirmed that there was no network issue, we were sure that this has to do something with DNS/Network. As we had identical environment from where connection was working fine , so we initiated tracing of session using truss

<span style="font-family: verdana, geneva;"><strong>truss -aeo /tmp/sqlplus_timing.txt sqlplus username/passwd@DB</strong></span>

Issuing a tail command on trace for Application Server A reported following

<strong>open("/etc/netsvc.conf", O_RDONLY)              Err#13 EACCES
open("/etc/irs.conf", O_RDONLY)                 Err#2  ENOENT
getdomainname(0x09001000A00A0E98, 1024)         = 0</strong>
__libc_sbrk(0x0000000000010020)                 = 0x000000001021CBE0
getdomainname(0x09001000A00A0E98, 1024)         = 0
open("/etc/hesiod.conf", O_RDONLY)              Err#2  ENOENT
getdomainname(0x09001000A00A0E98, 1024)         = 0
getdomainname(0x09001000A00A0E98, 1024)         = 0
getdomainname(0x09001000A00A0E98, 1024)         = 0
socket(2, 2, 0)                                 = 9
getsockopt(9, 65535, 4104, 0x0FFFFFFFFFFF5B74, 0x0FFFFFFFFFFF5B70) = 0
connext(9, 0x09001000A0015B50, 16)              = 0
send(9, 0x0FFFFFFFFFFF6A80, 32, 0)              = 32
_poll(0x0FFFFFFFFFFF5C20, 1, 5000)              = 1
nrecvfrom(9, 0x0FFFFFFFFFFF7DB0, 1024, 0, 0x0FFFFFFFFFFF63E0, 0x0FFFFFFFFFFF5C08) = 78
send(9, 0x0FFFFFFFFFFF6A80, 22, 0)              = 22
_poll(0x0FFFFFFFFFFF5C20, 1, 5000)              = 1
nrecvfrom(9, 0x0FFFFFFFFFFF7DB0, 1024, 0, 0x0FFFFFFFFFFF63E0, 0x0FFFFFFFFFFF5C08) = 22
close(9)                                        = 0
socket(2, 2, 0)                                 = 9
sendto(9, 0x0FFFFFFFFFFF6A80, 22, 0, 0x09001000A0015B60, 16) = 22
_poll(0x0FFFFFFFFFFF5C20, 1, 5000) (sleeping...)
_poll(0x0FFFFFFFFFFF5C20, 1, 5000)              = 0
close(9)                                        = 0
socket(2, 2, 0)                                 = 9
sendto(9, 0x0FFFFFFFFFFF6A80, 22, 0, 0x09001000A0015B60, 16) = 22
_poll(0x0FFFFFFFFFFF5C20, 1, 5000) (sleeping...)
_poll(0x0FFFFFFFFFFF5C20, 1, 5000)              = 0
close(9)                                        = 0
socket(2, 2, 0)                                 = 9
sendto(9, 0x0FFFFFFFFFFF6A80, 22, 0, 0x09001000A0015B60, 16) = 22
_poll(0x0FFFFFFFFFFF5C20, 1, 10000) <strong>(sleeping...)</strong>
_poll(0x0FFFFFFFFFFF5C20, 1, 10000)             = 0
close(9)                                        = 0

We can see lot of sleeping messages for _poll system call.  We did not find these wait’s in truss output of Application Server B . Going up in truss output file, we see EACCESS (permission issue) on /etc/netsvc.conf and ENONT (file/directory not present) on /etc/irs.conf file.  On application server B, there were no EACCESS errors on /etc/netsvc.conf but ENONT errors were reported for /etc/irs.conf . Look’s like we need to check etc/netsvc.conf file permissions.

Comparing file permission’s, we noticed that permission on /etc/netsvc.conf had changed on Server A and  root user alone had read permission on file. Granting read permission on /etc/netsvc.conf to other’s resolved the issue immediately.

As per AIX Doc’s, /etc/netsvc.conf file is used to specify the ordering of name resolution for the sendmail command, gethostbyname subroutine, gethostaddr subroutine, and gethostent subroutine and alias resolution for the sendmail command. It is also used to override the default order and the order given in the /etc/irs.conf file by creating the /etc/netsvc.conf configuration file and specifying the desired ordering.

e.g If the resolver cannot find the name in the /etc/hosts file and you want to the resolver to use NIS, enter:

hosts = local , nis

If the resolver cannot find the name in the /etc/hosts file and you want to the resolver to use NIS, enter:
hosts = local , nis

I had usually seen /etc/resolv.conf file to specify name resolution.  AIX Doc’s for name resolution specifies following process for name resolution

To resolve a name in a domain network, the resolver routine first queries the domain name server database, which might be local if the host is a domain name server or on a foreign host. Name servers translate domain names into Internet addresses. The group of names for which a name server is responsible is its zone of authority. If the resolver routine is using a remote name server, the routine uses the domain name protocol (DOMAIN) to query for the mapping. To resolve a name in a flat network, the resolver routine checks for an entry in the local /etc/hosts file. When NIS or NIS+ is used, the /etc/hosts file on the master server is checked.

By default, resolver routines attempt to resolve names using the above resources. BIND/DNS is tried first. If the /etc/resolv.conf file does not exist or if BIND/DNS could not find the name, NIS is queried if it is running. NIS is authoritative over the local /etc/hosts, so the search ends here if it is running. If NIS is not running, then the local /etc/hosts file is searched. If none of these services can find the name, then the resolver routines return with HOST_NOT_FOUND. If all of the services are unavailable, then the resolver routines return with SERVICE_UNAVAILABLE.

The default order described above can be overwritten by creating the /etc/irs.conf configuration file and specifying the desired order. Also, both the default and /etc/irs.conf orderings can be overwritten with the environment variable, NSORDER. If either the /etc/irs.conf file or NSORDER environment variable are defined, then at least one value must be specified along with the option.

As mentioned, /etc/irs.conf is used to modify the default order and /etc/netsvc.conf can be used to override the order given in /etc/irs.conf. I couldn’t find any reference explaining the time limit used by resolver, but seem’s like after checking for some limited time, it fall’s back on default order of name resolution.


Use awk/sed in vi

Thought of sharing some useful info which can help you to do your work faster in vi.You can use awk/sed scripts in vi using following command in vi

:%!scriptname

Here the scriptname file should have execute privilges for user. I used this to create a useful script which I was doing by typing multiple substitution command in vi.

e.g Your file contains list of table
$cat t.lst
BANK005
BJSTM
BJS_ORG
CHAINED_ROWS
CORR_BAM
CORR_CAM
CORR_EIT
CORR_GAC
CORR_GAM
CORR_ITC
CORR_LDT
CORR_LHT
Create script (quotes)  with following command and give execute permission to user.

sed -e “s/^/’/g” -e “s/$/’,/” $1|awk ‘{printf (“%s”,$0)}’|sed -e “s/^/(/g” -e “s/,$/)/g”

open t.lst in vi and type :%!quotes
('BANK005','BJSTM','BJS_ORG','CHAINED_ROWS','CORR_BAM','CORR_CAM','CORR_EIT','CORR_GAC','CORR_GAM','CORR_ITC','CORR_LDT','CORR_LHT')

Similarly if you wish to remove blank lines, have a file blank like

awk ‘!NF==0 {print $0}’ $1
Blank lines can also be directly removed from vi using :g/^$/d
Isn’t it cool.. 🙂

Connections to DataBase Hang Including “/ as sysdba”

Recently I faced one issue where all the connection to database hung and it was also not possible to login to database using “/ as sysdba”.
To get access of sqlplus I used the following syntax:

$ sqlplus -prelim / as sysdba

With “prelim” option we can run some commands which will help in collection useful information about the problem.

This will work only in Oracle 10g and higher version.

After successfully getting connected run the following commands to generate Hanganalyze and systemstate traces:

SQL> oradebug setmypid

SQL> oradebug unlimit

SQL> oradebug dump systemstate 266

SQL> oradebug tracefile_name

— This will give you the name of the tracefile generated.

SQL > oradebug dump hanganalyze 2

SQL > oradebug tracefile_name

To analyze these trace files one should be aware of Metalink Note: 215858.1.

After analyzing these files I found that following event was active and causing the hang:

<span style="font-family: arial,helvetica,sans-serif;"><span style="font-size: small;">"resmgr:cpu quantum"
Cmd: PL/SQL Execute

It means that the sessions are waiting for their turn on CPU.

This event occurs when resource manage is active and controls the allocation of CPU to processes.

We can also see the command which is causing all this: i.e some PL/SQL code was executing and spnning on for CPU.

After finding out this, checked with “TOP” command, got the PID of the process consuming all the cpu and killed that process with “kill -9”

After killing that process the users were able to connect.

So the cause of the Hang was found i.e PL/SQL, but it is still unknown why  PL/SQL caused problems. 🙂

Cheers!!!

Saurabh Sood