Recently we faced a performance issue, where in sqlplus connection from application server A (AIX 5.3) to Database server DB (different physical server,10gR2 on AIX 5.3) was taking nearly 1-2 minutes. We did some quick check's to narrow down the problematic area.i.e Is it network or database server or some issue with application server?
1)Vmstat output reported Normal Run queue's/memory utilisation/idle cpu on Application server A and Database Server DB . Rules out resource issue.
2) Connections from other application server B was taking normal time. Rules out Database server issue
3) If incorrect password was entered in connection details, second attempt resulted in normal quick response time. Indicates some network issues.
4) System Admin checked for network issues with traceroute command and reported normal results.
Even though SA confirmed that there was no network issue, we were sure that this has to do something with DNS/Network. As we had identical environment from where connection was working fine , so we initiated tracing of session using truss
<span style="font-family: verdana, geneva;"><strong>truss -aeo /tmp/sqlplus_timing.txt sqlplus [email protected]</strong></span>
Issuing a tail command on trace for Application Server A reported following
<strong>open("/etc/netsvc.conf", O_RDONLY) Err#13 EACCES open("/etc/irs.conf", O_RDONLY) Err#2 ENOENT getdomainname(0x09001000A00A0E98, 1024) = 0</strong> __libc_sbrk(0x0000000000010020) = 0x000000001021CBE0 getdomainname(0x09001000A00A0E98, 1024) = 0 open("/etc/hesiod.conf", O_RDONLY) Err#2 ENOENT getdomainname(0x09001000A00A0E98, 1024) = 0 getdomainname(0x09001000A00A0E98, 1024) = 0 getdomainname(0x09001000A00A0E98, 1024) = 0 socket(2, 2, 0) = 9 getsockopt(9, 65535, 4104, 0x0FFFFFFFFFFF5B74, 0x0FFFFFFFFFFF5B70) = 0 connext(9, 0x09001000A0015B50, 16) = 0 send(9, 0x0FFFFFFFFFFF6A80, 32, 0) = 32 _poll(0x0FFFFFFFFFFF5C20, 1, 5000) = 1 nrecvfrom(9, 0x0FFFFFFFFFFF7DB0, 1024, 0, 0x0FFFFFFFFFFF63E0, 0x0FFFFFFFFFFF5C08) = 78 send(9, 0x0FFFFFFFFFFF6A80, 22, 0) = 22 _poll(0x0FFFFFFFFFFF5C20, 1, 5000) = 1 nrecvfrom(9, 0x0FFFFFFFFFFF7DB0, 1024, 0, 0x0FFFFFFFFFFF63E0, 0x0FFFFFFFFFFF5C08) = 22 close(9) = 0 socket(2, 2, 0) = 9 sendto(9, 0x0FFFFFFFFFFF6A80, 22, 0, 0x09001000A0015B60, 16) = 22 _poll(0x0FFFFFFFFFFF5C20, 1, 5000) (sleeping...) _poll(0x0FFFFFFFFFFF5C20, 1, 5000) = 0 close(9) = 0 socket(2, 2, 0) = 9 sendto(9, 0x0FFFFFFFFFFF6A80, 22, 0, 0x09001000A0015B60, 16) = 22 _poll(0x0FFFFFFFFFFF5C20, 1, 5000) (sleeping...) _poll(0x0FFFFFFFFFFF5C20, 1, 5000) = 0 close(9) = 0 socket(2, 2, 0) = 9 sendto(9, 0x0FFFFFFFFFFF6A80, 22, 0, 0x09001000A0015B60, 16) = 22 _poll(0x0FFFFFFFFFFF5C20, 1, 10000) <strong>(sleeping...)</strong> _poll(0x0FFFFFFFFFFF5C20, 1, 10000) = 0 close(9) = 0
We can see lot of sleeping messages for _poll system call. We did not find these wait's in truss output of Application Server B . Going up in truss output file, we see EACCESS (permission issue) on /etc/netsvc.conf and ENONT (file/directory not present) on /etc/irs.conf file. On application server B, there were no EACCESS errors on /etc/netsvc.conf but ENONT errors were reported for /etc/irs.conf . Look's like we need to check etc/netsvc.conf file permissions.
Comparing file permission's, we noticed that permission on /etc/netsvc.conf had changed on Server A and root user alone had read permission on file. Granting read permission on /etc/netsvc.conf to other's resolved the issue immediately.
As per AIX Doc's, /etc/netsvc.conf file is used to specify the ordering of name resolution for the sendmail command, gethostbyname subroutine, gethostaddr subroutine, and gethostent subroutine and alias resolution for the sendmail command. It is also used to override the default order and the order given in the /etc/irs.conf file by creating the /etc/netsvc.conf configuration file and specifying the desired ordering.
e.g If the resolver cannot find the name in the /etc/hosts file and you want to the resolver to use NIS, enter:
hosts = local , nis
I had usually seen /etc/resolv.conf file to specify name resolution. AIX Doc's for name resolution specifies following process for name resolution
To resolve a name in a domain network, the resolver routine first queries the domain name server database, which might be local if the host is a domain name server or on a foreign host. Name servers translate domain names into Internet addresses. The group of names for which a name server is responsible is its zone of authority. If the resolver routine is using a remote name server, the routine uses the domain name protocol (DOMAIN) to query for the mapping. To resolve a name in a flat network, the resolver routine checks for an entry in the local /etc/hosts file. When NIS or NIS+ is used, the /etc/hosts file on the master server is checked.
By default, resolver routines attempt to resolve names using the above resources. BIND/DNS is tried first. If the /etc/resolv.conf file does not exist or if BIND/DNS could not find the name, NIS is queried if it is running. NIS is authoritative over the local /etc/hosts, so the search ends here if it is running. If NIS is not running, then the local /etc/hosts file is searched. If none of these services can find the name, then the resolver routines return with HOST_NOT_FOUND. If all of the services are unavailable, then the resolver routines return with SERVICE_UNAVAILABLE.
The default order described above can be overwritten by creating the /etc/irs.conf configuration file and specifying the desired order. Also, both the default and /etc/irs.conf orderings can be overwritten with the environment variable, NSORDER. If either the /etc/irs.conf file or NSORDER environment variable are defined, then at least one value must be specified along with the option.
As mentioned, /etc/irs.conf is used to modify the default order and /etc/netsvc.conf can be used to override the order given in /etc/irs.conf. I couldn't find any reference explaining the time limit used by resolver, but seem's like after checking for some limited time, it fall's back on default order of name resolution.