Odd Client Connection 2059 Error

Discussion:

Tim Zielke

2014-05-12 19:54:09 UTC

Hello,

I was curious if anyone else has run across something similar to what we experienced last week with a client application connecting to a remote queue manager, or had any insight into the issue.

Configuration:
We have a client application at 7.0.1 on a Solaris 10 non-global zone (server A) connecting into a remote queue manager at 7.1.0.2 on a Solaris 10 non-global zone (server B).

Issue:
Between 5/4 - 5/11, the client application on A would time out with a 2059, while trying to connect to the queue manager on B. The connection issue was traced with client app and queue manager traces, and this was reviewed by IBM level 2 support. They found from the client trace that the client app was correctly resolving the ip address of the host system, sending an initial TSH, going into receive mode, and then timing out with no response back from the queue manager. On the queue manager side, there was no evidence of any activity for a connection request for this client channel. This is consistent with what I was seeing with a DIS CHSTATUS on the queue manager side during a connection attempt, where nothing was being reported.

A TCP stack investigation of servers A and B during a client connection attempt showed an established connection between A -> B:1414 on both the TCP stack on A and B. What was odd was that on the TCP stack on B, we would also see two more sockets for A-> B:1414 in a FIN_WAIT_2 state. A socket goes into a FIN_WAIT_2 when an the application actively closes the socket, so this presumably was being initiated by the queue manager. During the client connection attempt, neither of the TCP receive or send queues in the established connection between A -> B:1414 showed any bytes pending on both TCP stacks. So no evidence of ACK packets being blocked, for example.

We were also able to successfully telnet from A -> B:1414.

We also recycled the queue manager on server B once this issue started, and this did not correct the issue.

When I checked on this client app this morning, everything was magically working again. Neither server A or B were rebooted. The odd FIN_WAIT_2 sockets are now gone in the TCP stack on server B, and I only see the one expected established connection of A -> B:1414 in the server B TCP stack.

We are suspecting some kind of underlying network issue, but I doubt we will be able to uncover what the issue was. Has anyone else seen this type of behavior before, or have any insights on what could have been happening here?

Thanks,
Tim

To unsubscribe, write to LISTSERV-0lvw86wZMd9k/bWDasg6f+***@public.gmane.org and,
in the message body (not the subject), write: SIGNOFF MQSERIES
Instructions for managing your mailing list subscription are provided in
the Listserv General Users Guide available at http://www.lsoft.com
Archive: http://listserv.meduniwien.ac.at/archives/mqser-l.html

Schanz, Arthur

2014-05-13 00:21:28 UTC

Permalink

Tim -
Are you using SSL on this connection? If so, I have seen SSL 3-way handshakes exhibit similar behavior. It's not normal, but I have seen it occur.
Another possibility is a firewall 'getting in the way'. Are these long-running connections that the firewall might be aging out, but both the appl and Qmgr still think are active? Heartbeating would prevent that scenario.
Just a few things to check/consider.
Cheers,
Art

Art Schanz
Distributed Computing Specialist
National IT Services - Federal Reserve
(804) 697-3889
***@frit.frb.org
Sent from my BlackBerry

From: Tim Zielke [mailto:***@AON.COM]
Sent: Monday, May 12, 2014 03:54 PM
To: ***@LISTSERV.MEDUNIWIEN.AC.AT <***@LISTSERV.MEDUNIWIEN.AC.AT>
Subject: Odd Client Connection 2059 Error

Hello,

I was curious if anyone else has run across something similar to what we experienced last week with a client application connecting to a remote queue manager, or had any insight into the issue.

Configuration:
We have a client application at 7.0.1 on a Solaris 10 non-global zone (server A) connecting into a remote queue manager at 7.1.0.2 on a Solaris 10 non-global zone (server B).

Issue:
Between 5/4 â 5/11, the client application on A would time out with a 2059, while trying to connect to the queue manager on B. The connection issue was traced with client app and queue manager traces, and this was reviewed by IBM level 2 support. They found from the client trace that the client app was correctly resolving the ip address of the host system, sending an initial TSH, going into receive mode, and then timing out with no response back from the queue manager. On the queue manager side, there was no evidence of any activity for a connection request for this client channel. This is consistent with what I was seeing with a DIS CHSTATUS on the queue manager side during a connection attempt, where nothing was being reported.

A TCP stack investigation of servers A and B during a client connection attempt showed an established connection between A -> B:1414 on both the TCP stack on A and B. What was odd was that on the TCP stack on B, we would also see two more sockets for A-> B:1414 in a FIN_WAIT_2 state. A socket goes into a FIN_WAIT_2 when an the application actively closes the socket, so this presumably was being initiated by the queue manager. During the client connection attempt, neither of the TCP receive or send queues in the established connection between A -> B:1414 showed any bytes pending on both TCP stacks. So no evidence of ACK packets being blocked, for example.

We were also able to successfully telnet from A -> B:1414.

We also recycled the queue manager on server B once this issue started, and this did not correct the issue.

When I checked on this client app this morning, everything was magically working again. Neither server A or B were rebooted. The odd FIN_WAIT_2 sockets are now gone in the TCP stack on server B, and I only see the one expected established connection of A -> B:1414 in the server B TCP stack.

We are suspecting some kind of underlying network issue, but I doubt we will be able to uncover what the issue was. Has anyone else seen this type of behavior before, or have any insights on what could have been happening here?

Thanks,
Tim

________________________________
List Archive<http://listserv.meduniwien.ac.at/archives/mqser-l.html> - Manage Your List Settings<http://listserv.meduniwien.ac.at/cgi-bin/wa?SUBED1=mqser-l&A=1> - Unsubscribe<mailto:***@LISTSERV.MEDUNIWIEN.AC.AT?subject=Unsubscribe&BODY=signoff%20mqseries>

Instructions for managing your mailing list subscription are provided in the Listserv General Users Guide available at http://www.lsoft.com<http://www.lsoft.com/resources/manuals.asp>

To unsubscribe, write to ***@LISTSERV.MEDUNIWIEN.AC.AT and,
in the message body (not the subject), write: SIGNOFF MQSERIES
Instructions for managing your mailing list subscription are provided in
the Listserv General Users Guide available at http://www.lsoft.com
Archive: http://listserv.meduniwien.ac.at/archives/mqser-l.html

Doug Clark

2014-05-13 01:20:24 UTC

Permalink

Do you have multiple network cards on your servers? I have found that the
ip supplied for a return connection unless explicitly defined can vary.

Post by Schanz, Arthur
Tim -
Are you using SSL on this connection? If so, I have seen SSL 3-way
handshakes exhibit similar behavior. It's not normal, but I have seen it
occur.
Another possibility is a firewall 'getting in the way'. Are these
long-running connections that the firewall might be aging out, but both the
appl and Qmgr still think are active? Heartbeating would prevent that
scenario.
Just a few things to check/consider.
Cheers,
Art
Art Schanz
Distributed Computing Specialist
National IT Services - Federal Reserve
(804) 697-3889
Sent from my BlackBerry
*Sent*: Monday, May 12, 2014 03:54 PM
*Subject*: Odd Client Connection 2059 Error
Hello,
I was curious if anyone else has run across something similar to what we
experienced last week with a client application connecting to a remote
queue manager, or had any insight into the issue.
We have a client application at 7.0.1 on a Solaris 10 non-global zone
(server A) connecting into a remote queue manager at 7.1.0.2 on a Solaris
10 non-global zone (server B).
Between 5/4 â 5/11, the client application on A would time out with a
2059, while trying to connect to the queue manager on B. The connection
issue was traced with client app and queue manager traces, and this was
reviewed by IBM level 2 support. They found from the client trace that the
client app was correctly resolving the ip address of the host system,
sending an initial TSH, going into receive mode, and then timing out with
no response back from the queue manager. On the queue manager side, there
was no evidence of any activity for a connection request for this client
channel. This is consistent with what I was seeing with a DIS CHSTATUS on
the queue manager side during a connection attempt, where nothing was being
reported.
A TCP stack investigation of servers A and B during a client connection
attempt showed an established connection between A -> B:1414 on both the
TCP stack on A and B. What was odd was that on the TCP stack on B, we
would also see two more sockets for A-> B:1414 in a FIN_WAIT_2 state. A
socket goes into a FIN_WAIT_2 when an the application actively closes the
socket, so this presumably was being initiated by the queue manager.
During the client connection attempt, neither of the TCP receive or send
queues in the established connection between A -> B:1414 showed any bytes
pending on both TCP stacks. So no evidence of ACK packets being blocked,
for example.
We were also able to successfully telnet from A -> B:1414.
We also recycled the queue manager on server B once this issue started,
and this did not correct the issue.
When I checked on this client app this morning, everything was magically
working again. Neither server A or B were rebooted. The odd FIN_WAIT_2
sockets are now gone in the TCP stack on server B, and I only see the one
expected established connection of A -> B:1414 in the server B TCP stack.
We are suspecting some kind of underlying network issue, but I doubt we
will be able to uncover what the issue was. Has anyone else seen this type
of behavior before, or have any insights on what could have been happening
here?
Thanks,
Tim
------------------------------
List Archive <http://listserv.meduniwien.ac.at/archives/mqser-l.html> - Manage
Your List Settings<http://listserv.meduniwien.ac.at/cgi-bin/wa?SUBED1=mqser-l&A=1>-
Instructions for managing your mailing list subscription are provided in
the Listserv General Users Guide available at http://www.lsoft.com<http://www.lsoft.com/resources/manuals.asp>
------------------------------
List Archive <http://listserv.meduniwien.ac.at/archives/mqser-l.html> - Manage
Your List Settings<http://listserv.meduniwien.ac.at/cgi-bin/wa?SUBED1=mqser-l&A=1>-
Instructions for managing your mailing list subscription are provided in
the Listserv General Users Guide available at http://www.lsoft.com<http://www.lsoft.com/resources/manuals.asp>

To unsubscribe, write to LISTSERV-0lvw86wZMd9k/bWDasg6f+***@public.gmane.org and,
in the message body (not the subject), write: SIGNOFF MQSERIES
Instructions for managing your mailing list subscription are provided in
the Listserv General Users Guide available at http://www.lsoft.com
Archive: http://listserv.meduniwien.ac.at/archives/mqser-l.html

Tim Zielke

2014-05-13 20:27:59 UTC

Permalink

Hi Doug and Art,

Thanks for the information.

This client app is using an SSL connection, but the same behavior happened with non-SSL connections. The client connection would time out after about 5 minutes, so they were not long-running.

I am not sure about the multiple network cards, but I do see multiple network interfaces on these servers. Before the issue got fixed on its own, I was getting ready to do some snoop commands on these server, to see if the TCP packets for the TSH request was getting to server B. So I will keep the possibility of needing to snoop on all the network interfaces as an option, if this issue comes up again.

Thanks,
Tim

From: MQSeries List [mailto:***@LISTSERV.MEDUNIWIEN.AC.AT] On Behalf Of Doug Clark
Sent: Monday, May 12, 2014 8:20 PM
To: ***@LISTSERV.MEDUNIWIEN.AC.AT
Subject: Re: Odd Client Connection 2059 Error

Do you have multiple network cards on your servers? I have found that the ip supplied for a return connection unless explicitly defined can vary.

On Monday, May 12, 2014, Schanz, Arthur <***@frit.frb.org<mailto:***@frit.frb.org>> wrote:
Tim -
Are you using SSL on this connection? If so, I have seen SSL 3-way handshakes exhibit similar behavior. It's not normal, but I have seen it occur.
Another possibility is a firewall 'getting in the way'. Are these long-running connections that the firewall might be aging out, but both the appl and Qmgr still think are active? Heartbeating would prevent that scenario.
Just a few things to check/consider.
Cheers,
Art

Art Schanz
Distributed Computing Specialist
National IT Services - Federal Reserve
(804) 697-3889
***@frit.frb.org<javascript:_e(%7B%7D,'cvml','***@frit.frb.org');>
Sent from my BlackBerry

From: Tim Zielke [mailto:***@AON.COM<javascript:_e(%7B%7D,'cvml','***@AON.COM');>]
Sent: Monday, May 12, 2014 03:54 PM
To: ***@LISTSERV.MEDUNIWIEN.AC.AT<javascript:_e(%7B%7D,'cvml','***@LISTSERV.MEDUNIWIEN.AC.AT');> <***@LISTSERV.MEDUNIWIEN.AC.AT<javascript:_e(%7B%7D,'cvml','***@LISTSERV.MEDUNIWIEN.AC.AT');>>
Subject: Odd Client Connection 2059 Error

Hello,

I was curious if anyone else has run across something similar to what we experienced last week with a client application connecting to a remote queue manager, or had any insight into the issue.

Configuration:
We have a client application at 7.0.1 on a Solaris 10 non-global zone (server A) connecting into a remote queue manager at 7.1.0.2 on a Solaris 10 non-global zone (server B).

Issue:
Between 5/4 â 5/11, the client application on A would time out with a 2059, while trying to connect to the queue manager on B. The connection issue was traced with client app and queue manager traces, and this was reviewed by IBM level 2 support. They found from the client trace that the client app was correctly resolving the ip address of the host system, sending an initial TSH, going into receive mode, and then timing out with no response back from the queue manager. On the queue manager side, there was no evidence of any activity for a connection request for this client channel. This is consistent with what I was seeing with a DIS CHSTATUS on the queue manager side during a connection attempt, where nothing was being reported.

A TCP stack investigation of servers A and B during a client connection attempt showed an established connection between A -> B:1414 on both the TCP stack on A and B. What was odd was that on the TCP stack on B, we would also see two more sockets for A-> B:1414 in a FIN_WAIT_2 state. A socket goes into a FIN_WAIT_2 when an the application actively closes the socket, so this presumably was being initiated by the queue manager. During the client connection attempt, neither of the TCP receive or send queues in the established connection between A -> B:1414 showed any bytes pending on both TCP stacks. So no evidence of ACK packets being blocked, for example.

We were also able to successfully telnet from A -> B:1414.

We also recycled the queue manager on server B once this issue started, and this did not correct the issue.

When I checked on this client app this morning, everything was magically working again. Neither server A or B were rebooted. The odd FIN_WAIT_2 sockets are now gone in the TCP stack on server B, and I only see the one expected established connection of A -> B:1414 in the server B TCP stack.

We are suspecting some kind of underlying network issue, but I doubt we will be able to uncover what the issue was. Has anyone else seen this type of behavior before, or have any insights on what could have been happening here?

Thanks,
Tim

________________________________
List Archive<http://listserv.meduniwien.ac.at/archives/mqser-l.html> - Manage Your List Settings<http://listserv.meduniwien.ac.at/cgi-bin/wa?SUBED1=mqser-l&A=1> - Unsubscribe<javascript:_e(%7B%7D,'cvml','***@LISTSERV.MEDUNIWIEN.AC.AT?subject%5Cx3dUnsubscribe%5Cx26BODY%5Cx3dsignoff+mqseries');>

Instructions for managing your mailing list subscription are provided in the Listserv General Users Guide available at http://www.lsoft.com<http://www.lsoft.com/resources/manuals.asp>

________________________________
List Archive<http://listserv.meduniwien.ac.at/archives/mqser-l.html> - Manage Your List Settings<http://listserv.meduniwien.ac.at/cgi-bin/wa?SUBED1=mqser-l&A=1> - Unsubscribe<javascript:_e(%7B%7D,'cvml','***@LISTSERV.MEDUNIWIEN.AC.AT?subject%5Cx3dUnsubscribe%5Cx26BODY%5Cx3dsignoff+mqseries');>

Instructions for managing your mailing list subscription are provided in the Listserv General Users Guide available at http://www.lsoft.com<http://www.lsoft.com/resources/manuals.asp>

________________________________
List Archive<http://listserv.meduniwien.ac.at/archives/mqser-l.html> - Manage Your List Settings<http://listserv.meduniwien.ac.at/cgi-bin/wa?SUBED1=mqser-l&A=1> - Unsubscribe<mailto:***@LISTSERV.MEDUNIWIEN.AC.AT?subject=Unsubscribe&BODY=signoff%20mqseries>

Instructions for managing your mailing list subscription are provided in the Listserv General Users Guide available at http://www.lsoft.com<http://www.lsoft.com/resources/manuals.asp>

To unsubscribe, write to ***@LISTSERV.MEDUNIWIEN.AC.AT and,
in the message body (not the subject), write: SIGNOFF MQSERIES
Instructions for managing your mailing list subscription are provided in
the Listserv General Users Guide available at http://www.lsoft.com
Archive: http://listserv.meduniwien.ac.at/archives/mqser-l.html