Cause of AS down: Connection failure for host 127.0.0.1 port

Posted by qasimpatwekar on 06-Jul-2017 03:00

Hi All,

OE 10.2B SP07 (TF17)

I am trying to find the cause of sudden stop of AS. Checked the server log and see below error:

[17/07/05@12:03:46.119+0200] P-031874 T-3421734688 1 AS -- Connection failure for host 127.0.0.1 port 46243 transport TCP. (9407)
[17/07/05@12:03:46.119+0200] P-007436 T-1879332640 1 AS -- Connection failure for host 127.0.0.1 port 49546 transport TCP. (9407)
[17/07/05@12:03:46.119+0200] P-007125 T-3607299872 1 AS -- Connection failure for host 127.0.0.1 port 35242 transport TCP. (9407)
[17/07/05@12:03:46.119+0200] P-009587 T-3692168992 1 AS -- Connection failure for host 127.0.0.1 port 44151 transport TCP. (9407)
[17/07/05@12:03:46.121+0200] P-007439 T-1451353888 1 AS -- Connection failure for host 127.0.0.1 port 38759 transport TCP. (9407)
[17/07/05@12:03:46.123+0200] P-009524 T-1727805216 1 AS -- Connection failure for host 127.0.0.1 port 46113 transport TCP. (9407)
[17/07/05@12:03:46.123+0200] P-007442 T-2127800096 1 AS -- Connection failure for host 127.0.0.1 port 36389 transport TCP. (9407)
[17/07/05@12:03:46.124+0200] P-009528 T-4251031328 1 AS -- Connection failure for host 127.0.0.1 port 34138 transport TCP. (9407)
[17/07/05@12:03:46.124+0200] P-007436 T-1879332640 2 AS AS Application Server Shutdown. (5476)
[17/07/05@12:03:46.137+0200] P-007442 T-2127800096 2 AS AS Application Server Shutdown. (5476)
[17/07/05@12:03:46.156+0200] P-009528 T-4251031328 2 AS AS Application Server Shutdown. (5476)
[17/07/05@12:03:46.159+0200] P-007125 T-3607299872 2 AS AS Application Server Shutdown. (5476)
[17/07/05@12:03:46.160+0200] P-007439 T-1451353888 2 AS AS Application Server Shutdown. (5476)
[17/07/05@12:03:46.164+0200] P-031874 T-3421734688 2 AS AS Application Server Shutdown. (5476)
[17/07/05@12:03:46.171+0200] P-009524 T-1727805216 2 AS AS Application Server Shutdown. (5476)
[17/07/05@12:03:46.177+0200] P-009587 T-3692168992 2 AS AS Application Server Shutdown. (5476)

Analysis:

I did search on Prokb and other related OE groups and came to know that it can cause due to multiple occasion like agent is not able to reach broker, network issue etc.

I can reproduce the issue by killing broker process. We see same messages which shows like above. But in this case we see additional message in Adminserver log like:

Alert! Alert Name: AS_BrokerAbnormalShutdown, Resource Name: qaddemo.qadui_ASlive, Severity: 2, Message: AppServer broker shutdown abnormally. Broker: qadui_ASlive. (11085)
Alert! Alert Name: NS_BrokerTimeout, Resource Name: qaddemo.NS1, Severity: 2, Message: The registered Broker is not responding. The Broker will not be registered. Broker: 4de0d10bc037932d:-64b4c2c9:14bdaa947ac:-7ffd AS.qadui_ASlive qaddemo.qad.com/192.168.0.51 51300. (11010)

In my customers case this I dont see above message so we can say that broker is not killed.

No clue in system logs.

Question:

1) Is any other thought on why AS went down?

2) From where the port number comes in the error message? example port 46113? Its not in the range of min max port range. also IP shown is server loop back IP, what we can conclude from the error message?

"Connection failure for host 127.0.0.1 port 46113 transport TCP. (9407)" port is mentioned as 46113

[View:/cfs-file/__key/communityserver-discussions-components-files/18/7506.qadui_5F00_ASprod.broker.log.gz:320:240][View:/cfs-file/__key/communityserver-discussions-components-files/18/4265.qadui_5F00_ASprod.server.log.gz:320:240][View:/cfs-file/__key/communityserver-discussions-components-files/18/2210.admserv.log.gz:320:240][View:/cfs-file/__key/communityserver-discussions-components-files/18/5342.server_2D00_log.zip:320:240]

Posted by Roy Ellis on 11-Jul-2017 14:29

Your AppServer Broker is definitely dying or crashing.  

The port numbers you are asking about are created by the broker at random when it starts agents (this is sometimes called an ephemeral port).  Each agent has its own port number to communicate with the broker.  The error message in the server log says the agent has lost its connection (via the ephemeral port).  In this case the agent shuts its self down, since there is no controlling broker. Forget about the port numbers they are not important to the problem.

Your job is to find out why the broker is crashing or dying.

1) The first thing to do is increase logging for the broker.  

2) Next check all the log files in the working directly, there may be more information in another log (like the admserv.log) or there may be other files that are created when the broker crashes.

3) Also you can use JAVA startup commands to monitor memory usage and dump a thread stack if the broker process crashes (search kbase for "java dump")

4) next when your AppServer crashes, have you checked to see if the process still exists?  Maybe it is just unresponsive?  If that is the case, getting a stack dump of the process is your next step.

5) lastly monitor the broker java process from the operating system.  does the memory grow continually?  does the CPU go runaway?

Whatever you choose above the idea is to find out why the broker is no longer available to the agents.

Let me know what you find.

Regards,

Roy Ellis

All Replies

Posted by Srinivas Munigala on 06-Jul-2017 04:00

Hi,

Following Knowledge Base article gives more information on error 9407. Hope, this answers all your questions:

knowledgebase.progress.com/.../P124127

I am copying the same content here incase if you are unable to access above site:

--------------------------------------------------------------------------------------------------------------

What does the  "Connection failure for host <host_name> port <port> transport <transport_name>. (9407)"  mean ?

The error "Connection failure for host <host_name> port <port> transport <transport_name>. (9407)" is very generic, and It signifies that for some reason, the other endpoint of the connection can no longer be reached.

   This can apply to any socket connection made by the AppServer agent, including, but not limited to:

   The connection to the owning broker process

   A connection from the agent to another AppServer broker

   A connection from the agent to a WebService

   For state-aware and state-reset operating modes, the connection to the client the agent is currently bound to.

It can be caused by the other endpoint terminating without closing the connection cleanly, or because the network connection was interrupted.

Without knowing what the other endpoint is, and what that was doing at the time of the issue, it's impossible to identify the root cause of this error.

How to identify the endpoint ?

   What's the hostname/address reported in the error message?

   If host is "localhost" or 127.0.0.1, per IP protocol standards means the connection was within the same machine local machine. Note: Modern OS'es will shortcut traffic to this address so that it never reaches the actual network, so for this name/address the network itself does not need to be investigated.

   Otherwise, trace this back to specific machines or network segments.

   What's the port number reported in the error message?

This must be checked on the machine found in Step 1 above.

   If there is an OpenEdge installation, cross-reference with ubroker.properties & conmgr.properties files to see if this is in the defined port -minport -maxport range of a specific database/appserver/other component. If yes, check logs for those additional components for additional symptoms.

   Otherwise, use netstat -b or equivalent OS tools if end-point of connection can still be identified.

Note that these steps are not full-proof.

   Ports may be outside of any -minport -maxport range, as there are cases where the AVM will just claim any available port from the OS if it needs one. (So-called ephemeral ports, these may or may not be within a set range depending on OS):

       For a list of popular ephemeral port ranges for various Operating Systems see:  support.sas.com/.../946.html.

       For information on how to possibly change the ephemeral port ranges see:  www.ncftp.com/.../ephemeral_ports.html.  

   Netstat may fail to detect anything if the failing connection has been cleaned up by the OS network layers already.

   if the issue occurs frequently enough, repeated snapshots while the environment is still running can help identify this. But a certain level of uncertainty will remain.

What can be done to get more data out of the AppServer Agent reporting error 9407?

Add the -debugalert to the AppServer Agents' startup parameters. This will make the Agent include the ABL trace for any error message that gets written to the log to assist in identifying if the error occurs while the Agent is running ABL code, and if yes, where in the code.  That can help identify the nature of the other endpoint:

   If there is no ABL trace, the Agent is not running any ABL code at the time of the error, so it is likely receiving a request or sending the response. At this point, the endpoint is likely the AppServer Broker (for stateless/state-free operating modes) or the AppServer client (for state-aware/state-reset operating modes).

   If there is an ABL trace, the line referenced by the trace likely has a statement that would cause TCP activity - writing data to a socket, or something similar. Analysing the ABL code in detail can help determine the type of connection established, and with that the nature of the other endpoint.

Thanks,

Srinivas Munigala

Posted by qasimpatwekar on 06-Jul-2017 05:13

Hi Srinivas,

Thanks for your reply.

Yes, I already gone through this kbase along with other kbases related to this error. But I want to understand where exactly this port number comes from?

You can also test this scenario like:

start appserver

query and notedown port number

kill broker by kill PID

check AS server and you will find different different port number which is not in the list of min and max port list.

I want to know from where this port number comes from?

Any idea?

Thanks,

Qasim

Posted by qasimpatwekar on 11-Jul-2017 00:28

Hello All,

Can someone share valuable inputs on this?

Thanks in advance.

Posted by gus bjorklund on 11-Jul-2017 13:58

port number incorrect?

Posted by Roy Ellis on 11-Jul-2017 14:29

Your AppServer Broker is definitely dying or crashing.  

The port numbers you are asking about are created by the broker at random when it starts agents (this is sometimes called an ephemeral port).  Each agent has its own port number to communicate with the broker.  The error message in the server log says the agent has lost its connection (via the ephemeral port).  In this case the agent shuts its self down, since there is no controlling broker. Forget about the port numbers they are not important to the problem.

Your job is to find out why the broker is crashing or dying.

1) The first thing to do is increase logging for the broker.  

2) Next check all the log files in the working directly, there may be more information in another log (like the admserv.log) or there may be other files that are created when the broker crashes.

3) Also you can use JAVA startup commands to monitor memory usage and dump a thread stack if the broker process crashes (search kbase for "java dump")

4) next when your AppServer crashes, have you checked to see if the process still exists?  Maybe it is just unresponsive?  If that is the case, getting a stack dump of the process is your next step.

5) lastly monitor the broker java process from the operating system.  does the memory grow continually?  does the CPU go runaway?

Whatever you choose above the idea is to find out why the broker is no longer available to the agents.

Let me know what you find.

Regards,

Roy Ellis

Posted by qasimpatwekar on 14-Jul-2017 21:22

Hi Roy,

Thanks for the troubleshooting steps.

I have communicate with customer. As this problem is less frequent (twice in last 2-3 years), it is not advisable at this moment to ask for increase in broker logging. Regarding all other steps, we will have to wait.

Thanks again Srinivas, Gus, Roy.

Thanks,

Qasim.

This thread is closed