Problems with agents in classic appserver that are stuck at

Posted by dbeavon on 27-Dec-2016 07:55

Something in our environment recently changed (exactly what was changed is something that we are still trying to determine - it may have been a hotfix we applied on top of 11.3.3) and now we are experiencing some pretty severe "classic" appserver issues, particularly with the "state-reset" mode of operation.

It is a bit hard to explain what is happening in our "state-reset" appservers.  They will be running smoothly for a week when, all of a sudden, they will all go into a "CONNECTNG" state and won't actually accept any new connection requests.  The broker log shows some IOException entries and some SocketException entries as well.  The problem isn't self-correcting so we have to use -stop (or -kill) on the broker to get things back to normal again.

While investigating this and trying to reproduce the behavior (in development as opposed to production), I found that it is actually quite *easy* for a remote client to begin the process of connecting to a state-reset appserver, then die and leave an agent at the "CONNECTNG" status.  I was able to do it at will.  And the appserver *never* brings those corrupted agents back online until the entire broker is stopped or killed.

I wonder if we've picked up a change in behavior at some point because I don't remember our agents being leaked away so easily as this in years past.  I found a couple articles (below) that reference a "connectingTimeout" but, based on those articles, I suspect this is no longer used for any purpose.  (Our sets this value at "60" but I don't think it is doing anything to clean up agents at that "CONNECTNG" status.)  Does anybody have any experience with troubleshooting "CONNECTNG" statuses?  

Thanks in advance, David

Posted by dbeavon on 04-Apr-2017 14:20

All Replies

Posted by dbeavon on 23-Feb-2017 17:58

It turned out that these errors on the server/broker side coincided with errors in the .Net open client that looked like so:

"Hashtable insert failed. Load factor is too high."

While it was the very last thing that occurred for us to do, we decided to explore the idea that this client-side error was the *root* cause of the problem on the broker side.  Initially we had believed it was just yet another client that was complaining about being unable to connect to a broken broker.  (We didn't suppose that the .Net client with a corrupted Hashtable would be the original reason why the broker broke in the first place.)

So we tried various work-arounds related to the specific client, the first and most successful one being to shorten the amount of time it took for the state-reset clients to connect to "state-reset" appserver (via the connect procedure).  For whatever reason, the (1) introduction of a shorter connection time served to fix a (2) timing issue in the .net open client that (3) caused an internal hashtable corruption in the .Net Open Client, that (4) put the broker into a confused state where it not only wouldn't accept connections from the afore mentioned corrupted client, but it wouldn't accept connections from ANYONE.

To make a long story short, be very wary that an error message from the Progress .Net open client that says "Hashtable insert failed" can lead to bigger issues which affect the entire broker.   In fact, it seemed that even the first such error can cause significant trouble.  Clearly it is not a part of the design for the .Net open client to run into an internal Hashtable error , let alone bring down an appserver broker.

In our case, this was happening because we have multiple distinct "state-reset" clients running in the same process on separate threads.  The clients are running with independent connections and appobjects but they still step on each other on rare occasion.  Hopefully the issue with the .Net open client will be fixed one day, but in the meantime you can wrap all your usage of the .Net open client with your own synchronization code to avoid the bug.  The most important parts to synchronize are the creation of the connection and appobject (even if they are ostensibly independent of each other).  The subsequent use of these, once they are created, can probably happen concurrently on multiple threads without additional synchronization but I haven't tested that.

Hope this helps.  I found very little when searching for appserver issues involving the error message "Hashtable insert failed".  So hopefully the next person running into this type of issue will find a bit more to go on.

Posted by dbeavon on 04-Apr-2017 14:20
This thread is closed