Rare .Net OpenClient Error Related to Concurrent Creation of

Posted by dbeavon on 08-Jul-2019 16:00

We are using OpenEdge 11.7.4 on PASOE servers, and on the OE .Net openclients.  The openclients connect to PASOE via the APSV transport.

It is rare but once in a while the entire appdomain which is used to make these openclient connections will become "corrupted", for lack of a better word.  And all state-reset (aka. session-managed) openclient requests will fail until the appdomain is restarted.  This is especially bothersome considering the fact that these app-objects and proc-objects are supposed to be independent of each other (they are new'ed up from scratch whenever we make a round-trip to PASOE.)

The exception that is thrown is very unhelpful (as is the case with many exceptions from the .Net openclient).  It suppresses the full callstack and it obscures the true underlying exception - since it doesn't attach it as an "Inner" exception.  Here is the message that we get for the entire lifetime of the appdomain, until it is restarted.:

Progress.Open4GL.DynamicAPI.SessionPool+NoAvailableSessionsException: SessionPool : NoAvailableSessions[CONNECT FAILURE: Unable to resolve hostname grpoeprod (8821) ]

Of course if we go to the server and attempt to ping the hostname "grpoeprod" there is no problem resolving that name.  Similarly if we launch other client applications, they have no problems.  And finally, if we cycle the appdomain which had become "corrupted", then it won't have problems again after it restarts.

I'm pretty sure this is similar to an issue I had reported in the past : PSC00355194.  That was back in the days when we were connecting our .Net openclient to "classic" appserver on HP-UX.  At the time I was fairly convinced that the problem was related to certain static members in .Net that had no concurrency protection.  Supposedly the related bug was going to be fixed in an early SP on 11.7 (11.7.1 I believe).  But we are up to SP 11.7.4 now, and we are still having these rare problems.

It would probably take a day or two to come up with a repro for this and submit it.  Is there still work being done to fix the .net openclient in 11.7?  I know the 11.7.5 SP was released, and I have to admit that this is a rare issue.  But it is a very disruptive issue, however, and involves recognizing the bug from the error messages, and stopping and restarting the appdomain that is failing.  In some cases it is a mission-critical appdomain used in a Windows Service.  

Has anyone else experienced this bogus error in the .net openclient: "Unable to resolve hostname"

The KB articles don't make reference to the bogus variety of the error; the only KB's I've found are the instances where the error message was legitimate.

All Replies

Posted by Shelley Chase on 05-Aug-2019 16:29

Hi David,

I have not heard of this before.

Yes, please create a repro and log a tech support issue. Engineering can then take a look and see what we can do.

Thanks

-Shelley

Posted by dbeavon on 29-Aug-2019 17:40

The repro is certainly the hard part, when it comes to concurrency issues. There have been other issues like this in the past as well. Fortunately they are quite rare.

The root problem, based on my investigation of decompiled code, seems to be the unprotected static members within the OpenClient classes (static hashtables, dictionaries, collections etc.) In one of our most important windows services, I eventually had to resort to using critical sections to avoid bugs.  The critical sections are placed around code where I create new instances of Connection and AppObject (constructors of Progress.Open4GL.Proxy.Connection and Progress.Open4GL.Proxy.AppObject) because I know that those constructors have never been especially thread-safe (especially in relation to "session-managed" connections to PASOE - which ostensibly should be using totally independent appobjects and connections).

Once the concurrency issues arise we start getting the errors ("NoAvailableSessions / Unable to resolve hostname ").  They will keep happening for any subsequent attempt to use the openclient.  The only way to get things working again is to kill the entire process or app domain.    It is clear that internally the process is developing a corruption of some kind - probably in the underlying static members. 

Another fun example happened in one of our .net openclient applications today.  This is an application that connects to PASOE.  It started complaining about a NameServerInterruptException, which is especially interesting given that the PASOE architecture doesn't rely on nameserver.

Progress.Open4GL.Exceptions.NameServerInterruptException: Connect Failure: Timeout or Interrupt occurred while attempting to communicate with NameServer. (8231)

Unfortunately the ToString() implementations of these openclient exceptions have never provided the full callstack.  I think the Progress developers went out of their way to obscure the callstack, and/or they neglected to attach inner exceptions whenever an exception was rethrown in the API.

To make a long story short, today's exception was yet another example of a corruption in the static members of the API, because after simply restarting the .Net process, things started working again and the exceptions stopped being thrown.  The attempts to use the openclient stopped having any more complaints about a "NameServerInterruptException".

Just an FYI for others who use the .net openclient.

Posted by Matt Baker on 30-Aug-2019 14:07

David,  Thanks for the info.  I have added a story to the backlog to look into this.  If you call tech support they may be able to log a bug for you so we can prioritize this higher.

Posted by gus bjorklund on 30-Aug-2019 19:00

> On Aug 29, 2019, at 1:42 PM, dbeavon wrote:

>

> I think the Progress developers went out of their way to obscure the callstack

Actually, they did not.

Posted by Matt Baker on 30-Aug-2019 19:43

David,

Can you give me an idea about the usage patterns you have in  your application?

Are you recreating app objects on each request?  

Are you creating connection objects on each request?

Are there concurrent requests using the same app object?

Are there concurrent requests using different app objects, but make re-use of the same connection object?

Are you calling _release..() on the app object's or connection objects in each request?

Posted by dbeavon on 30-Aug-2019 19:46

>> Actually, they did not.

I take that as a challenge!  The next time I see an openclient exception without any callstack I'm going to have to fire up Justdecompile from Telerik, and find out exactly what is up with these exceptions!

I can already tell you this much, the ToString() implementation of these exceptions are pretty non-standard.  I suspect they are neither inherited nor improved, as compared to the base Exception class.  Any implementation of ToString() should be reliable for troubleshooting unusual exceptions (ie. the ones which are by-passing our custom catch blocks).  

In the case of an unexpected openclient problem in production, the openclient exceptions' ToString() implementations are really of no use at all.  These openclient exceptions normally omit all the good stuff from their ToString() implementations, eg. (1) inner exceptions, (2) callstacks, and (3) the "RETURN-VALUE" which is the error message that came from the appserver.

I hope this will be changing in future versions of OE 12...

Posted by dbeavon on 30-Aug-2019 20:29

>> Can you give me an idea about the usage patterns you have in  your application?

For "state-reset" code (aka. "session managed" pasoe clients), we create brand new connections and new appobjects.   In theory they are totally independent of each other.

Internally there are static members that are shared... not by my choice... and I had spent a lot of time digging into this back in 2017 (per case 00375301  / bug PSC00355194).

For example, here is the callstack where two appobjects were stepping on each other.  Notice at the top of the stack there is a static Hashtable with no concurrency protection.  I didn't ever create my own repro for this, hoping that Progress would pick up the ball and do some of that work. 

> mscorlib.dll!System.Collections.Hashtable.Insert(object key, object nvalue, bool add) Unknown
Progress.o4glrt.dll!Progress.SupportClass.BufferedStreamManager.BufferedStreamsHashTable.MarkPosition(int index, System.IO.Stream stream) Unknown
Progress.o4glrt.dll!Progress.UBroker.Client.TcpClientMsgInputStream.readMsg() Unknown
Progress.o4glrt.dll!Progress.UBroker.Client.BrokerSystem.readMsg() Unknown
Progress.o4glrt.dll!Progress.UBroker.Client.BrokerSystem.connectPacket(string requestID, string username, string password, string clientInfo, bool fToBroker, int clntAskCaps, out int connAckFlags) Unknown
Progress.o4glrt.dll!Progress.UBroker.Client.BrokerSystem.processConnect(string requestID, object connectionInfo, string username, string password, string clientInfo, bool fToBroker, int clntAskCaps, out int connAckFlags) Unknown
Progress.o4glrt.dll!Progress.UBroker.Client.BrokerSystem.connect(string requestID, string url, string username, string password, string clientInfo, out int connAckFlags) Unknown
Progress.o4glrt.dll!Progress.Open4GL.DynamicAPI.Session.connect(string requestID, string url, string userId, string password, string clientInfo, int proxyGenVersion) Unknown
Progress.o4glrt.dll!Progress.Open4GL.DynamicAPI.SessionPool.BrokerSessionList.reserveSession(bool fCreateNewSession) Unknown
Progress.o4glrt.dll!Progress.Open4GL.DynamicAPI.SessionPool.reserveSession(bool fCreateNewSession, Progress.Open4GL.DynamicAPI.SessionPool.PickList pickList, bool fRefresh) Unknown
Progress.o4glrt.dll!Progress.Open4GL.DynamicAPI.SessionPool.createSession() Unknown
Progress.o4glrt.dll!Progress.Open4GL.DynamicAPI.SessionPool.initializePool() Unknown
Progress.o4glrt.dll!Progress.Open4GL.DynamicAPI.SessionPool.SessionPool(string appName, Progress.Open4GL.DynamicAPI.IPoolProps properties, Progress.Common.EhnLog.IAppLogger log, string poolName, string requestID) Unknown
Progress.o4glrt.dll!Progress.Open4GL.DynamicAPI.SessionPool.createPool(string appName, Progress.Open4GL.DynamicAPI.IPoolProps properties, Progress.Common.EhnLog.IAppLogger log, string requestID) Unknown
Progress.o4glrt.dll!Progress.Open4GL.Proxy.ProObject.initAppObject(string appName, Progress.Open4GL.DynamicAPI.IPoolProps properties, Progress.Common.EhnLog.IAppLogger log, string requestID) Unknown
Progress.o4glrt.dll!Progress.Open4GL.Proxy.AppObject.initAppObject(string appName, Progress.Open4GL.DynamicAPI.IPoolProps properties, Progress.Common.EhnLog.IAppLogger log, string requestID, int proxyGenVersion) Unknown
LumberTrackProxy.dll!UFP.LumberTrack.AppServer.Proxy.LumberTrackProxy.LumberTrackProxy(Progress.Open4GL.Proxy.Connection connectObj, bool useWebConfigFile) Unknown

I suspect that this error was most likely to happen the *first* time an openclient was used in the lifetime of an appdomain, *if* it was being used on multiple threads simultaneously (via a Parallel.Foreach or similar).  In certain of our hosting contexts, the appdomains are cycled regularly.  For example SSRS can host our custom code and it cycles appdomains quite frequently.  Some of our windows services and web applications get cycled quite frequently too.  I suspect that if you were to repeatedly restart a new appdomain in a loop, and if you were to write code that does a Parallel.Foreach when the appdomain first started, and if the code in the Parallel.Foreach tried to "new" an independent AppObject in the loop ... then that is where you would eventually find the bug.  Again I don't think that anyone ever built that repro to actually demonstrate the bug;  I think they probably just did a code-review and changed the code by adding better protection around the static Hashtable.

The existence of a bug isn't what bothers me.  Every piece of software has bugs.

What bothers me is the impression that the exceptions are deliberately opaque, and that the inner exceptions and callstacks are being hidden.  The real annoyance to me in this case was the amount of work that was involved - simply to get a meaningful callstack like the one you see above.  The openclient for .Net never seems to give helpful stacks and the root problem is often obscured by layers of subsequent/misleading exceptions that are re-thrown afterwards.  As I recall, I think I had to use procdump or something similar to capture the first-chance exception (procdump: https://docs.microsoft.com/en-us/sysinternals/downloads/procdump ).

Hopefully this grumbling is somewhat constructive.  Maybe OE 12 will give us some improvement in the openclient exceptions!

Posted by dbeavon on 30-Aug-2019 20:44

Is there still work being done to fix the .net openclient in 11.7?

Just to make this more constructive, I am willing to work on a repro for the issue: "NoAvailableSessions / Unable to resolve hostname".   I just didn't want to spend a day or more on this unless Progress was willing to fix it in an SP for 11.7.  It won't be feasible for us to upgrade to OE 12 for a year or more.

Thankfully it is a rare issue and we remember - most of the time - that we just need to recycle the .Net application (rather than trying to find a DNS problem with the related hostname).

Posted by dbeavon on 31-Aug-2019 21:28

I have a repro now, and I'll submit to tech support.

Insofar as the openclient exception implementations go, I think I've figured out what is so non-standard about them.  There are three primary issues going on.

  • Exceptions derived from ProException inherit a very poor implementation of ProException.ToString() .  That implementation is suppressing some of the most valuable stuff for troubleshooting - stack details and inner exception(s) details.

    It probably would have been better to leave the ToString() alone, and use the implementation from the base class (Exception), especially if you omit the good stuff.  Here is what Microsoft says about their implementation:

    • The default implementation of ToString obtains the name of the class that threw the current exception, the message, the result of calling ToString on the inner exception, and the result of calling Environment.StackTrace. If any of these members is null, its value is not included in the returned string.

  • None of the Progress exceptions ever bother to capture inner exception and *store* it!
    • The ProException class has something that appears to be a Progress-specific replacement for the inner exception. It is called a "Previous" exception member. But in practice, Progress never bothers to put any information in here either, and it certainly doesn't show up in the output of ToString().

  • Aggressive catching and rethrowing exceptions, recharacterizing the root cause of any given problem.  This is not a huge deal in itself (in fact, I would rather have my exceptions to be strongly typed, and to include as much valuable contextual detail as possible).  But when you combine this with other two factors (mentioned above) then the original/root cause of an error becomes quickly buried and obscured. 

Here is the stack I was working with today.  The root cause of the bug appears to be something basic (ie. static "Vector" members of HTTPConnection that aren't protected for concurrency).  Again the bug itself doesn't bother me as much as the implementations of the openclient Exceptions themselves.

 

The original exception is from ArrayList (because of the concurrent use of a class that isn't safe for threading).  It is simply 'System.IndexOutOfRangeException' - Index was outside the bounds of the array.  But soon this will be translated several times over until it looks totally different : "Unable to resolve hostname localhost". (... which we would then send to our network team, so they can spin their wheels for a while, and buy us some time to restart the application ;-)

Notice the red highlights in the callstack above.  At each of those points, Progress interacts with the exception - catches and rethrows it.  At each of those points the exception is translated and re-thrown.  At each point, the inner exception, and the original callstack, is omitted!  It reminds me of the children's game called telephone.


  • Progress.o4glrt.dll!Progress.UBroker.Client.HttpClientProtocol.setupProxyServer
    • Catches exception and changes to Progress.UBroker.Util.NetworkProtocolException
      ToString() output is simply "Progress.UBroker.Util.NetworkProtocolException: 9988"
    • Thrown again like so: throw new NetworkProtocolException(x, y, z); /* NO INNER SUPPLIED */

  • Progress.o4glrt.dll!Progress.UBroker.Client.BrokerSystem.processConnect

    • Catches exception and changes to Progress.Open4GL.Exceptions.ConnectProtocolException
      ToString output is simply "Progress.Open4GL.Exceptions.ConnectProtocolException: Invalid http protocol configuration: Failure removing localhost from proxy list: Index was outside the bounds of the array. (9988)"
    • Thrown again like so : throw new ConnectProtocolException(x, y, z); /* NO INNER SUPPLIED */

  • Progress.o4glrt.dll!Progress.UBroker.Client.BrokerSystem.connect

    • Catches exception and changes to Progress.Open4GL.Exceptions.UnknownHostnameException
      ToString output is simply "Progress.Open4GL.Exceptions.UnknownHostnameException: CONNECT FAILURE: Unable to resolve hostname localhost (8821) "
    • Thrown again like so :throw new UnknownHostnameException(x);  /* NO INNER SUPPLIED */

  • Progress.o4glrt.dll!Progress.Open4GL.DynamicAPI.SessionPool.BrokerSessionList.reserveSession

    • Catches exception and changes to Progress.Open4GL.DynamicAPI.SessionPool.NoAvailableSessionsException
      ToString output is simply "Progress.Open4GL.DynamicAPI.SessionPool+NoAvailableSessionsException: SessionPool : NoAvailableSessions[CONNECT FAILURE: Unable to resolve hostname localhost (8821) ]"
    • Thrown again like so: throw new SessionPool.NoAvailableSessionsException(x); /* NO INNER SUPPLIED */

  • APP_Production_Maintenance_Kiln_Proxy.dll!UFP.LumberTrack.AppServer.Proxy.APP_Production_Maintenance_Kiln_Proxy.APP_Production_Maintenance_Kiln_Proxy

    • Custom appobject constructor, created by proxygen...
    • Catches exception and throws another instance of the same type that was caught (Progress.Open4GL.DynamicAPI.SessionPool.NoAvailableSessionsException)
      ToString output is simply "Progress.Open4GL.DynamicAPI.SessionPool+NoAvailableSessionsException: SessionPool : NoAvailableSessions[CONNECT FAILURE: Unable to resolve hostname localhost (8821) ]"
    • This is the final exception that is observed by our custom code.

To make a long story short, an exception is caused by a internal bug (a pretty basic one).  But the exception itself gets retranslated *five* times and loses inner exceptions and callstacks and even the original exception message itself.  It comes out the other end as something that is totally meaningless, and is useless for troubleshooting the root of the problem!

Posted by dbeavon on 06-Sep-2019 17:06

>> I have a repro now, and I'll submit to tech support.

Tech support was able to recreate this as well.  The case number is 00508141.  

We are eager for a fix, despite the fact that it is rare.  When it comes up, it causes an outage and involves manual intervention (to restart the app or appdomain).  Our openclient processes won't auto-recover (ie. after the exception, the static member fields within the API become corrupted and don't repair themselves until things are restarted).

Posted by dbeavon on 14-Apr-2020 17:05

FYI, the concurrency issues were fixed in a hotfix for 11.7.

For anyone who finds this post after googling "NameServerInterruptException" I wanted to add a bit more information on that topic too....   See in the posts above that I had been seeing some cases I'd observe instances of the exception NameServerInterruptException, which is especially interesting given that the PASOE architecture doesn't rely on nameserver.

Progress.Open4GL.Exceptions.NameServerInterruptException: Connect Failure: Timeout or Interrupt occurred while attempting to communicate with NameServer. (8231)

If this exception ever comes up, you should review the URL that you are using to connect to PASOE via APSV.

Here is an image showing the underlying source of the error.

Notice in the watch window at the bottom that I was using an invalid URL.  Also notice that the stack is waste-deep in NameServerClient logic (which should not be relevant in any way to a PASOE connection, but this code-path must have been taken because of the invalid URL).  Finally notice that the exception itself starts out as something totally that is not explicitly related to nameserver (ie. SocketException from UDP).

We have been affected intermittently by this issue for many months.  As it turns out, on rare occasion our URL's are malformed.  It happens for reasons that are related to our custom code, and generally only in production.  We would see NameServerException and really scratch our heads given the fact that we had already migrated to PASOE.

The root of the problem was apparently in our own custom code.  However the .Net openclient doesn't do us any favors where the exception is concerned.  It transforms a SocketException into other types of exceptions, at several stages, and ultimately loses all the underlying details.  The final output of Exception.ToString() is misleading and unhelpful.

I'm also a bit surprised that the openclient code gets so far down the wrong path.  You would think that a Url property would reject my malformed URL ("LumberTrack") at an earlier stage (by throwing an InvalidOperationException or ArgumentException or whatever that says there was a bad URL).

Hopefully this helps other customers migrating to PASOE who might still encounter NameServerException.

This thread is closed