Zombie remote servers - need some advice

Posted by 302218 on 17-Feb-2020 10:55

In November last year we migrated

  • from OpenEdge 11.3.3 on Solaris SPARC – Solaris 10 zones
  • to OpenEdge 11.7.3 on RHEL 7.7 (Maipo) – Vmware vSphere 6.0 (10719132), ESXi 6.0 

Everything went according to plans and we got a huge performance boost:

  • Database online backup – 10 time faster
  • Most batch jobs – 7 times faster

 A huge succes, everything was good …

 … until runtime clients on Windows 7 Enterprise system randomly started to get SSL connections error when trying to connect to the database. So far so bad, but the situation is getting worse over time rendering the database unavailable for connections due to zombie remote server processes for as long as 17 plus minutes.

 This is the error the clients are getting:

 SSL error 12072 - SSL Client handshake failure (336130315) SSL routines occurred. (12168)

Error starting SSL handshake with the OpenEdge database server. (12167)

 

This is what I can see in the database log file:

 [2020/02/17@09:31:33.820+0100] P-1045937   T-140231528724288 I SRV   13: (12151) SSL error 12067 - SSL accept failed occurred.

[2020/02/17@09:31:33.820+0100] P-1045937   T-140231528724288 I SRV   13: (12154) Error while attempting to create the SSL Client instance.

……

[2020/02/17@09:49:03.940+0100] P-1045937   T-140231528724288 I SRV   13: (1334) Rejecting login -- too many users for this server.

[2020/02/17@09:49:03.940+0100] P-1045937   T-140231528724288 I SRV   13: (-----) User count inconsistency detected: usrcnt=5 users=15

[2020/02/17@09:49:03.940+0100] P-1045937   T-140231528724288 I SRV   13: (-----) User count corrected: usrcnt=15 users=15

 

Relevant database startup parameters - (250 concurrent remote clients max.):

 -n 850

-Mn 40

-Ma 15

-Mi 5

-S 47311

-minport 8400

-maxport 8499

-ServerType 4GL

-PendConnTime 10

 

Therefore I opened a TechSupport case. I turns out the issue which causes a remote server to become a zombie as soon as it encounters a 12151 error is a product defect (OCTA-19107). The TechSupport engineer was able to reproduce it on 11.7.5 and so far a fix for that is expected to make it into 11.7.6 which is scheduled sometime in Q2/2020.

 

What we’ve found out so far is that the 12151 error on the database remote server renders it to a zombie while the database broker still forwards connection requests to that zombie remote server which all fail for minutes until the same server adjusts the user count to its max when the clients get the error that the server has no more resources.

 

Still, I have not found out what causes the initial SSL connection error to a given remote server so that it encounters the 12151 error which makes it zombie server and what causes the database broker to adjust the user count to its max some time later.

 

Nevertheless - until I found out the root cause and OpenEdge 11.7.6 is released - I need a strategy to mitigate the problem in some way and this is what I’ve come up with:

 

  • Increase pending connection time out – as long as a zombie server has a pending connection subsequent connections request should be forwarded to other remote servers
  • Change –Mn 80, -Mi 2, -Ma 8 – to have the remote clients spread over more remote servers
  • Implement a monitoring job which greps the database log file to identify remote servers which got an 12151 error and eventually terminate them via promon – terminate zombie servers

 

Any thoughts are welcome!

 

Thanks in Advance.

All Replies

Posted by gus bjorklund on 18-Feb-2020 21:56

you have minp[ort 8400 and maxport 8499 which allows only for 99 network clients if all those ports are free (which they may not be).

Posted by 302218 on 19-Feb-2020 06:50

Thanks for you reply. Yes, those ports - 8400 to 8499 are free on the system and should be enough to be able to start 80 remote servers.

Thanks, Richard.

Posted by Rob Fitzpatrick on 19-Feb-2020 14:59

If the fix is expected to be in 11.7.6, which is due soon, I'd ask TS whether a hotfix can be provided for an earlier 11.7 SP.  Then you can have the fix rather than try to find a workaround, without having to wait for 11.7.6.

This thread is closed