Load average skyrocket in deadly embrace situation

Posted by Simon L. Prinsloo on 06-Oct-2016 04:51

Hi

My client experience a problem where two AppServer processes ends up in a deadly embrace situation. As both calls are made from a batch job, this should inherently only lock up processing of the message queue, but its effect goes much wider. The load average of the machine skyrockets, causing system wide response problems for online users.

Is it expected that the dead lock situation will cause a race condition on the machine and how can we avoid the race condition? We need to be able to keep the lock in place long enough to identify the offending processes without impacting the interactive users, as the remedial action for the response problem comes down to restarting the batch jobs and their associated AppServer, thereby loosing all evidence of the code causing the deadly embrace before we can pinpoint the source.

They are using OpenEdge 11.6.1 on Redhat Linux version 2.6.18-308.el5.

Regards

Simon

All Replies

Posted by gus bjorklund on 06-Oct-2016 06:34

How do you know there is a deadly embrace?

You can send a SIGUSR1 to the app server and it will write a protrace file that will tell you where it is in the code.

You might need to do that several times to see different places where it is executing.

Posted by Rob Fitzpatrick on 06-Oct-2016 06:41

You could turn on client statement caching for the app server agents; use the option for the current frame, not the whole stack.  Then you can use ProTop to view that info, including program names and line numbers, and the blocked-client list in real time.  Then when the problem hits, just hit "M" to mail yourself the info from the screens for later analysis.

Or you can write your own code to get that info from _connect, _lock, etc., or screen-scrape promon.

Posted by Simon L. Prinsloo on 06-Oct-2016 09:59

At first, the dead lock was an educated guess, ass the message queue would stop processing until you killed the process and restarted it.

But the code has a standard include, dating from circa v.6, that is used to lock a record if possible or tell the user what is happening when the record is locked.

I changed it to publish an event when the session is in batch mode and the lock duration exceeds a thresh hold. The I trap the event to log some information about the lock. This was not very useful at first, but once the customer upgraded to v. 11.6 I could enhance the code to give me more information.

Using this, we identified more than one of these conditions. As they fixed the code and the situation occurred less and less frequently, I noticed a direct correlation between the occurrence of a lock and a significant rise in the load average.

When ever the load average rapidly rise from the normal levels (around 15) to 25+, data will also be written to the Lock-log. As soon as I kill either process writing to the log, the load average drops immediately like a stone.

That being said, I think I just answered my own question.

I suspect that the customer's code that informs the user what is happening while it tries to lock the record has a tight loop that is most likely killing off a processor. In the past the machined had less processors and all users had to share and this tight loop could go unnoticed, but this machine has almost 100 cores and such a "haning" process will get access to many more cycles than usual.

If memory serves, the offending code goes something like this:

IF LOCKED ({1})
THEN DO:
  MESSAGE "Record is locked......
  w_locktimer = ETIME.
  DO WHILE NOT AVAIL ({1}):
    FIND {1} EXCLUSIVE WHERE ROWID({1}) = w_lockrowid 
        NO-WAIT NO-ERROR.
    IF SESSION:BATCH-MODE 
    AND ETIME - w_locktime > 20000
    THEN DO:
PUBLISH "ExtremeLockDuration" (INPUT BUFFER {1}:HANDLE, INPUT w_lockrowid).
w_locktime = ETIME.
END. END. END.

Posted by Frank Meulblok on 06-Oct-2016 11:12

Without anything to throttle that DO WHILE NOT AVAIL... loop, it'll try re-reading the record as fast as it can.

So that's  a likely suspect for something that maxes out your CPU.

Posted by gus bjorklund on 06-Oct-2016 11:22

the load average goes up due to the do wile not avail loop spinning continuously without any pause. so it burns cpu cycles.

this technique is called “busy waiting” and should not used without any limits.

This thread is closed