How does client-server ABL recover from locked database reco

Posted by dbeavon on 08-Oct-2019 20:42

On two occasions today I looked in PASOE and noticed that an ABL session was "stuck" waiting on a lock.  Ie. my impression is that the ABL logic in the PASOE session had reached a point where a dependency on a lock couldn't be satisfied, and the logic had to wait.  But once the lock was clearly available again, the PASOE session would *not* proceed.  (The lock should have become available based on the fact that there were no remaining active transactions in the system other than that PASOE session.)

So what is it that would allow an ABL session to "wake up" once a record lock becomes available?

I believe there must be some sort of a polling, along with some sort of network message that is sent on an interval  (perhaps based on Nmsgwait ?)

I'd like to have a way to troubleshoot this when it happens again. 

Better yet, I'd like to avoid the problem altogether.  I was considering using STOP-AFTER so that session won't need to wait indefinitely for locks to become available.  But when I was reading the KB about stop-after, (https://knowledgebase.progress.com/articles/Article/What-does-the-STOP-AFTER-expression-do ) I noticed that there is a caveat that says the feature won't work when the "session is blocked waiting for a network message ".

Any pointers would be very much appreciated.  This information is probably "client-server 101".  But for some reason I have never come across this information.

Posted by gus bjorklund on 11-Oct-2019 14:38

> On Oct 10, 2019, at 5:44 PM, dbeavon wrote:

>

> the most confusing to me was the information in the "_lock" VST

When deleting records, first an exclusive lock is acquired if one is not already in effect. then the record is deleted and replaced by a "rowid placeholder". that is done so that no one else can take that rowid until after the deleting transaction has been committed. otherwise if someone else took the rowid, it would be impossible to undo the incomplete transaction.

then, after the row is deleted, the exclusive lock in the lock table is removed as it is no longer needed. the rowid placeholder serves as an exclusive lock.

thus a loop that deletes many records only uses one entry in the lock table.

Posted by dbeavon on 10-Oct-2019 21:42

I figured this out.  It wasn't actually stuck, although the symptoms implied that it was:  The stack shown for the session (in OEE) remained in the same place for long periods of time.  The transaction was open, and the _lock VST table wasn't growing.  All the evidence seemed to indicate that the client/server ABL session was stuck.  But it turned out that what was happening is the logic was going into an unusual block that was *deleting* lots of records in a loop (well over 100,000 of them).  So the stack (in OEE) showed the same line number while the deletes were underway.

In my defense, the line number that was displayed was not actually pointing to the DELETE statement - or things would have been a lot more clear to me.  It wasn't until about the fifth time I re-read the source code that I figured it out.  

Those deletes run a bit slow over client/server.  They weren't ever a problem when things ran in shared memory (or maybe they would have become a problem if there were even more of them).  The lock wait timeout (-lkwtmo 40) didn't apply in this case,  because the session was continuing to delete records;  it wasn't ever "waiting" on any locks.  As a side note, I still wish OE would introduce *transaction* timeouts in addition to the lock wait timeouts.  A transaction timeout would be extremely helpful.  I don't think that transaction timeouts should be too difficult for the database to enforce, especially for client/server connections.

The thing that was probably the most confusing to me was the information in the "_lock" VST.  Right up until the end, I was convinced that I would be able to troubleshoot the long-running transaction by watching that particular VST.  But it really doesn't show anything interesting in the case of a lot of deletes (its not as if I noticed 100,000 entries for each of the deletes in the active transaction).  It always showed a single exclusively-locked record, and seemed to just confirm that the transaction was stuck.  I probably should have been using some other VST as well.

Thanks for the feedback Gus.  I probably shouldn't be so suspicious of client/server ABL.

All Replies

Posted by gus bjorklund on 08-Oct-2019 21:04

for self-serving 4GL sessions, when waiting for a lock, the session blocks on a semaphore. when the lock-holder releases the lock, it also signals the waiting session's semaphore to wake it up. the waiting session then tries to acquire the lock again.

for client-server 4GL sessions over the network, when waiting for a lock, the 4GL session is waiting for a network message from the server. the network message normally contains the requested record. when there is a lock conflict, the server queues the lock request and marks the database connection as waiting. when the lock holder releases the lock, it marks the connection as server action pending and marks the server as one of your clients needs attention. when no longer serving an earlier request, the server sees this and tries to acquire the lock again.

in PASOE, the worker thread can go do work on behalf of another session when the session is waiting for a lock. but i do not remember if it actually does this or if the worker thread blocks.

there is another possibility though: the lock may not be released by the lock holder - it may be only have been downgraded from exclusive to share at the end of a transaction and there may not be any active transactions other than the blocked client's. in this case, if the client wants a share lock, it will be granted. but if the client wants an exclusive lock, it will have to continue waiting even though it is the only session with an active transaction.

Posted by dbeavon on 08-Oct-2019 21:31

Thanks Gus,

Have you seen cases where the client-server 4GL session won't ever wake up again, despite the lock becoming available?  I don't want to jump to the worst possible conclusion and point to a possible bug in _mprosrv, but I haven't found another explanation yet.

Also, what are the chances that STOP-AFTER would be successful in releasing the 4GL session again (and properly generating a STOP/failure condition).  The KB seemed to say this may not work when waiting on a network message (which we are likely doing given that we are waiting on a lock).

>> there is another possibility though: the lock may not be released by the lock holder ...

Thanks for pointing me to this possibility.  I hadn't considered it.  I suppose the best way to troubleshoot is to scan the _lock table VST and find any potential conflicts that way:

knowledgebase.progress.com/.../19136

I had hoped to be able to point our DBA to the OEE console, so that they would find the problem without writing custom queries on the VST.  But I don't think OEE is going to cut it this time.

Posted by dbeavon on 08-Oct-2019 22:11

There is another thing that is bothering me about the behavior of the client-server sessions that are "stuck".  As far as I'm concerned, the 4GL sessions in PASOE should always break themselves free after 40 seconds ( based on client session parameter -lkwtmo 40 ).

The only resource contention that I see is on a single table.  And if records in that table are not made available to the remote client session, then the remote session should raise the STOP condition after 40 seconds.  I shouldn't have to introduce any explicit code to make that sort of thing happen (eg. by using STOP-AFTER)

No matter how I look at it, I'm still suspicious that there is some unreasonable blocking that is taking place between the _mprosrv process and the remote 4GL session.

Posted by dbeavon on 10-Oct-2019 21:42

I figured this out.  It wasn't actually stuck, although the symptoms implied that it was:  The stack shown for the session (in OEE) remained in the same place for long periods of time.  The transaction was open, and the _lock VST table wasn't growing.  All the evidence seemed to indicate that the client/server ABL session was stuck.  But it turned out that what was happening is the logic was going into an unusual block that was *deleting* lots of records in a loop (well over 100,000 of them).  So the stack (in OEE) showed the same line number while the deletes were underway.

In my defense, the line number that was displayed was not actually pointing to the DELETE statement - or things would have been a lot more clear to me.  It wasn't until about the fifth time I re-read the source code that I figured it out.  

Those deletes run a bit slow over client/server.  They weren't ever a problem when things ran in shared memory (or maybe they would have become a problem if there were even more of them).  The lock wait timeout (-lkwtmo 40) didn't apply in this case,  because the session was continuing to delete records;  it wasn't ever "waiting" on any locks.  As a side note, I still wish OE would introduce *transaction* timeouts in addition to the lock wait timeouts.  A transaction timeout would be extremely helpful.  I don't think that transaction timeouts should be too difficult for the database to enforce, especially for client/server connections.

The thing that was probably the most confusing to me was the information in the "_lock" VST.  Right up until the end, I was convinced that I would be able to troubleshoot the long-running transaction by watching that particular VST.  But it really doesn't show anything interesting in the case of a lot of deletes (its not as if I noticed 100,000 entries for each of the deletes in the active transaction).  It always showed a single exclusively-locked record, and seemed to just confirm that the transaction was stuck.  I probably should have been using some other VST as well.

Thanks for the feedback Gus.  I probably shouldn't be so suspicious of client/server ABL.

Posted by Peter Judge on 11-Oct-2019 13:42

> As a side note, I still wish OE would introduce *transaction* timeouts in addition to the lock wait timeouts.  A transaction timeout would be
>extremely helpful.  I don't think that transaction timeouts should be too difficult for the database to enforce, especially for client/server
>connections.
 
Do you mean as a startup parameter?
 
I believe you can use the STOP-AFTER block modifier to do this in code.
 
 

Posted by dbeavon on 11-Oct-2019 14:30

>> I believe you can use the STOP-AFTER block modifier to do this in code.

Yes, that is along the lines of what I need.  But it seems better to manage this at the database level, since STOP-AFTER is sort of "hit-and-miss".  Also that feature requires the addition of explicit code into every appserver entry point that has a potential to be long-running.

I'm pretty sure that STOP-AFTER is a feature of the ABL-client-session, rather than a timeout which is enforced by the remote database server.  (IE.  my understanding is that the way it works is analogous to a bot that would push the the ENDKEY button after a predetermined amount of time, and then cross its fingers.  There is a KB that describes various scenarios where the STOP-AFTER behavior would not work reliably.)

Posted by gus bjorklund on 11-Oct-2019 14:34

As peter says, you can use the stop after mechanism in this case.

Depending on how your code was written, it is possible for each row (and its index entries) to be deleted in separate transactions. That would make it run comparatively slowly.

Also, be careful what you wish for. If your code is deleting a million records in one transaction and the transaction timeout stops you at 999,000, all those deleted records have to be put back.

Posted by gus bjorklund on 11-Oct-2019 14:38

> On Oct 10, 2019, at 5:44 PM, dbeavon wrote:

>

> the most confusing to me was the information in the "_lock" VST

When deleting records, first an exclusive lock is acquired if one is not already in effect. then the record is deleted and replaced by a "rowid placeholder". that is done so that no one else can take that rowid until after the deleting transaction has been committed. otherwise if someone else took the rowid, it would be impossible to undo the incomplete transaction.

then, after the row is deleted, the exclusive lock in the lock table is removed as it is no longer needed. the rowid placeholder serves as an exclusive lock.

thus a loop that deletes many records only uses one entry in the lock table.

Posted by gus bjorklund on 11-Oct-2019 18:22

you are correct that stop-after is a feature of the 4GL runtime, NOT of the database. this is so because (in spite of its few (solvable) shortcomings) stop-after can be also used for things that do not have anything to do with the database.=

Posted by dbeavon on 11-Oct-2019 21:26

True,  but you point out that it doesn't always work reliably, even for purposes that have nothing to do with the database (eg to interrupt external interfaces).  That is the reason that the database needs a "backup plan".

... so if a user session locks a record in the database, and then that session gets held up by some external interface that is outside the control of the 4GL runtime, then that record will remain locked even after the stop-after timeout has expired.  

That is why the database server needs its own timeout - so that it can eventually release those locked records one way or another.  Otherwise there is a potential concurrency problem that could grow out of hand (sessions will start piling up).  IMHO the database server should be able to protect its own interests, and should NOT be at the mercy of some misbehaving 4GL session that runs on some remote PASOE server somewhere.  This is analogous to a "DBA-administrator-bot" that would disconnect remote connections after a predetermined amount of time has elapsed.  

Posted by gus bjorklund on 11-Oct-2019 22:58

some people do use 4GL jobs that disconnect users when their transactions have been active too long --- say 30 minutes for example. these are not too hard to write. they get the transaction start times from the _trans VST

Posted by dbeavon on 14-Oct-2019 12:20

Gotcha.  But it seems like it should be part of the database.  I can't think why a database wouldn't want to be able to enforce its own transaction timeouts.  

I never get used to seeing the misbehaving sessions that have long-running transactions open for hours at a time (in promon).

Posted by gus bjorklund on 08-Nov-2019 23:04

This thread is closed