Transaction undo and life after logout

Posted by George Potemkin on 14-May-2018 04:34

When transaction is interrupted by a signal it will be backed out, a session will log out from a database and… process will keep running for a while: the time of its life after logout equals approximately 50-80% of transaction’s duration.

The common terminating signals that can interrupt a Progress session can be divided into two groups - "stoppers" and "terminators".
"Stoppers": SIGINT, SIGPIPE, STOP and UNDO statements in 4GL (4GL statements are not the signals but anyway..).
"Terminators": SIGHUP, SIGQUIT, SIGTERM and proshut -C disconnect

These groups are different in many aspects:
Duration of transaction undo triggered by a "stopper" is just a bit longer than transaction duration (an undo adds BI reads).
Undo initiated by a "terminator" is much faster - 25-50% of transaction duration (the factor varies with transaction size). It’s the same time that broker/watchdog will spend on transaction undo if a session was terminated by SIGKILL.

If a session has opened the transactions in a few databases then all transactions will be undone one transaction in time, in other words, a session will be temporary "frozen" in all databases except one. Transaction backouts initiated by the "stoppers" will be done in the order in which a session has been connected to the databases. Transaction backouts initiated by the "terminators" will be done in the /reverse/ order: the last connected database will be the first one backed out.

The different scenarios, the different undo times - does it mean the different algorithms that Progress is using to undo the transactions depending on the trappable or untrappable STOP conditions? "Stoppers" do not set neither the "usertodie" flag (_Connect-Disconnect) nor the "resyncing" flag (_Connect-Resync). "Terminators" set the "usertodie" flags in all databases (except, of course, proshut -C disconnect that does it only for one database) and set the "resyncing" flag for one database - the one that is rolling back its transaction.

"Terminators" cause a session to write the backout messages to a log of the corresponding database (even if transaction was in the ALLOCATED status, i.e. when the transaction did not really change a database):
(2252) Begin transaction backout.
(2253) Transaction backout completed.
Of course, "stoppers" do not leave such messages.

If 4GL code is trapping the STOP condition and quits on STOP (IF RETRY THEN QUIT) then a session is terminated immediately after logout from the last connected database.
The "stoppers" (except SIGHUP) leave a session on pause. For example:
KILL signal received. (298)
Press space bar to continue.
The message really means "Press space bar to terminate the session".
The common signals ("stopper" or "terminator") will terminate a session in this state. Let’s treat this state as an end of session’s execution.

Between this state and the logout from the last database (= first connected database) the session keeps running for a while. Example:

[2018/05/12@09:24:19.113+0400] P-1135       T-140073628817216 I ABL     8: (452)   Login by root on /dev/pts/27. 
[2018/05/12@09:24:19.115+0400] P-1135       T-140073628817216 I ABL     8: (7129)  Usr 8 set name to root. 
[2018/05/12@09:24:19.224+0400] P-1135 Transaction begins.
[2018/05/12@09:25:19.228+0400] P-1135 Transaction is ready.
[2018/05/12@09:25:47.000+0400] kill -HUP 1135
[2018/05/12@05:25:47.000+0000] P-1135       T-140073628817216 I ABL     8: (562)   HANGUP signal received. 
[2018/05/12@09:25:47.117+0400] P-1135       T-140073628817216 I ABL     8: (2252)  Begin transaction backout. 
[2018/05/12@09:26:09.014+0400] P-1135       T-140073628817216 I ABL     8: (2253)  Transaction backout completed. 
[2018/05/12@09:26:09.014+0400] P-1135       T-140073628817216 I ABL     8: (453)   Logout by root on /dev/pts/27. 
[2018/05/12@09:26:53.000+0400] P-1135 is terminated.

Tran Time: 60.004 sec
Undo Time: 21.897 sec
Time after logout: 43.986 sec

Note: Undo + Time after logout seems to be always a bit longer than a transaction duration.

"P-1135 is terminated" was reported by:

while sleep 1
do
  kill -0 $PID 2>/dev/null && continue
  echo [`date '+%Y/%m/%d@%H:%M:%S.000%z'`] P-$PID is terminated. >>$LOG
  break
done &

Stack trace of a process after logout shows the calls:

fgetcxdll freadxdll frdfrom bfundo bfrej rnxitn_noval rninterpret rnrq main

I can be wrong but it looks like a process is reading lbi file. The size of lbi file created during transaction is almost twice (1.8-1.9 times) higher than the number of bytes written to BI file by transaction before it was interrupted. Lbi file is not growing during transaction undo. The size of lbi file is only slightly (700-800 bytes) increasing after database logout. I did not use the -nosavepoint parameter during the tests. The -lbimod parameter also has the default value.

If a session does read its lbi file at the final phase then it reads only the part of it related to the interrupted transactions. Anyway it’s not clear why process does not terminate immediately after logout from the last database.

Of course, it’s a minor issue because after logout a process is not using database resources. We can safely use SIGKILL to terminate a process provided we are 100% sure that a session is not connected to any other databases. But we will have 100% insurance only when a process is terminated.

Tested with 11.7.2 on Linux and HP-UX

All Replies

Posted by ChUIMonster on 14-May-2018 07:29

If I am following your logic one potential optimization might be to skip the LBI processing?

Posted by George Potemkin on 14-May-2018 07:53

Maybe the LBI processing (if it's really LBI) is just a bug?

Posted by ChUIMonster on 14-May-2018 08:25

For a "stopper" I can imagine it is a feature - you can catch the STOP condition in the startup procedure and then re-initialize your session.  In a case like that you would want any UNDO variables and temp-tables to be properly undone.

For a "terminator" it seems like there would be no point and you could just throw it all away.

Of course if everything is defined NO-UNDO and if you are not writing code with lots of nested sub-transactions there ought not to be much to undo anyway so the sheer size of what you are seeing seems a bit off kilter.

If they feel so inclined someone with knowledge of the internals can probably correct my ramblings and explain why it needs to be done the way that it is being done.

Posted by George Potemkin on 14-May-2018 08:39

It would be interesting to know why there is a difference in transaction undoes triggered by a "stopper" and by a "terminator". Why a second one is much faster?

Most likely the UNDO variables are treated as the fields of some pseudo tables (one pseudo table per procedure) in some pseudo database. But in my test program all variables are defined as NO-UNDO and 'strings lbi*' shown only the contents of the customer.name fields. I used the simple transactions in sports database:

  REPEAT:
    FOR EACH sports.customer EXCLUSIVE-LOCK:
      ASSIGN sports.customer.name = "x" + sports.customer.name.
      ASSIGN sports.customer.name = SUBSTRING(sports.customer.name, 2).
    END.
    
    IF NOW GE vEndTime THEN
    LEAVE.
  END.

Posted by gus bjorklund on 15-May-2018 08:38

Tom: it /doesn’t/ need to be done the way it is done. like many things, there is more than one way to do it. there are also opportunities to optimise the present implementation. I won’t get into those here.

George: i’ll explain as best i can.

Remember that transactions are an /error handling/ mechanism. There are really four variants of undo in the client (plus two in the database, but that is another topic). they are: 0) undo triggered by an error condition detected in application logic and execution of an UNDO statement, 1) undo triggered by some error condition like a duplicate key, 2) undo triggered by a stop condition, 3) undo and termination triggered by an error detected by a signal handler.

The LBI file is used to record client-side history so that some or all actions can be undone. In all cases, there may or may not be LBI reading to accompany the undo action. The LBI file is composed of one or more nested “ar units” or “action-recovery units”. An ar unit matches execution of a block in the 4GL. One ar unit may align with a block that has a database transaction and that does not have to be the outermost ar unit. Only one because database transactions cannot be nested in the 4GL. Subsequent ar units will be sub-transactions and their database actions may be recorded in the LBI file or not if database savepoints are being used by the client. Also, a nested ar unit may involve the start of a transaction on another database. When that happens, the second database transaction “joins” the outer one in a sense. While actions on the second database can be undone, that transaction can’t be committed until the outer database transaction commits.

Along with block nesting, the LBI file is used to record values of local variables, temp table actions, database actions, buffer connections and disconnections and a few other items related to client side state information.

During forward processing, the LBI file is buffered in memory and written sequentially when the buffer fills. When an undo happens, the LBI is read backwards from eof and a new eof established when the undo is finished. Depending on how much is being undone, the new eof may be at the beginning of the LBI or at the start of some block’s ar unit.

When an undo happens, the undo may go back to any active ar unit. The target ar unit may be the same one that matches a database transaction, higher, or lower. In the first four cases mentioned above, the 4GL session will or can continue, depending on the application logic. The undo processing is very similar for all of them, except that a stop condition may be triggered by a lost database connection. In such a case, database actions in subtransactions cannot be undone by the client. undo must include the ar unit for the outermost database transaction but may go farther than that but not does necessarily stop the session.

In the last case, the client session is going to be terminated due to a fatal error. much of the undo processing is the same as above, except that it goes all the way to the outermost ar-unit and then the client will exit. In some cases, depending on the type of error (e.g. an access violation), undo processing consists only of rolling back the database transaction followed by database disconnect and process exit.

I’m sure I have left out something important but i cannot remember what it is.

-gus

Posted by George Potemkin on 15-May-2018 09:13

> There are really four variants of undo in the client (plus two in the database, but that is another topic).

Thanks, Gus. Are they mutually exclusive? I tried SIGINT + pause + SIGHUP  (numbers 2 and 3 in your list) and I got the rather unexpected results for transaction undo. And, by the ways, for this combination of signals a process is terminating immediately after logout.

Posted by gus bjorklund on 20-May-2018 12:54

There is a lot more to the story.

Signals can trigger a stop condition (e.g. ctrl-c), clean termination (e.g. SIGHUP) similar to a QUIT statement or a fatal error with no undo (e.g. SIGSEGV).

Note that an undo does not have to cause termination of the transaction. It can just cause a rollback of some or all of its actions and then initiate a retry.

Furthermore, the action triggered by the signal may have to be postponed. An example of that is a SIGHUP that occurs in the middle of an uninterruptible action such as updating a database buffer or inserting a lock into the lock table. In such a case the operation must be completed in order to avoid corrupting shared memory data structures. So the undo and quit has to be postponed until the main thread reaches a point where those actions can be safely taken.

On the other hand, when there is an error like a segmentation fault, access violation, or illegal instruction, we can’t just allow the main thread to continue even if in the middle of an uninterruptible action because the error will just occur again, resulting in an infinite loop. So we must terminate the process with a minimum of cleanup. In such cases, there is infinite loop detection that can shortcut the cleanup and just call exit(). Sometimes that will cause a database shutdown (e.g. incomplete microtransaction) and sometimes it wont.

Note that when you are connected to multiple databases, undo actions might be required in none, one, several, or all connected databases. Depending on the nature of the thing that triggers the undo, this may or may not be possible.

In your case of SIGINT + pause + SIGHUP, you are interrupting the interrupt handling, which adds further complications depending on when it happens.

HTH

gus

PS: yes, tom sometimes the LBI processing can be skipped. But usually not.

Posted by George Potemkin on 20-May-2018 13:30

As usually more questions:

Signal diagnostic mechanism decribes SIGALRM signal as:
(4397) utalsig: Utility Alarm signal handler.
(4468) immediate: do not delay handling this signal.

Some old knowledgebase articles wrote:
"The 4GL PAUSE statement will cause the Progress session to wait for a SIGALRM signal sent by the OS."
SIGALRM "should wake up all clients from any PAUSE statements."

But it's not true at least for the modern Progress versions. Is SIGALRM completely ignored by Progress?

Proshut -C disconnect sends a signal to the self-service clients:
It's SIGEMT (Solaris, AIX, HP-UX - Progress 10.2B) or SIGTRAP (HP-UX - Progress 11.7) or SIGUSR2 (Linux)
But they do not have any visible effects on Progress sessions if we send the same signals using the kill command.
On other hand, Windows does not use the signals and proshut -C disconnect on Windows has the limitations that do not exist on Unix.
Do these signals only force a session to check a 'usrtodie' flag in shared memory?

There are many knowledgebase articles in which the danger of the "trap" command is discussed. But Progress sessions seem to ignore the traps. Or did I do it in a wrong way?
Example:
trap "" SIGHUP
mpro sports2000
Progress session will be able to get SIGHUP
I tried it with Progress 10.1C, 10.2B and 11.7

Posted by Dmitri Levin on 06-Jun-2018 11:41

And just to illustrate the life after a logout here is an example from our Live system today

[2018/06/06@08:16:01.290-0400] P-38667148   T-1     I ABL   461: (452)   Login by webadm on batch.

[2018/06/06@08:16:01.353-0400] P-38667148   T-1     I ABL   461: (7129)  Usr 461 set name to webadm.

[2018/06/06@08:34:00.912-0400] P-38667148   T-1     I ABL   461: (2252)  Begin transaction backout.

[2018/06/06@08:34:00.912-0400] P-38667148   T-1     I ABL   461: (2253)  Transaction backout completed.

[2018/06/06@08:34:00.918-0400] P-38667148   T-1     I ABL   461: (2252)  Begin transaction backout.

[2018/06/06@08:40:55.361-0400] P-38667148   T-1     I ABL   461: (2253)  Transaction backout completed.

[2018/06/06@08:40:55.361-0400] P-38667148   T-1     I ABL   461: (453)   Logout by webadm on batch.

So the session was disconnected by "proshut -C disconnect". Transaction was backed out in about 7 minutes. And after I was able to see " Logout by webadm on batch" above I was still able to see the process for 3 more minutes

# ps -ef | grep webadm

 webadm 38667148 46203638  44 08:16:01      -  6:31 /apps/dlc117/bin/_progres -b -pf /usr/dbx/live/livecon.pf -p weboe/web-process.p

After about 3 minutes it died.

Posted by gus bjorklund on 16-Jun-2018 10:48

> On Jun 6, 2018, at 12:42 PM, Dmitri Levin wrote:

>

> And after I was able to see " Logout by webadm on batch" above I was still able to see the process for 3 more minutes

>

>

there are several possibilities to explain this behaviour.

0) program was connected to several databases

1) transaction included some temp tables and those changes had to be backed out after database backout finished

3) program captured the disconnect and had some other logic to process before quitting

4) program continued to try to access the disconnected database with multiple retries

5) system was heavily loaded and it took some time to delete the temp files it created

6) there is some sort of bug in _progres=

This thread is closed