Were the changes written to BI file?

Posted by George Potemkin on 02-Mar-2018 08:06

Friday's question.

But at first the very long introduction (sorry).

Self-service session crashed holding a latch:

[2018/02/23@16:46:55.591+0600] P-6053 T-1 I WDOG 2462: (5028) SYSTEM ERROR: Releasing regular latch. latchId: 7 
[2018/02/23@16:46:55.592+0600] P-6053 T-1 I WDOG 1104: (2522) User 2462 died holding 1 shared memory locks. 

It's very unexpected that the session held the GST latch but it's a different story.

Ones of the first disconnected sessions were the background processes:

[2018/02/23@16:46:55.591+0600] P-6051 T-1 I APW 1102: (453) Logout by xxxxxx on batch. 
[2018/02/23@16:46:55.593+0600] P-6055 T-1 I AIW 1105: (453) Logout by xxxxxx on batch. 
[2018/02/23@16:46:55.600+0600] P-6054 T-1 I BIW 1106: (2520) Stopped. 
[2018/02/23@16:46:55.687+0600] P-6053 T-1 F WDOG 1104: (1007) SYSTEM ERROR: unexpected wakeup from wait code -2, user 1104 
[2018/02/23@16:46:55.687+0600] P-6053 T-1 I WDOG 1104: (439) ** Save file named core for analysis by Progress Software Corporation. 
[2018/02/23@16:46:55.687+0600] P-6053 T-1 I WDOG 1104: (453) Logout by xxxxxx on batch. 
[2018/02/23@16:46:55.691+0600] P-6050 T-1 I APW 1103: (453) Logout by xxxxxx on batch. 
[2018/02/23@16:46:57.820+0600] P-24168 T-1 I BROKER 0: (15192) The database will complete shutdown within approximately 60 seconds. 
[2018/02/23@16:46:57.821+0600] P-24168 T-1 I BROKER 0: (2249) Begin ABNORMAL shutdown code 2

Promon session started by dbmon script was disconnected a bit later and fortunately it gathered the information at the right time:

02/23/18 Activity: Summary
16:46:59 02/23/18 16:41 to 02/23/18 16:46 (5 min 8 sec)

Event Total Per Sec |Event Total Per Sec
Record Waits 11 0.0 |Active trans 217

Writes by APW 93 % Writes by BIW 58 % Writes by AIW 99 %
DB Size: 3643 GB BI Size: 11 GB AI Size: 625 MB
Empty blocks: 457 Free blocks: 52241 RM chain: 113
Buffer Hits 99 % Primary Hits 99 % Alternate Hits 99 %
0 Servers, 3897 Users (2392 Local, 1505 Remote, 750 Batch), 0 Apws

"Status: Active Transactions" screen gave us the list of those 217 transactions.

188 transactions were in "Active" state and the rest 29 were in "Begin" state.
121 transactions created by SELF/ABL sessions.

Only 10 sessions backed out their transactions (3 transactions in "Begin" state and 7 in "Active" state):

[2018/02/23@16:47:04.576+0600] P-17891 T-1 I ABL 1315: (2252) Begin transaction backout. 
[2018/02/23@16:47:04.577+0600] P-17891 T-1 I ABL 1315: (2253) Transaction backout completed. 
...
[2018/02/23@16:47:05.364+0600] P-28364 T-1 I ABL 1602: (2252) Begin transaction backout. 
[2018/02/23@16:47:05.364+0600] P-28364 T-1 I ABL 1602: (2253) Transaction backout completed.

Then db broker died as well:

[2018/02/23@16:47:05.574+0600] P-24168 T-1 F BROKER 0: (1007) SYSTEM ERROR: unexpected wakeup from wait code -2, user 0 
[2018/02/23@16:47:05.574+0600] P-24168 T-1 I BROKER 0: (439) ** Save file named core for analysis by Progress Software Corporation. 
[2018/02/23@16:47:05.574+0600] P-24168 T-1 I BROKER 0: (453) Logout by xxxxxx on batch.

[2018/02/23@16:47:06.587+0600] P-24168 T-1 I BROKER : (16869) Removed shared memory with segment_id: 32 
[2018/02/23@16:47:06.587+0600] P-24168 T-1 I BROKER : (16869) Removed shared memory with segment_id: 33 
[2018/02/23@16:47:06.587+0600] P-24168 T-1 I BROKER : (16869) Removed shared memory with segment_id: 34 
[2018/02/23@16:47:06.587+0600] P-24168 T-1 I BROKER : (334) Multi-user session end.

There were another 2019 logout messages written to db log after "Multi-user session end".

Among them there were the sessions with the active transactions:

02/23/18 Status: Active Transactions by user number for all tenants
16:46:59 
Usr:Ten Name Domain Type Login time Tx start time Trans id BI RReads BI RWrites Trans State
2618 xxxxxx 0 SELF/ABL 02/23/18 09:29 02/23/18 16:37 1998135437 0 55 Active FWD 
2804 yyyyyy 0 SELF/ABL 02/23/18 13:01 02/23/18 16:39 1998555745 0 55 Active FWD 
3218 zzzzzz 0 SELF/ABL 02/23/18 11:34 02/23/18 16:43 1999277796 0 93 Active FWD
[2018/02/23@16:52:58.527+0600] P-20577 T-1 I ABL 2618: (453) Logout by xxxxxx on /dev/pts/1414. 
[2018/02/23@17:02:18.414+0600] P-5030 T-1 I ABL 2804: (453) Logout by yyyyyy on /dev/pts/1283. 
[2018/02/23@17:05:10.474+0600] P-7933 T-1 I ABL 3218: (453) Logout by zzzzzz on /dev/pts/1480.

They did not have the "transaction backout" messages.

It's, of course, possible to start a database after a previous "multi-user session end" and the database was started.
And logouts of the processes from previous multi-user session lasted next 18 minutes.

End of long introduction.

Short question:
Did these old sessions with active transactions write their recovery notes to bi files after new multi-user session was started?

New multi-user session wrote:

BROKER 0: (12080) Begin Replication Redo Phase for 33 live transactions from 327680 . 
BROKER 0: (7163) Begin Physical Undo 256 transactions at block 341145 offset 16382

256 is more than the number of transactions at the beginning of shutdown.

Unfortunately the customer did not use the -crTXDisplay parameter to start the database. So I don't know which transactions were undone.

Posted by gus bjorklund on 02-Mar-2018 14:00

hard to know for sure. most likely not. but:

when the error 5028 and 2522 were detected, the detector (WDOG in this case) would have turned on the abnormal shutdown flag. this flag is checked in the relevant write functions before all data extent, bi, and ai writes and if enabled then the write is skipped. this applies to all database connections, self-serving, served, and utility. apw, biw and aiw too. when a process discovers that the abnormal shutdown flag is enabled, it is supposed to go away as soon as possible, without doing anything more. but it may be connected to other databases from which it has to back out transactions.

often, when this sitation arises, a variety of side-effects occur with error messages. like the error 1007 from the WDOG.

it is possible that the begin transaction backout messages were written but the transactions were not actually backed out in the database in question because all database i/o was disabled. the code that writes the messages does not discover that until after the messages are written.

to answer the question:

it should not have been possible to restart the database while there were old processes still connected. the startup code has checks to see if shared memory exists and if there are processes with old extant shared memory attachments before starting anything. if so, that code is supposed to prevent startup.

without examining the bi file it is impossible to know if something got written by an old process from the previous session.

an ai scan might reveal something.

-gus

All Replies

Posted by gus bjorklund on 02-Mar-2018 14:00

hard to know for sure. most likely not. but:

when the error 5028 and 2522 were detected, the detector (WDOG in this case) would have turned on the abnormal shutdown flag. this flag is checked in the relevant write functions before all data extent, bi, and ai writes and if enabled then the write is skipped. this applies to all database connections, self-serving, served, and utility. apw, biw and aiw too. when a process discovers that the abnormal shutdown flag is enabled, it is supposed to go away as soon as possible, without doing anything more. but it may be connected to other databases from which it has to back out transactions.

often, when this sitation arises, a variety of side-effects occur with error messages. like the error 1007 from the WDOG.

it is possible that the begin transaction backout messages were written but the transactions were not actually backed out in the database in question because all database i/o was disabled. the code that writes the messages does not discover that until after the messages are written.

to answer the question:

it should not have been possible to restart the database while there were old processes still connected. the startup code has checks to see if shared memory exists and if there are processes with old extant shared memory attachments before starting anything. if so, that code is supposed to prevent startup.

without examining the bi file it is impossible to know if something got written by an old process from the previous session.

an ai scan might reveal something.

-gus

Posted by George Potemkin on 03-Mar-2018 02:48

Thanks, Gus! Now all pieces of puzzle came together for me.

> the startup code has checks to see if shared memory exists and if there are processes with old extant shared memory attachments before starting anything. if so, that code is supposed to prevent startup.

In our case the shared memory was removed by broker after it got the fatal error (SYSTEM ERROR: unexpected wakeup from wait code -2, user 0). It's supposed that after the fatal errors the processes should leave a database without making any changes. I guess the broker has removed the shared memory segments and semaphores because it saw that watchdog is dead and nobody can care about the database anymore.

As we know OS will not release the shared memory untill there are the processes that are connected to these segements. The startup code will be, of course, unable to see the previous memory. That is why I was afraid that db changes stored in old shared memory could be saved on disk after new multi-user session is started.

It's a bit strange that only one of the old processes complained about the semaphores:

[2018/02/23@17:07:08.502+0600] P-24554      T-1     I ABL  1971: (1075)  Semaphore id 147 was removed 
[2018/02/23@17:07:08.502+0600] P-24554      T-1     F ABL      : (6517)  SYSTEM ERROR: Unexpected error return from semAdd -1 

This thread is closed