Retry Question - user was not disconnected properly?

Posted by dbeavon on 20-Feb-2019 16:32

Given that OE clients are allowed to start transactions and keep them open indefinitely, we are occasionally in the habit of killing long-running client transactions with proshut commands. We have even introduced some auto-killing scripts, for better or worse.

This past week when I was out of the office, there was a user that we had attempted to kill (via an auto-killing script that detects long-running transactions). But the user was never fully killed, and it seemed to continue to hold onto resources, and it may have even caused the entire database to eventually crash.

Here are the logs that show the crashing of the database. P26345 was a remote server, and user 1532 was a PASOE client-server connection:

[2019/02/14@12:31:38.457-0500] P-26345 T-1 F SRV 11: (6495) Out of free shared memory. Use -Mxs to increase. 
[2019/02/14@12:31:38.457-0500] P-26345 T-1 I SRV 11: (5028) SYSTEM ERROR: Releasing regular latch. latchId: 17 
[2019/02/14@12:31:38.457-0500] P-26345 T-1 I SRV 11: (5029) SYSTEM ERROR: Releasing multiplexed latch. latchId: 11916353 
[2019/02/14@12:31:38.457-0500] P-26345 T-1 I SRV 11: (2522) User 1532 died holding 2 shared memory locks. 
[2019/02/14@12:31:38.457-0500] P-26345 T-1 I SRV 11: (2520) Stopped.

[2019/02/14@12:31:38.468-0500] P-5097 T-1 I RPLS 63: (453) Logout by root on batch. 
[2019/02/14@12:31:38.472-0500] P-26212 T-1 I SRV 10: (2520) Stopped. 
[2019/02/14@12:31:38.475-0500] P-5120 T-1 I BIW 67: (2520) Stopped. 
[2019/02/14@12:31:38.475-0500] P-5120 T-1 I : (453) Logout by root on batch. 
[2019/02/14@12:31:38.480-0500] P-16375 T-1 I SRV 48: (2520) Stopped. 
[2019/02/14@12:31:38.483-0500] P-5123 T-1 I AIW 68: (2520) Stopped. 
[2019/02/14@12:31:38.483-0500] P-5123 T-1 I : (453) Logout by root on batch. 
[2019/02/14@12:31:38.494-0500] P-8276 T-1 I OEMAGE126: (453) Logout by DB_Agent on batch. 
[2019/02/14@12:31:38.510-0500] P-5111 T-1 I APW 65: (453) Logout by root on batch. 
[2019/02/14@12:31:38.527-0500] P-12441 T-27035 I SRV 5: (453) Logout by APP_Olm on . 
[2019/02/14@12:31:38.529-0500] P-21175 T-1 I APPSRV119: (453) Logout by APP_ArMgmt on batch. 
[2019/02/14@12:31:38.531-0500] P-24198 T-1 I SRV 37: (2520) Stopped. 
[2019/02/14@12:31:38.544-0500] P-16463 T-1 I SRV 55: (2520) Stopped. 
[2019/02/14@12:31:38.546-0500] P-5117 T-1 I APW 66: (453) Logout by root on batch. 
[2019/02/14@12:31:38.560-0500] P-5085 T-1 I BROKER 0: (15192) The database will complete shutdown within approximately 60 seconds. 
[2019/02/14@12:31:38.560-0500] P-5085 T-1 I BROKER 0: (2249) Begin ABNORMAL shutdown code 2

About fifteen minutes earlier was the point in time when our automated script had tried to kill that long-running transaction (user 1532 connected via client-server).

Here is that part of the log.

[2019/02/14@12:14:39.820-0500] P-9056 T-1 I SHUT 163: (452) Login by root on batch. 
[2019/02/14@12:14:39.826-0500] P-9056 T-1 I SHUT 163: (-----) User 1532 disconnect initiated
[2019/02/14@12:14:39.826-0500] P-9056 T-1 I SHUT 163: (453) Logout by root on batch.

[2019/02/14@12:14:39.826-0500] P-26345 T-1 I SRV 11: (1166) Server disconnecting user 1532. 
[2019/02/14@12:14:39.827-0500] P-26345 T-1 I SRV 11: (2252) Begin transaction backout.

At most, the attempt to kill the user was responsible for the eventual crash. In the least, the attempt to kill the user was unsuccessful. We can see that fifteen minutes later this user was still holding memory locks.

I should also mention that between the time that we attempted to kill this user, and the time the database crashed, we had lots of "lock table overflow" errors in the logs as follows:

[2019/02/14@12:20:58.155-0500] P-8795 T-1 I SRV 29: (915) Lock table overflow, increase -L on server 
...
[2019/02/14@12:20:58.383-0500] P-5144 T-1 I ABL 70: (915) Lock table overflow, increase -L on server 
...
[2019/02/14@12:20:59.481-0500] P-5660 T-1 I ABL 96: (915) Lock table overflow, increase -L on server 
...
[2019/02/14@12:21:00.528-0500] P-5703 T-1 I ABL 98: (915) Lock table overflow, increase -L on server



I'm guessing that something bad happened during the attempt to disconnect the user from the database. I am not familiar with some of these messages in the logs but they seem pretty scary, eg: "Out of free shared memory. Use -Mxs to increase". and "SYSTEM ERROR: Releasing regular latch. latchId: 17 "

I'm trying to find KB articles to figure out what actually went wrong. If anyone can interpret that from the logs above, I would be happy to know. We are fairly new to the use of "remote servers". We currently do not use the -Mxs parameter to start the database.

I had hoped that disconnecting client-server users was going to be safer than disconnecting shared-memory users! Should we avoid auto-killing the long-running transactions when they are associated with a remote client-server user?

All Replies

Posted by dbeavon on 20-Feb-2019 16:47

From gus bjorkland -

server 11 was disconnecting user 1532.
user 1532 had a partially completed transaction and the server was in the process of
rolling it back when it ran out of memory for record locks. (rollback sometimes requires
additional record locks)
because of this, the server could not finish rolling back the transaction so it declared
a fatal error which then caused the database to be shut down.
the subsequent error messages are side effects of this.

Posted by dbeavon on 20-Feb-2019 17:02

Thanks gus,  

Is this type of a problem specific to client-server connections?  We've only recently started using "client-server" connections recently (from PASOE) and I'm very suspicious of the "remote servers".  I don't recall running out of free shared memory from a "shared-memory" client ... regardless of whether any rollbacks were underway or not.

Also, I'm wondering if the client-server connections are using the same pool of locks (-L).  We never reached that limit before the crash occurred.

I hoped that disconnecting "client-server" connections would be much safer than this.  In one of the KB's which explained this message ("user died holding shared memory locks"), there was also a statement saying that a fix is to migrate over to "client-server" rather than "self-service":

https://knowledgebase.progress.com/articles/Article/P36402 

For sites where this occurs frequently, a way to prevent an abnormal database shutdown is to run clients as remote clients as opposed to self-service.

 

Given the crash points to running out of shared memory, should we be changing the default value of -Mxs?  Is that common?  Is it more common for customers as they migrate to "client-server" connections?

Posted by gus bjorklund on 20-Feb-2019 17:16

no, they are not specific to client-server (nor to appservers). in theory, any database connection in multi-user could run out of locks during rollback. in single-user, there are no locks.

yes, all from the same pool of locks, namely the lock table in shared memory. in theory, the lock table could be partially paged to disk in a memory-mapped scratch file, but it is not.

increasing -Mxs is not the solution. you should increase the size of your lock table.

-Mxs exists as an emergency workaround for bugs in the code that calculates how much shared memory is required. this is sort of done by calculating size of data structures times number of entries and adding them all up. -Mxs allows you to add more if the calculation is wrong and gives a size that is too small. currently it does not.

Posted by George Potemkin on 20-Feb-2019 17:18

> (5028) SYSTEM ERROR: Releasing regular latch. latchId: 17

It's LKF latch (Lock Table Free List Latch).

> (5029) SYSTEM ERROR: Releasing multiplexed latch. latchId: 11916353

11916353=101101011101010001000001

low 4 bits = 0001 means it's LKT latch (Lock Table Hash Chain Latches. Family of type 1 Object latches).

> Given the crash points to running out of shared memory, should we be changing the default value of -Mxs?

No. The customer has deleted more records in one transaction than the -L value. It's a point of no return. Such transaction cannot be undone without db crash.

This thread is closed