User was not fully disconnected ?

Posted by dbeavon on 20-Feb-2019 16:28

Given that OE clients are allowed to start transactions and keep them open indefinitely, we are occasionally in the habit of killing long-running client connections with proshut commands.  We have even introduced some auto-killing scripts, for better or worse.

This past week when I was out of the office, there was a user that we had attempted to kill (via an auto-killing script that detects long-running transactions).  But the user was never fully killed, and it seemed to continue to hold onto resources, and it may have even caused the entire database to eventually crash.

Here are the logs that show the crashing of the database.  P26345 was a remote server, and user 1532 was a PASOE client-server connection:

[2019/02/14@12:31:38.457-0500] P-26345      T-1     F SRV    11: (6495)  Out of free shared memory.  Use -Mxs to increase. 
[2019/02/14@12:31:38.457-0500] P-26345      T-1     I SRV    11: (5028)  SYSTEM ERROR: Releasing regular latch. latchId: 17 
[2019/02/14@12:31:38.457-0500] P-26345      T-1     I SRV    11: (5029)  SYSTEM ERROR: Releasing multiplexed latch. latchId: 11916353 
[2019/02/14@12:31:38.457-0500] P-26345      T-1     I SRV    11: (2522)  User 1532 died holding 2 shared memory locks. 
[2019/02/14@12:31:38.457-0500] P-26345      T-1     I SRV    11: (2520)  Stopped. 

[2019/02/14@12:31:38.468-0500] P-5097       T-1     I RPLS   63: (453)   Logout by root on batch. 
[2019/02/14@12:31:38.472-0500] P-26212      T-1     I SRV    10: (2520)  Stopped. 
[2019/02/14@12:31:38.475-0500] P-5120       T-1     I BIW    67: (2520)  Stopped. 
[2019/02/14@12:31:38.475-0500] P-5120       T-1     I          : (453)   Logout by root on batch. 
[2019/02/14@12:31:38.480-0500] P-16375      T-1     I SRV    48: (2520)  Stopped. 
[2019/02/14@12:31:38.483-0500] P-5123       T-1     I AIW    68: (2520)  Stopped. 
[2019/02/14@12:31:38.483-0500] P-5123       T-1     I          : (453)   Logout by root on batch. 
[2019/02/14@12:31:38.494-0500] P-8276       T-1     I OEMAGE126: (453)   Logout by DB_Agent on batch. 
[2019/02/14@12:31:38.510-0500] P-5111       T-1     I APW    65: (453)   Logout by root on batch. 
[2019/02/14@12:31:38.527-0500] P-12441      T-27035 I SRV     5: (453)   Logout by APP_Olm on  . 
[2019/02/14@12:31:38.529-0500] P-21175      T-1     I APPSRV119: (453)   Logout by APP_ArMgmt on batch. 
[2019/02/14@12:31:38.531-0500] P-24198      T-1     I SRV    37: (2520)  Stopped. 
[2019/02/14@12:31:38.544-0500] P-16463      T-1     I SRV    55: (2520)  Stopped. 
[2019/02/14@12:31:38.546-0500] P-5117       T-1     I APW    66: (453)   Logout by root on batch. 
[2019/02/14@12:31:38.560-0500] P-5085       T-1     I BROKER  0: (15192) The database will complete shutdown within approximately 60 seconds. 
[2019/02/14@12:31:38.560-0500] P-5085       T-1     I BROKER  0: (2249)  Begin ABNORMAL shutdown code 2

About fifteen minutes earlier was the point in time when our automated script had tried to kill that long-running transaction (user 1532 connected via client-server). 

Here is that part of the log.

[2019/02/14@12:14:39.820-0500] P-9056       T-1     I SHUT  163: (452)   Login by root on batch. 
[2019/02/14@12:14:39.826-0500] P-9056       T-1     I SHUT  163: (-----) User 1532 disconnect initiated
[2019/02/14@12:14:39.826-0500] P-9056       T-1     I SHUT  163: (453)   Logout by root on batch. 

[2019/02/14@12:14:39.826-0500] P-26345      T-1     I SRV    11: (1166)  Server disconnecting user 1532. 
[2019/02/14@12:14:39.827-0500] P-26345      T-1     I SRV    11: (2252)  Begin transaction backout. 

At most, the attempt to kill the user was responsible for the eventual crash.  In the least, the attempt to kill the user was unsuccessful.  We can see that fifteen minutes later this user was still holding memory locks.

I should also mention that between the time that we attempted to kill this user, and the time the database crashed, we had lots of "lock table overflow" errors in the logs as follows:

[2019/02/14@12:20:58.155-0500] P-8795       T-1     I SRV    29: (915)   Lock table overflow, increase -L on server 
...
[2019/02/14@12:20:58.383-0500] P-5144       T-1     I ABL    70: (915)   Lock table overflow, increase -L on server 
...
[2019/02/14@12:20:59.481-0500] P-5660       T-1     I ABL    96: (915)   Lock table overflow, increase -L on server 
...
[2019/02/14@12:21:00.528-0500] P-5703       T-1     I ABL    98: (915)   Lock table overflow, increase -L on server

I'm guessing that something bad happened during the attempt to disconnect the user from the database.  I am not familiar with some of these messages in the logs but they seem pretty scary, eg: "Out of free shared memory. Use -Mxs to increase".  and "SYSTEM ERROR: Releasing regular latch. latchId: 17 "

 

I'm trying to find KB articles to figure out what actually went wrong.  If anyone can interpret that from the logs above, I would be happy to know.    We are fairly new to running remote servers.  We currently do not use any -Mxs parameter to start the database.

I had hoped that disconnecting client-server users was going to be safer than disconnecting shared-memory users!  Should we avoid auto-killing the long-running transactions when they are associated with a client-server user?

All Replies

Posted by gus bjorklund on 20-Feb-2019 16:38

server 11 was disconnecting user 1532.

user 1532 had a partially completed transaction and the server was in the process of

rolling it back when it ran out of memory for record locks. (rollback sometimes requires

additional record locks)

because of this, the server could not finish rolling back the transaction so it declared

a fatal error which then caused the database to be shut down.

the subsequent error messages are side effects of this.

This thread is closed