What crashed the database?

Posted by craig love on 21-Dec-2017 18:04

Progress 11.3.2

RHEL 6.8

Last night our production database crashed and I'm trying to figure out what actually caused the problem.

I believe this is the event that caused the db to shutdown.


[2017/12/22@00:03:50.089+1100] P-13116      T-140203689506592 I ABL   105: (452)   Login by dtlive on /dev/pts/17.
[2017/12/22@00:03:50.203+1100] P-13116      T-140203689506592 I ABL   105: (7129)  Usr 105 set name to .
[2017/12/22@00:03:50.254+1100] P-22732      T-140669041129216 I SRV    16: (8873)  Login usernum 548, remote SQL client.
[2017/12/22@00:03:50.254+1100] P-22732      T-140669041129216 I SRV    16: (7129)  Usr 548 set name to ODBC.
[2017/12/22@00:04:21.533+1100] P-13161      T-140376655259424 I ABL   106: (452)   Login by dtlive on /dev/pts/18.
[2017/12/22@00:04:21.617+1100] P-13161      T-140376655259424 I ABL   106: (7129)  Usr 106 set name to .
[2017/12/22@00:04:51.483+1100] P-13214      T-140429425911584 I ABL   117: (452)   Login by dtlive on /dev/pts/21.
[2017/12/22@00:04:51.566+1100] P-13214      T-140429425911584 I ABL   117: (7129)  Usr 117 set name to .
[2017/12/22@00:05:01.888+1100] P-4469       T-140597915551488 I WDOG   54: (2526)  Disconnecting client 469 of dead server 17.
[2017/12/22@00:05:01.888+1100] P-26020      T-140637056100128 I ABL   115: (453)   Logout by aubatch on batch.
[2017/12/22@00:05:01.888+1100] P-4469       T-140597915551488 I WDOG   54: (2256)  SYSTEM ERROR: User 469 died during microtransaction.
[2017/12/22@00:05:01.889+1100] P-5037       T-139746551977728 I BACKUP146: (5462)  End backup of Data file(s).
[2017/12/22@00:05:01.889+1100] P-5037       T-139746551977728 I BACKUP146: (1726)  Backup cancelled, disconnected by server.
[2017/12/22@00:05:01.892+1100] P-2161       T-139818795464480 I ABL   150: (453)   Logout by  on batch.
[2017/12/22@00:05:01.896+1100] P-5037       T-139746551977728 I BACKUP146: (453)   Logout by root on batch.
[2017/12/22@00:05:01.904+1100] P-16339      T-139809017812736 I SRV     5: (2520)  Stopped.
[2017/12/22@00:05:01.904+1100] P-16341      T-140515433113344 I SRV     6: (2520)  Stopped.
[2017/12/22@00:05:01.904+1100] P-20451      T-139978265982720 I SRV    10: (2520)  Stopped.
[2017/12/22@00:05:01.931+1100] P-26912      T-139692325250816 I SRV    13: (2520)  Stopped.
[2017/12/22@00:05:01.956+1100] P-14978      T-140674025031424 I SRV     4: (2520)  Stopped.
[2017/12/22@00:05:01.995+1100] P-4463       T-140173412206336 I AIW    53: (2520)  Stopped.
[2017/12/22@00:05:01.996+1100] P-4457       T-140437947307776 I BIW    52: (2520)  Stopped.
[2017/12/22@00:05:01.996+1100] P-4463       T-140173412206336 I          : (453)   Logout by root on batch.
[2017/12/22@00:05:01.997+1100] P-4457       T-140437947307776 I          : (453)   Logout by root on batch.
[2017/12/22@00:05:02.029+1100] P-4477       T-140387071031040 I APW    55: (453)   Logout by root on batch.
[2017/12/22@00:05:02.154+1100] P-30755      T-140557580703488 I SRV    14: (2520)  Stopped.
[2017/12/22@00:05:02.184+1100] P-23907      T-140350851135232 I SRV    11: (2520)  Stopped.
[2017/12/22@00:05:02.247+1100] P-26861      T-140579751507712 I SRV    12: (2520)  Stopped.
[2017/12/22@00:05:02.356+1100] P-4450       T-140582885857024 I BROKER  1: (453)   Logout by root on batch.
[2017/12/22@00:05:02.888+1100] P-4469       T-140597915551488 I WDOG   54: (2526)  Disconnecting client 470 of dead server 17.
[2017/12/22@00:05:03.007+1100] P-5527       T-140634188207904 I APPSRV104: (453)   Logout by nvyas on batch.

                Fri Dec 22 00:05:03 2017
[2017/12/22@00:05:03.089+1100] P-13246      T-140440726918912 I DBUTIL   : (451)   prostrct list session begin for root on batch.
[2017/12/22@00:05:03.118+1100] P-13246      T-140440726918912 I DBUTIL   : (334)   prostrct list session end.
[2017/12/22@00:05:03.243+1100] P-4417       T-140192284309248 I BROKER  0: (15192) The database will complete shutdown within approximately 60 seconds.
[2017/12/22@00:05:03.248+1100] P-4417       T-140192284309248 I BROKER  0: (2249)  Begin ABNORMAL shutdown code 2
[2017/12/22@00:05:03.248+1100] P-4417       T-140192284309248 I BROKER  0: (2525)  Disconnecting dead server -1.
[2017/12/22@00:05:03.264+1100] P-9179       T-139789519144736 I ABL    73: (453)   Logout by  on batch.
[2017/12/22@00:05:03.361+1100] P-8533       T-139741812905760 I ABL   122: (453)   Logout by  on batch.
[2017/12/22@00:05:03.426+1100] P-8814       T-139806391924480 I SRV     3: (2520)  Stopped.
[2017/12/22@00:05:03.662+1100] P-5299       T-140117125142304 I APPSRV169: (453)   Logout by mfg on batch.
[2017/12/22@00:05:03.666+1100] P-6214       T-140639833450240 I SQLREMC 541: (-----) Disconnected due to shutdown of primary database: /db/live/mfgprod
[2017/12/22@00:05:03.666+1100] P-6214       T-140639833450240 I SRV     2: (453)   Logout by ODBC on  .
[2017/12/22@00:05:03.667+1100] P-6214       T-140639833450240 I SRV     2: (7132)  dsmAuditingEnableGet called for invalid user. rtc: -20031.
[2017/12/22@00:05:03.669+1100] P-22732      T-140669058762496 I SQLREMC 500: (-----) Disconnected due to shutdown of primary database: /db/live/mfgprod
[2017/12/22@00:05:03.669+1100] P-22732      T-140669058762496 I SRV    16: (453)   Logout by ODBC on  .

Now user 469 is an odbc sql user.  And should only be reading (sql select) from the database.

So what could it be doing to cause a transaction?

Any idea how to avoid this? (2nd time in the last few weeks. previously no problems. The only thing I believe that has happened is the splitting up of sql and 4gl clients on to different servers).

REF:

[2017/12/22@00:03:25.127+1100] P-22756      T-139671805683456 I SRV    17: (8873)  Login usernum 469, remote SQL client.
[2017/12/22@00:03:25.127+1100] P-22756      T-139671805683456 I SRV    17: (7129)  Usr 469 set name to ODBC.
[2017/12/22@00:03:25.157+1100] P-22980      T-139876293494528 I SRV    18: (8873)  Login usernum 468, remote SQL client.
[2017/12/22@00:03:25.157+1100] P-22980      T-139876293494528 I SRV    18: (7129)  Usr 468 set name to ODBC.

All Replies

Posted by Paul Koufalis on 21-Dec-2017 20:31

Check the ODBC DSN. The default isolation level is READ COMMITTED which locks records.  User 469 died during a microtransaction, which is not a transaction in the 4GL sense. It's internal to the DB.

It could also be that client 470 did something non-kosher that killed server 17, and user 469 was an innocent bystander.

Posted by Paul Koufalis on 21-Dec-2017 20:34

Somehow my post got deleted. Oh well.

Check the isolation level in the ODBC DSN. If it's READ COMMITTED, then the ODBC client is locking records even if it isn't modifying them.

It is also possible that another user of server 17 (like user 470) did something non-kosher, killed the server, and user 469 was in innocent bystander killed by the wdog.

Posted by Libor Laubacher on 21-Dec-2017 21:22

>>

Any idea how to avoid this? (2nd time in the last few weeks. previously no problems. The only thing I believe that has happened is the splitting up of sql and 4gl clients on to different servers).

<<

Spliting up the clients to go to their own broker was a good idea, it should be like this. On how to avoid it - best is to find the query causing this (unless there is someone or something killing the process), but narrowing down the query can be long and painful process.

If the queries you are running haven't changed or new ones added, I'd check indexes (idxcheck) to make sure there is no corruption there and also would set -Ma 1 and -Ma -1 for the SQL broker, so should this happen again, the client won't take up others with it - as Paul suggested a client could have "just" crashed without doing anything disruptive on its own, but since it crashed the server, it killed off other clients with it and one of that of in MTX.

Actually, I recall a similar issue in past, there were SQL crashes taking the database down, we never got to 100% pinpoint the reason of the crashes (bug suspected, but upgrade wasn't possible at that time). after setting -Ma & -Mi 1, the SQL client occassionally still kept crashing, but since its alone there, it never took the db anymore down with it.

Posted by George Potemkin on 21-Dec-2017 23:29

> Any idea how to avoid this? (2nd time in the last few weeks. previously no problems.

How often you get the errors 2526 without db crash?:

> WDOG   54: (2526)  Disconnecting client 470 of dead server 17.

> WDOG   54: (2256)  SYSTEM ERROR: User 469 died during microtransaction.

User (really a server on behalf of user) was /waiting/ for a buffer lock. In other words, it tried to read data but did not even get an access to data.

> BACKUP146: (5462)  End backup of Data file(s).

Probably the server was wiating for a buffer locked by probkup. I guess probkup can lock the buffers a bit longer than the processes of any other types.

First thought: the server was killed -9. Note that the chances to crash a database by killing a process that reads database rather low. I would estimate them as 10% or less. So if you have a couple of the incidents with db crash then I would expect that kill -9 was used many times.

Posted by lbryan on 22-Dec-2017 02:49

I've seen this happen before on client systems where a rogue report has crashed the database but the report in question couldn't be tracked down. I found that setting a min port , max port range so that each of the brokers were within their own defined range of ports prevented the database from crashing.

Posted by George Potemkin on 22-Dec-2017 07:18

> I found that setting a min port , max port range so that each of the brokers were within their own defined range of ports prevented the database from crashing.

Port ranges separated per login brokers is indeed a good thing but how can it prevent the servers from crashing? Even theoretically! Login broker decides which port to use to start a server. It does not matter for a server if it's a port 3000 or 30000. A server does not know if the nearest ports are in use and by which processes. I guess there was another change that solved the problem. Or, for example, the old port range was used by port scanners but not the new range.

Posted by lbryan on 22-Dec-2017 07:35

Well the issues stopped after this so I thought it might be a helpful suggestion. As I recall, an sql process, which we couldn't track down what was being run was causing a memory leak, memory spiked and the database servers died and eventually the db would crash. When the port ranges were defined, the issues ceased. Exactly what the route cause was, I never did find out but I was happy with the result so thought I would share.

This thread is closed