Test DB crash on Linux RH 7.2 while running OE 11.7.2

Posted by dvoyat on 29-Nov-2017 10:15

Hi

We're facing some db crashing while running 11.7.2 on a Linux RH 7.2 server.
This is a test server we use for testing and wanted to check 11.7.2.
It has 8gb of ram but nothing running with the exception on 1 appserver and few batches.

The max connection to DB is less than 10 included BIW,WDG,APW (we've not enabled AI).

bellow last db log from last crash

[2017/11/29@09:30:10.825+0000] P-11752      T-139666002794368 I ABL    32: (1075)  Semaphore id -32767 was removed
[2017/11/29@09:30:10.895+0000] P-12404      T-139627545966464 I ABL    34: (1075)  Semaphore id -32767 was removed
[2017/11/29@09:30:10.895+0000] P-12404      T-139627545966464 F ABL      : (6517)  SYSTEM ERROR: Unexpected error return from semAdd -1
[2017/11/29@09:30:41.510+0000] P-11412      T-139995119421248 I WDOG   28: (2256)  SYSTEM ERROR: User 32 died during microtransaction.
[2017/11/29@09:30:41.510+0000] P-11412      T-139995119421248 I WDOG   28: (2527)  Disconnecting dead user 32.
[2017/11/29@09:30:41.514+0000] P-11379      T-139937231816512 I BIW    26: (2520)  Stopped.
[

I can find anything in the appserver log or batch process log and the only error is visible on the db log file.
The db start with the default 3 semaphores sets and the kernel has been tuned with SEMMSL 250, SEMMNI 256 and SEMMNS 32000.

All activated appserver (classic) and batch processes has SELF connection via shared memory.

While running 11.7 before we used to get some "almost" similar error  we manage to get ride of by setting 2 SEMSETS (it's not easy for us to change kernel parameter as we don't have permanent root access). I'm in a process now to get more memory (16Gb) but I'm wondering if anyone has already faced something similar.  

Most of forum references are either related to kernel setting and/or memory but with almost no activity on the database I'm surprised to get crash. Is there anything in 11.7.2 features which require more memory ?? It might well be we shall be close to some OS limits earlier but we've been running several test dbs and much more batch processes with 11.6.3 on same server (same config) without any problems earlier.

Denis

  

Posted by dvoyat on 19-Dec-2017 03:11

Progress case confirmed issue being cause by RemoveIPC setting (default to YES and need to be reset to NO) or alternatively by running Progress components from a properly qualified/defined system's user. Could be good to get Progress gathering all specific LINUX setting and/or recommendation in installation guide or similar document.

Denis

All Replies

Posted by cjbrandt on 29-Nov-2017 10:34

Linux will write out to /var/log/messages if the OS killed a process due to a resource issue.  Might want to check.

If the issue isn't related to an OS limit, AI logs can be helpful to see what activity was happening in the db at the time of the crash.  If AI isn't setup, might want to enable it.

Posted by George Potemkin on 29-Nov-2017 10:40

It's a Progress issue. Something (bug?) seems to corrupt data in shared memory. The errors you got is the result of reading the corrupted data.

Check https://community.progress.com/community_groups/openedge_rdbms/f/18/t/26225?pi20882=3

Posted by dvoyat on 30-Nov-2017 01:54

Thanks George and Cjbrandt for your reply.Just to add extra comment there was no error in var/log at all. I can for sure try to enable AI and see if I can get some detailled infos. I've also gone through various posts in community, included the one George kindly mentionned but was not finding anything obvious .... except the bug suspicion. We've done some earlier try with 11.7 with same server this spring. We've tried to run several test dbs and got something similar. At that time we stopped all db and restarted only one and turn to default semset to 2 (rather 3) it worked but had to move to some other project work and therefore stop/pause some more 11.7+ Linux testing I'm now restarting.  I can also change both appserver and batch to connect via TCP socket and see if it goes better. I also let the default SEMOPM to 32 as I understood from some earlier post that this is not relevant to Progress (or at least was not).

Posted by cjbrandt on 30-Nov-2017 12:00

track what the difference in performance is between running batch programs via TCP instead of shared memory.  It is noticeable.  

Posted by dvoyat on 06-Dec-2017 02:58

Hi

Coming back to this issue with some more findings. It's been a bit painfull to capture evidence but I've been able to replicate the case and to eliminate but need to understand what has changed between 11.7 and 11.7.2 to make it happen on same RedHat 7.2 server.

As short summary I've been tracking what could remove the semaphore which eventually get db crashed a bit later. We start DB from adminserver (autostart) and adminserver is started from 'sudo' using some restricted user account. The semaphore are owned by same restricted account. We also use this restricted account for some other maintenance activity either via manual "sudo" or via scheduled process (cron).

Whenever I raise a "sudo" command against this restricted account I've been able running "ipcs -s" to check that there was no longer any semaphore set available.

I did test several times and each time semaphore get removed. I did a bit of search and I manage to eliminate the case by disabling "RemoveIPC=yes" in RH logind.conf (default feature since RH 7 as per my understanding) and restarting systemd-logind. And this indeed prevent the removal of - at least - semaphores. After that DB works fine for more than 24hrs with its regular db activity via appserver, batch processes....

There lot of forum suggesting that proper way of fixing this kind of issue is not to disable "RemoveIPC" but to ensure that all processes & daemon requiring communication & synchronization mecanism are fired by system users which are excluded from the removal. I'll probably consider that later on with more inhouse expert but the fact is that it was working fine with 11.7.

I'll contact Progress Tech Support anyway but if anyone has some experience there I'm more than happy to get feedback.

As a background all our current application & DB server are so far running HPUX, we're just doing test with RH Linux for now.

Denis  

Posted by dvoyat on 19-Dec-2017 03:11

Progress case confirmed issue being cause by RemoveIPC setting (default to YES and need to be reset to NO) or alternatively by running Progress components from a properly qualified/defined system's user. Could be good to get Progress gathering all specific LINUX setting and/or recommendation in installation guide or similar document.

Denis

This thread is closed