Does anybody see in db logs the "multi-threaded" ABL sessions after receiving the signals? For example, after receiving HANGUP signal?
Try:
grep " T-2 " *.lg
T-2 I ABL 1923: (562) HANGUP signal received.
I saw the same issue for other signals:
Received signal 7; handling as SIGHUP. (4375) KILL signal received. (298)
but these signals are rarely used.
The statistics comparing the number of “HANGUP signal received” issued by T-2 to their total numbers from the customer running HP-UX:
2018/05/31 10:37:45 24 of 3388 (0.7%) 2018/04/20 18:47:41 308 of 2016 (15.2%) 2017/04/25 11:43:03 14 of 2509 (0.7%) 2017/04/20 11:50:05 24 of 2653 (0.9%) 2016/03/02 18:02:00 780 of 8012 (9.7%)
It was not the full day logs – just the db logs until the incidents. That is why the percentage of the fake threads is varied between logs. Most of fake threads took place in the middle of day. It looks like there are the hundreds of “T-2” per day. Only 5 of them for last two years resulted in the critical issues for a database.
Does it happen only on HP-UX?
signal 7 is SIGEMT. This is used on some UNIX systems to get a process’s attention in various situations such as when a disconnect is requested. Also, IIRC, proshut uses it to signal the main broker.
Hi Gus! Yes, signal 7 is SIGEMT on HP-UX. In Progress V10.2B on HP-UX it was used by broker to notify the local clients to check Usrtodie flag. But the client's sessions handled SIGEMT as HANGUP signal. Since V11 Progress on HP-UX uses SIGTRAP instead of SIGEMT. Progress on Solaris and AIX still uses SIGEMT. All Progress versions on Linux use SIGUSR2 for the same purpose. I named them as a “disconnect” signal.
But my point was: the fake threads are not specific to SIGHUP signals. Any handled signal has a chance to cause a “dissociative identity disorder” of Progress session. But I'm not sure if the issue exists only on HP-UX (Itanium) or not.
not sure what you mean by fake threads.
_progres on UNIX and Linux has only the main thread. When a signal occurs, the main thread’s execution is interrupted and control is given to the signal handler. depending on where the main thread is interrupted, the signal handler can’t do various things and there are library functions which may not be used in signal handlers. in easy cases when things go properly, the main thread continues execution after the signal handler returns.
further complication is that certain signals are ignored sometimes but not always. for example, SIGINT (normally from ctrl-c) must be handled in an interactive session but ignored in batch and is ignored by servers.
further complication is that some signals, for example a floating point error or illegal instruction, are fatal errors and the process must be terminated /without/ continuing the main thread (otherwise the error will just reoccur).
> not sure what you mean by fake threads.
I expect to see in db log:
T-1 I ABL 1923: (562) HANGUP signal received.
instead of
T-2 I ABL 1923: (562) HANGUP signal received.
Examples:
community.progress.com/.../23440
Upgrade from 10.2B0817 to 11.7.2 did not eliminate 1077's:
SYSTEM ERROR: Attempt to free buffer type 2 (1077)
which in fact is tightly linked to:
INTERNAL ERROR: pnmsgbuf chain in inconsistent state
Since upgrade to 11.7.2 the error did not crash a database. We just unable to disconnect the sessions (with open transactions) after these errors. But now we can generate their protrace files:
(6) 0xc00000000064d530 ___ksleep + 0x30 [/usr/lib/hpux64/libc.so.1] (7) 0xc0000000002f1f60 __mxn_sleep + 0x1190 at /ux/core/libs/threadslibs/src/common/pthreads/sleep.c:1260 [/usr/lib/hpux64/libpthread.so.1] (8) 0xc000000000226930 __pthread_cond_wait + 0x3050 at /ux/core/libs/threadslibs/src/common/pthreads/cond.c:3127 [/usr/lib/hpux64/libpthread.so.1] (9) 0xc0000000002239b0 __pthread_cond_wait + 0xd0 at /ux/core/libs/threadslibs/src/common/pthreads/cond.c:2249 [/usr/lib/hpux64/libpthread.so.1] (10) 0xc00000003cd63800 ThreadMain + 0x12a0 at /build/slot1/p800_P/src/lib/cs/unix/hp700_ux90/amqxprmx.c:1996 [/opt/mqm/lib64/libmqe_r.so] (11) 0xc0000000002400e0 __pthread_bound_body + 0x1c0 at /ux/core/libs/threadslibs/src/common/pthreads/pthread.c:4929 [/usr/lib/hpux64/libpthread.so.1]
> further complication is that certain signals are ignored sometimes but not always.
Gus, I'm currently working (in my free time) on an article about signal handling in Progress. I can send it for proofreading when it will be done if you're interesting.
sure, i can do that.
be aware that signal handling in the PASOE agent processes is radically different due to the multithreading. I wrote a mostly new handler for that. i don’t know what changes have been made since then.
also, signal 7 is SIGEMT on HP-UX. on others it is SIGABRT.
> be aware that signal handling in the PASOE agent processes is radically different due to the multithreading.
I'm writing only about signal handling in ABL sessions (and just a little bit about broker, servers, page writers and promon). Still it's a lot of information that is not well known.
> also, signal 7 is SIGEMT on HP-UX. on others it is SIGABRT.
Yes, SIGEMT is not specified in POSIX standard.
It's a bit inconvenient that Progress uses different signal names for the same task. Why it's not, let's say, SIGTRAP (instead of SIGEMT and SIGUSR2)? Sorry, I'm grumbling as usual. ;-)
The names are the names from the UNIX documentation. While the signal names are mostly the same, there are a few that have different names because of history. Original UNIX was written for PDP-11 and some names come from the PDP-11 processor architecture. A few were changed later on.
there are also three different signal handling API’s. The original one had some serious flaws. While it is still in the C library, no one should use it except for the simplest possible use cases (like, terminate process after n seconds, with no cleanup)