11.6.3 promon crash/DB shutdown bug

Posted by Rob Fitzpatrick on 19-Apr-2017 14:34

Just an FYI for people out there running 11.6.3, I found a bug that TS says is new in 11.6.3 and is also in 11.7.0.

I found it when running a promon script but I am also able to reproduce it running interactively.  It causes promon to crash with error 49 (memory violation).  The promon session holds a lock on the MTX latch when it crashes, so the database shuts down abnormally.

I haven't narrowed down exactly the minimum steps for recreating the crash but I have a script that does it reliably.  The content of the script (with annotations) is below.

m       Modify defaults
1       page size
9999
q
1       User control
1       all users
q
4       Record locking table
1       all users
q
5       Activity
q
6       Shared resources
q
7       Database status
q
R&D
debghb
5       Adjust monitor options
1       Display page length
9999
6       Number of auto repeats
3
t       Main menu
1       Status displays
4       Processes/clients
2       Blocked clients
p
3       Active transactions
p
p
9       BI log
p
10      AI log
p
12      Startup parameters
p
13      Shared resources
p
14      Shared memory segments
p
17      Servers by broker
t       Main menu
2       Activity displays
3       Buffer cache
p
5       BI log
p
6       AI log
p
10      Space allocation
p
13      Other
t       Main menu
3       Other displays
1       Performance indicators
p
4       Checkpoints
t       Main menu
6       Hidden menu
8       Resource queues
p
11      Latch counts
x       Exit

You would have to remove the annotations to have a functioning script for promon stdin.  I showed them here for clarity.

Workaround: if "debghb" is moved from its location above to the latest point possible, i.e. between "t" and "6", six lines from the bottom, then promon does not crash and the DB does not shut down.  I hope that makes sense.

For those who are interested, I will post an update here when the 11.6.3 hotfix is available.

All Replies

Posted by George Potemkin on 19-Apr-2017 14:44

Is it the same issue as the defect # PSC00356177?

community.progress.com/.../30418

Posted by cjbrandt on 19-Apr-2017 14:47

George P reported something that sounds similar on Solaris 64-bit.  What OS did you see this on ?

Posted by Rob Fitzpatrick on 19-Apr-2017 14:52

George: it sounds similar, but TS said this issue is not in 11.6.2 and prior.  I'll try to repro in Linux 11.6.3.

CJ: Sorry, I should have given the platform.  I encountered this in 64-bit OE 11.6.3 on Linux x64 but I have also reproduced it in 64-bit 11.6.3 on Windows 7.

Posted by Rob Fitzpatrick on 19-Apr-2017 14:55

Update:

I can reproduce the error 49 following George's steps from the other thread.  So this might be related.

I'll try my steps again without R&D 1 14 and see if that changes anything.

Posted by George Potemkin on 19-Apr-2017 14:58

> The promon session holds a lock on the MTX latch when it crashes

Are you getting the error 5028 for latchId 1 (MTX) or 2 (USR)?

Posted by Rob Fitzpatrick on 19-Apr-2017 14:59

Update 2:

My script above works when R&D 1 14 is removed.  So it does indeed look like this is the same bug George reported, though it is cross-platform.

Posted by Rob Fitzpatrick on 19-Apr-2017 15:06

> Are you getting the error 5028 for latchId 1 (MTX) or 2 (USR)?

The (5028) error was: SYSTEM ERROR: Releasing regular latch. latchId: 2

I'm confused. I thought MTX was 2 and USR was 3.

for each _latch no-lock:

 display _latch-id _latch-name.

end.

1 0      

2 MTL_MTX

3 MTL_USR

4 MTL_OM

5 MTL_BIB

...

Posted by George Potemkin on 19-Apr-2017 15:10

_Latch._Latch-Id = real LatchID + 1

Common Progress rule: "plus or minus one" does not matter. ;-)

MTX latch was the first and the only latch in V5 Progress db and it was called MT lock.

Posted by Rob Fitzpatrick on 19-Apr-2017 15:19

> Common Progress rule: "plus or minus one" does not matter. ;-)

Well, I learned something new so today's a good day.  :)

> _Latch._Latch-Id = real LatchID + 1

I believe you, but this is non-obvious.  When the 5028 says "latchid" I expect it to mean "_latch-id".

I'm aware of such cases in other tables, like _connect-id = _connect-usr + 1.  It seems like _Latch is missing a field like "_latch-num" to hold the "real" number that shows up in the db log.  And the 5028 message should be reworded.

Posted by George Potemkin on 19-Apr-2017 15:26

> And the 5028 message should be reworded.

And what about the 5029? ;-)

(5029)  SYSTEM ERROR: Releasing multiplexed latch. latchId: 1489504328

It's the BHT latch, by the way. :-)

> 1 0      

> 2 MTL_MTX

BTW, Progress does use the memory for the nameless latchId 0 though it's not a real latch.

Posted by Richard Banville on 19-Apr-2017 15:33

The issue is that when the super secret "debghb" setting is on in promon, examining "14. Shared memory segments" has the adverse side effect of zeroing a pointer it should not be zeroing.

The next reference to this pointer will cause promon to crash.,

Depending on when the pointer is accessed, promon may be holding a resource that can cause a crash.  This is seen when disconnection but could happen sooner than that based on activity performed.

Posted by Richard Banville on 19-Apr-2017 15:36

And yes George, I believe it is the same issue and is available in HotFix 11.6.3.017

Posted by Rob Fitzpatrick on 19-Apr-2017 15:37

Thanks Rich and George.  That confirms how I should edit my script for safety until I have the fix.

Posted by gus bjorklund on 20-Apr-2017 17:22

> On Apr 19, 2017, at 4:12 PM, George Potemkin wrote:

>

> MTX latch was a first latch in Progress db and it was called db lock.

>

>

>

wrong. sorry george. you get a red card. :)

the first release to have shared memory was 5.2A. In that release, there were two memory locks: the DB lock and the MTX lock.

The MTX lock served a purpose similar to what it does today although the implementation was quite different. For various reasons, /all/ database writes were performed while holding the MTX lock.

The DB lock was used to lock the entire shared memory region when any shared data structure was accessed or modified.

Some other fun facts about the ancient version 5:

* max segment size was 8 MB,

* max -B was 32,000,

* db and bi block size was 1 kb,

* lock table size was limited to 32 kb,

* bi cluster size was fixed at 16 kb.

* TP1 benchmark performance was about 10 tps.

* no data servers

* 4GL could connect to only one database at a time

* there were no internal procedures in 4GL

-gus

Posted by George Potemkin on 21-Apr-2017 02:51

My fault. I thought that DB lock was what we know now as a login semaphore.

Posted by gus bjorklund on 21-Apr-2017 17:14

What used to be the db lock became the USR latch in v 6.3A.

There was a login semaphore in version 5 also, same as now.

This thread is closed