What Are Very Large Latch IDs?

Posted by Paul Koufalis on 19-Jul-2016 14:16

11.6 64-bit on Linux.

WDOG   62: (5029)  SYSTEM ERROR: Releasing multiplexed latch. latchId: 3096224809273976

What is that latch ID? That's a big honkin' number. 

All Replies

Posted by George Potemkin on 19-Jul-2016 14:24

It's a multiplexed latch of the governing latch BUF

latchId: 3096224809273976 = 0xB000003E6CA78

Tip: high 4 bytes are always 0x000B0000

In case of LKT (LHT) latch the high 4 bytes are always zeroes. For example:

SYSTEM ERROR: Releasing multiplexed latch. latchId: 43520 

Posted by Paul Koufalis on 19-Jul-2016 20:40

Thanks George!

Posted by George Potemkin on 19-Jul-2016 23:32

BTW, does anybody know how Progress calls these (sub-)latches of the governing latches (BUF and LKT)?

Progress seems to use the latObjLatchLock() call to lock the multiplexed latches of the BUF latches - Buffer Control Blocks latches.
The article 000035040 seems to call them the object latches:
knowledgebase.progress.com/.../000035040
"Upgrade to OpenEdge 10.2B08, 11.3.0.0 or later where the problem of the process not in service when executing a critical session on the object latches has been addressed."

latObjLatchFree: owner %d of object latch %i is not me (lockCnt: %i, addr: %x) latch stack: %d
latObjLatchLock2: latchNum: %l, (lockCnt: %i, addr: %x) latch stack: %d latches: %x
latObjLatchLock being called with a latch other than a lock table chain latch


But they do not seem to be the same as the Object Number Latches introduced in 11.6:
User <usrnum> died holding the Object Number Latch. (18115)
"Object Number Latch" seems to be one of the "service" (?) latches like:
BIADD Latch
AI Mgt Latch
XAlist Latch
DB Service Latch
Large Keys Enablement Latch
Audit Policy Latch
Object Number Latch
Audit Latch
SQLTSINFO Latch
REPL EOF Latch

Progress seems to use the pmObjectLockLock call to lock the multiplexed latches of the LKT latches (called "LHT" in _Latch VST) - Lock Table Hash Chain latches (introduced in 10.0A).
10.2B06 Service Pack Readme seems to call them the private latches:
knowledgebase.progress.com/.../fileField
"Replication server can get blocked on private latch in pmObjectLockLock() when more than 1 remote SQL user is connected to a SQL92 remote server."

Do the new messages introduced in V11 talk about the same private latches?

11.0: SYSTEM ERROR: User <usrnum> died holding the <latch-name> latch. (15767)
Desc: The user has died while holding a private latch.

11.6: User <user-number> died holding the <private-latch-name>. (18386)
Desc: User with usernum died holding a private latch as specified.

Just curious as always.

Posted by ChUIMonster on 20-Jul-2016 06:46

I am very interested too.

Posted by Richard Banville on 20-Jul-2016 09:07

I gave a talk on latches at the various conferences some years back.  However, since then some things have changed which can make this stuff a bit more confusing than it really is.

> BTW, does anybody know how Progress calls these (sub-)latches of the governing latches (BUF and LKT)?

Do you mean what we call them or how we call them?

They are called object latches for BUF (true object latches type 2) and Latch families for LKT (one latch governs multiple objects but not all of them).

We implemented true object latches by embedding the actual latch value/structure within the structure that requires protection.

In other words, for the BUF latches, each -B buffer control structure (bktbl_t) has a latch embedded in it that is used for its own protection.

BUF: latObjLatchLock2()/latObjLatchFree2().

LKT: latlatch()/latunlatch() or latObjLatchLock()/latObjLatchFree() depending on the circumstance (if the operation requires operating on multiple chains or a single chain)

However, realize that the exact function names used in supporting this mechanism are subject to change without notice.

> But they do not seem to be the same as the Object Number Latches introduced in 11.6:

There is no such thing as "Object Number Latches".  There is an administrative latch called a "Object Number Latch".  The terminology was inappropriately coined in that poorly formed message leading you to believe that it was a latch "type" not a latch "name". .  Although these latches have not been given an appropriate name as a group, they should be thought of as "administrative latches".  I will work on "socializing" that.  BTW, this message is no longer used in 11.7.

> Progress seems to use the pmObjectLockLock call to lock the multiplexed latches of the LKT latches (called "LHT" in _Latch VST) - Lock Table Hash Chain latches (introduced in 10.0A).

This is not true. latObjLatchLock() or latlatch() is used but again, it is the mechanism that is more important than the function name (but I appreciate that you often use the function name to decipher stack traces).  pmObjectLockLock() is only used to protect the plugin manager stuff, currently only implemented for replication.

> 10.2B06 Service Pack Readme seems to call them the private latches:

Process "Private latches" were originally introduced to protect data structures between threads of the same process, not to protect shared memory data structures.  However, the same process private latch mechanism was leveraged to protect certain shared data structures.  Although the mechanism was architected flexible enough to work in this way, it was incorrect to refer to this use of the latching mechanism as "private latches".

The use of the term "private latches" is therefore inappropriate here.  I stressed that at the time but the developer continued to call them by that name and incorrectly added references to that mechanism in messages and KBases.  I should have "strong armed" it at the time but did not.  It is something that Progress needs to clarify and clean up.

> Do the new messages introduced in V11 talk about the same private latches?

> 11.0: SYSTEM ERROR: User <usrnum> died holding the <latch-name> latch. (15767)

> Desc: The user has died while holding a private latch.

This message is obsolete and is no longer used anywhere as far as I can see.

> 11.6: User <user-number> died holding the <private-latch-name>. (18386)

> Desc: User with usernum died holding a private latch as specified.

Yes, this is the same, but please let's start referring to them as administrative latches.  I will do what I can to clean this up throughout the organization.

Posted by George Potemkin on 20-Jul-2016 09:37

Thanks, Richards, for the your answers.

> Do you mean what we call them or how we call them?

How you term them. The question was about the terminology.

> However, the same process private latch mechanism was leveraged to protect certain shared data structures.

I understand (more or less) how not to kill a process while it hold a multiplexed latch (like BUF or LKT). But should we care about the private latches or the administrative latches when we are going to use kill -9?

> I gave a talk on latches at the various conferences some years back.

To remind the links for community:
B10: A New Spin on Some Old Latches
community.progress.com/.../2143.exchange-2008-breakout-sessions
OPS-28: A New Spin on Some Old Latches
download.psdn.com/.../OPS-28_Banville.ppt

OE1108: What are You Waiting For? Reasons for Waiting Around! (Recording)
community.progress.com/.../1010.exchange-2011-breakout-sessions
PCA2011 Session 105: What are you waiting for? Reasons for waiting around! (Presentation)
pugchallenge.org/.../Waiting_AmericaPUG.pptx

Posted by gus bjorklund on 20-Jul-2016 09:40

i’m not quite sure what you are asking, george.

the terminology is a bit inconsistent in places and also the implementation details have changed over time some of the differences in terminology are due to the age of the writings.

take the buffer pool for an example. currently, each buffer has its own latch. in promon these are grouped together and the counters are summed and reported under BUF1, BUF2, BUF3, and BUF4. so they are “multiplexed” only for reposting purposes.

in earlier times, there were another type of locks that were called multiplexed latches or muxlatches, where each buffer was assigned to one of several multiplexed locks that were each under a governing latch (called BUF1, BUF2, BUF3, and BUF4). i wont explain how these work since they are obsolete.

either way, those were all in shared memory and the latches can be manipulated by any local database connection.

however, there is some code which is used to operate on shared data structures and on private data structures (i.e data structures that are in process private memory). since the code locks and unlocks latches, we put in”private latches” for the data structures that are in process private memory. given that they are process private, they can be manipluated only by threads in the process.

Posted by Richard Banville on 20-Jul-2016 09:52

> I understand (more or less) how not to kill a process while it hold a multiplexed latch (like BUF or LKT). But should we care about the private latches or the administrative latches when we are going to use kill -9?

For many of them you will have the same miserable experience if you kill -9 a user holding one of those admin type latches.

... but not all of them.  Some are used more as a locking mechanism than a latching mechanism so a user can be kill -9 and release its lock type latch.   Yes, I know this muddies the waters even more.

The ones to be careful of are when adding BI extents,  JTA/XA list management, on-line object creation and during auditing.

The difficulty is that it is hard to tell if you have these held or not.  Something should be added to promon/vsts for them

Posted by George Potemkin on 20-Jul-2016 09:56

The customers of ours are stongly demanding a safe way to disconnect the processes when proshut -C disconnect does not work. 10 years ago Paul Koufalis had wrote the killprosession script to do exactly such things. The approach implemented in the script stays the best all these years but unfortunately it's not 100% safe. I'm trying to do one step forward and to give the customers a tool to avoid the errors like one that started this topic. I'm looking for any information that can help to implement a more reliable methord to stop Progress processes.

Posted by George Potemkin on 20-Jul-2016 10:06

> For many of them you will have the same miserable experience if you kill -9 a user holding one of those admin type latches.

But only if we kill -9 a multi-threaded process, right?

ABL sessions will not issue the error 18386, will they?

User <user-number> died holding the <private-latch-name>. (18386)

Posted by Richard Banville on 20-Jul-2016 10:12

Unfortunately no.  As I tried to explain in the previous post, these are incorrectly referred to as "private latches" but not all of them actually process private some of them protect inter process shared memory data.  I will have to examine the code to get you the real list (later today) but this is exactly what makes the incorrect terminology confusing.

Posted by James Palmer on 20-Jul-2016 10:14

You should come along to this session George... www.pugchallenge.eu/.../emeaprogdets.p

:) :) :)

Posted by George Potemkin on 20-Jul-2016 10:49

> I will have to examine the code to get you the real list (later today) but this is exactly what makes the incorrect terminology confusing.

Thanks again, Richard!

I has written a family of the intentionally crazy processes that create as high latch activity as possible and now I'm running some inhuman experiments with them. ;-) In my tests I mainly use Progress 10.2B08. Should I start to use 11.6 to get the 18386's?

Posted by George Potemkin on 20-Jul-2016 11:15

> for the BUF latches, each -B buffer control structure (bktbl_t) has a latch embedded in it that is used for its own protection.

When Progress finds a wrong dbkey it writes the block to a log file and it also seems to dump a buffer header (a part of buffer pool):

(-----) SYSTEM DEBUG: Database buffer block
(-----) pbktbl = 0x7000000076ff368
(-----) pbktbl->qself = 0x000b0000076ff368
(-----) XBKBUF(pbktbl->qself) = 0x07000000076ff368
(-----) pbktbl->bt_qbuf = 000b00000c6b8a68
(-----) XBKBUF(pbktbl->bt_qbuf) = 0x70000000c6b8a68
(-----) qusrctl: 0x000b0000011c7b70
(-----) use count: 1, governing latch: 27, lru: 0, state: 4
(-----) changed: 0, chkpt: 0, writing: 0, fixed: 0
(-----) aged: 0, onlru: 0, cleaning: 0, apwq: 0
(-----) bt_qlrunxt: 0x000b0000078fb4a8, bt_qlruprv:  0x000b000008ad5648
(-----) bt_qapwnxt: 0x0000000000000000, bt_qapwprv:  0x0000000000000000
(-----) bt_qfstuser: 0x0000000000000000, bt_qcuruser:  0x0000000000000000
(-----) pbkbuf = 0x70000000c6b8a68
(-----) Block dbkey = 557303360   Offset = 71334830080


Does "pbktbl->qself" reports a multiplexed latch that protects the buffer?

Posted by Richard Banville on 20-Jul-2016 12:08

We may be getting a little too deep here...

Many shm structures have a qself pointer.  If is a shared memory pointer (shm segment and offset)  back to the structure itself which is used in part for self validation and memory overwrite detection.

The actual object latch in the bktbl_t is not printed out in this part of the message.

Posted by George Potemkin on 20-Jul-2016 12:37

The value of a qself pointer looks very similar to latchId reported by 5029:

SYSTEM ERROR: Releasing multiplexed latch. latchId: 3096224809273976 (= 0x000B0000 03E6CA78)

Both are 8 bytes where 4 high bytes are 0x000B0000. When I tried to answer to Paul I was not 100% sure if 0x000B0000 always means that a governing latch is BUF. My assumtion was based only on an interpretation of the customer's cases. Is it just a coincidence that the same value is stored in buffer header? Is it correct that 0x000B0000 in latchId means the BUF latch?

Posted by Richard Banville on 20-Jul-2016 12:55

No.  This is an OpenEdge shared ptr that is encoded to consist of the internal segment table entry number and the offset within the shm segment pointed to by that table which when decoded results in a process specific address which points to the shared data.  It is done this way since absolute addresses do not work for data in shared memory.  

The latch id in this case is indeed the shared pointer to the bktlb_t but not the latch itself.  The latch itself is 32 bytes further within that structure.

Each latch has to have a unique identifier.  For the BUF latches, the shared pointer to the associated bktbl is the unique identifier.  Prior to true object latching, the value would have been one of 4 integer values < 32.   I think  they were 24, 25, 26, 27 but can't remember for sure.  

This thread is closed