The Secret Life of Latches

Posted by George Potemkin on 11-Oct-2016 04:04

First two latches (MT and DB locks) were introduced in Progress V5 released in 1988 (first version that used the shared memory). But the changes made in V6.3 (released in 1991) was, IMHO, a breakthrough. Apart of the separate latches that protect the access to the most shared memory structures we got a watchdog functionality, background writers and a lot of the secret options in promon. ;-) I think it would be right to say that this year is 25 Anniversary of Progress latches and I wish to thank Gus Bjorklund for his excellent work!

During the conference in Noordwijk I has made a short introduction to the rather long presentation about the latches.
The presentation can be downloaded there:
ftp.progress-tech.ru/.../SecretLifeOfLatches.pptx

Questions, comments, additions are welcome!

Best regards,
George

All Replies

Posted by George Potemkin on 30-Oct-2016 07:58

I have some questions about the latches and how Progress uses them. Here is just a few of these questions. I hope someone knows the answers.

1. What is the purpose of the SEQ latch?
Progress stores the current values of the database sequences in the sequence block (bk_type 6, third block in Schema Area or mb_seqblk) as well as in the sequence control structures protected by the SEQ latch. At least broker reads the sequence block during a database startup. CURRENT-VALUE() function reads a sequence block using a shared buffer lock (DB Buf S Lock) and locks the SEQ latch. In other words, it reads the both sources of the sequence's current values. NEXT-VALUE() uses the exclusive buffer lock (DB Buf X Lock) and locks the SEQ latch twice. In other words, it updates the both sources. More over since V11 Progress additionally locks transaction end queue (TXQ) latch - twice in both cases. Sequence reads as well as sequence updates are not a part of transaction. Why these operations are additionally protected by TXQ latch?

2. What is the difference between the queue type latches (USR and SCH latches) and TXQ latch that protects the acquire/release of TXE resource locks (Share, Update & Commit)?
In both cases the processes use the semaphores to wait for the database resources. In case of the queue type latches it's the semaphore latch waits reported by promon/R&D/2/13. Activity: Other. In case of TXQ latch it's the waits reported by promon/R&D/debghb/6/9. Activity: TXE Lock Activity. In both cases we can get the naps for the latches as well. Does it mean that Progress uses two locking mechanisms (one is a short term lock and another is a medium term lock) to access the same database resources? Sequentially? It does not have a meaning for me. In parallel? I can't get it either.

3. Process can lock a few latches at the same time. The most obvious example is the MTX latch. Unlike any other latches it does not protect any specific structures in shared memory. It protects transaction allocation and bi/ai recovery note /order/. MTX lock can be hold during disk IO but even if it's not the case the duration of the MTX latch locks is the longest among the latches. Process holds MTX latch while it loads other latch code into CPU cache and while other latch code is updating the correspondent structure in shared memory. Time needed to load a latch code seems to be of the same order as the execution time of this code. Hence if a process holds a latch and it needs to lock another latch then the lock duration of the first latch can significantly increase even if there is no contention on the second latch.

I know that processes can simultaneously lock the LKT and LKF latches (Lock Table Hash Latch and Lock Table Free Chain Latch). I guess that a process locks LKT latch first and then it may need to lock LKF latch. This happens in 90% of the times in case of the self-service sessions. If it's true then why the LKT latch is multi-threaded (type 1 family) while LKF latch is a regular (plain) latch? LKF can't be multi-threaded by its nature. The different processes can simultaneously lock the different latches of the LKT family but they still should wait for the LKF latch. Hence the whole mechanism will work as single-threaded.

What are the combination of latches (other than MTX+anything and LKT+LKF) can be locked by a process at the same time?
I know that Progress uses a protocol based on the latch's masks.
We can see these masks in promon/R&D/4/6. Restricted Options: 8j2hhs7gio
Below is the table with the values converted to the hexadecimal and binary mode and sorted by mask but I don't know how Progress uses them.
For example, two LRU latches - for primary and for alternate buffer pools - have totally different masks. Either promon reports the values that are not really used or there is something about LRU latch for alternate buffer pool that we do not yet know.

 Latch  Mask/hex  Mask/binary
-- ---  --------- ---------------------------------
 0 L00  100000000 100000000000000000000000000000000
 1 MTX  100000000 100000000000000000000000000000000
 2 USR  100000000 100000000000000000000000000000000
 3 OM   100000000 100000000000000000000000000000000
 4 BIB  100000000 100000000000000000000000000000000
 5 SCH  100000000 100000000000000000000000000000000
 8 TXT  100000000 100000000000000000000000000000000
 9 LKT  100000000 100000000000000000000000000000000
13 SEQ  100000000 100000000000000000000000000000000
15 TXQ  100000000 100000000000000000000000000000000
22 LRU  100000000 100000000000000000000000000000000
29 CDC  100000000 100000000000000000000000000000000
30 SEC  100000000 100000000000000000000000000000000
31 L31  100000000 100000000000000000000000000000000
28 INC   F0000000  11110000000000000000000000000000
20 PWQ   F03C0000  11110000001111000000000000000000
23 LRU   F0C00000  11110000110000000000000000000000
19 BHT   F0F80000  11110000111110000000000000000000
18 BFP   F1000000  11110001000000000000000000000000
21 CPQ   F1000000  11110001000000000000000000000000
24 BUF   FF400000  11111111010000000000000000000000
25 BUF   FF400000  11111111010000000000000000000000
26 BUF   FF400000  11111111010000000000000000000000
27 BUF   FF400000  11111111010000000000000000000000
 7 GST   FFF8A000  11111111111110001010000000000000
16 EC    FFFFC000  11111111111111111100000000000000
 6 LKP   FFFFE200  11111111111111111110001000000000
17 LKF   FFFFE200  11111111111111111110001000000000
12 LKT   FFFFF200  11111111111111111111001000000000
11 LKT   FFFFFA00  11111111111111111111101000000000
10 LKT   FFFFFE00  11111111111111111111111000000000
14 AIB   FFFFFF00  11111111111111111111111100000000

Posted by George Potemkin on 30-Oct-2016 12:38

I has updated my LatchMon.p program that I used during the tests with the latches. Program requires Progress V11.5 or higher. It reads the _Latch VST 10 times per second (the default polling interval is 0.1 sec) and creates two reports for the specified number of the sampling intervals:

LatchStat.<dbname>.txt - statistics per latches.
LatchHold.<dbname>.txt - latch statistics per users (for the most active users only).

LatchStat reports apart of the usual columns like "Locks" and "Naps" and some auxiliary "technical" fields contain two new columns:

"Busy%" - percent of time the latch was busy. Note that neither high latch locks nor high latch naps (waits) indicate that latch is busy. The value in "Busy%" is calculated as the number of times when _Latch-Owner catches the current latch owner compared to the number of checks ("Polls") done per sampling interval. Average latch lock duration can be estimated as the ration of "Busy%" to "Locks" values.

"Users" - the number of users that were caught during the checks as a last latch holder (_Latch-holder).

Of course, both values exist only for regular (plain) latches.

LatchHold reports the total number of latch locks ("Locks") plus "User%" column - how many time a particular user was caught as a last latch holder.
"BgnTime" and "EndTime" is the first/last time the user was found as a holder of /any/ latch during the current sampling interval.
"DbAccess", "*Read", "*Write" is the user's activity between BgnTime and EndTime.

It will be a bad sign if LatchMon.p will create the third report: LatchLock.<dbname>.txt
The file will reports all cases when a latch was locked longer than a polling interval (e.g. 0.1 sec or longer). It's a million times longer than a normal latch lock duration (~ 100 nanoseconds) but it might happen especially with the MTX latch.

The program can be downloaded there:
ftp.progress-tech.ru/.../LatchMon.p

Posted by Richard Banville on 14-Nov-2016 07:21

#1 - This is an oversight that can be cleaned up.  I'll have a bug filed.  Realize that it is "functionally correct" but can be improved to help with performance.

#2 - The difference is that the USR and SCH latches queue to acquire the latch.  The TXQ latch does not queue to acquire the latch but a conflict in acquiring the TXE lock will queue waiting for the lock.  This mechanism is similar to how the BUF latch and the buffer lock work together.

#3 - We continue to work on opening every flood gate possible.  As we make the higher level latches more concurrent, it pushes the next bottleneck lower in the stack.  Just because there is a contentious bottleneck at a lower level does not mean however that making higher level processing more concurrent should not be done, it just means that the end user's over all operation is not complety concurrent yet.  For example, the plan with the LKF latch was to remove it altogether and make acquiring a free lock table entry a latch free atomic operation.  I had prototyped this when I did the latch concurrency improvements in 10.1C.  Even though it is latch free, there is still contention at the machine level for the atomic swap so the minor performance improvement seen at that time did not warrent the risk.  Without multiple free chains, the overall operation will not be 100% concurrent - its just that you as the end user will not have the information to identify it.  I agree, this is something we should revisit.

As always, there is more work to do to improve the concurrency of shared resources for performacne reasons and we continue to work in this are.

Thanks for your insights.

Posted by George Potemkin on 14-Nov-2016 08:58

Richard, thanks for information. Just for clarification: the only aim of the questions was to verify my understanding how the latches are working. I'm not complaining about performance. For example, there were the customer's cases where the sequences were a bottleneck but it was so because an application did something obviously wrong rather than due to the implementation of the sequences in Progress. I never saw a bottleneck on the latches related to a lock table (even in case of the remote clients where the number of LKT latch locks is much higher than record lock requests). As usually it was just my curiosity. ;-)

This thread is closed