The following almost always results in zeros:
for each _resrc no-lock where _resrc-name = "shared memory":
display _resrc.
end.
But I have recently found a few possibly interesting incidents at a customer site with non-zero values. Does anyone know what activities this resource is involved with?
No, its all latch waits.
The stat is only collected when latch statistics collection has been requested.
Shared memory latch waits.
What sorts of events would require such a latch?
Mere access to shared memory seems unlikely or wouldn't there always be tons and tons of these all of the time instead of it being a very rare thing?
I'm wondering if it is being driven by something like "proutil increaseto"? Or maybe memory in -Mxs being reallocated to -L when the lock table gets filled up? Or something like that?
> Or maybe memory in -Mxs being reallocated to -L when the lock table gets filled up?
Shared memory lock queue stays zero in such cases.
I has checked my archive of customer's data: shared memory lock queue was always zero.
No, its all latch waits.
The stat is only collected when latch statistics collection has been requested.
Aha!
That makes sense.
I will need to double check but I'm guessing that tech support's "gather" script enables latch statistics collection while it runs.
I has enabled:
2. Enable latch activity data collection
and 'for each customer: end' results in:
05/20/19 Activity: Resource Queues 17:43:32 05/20/19 17:43 to 05/20/19 17:43 (17 sec) Queue - Requests - Total Shared Memory 1215 Record Lock 84 DB Buf S Lock 177
Total Locks in Activity: Latch Counts is 1051.
The difference between Shared memory queue locks and Latch Counts is 164.
Latch locks are counted in Shared memory queue locks. Plus something else.
Yes, the Tech Support promon gather script does indeed enable latch statistics collection.
George, type 1 object latches are counted twice, once for the governor, once for the object latch.
Feature or bug? I think we should add some additional statistics for the object latches then it would be clearer as to what is going on.
In general LKP, LHT* behave as type 1 object latches but not always. There are times for locking an entire chain where the object latch is not obtained, just the governor.
With this in mind, the counts are more what you would expect given the sum of the latches compared with the "Shared Memory" Resource Queue report.
Thanks, Richard!
> Yes, the Tech Support promon gather script does indeed enable latch statistics collection.
Does it mean a kind of an official support for these options? Can we use them safely in production environment apart of possible performance degradation?
2. Enable latch activity data collection
3. Enable latch timing data collection
Latch timing data collection does not seem to be implemented on some platforms. Where can we use this option?
It is not officially supported since the mechanisms have never been properly QA'ed.
I would not expect it to crash or anything - that would be considered a bug. But there is the possibility that the numbers reported are not accurate.
I don't see any platform where the latch timing is enabled anymore. Is this something that would be useful for you if re-implemented?
> I would not expect it to crash or anything - that would be considered a bug.
It's what I meant as "supported": the issues can be reported to PSTS.
> I don't see any platform where the latch timing is enabled anymore. Is this something that would be useful for you if re-implemented?
I did not played with the versions where latch timing data collection was implemented. I can't say if it can be useful.
I asked about this option just because the gather.sh script says:
# Last modified 12/10/2013
echo 2 >>gatherin.txt #Enable latch activity collection
echo 3 >>gatherin.txt #Enable latch timing collection
> On May 22, 2019, at 12:28 PM, George Potemkin wrote:
>
> I don't see any platform where the latch timing is enabled anymore
It could be re-implemented now that there is decent low-overhead high-resolution timers on all the current processor architectures.
An old implementation used gettimeofday() on most machines and a memory-mapped microsecond timer on Sequent machines. The cost to read the timer was one memory access. Back then, it was the 0only low cost timer available.
Yep, but I will not spend time on it if no one finds it useful.
It has been missing form the product for a very long time and this is the first I've heard it mentioned and got no answer on if or why it would be useful.
agree.
Fact: duration at least of MTX latch locks may significantly vary depending from the external factors like a failure of disk in RAID or long queues of processes on OS level. "Significantly" = by millions (mayby billions) times. Number of latch locks/naps will be low during such extreme cases. I don't know if there are the situations where the duration of latch locks are only moderately increasing above normal - let's say, by tens/hundreds times. IMHO, the only way to see such situations is a latch timing.
I would enable latch timing collection in the tests without competions for resources in shared memory - with only one session running. Just to estimate a percent of time the session locks resources compared to the total execution time. It will give an estimation how many processes can be run simulatiosly before the competion will become a bottleneck.
I have avoided trying to use it because, in the past (v9 was I think the last time I tried), it had a noticeable impact on performance.
I would welcome being able to obtain that data if the impact of collecting it was not significant or painful.
I think that having that data would be helpful in trying to figure out which latch tuning options to use and how well those changes have worked. Right now a lot of that sort of thing is trial and error.
Even if the timing data is difficult or expensive, the current PROMON, "debghb", 11 has two columns on the far right "nap max total / hwm" that seem like they might be good candidates to add to the _Latch VST. My thinking there is that they would give better insight into how frequently the -spin limit is being reached, how much napping is actually taking place and how long those naps are.
On a somewhat related note -- I was speculating to myself on a long drive home that the distribution of block accesses seems like it is very probably a Pareto distribution or something very similar to that (IOW -- the vast majority of spinning is probably to gain access to a relatively tiny number of resources) for many applications (maybe not for all applications, but I suspect it is true for the majority). I have no way to really know if that is actually true but if there is a block access counter somewhere (maybe related to how lruskips gets calculated?) and if all of those values could be dumped (or queried) relatively painlessly that would also be a fascinating bit of information. Especially if the db table/index/blob # was also available. That would be really useful in finding the source of hot spots and in designing ways to mitigate them.
> if there is a block access counter somewhere
there are counters in all the buffer header structures for the blocks that are currently in memory. maybe a vst can read these but i forget.
> the current PROMON, "debghb", 11 has two columns on the far right "nap max total / hwm" that seem like they might be good candidates to add to the _Latch VST. My thinking there is that they would give better insight into how frequently the -spin limit is being reached, how much napping is actually taking place and how long those naps are.
They are useful only when -napmax is larger than -nap. But why they are set so? Normal duration of latch lock is less than 100 nanoseconds. Default -nap is 10 milliseconds. In other words, if a process failed to get a latch it kindly allows other processes to lock the latch almost 100,000 times (= 10 ms / 100 ns). The higher nap time will increase the "kindness" of our process but does its job is less important than of other processes? This only makes a responce time less predictable. Of course, the -napmax was introduced to save CPU time: spinning consumes CPU but does not make any useful changes in database. The "courtesy" of our process allows other processes to consume CPU by spinning when a latch is busy. When a resource is a bottleneck then law of the jungle comes on a stage: -napmax = -nap = 1 ms.
> there are counters in all the buffer header structures for the blocks that are currently in memory.
I guess only the countdown access counters used for lruskips mechanism:
DEFINE VARIABLE i AS INTEGER NO-UNDO. DEFINE VARIABLE vRecid AS INT64 NO-UNDO. DISABLE TRIGGERS FOR DUMP OF customer. FIND FIRST customer NO-LOCK. ASSIGN vRecid = RECID(customer). MESSAGE "Dbkey:" vRecid - (vRecid MOD 32) SKIP "Zero" VIEW-AS ALERT-BOX INFORMATION BUTTONS OK. DO i = 1 TO 1000: FIND FIRST customer NO-LOCK WHERE RECID(customer) EQ vRecid. END. MESSAGE "Update" VIEW-AS ALERT-BOX INFORMATION BUTTONS OK.
Result:
06/06/19 Status: Lru Chains Num DBKEY Area Hash T S Usect Flags Updctr Lsn Chkpnt Lru Skips 1 384 8 221 D 0 L 36 0 0 0 0 10:39:05 Adjust Latch Options 8. Adjust LRU force skips: 2147483647 06/06/19 Activity: Buffer Cache Primary Buffer Pool Logical reads 1000 06/06/19 Status: Lru Chains 10:35:38 Num DBKEY Area Hash T S Usect Flags Updctr Lsn Chkpnt Lru Skips 1 384 8 221 D 0 L 36 0 0 0 2147482648
2147483647 - 2147482648 = 999
In other words, "Skips" are decreasing from -lruskips - 1 down to 0 and then they are re-set back,
So setting the -lruskips to 2147483647 will give us the block access counters but only for a limited time interval.
Tested on V12.0.
> I have avoided trying to use it because, in the past (v9 was I think the last time I tried), it had a noticeable impact on performance.
We could enable latch timing data collection for a couple of seconds just to check if the durations of latch locks are normal or not.
Currently the latch timing is like Planet X: nobody saw it, its existence is only a guess or theory.
> On Jun 6, 2019, at 2:14 AM, George Potemkin wrote:
>
> But why they are set so?
Because when I set them originally, it was not possible to sleep for less than 10 milliseconds (16 on dos), processors 5,000 times slower, and with 100 processes doing nothing but sleeping for 10 msec, there was no cpu time left.
-nap seems to still be undocumented in the oe12 online docs. As of at least 11.7 (and if I recall correctly quite a lot further back than that) you can "proserve sports2000 -nap 1" successfully and PROMON R&D, 4, 4, 4 will show the initial sleep time as 1 but you cannot change it online to anything less than 10.
My mistake -- I was actually running 10.2b when I thought I was running 11.7. The "-nap 1" startup parameter is allowed at least as far back as oe10.2b08, 11.7 PROMON does not allow values less than 10 but OE12 PROMON *does* allow changing -nap online to 1.
11.7 does allow you to change -nap to less than 10 via _dbparams._dbparams-value
these days, using nanosleep() you can sleep for less than 1 msec. though the resolution in the function argument is in nanoseconds, the actual granularity is higher. The minimum sleep time is also (much) higher but i don't know the actual values.
Nowadays CPUs are faster. New Progress versions are faster (for example, DO loop). Does percent of time when, for example, the same FOR EACH loop locks the shared resources is decreasing with new Progress versions? In other words, do the client's sessions spend less time holding the latches? Any guesses? I did not test this with different Progress versions.
When OE "naps" what does it actually do? Is it calling something like usleep() or nanosleep()? The usleep() man page claims to have granularity down to 4 or 5 microseconds. Is that realistic?
OE00184551, DB lat, 10.1C04 (June 2009):
Improved latch performance can be achieved by changing select() to nanosleep()
Stack trace while napping:
_p_nsleep
nsleep
nanosleep
utnap@AF12_8
latSleep
latObjLatchLock
> On Jun 6, 2019, at 9:15 AM, ChUIMonster wrote:
>
> The "-nap 1" startup parameter is allowed at least as far back as oe10.2b08
my recollection is that the minimum was 10 and if you specified a smaller value it was raised to 10. but maybe i'm wrong.
PROGRESS version 8.1A as of Tue Nov 12 1996
proserve sports -nap 1
promon sports
4. Initial latch sleep time: 1 milliseconds
5. Maximum latch sleep time: 100 milliseconds
PROGRESS Version 9.1B as of Thu Aug 17 22:49:26 EDT 2000
4. Initial latch sleep time: 1 milliseconds
5. Maximum latch sleep time: 5000 milliseconds
still might 10 msec because the select() and poll () functions also used to round up to whatever was minimum.
examination of code is needed but i dont have access to it anymore