Unusual _resrc usage - OpenEdge RDBMS - Forum

All Replies

Posted by Richard Banville on 20-May-2019 13:02

Shared memory latch waits.

Posted by ChUIMonster on 20-May-2019 13:08

What sorts of events would require such a latch?

Mere access to shared memory seems unlikely or wouldn't there always be tons and tons of these all of the time instead of it being a very rare thing?

I'm wondering if it is being driven by something like "proutil increaseto"? Or maybe memory in -Mxs being reallocated to -L when the lock table gets filled up? Or something like that?

Posted by George Potemkin on 20-May-2019 13:20

> Or maybe memory in -Mxs being reallocated to -L when the lock table gets filled up?

Shared memory lock queue stays zero in such cases.

I has checked my archive of customer's data: shared memory lock queue was always zero.

Posted by Richard Banville on 20-May-2019 13:27

No, its all latch waits.

The stat is only collected when latch statistics collection has been requested.

Posted by ChUIMonster on 20-May-2019 13:57

Aha!

That makes sense.

I will need to double check but I'm guessing that tech support's "gather" script enables latch statistics collection while it runs.

Posted by George Potemkin on 20-May-2019 13:59

I has enabled:
2. Enable latch activity data collection
and 'for each customer: end' results in:

05/20/19        Activity: Resource Queues
17:43:32        05/20/19 17:43 to 05/20/19 17:43 (17 sec)

Queue                  - Requests -
                       Total
Shared Memory           1215
Record Lock               84
DB Buf S Lock            177

Total Locks in Activity: Latch Counts is 1051.

The difference between Shared memory queue locks and Latch Counts is 164.

Latch locks are counted in Shared memory queue locks. Plus something else.

Posted by Richard Banville on 20-May-2019 14:22

Yes, the Tech Support promon gather script does indeed enable latch statistics collection.

Posted by Richard Banville on 21-May-2019 14:50

George, type 1 object latches are counted twice, once for the governor, once for the object latch.

Feature or bug? I think we should add some additional statistics for the object latches then it would be clearer as to what is going on.

In general LKP, LHT* behave as type 1 object latches but not always. There are times for locking an entire chain where the object latch is not obtained, just the governor.

With this in mind, the counts are more what you would expect given the sum of the latches compared with the "Shared Memory" Resource Queue report.

Posted by George Potemkin on 21-May-2019 18:01

Thanks, Richard!

> Yes, the Tech Support promon gather script does indeed enable latch statistics collection.

Does it mean a kind of an official support for these options? Can we use them safely in production environment apart of possible performance degradation?

2. Enable latch activity data collection

3. Enable latch timing data collection

Latch timing data collection does not seem to be implemented on some platforms. Where can we use this option?

Posted by Richard Banville on 22-May-2019 16:06

It is not officially supported since the mechanisms have never been properly QA'ed.

I would not expect it to crash or anything - that would be considered a bug. But there is the possibility that the numbers reported are not accurate.

I don't see any platform where the latch timing is enabled anymore. Is this something that would be useful for you if re-implemented?

Posted by George Potemkin on 22-May-2019 16:26

> I would not expect it to crash or anything - that would be considered a bug.

It's what I meant as "supported": the issues can be reported to PSTS.

> I don't see any platform where the latch timing is enabled anymore. Is this something that would be useful for you if re-implemented?

I did not played with the versions where latch timing data collection was implemented. I can't say if it can be useful.

I asked about this option just because the gather.sh script says:

# Last modified 12/10/2013

echo 2 >>gatherin.txt #Enable latch activity collection

echo 3 >>gatherin.txt #Enable latch timing collection

knowledgebase.progress.com/.../000010526

Posted by gus bjorklund on 05-Jun-2019 19:34

> On May 22, 2019, at 12:28 PM, George Potemkin wrote:

> I don't see any platform where the latch timing is enabled anymore

It could be re-implemented now that there is decent low-overhead high-resolution timers on all the current processor architectures.

An old implementation used gettimeofday() on most machines and a memory-mapped microsecond timer on Sequent machines. The cost to read the timer was one memory access. Back then, it was the 0only low cost timer available.

Posted by Richard Banville on 05-Jun-2019 19:43

Yep, but I will not spend time on it if no one finds it useful.

It has been missing form the product for a very long time and this is the first I've heard it mentioned and got no answer on if or why it would be useful.

Posted by gus bjorklund on 05-Jun-2019 20:20

agree.

Posted by George Potemkin on 05-Jun-2019 20:26

Fact: duration at least of MTX latch locks may significantly vary depending from the external factors like a failure of disk in RAID or long queues of processes on OS level. "Significantly" = by millions (mayby billions) times. Number of latch locks/naps will be low during such extreme cases. I don't know if there are the situations where the duration of latch locks are only moderately increasing above normal - let's say, by tens/hundreds times. IMHO, the only way to see such situations is a latch timing.

I would enable latch timing collection in the tests without competions for resources in shared memory - with only one session running. Just to estimate a percent of time the session locks resources compared to the total execution time. It will give an estimation how many processes can be run simulatiosly before the competion will become a bottleneck.

Posted by ChUIMonster on 05-Jun-2019 21:23

I have avoided trying to use it because, in the past (v9 was I think the last time I tried), it had a noticeable impact on performance.

I would welcome being able to obtain that data if the impact of collecting it was not significant or painful.

I think that having that data would be helpful in trying to figure out which latch tuning options to use and how well those changes have worked. Right now a lot of that sort of thing is trial and error.

Even if the timing data is difficult or expensive, the current PROMON, "debghb", 11 has two columns on the far right "nap max total / hwm" that seem like they might be good candidates to add to the _Latch VST. My thinking there is that they would give better insight into how frequently the -spin limit is being reached, how much napping is actually taking place and how long those naps are.

On a somewhat related note -- I was speculating to myself on a long drive home that the distribution of block accesses seems like it is very probably a Pareto distribution or something very similar to that (IOW -- the vast majority of spinning is probably to gain access to a relatively tiny number of resources) for many applications (maybe not for all applications, but I suspect it is true for the majority). I have no way to really know if that is actually true but if there is a block access counter somewhere (maybe related to how lruskips gets calculated?) and if all of those values could be dumped (or queried) relatively painlessly that would also be a fascinating bit of information. Especially if the db table/index/blob # was also available. That would be really useful in finding the source of hot spots and in designing ways to mitigate them.

Posted by gus bjorklund on 05-Jun-2019 23:34

> if there is a block access counter somewhere

there are counters in all the buffer header structures for the blocks that are currently in memory. maybe a vst can read these but i forget.

Posted by George Potemkin on 06-Jun-2019 06:12

> the current PROMON, "debghb", 11 has two columns on the far right "nap max total / hwm" that seem like they might be good candidates to add to the _Latch VST. My thinking there is that they would give better insight into how frequently the -spin limit is being reached, how much napping is actually taking place and how long those naps are.

They are useful only when -napmax is larger than -nap. But why they are set so? Normal duration of latch lock is less than 100 nanoseconds. Default -nap is 10 milliseconds. In other words, if a process failed to get a latch it kindly allows other processes to lock the latch almost 100,000 times (= 10 ms / 100 ns). The higher nap time will increase the "kindness" of our process but does its job is less important than of other processes? This only makes a responce time less predictable. Of course, the -napmax was introduced to save CPU time: spinning consumes CPU but does not make any useful changes in database. The "courtesy" of our process allows other processes to consume CPU by spinning when a latch is busy. When a resource is a bottleneck then law of the jungle comes on a stage: -napmax = -nap = 1 ms.

Posted by George Potemkin on 06-Jun-2019 07:53

> there are counters in all the buffer header structures for the blocks that are currently in memory.

I guess only the countdown access counters used for lruskips mechanism:

DEFINE VARIABLE i AS INTEGER NO-UNDO.
DEFINE VARIABLE vRecid AS INT64 NO-UNDO.
DISABLE TRIGGERS FOR DUMP OF customer.
FIND FIRST customer NO-LOCK.
ASSIGN vRecid = RECID(customer).
MESSAGE "Dbkey:" vRecid - (vRecid MOD 32) SKIP "Zero"
VIEW-AS ALERT-BOX INFORMATION BUTTONS OK.
DO i = 1 TO 1000:
  FIND FIRST customer NO-LOCK WHERE RECID(customer) EQ vRecid.
END.
MESSAGE "Update" VIEW-AS ALERT-BOX INFORMATION BUTTONS OK.

Result:

06/06/19        Status: Lru Chains
  Num                DBKEY Area                 Hash T S Usect Flags    Updctr     Lsn Chkpnt  Lru      Skips
    1                  384    8                  221 D       0 L            36       0      0    0          0

10:39:05        Adjust Latch Options
 8. Adjust LRU force skips:  2147483647

06/06/19        Activity: Buffer Cache
Primary Buffer Pool
Logical reads                        1000

06/06/19        Status: Lru Chains
10:35:38
  Num                DBKEY Area                 Hash T S Usect Flags    Updctr     Lsn Chkpnt  Lru      Skips
    1                  384    8                  221 D       0 L            36       0      0    0 2147482648

2147483647 - 2147482648 = 999

In other words, "Skips" are decreasing from -lruskips - 1 down to 0 and then they are re-set back,

So setting the -lruskips to 2147483647 will give us the block access counters but only for a limited time interval.

Tested on V12.0.

Posted by George Potemkin on 06-Jun-2019 10:11

> I have avoided trying to use it because, in the past (v9 was I think the last time I tried), it had a noticeable impact on performance.

We could enable latch timing data collection for a couple of seconds just to check if the durations of latch locks are normal or not.

Currently the latch timing is like Planet X: nobody saw it, its existence is only a guess or theory.

Posted by gus bjorklund on 06-Jun-2019 12:40

> On Jun 6, 2019, at 2:14 AM, George Potemkin wrote:

> But why they are set so?

Because when I set them originally, it was not possible to sleep for less than 10 milliseconds (16 on dos), processors 5,000 times slower, and with 100 processes doing nothing but sleeping for 10 msec, there was no cpu time left.

Posted by ChUIMonster on 06-Jun-2019 13:07

-nap seems to still be undocumented in the oe12 online docs. As of at least 11.7 (and if I recall correctly quite a lot further back than that) you can "proserve sports2000 -nap 1" successfully and PROMON R&D, 4, 4, 4 will show the initial sleep time as 1 but you cannot change it online to anything less than 10.

Posted by ChUIMonster on 06-Jun-2019 13:13

My mistake -- I was actually running 10.2b when I thought I was running 11.7. The "-nap 1" startup parameter is allowed at least as far back as oe10.2b08, 11.7 PROMON does not allow values less than 10 but OE12 PROMON *does* allow changing -nap online to 1.

11.7 does allow you to change -nap to less than 10 via _dbparams._dbparams-value

Posted by gus bjorklund on 06-Jun-2019 13:26

these days, using nanosleep() you can sleep for less than 1 msec. though the resolution in the function argument is in nanoseconds, the actual granularity is higher. The minimum sleep time is also (much) higher but i don't know the actual values.

Posted by George Potemkin on 06-Jun-2019 13:38

Nowadays CPUs are faster. New Progress versions are faster (for example, DO loop). Does percent of time when, for example, the same FOR EACH loop locks the shared resources is decreasing with new Progress versions? In other words, do the client's sessions spend less time holding the latches? Any guesses? I did not test this with different Progress versions.

Posted by ChUIMonster on 06-Jun-2019 13:45

When OE "naps" what does it actually do? Is it calling something like usleep() or nanosleep()? The usleep() man page claims to have granularity down to 4 or 5 microseconds. Is that realistic?

Posted by George Potemkin on 06-Jun-2019 13:54

OE00184551, DB lat, 10.1C04 (June 2009):

Improved latch performance can be achieved by changing select() to nanosleep()

Stack trace while napping:

_p_nsleep

nsleep

nanosleep

utnap@AF12_8

latSleep

latObjLatchLock

Posted by gus bjorklund on 06-Jun-2019 16:28

> On Jun 6, 2019, at 9:15 AM, ChUIMonster wrote:

> The "-nap 1" startup parameter is allowed at least as far back as oe10.2b08

my recollection is that the minimum was 10 and if you specified a smaller value it was raised to 10. but maybe i'm wrong.

Posted by George Potemkin on 06-Jun-2019 16:41

PROGRESS version 8.1A as of Tue Nov 12 1996

proserve sports -nap 1

promon sports

4. Initial latch sleep time: 1 milliseconds

5. Maximum latch sleep time: 100 milliseconds

PROGRESS Version 9.1B as of Thu Aug 17 22:49:26 EDT 2000

4. Initial latch sleep time: 1 milliseconds

5. Maximum latch sleep time: 5000 milliseconds

Posted by gus bjorklund on 06-Jun-2019 16:54

still might 10 msec because the select() and poll () functions also used to round up to whatever was minimum.

examination of code is needed but i dont have access to it anymore

Posted by mfurgal on 06-Jun-2019 17:28

Yes, I recall putting this in back in 7.2C.

Mike

Mike Furgal

Director – Database and Pro2 Services

PROGRESS

617-803-2870

mfurgal@progress.com

This thread is closed