CDC policies and OM latches

Posted by George Potemkin on 13-Sep-2018 13:38

The -omsize value was set higher than the number of the objects in database (-omsize 2000 vs 1190 objects). 

OM latch controls Object Manager cache: “This is the Object to Area mapping information used to show what Object resides in what Area. The Area also anchors the Index Delete Chain”.

Richard T. Banville wrote:

It's not just schema changes that can page something into the secondary cache. Any update where something is not found in the primary cache will cause a latch on the secondary cache. It's just that creating a new table forces it to be in the secondary cache.

More about OM latch:

community.progress.com/.../49024 

Problem: after enabling CDC policies we got the high numbers of the OM latch locks and its naps: ~500,000 locks per sec, ~4,000-6,000 naps perc. OM latch was almost 100% busy (the most snapshots in promon shown a latch owner). In fact the OM latch locks were just 2-6 times less than BHT latch locks on any time intervals. In other words there is a correlation between the activities of these two latches. IMHO, it’s very strange - OM latch deals with db objects while BHT latch deals with the activity of db objects.

Any explanation?

BHT/OM latch lock ratio gives a hint of index tree levels.

All Replies

Posted by Rob Fitzpatrick on 13-Sep-2018 14:08

No explanation, but a few questions:

Were there any objects in the secondary cache, according to promon R&D 1 13?  

Was -omsize set to 2000 at DB start or increased online?

> after enabling CDC policies we got the high numbers of the OM latch locks and its naps

Would the online addition of CDC policies cause the creation of a large number of change tables, whose OM entries would be in the secondary cache (and result in latching) until DB restart?

Posted by Richard Banville on 13-Sep-2018 14:08

My theory is that the changes for CDC either check the object information to see if CDC is required and it causes an om latch or that something (maybe cdc policies) are no pre-loaded into the primary om cache.  Either way, if the om cache is sized properly and you are seeing om latch contention, then this it is a bug.  I'll have someone file a bug for this so that additional investigation can be done on it.

Posted by Richard Banville on 13-Sep-2018 14:11

Would the online addition of CDC policies cause the creation of a large number of change tables, whose OM entries would be in the secondary cache (and result in latching) until DB restart?

Yes.

Posted by Rob Fitzpatrick on 13-Sep-2018 14:17

> Yes.

Does that suggest this could be expected behaviour rather than a bug?

Posted by Richard Banville on 13-Sep-2018 14:22

Yes, in that case it would be expected behavior. I did not read that George had added them online.

If that is the case, then they go into the secondary om cache and the secondary cache is always latched when accessed.

It does show the value of not latching the primary cache though.

This could be an area of improvement for us when adding new policies.

Posted by George Potemkin on 13-Sep-2018 14:23

> Were there any objects in the secondary cache, according to promon R&D 1 13?  

09/13/18        Status: Shared Resources  
10:36:49        
Active transactions:               481
Lock table entries in use:         35235 of 1500000
Lock table high water mark:        776635
Primary object cache (-omsize):    Total: 2000 Used: 1176 Pinned: 0
Secondary object cache (-omsize):  Total: 2000 Used: 164 Pinned: 0
Primary MT cache (-mtpmsize):      Total: 1024 Used: 0 Pinned: 0
Secondary MT cache (-mtpmsize):    Total: 1024 Used: 0 Pinned: 0

5 min later

09/13/18        Status: Shared Resources  
10:42:15        
Active transactions:               496
Lock table entries in use:         33006 of 1500000
Lock table high water mark:        776635
Primary object cache (-omsize):    Total: 2000 Used: 1176 Pinned: 0
Secondary object cache (-omsize):  Total: 5120 Used: 164 Pinned: 0
Primary MT cache (-mtpmsize):      Total: 1024 Used: 0 Pinned: 0
Secondary MT cache (-mtpmsize):    Total: 1024 Used: 0 Pinned: 0

> Was -omsize set to 2000 at DB start or increased online?

-omsize 2000 at startup and it was increased online to 5000 

Posted by George Potemkin on 13-Sep-2018 14:32

CDC policies were activated online. It caused OM latch naps. Deactivation of policies did not resolve the issue.

Posted by George Potemkin on 14-Sep-2018 08:13

Quick test results:

1. If we just create a CDC policy for a table then the /reading/ its records will cause OM latch activity. But only if you don't restart a database.

2. The number of OM latch locks is increasing with the record reads. For example, 1 read -> 2 OM latch locks, 10 reads -> 600+ OM latch locks.

3. OM latch activity stays the same even if you delete a CDC policy. Unless you restart a database.

4. 'FOR EACH _Cdc-Table-Policy NO-LOCK' query caused the OM latch locks in production. Really a session running CDCAdminProcess.p crashed while holding the OM latch:

ABL  1507: (49)    SYSTEM ERROR: Memory violation.

WDOG 1507: (5028)  SYSTEM ERROR: Releasing regular latch. latchId: 3

But the tests show no OM latch locks when the query reads one record in _Cdc-Table-Policy.

Posted by Richard Banville on 14-Sep-2018 08:20

1. That is expected behavior

2. What records are you reading?  The user data or the CDC data? The ratio 10 reads : 600 OM latches seems wrong

3. Probably something that could be optimized on our end.

4. Should be reported and investigated by Progress.

Posted by George Potemkin on 14-Sep-2018 08:32

> What records are you reading?  The user data or the CDC data?

We read the history1 table in atm database (in other words, the user data).

Posted by Valeriy Bashkatov on 14-Sep-2018 08:37

Hi Richard,

Today I created case  for this issue

00458901 - High OM latches and abnormal shutdown of the database after CDCBatch start

Posted by Richard Banville on 14-Sep-2018 08:42

Thanks.  That will help get the right resources to investigate the issue and determine if there is a bug here and if so how to fix it.

If I am understanding this properly, the reads of the existing history1 table (where all object references should be in primary om cache if set large enough at startup) are incurring om latch activity after the secondary om cache contains entries for the CDC change capture tables.

Am I properly interpreting what you are encountering?

Posted by Valeriy Bashkatov on 14-Sep-2018 09:50

Yes, you understood correctly.

I again checked this on the ATM database and here's the result.

1. I deleted all CDC policies and restarted the database.

2. After starting the database for the history1 table, I added one CDC policy, but did not activate it. I did this through the Pro2 Administration Utility.

3. Now in the promon menu R&D 1/13 Status: Shared Resources I see that I have 1 object in the Secondary object cache

09/14/18 Status: Shared Resources
16:59:34

Active transactions: 0
Lock table entries in use: 0 of 10016
Lock table high water mark: 10
Primary object cache (-omsize): Total: 1024 Used: 265 Pinned: 0
Secondary object cache (-omsize): Total: 1024 Used: 1 Pinned: 1
Primary MT cache (-mtpmsize): Total: 1024 Used: 0 Pinned: 0
Secondary MT cache (-mtpmsize): Total: 1024 Used: 0 Pinned: 0
Number of servers: 2 (41 allocated)
Total users: 2 (200 allocated)
Self-service: 0
Remote: 2
Batch: 0
Watchdog status: Executing
BIW status: Executing
Number of page writers: 1
Number of monitors: 2
Number of semaphores allocated: 249
Shared memory allocated: 852384K (1 segments. The last segment was not locked in memory)

Enter <return>, R, P, T, or X (? for help):

4. Now reset promon statistics in the promon menu R&D debghb 6/11 Activity: Latch Counts. And start this simple code:

MESSAGE "Zero Activity: Latch Counts" VIEW-AS ALERT-BOX. /*manuale reset promon statistics and click OK*/

DEF VAR i AS INT NO-UNDO.
i = 0.
FOR EACH atm.history1 NO-LOCK:
DISPL history1. PAUSE 0.
i = i + 1.
IF i > 10 THEN
DO:
LEAVE.
END.
END.

MESSAGE "Update Activity: Latch Counts" VIEW-AS ALERT-BOX.

5. Update promon statistics in Activity: Latch Counts

09/14/18 Activity: Latch Counts
17:00:55 09/14/18 16:59 to 09/14/18 17:00 (1 min 26 sec)

----- Locks ----- ------ Busy ------ Naps ---------- Spins ----------- ----- Nap Max -----
Owner Total /Sec /Sec Pct /Sec /Sec /Lock /Busy Total HWM

MTX -- 0 0 0 0.0 0 0 0 0 0 0
USR -- 6 0 0 0.0 0 0 0 0 0 0
OM -- 762 8 0 0.0 0 0 0 0 0 0
BIB -- 86 1 0 0.0 0 0 0 0 0 0
SCH -- 0 0 0 0.0 0 0 0 0 0 0
LKP -- 36 0 0 0.0 0 0 0 0 0 0
GST -- 0 0 0 0.0 0 0 0 0 0 0
TXT -- 0 0 0 0.0 0 0 0 0 0 0
LKT -- 0 0 0 0.0 0 0 0 0 0 0
LKT -- 0 0 0 0.0 0 0 0 0 0 0
LKT -- 0 0 0 0.0 0 0 0 0 0 0
LKT -- 0 0 0 0.0 0 0 0 0 0 0
SEQ -- 0 0 0 0.0 0 0 0 0 0 0
AIB -- 0 0 0 0.0 0 0 0 0 0 0
TXQ -- 2 0 0 0.0 0 0 0 0 0 0
EC -- 0 0 0 0.0 0 0 0 0 0 0
LKF -- 54 0 0 0.0 0 0 0 0 0 0
BFP -- 0 0 0 0.0 0 0 0 0 0 0
BHT -- 1583 18 0 0.0 0 0 0 0 0 0
PWQ -- 0 0 0 0.0 0 0 0 0 0 0
CPQ -- 858 9 0 0.0 0 0 0 0 0 0
LRU -- 8 0 0 0.0 0 0 0 0 0 0
LRU -- 0 0 0 0.0 0 0 0 0 0 0
BUF -- 1954 22 0 0.0 0 0 0 0 0 0
BUF -- 434 5 0 0.0 0 0 0 0 0 0
BUF -- 348 4 0 0.0 0 0 0 0 0 0
BUF -- 430 5 0 0.0 0 0 0 0 0 0
INC -- 0 0 0 0.0 0 0 0 0 0 0
CDC -- 0 0 0 0.0 0 0 0 0 0 0
SEC -- 0 0 0 0.0 0 0 0 0 0 0
LG -- 0 0 0 0.0 0 0 0 0 0 0

Enter <return>, A, L, R, S, U, Z, P, T, or X (? for help):

As we can see, OM latch is now used very actively when only 10 records are read.

6. Now online I delete this CDC-policy via Pro2

09/14/18 Status: Shared Resources
17:03:27

Active transactions: 0
Lock table entries in use: 0 of 10016
Lock table high water mark: 10
Primary object cache (-omsize): Total: 1024 Used: 265 Pinned: 0
Secondary object cache (-omsize): Total: 1024 Used: 1 Pinned: 0
Primary MT cache (-mtpmsize): Total: 1024 Used: 0 Pinned: 0

Secondary object cache is still in use.

7. I execute my code again and look into Activity: Latch Counts

09/14/18 Activity: Latch Counts
17:04:14 09/14/18 17:03 to 09/14/18 17:04 (14 sec)

----- Locks ----- ------ Busy ------ Naps ---------- Spins ----------- ----- Nap Max -----
Owner Total /Sec /Sec Pct /Sec /Sec /Lock /Busy Total HWM

MTX -- 0 0 0 0.0 0 0 0 0 0 0
USR -- 2 0 0 0.0 0 0 0 0 0 0
OM -- 762 54 0 0.0 0 0 0 0 0 0
BIB -- 14 1 0 0.0 0 0 0 0 0 0
SCH -- 0 0 0 0.0 0 0 0 0 0 0
LKP -- 8 0 0 0.0 0 0 0 0 0 0
GST -- 0 0 0 0.0 0 0 0 0 0 0
TXT -- 0 0 0 0.0 0 0 0 0 0 0
LKT -- 0 0 0 0.0 0 0 0 0 0 0
LKT -- 0 0 0 0.0 0 0 0 0 0 0
LKT -- 0 0 0 0.0 0 0 0 0 0 0
LKT -- 0 0 0 0.0 0 0 0 0 0 0
SEQ -- 0 0 0 0.0 0 0 0 0 0 0
AIB -- 0 0 0 0.0 0 0 0 0 0 0
TXQ -- 2 0 0 0.0 0 0 0 0 0 0
EC -- 0 0 0 0.0 0 0 0 0 0 0
LKF -- 12 0 0 0.0 0 0 0 0 0 0
BFP -- 0 0 0 0.0 0 0 0 0 0 0
BHT -- 1569 112 0 0.0 0 0 0 0 0 0
PWQ -- 0 0 0 0.0 0 0 0 0 0 0

OM latch is still in use.

8. Again, create the CDC policy and restart the database.

09/14/18 Status: Shared Resources
17:07:37

Active transactions: 0
Lock table entries in use: 0 of 10016
Lock table high water mark: 0
Primary object cache (-omsize): Total: 1024 Used: 265 Pinned: 0
Secondary object cache (-omsize): Total: 1024 Used: 0 Pinned: 0
Primary MT cache (-mtpmsize): Total: 1024 Used: 0 Pinned: 0
Secondary MT cache (-mtpmsize): Total: 1024 Used: 0 Pinned: 0

Now Secondary object cache is not used.

9. I execute my code again and look in Activity: Latch Counts

09/14/18 Activity: Latch Counts
17:09:35 09/14/18 17:09 to 09/14/18 17:09 (5 sec)

----- Locks ----- ------ Busy ------ Naps ---------- Spins ----------- ----- Nap Max -----
Owner Total /Sec /Sec Pct /Sec /Sec /Lock /Busy Total HWM

MTX -- 0 0 0 0.0 0 0 0 0 0 0
USR -- 0 0 0 0.0 0 0 0 0 0 0
OM -- 0 0 0 0.0 0 0 0 0 0 0
BIB -- 0 0 0 0.0 0 0 0 0 0 0
SCH -- 0 0 0 0.0 0 0 0 0 0 0
LKP -- 4 0 0 0.0 0 0 0 0 0 0
GST -- 0 0 0 0.0 0 0 0 0 0 0
TXT -- 0 0 0 0.0 0 0 0 0 0 0
LKT -- 0 0 0 0.0 0 0 0 0 0 0
LKT -- 0 0 0 0.0 0 0 0 0 0 0
LKT -- 0 0 0 0.0 0 0 0 0 0 0
LKT -- 0 0 0 0.0 0 0 0 0 0 0
SEQ -- 0 0 0 0.0 0 0 0 0 0 0
AIB -- 0 0 0 0.0 0 0 0 0 0 0
TXQ -- 2 0 0 0.0 0 0 0 0 0 0
EC -- 0 0 0 0.0 0 0 0 0 0 0
LKF -- 6 1 0 0.0 0 0 0 0 0 0
BFP -- 0 0 0 0.0 0 0 0 0 0 0
BHT -- 1641 328 0 0.0 0 0 0 0 0 0
PWQ -- 0 0 0 0.0 0 0 0 0 0 0
CPQ -- 52 10 0 0.0 0 0 0 0 0 0

Now the OM latch is not used.

P.S.
Pay attention to BHT. Why is it so big, if we only read 10 records?

Posted by George Potemkin on 14-Sep-2018 10:30

By the way, what are the "Pinned" objects in a object cache (and in MT cache)?

Posted by Richard Banville on 14-Sep-2018 10:41

This is to provide isolation from uncommitted changes to object entries.

When making changes to an object the "old" committed version of the object is pinned in the cache while the new uncommitted version resides on disk as it is being changed.  

This is how we ensure all other OLTP sees only the committed version of the object entry as opposed to the changed but not committed yet version.  If the entry was not pinned in the cache, it could be paged out and when read on demand from disk NO-LOCK, the uncommitted version would show up in the cache.  

After the transaction which changed the cache entry ends, committed or rolled back, the entry in the cache will be updated with the current on-disk version the next time it is referenced.

Posted by George Potemkin on 14-Sep-2018 12:20

Thanks, Richard!

Posted by Valeriy Bashkatov on 17-Sep-2018 07:43

I see a fresh article about this

knowledgebase.progress.com/.../High-OM-latches-after-online-activation-of-CDC-policies

>>Cause

>>This is expected behavior.

But I do not agree that this should be the expected behavior. Performance degradation should not be due to the simple addition of CDC policies and this is certainly not expected by the end customers.

Posted by Rob Fitzpatrick on 17-Sep-2018 08:17

George, I'm re-reading your earlier post:

>  If we just create a CDC policy for a table then the /reading/ its records will cause OM latch activity. But only if you don't restart a database.

I think I misunderstood what you said.  I originally read this as "reads of the change table will cause OM latch activity".  That would be expected behaviour.  

But if you meant "reads of the application table for which the CDC policy was created will cause OM latch activity", then I wouldn't expect that, given that the OM cache was initially sized to have enough entries for all of the application objects.

Can you clarify please?

Posted by George Potemkin on 17-Sep-2018 08:38

For example, 'FOR FIRST atm.history1 NO-LOCK: END' creates OM latch locks if CDC policy is enabled for history1.

Reading of _Cdc-Change-Tracking or _Cdc-Table-Policy does not create OM latch locks.

This thread is closed