I'm using OE 11.3 on HP-UX. I have type 2 storage areas.
I've started up a number of separate process (eg 10) to delete data out of the same table. Each process is started with a parameter to say what needs to be deleted within that process.
The processes seem to do the table deleting work *slower* in parallel (as a team) than if I just run a single process.
I think there must be some massive concurrency problem in OE related to deletes. Is OE taking an entire table lock or such, just in order to delete a single record? Or perhaps the indexes are the concurrency bottleneck? Any tips on how I should go about this?
At the moment we are trimming old data (prior to a certain date). Each process is trimming data for a different part of the organization.
Should I be disabling indexes (ie ones that don't contribute to the partitioning of the organization's data)?
Any help would be appreciated.
Thanks, David
Sounds like your processes are attempting to read the same records, and, since one of them is in the process of deleting it, the rest of them waits for the record to be unlocked (or in this case deleted).
I'd do a FOR EACH with NO-LOCK with my selection criteria, then, within the for-each reread the record with EXCLUSIVE-LOCK NO-WAIT NO-ERROR, and IF AVAILABLE THEN DELETE it.
Sometimes, the for each of a second process might need to look at a record that a first process has locked, thus having to wait... Especially, if the FOR EACH criteria requires a final selection on the client-side...
> The processes seem to do the table deleting work *slower* in parallel (as a team) than if I just run a single process.
Check promon/R&D/deghb
3/1. Activity: Performance Indicators
2/13. Activity: Resource Queues
6/11. Activity: Latch Counts
1/4/2. Status: Blocked Clients
6/15. Status: Buffer Lock Queue
> Each process is started with a parameter to say what needs to be deleted within that process.
Do the processes use an unique index to find the records to delete?
Thanks for the tips guys, I will take a closer look.
I think I have a few extra strategies, 1 - disable as many indexes as feasible, 2 - larger transactions (100,000 rows at a time instead of one). 3 - gather rowids into temp-tables with no-lock for concurrency, then circle back and delete - using exclusive-lock - within larger transactions:
I imagine that part of the problem is overhead related to so many tiny, individual transactions. (I only get about 100 commits per second (according to 3/1 Activity: Performance Indicators - thanks for pointing it out). My hope is that maybe I can increase the number of rows deleted per commit, without increasing the costs of the transaction overhead.)
Initially I was a little concerned about the table-locks. I wasn't sure if the OE database essentially goes into "single user mode" while deleting. The concern was because of the following promon output:
Usr:Ten Name Domain Chain #
Lock Flags Tran State Tran ID
-----------
66 0 TAB 9645
IX L Phase 2 FWD 517284980
-----------
66 0 TAB 9648
IX L Phase 2 FWD 517284980
-----------
Notice the TAB locks that are taken with "IX" lock flags.
It turns out this shouldn't be a conflict between two ABL programs that are deleting data using row locking: per KB
knowledgebase.progress.com/.../P37194
Either way, I don't think OE is especially great at deleting data, at least not within ABL. It appears others have had the same questions, I found these posts while doing more research. I expect any ABL developer trying to delete data (ie. tens of thousands of rows at a time or more) would have the exact same complaints we're having.
*Deleting large amount of data in openedge database*
*Deleting the contents of a table...fast!*
community.progress.com/.../8292
Maybe the SQL92 engine has some better tricks for optimizing deletion work in batches. That will be my fallback if/when I give up on ABL. I suppose the last-ditch effort is to create a brand new storage area with duplicate tables and move in the data I want, skipping over the stuff I want "deleted". However it seems silly to do this when we are keeping the majority (eg 90%) of the rows.
Thanks, David
Actually it is normal.
Deleting data is best accomplished with a single thread. When you run multiple deletes in parallel things slow down considerably.
Also -- this thread really ought to be in the database forum. I doubt the db engineering team has seen it at this point.
I found some low hanging fruit.... I found this one after coming to the conclusion that my only real bottleneck in the database seems to be latch timeouts:
(Found via checking promon/R&D/deghb): 3/1. Activity: Performance Indicators
So I changed -spin from 2000 to 500000 and my commits per second are now in the 300-500 range (instead of 100 before). And better yet, I'm actually seeing some disk utilization in glance!
I wish OE would auto-tune the spin. This seems like an especially obscure thing for the average OE DBA to be responsible for. The CPU usage is still under 10% for this entire team of processes. Not sure why our we always used -spin 2000 by default. Maybe it is the best-practice from a decade ago.
> my only real bottleneck in the database seems to be latch timeouts
Which latch?
> Maybe it is the best-practice from a decade ago.
The optimal value of the -spin depends from main kind of database activity at the current moment (or by a latch that is a bottleneck right now).
Tuning spin for varying workloads is a bit of "magic". Take care that your new setting doesn't cause excessive CPU consumption during your normal workload.
There are a lot of cases where the latch timeouts can actually be resolved in other/better ways than just increasing -spin. Take a look at the Latch Counts (debghb - 11) and take a look/post which latches you are having issues with.
What would make the SQL route faster than the ABL connection with the current topic of large amounts of deletes?
Keith/George,
Secret menus? I'm not a DBA, and have little experience with the OE database at this level.
Anyway, I found it. For those of us who don't know the deep secrets, its like so:
From: pugchallenge.org/.../332_debghb_v2.pptx
Start promon
Enter: R&D (also works in lower case starting in V10)
Enter: debghb
You have now entered the debghb “zone”
Two main differences in the world of debghb
Extensions to some existing R&D screens
Enables access to “This menu is not here Menu”
Enter: “6” even though there is no visible option 6
See next slide
------------
Anyway here is the screen I used to see my transaction thruput and the latch waits. (Notice that I have a real heavy workload where index data maintenance is concerned. I think half our indexes aren't even used so it kind of annoys me...)
03/08/16 Activity: Performance Indicators
10:44:54 03/08/16 10:44 to 03/08/16 10:44 (10 sec)
Total Per Min Per Sec
Per Tx
Commits 4427 26562 442.70
1.00
Undos 0 0 0.00
0.00
Index operations 136291 817746 13629.10
30.79
Record operations 36843 221058 3684.30
8.32
Total o/s i/o 10977 65862 1097.70
2.48
Total o/s reads 3071 18426 307.10
0.69
Total o/s writes 7906 47436 790.60
1.79
Background o/s writes 7718 46308 771.80
1.74
Partial log writes 3 18 0.30
0.00
Database extends 0 0 0.00
0.00
Total waits 2966 17796 296.60
0.67
Lock waits 0 0 0.00
0.00
Resource waits 2966 17796 296.60
0.67
Latch timeouts 1735 10410 173.50
0.39
=====================================
More on indexes....
03/08/16 Activity: Index
10:45:21 03/08/16 10:45 to 03/08/16 10:45 (10 sec)
Total Per Min Per Sec
Per Tx
Find index entry 17776 106656 1777.60
4.24
Create index entry 0 0 0.00
0.00
Delete index entry 111143 666858 11114.30
26.53
Remove locked entry 0 0 0.00
0.00
Split block 0 0 0.00
0.00
Free block 29 174 2.90
0.01
================================
Finally the latches:
03/08/16 Activity: Latch Times
11:17:45 03/08/16 11:17 to 03/08/16 11:17 (10 sec)
Locks Lk usec Wt usec Lk usec Wt usec
/Sec /Lock /Lock /Sec /Sec
MTX 15888 0 0 0 0
USR 0 0 0 0 0
OM 14611 0 0 0 0
BIB 15782 0 0 0 0
SCH 0 0 0 0 0
LKP 6658 0 0 0 0
GST 0 0 0 0 0
TXT 1408 0 0 0 0
LKT 5253 0 0 0 0
LKT 5209 0 0 0 0
LKT 5273 0 0 0 0
LKT 5317 0 0 0 0
SEQ 0 0 0 0 0
AIB 0 0 0 0 0
TXQ 20439 0 0 0 0
===========================================
counts:
03/08/16 Activity: Latch Counts
11:21:57 03/06/16 02:08 to 03/08/16 11:21 (57 hrs 13 min)
----- Locks ----- ------ Busy ------ Naps -
--------- Spins ----------- ----- Nap Max -----
Owner Total /Sec /Sec Pct /Sec /
Sec /Lock /Busy Total HWM
MTX -- 712129433 3457 0 0.0 28 13
051 3 3234870 138 250
USR -- 194110 0 0 0.0 0
0 0 0 0 0
OM -- 1347954788 6544 0 0.0 0 1
858 0 1867741 370 10
BIB -- 699531877 3396 0 0.0 0 1
384 0 1926612 35 250
SCH -- 370730 1 0 0.0 0
0 0 0 0 0
LKP -- 286422796 1390 0 0.0 0
102 0 18128 110 10
GST -- 253893 1 0 0.0 0
0 0 0 0 0
TXT -- 102473565 497 0 0.0 0
3 0 218623 1 10
LKT -- 314458151 1526 0 0.0 0
0 0 621 39 10
LKT -- 271685318 1319 0 0.0 0
0 0 464 14 10
LKT -- 255297769 1239 0 0.0 0
0 0 400 1 20
LKT -- 253810487 1232 0 0.0 0
0 0 431 44 10
SEQ -- 129452 0 0 0.0 0
0 0 0 0 0
AIB -- 39067 0 0 0.0 0
0 0 0 0 0
TXQ -- 896593025 4352 0 0.0 6
994 0 171644 1 40
Enter <return> for more, A, L, R, S, U, Z, P, T, or X (? for help):
03/08/16 Activity: Latch Counts
11:22:00 03/06/16 02:08 to 03/08/16 11:21 (57 hrs 13 min)
----- Locks ----- ------ Busy ------ Naps -
--------- Spins ----------- ----- Nap Max -----
Owner Total /Sec /Sec Pct /Sec /
Sec /Lock /Busy Total HWM
EC -- 0 0 0 0.0 0
0 0 0 0 0
LKF -- 690158906 3350 0 0.0 0
365 0 171000 81 10
BFP -- 2 0 0 0.0 0
0 0 0 0 0
BHT -- 7093100330 34437 0 0.0 0 2
295 0 31513724 491 10
PWQ -- 408399 1 0 0.0 0
0 0 0 0 0
CPQ -- 34280260 166 0 0.0 0
0 0 11073 0 0
LRU -- 6465601420 31390 0 0.0 1 4
811 0 69780 874 10
LRU -- 376621493 1828 0 0.0 0
0 0 0 0 0
BUF -- 3519010800 17084 0 0.0 0 8
525 0 3850820 3 20
BUF -- 3615680361 17554 0 0.0 2 17
875 1 39168405 14 20
BUF -- 3579977569 17380 0 0.0 3 3
003 0 2850957 69 20
BUF -- 3553969436 17254 0 0.0 3 4
998 0 14921600 2 40
INC -- 0 0 0 0.0 0
0 0 0 0 0
L29 -- 0 0 0 0.0 0
0 0 0 0 0
L30 -- 0 0 0 0.0 0
0 0 0 0 0
L31 -- 0 0 0 0.0 0
0 0 0 0 0
@bremmeyr - I think OE-SQL92 came to mind when I saw those related comments on this thread :community.progress.com/.../8292
I have used OE-SQL92 quite a bit and it often performs better than ABL for bulk operations; I assume deletes are no different. For one, the technology can leverage a different set of concurrency/isolation modes and locking levels (even up to locking the entire table if appropriate). Additionally I'm guessing that it can batch up and pipeline the various units of work (eg. index deletions) in a way that is unlike than what an individual ABL DELETE row statement would do. Finally, in SQL92, a large deletion statement can be processed entirely within native, non-custom code without any AVM which, provided that is part of the bottleneck, would be a huge benefit on the CPU side of things.
I still have some hope for my ABL deletion program. Especially if I can find more silver bullets like the -spin change.
Sorry for the short hand... checking in between normal work :-)
Your -omsize parameter needs to be increased to reduce the OM latches knowledgebase.progress.com/.../20398
You could benefit from setting lruskips and looking at the size of your -B knowledgebase.progress.com/.../000031150
Setting lruskips to 20 will potentially give you some relief without much risk of degradation. Using a properly sized secondary buffer pool (-B2) and assigning the proper objects can also help out drastically.
It is possible that the BI file isn't sized properly and/or your -bibufs parameter is too low. Most people suggest using a biblocksize of 16KB and a large enough cluster size so that your checkpoints aren't too frequent (at least a few minutes apart). Some of the MTX latches are just part of deleting index entries though.
You say you have two storage areas... for the entire database or just this table? Hopefully they are at least Type II areas, otherwise you need to rethink the physical layout.
David,
> 11:21:57 03/06/16 02:08 to 03/08/16 11:21 (57 hrs 13 min)
57 hours is too long interval for statistics. ;-)
> On Mar 8, 2016, at 1:03 PM, George Potemkin wrote:
>
> 57 hours is too long interval for statistics. ;-)
All depends on what you want to measure.
For determining a week's worth of after-image volume it is too short. :)
Gus, when we talk about the latch statistics I would not agree with any intervals longer than 2 seconds (but I would like to see statistics for many intervals). ;-)
Your comments make me curious. Even with 2 different agents I envision OE-SQL92 and OE-ABL\4GL as accessing the stored data with the same database engine. Perhaps access to the data is more separated then I thought. Another way to put it, is ABL and SQL 2 paths to the same data or 2 paths to the same database engine?
think of it as 2 paths to same data
2 "languages", one storage engine
OE-SQL92 and OE-ABL are totally different. They share low-level database resources, but resolve queries in different ways. In promon you will often see lock types which are specifically there to support SQL92 features, while ABL would not otherwise have any use for them.
(Whereas OE-ABL must respect SQL92 locks - and sometimes even take SQL92-style locks in order to coexist - ABL doesn't have the ability to take and use the full set of locks that are possible - see knowledgebase.progress.com/.../P37194)
There are other differences beyond locking types. SQL92 can also do some cool things to resolve queries from index data alone, while ABL doesn't really have anything like that.
We use SQL92 primarily for reporting purposes but it has many other uses too.
Your idea about "2 paths to the same database engine" is not accurate. That was probably the way SQL89 was built (which is now deprecated). SQL89 always felt like an ugly SQL syntax layer that was built on top of the same AVM and it probably used the exact same database engine features as it would have been used in ABL.
As cool as SQL92 is, I should put in a plug for an idea I just added last month. We need to be able to do a QUERY-PREPARE of sorts and generate results using the SQL92 engine. The results would be stored in a regular ABL temp-table for subsequent processing....
Please vote!
community.progress.com/.../make_use_of_sql92_query_optimizer_from_an_abl_program
The only alternative is shelling out to the OS and running sqlexp commands, and importing the results. This feels very hack-ish.