Trouble deleting data

Posted by dbeavon on 07-Mar-2016 20:14

I'm using OE 11.3 on HP-UX.   I have type 2 storage areas.

I've started up a number of separate process (eg 10) to delete data out of the same table.  Each process is started with a parameter to say what needs to be deleted within that process.

The processes seem to do the table deleting work *slower* in parallel (as a team) than if I just run a single process.

I think there must be some massive concurrency problem in OE related to deletes.  Is OE taking an entire table lock or such, just in order to delete a single record?  Or perhaps the indexes are the concurrency bottleneck? Any tips on how I should go about this? 

At the moment we are trimming old data (prior to a certain date).  Each process is trimming data for a different part of the organization. 

Should I be disabling indexes (ie ones that don't contribute to the partitioning of the organization's data)?

Any help would be appreciated.

Thanks, David

All Replies

Posted by smat-consulting on 07-Mar-2016 21:09

Sounds like your processes are attempting to read the same records, and, since one of them is in the process of deleting it, the rest of them waits for the record to be unlocked (or in this case deleted).

I'd do a FOR EACH with NO-LOCK with my selection criteria, then, within the for-each reread the record with EXCLUSIVE-LOCK NO-WAIT NO-ERROR, and IF AVAILABLE THEN DELETE it.

Sometimes, the for each of a second process might need to look at a record that a first process has locked, thus having to wait... Especially, if the FOR EACH criteria requires a final selection on the client-side...

Posted by George Potemkin on 08-Mar-2016 02:47

> The processes seem to do the table deleting work *slower* in parallel (as a team) than if I just run a single process.

Check promon/R&D/deghb

3/1.   Activity: Performance Indicators

2/13.  Activity: Resource Queues

6/11.  Activity: Latch Counts

1/4/2. Status: Blocked Clients

6/15.  Status: Buffer Lock Queue

> Each process is started with a parameter to say what needs to be deleted within that process.

Do the processes use an unique index to find the records to delete?

Posted by dbeavon on 08-Mar-2016 07:53

Thanks for the tips guys,  I will take a closer look.

I think I have a few extra strategies, 1 - disable as many indexes as feasible, 2 - larger transactions (100,000 rows at a time instead of one).  3 - gather rowids into temp-tables with no-lock for concurrency, then circle back and delete - using exclusive-lock - within larger transactions:

I imagine that part of the problem is overhead related to so many tiny, individual transactions.  (I only get about 100 commits per second (according to 3/1 Activity: Performance Indicators - thanks for pointing it out).  My hope is that maybe I can increase the number of rows deleted per commit, without increasing the costs of the transaction overhead.)

Initially I was a little concerned about the table-locks.  I wasn't sure if the OE database essentially goes into "single user mode" while deleting.  The concern was because of the following promon output:

Usr:Ten   Name      Domain     Chain #                        

Lock Flags   Tran State          Tran ID

-----------

  66                     0      TAB        9645

  IX    L      Phase 2 FWD         517284980

-----------

  66                     0      TAB        9648                    

  IX    L      Phase 2 FWD         517284980

-----------

Notice the TAB locks that are taken with "IX" lock flags.  

It turns out this shouldn't be a conflict between two ABL programs that are deleting data using row locking: per KB

knowledgebase.progress.com/.../P37194

Either way, I don't think OE is especially great at deleting data, at least not within ABL.  It appears others have had the same questions, I found these posts while doing more research.  I expect any ABL developer trying to delete data (ie. tens of thousands of rows at a time or more) would have the exact same complaints we're having.

*Deleting large amount of data in openedge database*

community.progress.com/.../42

*Deleting the contents of a table...fast!*

community.progress.com/.../8292

Maybe the SQL92 engine has some better tricks for optimizing deletion work in batches.  That will be my fallback if/when I give up on ABL.  I suppose the last-ditch effort is to create a brand new storage area with duplicate tables and move in the data I want, skipping over the stuff I want "deleted".  However it seems silly to do this when we are keeping the majority (eg 90%) of the rows.

Thanks, David

Posted by ChUIMonster on 08-Mar-2016 08:16

Actually it is normal.

Deleting data is best accomplished with a single thread.  When you run multiple deletes in parallel things slow down considerably.

Posted by ChUIMonster on 08-Mar-2016 08:17

Also -- this thread really ought to be in the database forum.  I doubt the db engineering team has seen it at this point.

Posted by dbeavon on 08-Mar-2016 09:10

I found some low hanging fruit....  I found this one after coming to the conclusion that my only real bottleneck in the database seems to be latch timeouts:

 (Found via checking promon/R&D/deghb): 3/1.   Activity: Performance Indicators

So I changed -spin from 2000 to 500000 and my commits per second are now in the 300-500 range (instead of 100 before).  And better yet, I'm actually seeing some disk utilization in glance!  

I wish OE would auto-tune the spin.  This seems like an especially obscure thing for the average OE DBA to be responsible for.  The CPU usage is still under 10% for this entire team of processes.  Not sure why our we always used -spin 2000 by default.  Maybe it is the  best-practice from a decade ago.

Posted by George Potemkin on 08-Mar-2016 09:26

> my only real bottleneck in the database seems to be latch timeouts

Which latch?

> Maybe it is the  best-practice from a decade ago.

The optimal value of the -spin depends from main kind of database activity at the current moment (or by a latch that is a bottleneck right now).

Posted by Keith Sudbury on 08-Mar-2016 09:29

Tuning spin for varying workloads is a bit of "magic". Take care that your new setting doesn't cause excessive CPU consumption during your normal workload.

There are a lot of cases where the latch timeouts can actually be resolved in other/better ways than just increasing -spin. Take a look at the Latch Counts (debghb - 11) and take a look/post which latches you are having issues with.

Posted by bremmeyr on 08-Mar-2016 09:47

What would make the SQL route faster than the ABL connection with the current topic of large amounts of deletes?

Posted by dbeavon on 08-Mar-2016 10:22

Keith/George,

Secret menus?  I'm not a DBA, and have little experience with the OE database at this level.

Anyway, I found it.  For those of us who don't know the deep secrets, its like so:

From: pugchallenge.org/.../332_debghb_v2.pptx

Start promon

Enter: R&D (also works in lower case starting in V10)

Enter: debghb

You have now entered the debghb “zone”

Two main differences in the world of debghb

Extensions to some existing R&D screens

Enables access to “This menu is not here Menu”

Enter: “6” even though there is no visible option 6

See next slide

------------

Anyway here is the screen I used to see my transaction thruput and the latch waits.  (Notice that I have a real heavy workload where index data maintenance is concerned.  I think half our indexes aren't even used so it kind of annoys me...)

03/08/16        Activity: Performance Indicators

10:44:54        03/08/16 10:44 to 03/08/16 10:44 (10 sec)

                                   Total         Per Min          Per Sec

   Per Tx

Commits                              4427           26562           442.70

     1.00

Undos                                   0               0             0.00

     0.00

Index operations                   136291          817746         13629.10

    30.79

Record operations                   36843          221058          3684.30

     8.32

Total o/s i/o                       10977           65862          1097.70

     2.48

Total o/s reads                      3071           18426           307.10

     0.69

Total o/s writes                     7906           47436           790.60

     1.79

Background o/s writes                7718           46308           771.80

     1.74

Partial log writes                      3              18             0.30

     0.00

Database extends                        0               0             0.00

     0.00

Total waits                          2966           17796           296.60

     0.67

Lock waits                              0               0             0.00

     0.00

Resource waits                       2966           17796           296.60

     0.67

Latch timeouts                       1735           10410           173.50

     0.39

=====================================

More on indexes....

03/08/16        Activity: Index

10:45:21        03/08/16 10:45 to 03/08/16 10:45 (10 sec)

                                   Total         Per Min          Per Sec

   Per Tx

Find index entry                    17776          106656          1777.60

     4.24

Create index entry                      0               0             0.00

     0.00

Delete index entry                 111143          666858         11114.30

    26.53

Remove locked entry                     0               0             0.00

     0.00

Split block                             0               0             0.00

     0.00

Free block                             29             174             2.90

     0.01

================================

Finally the latches:

03/08/16        Activity: Latch Times

11:17:45        03/08/16 11:17 to 03/08/16 11:17 (10 sec)

       Locks   Lk usec   Wt usec   Lk usec   Wt usec

        /Sec     /Lock     /Lock      /Sec      /Sec

MTX     15888         0         0         0         0

USR         0         0         0         0         0

OM      14611         0         0         0         0

BIB     15782         0         0         0         0

SCH         0         0         0         0         0

LKP      6658         0         0         0         0

GST         0         0         0         0         0

TXT      1408         0         0         0         0

LKT      5253         0         0         0         0

LKT      5209         0         0         0         0

LKT      5273         0         0         0         0

LKT      5317         0         0         0         0

SEQ         0         0         0         0         0

AIB         0         0         0         0         0

TXQ     20439         0         0         0         0

===========================================

counts:

03/08/16        Activity: Latch Counts

11:21:57        03/06/16 02:08 to 03/08/16 11:21 (57 hrs 13 min)

               ----- Locks -----        ------ Busy ------        Naps        -

--------- Spins -----------    ----- Nap Max -----

  Owner        Total        /Sec        /Sec           Pct        /Sec        /

Sec       /Lock       /Busy       Total   HWM

MTX  --     712129433        3457           0           0.0          28       13

051           3     3234870         138   250

USR  --        194110           0           0           0.0           0

 0           0           0           0     0

OM   --    1347954788        6544           0           0.0           0        1

858           0     1867741         370    10

BIB  --     699531877        3396           0           0.0           0        1

384           0     1926612          35   250

SCH  --        370730           1           0           0.0           0

 0           0           0           0     0

LKP  --     286422796        1390           0           0.0           0

102           0       18128         110    10

GST  --        253893           1           0           0.0           0

 0           0           0           0     0

TXT  --     102473565         497           0           0.0           0

 3           0      218623           1    10

LKT  --     314458151        1526           0           0.0           0

 0           0         621          39    10

LKT  --     271685318        1319           0           0.0           0

 0           0         464          14    10

LKT  --     255297769        1239           0           0.0           0

 0           0         400           1    20

LKT  --     253810487        1232           0           0.0           0

 0           0         431          44    10

SEQ  --        129452           0           0           0.0           0

 0           0           0           0     0

AIB  --         39067           0           0           0.0           0

 0           0           0           0     0

TXQ  --     896593025        4352           0           0.0           6

994           0      171644           1    40

Enter <return> for more, A, L, R, S, U, Z, P, T, or X (? for help):

03/08/16        Activity: Latch Counts

11:22:00        03/06/16 02:08 to 03/08/16 11:21 (57 hrs 13 min)

               ----- Locks -----        ------ Busy ------        Naps        -

--------- Spins -----------    ----- Nap Max -----

  Owner        Total        /Sec        /Sec           Pct        /Sec        /

Sec       /Lock       /Busy       Total   HWM

EC   --             0           0           0           0.0           0

 0           0           0           0     0

LKF  --     690158906        3350           0           0.0           0

365           0      171000          81    10

BFP  --             2           0           0           0.0           0

 0           0           0           0     0

BHT  --    7093100330       34437           0           0.0           0        2

295           0    31513724         491    10

PWQ  --        408399           1           0           0.0           0

 0           0           0           0     0

CPQ  --      34280260         166           0           0.0           0

 0           0       11073           0     0

LRU  --    6465601420       31390           0           0.0           1        4

811           0       69780         874    10

LRU  --     376621493        1828           0           0.0           0

 0           0           0           0     0

BUF  --    3519010800       17084           0           0.0           0        8

525           0     3850820           3    20

BUF  --    3615680361       17554           0           0.0           2       17

875           1    39168405          14    20

BUF  --    3579977569       17380           0           0.0           3        3

003           0     2850957          69    20

BUF  --    3553969436       17254           0           0.0           3        4

998           0    14921600           2    40

INC  --             0           0           0           0.0           0

 0           0           0           0     0

L29  --             0           0           0           0.0           0

 0           0           0           0     0

L30  --             0           0           0           0.0           0

 0           0           0           0     0

L31  --             0           0           0           0.0           0

 0           0           0           0     0

Posted by dbeavon on 08-Mar-2016 10:57

@bremmeyr - I think OE-SQL92 came to mind when I saw those related comments on this thread :community.progress.com/.../8292

I have used OE-SQL92 quite a bit and it often performs better than ABL for bulk operations; I assume deletes are no different.  For one, the technology can leverage a different set of concurrency/isolation modes and locking levels (even up to locking the entire table if appropriate). Additionally I'm guessing that it can batch up and pipeline the various units of work (eg. index deletions) in a way that is unlike than what an individual ABL DELETE row statement would do.  Finally, in SQL92, a large deletion statement can be processed entirely within native, non-custom code without any AVM which, provided that is part of the bottleneck, would be a huge benefit on the CPU side of things.

I still have some hope for my ABL deletion program.  Especially if I can find more silver bullets like the -spin change.

Posted by Keith Sudbury on 08-Mar-2016 11:10

Sorry for the short hand... checking in between normal work :-)

Your -omsize parameter needs to be increased to reduce the OM latches knowledgebase.progress.com/.../20398

You could benefit from setting lruskips and looking at the size of your -B knowledgebase.progress.com/.../000031150

Setting lruskips to 20 will potentially give you some relief without much risk of degradation. Using a properly sized secondary buffer pool (-B2) and assigning the proper objects can also help out drastically.

It is possible that the BI file isn't sized properly and/or your -bibufs parameter is too low. Most people suggest using a biblocksize of 16KB and a large enough cluster size so that your checkpoints aren't too frequent (at least a few minutes apart). Some of the MTX latches are just part of deleting index entries though.

You say you have two storage areas... for the entire database or just this table? Hopefully they are at least Type II areas, otherwise you need to rethink the physical layout.

Posted by George Potemkin on 08-Mar-2016 12:02

David,

> 11:21:57        03/06/16 02:08 to 03/08/16 11:21 (57 hrs 13 min)

57 hours is too long interval for statistics. ;-)

Posted by gus on 08-Mar-2016 12:42

> On Mar 8, 2016, at 1:03 PM, George Potemkin wrote:

>

> 57 hours is too long interval for statistics. ;-)

All depends on what you want to measure.

For determining a week's worth of after-image volume it is too short. :)

Posted by George Potemkin on 08-Mar-2016 13:36

Gus, when we talk about the latch statistics I would not agree with any intervals longer than 2 seconds (but I would like to see statistics for many intervals). ;-)

Posted by bremmeyr on 09-Mar-2016 16:36

Your comments make me curious. Even with 2 different agents I envision OE-SQL92 and OE-ABL\4GL as accessing the stored data with the same database engine.  Perhaps access to the data is more separated then I thought. Another way to put it, is ABL and SQL 2 paths to the same data or 2 paths to the same database engine?

Posted by gus on 09-Mar-2016 16:42

think of it as 2 paths to same data

Posted by ChUIMonster on 09-Mar-2016 16:43

2 "languages", one storage engine

Posted by dbeavon on 09-Mar-2016 17:21

OE-SQL92 and OE-ABL are totally different.  They share low-level database resources, but resolve queries in different ways.  In promon you will often see lock types which are specifically there to support SQL92 features, while ABL would not otherwise have any use for them.  

(Whereas OE-ABL must respect SQL92 locks - and sometimes even take SQL92-style locks in order to coexist - ABL doesn't have the ability to take and use the full set of locks that are possible - see knowledgebase.progress.com/.../P37194)

There are other differences beyond locking types.  SQL92 can also do some cool things to resolve queries from index data alone, while ABL doesn't really have anything like that.

We use SQL92 primarily for reporting purposes but it has many other uses too.

Your idea about "2 paths to the same database engine" is not accurate.  That was probably the way SQL89 was built (which is now deprecated).  SQL89 always felt like an ugly SQL syntax layer that was built on top of the same AVM and it probably used the exact same database engine features as it would have been used in ABL.

Posted by dbeavon on 09-Mar-2016 17:29

As cool as SQL92 is, I should put in a plug for an idea I just added last month.  We need to be able to do a QUERY-PREPARE of sorts and generate results using the SQL92 engine.  The results would be stored in a regular ABL temp-table for subsequent processing....  

Please vote!

community.progress.com/.../make_use_of_sql92_query_optimizer_from_an_abl_program

The only alternative is shelling out to the OS and running sqlexp commands, and importing the results.  This feels very hack-ish.

This thread is closed