Setting the -pica parameter for OE Replication

Posted by lbryan on 15-Sep-2017 02:07

Hello,

I am looking for some advise on setting the -pica parameter for a group of databases with replication enabled.

From what I've read, all databases must have the same -pica value. Articles suggest the calculation is

-pica = (TAIW / BlockCount) * 1.25

So should I take the Total AI writes from the most used database?

Is this the total AI writes within 1 hour?

Also, it says from the  AI writes from the most busy period but I don't know when this is so not sure how to best estimate. I did an average of this in order to calculate the ai extents required so could this be used?

Any advice would be greatly appreciated.

All Replies

Posted by James Palmer on 15-Sep-2017 04:41

Where did you read that -pica should be the same on all databases? Because that's news to me, and something I fall foul of.

Which version of Progress are we talking by the way?

Posted by lbryan on 15-Sep-2017 04:48
Posted by Paul Koufalis on 15-Sep-2017 08:31

This advice should apply to 99% of OE environments:

If you are on any version before 10.2B08, use the maximum allowable pica value in your version. In 10.2B08 the max was raised from 8192 to 1,000,000 (I think they were sick of raising it every couple of years). In that case, use 32,768.

Larger pica values are required if a) the total AI writes is VERY high and/or b) the bandwidth to the target is relatively slow.

You cannot base your pica calculation only on AI writes unless you have a very fast (read local) network. The minute you're going over a WAN, bandwidth varies wildly. Last week one of our customer's replication was falling behind every day and almost catching up every night. We were tracking it in ProTop, watching AI files fill and get locked and fill and get locked. Replication status was stuck in "synchronizing" mode and never quite got back to "Normal Processing". The network admins swore nothing was wrong and only 1% of the WAN bandwidth was being used.

It must be a Progress bug!!  Not.

Copying a 1 MB file from source server to target server took minutes. Copying it back from target to source server took one second. Yeah, must be a Progress issue.

The lesson here is to monitor pica usage and to stop the replication server process before the pica buffers fill, otherwise forward processing for ALL users will be stalled and your phone will start ringing off the hook. Monitor replication server AND agent status (both sides) as there are bugs in earlier versions whereby the source will report "Normal Processing" and the target will report "Listening". ProTop does all of this and more.

Regards,

Paul

Posted by ChUIMonster on 15-Sep-2017 08:46

To complete Paul's story...

... after pestering the network vendor they had an "aha moment" and suddenly found something to change.  Normal bandwidth was restored and the replication caught up in a few hours and returned to its normal state of being always up to date.

Posted by n.lewers on 15-Sep-2017 16:22

When you say that "watching AI files fill and get locked and fill and get locked", is it then possible to fill up all available AI extents on the source DB and if so what happens if that is the case?  I would assume that the source db would then generate an error and pause, but does it give an AI write error or a replication sync error?   Thanks,

Posted by Paul Koufalis on 15-Sep-2017 18:57

If AI files are fixed size, once all of them fill your database will either crash or stall, depending on the use or not of the -aistall DB startup parameter.

If AI files are variable and are rotated every x minutes, (n-1) AI files will fill (status: locked) and the nth AI file (the busy one) will continue to grow until you hit some external limit: 2 GB file size if large files are not enabled, or you fill the file system. When this happens, again the DB will crash or stall depending on the use of -aistall.

As this is happening, the replication server and agent processes would be generating error messages in the database log files.

To summarize, you should be monitoring pica usage, AI file status (free, full, locked), replication server and agent status and the database log file. You can write your own scripts to do this or you can use ProTop to generate alerts when configured threshold(s) are reached. For example, you could configure ProTop to send an email when 2 AI files are locked and to send an SMS when 4 AI files are locked. Please note that alert generation is a paid add-on to ProTop Free.

Regards,

Posted by lbryan on 18-Sep-2017 06:02

Thanks. One of the databases will have a high volume of ai writes so I think I should start with this database to get the most accurate value for -pica.

I'm not sure how to check the network speed. Just copy a file and see the speed it says?

I should have said - I am using version 11.6.

Posted by Paul Koufalis on 18-Sep-2017 07:43

Your network admin can tell you the theoretical bandwidth and then you can copy some largish files (hundreds of MB) to estimate real bandwidth.

Another thing that I like to do is rotate AI files every 15 minutes or less and chart the file sizes. If the peaks are high relative to my network bandwidth, I might ask the network admin to spit out a chart of bandwidth utilization between the main datacentre and the DR site. You may find that your peak AI writes correspond to peak network usage, requiring some QoS configuration.

Posted by Dmitri Levin on 22-Sep-2017 14:50

Paul

Is there a way to quantitatively monitor -pica usage with either promon or VST  or anything else?

Other than it is Good ("Normal Processing".) or Bad ("synchronizing" )

All the rest that you write above sounds pretty straight forward:

Count how many AI are Locked, check the status of  server and agent being 1101 and 3101. We can even rotate AI files faster ( 15 min ) or slower ( 60 min or even 2 hr ) based on how many AI are locked out of total number that are available.

If server status not 1101 dsrutil -C restart server

If agent status not  3101 dsrutil -C restart agent

Posted by Libor Laubacher on 22-Sep-2017 14:59

promon - r&d - 1 - 16 I think

Posted by Paul Koufalis on 22-Sep-2017 15:00

Dmitri,

I think you're asking for promon - R&D - 1 - 16. That's the only way that I know to monitor pica usage and that's how ProTop does it. There is nothing in the VSTs.

FYI I use dsrutil db -C status -detail and check for 6021 or 3049.

FYI #2 (not for Dmitri but for other readers) dsrutil db -C restart agent is only available in 11.6+. Prior to that, you had to bounce the target DB to restart the agent.

Posted by Rob Fitzpatrick on 22-Sep-2017 15:35

You can get R&D 1|16 info from the _DbServiceManager VST in 11.6+.  Example:

find _dbservicemanager no-lock.         
display _dbservicemanager with 1 column.

ID                                   : 1                
Size of Pica                         : 8193.08          
Available Pica Message Entries       : 74908            
Free Pica Message Entries            : 74908            
Queued Pica Message Entries          : 0                
Used Message High Water Mark         : 15288            
Number of Times Pica Filled          : 0                
User Number of Last Pica Latch Holder: -1               
Number of Times Locks Taken Out      : 799392373        
Number of Lock Collisions            : 2021             
PID of Lock Holder                   : 0                
User Type of Lock Holder             : Not available    
User Number of Lock Holder           : 0                

(I fudged the column labels for the sake of clarity.)

More info on the new and updated VSTs, and other OE Repl enhancements:

http://pugchallenge.org/downloads2017/372_Whats%20New%20in%20Replication.pptx

Posted by Dmitri Levin on 28-Sep-2017 17:28

Thank you Libor, Paul and Rob

Communication Area Size   :      16384.00 KB

 Total Message Entries   :      149796

 Free Message Entries    :      149796

 Used Message Entries    :           0

 Used HighWater Mark     :         216

 Area Filled Count       :           0

 Service Latch Holder    :          -1

 Access Count            :      589755

 Access Collisions       :         114

Now I know I set -pica too high at 16384.

Used HighWater Mark shows that.

So the goal is that HWM is somehow close to Total Message Entries, like half or more, right?

Area Filled Count should be small (single digit) or zero?

Access Collisions ???

Posted by kevin hermans on 29-Sep-2017 03:55

I see you have also Access Collisions is there somebody who can tell something more about this parameter.
I think there is no problem if it not exceeds a certain proportion to the Access Count but I'm curious.

Communication Area Size   :       65.00 KB
  Total Message Entries   :         595
  Free Message Entries    :         595
  Used Message Entries    :           0
  Used HighWater Mark     :         495
  Area Filled Count       :           0
  Service Latch Holder    :          -1
  Access Count            :   213971543
  Access Collisions       :       65747

Posted by George Potemkin on 29-Sep-2017 04:13

My bet: access count and access collisions are similar to the locks and the naps for the latches. Service latch is an administrative (a.k.a. “private”) latch. The administrative latches are quite different from the spin's latches but any latches are born to resolve the collisions.

Posted by George Potemkin on 29-Sep-2017 07:59

Example of pica filled:

Communication Area Size: 20500.00 KB
Total Message Entries: 187428

Time     Free Msgs  Used Msgs  Used HWM  Filled Cnt  Access Cnt  Collisions  Locked by Replication Server
04:00:28    187428          0     39351           0   146653272       24813          0
04:30:29    176446      10982     39351           0   146787113       24823      10982 
05:00:30    129880      57548     57554           0   146903954       24837      57548 
05:30:31     82080     105348    105349           0   147012984       24864     105348 
06:00:32     40849     146579    146579           0   147115650       24895     146579 
06:30:33      1727     185701    185736           0   147259217       24934     185701 
07:00:34        90     187338    187428        2757   147399645       24985     187338 
07:30:35         9     187419    187428        6039   147546696       25036     187419 
08:00:36      6094     181334    187428        6826   147699472       25088     181334 
08:30:37       148     187280    187428        8676   147835357       25132     187280 
09:00:38       282     187146    187428       11422   147990051       25175     187146 
09:30:39        11     187417    187428       13920   148140550       25230     187417 
10:00:40         8     187420    187428       16565   148301390       25271     187420 
10:30:40         0     187428    187428       19071   148474473       25320     187428   
11:00:41         6     187422    187428       20430   148667535       25351     187422 
11:30:42         0     187428    187428       21451   148860809       25367     187428 
12:00:43         0     187428    187428       22290   149052507       25387     187428
12:30:44         0     187428    187428       23280   149244708       25404     187428
13:00:45         6     187422    187428       24393   149439472       25427     187422 
13:30:46         0     187428    187428       25901   149634877       25463     187428
14:00:47         0     187428    187428       27601   149829466       25516     187428
14:30:48         0     187428    187428       30126   150026275       25562     187428

Access Collisions are almost the same all the time.

Used Message Entries = Locked by OpenEdge Replication Server

Activity: AI Log

Activity: AI Log

Time    Total Wrt  AIW Wrt Notes Wrt  Bytes Wrt BusyWaits Partial writes
04:00:28  6410836  6134415    860955K 98412861K  3506  248953
04:30:29    30234    29677   4079689    476171K    17     429
05:00:30    76472    75515  10441933   1211905K    40     616
05:30:31    73177    72211  10123924   1159241K    45     614
06:00:32    66706    65792   9253486   1055882K    45     614
06:30:33    86124    85119  11848009   1365994K    54     622
07:00:34    50341    49417   6874101    794400K    35     615
07:30:35    47549    46614   6566173    749848K    37     612
08:00:36    64745    63774   8997106   1024461K    46     620
08:30:37    44369    43458   6426670    698868K    43     621
09:00:38    29663    28685   4459687    464162K    41     609
09:30:39    41087    40112   6251794    646457K    56     624
10:00:40    28148    27162   4220942    439805K    91     617
10:30:40    26172    25201   4019687    408615K    99     595
11:00:41    13070    12416   2005665    201711K   173     443
11:30:42    10044     9531   1543874    154703K   112     359
12:00:43     8280     7677   1261205    126705K    68     347
12:30:44     9675     9166   1479821    148782K   149     362
13:00:45    11091    10619   1723126    171474K   185     359
13:30:46    15510    14665   2414339    240059K   287     488
14:00:47    17176    16211   2681374    266200K   226     514
14:30:48    24562    23546   3876160    383844K   154     538

And, of course, the filled pica queue cause the slowness on the source db (Total AI Writes are decreased by 10 times).

Posted by Jeff Owen on 29-Sep-2017 08:04
Access collisions is the number of times that a process tried to add to the queue or remove from the queue (repl server) and found
that the queue was locked.  It can be used with access counts to determine what level contention exists when accessing the queue.
Your example suggests that there is very little contention on the pica queue.  And that interactions with the pica queue are not affecting
the performance of the source database.
 
-jeff
 
Posted by George Potemkin on 29-Sep-2017 08:35

> Access collisions is the number of times that a process tried to add to the queue or remove from the queue (repl server) and found that the queue was locked.

What is a typical duration of a queue lock and what are the factors that can affect it?

Posted by Jeff Owen on 29-Sep-2017 09:56
Once the lock is obtained, a free entry is either available or it is not. Search of the queue is not required to find a free entry while
the lock is held.  If a free entry is not available, the lock is released.  If a free entry on the queue is available, a small amount of  data
is updated in the message entry ( 112 bytes). Then the lock is released.  So the lock is held for a very, very, short time in duration.
 
-jeff
 
 
Posted by George Potemkin on 29-Sep-2017 12:16

Thanks, Jeff.

The pica queue lock is the same as service latch lock, isn't it?

>  So the lock is held for a very, very, short time in duration.

Is it less than a microsecond as in case with the "spin" latches? Sorry I don't know a correct term to name the classical latches (like MTX, USR etc) in opposite to the administrative latches (like the pica service latch).

Access Count: 150,000,000 per 30 min in my example. If access count = lock count then we can roughly estimate the total duration of the queue lock as being less than 150 sec during 30 min interval or 8% (or less) of all time the queue is locked. Is it correct?

Access Collisions to Access Count ratio is only 0.017% in my example and 0.019% in Dmitri's example. Which processes and how activily are competing for the queue lock? Why the collisions are so rare?

UPD: The administrative latches seem to be used only for situations with low collisions (like BIADD latch used by prostrct add online utility). I guess they don't even use the spin lock mechanism. So the low access collisions in Service Manager seem to be expected.

Posted by George Potemkin on 01-Oct-2017 03:38

I missed that Status: Database Service Manager reports the counters since db startup. So I slightly "re-worded" the same statistics I posted above.

Status: Database Service Manager


Time Free Msgs Used Msgs Locked by RPLS Filled Access Collisions AI Blocks Percent Access/Block Notes/Block 04:00:28 187428 0 0 0 146,653,272 24813 6,161,883 96% 23.80 140 04:30:29 176446 10982 10982 0 133,841 10 29,805 99% 4.49 137 05:00:30 129880 57548 57548 0 116,841 14 75,856 99% 1.54 138 05:30:31 82080 105348 105348 0 109,030 27 72,563 99% 1.50 140 06:00:32 40849 146579 146579 0 102,666 31 66,092 99% 1.55 140 06:30:33 1727 185701 185701 0 143,567 39 85,502 99% 1.68 139 07:00:34 90 187338 187338 2757 140,428 51 49,726 99% 2.82 138 07:30:35 9 187419 187419 3282 147,051 51 46,937 99% 3.13 140 08:00:36 6094 181334 181334 787 152,776 52 64,125 99% 2.38 140 08:30:37 148 187280 187280 1850 135,885 44 43,748 99% 3.11 147 09:00:38 282 187146 187146 2746 154,694 43 29,054 98% 5.32 153 09:30:39 11 187417 187417 2498 150,499 55 40,463 98% 3.72 155 10:00:40 8 187420 187420 2645 160,840 41 27,531 98% 5.84 153 10:30:40 0 187428 187428 2506 173,083 49 25,577 98% 6.77 157 11:00:41 6 187422 187422 1359 193,062 31 12,627 97% 15.29 159 11:30:42 0 187428 187428 1021 193,274 16 9,685 96% 19.96 159 12:00:43 0 187428 187428 839 191,698 20 7,933 96% 24.16 159 12:30:44 0 187428 187428 990 192,201 17 9,313 96% 20.64 159 13:00:45 6 187422 187422 1113 194,764 23 10,732 97% 18.15 161 13:30:46 0 187428 187428 1508 195,405 36 15,022 97% 13.01 161 14:00:47 0 187428 187428 1700 194,589 53 16,662 97% 11.68 161 14:30:48 0 187428 187428 2525 196,809 46 24,024 98% 8.19 161 15:00:49 8 187420 187420 2627 191,212 59 26,520 98% 7.21 160 15:30:51 2244 185184 185184 2285 175,022 70 28,485 98% 6.14 159 16:00:52 24712 162716 162716 1165 172,104 87 42,393 99% 4.06 161 16:30:53 103103 84325 84325 0 143,864 55 26,906 98% 5.35 152 17:00:54 123870 63558 63558 0 68,242 14 17,954 97% 3.80 145 17:30:55 141597 45831 45831 0 67,828 9 19,297 97% 3.51 146 18:00:56 187428 0 0 0 178,082 40 16,868 97% 10.56 149 18:30:57 187428 0 0 0 257,580 44 7,775 94% 33.13 143 19:00:58 187428 0 0 0 345,113 96 25,769 98% 13.39 149 19:30:59 187428 0 0 0 355,315 77 22,158 98% 16.04 137 20:00:59 187428 0 0 0 281,343 69 18,364 98% 15.32 137 20:31:00 187428 0 0 0 81,337 10 902 89% 90.17 142 21:01:01 187428 0 0 0 60,178 5 190 71% 316.73 138 21:31:02 187428 0 0 0 59,145 2 247 79% 239.45 140 22:01:03 187428 0 0 0 55,854 4 130 71% 429.65 132 22:31:04 187428 0 0 0 53,761 0 28 25% 1,920.04 125 23:01:05 187428 0 0 0 57,486 1 161 65% 357.06 137 23:31:06 187428 0 0 0 48,780 1 15 42% 3,252.00 132 00:01:07 187428 0 0 0 51,271 3 658 96% 77.92 147 00:31:08 187428 0 0 0 50,421 3 1,696 98% 29.73 143 01:01:09 187428 0 0 0 78,023 3 442 71% 176.52 140 01:31:10 187428 0 0 0 172,987 4 2,707 83% 63.90 143 02:01:11 187428 0 0 0 80,442 1 281 51% 286.27 142 02:31:12 187428 0 0 0 85,199 1 593 68% 143.67 134 03:01:13 187428 0 0 0 185,981 23 9,860 97% 18.86 132

where:

First interval - 410 hrs 14 min (db uptime).
The rest intervals - 30 min.
"Filled", "Access", "Collisions" are the increments per interval of the corresponding counters in Status: Database Service Manager ("Area Filled Count", "Access Count", "Access Collisions").

"AI Blocks" - the number of AI blocks generated per interval = "Total AI Writes" minus "Partial Writes" (from Activity: AI Log)

"Percent" = "AI Blocks" / "Total AI writes" * 100% = an indirect indicator of idle db.

"Notes/Block" = "Records written" (from Activity: AI Log) / "AI Blocks". AI blocksize = 16K. Average size of recovery notes is 112 bytes in the given example.

Conclusions/Remarks:

An unknown process(es) on source db is creating the collisions trying to access the pica queue but all 47 checks by promon show that the message entries locked only by Replication Server. The collisions are extremely low even during the most busy time: 96 collisions per 30 minutes (or one collision in 20 sec). In other words, almost 100% of the pica queue accesses are generated by Replication Server. I guess the ABL sessions do not access the pica queue. In my example there were about 2,500 sessions. It's unlikely that Replication Server would be able to dominate over their access requests. So the access collisions can be created only by the special processes that are working with AI files. UPD: Documentation says: "AIW and REPL attempting simultaneous access".

Replication Server generates at least one access per incoming AI block (e.g. 102,666 accesses per 66,092 AI blocks). When transaction activity is low the number of pica queue accesses per AI block can be huge. In other words, Replication Server accesses the pica queue at least 27 times per sec (e.g. 48,780 accesses per 30 min with only 15 incoming AI blocks). For comparison: at busy time db was able to generate 85,502 AI blocks per 30 min or 48 AI blocks per sec.

Max value of "Area Filled Count" per 30 min interval was 3,282 or almost twice per sec.

Posted by George Potemkin on 01-Oct-2017 06:53

> Now I know I set -pica too high at 16384.

> Used HighWater Mark shows that.

16 MB of RAM per db is not a big price for insurance. ;-)

> So the goal is that HWM is somehow close to Total Message Entries, like half or more, right?

Unfortunately HWM is almost useless without date/time when the message entries had rised to the specified value. It could take place a month ago due to some particular conditions at this time (e.g. abnormal peak of transaction activity, a network issue between source and target dbs etc). The rest of time (untill db restart) you will see a "reminder" about past bad day that does not matter for the current situation.

.

> Area Filled Count should be small (single digit) or zero?

Only zero! Other values would mean that performance on source db was considerably degradated due to a short pica queue at least for some period of time.

> Access Collisions ???

It's a field that can be totally ignored.

Posted by Jeff Owen on 02-Oct-2017 07:36
The duration of a lock of the pica queue is less than a millisecond.  Processes that flush AI buffers add entries to the pica queue and the replication server reads entries from the queue.    I should note that the pica queue is not used during the startup synchronization process.  During synchronization, the repl server reads directly from the AI extent(s).  The pica queue  is then “turned on” once is gets close to the most recently written AI blocks.  The pica queue is then used  during normal processing.  I would assume your contention is very low because you are running an AIW on the source.  The only contention in this case would only between the AIW and the repl server.
 
Generally, you should see very little contention for the pica queue with an AI configuration that is well tuned. Otherwise, we’d like to hear about an environment where a high level of pica queue is seen.
 
-jeff
 
Posted by George Potemkin on 02-Oct-2017 08:47

Earlier mentioned knowledgebase article "Clarification of the -pica replication enabled database startup parameter"

knowledgebase.progress.com/.../P104638

says: "There is a PICA queue on both the source and target databases"

Should the -pica parameters be the same on source and target databases or should we tune them individually? These databases are running on a different hardware and under a different load.

I guess that in the customer's case I posted above the performance degradation on source db was caused by slowness of  replication agent on target db. Theory: replication server was spending a lot of time waiting for the acknowledgments from replication agent that in turn resulted in the filled pica queue on source db => source's APW was waiting for pica access => the updates on source db was very slow. Is it a reasonable explanation?

Will we also get the negative impact on source database if the pica queue is filled on target database for any reasons? Will it cause the filled pica queue also on source db as well?

Posted by George Potemkin on 02-Oct-2017 09:08

Comparing the pica queues on source and target dbs:

Source:

09/27/17        Status: Database Service Manager
16:30:53        
Communication Area Size   :      20500.00 KB
  Total Message Entries   :      187428
  Free Message Entries    :      103103
  Used Message Entries    :       84325
  Used HighWater Mark     :      187428
  Area Filled Count       :       36203
  Service Latch Holder    :          -1
  Access Count            :   150708477
  Access Collisions       :       25833
Registered Database Service Objects
Name                             Rdy Status  Messages Locked by
OpenEdge Replication Server       Y  RUN        84325 

Target:

09/27/17        Status: Database Service Manager
16:30:49        
Communication Area Size   :      20500.00 KB
  Total Message Entries   :      187428
  Free Message Entries    :      187428
  Used Message Entries    :           0
  Used HighWater Mark     :           0
  Area Filled Count       :           0
  Service Latch Holder    :          -1
  Access Count            :    24717946
  Access Collisions       :         141
Registered Database Service Objects
Name                             Rdy Status  Messages Locked by
OpenEdge Replication Agent        Y  RUN            0 

Access count on source db is 6 times larger than the one on target.
Large Area Filled Count on source db and zero HWM on target.

Posted by Jeff Owen on 02-Oct-2017 10:16
The pica queue on the target does not need to be same on the source and the target.  The default size on the target is likely sufficient since the queue is only used by dsrutil  to terminate and agent, restart and agent (11.6+), and to start transition.  When configuring restart after transition, the value used for the restart should be the same as the setting for the original source.
 
Regarding the customer’s problem, the pica queue on the target should not fill up during the processing of the AI blocks.  The queue on the target is not used as part of the replication process.   The repl server generally doesn’t wait for acknowledgments from the agent so I would be skeptical of that theory.  The agent may not be keeping up with the source.  How similar in configuration are the source and target servers?  Is a BIW running on the target?  And if AI is configured on the target, is an AIW running?
 
Posted by George Potemkin on 03-Oct-2017 04:45

BIW is running on the target.  AI on the target is not enabled.

The customer's case: a rather high transaction activity (~ 0.7 MB/sec of AI/BI writes) caused the persistent increase of the used message entries in the pica queue on source db. When the pica queue was totally filled we got 100% busy MTX and AIB latches. It means that almost every snapshot in promon shown an owner of these latches. MTX latch was owned by ABL users (each time a different one). AIB latch was always owned by AIW process. Just for a record: the competition for AIB latch between ABL users are not possible because they always lock AIB (and BIB) latch under cover of MTX latch: they lock MTX latch first and only then they're requesting AIB/BIB lock. I guess there are only a few special processes that can compete for AIB latch directly but mainly it's AIW.

The number of the AIB latch locks was rather low for a busy latch: 700-2000 locks per sec. Under normal conditions the lock duration of AIB latch is less than 100 nanoseconds. Hence to make the latch to be 100% busy we need 10 millions locks per sec - it's impossible to get such high numbers of AIB locks in a real database.

> The duration of a lock of the pica queue is less than a millisecond.

It's a huge time compared to the duration of the "spin" latches like AIB. And it also means that we can create only about a thousand locks per sec (that matches the numbers above).

New explanation of the customer's case:

I guess the AIW process is requesting the pica queue lock while it holds AIB latch. When AIW is forced to wait for the pica queue lock then the duration of AIB latch is increasing by 10 thousands times - from 100 ns to 1 ms. ABL process that already owns the MTX latch is forced to wait for AIB latch. That is why the MTX latch became 100% busy at the same time with the AIB latch. The number of MTX/AIB/BIB latch locks are closed to each other - the difference between MTX and AIB latch locks used to be plus/minus 1%, the difference between MTX and BIB latches used to be plus/minus 10% (or more during the idle time).

If this my "theory" is correct then the only solution is to have a large pica queue to avoid its total filling during the periods with highest transaction activity.

Posted by Jeff Owen on 03-Oct-2017 07:42
I misspoke regarding the duration of a  pica queue lock.  The duration of a pica queue lock is less than a microsecond.   The mechanism is the same as with “spin” latches.  My apologizes for any confusion.
 
It’s not so much, in the customer’s case, that is it is waiting for the pica queue lock, it’s that it’s waiting for a free message entry on the queue.  The APW is queuing messages faster than the repl server can read AI blocks and send them to the target.  The repl server’s job can be affected by network bandwidth, the target server being less  performant than the source server, and if the target is not tuned as well as the source.  In any case, I would agree that the solution for the customer is to increase the pica queue size so that it is large enough to accommodate the highest level of transaction activity.
 
 
Posted by George Potemkin on 03-Oct-2017 09:10

Thanks again, Jeff.

> The duration of a pica queue lock is less than a microsecond.   The mechanism is the same as with “spin” latches.

How long AIW will wait before a second try if it fails to get the pica queue lock? 10 ms? Does it use the -nap value?

Posted by Jeff Owen on 04-Oct-2017 06:56
Yes, 10 ms.  No for using -nap value, it is not configurable.
 
-jeff
 
Posted by Dmitri Levin on 06-Oct-2017 11:43

>The customer's case: a rather high transaction activity (~ 0.7 MB/sec of AI/BI writes) caused the persistent increase

>of the used message entries in the pica queue on source db.

What was the -pica value? And if it was increased, what is it now?

We have pretty similar transaction activity.

>BIW is running on the target.  AI on the target is not enabled.

Why would one enable AI on Target DB and is it even possible?

Posted by Valeriy Bashkatov on 10-Oct-2017 03:18

>Why would one enable AI on Target DB and is it even possible?

Yes it is possible. Someone use AI on the Target database to archive AI extents from Target not from Source, but for me this decision is doubtful, since there is a possible a time lag between the Source and Target.

However, if you want to use a Replication Set in OE 11.7 you must activate after-imaging on your target databases for this.

This thread is closed