Replication Redo 0%

Posted by Dmitri Levin on 22-Feb-2018 15:40

IBM AIX 7.1

OpenEdge 11.7.1 with OE Replication Plus

We had a power outage in the area ( truck hitting Peco pole) and all live databases went down. Here is the log of what happened when DB went up. Obviously database has to roll back transactions and thus we had 79 seconds of Physical Redo process. It was nice to see that Progress is reporting percentage of Physical Redo. good job! And 79 seconds not bad for 1.5 TB database with thousand clients.

My question is what is Replication Redo phase? Why it is reporting 0% all 121 seconds? Overall it was a good experience to see how fast OE db went up and also OE Replication pick up immediately without any intervention. I had more to do to fix Oracle DataGuard than OE Replication, though it worked fine too eventually.

OpenEdge Release 11.7.1 as of Wed Jun 14 19:00:30 EDT 2017

13:08:50 BROKER     The startup of this database requires 35013Mb of shared memory. Maximum segment size is 32768Mb.

13:09:49 BROKER 0: Multi-user session begin. (333)

13:09:49 BROKER 0: Connections to this database will not be allowed until all Database Services started have completed their startup and initialization. (10545)

13:09:50 BROKER 0: Begin Physical Redo Phase at 16384 . (5326)

13:10:20 BROKER 0: Physical Redo 22% at block 18231. 1847 blocks processed in 30 seconds.

13:10:50 BROKER 0: Physical Redo 46% at block 20163. 3779 blocks processed in 60 seconds.

13:11:09 BROKER 0: Physical Redo 100% at block 21616. 5232 blocks processed in 79 seconds.

13:11:09 BROKER 0: Physical Redo Phase Completed at blk 21616 off 16205 upd 292420. (7161)

13:11:09 BROKER 0: At end of Physical redo, transaction table size is 4096. (13547)

13:11:09 BROKER 0: Begin Replication Redo Phase for 20 live transactions from 16384 . (12080)

13:11:39 BROKER 0: Replication Redo 0% at block 4095. 0 blocks processed in 30 seconds.

13:12:09 BROKER 0: Replication Redo 0% at block 4095. 0 blocks processed in 60 seconds.

13:12:39 BROKER 0: Replication Redo 0% at block 4095. 0 blocks processed in 90 seconds.

13:13:09 BROKER 0: Replication Redo 0% at block 4095. 0 blocks processed in 120 seconds.

13:13:10 BROKER 0: Replication Redo 100% at block 4095. 0 blocks processed in 121 seconds.

13:13:10 BROKER 0: Replication Redo Phase completed at block 145534, offset 398. (12081)

13:13:10 BROKER 0: Begin Physical Undo 37 transactions at block 21616 offset 16241 (7163)

13:13:11 BROKER 0: Physical Undo Phase Completed at 21615 . (5331)

13:13:11 BROKER 0: Begin Logical Undo Phase, 37 incomplete transactions are being backed out. (7162)

13:13:11 BROKER 0: Logical Undo Phase begin at Block 21615 Offset 5802. (11231)

13:13:13 BROKER 0: Logical Undo Phase completed at block 145533, offset 15860. (12095)

13:13:16 BROKER 0: Login by root on /dev/pts/4. (452)

13:13:16 BROKER 0: The OpenEdge Replication Server is starting...

Posted by Dapeng Wu on 23-Feb-2018 15:44

A bug has been entered: PSC00363860.

Dapeng

Posted by George Potemkin on 25-Feb-2018 05:33

Fortunately the documentation is wrong.

On screen:

13:59:38 BROKER 0: Multi-user session begin. (333)
13:59:38 BROKER 0: Crash Recovery Status Interval (-crStatus): 1 seconds (18186)
13:59:38 BROKER 0: Crash Recovery Transaction Display (-crTXDisplay): ON (18187)
13:59:38 BROKER 0: Begin Physical Redo Phase at 2880 . (5326)
13:59:38 BROKER 0: Physical Redo Phase Completed at blk 2976 off 8144 upd 2975. (7161)
13:59:38 BROKER 0: At end of Physical redo, transaction table size is 128. (13547)
13:59:38 BROKER 0: Begin Physical Undo 1 transactions at block 2976 offset 8180 (7163)
13:59:38 BROKER 0: Transaction 404 by user 5 at 2018/02/25@13:57:43. Counter: 26
13:59:39 BROKER 0: Physical Undo 30% at block 2064. 913 blocks processed in 1 seconds. Counter: 58
13:59:40 BROKER 0: Physical Undo 59% at block 1186. 1791 blocks processed in 2 seconds. Counter: 44
13:59:41 BROKER 0: Physical Undo 87% at block 228. 2621 blocks processed in 3 seconds. Counter: 31
13:59:41 BROKER 0: Physical Undo Phase Completed at 1086 . (5331)
13:59:41 BROKER 0: Begin Logical Undo Phase, 1 incomplete transactions are being backed out. (7162)
13:59:41 BROKER 0: Logical Undo Phase begin at Block 1086 Offset 6840. (11231)
13:59:41 BROKER 0: Logical Undo 100% at block 1086. 2915 blocks processed in 3 seconds. Counter: 26
13:59:41 BROKER 0: Logical Undo Phase completed at block 1086, offset 6840. (12095)

In db log:

[2018/02/25@13:59:38.016+0300] P-19741 T-140406907832064 I BROKER 0: (333) Multi-user session begin. 
[2018/02/25@13:59:38.272+0300] P-19741 T-140406907832064 I BROKER 0: (18186) Crash Recovery Status Interval (-crStatus): 1 seconds 
[2018/02/25@13:59:38.272+0300] P-19741 T-140406907832064 I BROKER 0: (18187) Crash Recovery Transaction Display (-crTXDisplay): ON 
[2018/02/25@13:59:38.272+0300] P-19741 T-140406907832064 I BROKER 0: (5326) Begin Physical Redo Phase at 2880 . 
[2018/02/25@13:59:38.279+0300] P-19741 T-140406907832064 I BROKER 0: (7161) Physical Redo Phase Completed at blk 2976 off 8144 upd 2975. 
[2018/02/25@13:59:38.279+0300] P-19741 T-140406907832064 I BROKER 0: (13547) At end of Physical redo, transaction table size is 128. 
[2018/02/25@13:59:38.280+0300] P-19741 T-140406907832064 I BROKER 0: (7163) Begin Physical Undo 1 transactions at block 2976 offset 8180 
[2018/02/25@13:59:38.280+0300] P-19741 T-140406907832064 I BROKER 0: (-----) Transaction 404 by user 5 at 2018/02/25@13:57:43. Counter: 26
[2018/02/25@13:59:41.614+0300] P-19741 T-140406907832064 I BROKER 0: (5331) Physical Undo Phase Completed at 1086 . 
[2018/02/25@13:59:41.614+0300] P-19741 T-140406907832064 I BROKER 0: (7162) Begin Logical Undo Phase, 1 incomplete transactions are being backed out. 
[2018/02/25@13:59:41.614+0300] P-19741 T-140406907832064 I BROKER 0: (11231) Logical Undo Phase begin at Block 1086 Offset 6840. 
[2018/02/25@13:59:41.618+0300] P-19741 T-140406907832064 I BROKER 0: (12095) Logical Undo Phase completed at block 1086, offset 6840.

I don't understand why Progress writes the "Transaction by user at" message to both destinations because the message in promsgs file does not have the "%B" tag and the message is not dublicated/embedded into the Progress executables:

In promsgs file:

Transaction %l by user %d at %s. Counter: %l

Another side note: as we see in this message the recovery notes contains the user numbers but 'rfutil -S scan verbove' reports only user's names stored in RL_TBGN notes. It's very inconvenient when we need tp parse the transactions generated by batch processes that share the same login name.

Conclusion: use the -crTXDisplay parameter to start a database. Especially after db crash.

All Replies

Posted by Dapeng Wu on 22-Feb-2018 16:25

Replication redo is a part of the recovery phase if the database is replication enabled. It's used to collect some information from the recovery log file so that it can re-establish the replication environment, for example, record locks etc. Without this phase, some operations, say read on a target database, may see inconsistent results or unexpected behaviors.

The log file clearly shows some work has been done during the replication redo phase. So it looks like a bug in the percentage report.

Dapeng

Posted by George Potemkin on 23-Feb-2018 00:56

> It was nice to see that Progress is reporting percentage of Physical Redo. good job!

This feature is available since version 11.6:

Crash Recovery Status Interval (-crStatus): <num> seconds (18186)
This parameter defines the interval between progress approximation messages for the redo and undo phases of crash recovery.

Crash Recovery Transaction Display (-crTXDisplay): <on/off> (18187)
If you will use the –crTXDisplay parameter then Progress will write to db log the numberless ;-) message # 18158:
Transaction <TRID> by user <num> at <datetime>. Counter: <cluster>
This information can assist you in determining if there are many transactions to back out during crash recovery, or if there is perhaps one long-running transaction.

Posted by George Potemkin on 23-Feb-2018 03:30

Off-topic: Am I correctly interpret the values reported in the messages?

> 13:11:09 BROKER 0: Physical Redo 100% at block 21616. 5232 blocks processed in 79 seconds.
> 13:11:09 BROKER 0: Physical Redo Phase Completed at blk 21616 off 16205 upd 292420. (7161)

I think the messages say that broker scanned 5232 BI blocks and updated 292420 DB blocks.
Hence one BI block in average contains 56 (=292420 / 5232) recovery notes that describe the physical changes in DB blocks.

Max offset in BI blocks mentioned in the messages is 16384 (=16K). Hence BI block size = 16K
The average size of recovery notes that describe the physical changes in DB blocks is less than 293 bytes (=16K / 56). Not all recovery notes describe the physical changes. I would expect that real average size of BI notes is less than 200 bytes. In other words BI blocks contained enough number of notes like RL_TBGN, RL_TEND or RL_TMSAVE. Hence either the most of transactions were short or they consisted of the small subtransactions.

Redo Phase updated 292420 DB blocks in 79 seconds => 3701.52 DB blocks/sec
Obviously db block size is 8K. => DB Writes were 28.92 MB/sec. Fast!

BI Writes were 66 blocks/sec or 1.16 MB/sec (= 5232 / 79 * 16K). It's less than a typical BI writes in multi-user environment - a hundred or so BI blocks/sec. It's much less than the speed of BI writes in the Fugal test (proutil -C bigrow -zextendSyncIO). Hence the bottleneck during Physical Redo Phase were DB reads/writes.

> 13:13:11 BROKER 0: Begin Logical Undo Phase, 37 incomplete transactions are being backed out. (7162)
> 13:13:11 BROKER 0: Logical Undo Phase begin at Block 21615 Offset 5802. (11231)
> 13:13:13 BROKER 0: Logical Undo Phase completed at block 145533, offset 15860. (12095)

Logical Undo Phase did NOT created 123,918 (=145533 – 21615) BI blocks, did it? It should not be larger than the number of BI blocks processed by Physical Redo Phase (5,232 BI blocks). Undo Phase completed at BI block with large number because the block numbers are not sequential in the ring of BI clusters. Right?

Posted by Dapeng Wu on 23-Feb-2018 07:25

George,

I think you got most of the calculation right. One thing I want to point out is that in Redo Phase, not all the block updates will turn into actual writes. A block may be updated many times but only with one or even no writes.

During the Redo phase, we only replay the notes from the BI. This means no BI writes in this phase. So your calculation of 66 block/sec is actually from BI reads. When recovery moves to Physical/Logical undo phase, then BI activity will involve writes because we need to record the (new) actions to roll back uncommitted transactions.

Your understanding of the Logical undo phase is also correct. You can't use the block number/offsets to calculate the number of BI blocks, as BI blocks are grouped into clusters, but clusters may be out of order.

Dapeng

Posted by George Potemkin on 23-Feb-2018 07:47

Thanks, Dapeng!

> One thing I want to point out is that in Redo Phase, not all the block updates will turn into actual writes.

And most likely there were actually the /reads/. As far as I know db was started with -B ~ 3,000,000. 292,420 of updated db blocks used only small part of this memory. So Physical Redo Phase created 28.92 MB/sec of disk reads. It looks as a reasonable number for random disk reads.

Posted by Dapeng Wu on 23-Feb-2018 15:44

A bug has been entered: PSC00363860.

Dapeng

Posted by Dmitri Levin on 23-Feb-2018 16:35

>> Crash Recovery Status Interval (-crStatus): <num> seconds (18186)

The default 30 seconds works pretty good. Nice to know I can change it. Thanks

>>Crash Recovery Transaction Display (-crTXDisplay): <on/off> (18187)

>>If you will use the –crTXDisplay parameter then Progress will write to db log

The documentation says

Use Crash Recovery Transaction Display (-crTXDisplay) to display to the screen only (not the log file) the

contents of the transaction table prior to the Physical Undo phase of crash recovery.

Posted by George Potemkin on 25-Feb-2018 05:33

Fortunately the documentation is wrong.

On screen:

13:59:38 BROKER 0: Multi-user session begin. (333)
13:59:38 BROKER 0: Crash Recovery Status Interval (-crStatus): 1 seconds (18186)
13:59:38 BROKER 0: Crash Recovery Transaction Display (-crTXDisplay): ON (18187)
13:59:38 BROKER 0: Begin Physical Redo Phase at 2880 . (5326)
13:59:38 BROKER 0: Physical Redo Phase Completed at blk 2976 off 8144 upd 2975. (7161)
13:59:38 BROKER 0: At end of Physical redo, transaction table size is 128. (13547)
13:59:38 BROKER 0: Begin Physical Undo 1 transactions at block 2976 offset 8180 (7163)
13:59:38 BROKER 0: Transaction 404 by user 5 at 2018/02/25@13:57:43. Counter: 26
13:59:39 BROKER 0: Physical Undo 30% at block 2064. 913 blocks processed in 1 seconds. Counter: 58
13:59:40 BROKER 0: Physical Undo 59% at block 1186. 1791 blocks processed in 2 seconds. Counter: 44
13:59:41 BROKER 0: Physical Undo 87% at block 228. 2621 blocks processed in 3 seconds. Counter: 31
13:59:41 BROKER 0: Physical Undo Phase Completed at 1086 . (5331)
13:59:41 BROKER 0: Begin Logical Undo Phase, 1 incomplete transactions are being backed out. (7162)
13:59:41 BROKER 0: Logical Undo Phase begin at Block 1086 Offset 6840. (11231)
13:59:41 BROKER 0: Logical Undo 100% at block 1086. 2915 blocks processed in 3 seconds. Counter: 26
13:59:41 BROKER 0: Logical Undo Phase completed at block 1086, offset 6840. (12095)

In db log:

[2018/02/25@13:59:38.016+0300] P-19741 T-140406907832064 I BROKER 0: (333) Multi-user session begin. 
[2018/02/25@13:59:38.272+0300] P-19741 T-140406907832064 I BROKER 0: (18186) Crash Recovery Status Interval (-crStatus): 1 seconds 
[2018/02/25@13:59:38.272+0300] P-19741 T-140406907832064 I BROKER 0: (18187) Crash Recovery Transaction Display (-crTXDisplay): ON 
[2018/02/25@13:59:38.272+0300] P-19741 T-140406907832064 I BROKER 0: (5326) Begin Physical Redo Phase at 2880 . 
[2018/02/25@13:59:38.279+0300] P-19741 T-140406907832064 I BROKER 0: (7161) Physical Redo Phase Completed at blk 2976 off 8144 upd 2975. 
[2018/02/25@13:59:38.279+0300] P-19741 T-140406907832064 I BROKER 0: (13547) At end of Physical redo, transaction table size is 128. 
[2018/02/25@13:59:38.280+0300] P-19741 T-140406907832064 I BROKER 0: (7163) Begin Physical Undo 1 transactions at block 2976 offset 8180 
[2018/02/25@13:59:38.280+0300] P-19741 T-140406907832064 I BROKER 0: (-----) Transaction 404 by user 5 at 2018/02/25@13:57:43. Counter: 26
[2018/02/25@13:59:41.614+0300] P-19741 T-140406907832064 I BROKER 0: (5331) Physical Undo Phase Completed at 1086 . 
[2018/02/25@13:59:41.614+0300] P-19741 T-140406907832064 I BROKER 0: (7162) Begin Logical Undo Phase, 1 incomplete transactions are being backed out. 
[2018/02/25@13:59:41.614+0300] P-19741 T-140406907832064 I BROKER 0: (11231) Logical Undo Phase begin at Block 1086 Offset 6840. 
[2018/02/25@13:59:41.618+0300] P-19741 T-140406907832064 I BROKER 0: (12095) Logical Undo Phase completed at block 1086, offset 6840.

I don't understand why Progress writes the "Transaction by user at" message to both destinations because the message in promsgs file does not have the "%B" tag and the message is not dublicated/embedded into the Progress executables:

In promsgs file:

Transaction %l by user %d at %s. Counter: %l

Another side note: as we see in this message the recovery notes contains the user numbers but 'rfutil -S scan verbove' reports only user's names stored in RL_TBGN notes. It's very inconvenient when we need tp parse the transactions generated by batch processes that share the same login name.

Conclusion: use the -crTXDisplay parameter to start a database. Especially after db crash.

This thread is closed