OpenEdge replication to secondary target fails.

Posted by Simon L. Prinsloo on 02-Dec-2019 08:13

Good day

My client has (almost) 24/7 operations with 11 databases, two of which are particularly large. (Backups are 134 GB and 194 GB).

For a number of years now, they are running RHEL and OpenEdge 11.6 with OpenEdge Replication. The DR is identical to the production server, except for the fact that it has only half the number of processors.

The current OS does not support OpenEdge 11.7.

The customer bought a new server and installed Centos 7.

In preparation for the switch to the new production server, OpenEdge 11.6 was installed on the new server and it was configured as a secondary replication target. The idea is to transition to this machine, restage the old DR and then to remove the old production server. It will then be rebuild using Centos 7 and OpenEdge 11.6, where after the old production server will become the new DR machine, the current DR will be decommissioned and the client will upgrade to OpenEdge 11.7.5.

Currently, 10 of the 11 databases have been staged successfully and are replicating perfectly to the new server as well as the old DR.

The 11th database is the largest database and gave the same problem after three failed attempts to set up the secondary target on the new server.

We double checked the configuration and see no difference from the others which worked. During each attempt, a new online backup was taken with the -REPLTargetCreation switch, synced to the new machine (which takes the best part of 12-14 hours) and restored. During that time, almost 50% of the AI extents on the production db fills up.

When replication is restarted, the following is found in the log of the new target:

Database ... is being replicated from database ... on host <production>.

The Replication Server has been terminated or the Source database has been shutdown.  The Agents will enter PRE-TRANSITION, waiting for re-connection from the Replication Server.

The following appears in the log of the source database:

RPLU 18: (-----) The OpenEdge Replication Server is starting...
RPLU 18: (10718) Restarting OpenEdge Replication Server.
RPLU 18: (453) Logout by root on /dev/pts/6.
RPLS 19: (10500) The Fathom Replication Server successfully started as PID 30276.
RPLS 19: (10842) Connecting to Fathom Replication Agent <DR agent>.
RPLS 19: (10507) The Fathom Replication Server has successfully connected to the Fathom Replication Agent <DR agent> on host <DR host>.
RPLS 19: (10842) Connecting to Fathom Replication Agent <new server agent>.
RPLS 19: (10507) The Fathom Replication Server has successfully connected to the Fathom Replication Agent <new server agent> on host <new server host>.
RPLS 19: (11251) The Replication Server successfully connected to all of its configured Agents.
RPLS 19: (10508) Beginning Fathom Replication synchronization for the Fathom Replication Agent <DR agent>.
RPLS 19: (10440) Either the Fathom Replication Agent <new server agent> has been incorrectly configured or the target database ... has been improperly sourced.
RPLS 19: (11696) The Agent <new server agent> cannot be properly configured and is being terminated.
RPLS 19: (10700) The Fathom Replication Agent <new server agent> is being terminated.
RPLS 19: (10504) Unexpected error -129 returned to function rpSRV_ServerLoop.
RPLS 19: (10700) The Fathom Replication Agent <DR agent> is being terminated.
RPLU 18: (452) Login by root on /dev/pts/6.
RPLU 18: (7129) Usr 18 set name to root.
RPLS 19: (10505) The Fathom Replication Server is ending.

At this point the replication server is terminated. The only recourse in each case was to remove the new server's configuration from the repl.properties (i.e. restore the old version) file and to restart replication server, at which point the DR was brought up to date and normal processing continued.

I can find nothing regarding the error -129 reported in message number 10504. Nor do I see any problem in the configuration of the server's repl.properties file they are using. I did however note that the new target server's repl.properties does not contain any reference to the primary tartget (DR database), but apparently it is the same for the 10 databases that does work.

Since the database was sourced three times already, I am also of the opinion that incorrect sourcing can be ruled out as well.

Any insights are welcome.

Regards

Simon

Posted by Simon L. Prinsloo on 06-Dec-2019 11:16

[mention:a625e4e4328e4aee87521f121c965b40:e9ed411860ed4f2ba0265705b8793d05]'s answer pointed us in the correct direction. It turns out that the daily backup was the culprit.

After disabling the daily backup, the client was able to successfully stage the secondary target.

They did not touch the .recovery files.

Thanks

Simon

All Replies

Posted by Pieterm on 02-Dec-2019 08:19

Simon,

And there is no old repl.recovery files still lying around, and no other backups running on the source?

Posted by Simon L. Prinsloo on 02-Dec-2019 13:29

Hi

It turns out that there may have been another backup running while they were busy with they setup of this target.

It seems that the repl.recovery files are constantly update while normal processing is active. We can not afford to loose the primary target while we are creating the secondary target, thus the question is if it is safe to delete the .recovery file on the source, or would that break replication to the primary target? Should it be deleted on the primary target as well?

Some background on the steps being followed:

1. The primary target database is shut down. As a consequence of this, the replication server on the source dies as well.

2. Confirm that the replication server on the source is gone.

3. probkup online <dbname> <outputfile> -REPLTargetCreation -com

4. rsync the backup to the new server.

5. prorest the backup on the new server.

6. proutil <dbname> -C enableSiteRelication targer (On the new server)

7. Replace the .repl.properties on the source with a version containing both targets.

8. Start both the target databases.

9. Restart the replication server on the source database.

Posted by Simon L. Prinsloo on 02-Dec-2019 13:34

One more thing, when doing the first 10 databases, they did not touch the .recovery files in any way, but all the databases were small enough to get replication up and running before the daily backup started.

Posted by Dapeng Wu on 03-Dec-2019 15:27

Simon,

Please take a look at this article to see if it helps:

knowledgebase.progress.com/.../000057059

It says:

Create the backup for target creation:

IMPORTANT: No more PROBKUP, PROQUIET disable markbackedup or RFUTIL mark backedup commands to be run against the source database until this probkup volume has been restored to target, started and the RPLS and RPLA have synchronised or this will fail and the target baseline will have to be re-taken.

$ probkup online <source> <source>_b -REPLTargetCreation

$ rm <source>.repl.recovery (** important **)

Posted by Simon L. Prinsloo on 06-Dec-2019 11:16

[mention:a625e4e4328e4aee87521f121c965b40:e9ed411860ed4f2ba0265705b8793d05]'s answer pointed us in the correct direction. It turns out that the daily backup was the culprit.

After disabling the daily backup, the client was able to successfully stage the secondary target.

They did not touch the .recovery files.

Thanks

Simon

This thread is closed