First the background, there are questions at the end.
10.2A, Linux
I have an OE Replication test bed setup and running. I have a source and two targets. The source is happily replicating to the targets just like it should.
But that isn't the real world. In the real world "stuff happens". So I've been testing some of the things that I would expect to happen and seeing how OE Replication reacts.
In the real world the target server sometimes needs maintenance. So it is important to be able to shut it down and restart it without undue difficulty. So my first test was to proshut my targets and then restart them:
Ok, that works! Good.
Let's take it a step further... In the real world the production server (the "source") isn't always the one that crashes. So I crashed one of my targets by removing the .lk file. I had this idea that that was a really simple test that OE Replication would pass with flying colors.
No such luck.
Not only that but there are virtually no diagnostics, error messages or indicators of any kind that a) there is a problem or b) what the nature of the problem might be. The target database comes up and appears to start normally. The .lg file shows no indication of any problem. But if you try to start an mpro session you get this very helpful message:
┌─────────────────────── Error ────────────────────────┐
│ You are not licensed to access the database. (10830) │
│ │
│ ──────────────────────────────────────────────────── │
│ <OK> │
└──────────────────────────────────────────────────────┘
Looking in the repl.agent.startup.lg isn't much more useful, that shows:
...
Opening logging subsystem using message file /usr/pro102a/promsgs.
Opening database log /data/s2k-rpt/sports2000.lg.
Child process started normally
Parent process killed
(Isn't there supposed to be a new common .lg file format that everyone uses?)
One might then think that "dsrutil sports2000 -C status" or "dsrutil sports2000 -C status -detail" might shed additional light on matters. These commands return the values 3199 and 1063 respectively. Values that, oddly enough, do not correspond to any of those listed in the OpenEdge Replication User Guide. These also, incidentally, do not match the status codes that the _Repl* VSTs show which are also not reflected in the documentation.
Well then, perhaps "dsrutil sports2000 -C monitor" might be helpful? No, not on the target. That results in "Cannot connect to replication shared memory. Status = -1". How about at the source? That, at least, connects successfully so let's see what it has to say about our crashed target (R -> 2):
Database: /data/s2k/sports2000
Agent:
Name: s2k-rpt
ID: 2
Host name: 127.0.0.1
Target database: /data/s2k-rpt/sports2000
State: Recovery Failed
Critical: No
Method: Asynchronous
Server/Agent connection time: Mon Apr 20 13:33:13 2009
Remote agent is waiting for: Nothing
Recovery state: Failed for the agent
Maximum bytes in TCP/IP message: 8504
Server/Agent connection timeout: 30.000 seconds
That's a bit more meaningful. At least I can (finally) see that "recovery failed".
So the questions:
1) Is there something wrong with my attempted recovery technique? (Restart the db and reconnect the agent)
2) Where is the real list of status codes and their meanings?
3) Is it unreasonable to expect meaningful log file entries, error messages and status codes?
4) Is it unreasonable to think that simple recovery from such a commonplace exception should be easily handled?
Since a simple recovery didn't work it's time to try something less simple.
Obviously I wouldn't want to shut down my source database. Likewise I wouldn't want to stop replication to my other target. So the obvious thing to do would be to take a new probkup online as my starting point for restarting the crashed target. That also happens to be how I started it in the first place so that would seem like a workable approach.
Making the backup is easy.
Restoring it less so. The first thing that happens is that you are informed:
07:35:39 BROKER 0: Access to database sports2000 not allowed. The database is enabled for site replication but either replication is not running, or this process is not authorized to open a replication enabled database. (10356)
Ok, actually I sort of knew that would happen (although the documentation is far from clear on the subject). And I'm very happy to see a relatively clear and meaningful error message! So let's disable replication so that I can restore the db... Oops!
dsrutil sports2000 -C disablesitereplication target
The Fathom Replication Utility cannot connect to database /data/s2k-rpt/sports2000. (10717)
Let's try proutil...
proutil sports2000 -C disablesitereplication target
OpenEdge Release 10.2A as of Fri Oct 31 20:06:43 EDT 2008
WARNING! You are about to disable OpenEdge Replication for the target database sports2000. Do you wish to continue [n]?
y
Replication (target) has been disabled for database sports2000. (10355)
Gee, I kind of expected that I should use the OE Replication utilities to manage OE Replication...
Restore, restart the db, try reconnecting the agent...
Nope. No change.
Ok, terminate the replication server and restart it...
Interesting. That seems to have pretty much hosed things. dsrutil -C monitor thinks that there is one agent connected and another "intializing". But neither target database can actually be used. The VSTs are totally screwed up too.
Looks to me like I'm going to have to completely shut down OE Replication and restart it from scratch.
All because one of my targets crashed?
Tech support says that this behavior is a bug.
No word on what the real list of status codes and such is.
Good news!
I seem to have been suffering from a brain cramp regarding status codes. This morning they are making sense!
For the curious -- the VST codes are the "dsrutil dbname -C status -detail" codes. Not "1" or "2" as suggested in the documentation.
_ReplAgtCtl-CommStatus and _ReplAgt-CommStatus do, however, appear to be non-functional. They are always 0.
For the curious...
I'm an idiot. I had an e-mail sitting in my inbox from Keith on 4/15 explaining about the status codes.
Sorry Keith.
I may have a solution. Further testing is needed but the following procedure seems to be working:
0) Start with OE Replication up and running with 2 targets. Arrange to have some transaction activity so that data is flowing to the targets.
1) Crash a target database. Removing the .lk file and using proshut -byF seem to have essentially the same effect. I haven't graduated to "kill -9", pulling the plug, hosing the SAN or dumping halon all over the server like a real sys admin would. Yet.
2) The source db and the other target will continue merrily on their way.
3) Roughly 30 seconds after the crash (probably depends on various *.repl.properties settings) the status of the control agent for the crashed db will change to "Recovery Failed".
4) Restart the crashed db and take it through normal crash recovery.
5) Perform a normal clean shutdown (proshut -by) on the formerly crashed db.
6) Perform a clean shutdown on the other target.
7) With both targets down the replication server should terminate. The source db will stay up and running. Verify this.
8) Restart the crashed target.
9) Restart the other target.
10) Restart the replication server.
11) Replication should now proceed to synchronization and normal processing on both targets.
If significant transaction volume occurs in the period when the target databases were down it could take a very long time to complete synchronization.
My experiments seem to indicate that synchronization of the crashed db starts at the beginning of the oldest locked AI extent. (I have not exhaustively tested the "oldest locked extent" theory -- that's just an eyeball guess based on the number of blocks being transferred.) The normally shutdown db will start at a much friendlier point. In my test run I had a 2.5GB ai extent that took 2 hours to synchronize. Keep an eye on the number of blocks transferred (dsrutil dbname -C monitor) to make sure that something is actually happening. There don't seem to be any other diagnostics to tell you what is going on. You won't be able to connect to the targets with a VST based monitor until after synchronization completes (you can, however, monitor _Repl-AgentControl from the source db).
I'm not sure if the very long synchronization is inherent in the process or if tweaking properties & parameters might speed it up. Any advice on that point would be welcome.
"Recovery Failed" (#3 above) seems to mean that the replication server lost communications with the target. For some reason this state can only be reset by restarting both targets. That seems odd. Is that by design? Or is there a better procedure? If there is I'd appreciate knowing about it.