AI Management Daemon

Posted by Nigel Allen on 16-Mar-2014 21:47

Antipodean Greetings

I'm trying to gather some information about using the AI Management Daemon.

I'm particularly inquisitive about the directories used for –aiarcdir. How do most sites who have implemented handle this? Do you use a local directory with an additional scripts used to transfer them to another server/room/city/planet or do you use NFS mounted directories?

I just read that the daemon will write to the archival directory if there is enough room. What happens if the NFS directory is not mounted?

I've read through all the KB entries and Paul Koufalis' presentations and I feel like they have raised more questions in my mind than they have answered.

If you are willing to share some of your secrets and/or horror stories I (and I'm sure many others) would be very grateful.

Nigel.

All Replies

Posted by pfred on 16-Mar-2014 22:15

Hi Nigel

Paul's presentation was great (if it is the one I'm thinking about).

We use the AI Management Daemon (Thanks Progress for doing it - it's fantastic). We are running under Linux FYI. The aiarcdir for us points to a directory on a different local filesystem. We then have a cron job that runs every 15 minutes (as root) looking for files in the aiarcdir. When it finds them there it copies moves them to a new directory (DIR-B), compresses them and sets their permissions appropriately.

#

#       Move AI files if they exist - run as root

#

5,20,35,50 * * * * /home/sysadmin/sh/ai-move

A second cron job (running as a non-root user but with ssh keys that allow copying to an offsite location) runs 2 minutes after and using scp copies everything in the DIR-B folder to the offsite location. It does a sum -s at both ends and compares them to make sure the scp is 100% and then deletes the copy in DIR-B since it is now safely somewhere else.

#

#       Copy AI files to OFFSITE LOCATION if they exist

#

7,22,37,52 * * * * /home/sysadmin/sh/ai-remcopy

If the remote site is not available (vpn is down) they will sit in the DIR-B folder until they are copied and checked. If the link goes down in the middle of a transfer the sum's won't match so it will try again later when the link comes up.

Cheers

Peter

Posted by Nigel Allen on 16-Mar-2014 23:33

Hi Peter.

Good to see you're still around!

That all makes a lot of sense to me. One question though:

If you delete the files in DIR-B once you're happy that the copy is valid, at what stage do you remove the files from the orginal arcdir?

Thanks and Regards

Nigel.

Posted by pfred on 16-Mar-2014 23:51

The first job is MOVING them - not copying. They are gone once the first script runs. After the second script runs (and they are copied to the remote location and checked) they are then removed completely from DIR-B.

So on the original machine only one (1) copy of the AI file(s). Moved from aiarcdir to DIR-B. Chown to useful account, compressed/gzipped. Two minutes later the second job runs that gets them offsite and removes them completely.

Cheers

Peter

Posted by Nigel Allen on 16-Mar-2014 23:54

Aha! I read "copies" in your description but overlooked the "move" in the cron script name. Aplogies and thanks for the clarification.

N/

Posted by Rob Fitzpatrick on 17-Mar-2014 08:42

We do something similar to what pfred described: archive to a local directory on prod; from there, copy to remote system's input directory; remove from aiarcdir when successfully copied; a cron job on DR rolls forward files from the input directory to the target and moves them to an archive directory.

I had a client site where the primary aiarcdir (there were two) was an NFS share to DR and it didn't work well.  There was a network disruption, the NFS mount was stale and file copies no longer happened.  But the production system could still see the directory, even though it now had no contents, so the daemon didn't switch to the secondary aiarcdir and stopped archiving/switching extents.  We couldn't even make it switch directories manually.  The daemon got into an unresponsive state, couldn't be shut down or signalled, and eventually we had to restart the DB to recover the situation.

So in theory it's nice to archive directly to a remote box but I got burned by an unreliable network.  I'd be interested to hear what others do.

Posted by Paul Koufalis on 17-Mar-2014 08:49

I used to use -aiarcdir "/some_nfs_mount,/some_local_fs" but I hit a situation in 11.2 where the AIMGT could not write to the NFS directory and rather than returning with an error, it froze and stopped ALL writes to the DB. The current theory is that the AIMGT froze while holding some latch but unfortunately PSTS cannot reproduce.

As a precaution, I have returned to the method described by Peter though I wonder if rsync (with or without--checksum) would be a more elegant solution than scp.

Posted by ChUIMonster on 17-Mar-2014 10:12

If I could change three things about the ai management daemon they would be:

1) Fix the file naming.  Allow a user specified mask in the style of the "date" command so that I can decide which elements of the name I want to have.  i.e. %s = seq#, %i = iso date, %e = extent#, %p = path, %d = dbname then I could say something like "-aiarcname "%d.%s" and the name would be a nice, simple, easy to read, easy to type dbname.#

2) Instruct it to archive to multiple directories simultaneously.  Write an error to the log if one is not available but, so long as at least one is writeable, continue processing normally.  Continually re-check directories that were previously unavailable.

2a) An interesting option might be to not mark extents empty until they have been copied to all targets.  This would make transient network problems and full filesystems less disruptive and enable "self healing" of the ai daemon in these cases.

2b) Unless X extents are full -- then start freeing them up to permit extent switches so long as there is at least one viable target to write the archives to.

3) Support remote transfer protocols.  Specifically "scp" but this could probably be done in a portable way that would allow a user specified transfer command.

Posted by Rob Fitzpatrick on 17-Mar-2014 19:21

Paul,

That sounds very much like what happened to me.  If I recall it was a directory on DR, shared to prod as an NFS mount.  AIX 6.1, 10.2B02.

Posted by gus on 18-Mar-2014 11:38

@paul koufalis: /any/ program that writes to an NFS mounted filesystem might be blocked for an indeterminate amount of time. if that program is holding locks or has acquired other resources, they cannot be freed until the blocked process can continue. this behaviour is not limited nfs. it can happen with I/O operations on other kinds of filesystems and devices too. fortunately not often.

Posted by Paul Koufalis on 18-Mar-2014 12:06

@gus: it seems I sometimes take certain operations for granted: I incorrectly assumed that the task would succeed or fail.  It did neither and that's a little more difficult to script around.  I can only hope that the archive to the local FS has a lower probability of freezing than the write to the NFS mount.

Posted by Tim Kuehn on 18-Mar-2014 12:25

Another option is to have a sequence of locations to write to, followed by a script to launch before or after each AI is written.

This way, the daemon can write to a local location, fire off a script to do the NFS copy, and if that copy blocks, the daemon / db  db won't end up blocked.

Posted by gus on 20-Mar-2014 08:45

yes. this will help. but if your nfs server is down, you are just postponing the inevitable.

Posted by Tim Kuehn on 20-Mar-2014 09:07

That depends on the script - if it's written to check if a prior script instance completed, and send an alert if it hasn't, that would cover off cases where the NFS copy failed with a hang, and would only leave the system exposed for the time period between AI archives plus the time for someone to respond with a correction.

Posted by ChUIMonster on 20-Mar-2014 09:15

If I could change 3 things...

Posted by Nigel Allen on 20-Mar-2014 17:14

If I could change 3 things....

(I'd cheat and make it five) :)

This thread is closed