We've a fault tolerant Management framework set up with Primary and backup Directory Service storage (DS). Most of the time fail-overs work in switching the roles of backup and primary directory service with graceful shutdowns.
However, when there is network failures, replication connections were dropped, the DS is hanging and locked up with Nullpoint exception.
At this point, both Primary and Backup are disfunctional and can not recover. The only option to restore is to kill the process and restart which gonna be issue.
Any fix for this.
We're on latest patch:
Sonic Management
Release 7.6.2 Build Number 196
Copyright (c) 1999-2008 Progress Software Corporation.
All rights reserved.
SonicMQ Continuous Availability Edition [Serial Number 4009181]
Release 7.6.2 Build Number 209 Protocol P30
Copyright (c) 1999-2008 Progress Software Corporation.
---------------
------LOG--
[10/04/15 00:43:53] ID=DIRECTORY SERVICE (warning) Replication connection SSL://edc-prod-dm:22506 failed to connect: No route to host
[10/04/15 00:44:14] ID=DIRECTORY SERVICE (severe) Directory Service failure, trace follows...
java.lang.NullPointerException
at com.sonicsw.mtstorage.impl.Storage.doGet(Storage.java:231)
at com.sonicsw.mtstorage.impl.Storage.get(Storage.java:160)
...
at com.sonicsw.mf.framework.util.StateManager.requestStateChange(StateManager.java:106)
at com.sonicsw.mf.framework.directory.DSComponent$DSStateHandler.activateNewState(DSComponent.java:4559)
at com.sonicsw.mf.framework.directory.DSComponent$DSStateHandler.access$5900(DSComponent.java:4466)
at com.sonicsw.mf.framework.directory.DSComponent$8.run(DSComponent.java:4493)
[10/04/15 00:44:14] ID=DIRECTORY SERVICE (severe) The transition failed...
com.sonicsw.mf.common.runtime.NonRecoverableStateChangeException: Directory Service startup failure
...
[10/04/15 00:44:14] ID=DIRECTORY SERVICE (severe) Aborting container
[10/04/15 00:44:14] (warning) Shutdown initiated (exit code=4)
...
[10/04/15 00:44:26] (warning) Failed to unload ID=DIRECTORY SERVICE, trace follows...
com.sonicsw.mf.common.MFRuntimeException: java.lang.NullPointerException
at com.sonicsw.mf.framework.directory.DSComponent.createMFRuntimeException(DSComponent.java:3228)
at com.sonicsw.mf.framework.directory.DSComponent.stopDirectoryService(DSComponent.java:1390)
at com.sonicsw.mf.framework.directory.DSComponent.stop(DSComponent.java:1136)
My suggestion would be to contact Customer Support with this. They would be better at diagnosing this.
One question.... you mention this happens with Network Failure ---- I am assuming you are using local storage and not some network drive, but it is worth asking, just in case. If there is a network failure you might get low level storage exceptions like this.
Both DS storage are in local drive. opened support ticket and I was told it was something to do with DS cache loaded in to memory. The underlying code is not handling the DS cache properly when the replication connections dropped.
looking forward for the fix.
Issue was handled by Customer Support ticket.