Resilience

This section explains about the Client lock resilience and Server lock resilience.

Client Lock Resilience

The Distributed Lock mechanism provides lock resilience by simply duplicating the original lock request to a second Distributed Lock server. The distributed lock resilience model allows only a single point of failure. Single point of failure coverage is normally adequate for resilience technology. Network failure between localized systems is extremely rare especially with duplicated network components.

In resilience mode, the client process will issue a second lock request, to the secondary Distributed Lock server, only once the original lock request has been acknowledged as successful by the primary distributed lock server. The response of the lock requests will be compared and if the secondary response is different an error logged such that the problem can be investigated. As such, the performance cost of resilience is an additional socket send and receive message per lock request. While both primary and secondary Distributed Lock Servers continue to respond the process is executing in resilient mode.

If communication to the primary Distributed Lock server should be interrupted or lost, an error is logged and then the client process will automatically promote the secondary Distributed Lock server to take over from the original primary server and become the new primary. At this point, the duplication of lock requests will cease and the process will continue to communicate only with the new primary lock server. At this point, the process is also no longer resilient and any subsequent communication failure with the Distributed Lock server will result in the client process wrapping up and exiting the client system.

If communication to the secondary Distributed Lock server should fail while in resilient mode, an error message is logged and the process continues to communicate only with the primary Distributed Lock server and is hence therefore no longer resilient.

Once communication fails to one or other of the Distributed Lock servers further communication with the failed server is never attempted for the remainder of the lifetime of the process as attempting to do could cause lock confusion and undermine the lock mechanism. All communication errors are logged to the jbase_error_trace.

The resilient mode should not be used with a configuration that integrates with direct local locking processes, as the local processes do not even communicate with the primary lock server let alone a secondary lock server. If resilience is required then all processes both local and remote must be configured to communicate to the same primary and secondary Distributed Lock servers via the JDLS environment variable.

Server Lock Resilience

Client failure resilience is built into the Distributed Lock server processes, irrespective of resilience mode.

If communication with a client process fails, then the distributed lock server process handling lock requests on behalf of that client will release and outstanding locks and then exit.

This procedure ensures that absent or misbehaving clients cannot continue to hold locks irrespective of the state of the client system. The release of locks in this scenario by the distributed lock server process only effects the local system and has no effect on any other distributed lock server executing on either another primary or secondary lock server system.

The Distributed Lock Listener service can be stopped or restarted (using the –k and –ib options together with the –D option on the jDLS command line) without interfering with the communications of the currently connected remote clients.  Obviously, new clients will be unable to connect while the Distributed Lock Listener service is not currently listening, hence these options should be used with great care.


Bookmark Name Actions
Feedback
x