failovermon(1M) DG/UX R4.11MU05 failovermon(1M)
NAME
failovermon - manage failover monitors
SYNOPSIS
failovermon -o add [ -i interval ] [ -r retries ] [ -l lost-pulse ]
[ -g regain-pulse ] [ -b ] [ -s ] hostname
failovermon -o delete hostname
failovermon -o modify [ -i interval ] [ -r retries ] [ -l lost-pulse
] [ -g regain-pulse ] [ -bn ] [ -s ] hostname
failovermon -o list [ -qv ] [ hostname ... ]
failovermon -o start [ hostname ... ]
failovermon -o stop [ hostname ... ]
DESCRIPTION
Failovermon provides operations for manipulating entries in the
failover monitors(4M) database as well as operations for starting and
stopping failovermon monitors. Failover monitors and their action
scripts (lost-pulse and regain-pulse) are set up and execute on the
system that is serving in the backup role. This system should
already have been set up for failover using the operator initiated
failover operations through sysadm.
The failovermon process monitors the specified system with a
heartbeat message. This message is sent from the failovermon process
to the failoverd(1M) process on the host being monitored. The
heartbeat is sent over all communication paths that have been set up
for the host being monitored using the admfailoveraltcommpath(1M)
command. As long as at least one response is received by the monitor
the heartbeat is successful. The monitor then sleeps for the number
of seconds specified in its interval value.
If no response is received on any of the communications paths, the
retries value is examined to determine whether or not to declare the
host failed. If the retries value is zero the monitor immediately
executes the lost-pulse script. If the retries value is not zero,
the monitor continues to try and communicate with the host until the
retry value is exceeded. Then the monitor executes the lost-pulse
action script.
The monitor continues to attempt to communicate with the failed host.
When communications are re-established the regain-pulse action script
is executed.
The failovermon monitor can be configured to monitor the host it is
running on. This type of monitoring is used to detect a system hang.
The monitor determines whether the system it is invoked on has the
wdt() driver configured. If configured, and the system has a hardware
timer (AV5500, AV8500, and AV9500 systems) the wdt driver internally
resets a register every second. If it fails to reset the timer in
one second, it triggers a warm reset of the system.
The failovermon monitor communicates with the wdt driver for a higher
level of monitoring. The failovermon process tries to use the fork,
exec, open, read, write, and close system calls every user specified
interval seconds. Upon successful completion, the failovermon process
sends a message to the wdt driver indicating the system is not hung.
If the wdt driver does not get this message from the failovermon
process within 600 seconds, the wdt driver initiates a system halt to
alleviate the hang. This, in conjunction with the proper set up via
the dg_sysctl(1M) command, will allow the system to reboot itself.
This level of hang detection will not detect a single runaway process
that is causing an application to perform poorly. This is designed to
detect hangs in the operating system by exercising several extensive
code paths.
The failovermon monitor is also used to check multi-path lan I/O
paths. This is one of the functions that is performed when the
monitor is monitoring the host it is running on. If the system does
not currently have a monitor running, the admiopath(1M) command will
create one.
When the failovermon process is stopped or terminates abnormally, the
wdt driver ceases the high level monitoring. The wdt driver continues
to perform its lower level monitoring (on systems with a hardware
watch dog timer) until the driver is deconfigured from the system.
Operations
add Add a failovermon monitor entry for the specified hostname
to the failover monitors database. This operation
optionally lets the administrator start the monitor at this
time.
delete Delete a failovermon monitor entry for hostname from the
failover monitors database. This operation also terminates
an existing monitor, if one is running.
modify Modify a failovermon monitor entry for hostname. This
operation optionally lets the administrator restart the
current monitor (if one is running) or start one using the
new information.
list List failover monitors database entries. The list
operation reports the following monitor information to
stdout:
the name of the host that is being monitored
a flag indicating that a monitor is running or not
flag indicating whether the monitor is brought up
at system reboot time
the interval value
the retries value
the lost pulse action script name
the regain pulse action script name
With the `verbose' format (-v), information is printed in
aligned columns with headers. With the `quiet' format (-q)
headers are suppressed and each host entry is printed on a
separate line. If both -q and -v are specified, the output
is in `quiet' format.
start Start a failovermon monitor for the specified host(s). When
the hostname is not specified, a monitor will be started
for all entries in the monitors database, that are not
currently active.
stop Stop a failovermon monitor for the specified host(s). When
the hostname is not specified, all monitors that are
currently active, will be stopped.
Options
The following options can be used with the add or modify operations:
-b Start on reboot. This option specifies that this monitor
is to be brought up when the system is rebooted.
-i interval
The time in seconds that the failovermon monitor waits
after receiving a reply to a handshake before initiating
the next handshake. The default is zero for an add
operation or the current interval value for a modify
operation.
-r retries
The number of times the failovermon monitor should continue
to try and communicate with the failoverd server of the
specified system, before declaring the system failed. The
default is zero for an add operation or the current retries
value for a modify operation.
-l lost-pulse
The full pathname to the user created script to be executed
when the monitor declares a system to be failed. This
script should contain an admfailoverdisk(1M) command line
to transfer the physical disks from the failed host to the
backup host. This script should also contain any system
set up required for the application or its users. The
default is /etc/failover/failovermon_lost_pulse for an add
operation or the current lost_pulse value for a modify
operation.
-g regain-pulse
The full pathname to the user created script to be executed
when the monitor regains the pulse of the system it is
monitoring. This script should contain any actions that
should be performed when the heart beat is regained (e.g.,
the administrator may want to shutdown the application and
move the disks back to the original host). The default is
/etc/failover/failovermon_regain_pulse for an add operation
or the current regain-pulse value for a modify operation.
-s If specified on an add operation this option indicates that
the monitor should be started. If specified on a modify
operation this option indicates that the currently running
monitor should be stopped and restarted with the new
values. If no monitor is running, one will be started.
The following option can be used with the modify operation:
-n Do not start on reboot. This option specifies that this
monitor is not to be brought up when the system is
rebooted.
The following options can be used with the list operation:
-q Quiet. Produce an unformatted listing with no headers,
fields delimited by a single space.
-v Verbose. Produce a formatted listing with headers and
aligned columns. This option is the default.
EXAMPLE
To add and start a failovermon monitor to monitor a system named
hostA. This monitor sends messages every 60 seconds, and retries the
handshake message three times before executing the /hostA_has_failed
script. Should hostA return, the /hostA_is_back script is executed.
This can be done with the following command line:
failovermon -o add -i 60 -r 3 -l /hostA_has_failed -g /hostA_is_back hostA
You can then start the monitor with the following command line:
failovermon -o start hostA
To modify this monitor and restart it with an interval of 1200
seconds (i.e., 20 minutes). The following command line could be
submitted for off-peak monitoring:
failovermon -o modify -i 1200 -s hostA
To stop this monitor, use the following command line:
failovermon -o stop hostA
FILES
/etc/failover/monitors failover monitors database
DIAGNOSTICS
Warnings
- Cannot initiate connection with host <hostname>, retrying.
- A monitor for <hostname> is already running.
- An attempt was made to delete a monitors database entry that
did not exist
Errors
- Monitor for <hostname> not running.
- Monitor for <hostname> is already running.
- An attempt was made to add, delete, modify, or list a monitor
for an invalid host.
- An attempt was made to modify or list a monitors database
entry that did not exist.
- An attempt was made to add a monitors database entry that
already existed.
- The wdt driver is not configured on this system.
Exit Codes
0 The operation was successful.
1 The operation was unsuccessful.
2 The operation failed due to access restrictions.
3 There was an error in the command line.
SEE ALSO
admfailoveraltcommpath(1M), admfailoverdisk(1M), failoverd(1M),
listen(1M), sysadm(1M), failover(4M).
cap_defaults(5).
NOTES
You must have appropriate privilege to perform all operations except
list. On a generic DG/UX system, appropriate privilege is granted by
having an effective UID of 0 (root). See the appropriate_privilege(5)
man page for more information.
On a system with DG/UX information security, appropriate privilege is
granted by having one or more specific capabilities enabled in the
effective capability set of the user. See the cap_defaults(5) man
page for the default capabilities for this command.
It is possible for systems to be in a state where users get no
response but the monitor continues to detect a heartbeat. If this is
detected you should reset or `hot-key' the system that is hung. This
lets the monitor detect a failure and perform its functions that let
the applications be restarted while the failed system is rebooted.
If you add additional communications paths to the failover
altcommpath database after a monitor has been started, you need to
stop and start the monitor in order for those additional paths to be
used.
If you intend to shutdown a system that is being monitored and do not
want the monitor to detect the system being down and execute its
lost-pulse action script, you should stop the monitor before shutting
down the system. Additionally, if the system that is being monitored
is using multi-path LAN I/O or the watch dog timer monitoring, you
must take this into account when setting up the failovermon monitor
on the backup system. Failure to account for the time it takes to
switch to the alternate LAN path, or to reset the system, will result
in the disks being taken by the backup system and still visible on
the primary system.
Licensed material--property of copyright holder(s)