Museum

Home

Lab Overview

Retrotechnology Articles

⇒ Online Manual

Media Vault

Software Library

Restoration Projects

Artifacts Sought

Related Articles

admfailoveraltcommpath(1M)

admfailoverdisk(1M)

failoverd(1M)

listen(1M)

sysadm(1M)

failover(4M)

cap_defaults(5)



failovermon(1M)                DG/UX R4.11MU05               failovermon(1M)


NAME
       failovermon - manage failover monitors

SYNOPSIS
       failovermon -o add [ -i interval ] [ -r retries ] [ -l lost-pulse ]
                     [ -g regain-pulse ] [ -b ] [ -s ] hostname

       failovermon -o delete hostname

       failovermon -o modify [ -i interval ] [ -r retries ] [ -l lost-pulse
                     ] [ -g regain-pulse ] [ -bn ] [ -s ] hostname

       failovermon -o list [ -qv ] [ hostname ... ]

       failovermon -o start [ hostname ... ]

       failovermon -o stop [ hostname ... ]

DESCRIPTION
       Failovermon provides operations for manipulating entries in the
       failover monitors(4M) database as well as operations for starting and
       stopping failovermon monitors.  Failover monitors and their action
       scripts (lost-pulse and regain-pulse) are set up and execute on the
       system that is serving in the backup role.  This system should
       already have been set up for failover using the operator initiated
       failover operations through sysadm.

       The failovermon process monitors the specified system with a
       heartbeat message.  This message is sent from the failovermon process
       to the failoverd(1M) process on the host being monitored.  The
       heartbeat is sent over all communication paths that have been set up
       for the host being monitored using the admfailoveraltcommpath(1M)
       command.  As long as at least one response is received by the monitor
       the heartbeat is successful.  The monitor then sleeps for the number
       of seconds specified in its interval value.

       If no response is received on any of the communications paths, the
       retries value is examined to determine whether or not to declare the
       host failed.  If the retries value is zero the monitor immediately
       executes the lost-pulse script.  If the retries value is not zero,
       the monitor continues to try and communicate with the host until the
       retry value is exceeded.  Then the monitor executes the lost-pulse
       action script.

       The monitor continues to attempt to communicate with the failed host.
       When communications are re-established the regain-pulse action script
       is executed.

       The failovermon monitor can be configured to monitor the host it is
       running on.  This type of monitoring is used to detect a system hang.
       The monitor determines whether the system it is invoked on has the
       wdt() driver configured. If configured, and the system has a hardware
       timer (AV5500, AV8500, and AV9500 systems) the wdt driver internally
       resets a register every second.  If it fails to reset the timer in
       one second, it triggers a warm reset of the system.

       The failovermon monitor communicates with the wdt driver for a higher
       level of monitoring. The failovermon process tries to use the fork,
       exec, open, read, write, and close system calls every user specified
       interval seconds. Upon successful completion, the failovermon process
       sends a message to the wdt driver indicating the system is not hung.
       If the wdt driver does not get this message from the failovermon
       process within 600 seconds, the wdt driver initiates a system halt to
       alleviate the hang. This, in conjunction with the proper set up via
       the dg_sysctl(1M) command, will allow the system to reboot itself.

       This level of hang detection will not detect a single runaway process
       that is causing an application to perform poorly. This is designed to
       detect hangs in the operating system by exercising several extensive
       code paths.

       The failovermon monitor is also used to check multi-path lan I/O
       paths.  This is one of the functions that is performed when the
       monitor is monitoring the host it is running on. If the system does
       not currently have a monitor running, the admiopath(1M) command will
       create one.

       When the failovermon process is stopped or terminates abnormally, the
       wdt driver ceases the high level monitoring. The wdt driver continues
       to perform its lower level monitoring (on systems with a hardware
       watch dog timer) until the driver is deconfigured from the system.

   Operations
       add       Add a failovermon monitor entry for the specified hostname
                 to the failover monitors database.  This operation
                 optionally lets the administrator start the monitor at this
                 time.

       delete    Delete a failovermon monitor entry for hostname from the
                 failover monitors database.  This operation also terminates
                 an existing monitor, if one is running.

       modify    Modify a failovermon monitor entry for hostname.  This
                 operation optionally lets the administrator restart the
                 current monitor (if one is running) or start one using the
                 new information.

       list      List failover monitors database entries.  The list
                 operation reports the following monitor information to
                 stdout:

                     the name of the host that is being monitored
                     a flag indicating that a monitor is running or not
                     flag indicating whether the monitor is brought up
                         at system reboot time
                     the interval value
                     the retries value
                     the lost pulse action script name
                     the regain pulse action script name

                 With the `verbose' format (-v), information is printed in
                 aligned columns with headers.  With the `quiet' format (-q)
                 headers are suppressed and each host entry is printed on a
                 separate line.  If both -q and -v are specified, the output
                 is in `quiet' format.

       start     Start a failovermon monitor for the specified host(s). When
                 the hostname is not specified, a monitor will be started
                 for all entries in the monitors database, that are not
                 currently active.

       stop      Stop a failovermon monitor for the specified host(s). When
                 the hostname is not specified, all monitors that are
                 currently active, will be stopped.

   Options
       The following options can be used with the add or modify operations:

       -b        Start on reboot.  This option specifies that this monitor
                 is to be brought up when the system is rebooted.

       -i interval
                 The time in seconds that the failovermon monitor waits
                 after receiving a reply to a handshake before initiating
                 the next handshake.  The default is zero for an add
                 operation or the current interval value for a modify
                 operation.

       -r retries
                 The number of times the failovermon monitor should continue
                 to try and communicate with the failoverd server of the
                 specified system, before declaring the system failed.  The
                 default is zero for an add operation or the current retries
                 value for a modify operation.

       -l lost-pulse
                 The full pathname to the user created script to be executed
                 when the monitor declares a system to be failed.  This
                 script should contain an admfailoverdisk(1M) command line
                 to transfer the physical disks from the failed host to the
                 backup host.  This script should also contain any system
                 set up required for the application or its users.  The
                 default is /etc/failover/failovermon_lost_pulse for an add
                 operation or the current lost_pulse value for a modify
                 operation.

       -g regain-pulse
                 The full pathname to the user created script to be executed
                 when the monitor regains the pulse of the system it is
                 monitoring.  This script should contain any actions that
                 should be performed when the heart beat is regained (e.g.,
                 the administrator may want to shutdown the application and
                 move the disks back to the original host).  The default is
                 /etc/failover/failovermon_regain_pulse for an add operation
                 or the current regain-pulse value for a modify operation.

       -s        If specified on an add operation this option indicates that
                 the monitor should be started.  If specified on a modify
                 operation this option indicates that the currently running
                 monitor should be stopped and restarted with the new
                 values.  If no monitor is running, one will be started.

       The following option can be used with the modify operation:

       -n        Do not start on reboot.  This option specifies that this
                 monitor is not to be brought up when the system is
                 rebooted.

       The following options can be used with the list operation:

       -q        Quiet.  Produce an unformatted listing with no headers,
                 fields delimited by a single space.

       -v        Verbose.  Produce a formatted listing with headers and
                 aligned columns.  This option is the default.

EXAMPLE
       To add and start a failovermon monitor to monitor a system named
       hostA.  This monitor sends messages every 60 seconds, and retries the
       handshake message three times before executing the /hostA_has_failed
       script.  Should hostA return, the /hostA_is_back script is executed.
       This can be done with the following command line:

     failovermon -o add -i 60 -r 3 -l /hostA_has_failed -g /hostA_is_back hostA

       You can then start the monitor with the following command line:

         failovermon -o start hostA

       To modify this monitor and restart it with an interval of 1200
       seconds (i.e., 20 minutes).  The following command line could be
       submitted for off-peak monitoring:

         failovermon -o modify -i 1200 -s hostA

       To stop this monitor, use the following command line:

         failovermon -o stop hostA


FILES
       /etc/failover/monitors   failover monitors database

DIAGNOSTICS
   Warnings
       -      Cannot initiate connection with host <hostname>, retrying.

       -      A monitor for <hostname> is already running.

       -      An attempt was made to delete a monitors database entry that
              did not exist

   Errors
       -      Monitor for <hostname> not running.

       -      Monitor for <hostname> is already running.

       -      An attempt was made to add, delete, modify, or list a monitor
              for an invalid host.

       -      An attempt was made to modify or list a monitors database
              entry that did not exist.

       -      An attempt was made to add a monitors database entry that
              already existed.

       -      The wdt driver is not configured on this system.

   Exit Codes
        0     The operation was successful.

        1     The operation was unsuccessful.

        2     The operation failed due to access restrictions.

        3     There was an error in the command line.

SEE ALSO
       admfailoveraltcommpath(1M), admfailoverdisk(1M), failoverd(1M),
       listen(1M), sysadm(1M), failover(4M).
       cap_defaults(5).

NOTES
       You must have appropriate privilege to perform all operations except
       list.  On a generic DG/UX system, appropriate privilege is granted by
       having an effective UID of 0 (root). See the appropriate_privilege(5)
       man page for more information.

       On a system with DG/UX information security, appropriate privilege is
       granted by having one or more specific capabilities enabled in the
       effective capability set of the user. See the cap_defaults(5) man
       page for the default capabilities for this command.

       It is possible for systems to be in a state where users get no
       response but the monitor continues to detect a heartbeat.  If this is
       detected you should reset or `hot-key' the system that is hung.  This
       lets the monitor detect a failure and perform its functions that let
       the applications be restarted while the failed system is rebooted.

       If you add additional communications paths to the failover
       altcommpath database after a monitor has been started, you need to
       stop and start the monitor in order for those additional paths to be
       used.

       If you intend to shutdown a system that is being monitored and do not
       want the monitor to detect the system being down and execute its
       lost-pulse action script, you should stop the monitor before shutting
       down the system.  Additionally, if the system that is being monitored
       is using multi-path LAN I/O or the watch dog timer monitoring, you
       must take this into account when setting up the failovermon monitor
       on the backup system. Failure to account for the time it takes to
       switch to the alternate LAN path, or to reset the system, will result
       in the disks being taken by the backup system and still visible on
       the primary system.


Licensed material--property of copyright holder(s)

Typewritten Software • bear@typewritten.org • Edmonds, WA 98026