Bongoyo: Methods Overview

1. The 10,000m view

The system works by having several fail-over servers (f1, f2, ...) ready to assume the identities of one or more virtual servers (v1, v2, ...). Once a virtual server is determined to be unavailable, one of the fail-over servers will take over the identity of the virtual server. Bongoyo itself places no limitation on how many virtual server each fail-over server can assume the identities of. All this work is done by the bongoyo daemon running on each of the fail-over servers.

2. Configuration

When bongoyo starts, it reads a configuration file (default to /etc/bongoyo.conf).

Initially bongoyo has been designed to reread its configuration file on a HUP signal. It still does, but little attention has been paid to the effects of dynamic reconfiguration on a running bongoyo, say, in the middle of taking over a virtual server. This is one of the wish list. Right now, consider the HUP signal to be unreliable.

3. The Boss

Before any monitoring or take-over can happen, a boss has to be selected. One of the bongoyo daemons will become the boss. There can only be one and only one boss at any time. If at anytime there are more than one boss, or there is no boss, then an election will be called to select a new boss. All fail-over servers will monitor the boss to make sure that the boss does not go away.

The following are the functions of the boss:

Monitors virtual server(s) availability.
Calls for volunteer for a virtual server that has died.
Select a volunteer for a virtual server.
Tells a fail-over server to release a virtual server.

In general, the role of the boss is to be a big kludgy semaphore against race/deadlock conditions. It is a simple and necessary hack in distributed computing.

4. Monitoring the virtual server: the combo states

It is the task of the boss to monitor the virtual server, and it does the monitoring in three ways:

Check that the service is up.
Check that the IP is reachable (TCP connect).
Query all fail-over servers to see if anyone of them thinks it is assuming the duty of any virtual server.

The results of the monitoring is represented by a 3-character combo state. The first character represents the state of the service on the vistual server, it can be P(ending), U(p) or D(own). The second character represents the IP reachability, it can also be P(ending), U(p) or D(own). The third character represents the number of fail-over servers that think they are the virtual server, this character can be 0(none), 1(one, which is what you hope to see) or M(ultiple). E.g., a combo state of DU1 means the service is dead, however, the IP of the virtual server is still reachable, and one fail-over server claims responsibility for the virtual server.

The combo state that you most want to see is UU1. The combo state that is safe to request volunteer for a virtual server is DD0. There are combo states like UUM that the boss will tell all the fail-over servers to release a virtual server. There are also combo states that the boss can't do anything about, like DU0.

5. Self monitoring

Each of the fail-over servers will monitor itself, to have some idea of how suitable it is to assume the duty of a failed virtual server. This is called the suitability self check in the program. The reliability of this check depends on the module used to do the check. For example, the check for the LPRNG module just tries to open a connection to the 515/TCP port on itself. If the connection is accepted, it assumes that the check was successful.

The result will determine if a fail-over server will volunteer when a request is sent from the boss for taking over a virtual server.

6. Volunteer request and selection

When the boss has determined that a virtual server is down and suitable for taking over (combo state DD0), it will send out a volunteer request message to all the fail-over servers. The fail-over server will reply with a suitability factor of 0-100, 100 being ideal. The boss gathers all the replies and select a volunteer. It then sends the selected fail-over server the command to take over a virtual server, then it sets the combo state for the virtual server from DD0 to PP0.

7. Taking over a virtual server

When a fail-over server receives an assume_vserver command from the boss, it does the following:

Executes the pre_takeover command.
Takes over the IP (using dedicated or aliasign an interface).
Executes the post_takeover command.

If a takeover fails, nothing is done to notify the boss explicitly. The boss will implicitly know this when the combo state reaches DD0 again. This also places an implicit time limit on how long a fail-over server has in taking over a virtual server. (It is really sorta, kinda, hand-waving time limit here--one of the wish list is to put a more concrete numbers into all this timeouts.)

Page mangling by Edwin Lim.

Tue Sep 5 00:36:50 EDT 2000