Bongoyo: program details

1. Overview

This section is probably interesting only to those to wants to understand the internals of the program. If you just want to use bongoyo, you can skip this section.

Bongoyo's objective is high availability. And it accomplishes this by running on multiple fail-over servers that can take over the server/service that is to be made highly available. Immediately it follows that bongoyo needs to have a way to know if the HA server/service is down, and needs to have a way to provide the same service if so. Of course, when it comes to implementing the solution, a lot more details comes up that needs to be taken care of that were not first obvious from 10,000 feet. The rest of the section an attempt will be made to try to describe the internals of bongoyo, with the hope that the reader can understand that the program does, and perhaps contribute to it.

Bongoyo is written in perl? Some of the reasons why perl is used are:

Perl seems to have enough juice to get the job done, despite being known as a "script" language.
Rapid deployment: has many features that makes life easier, chief of all is garbage collection. With GC, less attention is needed for memory management, which allows the flexibility of playing with various data structures.
Multi-platform support.

It is planned that the next version of bongoyo will be rewritten in a "real" language, most likely C, C++ or java. Right now, the main objective is to make the underlying concept of bongoyo as robust and as flexible as possible.

2. Tasks and modules - flexibility

Like many programs, bongoyo is designed to be very flexible. Bongoyo's flexibility comes mainly from two main "ingredients":

cooperative multitasking
modules

These two design objectives are not directly relevant to high-availability, but they are the backbone of bongoyo.

2.1 Cooperative multitasking

The task-approach has the advantage of running arbitrary subroutines without having to resort to a huge and ugly loop of while's and if's. For example, the subroutine that checks if the boss is alive is run as one task, the subroutine that checks for the service is run as another tasks, and if there is another service that needs to be checked, it can run as another tasks, when an incoming connection is received, a task is started to collect the message, when an external task needs to be executed, a task is created to run it, etc. It makes the program very, very flexible, however, the trade off is that weird interactions like deadlock and race conditions are more likely.

Once bongoyo has initialized (a few small subroutines), the first thing it runs is the task switcher main_loop. main_loop is pretty much the equivalent of the kernel scheduler, all it does is to run one subroutine (task) after another. Each of the task will either run until it is completed, or it will return control to main_loop and schedule itself to be executed again. Since this is not a full blown kernel with protected memory space between each task and pre-emptive multitasking, things have to be simulated.

Most task will complete in a single run. Some tasks cannot or might not be completed in a single run, i.e., input might block. In situations like this, the task will have to set its task specific data, $s_task, (so that it knows which state it was at when it gets run again) and return control to main_loop. The return value is an integer: if it is less than 0, the task is deleted, otherwise the task is scheduled to run again after as many seconds (1 = 1 s, 2 = 2s, etc., 0 = ASAP), or longer (if there are lots of tasks--there is no guarantee).

2.1.1 Task specific data

To provide for task-specific memory, the $s_task data structure is maintained for each task. It contains stuffs like task subroutine pointer, wake-up time, and pointer to more task specific data.

2.2 Modules

Aside from the core bits, most of the functionalities of bongoyo are implemented as modules. This allows for easy expansion of the program. There are three main classes of modules:

OS specific: Contains stuffs like how to obtain interface data, aliasing IP, etc. So support for a new platform is as simple as writing an OS specific module for it. The OS specific modules gets loaded automatically by bongoyo (autodetecting).
HA related: Contains stuffs for HA, like what how to check for a particular service (e.g., LPRNG module for LPRng), what to do before taking over an IP, etc. Support for a particular service is basically to write a module for it. This class of modules gets loaded as needed when the configuration file is parsed.
Miscellaneous support: There is no modules in this class at the moment. The first module will probably be a mini webserver that allows status viewing and dynamic control of the bongoyo daemon. This class of modules will probably get loaded when the configuration file is parsed (like the HA related modules).

3. The core bits

Basically everything that cannot be modularized are considered to be core bits of bongoyo. There are two classes of core bits:

HA related: The major components are boss_monitor(), duty_monitor() and service_monitor().
Program support: This contains bits that are not specific to HA, like main_loop, configuration parsing, etc.

3.1 HA related core bits

3.1.1 `boss_monitor`

In a distributed system, there is always the problem of semaphore/locking. There are two obvious approaches: (1) semaphore is contested by all servers for each action and (2) semaphore is contested once and held on forever. Since the second approach seems like a simpler solution, that used in bongoyo. The fail-over server that won the contest is called the boss.

Each fail-over server will run the boss_monitor task at all time. The purpose of this task is to make sure that there is one and only one boss at any time. This includes:

Broadcast query/reply for the boss.
Boss "heartbeat" to make sure that boss is up.
Call for boss duty election (boss dies, or more than one boss).
Running boss-related task (if self becomes boss).

The way the boss selection works is this:

Election broadcast is sent (by anyone) and received by everyone.
According to the boss_order seniority (lower number is higher seniority), each server will send a message to each of the server lower ranked than itself, saying that it is willing to become the boss.
If a server received a message from another higher ranked server who is willing to become boss, the lower ranked server's participation is done--all it has to do is to wait until a final message from a higher ranked server declaring itself to be boss.
If a server does not receive any message from a higher ranked server who is willing to become the boss, it will assume that itself to be the highest ranked server who is willing to become boss, and will declare itself the boss in a broadcast message.

3.1.2 `duty_monitor`

duty_monitor is basically the slave to the boss, and it contains all the stuffs needed to process command from the boss. It gets run by all fail-over servers (including the server that is currently the boss). The functions of this task include:

Performing self-check to see if self is suitable for taking over a failed virtual server.
Accepting commands from boss to take over/release a virtual server.
Answering volunteer calls from the boss (for a failed virtual server).

3.1.3 `service_monitor`

This task gets run only by the boss. It's function is to make sure that the virtual server is available. It does it by:

Monitoring the virtual server.
Calling for volunteer to take over a virtual server if it is foobar'ed.

In an attempt to be robust (i.e., the design decision was sort of forgotten), the service_monitor keeps track of three states for each of the virtual server:

Service availability
IP reachability
How many fail-over servers thinks that they are a particular virtual server.

The three states are combined into one big combo state, for the lack of a better name.

The combo states consist of three characters, representing the summary of the three states for each virtual server:

Service availability: P(ending), U(p) or D(own).
IP reachability: P(ending), U(p) or D(own).
# of responsible fail-over servers: 0, 1 or M(ultiple).

Ideally the combo state for a virtual server should be UU1. However, we are dealing with computers after all, so all kinds of weird stuffs can happen. The only time a request for volunteer is sent is when the combo state is DD0.

Page mangling by Edwin Lim.

Wed Sep 6 01:39:53 EDT 2000