|
The Assimilation Monitoring Project
|
There are a number of considerations for the CMA architecture. The first of these is probably robustness. It needs to be able to fail over and recover while maintaining system state, and continuing to respond to client input without losing any important messages.
A simple and relatively-proven architecture for this kind of thing is to have a front end process which reads messages bound for the CMA, and puts them in a persistent queue - using a tool similar to Qpid or Websphere MQ. It is worth noting that Qpid doesn't solve all possible failover-type problems, but it reduces the number of cases to take care of, and significantly reduces the probabilities of these corner cases.
This then puts the structure into two sets of components:
This architecture allows the packet readers to be multiple instances. It is unclear how the queue reader/packet writer job should be structured, nor how many queues there should be, and so on...
The CMA packet reader architecture is very simple. It performs these functions:
In my original thoughts on this subject, I had thought that all the CMA software would be in Python. However, it probably makes sense to do this one module in 'C' - since it is very simple, and the GSource code is well-suited to this kind of task. But, I haven't thought out all the details yet...
This code is more complex than the clients, or the packet reader above. It makes sense for this code to be in a higher-level language with garbage collection. I currently think of this as a good thing to write in Python. Java is also a reasonable candidate language for this task - particularly since the native interfaces for Qpid are Java interfaces.
There are several kinds of messages that might be received from clients
For the first four types of packets, the actions are pretty similar
The occurrence of a heartbeat timeout will eventually invoke a finite state machine to disambiguate the failure. That is, if machine B is being monitored by machines A and C, then when A reports that B is down, it is expected that machine C should soon (within two heartbeats time) make a similar report. If it does not, then something funky is going on here and further investigation is likely in order.
When a machine is a member of a higher level ring and the machine making the report is not connected to the same switch, then active probes are in order to see whether network components (switches or routers) might be implicated.