The Assimilation Project
Welcome to the Assimilation project. (README)
We provide open source discovery with zero network footprint integrated with highly-scalable monitoring. Here are the problems we address:
What we do: Continually discover and monitor systems, services, switches and dependencies with very low human and network overhead
The Assimilation Project is designed to discover and monitor infrastructure, services, and dependencies on a network of potentially unlimited size, without significant growth in centralized resources. The work of discovery and monitoring is delegated uniformly in tiny pieces to the various machines in a network-aware topology - minimizing network overhead and being naturally geographically sensitive.
The two main ideas are:
The original monitoring scalability idea was outlined in two different articles
These two main ideas create a system which will provide significant important capabilities giving both a great out-of-the-box experience for new users and smooth accommodation of growth to virtually all environments.
For a human-driven overview, we recommend our videos from interviews and conference presentations.
We also have a few demos, which demonstrate the ease of use and power of the Assimilation software.
The project software undergoes a number of rigorous static and dynamic tests to ensure its continued integrity.
The team currently posts updates in the following places:
This concept has two kinds of participating entities:
The picture below shows the architecture for discovering system outages.
Each of the blue boxes represents a server. Each of the connecting arcs represent bidirectional heartbeat paths. When a failure occurs, the systems which observe it report directly to the central collective management authority (not shown on this diagram). Several things are notable about this kind of heartbeat architecture:
This is all controlled and directed by the collective monitoring authority (CMA) - which is designed to be configured to run in an HA cluster using a product like Pacemaker. The disadvantage of this approach is the getting started after a complete data center outage/shutdown can take a while - this part is not O(1).
An alternative approach would be to make the rings self-organizing. The advantage of this is that startup after an full datacenter outage would happen much more quickly. The disadvantage is that this solution is much more complex, and embeds knowledge of the desired topology (which is to some degree a policy issue) into the nanoprobes. It also is not likely to work as well when CDP or LLDP are not available, and to properly diagnose complex faults, it is necessary to know the order nodes are placed on rings.
One of the key aspects of this system is it be largely auto-configuring, and incorporates discovery into its basic philosophy. It is expected that a customer will drop the various nanoprobes onto the clients being monitored, and once those are running, the systems register themselves and get automatically configured into the system once the nanoprobes are installed and activated.
Zero-network-footprint discovery is a process of discovering systems and services without sendign active probes across the network which might trigger security alarms. Some examples of current and anticipated zero-network-footprint discovery techniques include:
These techniques will not immediately provide a complete list of all systems in the environment. However as nanoprobes are activated on systems discovered in this way, this process will converge to include the complete set of systems and edge switches in the environment - without setting off even the most sensitive security alarms.
In addition, the netstat information correlated across the servers also provides information about dependencies and service groups.
Furthermore, these nanoprobes use zero-network-footprint discovery methods to discover systems not being monitored and services on the systems being monitored. Zero-network-footprint discovery methods are methods which cannot trip even the most sensitive network security alarm - because no probes (packets) are sent over the network to perform discovery.
This discovery process is intended to achieve these goals:
The nanoprobe code is written largely in C and minimizes use of:
To do this, we will follow a management by exception philosophy for exception monitoring - when nothing is wrong, nothing will be reported. Although the central part of the code will likely be only available on POSIX systems, the nanoprobes will also be available on various flavors of Windows as well.
To the degree possible, we will perform exception monitoring of services on the machine they're provided on - which implies zero network overhead to monitor working services. Stated another way, we follow a management by exception philosophy. Our primary tool for monitoring services is through the use of a re-implemented Local Resource Manager from the Linux-HA project.
There are three kinds of testing I see as necessary
We are currently using the Testify software written by the folks at Yelp. Probably will try some of the alternatives as well. Very pleased with the results it's bringing. The nice thing about this is much of the detailed gnarly C code is wrapped by the python code, so when I run the python tests of those wrappers, the C code under it gets well tested as well.
Not quite sure how to best accomplish this. Some of it can just be my home network, but I suppose I could also spin up some cloud VMs too... Not sure yet... Automation is a GoodThing.
I have been thinking about this quite a bit, and have what I think is a reasonable idea about it. It involves writing a simulator to simulate up to hundreds of thousands of nanoprobe clients through a separate python process - probably using the Twisted framework. It would accept and ACK requests from the CMA and randomly create failure conditions similar to those in the "real world" - except at a radically faster rate. This is a big investment, but likely worth it. It helps to have this in mind while designing the CMA as well - since there are things that it could do to make this job a little easier.