|
The Assimilation Monitoring Project
|
Welcome to the Assimilation monitoring project. (README)
We provide hyperscale monitoring driven by integrated continuous Stealth Discovery™ and address these system management problems:
What we do: monitoring systems with near-zero overhead both on the systems and their administrators.
This is a new project designed to to monitor systems and services on a network of potentially unlimited size, without significant growth in centralized resources. The work of monitoring is delegated uniformly in tiny pieces to the various machines being monitored in a network-aware topology - minimizing network overhead and being naturally geographically sensitive.
The two main ideas are:
The original scalability idea was outlined in two different articles
These two main ideas create a system which will have both a great out-of-the-box experience for new users and smooth accommodation of growth to virtually all environments.
For a human-driven overview, we recommend our overview video - from LinuxCon NA 2012 in San Diego.
For source and licensing information see the
The project software undergoes a number of rigorous static and dynamic tests to ensure its integrity.
We now have a store featuring swag for the Assimilation Project! Purchases from this store help support the project. Check it out at http://www.printfection.com/AssimilationProject
The team currently posts updates in the following places:
This concept has two kinds of participating entities:
The picture below shows the architecture for discovering system outages.
Each of the blue boxes represents a server. Each of the connecting arcs represent bidirectional heartbeat paths. When a failure occurs, the systems which observe it report directly to the central collective management authority (not shown on this diagram). Several things are notable about this kind of heartbeat architecture:
This is all controlled and directed by the collective monitoring authority (CMA) - which is designed to be configured to run in an HA cluster using a product like Pacemaker. The disadvantage of this approach is the getting started after a complete data center outage/shutdown can take a while - this part is not O(1).
An alternative approach would be to make the rings self-organizing. The advantage of this is that startup after an full datacenter outage would happen much more quickly. The disadvantage is that this solution is much more complex, and embeds knowledge of the desired topology (which is to some degree a policy issue) into the nanoprobes. It also is not likely to work as well when CDP or LLDP are not available.
One of the key aspects of this system is it be largely auto-configuring, and incorporates discovery into its basic philosophy. It is expected that a customer will drop the various nanoprobes onto the clients being monitored, and once those are running, the systems register themselves and get automatically configured into the system once the nanoprobes are installed and activated.
Stealth discovery is a process of discovering systems and services without using active probes which might trigger security alarms. Some examples of current and anticipated stealth discovery techniques include:
These techniques will not immediately provide a complete list of all systems in the environment. However as nanoprobes are activated on systems discovered in this way, this process will converge to include the complete set of systems and edge switches in the environment - without setting off even the most sensitive security alarms.
In addition, the netstat information correlated across the servers also provides information about dependencies and service groups.
Furthermore, these nanoprobes use stealth discovery methods to discover systems not being monitored and services on the systems being monitored. Stealth discovery methods are methods which cannot trip even the most sensitive network security alarm - because no probes are sent over the network.
This discovery process is intended to achieve these goals:
The nanoprobe code is written largely in C and minimizes use of:
To do this, we will follow a no news is good news philosophy for exception monitoring - when nothing is wrong, nothing will be reported. Although the central part of the code will likely be only available on POSIX systems, the nanoprobes will also be available on various flavors of Windows as well.
To the degree possible, we will perform exception monitoring of services on the machine they're provided on - which implies zero network overhead to monitor working services. Stated another way, we follow a no news is good news philosophy. Our primary tool for monitoring services is through the use of the Local Resource Manager from the Linux-HA project.
There are three kinds of testing I see as necessary
We are currently using the Testify software written by the folks at Yelp. Probably will try some of the alternatives as well. Very pleased with the results it's bringing. The nice thing about this is much of the detailed gnarly C code is wrapped by the python code, so when I run the python tests of those wrappers, the C code under it gets well tested as well.
Not quite sure how to best accomplish this. Some of it can just be my home network, but I suppose I could also spin up some cloud VMs too... Not sure yet... Automation is a GoodThing.
I have been thinking about this quite a bit, and have what I think is a reasonable idea about it. It involves writing a simulator to simulate up to hundreds of thousands of nanoprobe clients through a separate python process - probably using the Twisted framework. It would accept and ACK requests from the CMA and randomly create failure conditions similar to those in the "real world" - except at a radically faster rate. This is a big investment, but likely worth it. It helps to have this in mind while designing the CMA as well - since there are things that it could do to make this job a little easier.