The Assimilation Project
 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Macros Groups Pages
Incredibly easy to configure, easy on your network, incredibly scalable.

Introduction

Welcome to the Assimilation project. (README) Coverity Status

We provide open source discovery with zero network footprint integrated with highly-scalable monitoring. Here are the problems we address:

  • Organizations are vulnerable to attack through forgotten or unknown systems (30% of all intrusions)
  • Organizations have no automatic infrastructure discovery, or they run it infrequently
    • => System configuration information is out of date, or only in people's heads
  • System discovery is not integrated with monitoring
    • Most organizations have no way of knowing they're monitoring everything - and probably aren't
    • Most monitoring is time-consuming to configure, incomplete, out-of-date, easily confused
  • Monitoring is complex and expensive to scale.

What we do: Continually discover and monitor systems, services, switches and dependencies with very low human and network overhead

  • Discover systems, services, switches and dependencies using zero network footprint techniques
  • Monitor systems and services with very low overhead and extreme scalability
  • Make montoring easy to configure and manage

The Assimilation Project is designed to discover and monitor infrastructure, services, and dependencies on a network of potentially unlimited size, without significant growth in centralized resources. The work of discovery and monitoring is delegated uniformly in tiny pieces to the various machines in a network-aware topology - minimizing network overhead and being naturally geographically sensitive.

The two main ideas are:

  • distribute discovery throughout the network, doing most discovery locally
  • distribute the monitoring as broadly as possible in a network-aware fashion.
  • use autoconfiguration and zero-network-footprint discovery techniques to monitor most resources automatically. during the initial installation and during ongoing system addition and maintenance.

The original monitoring scalability idea was outlined in two different articles

  1. http://techthoughts.typepad.com/managing_computers/2010/10/big-clusters-scalable-membership-proposal.html
  2. http://techthoughts.typepad.com/managing_computers/2010/11/a-proposed-network-discovery-design-for-scalable-membership-and-monitoring.html

These two main ideas create a system which will provide significant important capabilities giving both a great out-of-the-box experience for new users and smooth accommodation of growth to virtually all environments.

For a human-driven overview, we recommend our videos from interviews and conference presentations.

We also have a few demos, which demonstrate the ease of use and power of the Assimilation software.

Project Integrity

The project software undergoes a number of rigorous static and dynamic tests to ensure its continued integrity.

  • Highly restrictive gcc options in all compiles - no warnings allowed (-Werror)
  • Static Analysis via the Clang static analyzer - zero warnings allowed before changes are pushed to public repository
  • Static Analysis by Coverity before each release candidate (and other times) Coverity Status
  • Four collections of regression tests - successful run required before any changes are pushed to public repository
  • pylint Python code checker - enforces Python coding standards and performs static error checks

Progress Reports on the project

The team currently posts updates in the following places:

External Links

Architecture

This concept has two kinds of participating entities:

  • a Centralized Management Authority - monitoring the collective, and collecting discovery information
  • a potentially very large number of lightweight monitoring/discovery agents (aka nanoprobes)

Scalable Monitoring

The picture below shows the architecture for discovering system outages.

MultiRingHeartbeat.png
Multi-Ring Heartbeating Architecture

Each of the blue boxes represents a server. Each of the connecting arcs represent bidirectional heartbeat paths. When a failure occurs, the systems which observe it report directly to the central collective management authority (not shown on this diagram). Several things are notable about this kind of heartbeat architecture:

  • It has no single points of failure. Each system is monitored by at least two other systems.
  • It is simple to detect the difference between a switch failure and a host failure by which systems report the failure, and which ones do not.
  • Each system talks to no more than 4 systems - no matter how big the collection being monitored. Since the central system only hears from the monitored systems when a failure occurs, the work to perform monitoring of systems does not go up as the number of systems being monitored goes up.
  • Approximately 96% of all monitoring traffic stays within edge switches.
  • This architecture is naturally geographically sensitive. Very little traffic goes between sites to monitor multiple sites from a central location.
  • This architecture is simple and easy to understand.

This is all controlled and directed by the collective monitoring authority (CMA) - which is designed to be configured to run in an HA cluster using a product like Pacemaker. The disadvantage of this approach is the getting started after a complete data center outage/shutdown can take a while - this part is not O(1).

An alternative approach would be to make the rings self-organizing. The advantage of this is that startup after an full datacenter outage would happen much more quickly. The disadvantage is that this solution is much more complex, and embeds knowledge of the desired topology (which is to some degree a policy issue) into the nanoprobes. It also is not likely to work as well when CDP or LLDP are not available, and to properly diagnose complex faults, it is necessary to know the order nodes are placed on rings.

Autoconfiguration through Discovery

One of the key aspects of this system is it be largely auto-configuring, and incorporates discovery into its basic philosophy. It is expected that a customer will drop the various nanoprobes onto the clients being monitored, and once those are running, the systems register themselves and get automatically configured into the system once the nanoprobes are installed and activated.

What is Zero Network Footprint Discovery™?

Zero-network-footprint discovery is a process of discovering systems and services without sendign active probes across the network which might trigger security alarms. Some examples of current and anticipated zero-network-footprint discovery techniques include:

  • Discovery of newly installed systems by auto-registration
  • Discovery of network topology using LLDP and CDP aggregation
  • Discovery of services using netstat -utnlp
  • Discovery of services using "service" command and related techniques
  • Discovery of systems using arp -n
  • Discovery of systems using netstat -utnp
  • Discovery of service interdependencies using netstat -utnp
  • Discovery of network filesystem mount dependencies using the mount table

These techniques will not immediately provide a complete list of all systems in the environment. However as nanoprobes are activated on systems discovered in this way, this process will converge to include the complete set of systems and edge switches in the environment - without setting off even the most sensitive security alarms.

In addition, the netstat information correlated across the servers also provides information about dependencies and service groups.

Furthermore, these nanoprobes use zero-network-footprint discovery methods to discover systems not being monitored and services on the systems being monitored. Zero-network-footprint discovery methods are methods which cannot trip even the most sensitive network security alarm - because no probes (packets) are sent over the network to perform discovery.

This discovery process is intended to achieve these goals:

  • Simplify initial installation
  • Provide a continuous audit of the monitoring configuration
  • Create a rich collection of information about the data center
    DiscoveryMethods.png
    Zero-Network-Footprint Discovery Process

Lightweight monitoring agents

The nanoprobe code is written largely in C and minimizes use of:

  • CPU
  • memory
  • disk
  • network resources

To do this, we will follow a management by exception philosophy for exception monitoring - when nothing is wrong, nothing will be reported. Although the central part of the code will likely be only available on POSIX systems, the nanoprobes will also be available on various flavors of Windows as well.

Service Monitoring

To the degree possible, we will perform exception monitoring of services on the machine they're provided on - which implies zero network overhead to monitor working services. Stated another way, we follow a management by exception philosophy. Our primary tool for monitoring services is through the use of a re-implemented Local Resource Manager from the Linux-HA project.

Testing Strategy

There are three kinds of testing I see as necessary

  • junit/pyunit et al level of testing for the python code
  • Testing for the C nanoprobes in situ
  • System level (simulated) testing for the CMA Each of these areas is discussed below.

Unit-level testing

We are currently using the Testify software written by the folks at Yelp. Probably will try some of the alternatives as well. Very pleased with the results it's bringing. The nice thing about this is much of the detailed gnarly C code is wrapped by the python code, so when I run the python tests of those wrappers, the C code under it gets well tested as well.

Testing of the Nanoprobes

Not quite sure how to best accomplish this. Some of it can just be my home network, but I suppose I could also spin up some cloud VMs too... Not sure yet... Automation is a GoodThing.

Testing of the Collective Management Code

I have been thinking about this quite a bit, and have what I think is a reasonable idea about it. It involves writing a simulator to simulate up to hundreds of thousands of nanoprobe clients through a separate python process - probably using the Twisted framework. It would accept and ACK requests from the CMA and randomly create failure conditions similar to those in the "real world" - except at a radically faster rate. This is a big investment, but likely worth it. It helps to have this in mind while designing the CMA as well - since there are things that it could do to make this job a little easier.