Why monitoring sucks — for now

Today’s operations engineers are faced with choosing between two imperfect routes for infrastructure monitoring. On one hand, there’s no shortage of complicated, inflexible and expensive enterprise tools shrouded in the glory of vendor lock-in. On the other, we have a veritable zoo of open source tools — many of which are great at addressing specific pain points, but are small pieces of a larger puzzle.

The failure of the space has spawned a loosely-organized grassroots movement in the devops community to address the challenge, and led to numerous blog posts, an IRC channel and a collection of GitHub repos.

Although there is a litany of complaints, I would submit that this pain is rooted in a way of monitoring that ignores the realities of growing and scaling businesses along with the demands of fast-paced infrastructure teams. Perhaps it’s time for us to think of monitoring in a new way.

A new (old) model
During the Korean War, United States Air Force Colonel Robert Boyd formulated the “OODA loop.” OODA stands for observe, orient, decide and act. Boyd theorized that the faster a team could understand what’s happening, orient themselves to the situation, decide how to respond to it, and act — the greater their readiness and haste of response. Boyd’s insight suggests that teams iterating through the loop faster gain a competitive advantage over opponents. I’d suggest that any well-designed monitoring tool can help automate the OODA loop for operations teams. Below are the essential components of monitoring infrastructure for fast-paced teams.

1. Deep integration
Most open source monitoring tools only tackle one aspect or a subset of the OODA loop. For instance: Graphite and Cacti provide trending (orientation), Nagios provides alerting (decision and action) and Statsd and Collectd gather metrics (observation). But integrating these projects is a daunting task and often takes the form of a Frankenstein’s monster of Perl scripts and PHP dashboards. While each of these tools are helpful, they only paint part of the picture. An ideal tool might integrate all four steps of the OODA loop into one harmonious system. Where necessary, one would also expect API endpoints to allow for custom behavior and flexibility to further automate a team’s action.

2. Contextual alerting and pattern recognition
Most monitoring tools require the user to predefine all of the conditions on which to alert. For instance, one would set static thresholds that say, “Notify me when disk usage goes above 90 percent,” or, “Notify me when CPU usage goes above 75 percent.” However, static thresholds are a poor substitute for pattern recognition, the basis of cognitive decision-making. Setting static thresholds for applications whose load varies throughout the day, week, or month is hell. At any given point, monitoring infrastructure should be able to reflect upon its current state, past state, and forecasting and ask, “Are current trends sufficiently deviant enough to warrant action?” And if so, it should immediately notify the team with context. What if ops teams could look at a graph and say to the system, “Alert us when something looks (or doesn’t look) like this?”

3. Timeliness
The term “real time” has been watered down, but it carries a specific meaning. Real-time computing concepts in monitoring systems relate to an intrinsic property of events: they happen on a timeline. Monitoring systems must be real time, because the timeliness of the data impacts its correctness and utility. All aspects of a monitoring system must respond immediately to events. The OODA loop is only effective when it is faster than the environment or opponent that it is running against. If you’re operating on assumptions that are a minute old, it’s hard to say much of anything about what’s happening now.

4. High resolution
The resolution of monitoring systems is critical. With most options offering updates once every one to five minutes, low-resolution monitoring obscures a world of patterns that are invisible until you’ve zoomed in. The difference between a one-second graph updated in real time and a one-minute graph updated every five minutes is the difference between a fluid HD film and a paper flip-book.

5. Dynamic configuration
The fluidity of modern architectures demands monitoring infrastructure that can keep up with the changes that ops teams require. The rise of virtualized infrastructure combined with dynamic configuration management systems means that there may be a great deal of host churn. This churn challenges the concepts of host identity that traditional monitoring tools have built in as fundamental abstractions.

What’s next for monitoring?
The pace of business today requires tools that help teams move rapidly through the OODA loop. Smarter software can push this process forward, offering deep integration across infrastructure, pattern recognition to quickly spot problems, real-time updates at high resolution and automatic adaptation to changing environments. With a set of tools like that, operations teams can respond to incidents, resolve them for good and drive business value while leaving competitors’ heads spinning.

Cliff Moon is the founder and CTO of Boundary, a provider of real-time network and applications monitoring-as-a-service. The views expressed here are personal and do not necessarily reflect those of any Company with which he is or has been affiliated.

Special thanks and acknowledgement goes out to Coda Hale for his views on monitoring and metrics (read Metrics, Metrics Everywhere).

Image courtesy of Flickr user purpleslog.