Defining Infrastructure Monitoring for Alerts, Prevention and Change
Recently I was asked to provide some feedback on a Network Monitoring project scope for a multi-national organisation. On review I felt there was a need to define the different components of network (or as I refer to it Infrastructure Monitoring) to establish requirements and measure the success of the project.
Keeping it simple I split Infrastructure Monitoring into five components;
1) Alert
A notification process that informs key personnel when a services (and or components of) are either not responding or not performing as expected.
2) Preventative
A notification process that pro-actively advises key personnel of approaching conditions that may impact the availability or performance of a service (or components of).
3) Historical
The retention of data in such a way to facilitate capacity management through analysis of trend and forecasting.
4) Performance
Establishing standardized testing processes calculating execution times for the establishment of benchmarks to monitor performance drift.
5) Change
Using the Performance Benchmarking to analyze the impact and success of change.
These definitions can be split into three simple maturity groups:

Re-Active: You at least know when somethings wrong
Where most environments start off and traditionally the easiest to establish. Really the bare minimum for any organisation particularly given the number of low-cost or open source solutions available.
Active: You’re trying to stop things before they go wrong
Preventative Monitoring is really Alert Monitoring taken one step further – instead of alerting that a drive is full, send an alert when it hits 85% capacity. Establishing a preventative alert system requires analysis of your environment’s capabilities but is critical in providing consistent service to you clients.
Historical retention cannot be undervalued. Understanding where you are today means knowing where you were, and without that forget predicting tomorrow.
Graphs of storage, performance and resource consumption provide a mechanism for recognising trend and can be understood by anyone vertically in the business – from Board to CEO to client.
Some time ago I was challenged by my company about our rising internet costs. Using three years of history I was able to present that our costs reflected our growth in staff over those years, the growth of external services between our clients and vendors and the increased mobility of our staff.
We were then able to forecast the trend of the next few years and revisit our setup, procedures and vendors that ultimately returned a solution for less cost and more capacity.
Pro-Active: You recognise the root cause of most failures is change
In my experience if you exclude flat out failures the majority of issues that arise can be sourced back to Change.
A mature monitoring environment looks to measure change success not only by the fact that it occurred successfully, but that is impact to the clients was as anticipated.
Unfortunately in complex environment’s it can get tricky – as the change may have improved or provided for service A, but impacted service D – four “steps” away.
To measure this effectively you need to have a performance profile of your solutions / services which can be quite complex to establish and maintain.
However the reward is you can (a) confirm that the processes or plans you are putting into effect are having the desired (hopefully positive!) effect, and (b) can detect any capacity requirement that historical may not show easily.
I am working on one of those “difficult” ones right now – a performance drift which has seen an operation’s execution time increase from average 1.5 seconds to 3.5 seconds over the last 7 months.
It’s too subtle to appear on a direct graph and our users don’t even notice it. But this is a significantly used resource and if not managed in a years time it will blow out to 8 or 9 seconds – and that will be noticed.
Our suspicion is Change, either in the addition of a number of new solutions using shared infrastructure – or just simple growth – it’s just getting used more by more people.
Hopefully someone looking to establish their own Infrastructure Monitoring will come across this post and find it assists them establish their own requirements.

Recent Comments