In An analytics-based approach to service assurance: Part 1 – What’s the problem? we had too many bulleted lists and we painted a fairly bleak picture of the current state of service assurance in the industry. In this post, we try and answer the question of whether we can do a better job with an analytics-based approach.
So where do we start? By performance managing the world and Lucene query? Useful tools they may be, but a panacea they are not.
We start with an event analytics pipeline. Why start with events? Because events indicate a state or change of state, and other data like metrics ultimately get processed and generate events. So our event management pipeline is really our service assurance baseline.
We aggregate all events across all domains and all layers so that we have complete visibility; this includes infrastructure, service and customer data. The downside is that is a lot of data (big data) and the first thing we need to do is to filter out noise. We can filter noise from an event stream in the same way we can filter noise from a radio signal; this is our first stage of event analytics.
Next, we try and cluster the filtered event stream; grouping together events which may be associated. This can be based upon time, event content or other context.
Once we know that we have a group of associated events, which represent an incident, we can apply a third level of analytics to identify causality within the group.
These three stages form the basis of our event analytics pipeline:
The output of this pipeline needs to coupled to an Incident and Problem Management system, which is the interface that operators use to track the resolution of the problem and capture the output.
In order to prioritise the issues that get fixed, we need to be able to map these incidents to the services they impact. We can do this by mapping the service inventories from the orchestration and control stack to the events in each incident. We call the aggregation of service inventory data the Real-time Inventory, although it need not be a discrete function.
This also provides an input to service status and can be used to enrich the incident, providing valuable context for the operators that are trying to fix the underlying problem. This is an example of the coupling between orchestration and analytics; see also: Big Data Analytics and the bifurcation of OSS: what about the F_APS? and ETSI NFV and Big Data Analytics with PNDA
Now we create a pipeline for processing time series or metric data; we use anomaly detection rather than prescribed thresholds, to dynamically pick out significant deviations from the norm:
The anomaly detector outputs events which feed into the event analytics pipeline. The event processing pipeline provides a baseline from which it’s possible to measure the effectiveness of the anomaly detection, e.g. if an anomaly is the first alert to indicate an incident/problem then value is measurable in terms of by how much it reduced MTTD; if an anomaly indicates causality then it offers value which is measurable in terms of by how much it reduced MTTR.
We built this system on PNDA, integrating a mixture of custom applications, open source applications and licensed software from ISVs:
Can’t resist a bulleted list:
- We used Moogsoft AIOps for the event analytics pipeline; we wrote a PNDA consumer for Moog to take an event feed from PNDA
- We used Ontology.com for the real-time inventory capability; their semantic web approach proved effective at presenting an aggregated view over disparate inventory distributed through the orchestration and control stack
- We wrote a custom PNDA application for metric anomaly detection; more on metric anomaly detection in a future post.
- We added good old ElasticSearch as another consumer on PNDA, as we generally find the need for Lucene query for more detailed log search and investigations.
So did it achieve what we set out to; to provide a more effective approach to service assurance? To answer this question, we looked at a number of factors:
- Are events correctly grouped into incidents/situations?
- Are all issues that we know about identified?
- Are they identified earlier than they would be otherwise?
- Are tenants & services impacted by incidents correctly identified?
- Are incidents identified that we wouldn’t otherwise, i.e. increased number of detected incidents?
- Are recurring incidents identified?
The pudding was the proof:
- 99.9999% reduction from events/log messages to situations
- ~400,000,000 events/log messages per day
- Reduced down to ~250 situations per day
- All known issues identified
- Reduction in average mean time to detect (MTTD)
- Service impact correctly flagged
- Automatically identified recurring situations
All of this was achieved across domains, without any prescribed rules, models, or thresholds. There was a lot of groundwork involved in getting to that point (2+ years), however, now this is all reproducible with much less effort.