Service assurance is not something that is well defined within the industry. So what is it that we are trying to achieve with service assurance? Simply put:
- I want to know that my services are working and if they’re not I want to be proactively informed (service health)
- If they aren’t working, I want to know what the underlying cause is (root cause analysis)
- I want to know which services are impacted so I can prioritise the fix (service impact analysis)
- Then I want as much relevant information as possible to help fix the problem as fast as possible (enrichment and knowledge base)
- I want to track and capture the resolution of the problem (incident and problem management)
- I want to try and ensure that the problem doesn’t happen again; if it does, I want to know it’s a repeat problem, and I want to know what I did last time to fix it (incident profiling)
- Lastly and maybe dreaming now, … but I’d really like to predict when a problem is going to happen that will be service impacting so that I can fix it before it is … (predictive analytics)
Overall I’d like to maximise the availability of the service I’m delivering (within the bounds of the quality of experience my customers require), whilst minimising my operational costs.
Simple? Not today:
- We have complex multi-layer, multidomain networks
- They are producing a huge amount of data: event data, metric data, flow data, … which includes symptomatic data, causal data and noise
- Most deployed systems use prescribed rules (rule-based reasoning – RBR, i.e. Netcool, circa 1996), topology models (model-based reasoning – MBR – i.e. Riversoft, circa 1999), and static metric thresholds, which don’t do a good job of identifying and isolating the causes from the symptoms from the noise
- We silo the data by type and by domain, making cross-domain assurance almost impossible
As a result, many service assurance implementations are ineffective today, with the following consequences:
- Customers report faults before service assurance systems identify or prioritise them
- False negatives – faults missed
- False positives – time wasted
- Duplication – 2 people look at the same underlying fault
- Finger pointing – time wasted
- Causes undiagnosed
- Repeat faults – same fault recurs; lessons not learnt
The net effect is that the overall service availability is lower than it should be. To add some numbers to this, consider the following statics from a mobile network operator:
If these statistics are to be believed, the service assurance systems were proactively detecting <1% of service impacting issues.
If this represents the state of many network operations today, are we now at an inflexion point where we can apply analytics to service assurance to provide a more effective answer?
We tray and answer this question in: An analytics-based approach to service assurance: Part 2 – Is analytics the answer?