Can AI Accelerate Root Cause Analysis in Networks?

Mobile networks are designed and built with security in mind. It takes attackers a lot to do any damage. However, security incidents may of course occur, and when they do occur, rapid countermeasures are crucial. Equally important is the investigation of the accident – to find the root cause and take appropriate actions so that the real problem can be addressed in an instant and prevent it from happening again. These actions become even more important as new applications, use cases, and industries connect to the network, with an emphasis on requirements for high resilience, minimal downtime, and rapid recovery.

One such measure is root cause analysis. Recent high-profile incidents of network and cloud outages have led the R&D community to step up efforts to develop solutions that can restore services faster and reduce damage and loss. Today, there are also compliance regulations in place that require service providers to provide timely information about the root cause of accidents, reinforcing the importance of root cause analysis.

In the 5G era, networks rely on virtualization to improve flexibility and performance. To achieve these benefits, network function virtualization offers several levels. When an incident occurs, symptoms detected at one level may have their true root cause occurring at another level. It is important to efficiently and effectively link symptoms to the true root cause even if they belong to different levels.

What is Network Function Virtualization?

Network function virtualization is the migration of network functions from specially designed physical network nodes to software that runs on a public hardware computing platform. It makes it possible for Communication Service Providers (CSPs) to manage, transfer, and extend their network capabilities on demand using software-based virtual applications via distributed hardware resources.

In collaboration with researchers at Concordia UniversityIn this article, we explored possibilities for improving incident investigation in virtual environments, specifically 5G network processing. This research resulted in a new solution that combines well-established source graph analysis with AI-based techniques for effective root cause analysis.

Let’s expand a bit on the thinking behind this research, and its applicability in mobile networks.

What is an accident investigation using source? And what are the challenges?

A source graph is a well-known tool for capturing causal relationships between events that occur in a system. Source graph analysis can help identify the root cause of security incidents by tracing all events in a sequence, from the last recorded event related to the incident (that is, the presentation), all the way to the source event that caused the incident – the root cause.

However, in a virtual environment, this process can become difficult and expensive. This is because:

  • As the number of events recorded increases, the efficacy and scalability of existing source-based solutions may drop significantly if they are applied as is.
  • The multi-level aspect, introduced by the Network Function Virtualization (NFV) environment, makes source capture and analysis very difficult and error-prone without proper models and processes.

For example, determining causal dependence and the semantic relationship between events that occurred at different levels requires extensive knowledge and likely human experience. However, the task of the human analyzer can still be made faster and easier with the support of the right tools.

How to solve the accident investigation in multi-level NFV

In this context, we have undertaken a research journey that has resulted in what we call ProvTalk – a source analysis system designed to deal with the unique multi-level nature of NFV. It is based on the prototype of previous root cause analysis research, DominoCatcher.

ProvTalk mini video


Domino Catcher’s Root Cause Analysis Research Form

Our solution was developed in collaboration with experts at Concordia University and addresses the following:

  • It connects source graphs at different levels of the NFV stack by capturing dependencies across levels.
  • The human analyzer helps identify the root cause of security incidents. To this end, it uses graph pruning techniques and data mining methods for recurring patterns (related to the system or user) to encapsulate the complexity of graph analysis across assemblies, while preserving valuable details for effective root cause analysis.
  • Finally, a rule-based approach is taken to automatically translate the details of a source graph (or a subset of it) into an incident report that can be interpreted by human analysts.
Overview of the source analysis solution

Figure 1: Overview of the source analysis solution

Source analysis and data mining: how it works

Let’s take a closer look at the technical features behind the new source analysis solution.

To enable source analysis, we first defined a platform independent source model based on the World Wide Web Consortium (W3C) standard specification. PROV-DM, which makes it possible for us to organize different levels of the NFV stack into different layers in the source graph. This model captures virtual resources (at different levels of abstractions) as nodes and operations on those resources as edges connecting nodes. To define cross-level dependencies, we used specially tagged edges to connect virtual resources of different levels.

Once the model is defined and validated, we then use it to automatically capture all virtual resources at different levels and the management processes that modify them, using event-intercepting mechanisms deployed as middleware, to keep track of what is happening at runtime at different levels.

But what should be done when an incident occurs, eg a security breach of virtual resources at any particular level? First of all, the multilevel source graph must be checked for the root cause, until the first alert is received. Since human participation is central to this process and since this source model generally includes too much low-level information to be processed manually, we have developed a set of useful tools that simplify source information and make it easier for human analysts to do so. Interpret and understand what happened. All these tools are implemented automatically and then the information is provided to the analyst who can adjust and analyze accordingly by applying their own expertise.

These tools are implemented in a three-step process:

Step 1: Multi-level pruning

The first tool is multilevel pruning, which uses meta information from the incident alert to filter irrelevant information from the source graph using cross-level dependencies. This means that human analysts can identify potentially irrelevant parts of a graph at different levels more efficiently and through means that do not exist today. This tool helps narrow down the search space for root causes considerably.

Step 2: Mining-Based Aggregation

The second tool is mining-based aggregation, which makes it possible to aggregate parts of a graph in a reversible way to reduce redundancy in the graph and add high-level semantics to low-level operations. This can make the source graph easier to understand. More specifically, this assembly targets the most common sequence of lower-level processes that automatically run after the higher-level process in the NFV stack. It also targets the routine administrative processes (eg, maintenance tasks) that appear regularly in the source graph. Mining-based aggregation provides human analysts with need-to-know information about what is happening in low-level detail, enabling them to focus on the main task, which is discovering the root cause.

Step 3: Translate the graph into human-readable text

Finally, when certain paths are identified in the source graph, the third tool can be utilized to translate these fragments into text that human analysts can easily read and provide additional useful guidance in the investigation process. This feature can also be used to generate a report describing the outcome of the analyzes’ investigations. The report, created in natural language (in our English case), explains what happened and how the symptoms of the accident relate to the root causes. To do this, it describes the hypothetical resources and suspicious processes that were involved, the timing of what happened, and the parties that carried out those operations.

Towards effective handling of incidents in 5G and 6G in the future

The main benefit of this research work is the creation of a concise and interpretable source graph using data comprising a large number of events that occurred across several levels of NFV. This facilitates the task of the human analyst in finding the root cause of the security incident and results in a significant reduction in source graph sizes without losing vital investigation information, while reducing latency and computational costs. For cloud computing service providers, this provides the clear and substantial benefits of lower costs related to incident investigation, as well as significantly improved incident response time.

It is expected that our solution will be seamlessly adapted as a basis for efficient incident analysis also in 6G. Many aspects of the future 6G will evolve from 5G, with native virtualization and cloud technologies as key factors, and we have specifically designed ProvTalk to address the privacy and complexity of virtualized environments.

in-depth reading

Read the Research paper behind the workPublished at the Network and Distributed System Security (NDSS) Symposium.

Learn more about other Ericsson cybersecurity initiatives developed in collaboration with Concordia University.

Learn more about Ericsson’s vision for future network security.

Learn more about Network Function Virtualization (NFV) and its role in improving the reliability of the 5G network.

This work was carried out as part of the Industrial Research Chair between Ericsson and Concordia University with funding from the Natural Sciences and Engineering Research Council of Canada (NSERC). Read more about it over here.

Leave a Comment