Meta Analysis Introduction
The SIOS iQ Meta Analysis feature that appears under Performance Root Cause Analysis adds Deep Learning to strengthen iQ’s overall Performance Root Cause Analysis. Deep Learning is a Machine Learning approach that helped AlphaGo master the game of Go and Deep Blue to master Chess. Now, the incarnation of Deep Learning in SIOS iQ will help to identify the root causes of the performance problems across very large dataset (behaviors, topologies, anomalies and patterns over time) events in very dynamic virtualization and cloud environments. Meta Analysis drastically reduces problem identification to a very small number of recurring anomalous behavior patterns and their root cause(s). IT admins can now manage even the largest and “noisiest” environments and can gain insights instantaneously, eliminating hours or days of trial-and-error guesswork when trying to understand and mitigate a problem affecting infrastructure operations and application service delivery.
Problem vs Issue
- An Issue is an incident identified by the Performance Root Cause Analysis feature, powered by patented Topological Behavior Analysis (TBA), that takes place at a particular time in the environment.
- A Problem is a holistic view of the performance issues (incidents) over time that better reveals their root cause and recommendations to address them.
Following these definitions, Performance Root Cause Analysis performs identification of the Performance issue (incident), while Performance Meta Analysis provides identification and root cause analysis of the related Performance Problems along with resolution(s) for them.
How does Performance Meta Analysis work?
As the issues (incidents) are identified in the environment they are gathered and analyzed by the Meta Analysis feature across behaviors, topologies, anomalies, and patterns over time. As a result of Meta Analysis the provided root cause and recommendations are no longer based on individual issues (incidents) but on a problem overall that repeats itself in the environment across the topologies of the objects. Currently Meta Analysis is performed based on the centers of contention (i.e. hosts and datastores).
How to use Performance Meta Analysis
There are two simple workflows for the Performance Root Cause Analysis enabled by Meta Analysis.
PERC Topology Dashboard and PERC Dashboard
All performance related problems identified by Meta Analysis will surface through the PERC Topology and PERC dashboards. When the user selects one of the severities, the user is guided to a list of corresponding problems on the Performance Root Cause Analysis dashboard (discussed later in more detail), as illustrated in Figure 1.
Performance Root Cause Dashboard
Another convenient way to access performance related problems identified by SIOS iQ is through the Performance Root Cause dashboard directly.
Once there, the user will have access to the performance problems based on the set filters (default 24 hours, In-Progress) see Figure 2.
The left side of the Dashboard presents a list of the performance problems discovered by the Meta Analysis feature relating to the selected time frame and the filter specifications selected via the Show/Hide selector. Problems that have not yet been fully analyzed are labeled “Unanalyzed”, and each problem is accompanied by a severity wheel indicating the number of correlated issues and their severity distribution. To the right of the Dashboard, the Details pane presents additional information for the selected problem, including its type, root cause object(s), and recommendation(s) for mitigating the problem. Above the Details pane, a dropdown menu determines the scope of Meta Analysis for the selected problem (i.e., the number of individual incidents analyzed leveraging Meta Analysis). The user has the option to set each issue as Muted, Acknowledged, or Resolved (see Issue Action States for more information). Some states can be re-activated by the user (see Issue State Modification). Within the Details pane, the View N Issues button allows the user to playback individual similar incidents, while the Impact Details button opens a Performance Impact screen that dives into the symptoms associated with each correlated issue (see Figure 3).
The Performance Impact screen provides a list of all issues associated with the selected Meta Analysis problem, which may be sorted by Start Time, End Time, or Severity. For any selected issue, details are provided to the right along with a graph illustrating the symptom(s) (e.g, Memory Swapping, Latency) associated with the problem. These symptom graphs provide a filter allowing the user to add/remove interrelated symptoms, enabling custom visualization and evaluation of each issue. Collectively, the Performance Impact symptom graphs provide the user with an interactive framework for inspecting the global impact of Meta Analysis problems on the infrastructure and gaining a more comprehensive understanding of problem scope.
Finally, the Root Cause Analysis pane on the Performance Root Cause Analysis Dashboard provides the Meta Analysis graph for each selected Meta Analysis problem. This graph is a visual representation of the selected problem that breaks down the related objects by Root Cause, Impacted, and Associated (Figure 4). While there could be a number of Root Cause objects identified over time as a result of the analysis, visualizing the relationship edges and highlight(s) reveals that only a subset of the Root Cause objects are actually causing most of the “damage” resulting from the Performance problem. Green “healthy” edges complete the picture of topological relationships, indicating that the connected Associated Objects are not involved in the problem.
How to use the “Playback” feature of the Performance Root Cause Analysis
By selecting the View N Issues button shown on Figure (2), you can access the playback feature of the problem by investigating individual incidents that were incorporated into the Meta Analysis of the Performance Root Cause Analysis.
SIOS iQ learns the behavior of each individual object across different metrics in the environment leveraging the principles of machine learning and topological behavior analysis. SIOS iQ identifies the anomalies in the behavior that potentially cause the performance issues to the application, correlates the anomalies to derive the relationships and determine the root cause of the problem (such as object or event), and recommends the solution to address it. SIOS iQ then presents any infrastructure component events that may affect performance through the Performance Root Cause Analysis Dashboard (Not Available in SIOS iQ free edition).
The Impact Analysis tab provides information regarding the impacted and associated object(s) for each Issue in the PERC Issue and Performance Root Cause Lists. In List View, these objects can be sorted by Name, Type or Impacted status. Properties and Impact data (for impacted objects) for a specific object can be accessed by selecting it in the list and clicking the Properties or Impact button, respectively. Topology View (the default) provides a comprehensive, interactive graphical representation of the relationships among the root cause, impacted and associated objects or events, each indicated as shown in the legend. Selecting any object in the graph provides access to its Name, Type, Properties and Impact Details as shown below.
Symptom Graphs and Learned Behavior
SIOS iQ machine learning develops behavior patterns that appear in the Symptoms graph and Impact Analysis graph which show the learned behavior vs anomalous behavior. The highlighted Learned Behavior region (Best Practices region, when iQ is still in a learning state) represents the expected behavior of the symptom being displayed. Depending upon the Sensitivity setting selected by the user, the learned behavior and its underlying statistical features are combined to determine a decision region, where any data point lying outside this region is identified as an anomaly. Below is a sample image of a symptom graph and a summary of all of its individual parts.
- The Issue Type
- The object displaying the symptom
- The anomalous metric identified as having the most impact to the impacted object
- The history of the given metric
- The red highlighted section shows the duration of the selected event
- The blue highlighted section shows the learned expected behavior
- The observed values of the given metric
- The legend for the symptom graph
Infrastructure Event Correlation
In Performance Root Cause analysis, performance issues may be identified whose true root cause(s) consist of virtualization and infrastructure related events (such as VM migration and VM provisioning). Such events will be correlated and will appear in the list of Root Cause Objects as well as in the Symptom Graph as illustrated below.
|Infrastructure Event Type||Description|
|VM Migration event||Migrated VMs have the potential to introduce greater work load (cpu/memory usage, IOPs, etc) on underlying resource layers (Compute, Storage or Network) and may eventually cause a negative impact on the performance of the related objects (host, datastore, VM, etc).|
|Newly Provisioned VM||Provisioning of the new VM has the potential to introduce greater work load (cpu/memory usage, IOPs, etc) on underlying resource layers (Compute, Storage or Network) and may eventually cause a negative impact on the performance of the related objects (host, datastore, VM, etc).|
Utilizing the topological relationship of corresponding objects, SIOS iQ correlates VM migration and provisioning events with identified Performance Root Cause Issues and identifies whether each event constitutes a true root cause of a related performance issue.
Performance Impact Graph & Symptoms
The Performance Impact graph provides chart information and symptoms metrics regarding the Impacted and Root Cause object(s) for each issue in the PERC Issue and Performance Root Cause lists. The Performance Impact data for Impacted and Root Cause objects can be accessed by selecting the Root Cause Object link on the Details tab or selecting the object on the Impact Analysis tab and clicking the Impact button.
What does that mean? And what should I do with this information?
For detailed information about each possible Root Cause event, please see the description in the Specific Issue Details topic.