ml

Machine Learning and Anomaly Detection

Overview

You can leverage machine learning to detect patterns and anomalies in your large datasets, enabling you to identify issues early before they escalate.

Netdata offers Anomaly Advisor, a tool designed to improve your troubleshooting experience, reduce mean time to resolution, and prevent issues from escalating. You can access it through the Netdata dashboard.

:::tip

To configure ML on your nodes, check the ML configuration documentation.

:::

Design Principles

When you use Netdata's machine learning models, you benefit from these key principles:

Principle	Description
Unsupervised Learning	Models operate independently without requiring your input
Real-time Performance	While ML impacts CPU usage, you won't experience any compromise to Netdata's high-fidelity, real-time monitoring
Seamless Integration	ML-based insights are fully embedded into your existing Netdata infrastructure monitoring and troubleshooting workflow
Assistance Over Alerts	ML helps you investigate potential issues rather than triggering unnecessary alerts - no 3 AM wake-ups for minor anomalies
Many Light Models	Netdata uses many lightweight models instead of a few heavy ones, optimizing for resource usage while maintaining accuracy
Scalable Architecture	The system is designed to handle thousands of metrics simultaneously, scoring each one every second with minimal latency

:::note

Netdata deliberately avoids using deep learning models, as they would introduce heavy dependencies and resource requirements that wouldn't align with Netdata's goal of running efficiently on any Linux system. Instead, the implementation uses the lightweight dlib library and spreads training costs over a wide window to minimize performance impact.

:::

Types of Anomalies You Can Detect

Anomaly Type	Description	Business Impact
Point Anomalies	Unusually high or low values compared to historical data	Early warning of service degradation
Contextual Anomalies	Sequences of values that deviate from expected patterns	Identification of unusual usage patterns
Collective Anomalies	Multivariate anomalies where a combination of metrics appears off	Detection of complex system issues
Concept Drifts	Gradual shifts leading to a new baseline	Recognition of evolving system behavior
Change Points	Sudden shifts resulting in a new normal state	Identification of system changes

How Netdata ML Works

flowchart TD
    Raw("**Raw Metrics**<br/><br/>Last 4 Hours")
    Preprocess("**Preprocess**<br/><br/>Feature Vectors")
    Train("**Train k-means<br/><br/>k=2**")
    Model("**Trained Model**")
    
    M1("Model 1<br/><br/>**Recent Data**") 
    M2("Model 2<br/><br/>**Older Data**")
    M3("Model 3<br/><br/>**Even Older Data**")
    MN("Model N<br/><br/>**Up to 2 Days Old**")
    
    NewData("**New Metrics**")
    DistCalc("**Calculate**<br/><br/>Euclidean Distance<br/>to Cluster Centers")
    Threshold("**Distance > 99th**<br/><br/>Percentile?")
    FlagA("**Flag as Anomalous**<br/><br/>in This Model")
    FlagN("**Flag as Normal**<br/><br/>in This Model")
    
    AllResults("**Results from All Models**")
    AllAgree("**All Models<br/><br/>Agree it's<br/><br/>Anomalous?**")
    SetBit("**Set Anomaly Bit = 100**<br/><br/>True")
    ClearBit("**Set Anomaly Bit = 0**<br/><br/>False")
    
    Raw --> Preprocess
    Preprocess --> Train
    Train --> Model
    Model --> M1
    Model --> M2
    Model --> M3
    Model --> MN
    
    M1 --> NewData
    M2 --> NewData
    M3 --> NewData
    MN --> NewData
    
    NewData --> DistCalc
    DistCalc --> Threshold
    Threshold -->|Yes| FlagA
    Threshold -->|No| FlagN
    
    FlagA --> AllResults
    FlagN --> AllResults
    AllResults --> AllAgree
    AllAgree -->|Yes| SetBit
    AllAgree -->|No| ClearBit
    
    %% Style definitions
    classDef neutral fill:#f9f9f9,stroke:#000000,stroke-width:3px,color:#000000,font-size:16px
    classDef process fill:#ffeb3b,stroke:#000000,stroke-width:3px,color:#000000,font-size:16px
    classDef complete fill:#4caf50,stroke:#000000,stroke-width:3px,color:#000000,font-size:16px
    classDef anomaly fill:#f44336,stroke:#000000,stroke-width:3px,color:#000000,font-size:16px
    
    %% Apply styles
    class Raw,Preprocess,NewData,AllResults neutral
    class Train,Model,M1,M2,M3,MN,DistCalc,Threshold,AllAgree process
    class FlagN,ClearBit complete
    class FlagA,SetBit anomaly

Training & Detection

When you enable ML, Netdata trains an unsupervised model for each of your metrics. By default, this model is a k-means clustering algorithm (with k=2) trained on the last 4 hours of your data. Instead of just analyzing raw values, the model works with preprocessed feature vectors to improve your detection accuracy.

:::important

To reduce false positives in your environment, Netdata trains multiple models per time-series, covering over two days of data. An anomaly is flagged only if all models agree on it, eliminating 99% of false positives. This approach of requiring consensus across models trained on different time scales makes the system highly resistant to spurious anomalies while still being sensitive to real issues.

:::

The anomaly detection algorithm uses the Euclidean distance between recent metric patterns and the learned cluster centers. If this distance exceeds a threshold based on the 99th percentile of training data, that model considers the metric anomalous.

Anomaly Bit

Each trained model assigns an anomaly score at every time step based on how far your data deviates from learned clusters. If the score exceeds the 99th percentile of training data, the anomaly bit is set to true (100); otherwise, it remains false (0).

Key benefits you'll experience:

No additional storage overhead since the anomaly bit is embedded in Netdata's floating point number format
The query engine automatically computes anomaly rates without requiring extra queries

:::note

The anomaly bit is quite literally a bit in Netdata's internal storage representation. This ingenious design means that for every metric collected, Netdata can also track whether it's anomalous without increasing storage requirements.

:::

You can access the anomaly bits through Netdata's API by adding the options=anomaly-bit parameter to your query. For example:

https://your-node/api/v1/data?chart=system.cpu&dimensions=user&after=-10&options=anomaly-bit

This would return anomaly bits for the last 10 seconds of CPU user data, with values of either 0 (normal) or 100 (anomalous).

Anomaly Rate

You can see Node Anomaly Rate (NAR) and Dimension Anomaly Rate (DAR) calculated based on anomaly bits. Here's an example matrix:

Time	d1	d2	d3	d4	d5	NAR
t1	0	0	0	0	0	0%
t2	0	0	0	0	100	20%
t3	0	0	0	0	0	0%
t4	0	100	0	0	0	20%
t5	100	0	0	0	0	20%
t6	0	100	100	0	100	60%
t7	0	100	0	100	0	40%
t8	0	0	0	0	100	20%
t9	0	0	100	100	0	40%
t10	0	0	0	0	0	0%
DAR	10%	30%	20%	20%	30%	*NAR_t1-10 = 22%*

DAR (Dimension Anomaly Rate): Average anomalies for a specific metric over time
NAR (Node Anomaly Rate): Average anomalies across all metrics at a given time
Overall anomaly rate: Computed across your entire dataset for deeper insights

Node-Level Anomaly Detection

Netdata tracks the percentage of anomaly bits over time for you. When the Node Anomaly Rate (NAR) exceeds a set threshold and remains high for a period, a node anomaly event is triggered. These events are recorded in the new_anomaly_event dimension on the anomaly_detection.anomaly_detection chart.

Viewing Anomaly Data in Your Netdata Dashboard

Once you enable ML, you'll have access to an Anomaly Detection menu with key charts:

anomaly_detection.dimensions: Number of dimensions flagged as anomalous
anomaly_detection.anomaly_rate: Percentage of anomalous dimensions
anomaly_detection.anomaly_detection: Flags (0 or 1) indicating when an anomaly event occurs

These insights help you quickly assess potential issues and take action before they escalate.

Summary

With Netdata ML, you get reliable, real-time anomaly detection with minimal false positives. By incorporating ML within your existing observability workflows, you can enhance troubleshooting and ensure proactive monitoring without unnecessary alerts.

For more information:

Name		Name	Last commit message	Last commit date
parent directory ..
notebooks		notebooks
README.md		README.md
ad_charts.cc		ad_charts.cc
ad_charts.h		ad_charts.h
ml-configuration.md		ml-configuration.md
ml-dummy.c		ml-dummy.c
ml-unittest.cc		ml-unittest.cc
ml.cc		ml.cc
ml_calculated_number.h		ml_calculated_number.h
ml_chart.h		ml_chart.h
ml_config.cc		ml_config.cc
ml_config.h		ml_config.h
ml_dimension.h		ml_dimension.h
ml_enums.cc		ml_enums.cc
ml_enums.h		ml_enums.h
ml_features.cc		ml_features.cc
ml_features.h		ml_features.h
ml_host.h		ml_host.h
ml_kmeans.cc		ml_kmeans.cc
ml_kmeans.h		ml_kmeans.h
ml_memory.cc		ml_memory.cc
ml_private.h		ml_private.h
ml_public.cc		ml_public.cc
ml_public.h		ml_public.h
ml_queue.cc		ml_queue.cc
ml_queue.h		ml_queue.h
ml_string_wrapper.h		ml_string_wrapper.h
ml_worker.h		ml_worker.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Machine Learning and Anomaly Detection

Overview

Design Principles

Types of Anomalies You Can Detect

How Netdata ML Works

Training & Detection

Anomaly Bit

Anomaly Rate

Node-Level Anomaly Detection

Viewing Anomaly Data in Your Netdata Dashboard

Summary

FilesExpand file tree

ml

Directory actions

More options

Directory actions

More options

Latest commit

History

ml

Folders and files

parent directory

README.md

Machine Learning and Anomaly Detection

Overview

Design Principles

Types of Anomalies You Can Detect

How Netdata ML Works

Training & Detection

Anomaly Bit

Anomaly Rate

Node-Level Anomaly Detection

Viewing Anomaly Data in Your Netdata Dashboard

Summary