Evaluation module

This module contains functionality to evaluate performance of an anomaly detector. It can be imported as follows:

>>> from dtaianomaly import evaluation

Custom evaluation metrics can be implemented by extending Metric or ProbaMetric. The former expects predicted “decisions” (anomaly or not), the latter predicted “scores” (more or less anomalous). This distinction is important for later use in a Worfklow.

class dtaianomaly.evaluation.Metric[source]

compute(y_true: ndarray, y_pred: ndarray) → float[source]

Computes the performance score.

Parameters:

y_true (array-like of shape (n_samples)) – Ground-truth labels.
y_pred (array-like of shape (n_samples)) – Predicted anomaly scores.

Returns:

score – The alignment score of the given ground truth and prediction, according to this score.

Return type:

float

Raises:

ValueError – When inputs are not numeric “array-like”s
ValueError – If shapes of y_true and y_pred are not of identical shape
ValueError – If y_true is non-binary.

class dtaianomaly.evaluation.BinaryMetric[source]: A metric that takes as input binary anomaly labels.

class dtaianomaly.evaluation.ProbaMetric[source]: A metric that takes as input continuous anomaly scores.

class dtaianomaly.evaluation.ThresholdMetric(thresholder: Thresholding, metric: BinaryMetric)[source]

Wrapper to combine a BinaryMetric object with some thresholding, to make sure that it can take continuous anomaly scores as an input. This is done by first applying some thresholding to the predicted anomaly scores, after which a binary metric can be computed.

Parameters:

thresholder (Thresholding) – Instance of the desired Thresholding class
metric (Metric) – Instance of the desired Metric class

class dtaianomaly.evaluation.Precision[source]

Computes the Precision score.

Precision measures how accurately the model identifies anomalies. It reflects the proportion of detected anomalies that are truly abnormal. This is particularly important when the cost of false positives (normal events incorrectly flagged as anomalies) is high.

Mathematically, precision is the ratio of true positives (correctly identified anomalies) to all predicted positives, which includes both true anomalies and false positives (normal events mistakenly flagged as anomalies). It can be expressed as:

\[\text{Precision} = \frac{\text{True Anomalies}}{\text{True Anomalies} + \text{False Positives}}\]

A high precision in anomaly detection indicates that the model generates few false alarms, ensuring that most flagged anomalies are truly abnormal events. However, it does not measure how many anomalies were actually identified.

class dtaianomaly.evaluation.PointAdjustedPrecision[source]

Compute the point-adjusted precision: first point-adjust the predicted anomaly scores, after which the precision is computed.

For given binary anomaly predictions and ground truth anomaly labels, point-adjusting will treat any sequence of consecutive ground truth anomalies as anomalous events. If any of the observations in such an event has been detected, then we say that the anomaly has been detected. In this case, all predictions in the anomalous event are set to 1, thereby indicating that the method predicted an anomaly.

metric

The Precision-object used for computing the precision, after the prediction has been point adjusted. Note that the object should not be passed to the constructor.

Type:: Precision

See also

Precision: Compute the standard, not point-adjusted precision.

class dtaianomaly.evaluation.Recall[source]

Computes the Recall score.

Recall measures the model’s ability to correctly identify all actual anomalies. It tells us the proportion of true anomalies that were successfully detected by the model. In an anomaly detection system, recall answers the question: “Of all the anomalies that occurred, how many did the model detect?” A high recall is especially important when missing actual anomalies (false negatives) could have severe consequences.

Mathematically, recall is the ratio of true positives (correctly identified anomalies) to all actual positives, which includes both true anomalies and false negatives (missed anomalies). It can be expressed as:

\[\text{Recall} = \frac{\text{True Anomalies}}{\text{True Anomalies} + \text{False Negatives}}\]

A high recall ensures that most anomalies are detected, but it doesn’t account for how many false positives (normal events incorrectly flagged as anomalies) were generated, which is handled by precision.

class dtaianomaly.evaluation.PointAdjustedRecall[source]

Compute the point-adjusted recall: first point-adjust the predicted anomaly scores, after which the recall is computed.

For given binary anomaly predictions and ground truth anomaly labels, point-adjusting will treat any sequence of consecutive ground truth anomalies as anomalous events. If any of the observations in such an event has been detected, then we say that the anomaly has been detected. In this case, all predictions in the anomalous event are set to 1, thereby indicating that the method predicted an anomaly.

metric

The Recall-object used for computing the precision, after the recall has been point adjusted. Note that the object should not be passed to the constructor.

Type:: Recall

See also

Recall: Compute the standard, not point-adjusted recall.

class dtaianomaly.evaluation.FBeta(beta: (<class 'float'>, <class 'int'>) = 1)[source]

Computes the \(F_\beta\) score.

The \(F_\beta\) combines both precision and recall into a single value. It provides a balanced evaluation of a model’s performance, especially in anomaly detection, where there is often a trade-off between catching all anomalies (high recall) and minimizing false alarms (high precision). The parameter \(\beta\) controls the balance between precision and recall. A \(\beta > 1\) gives more weight to recall, useful when missing anomalies is costly, while \(\beta < 1\) emphasizes precision, reducing false positives.

The \(F_\beta\) score is the harmonic mean of precision and recall. It can be expressed as:

\[F_\beta = \frac{(1 + \beta^2) \text{tp}} {(1 + \beta^2) \text{tp} + \text{fp} + \beta^2 \text{fn}}\]

A high \(F_\beta\) score indicates a good balance between detecting actual anomalies and minimizing false positives.

Parameters:: beta (int, float, default=1) – Desired beta parameter.

See also

Precision: Compute the Precision score.
Recall: Compute the Recall score.

class dtaianomaly.evaluation.PointAdjustedFBeta(beta: (<class 'float'>, <class 'int'>) = 1)[source]

Compute the point-adjusted \(F_\beta\): first point-adjust the predicted anomaly scores, after which the \(F_\beta\) is computed.

For given binary anomaly predictions and ground truth anomaly labels, point-adjusting will treat any sequence of consecutive ground truth anomalies as anomalous events. If any of the observations in such an event has been detected, then we say that the anomaly has been detected. In this case, all predictions in the anomalous event are set to 1, thereby indicating that the method predicted an anomaly.

Parameters:: beta (int, float, default=1) – Desired beta parameter.

metric

The FBeta-object used for computing the precision, after the \(F_\beta\): has been point adjusted. Note that the object should not be passed to the constructor.

Type:: FBeta

See also

FBeta: Compute the standard, not point-adjusted \(F_\beta\).

class dtaianomaly.evaluation.AreaUnderPR[source]

Computes the Area Under the Precision-Recall Curve (AUC-PR) score.

The AUC-PR is a performance metric that is especially useful for evaluating models in imbalanced datasets, such as anomaly detection, where the number of normal instances vastly outnumbers the anomalies. The Precision-Recall curve plots precision against recall at various thresholds, providing a detailed view of the trade-off between detecting true anomalies (recall) and minimizing false alarms (precision).

AUC-PR summarizes the curve into a single value, representing the overall ability of the model to identify anomalies while keeping false positives in check. A higher AUC-PR value indicates better performance, meaning the model is effective at detecting true anomalies with fewer false positives.

class dtaianomaly.evaluation.AreaUnderROC[source]

Computes the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) score.

The AUC-ROC is a widely used metric to evaluate the performance of a binary classifier, especially in anomaly detection. The ROC-curve plots the true positive rate (recall) against the false positive rate across different classification thresholds. The AUC-ROC represents the likelihood that the model ranks a randomly chosen anomaly higher than a randomly chosen normal instance.

AUC-ROC provides a single number summarizing the model’s ability to distinguish between normal and anomalous instances. A value of 1.0 indicates perfect discrimination, while 0.5 implies the model performs no better than random guessing. It is especially useful when anomalies are rare, as it considers the trade-off between detecting true anomalies (high recall) and minimizing false positives.

class dtaianomaly.evaluation.BestThresholdMetric(metric: BinaryMetric, max_nb_thresholds: int = -1)[source]

Compute the maximum score of a binary metric over all thresholds. This method will iterate over the possible threshold for given predicted anomaly scores, compute the binary metric for each threshold, and then return the score for the highest threshold.

Parameters:

metric (BinaryMetric) – Instance of the desired Metric class
max_nb_thresholds (int, default=-1) – The maximum number of thresholds to use for computing the best threshold. If max_nb_thresholds = -1, all thresholds will be used. Otherwise, the value indicates the subsample of all possible thresholds that should be used. This subset is created by first sorting the possible unique thresholds, and then selecting the threshold at regular intervals (i.e., the 3rd, 6th, 9th, …). We recommend using the default value (use all thresholds), but can be used for reducing the resource requirements.

threshold_

The threshold resulting in the best performance.

Type:: float