Cluster Based Local Outlier Factor
- class dtaianomaly.anomaly_detection.ClusterBasedLocalOutlierFactor(window_size: int | str, stride: int = 1, n_clusters: int = 8, alpha: float = 0.9, beta: float = 5.0, **kwargs)[source]
Anomaly detector based on the Cluster-based Local Outlier Factor (CBLOF) [11].
CBLOF is a cluster-based LOF which uses the distance to clusters in the data to compute an outlier score. Specifically, CBLOF first clusters the data using some clustering algorithm (K-means in this implemention). Next, the clusters are separated in the so-called ‘large clusters’ \(LC\) and ‘small clusters’ \(SC\), depending on the parameters \(\alpha\) and \(\beta\). Then, the Cluster-based Local outlier Factor of an observation \(o\) belonging to cluster \(C_i\) is computed as follows:
\[\begin{split}\begin{equation} CBLOF(o) = \lvert C_i \rvert \cdot \begin{cases} dist(o, C_i), & \text{if $C_i \in LC$}. \\ min_{C_j \in LC} (dist(o, C_j)), & \text{if $C_i \in SC$}. \end{cases} \end{equation}\end{split}\]Specifically, if \(o\) is part of a large cluster \(C_i\), we multiply the size of \(C_i\) with the distance of \(o\) to \(C_i\). If \(o\) is in a small cluster, then the size of \(C_i\) is multiplied by the distance to the nearest large cluster \(C_j\).
- Parameters:
window_size (int or str) – The window size to use for extracting sliding windows from the time series. This value will be passed to
compute_window_size().stride (int, default=1) – The stride, i.e., the step size for extracting sliding windows from the time series.
n_clusters (int, default=8) – The number of clusters to form and the number of centroids to generate.
alpha (float in [0.5, 1.0], default=0.9) – The ratio for deciding small and large clusters. \(\alpha\) equals the ratio of number of samlples in large clusters to the number of samples in small clusters.
beta (float, default=5.0) – The ratio for deciding small and large clusters. \(\beta\) equals a cutoff for the small and large clusters, such that for clusters ordered by size, we have that \(\lvert C_k \rvert / \lvert C_{k+1} \rvert = \beta\).
**kwargs – Arguments to be passed to the PyOD CBLOF.
- window_size_
The effectively used window size for this anomaly detector
- Type:
int
- pyod_detector_
A CBLOF detector of PyOD
- Type:
CBLOF
Examples
>>> from dtaianomaly.anomaly_detection import ClusterBasedLocalOutlierFactor >>> from dtaianomaly.data import demonstration_time_series >>> x, y = demonstration_time_series() >>> cblof = ClusterBasedLocalOutlierFactor(10).fit(x) >>> cblof.decision_function(x) array([0.50321076, 0.5753145 , 0.61938076, ..., 0.29794485, 0.30720306, 0.29857479]...)
Notes
CBLOF inherets from
PyodAnomalyDetector.- check_is_fitted() None
Check whether this anomaly detector is fitted or not.
- Raises:
NotFittedError – If this detector is not fitted yet.
- decision_function(X: ndarray) array
Abstract method, compute anomaly scores.
- Parameters:
X (array-like of shape (n_samples, n_attributes)) – Input time series.
- Returns:
decision_scores – The computed anomaly scores.
- Return type:
array-like of shape (n_samples)
- fit(X: ndarray, y: ndarray = None, **kwargs) BaseDetector
Abstract method, fit this detector to the given data.
- Parameters:
X (array-like of shape (n_samples, n_attributes)) – Input time series.
y (array-like, default=None) – Ground-truth information.
- Returns:
self – Returns the instance itself.
- Return type:
- is_fitted() bool
Return whether this anomaly detector is fitted.
- Returns:
is_fitted – True if and only if this detector is fitted, and can be used for detecting anomalies.
- Return type:
bool
- predict_confidence(X: ndarray, X_train: ndarray = None, contamination: float = 0.05, decision_scores_given: bool = False)
Predict the confidence of the anomaly scores on the test given test data.
This method implements ExCeeD [perini2020quantifying] (Example-wise Confidence of anomaly Detectors) to estimate the confidence. ExCeed transforms the predicted decision scores to probability estimates using a Bayesian approach, which enables to assign a confidence score to each prediction which captures the uncertainty of the anomaly detector in that prediction.
- Parameters:
X (array-like of shape (n_samples, n_attributes)) – The test time series for which the confidence of anomaly scores should be predicted.
X_train (array-like of shape (n_samples_train, n_attributes), default=None) – The training time series, which can be used as reference. If
X_train=None, the test set is used as reference set.contamination (float, default=0.05) – The (estimated) contamination rate for the data, i.e., the expected percentage of anomalies.
decision_scores_given (bool, default=False) – Whether the given
XandX_trainrepresent time series data or decision scores. Ifdecision_scores_given=False(default), then the given arrays are interpreted as time series. Otherwise, they are interpreted as decision scores, as computed bydecision_function().
- Returns:
confidence – The confidence of this anomaly detector in each prediction in the given test time series.
- Return type:
array-like of shape (n_samples)
References
[perini2020quantifying]Perini, L., Vercruyssen, V., Davis, J. Quantifying the Confidence of Anomaly Detectors in Their Example-Wise Predictions. In: Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Springer, Cham, doi: 10.1007/978-3-030-67664-3_14.
- predict_proba(X: ndarray) ndarray
Predict anomaly probabilities
Estimate the probability of a sample of X being anomalous, based on the anomaly scores obtained from decision_function by rescaling them to the range of [0, 1] via min-max scaling.
- Parameters:
X (array-like of shape (n_samples, n_attributes)) – Input time series.
- Returns:
anomaly_scores – 1D array with the same length as X, with values in the interval [0, 1], in which a higher value implies that the instance is more likely to be anomalous.
- Return type:
array-like of shape (n_samples)
- Raises:
ValueError – If scores is not a valid array.
ValueError – If the prediction scores from ‘decision_function’ are constant, but not in the interval [0, 1], because these values can not unambiguously be transformed to an anomaly probability.
- save(path: str | Path) None
Save detector to disk as a pickle file with extension .dtai. If the given path consists of multiple subdirectories, then the not existing subdirectories are created.
- Parameters:
path (str or Path) – Location where to store the detector.