Extensibility
Even though dtaianomaly already offers a lot of functionality, there is always room
for more enhancements. dtaianomaly is designed with flexibility in mind: it is
extremely easy to integrate a new component in dtaianomaly. These new components
can be either existing methods that haven’t been implemented yet, or new state-of-the-art
time series anomaly detection methods. By implementing your new component in dtaianomaly, y
ou can seamlessly use the existing tools - such as the Pipeline
and Workflow - as if it were a native part of dtaianomaly.
Below, we illustrate how you can implement your own (1) anomaly detector, (2) neural anomaly detector, (3) dataloader, (4) preprocessor, (5) thresholding, and (6) evaluation.
Custom anomaly detector
The core functionality of dtaianomaly - time series anomaly detection - is extended
by implementing the BaseDetector. To achieve
this, you need to implement the _fit(),
and _decision_function()
methods. Below, we implement an anomaly detector that detects anomalies when the distance
between an observation and the mean value exceeds a specified number of standard deviations
(also known as the 3-sigma rule).
The methods have the following functionality:
_fit(): learn the mean and standard deviation of the training data. These values are stored in the attributesmean_andstd_._decision_function(): compute the values that have distance larger thannb_sigmastimes the learned standard deviation from the learned mean. These values are considered anomalies.
>>> from dtaianomaly.anomaly_detection import BaseDetector, Supervision
>>>
>>> class NbSigmaAnomalyDetector(BaseDetector):
... nb_sigmas: float
... mean_: float
... std_: float
...
... def __init__(self, nb_sigmas: float = 3.0):
... super().__init__(Supervision.UNSUPERVISED)
... self.nb_sigmas = nb_sigmas
...
... def _fit(self, X: np.ndarray, y: Optional[np.ndarray] = None, **kwargs) -> 'NbSigmaAnomalyDetector':
... """ Compute the mean and standard deviation of the given time series. """
... self.mean_ = X.mean()
... self.std_ = X.std()
... return self
...
... def _decision_function(self, X: np.ndarray) -> np.ndarray:
... """ Compute which values are too far from the mean. """
... return np.abs(X - self.mean_) > self.nb_sigmas * self.std_
>>>
>>> detector = NbSigmaAnomalyDetector()
Custom neural anomaly detector
While above API also allows to implement neural methods, dtaianomaly offers
several approaches to simplify this process. Specifically, you can implement one
of the following classes, depending on how you want your neural net to detect
anomalies:
BaseNeuralForecastingDetector: detect anomalies by forecasting the data, and measuring the difference between the predicted values and the actual observations. An example isMultilayerPerceptron.BaseNeuralReconstructionDetector: reconstruct windows of the data, and the instances that are more difficult to reconstruct were not seen in the data, and thus anomalies. An example isAutoEncoder.
Whatever strategy you choose, you only need to implement the _build_architecture()
function. This function receives the input dimension of the time series, and returns
the architecture of your neural network as a torch.nn.Module.
Below code shows a very simple example of this: detect anomalies using a perceptron. We will
train a perceptron to forecast the data and then measure the deviation, hence we will extend
the BaseNeuralForecastingDetector class. Specifically,
given a time series with \(D\) attributes and a window size of \(w\), the input is a
flattened \((D \cdot w)\)-array. If we want to forecast \(h\) values in the future
(i.e., the parameter forecast_length), then the output of the perceptron is a
\((D \cdot h)\)-array. The implementation is given below:
>>> import torch
>>> from dtaianomaly.anomaly_detection import BaseNeuralForecastingDetector
>>>
>>> class Perceptron(BaseNeuralForecastingDetector):
...
... def _build_architecture(self, n_attributes: int) -> torch.nn.Module:
... return torch.nn.Linear(
... in_features=n_attributes * self.window_size_,
... out_features=n_attributes * self.forecast_length
... )
>>>
>>> perceptron = Perceptron(window_size=16, forecast_length=1)
If you want more flexibility over your network, you can also directly implement
BaseNeuralDetector, in which you must
also implement the creation of a torch.utils.data.DataSet and the evaluation
and training on a single batch. It is also possible to further customize the training
process by overwriting some of the already implemented methods, or by extending the
BaseDetector and implement your network
from scratch!
Custom data loader
Some dataloaders are provided within dtaianomaly, but often we want to detect anomalies
in our own data. Typically, for such custom data, there is no dataloader available within
dtaianomaly. To address this, you can implement a new dataloader by extending the
LazyDataLoader, along with the _load()
method. Upon initialization of the custom data loader, a path parameter is required,
which points to the location of the data. Optionally, you can pass a do_caching parameter
to prevent reading big files multiple times. The _load()
function will effectively load this dataset and return a DataSet
object, which combines the data X and ground truth labels y. The load()
function will either load the data or return a cached version of the data, depending on the
do_caching property.
Implementing a custom dataloader is especially useful for quantitatively evaluating the anomaly
detectors on your own data, as you can pass the loader to a Workflow
and easily analyze multiple detectors simultaneously.
>>> from dtaianomaly.data import LazyDataLoader, DataSet
>>>
>>> class SimpleDataLoader(LazyDataLoader):
... def _load(self) -> DataSet:
... """ Read a data frame with the data in column 'X' and the labels in column 'y'. """
... df = pd.read_clipboard(self.path)
... return DataSet(df['X'].values, df['y'].values)
>>>
>>> data_loader = SimpleDataLoader('data')
Custom preprocessor
The preprocessors will perform some processing on the time series, after which the transformed
time series can be used for anomaly detection. Below, we implement a custom preprocessor by
extending the Preprocessor class. Our preprocessor
replaces all missing values (i.e., the NaN values) with the mean of the training data.
Specifically, we need to implement following methods:
_fit(): learns the mean value of the given time series and stores it as thefill_value_attribute._transform(): fills in all missing values with the given time series by the learned mean value. This method returns both a transformedXandy, because some preprocessors also change the labelsy(for example, theSamplingRateUnderSampler).
Notice that we implement the _fit() and
_transform() methods (with a starting underscore),
while we can call the fit() and
transform() methods (without the underscore) on
an instance of our Imputer. This is because the public methods will first check if the input
is valid using the check_preprocessing_inputs() method, and
only then call the protected methods with starting underscores, ensuring that valid data is passed
to these methods.
>>> from dtaianomaly.preprocessing import Preprocessor
>>>
>>> class Imputer(Preprocessor):
... fill_value_: float
...
... def _fit(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> 'Preprocessor':
... self.fill_value_ = np.nanmean(X, axis=0)
... return self
...
... def _transform(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> Tuple[np.ndarray, Optional[np.ndarray]]:
... X[np.isnan(X)] = self.fill_value_
... return X, y
>>>
>>> imputer = Imputer()
Custom thresholding
Many anomaly detectors compute continuous anomaly scores (“how anomalous is the sample?), while
many practical applications prefer binary labels (“is the sample an anomaly?”). Converting the
continuous scores to binary labels can be done via thresholding. The most common thresholding
strategies have already been implemented in dtaianomaly, but is possible to add a new
thresholding technique, as we do below. For this, we extend the Thresholding
object and implement the threshold method. Our custom thresholding technique sets a dynamic
threshold, such that observations with an anomaly score larger than a specified number of standard
deviations above the mean anomaly score are considered anomalous.
>>> from dtaianomaly.thresholding import Thresholding
>>>
>>> class DynamicThreshold(Thresholding):
... factor: float
...
... def __init__(self, factor: float):
... self.factor = factor
...
... def threshold(self, scores: np.ndarray) -> np.ndarray:
... threshold = scores.mean() + self.factor * scores.std()
... return scores > threshold
>>>
>>> dynamic_threshold = DynamicThreshold(1.0)
Custom evaluation
Various performance metrics exist to evaluate an anomaly detector. There are two types
of metrics in dtaianomaly:
BinaryMetric: the provided anomaly scores must be binary anomaly labels. An example of such metric is the precision.ProbaMetric:: the provided anomaly scores are expected to be continuous scores. An example of such metric is the area under the ROC curve (AUC-ROC).
Custom evaluation metrics can be implemented in dtaianomaly. Below, we implement accuracy
by extending the BinaryMetric class (since accuracy requires
binary labels) and implementing the _compute() method.
Similar to the custom preprocessor above,we implement the _compute()
method with starting underscore, while we call the compute()
method to measure the metric. This is because the public compute()
method performs checks on the input, ensuring that valid data is passed to the _compute()
method.
Warning
Anomaly detection is typically a highly unbalanced problem: anomalies are, by definition, rare. Therefore, it is not recommended to use accuracy for evaluation (time series) anomaly detection!
>>> from dtaianomaly.evaluation import BinaryMetric
>>>
>>> class Accuracy(BinaryMetric):
... def _compute(self, y_true: np.ndarray, y_pred: np.ndarray, **kwargs) -> float:
... """ Compute the accuracy. """
... return np.nanmean(y_true == y_pred)
>>>
>>> accuracy = Accuracy()