Custom models
Even though dtaianomaly already offers a lot of functionality, there is always room
for more enhancements. dtaianomaly is designed with flexibility in mind: it is
extremely easy to integrate a new component in dtaianomaly. These new components
can be either existing methods that haven’t been implemented yet, or new state-of-the-art
time series anomaly detection methods. By implementing your new component in dtaianomaly, y
ou can seamlessly use the existing tools - such as the Pipeline
and Workflow - as if it were a native part of dtaianomaly.
Below, we illustrate how you can implement your own (1) anomaly detector, (2) dataloader, (3) preprocessor, (4) thresholding, and (5) evaluation.
Custom anomaly detector
The core functionality of dtaianomaly - time series anomaly detection - is extended
by implementing the BaseDetector. To achieve
this, you need to implement the fit(),
and decision_function()
methods. Below, we implement an anomaly detector that detects anomalies when the distance
between an observation and the mean value exceeds a specified number of standard deviations
(also known as the 3-sigma rule.
The methods have the following functionality:
fit(): learn the mean and standard deviation of the training data. These values are stored in the attributesmean_andstd_.decision_function(): compute the values that have distance larger thannb_sigmastimes the learned standard deviation from the learned mean. These values are considered anomalies.
from dtaianomaly.anomaly_detection import BaseDetector
class NbSigmaAnomalyDetector(BaseDetector):
nb_sigmas: float
mean_: float
std_: float
def __init__(self, nb_sigmas: float = 3.0):
self.nb_sigmas = nb_sigmas
def fit(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> 'NbSigmaAnomalyDetector':
""" Compute the mean and standard deviation of the given time series. """
self.mean_ = X.mean()
self.std_ = X.std()
return self
def decision_function(self, X: np.ndarray) -> np.ndarray:
""" Compute which values are too far from the mean. """
return np.abs(X - self.mean_) > self.nb_sigmas * self.std_
Custom data loader
Some dataloaders are provided within dtaianomaly, but often we want to detect anomalies
in our own data. Typically, for such custom data, there is no dataloader available within
dtaianomaly. To address this, you can implement a new dataloader by extending the
LazyDataLoader, along with the _load()
method. Upon initialization of the custom data loader, a path parameter is required,
which points to the location of the data. Optionally, you can pass a do_caching parameter
to prevent reading big files multiple times. The _load()
function will effectively load this dataset and return a DataSet
object, which combines the data X and ground truth labels y. The load()
function will either load the data or return a cached version of the data, depending on the
do_caching property.
Implementing a custom dataloader is especially useful for quantitatively evaluating the anomaly
detectors on your own data, as you can pass the loader to a Workflow
and easily analyze multiple detectors simultaneously.
from dtaianomaly.data import LazyDataLoader, DataSet
class SimpleDataLoader(LazyDataLoader):
def _load(self)-> DataSet:
""" Read a data frame with the data in column 'X' and the labels in column 'y'. """
df = pd.read_clipboard(self.path)
return DataSet(x=df['X'].values, y=df['y'].values)
Custom preprocessor
The preprocessors will perform some processing on the time series, after which the transformed
time series can be used for anomaly detection. Below, we implement a custom preprocessor by
extending the Preprocessor class. Our preprocessor
replaces all missing values (i.e., the NaN values) with the mean of the training data.
Specifically, we need to implement following methods:
_fit(): learns the mean value of the given time series and stores it as thefill_value_attribute._transform(): fills in all missing values with the given time series by the learned mean value. This method returns both a transformedXandy, because some preprocessors also change the labelsy(for example, theSamplingRateUnderSampler).
Notice that we implement the _fit() and
_transform() methods (with a starting underscore),
while we can call the fit() and
transform() methods (without the underscore) on
an instance of our Imputer. This is because the public methods will first check if the input
is valid using the check_preprocessing_inputs() method, and
only then call the protected methods with starting underscores, ensuring that valid data is passed
to these methods.
from dtaianomaly.preprocessing import Preprocessor
class Imputer(Preprocessor):
fill_value_: float
def _fit(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> 'Preprocessor':
self.fill_value_ = np.nanmean(X, axis=0)
return self
def _transform(self, X: np.ndarray, y: Optional[np.ndarray] = None) -> Tuple[np.ndarray, Optional[np.ndarray]]:
X[np.isnan(X)] = self.fill_value_
return X, y
Custom thresholding
Many anomaly detectors compute continuous anomaly scores (“how anomalous is the sample?), while
many practical applications prefer binary labels (“is the sample an anomaly?”). Converting the
continuous scores to binary labels can be done via thresholding. The most common thresholding
strategies have already been implemented in dtaianomaly, but is possible to add a new
thresholding technique, as we do below. For this, we extend the Thresholding
object and implement the threshold method. Our custom thresholding technique sets a dynamic
threshold, such that observations with an anomaly score larger than a specified number of standard
deviations above the mean anomaly score are considered anomalous.
from dtaianomaly.thresholding import Thresholding
class DynamicThreshold(Thresholding):
factor: float
def __init__(self, factor: float):
self.factor = factor
def threshold(self, scores: np.ndarray) -> np.ndarray:
threshold = scores.mean() + self.factor * scores.std()
return scores > threshold
Custom evaluation
Various performance metrics exist to evaluate an anomaly detector. There are two types
of metrics in dtaianomaly:
BinaryMetric: the provided anomaly scores must be binary anomaly labels. An example of such metric is the precision.ProbaMetric:: the provided anomaly scores are expected to be continuous scores. An example of such metric is the area under the ROC curve (AUC-ROC).
Custom evaluation metrics can be implemented in dtaianomaly. Below, we implement accuracy
by extending the BinaryMetric class (since accuracy requires
binary labels) and implementing the _compute() method.
Similar to the custom preprocessor above,we implement the _compute()
method with starting underscore, while we call the compute()
method to measure the metric. This is because the public compute()
method performs checks on the input, ensuring that valid data is passed to the _compute()
method.
Warning
Anomaly detection is typically a highly unbalanced problem: anomalies are, by definition, rare. Therefore, it is not recommended to use accuracy for evaluation (time series) anomaly detection!
from dtaianomaly.evaluation import BinaryMetric
class Accuracy(BinaryMetric):
def _compute(self, y_true: np.ndarray, y_pred: np.ndarray):
""" Compute the accuracy. """
return np.nanmean(y_true == y_pred)