Data module

This module contains functionality to dynamically load data when executing a pipeline or workflow. It can be imported as follows:

>>> from dtaianomaly import data

Custom data loaders can be implemented by extending LazyDataLoader.

class dtaianomaly.data.LazyDataLoader(path: str | Path, do_caching: bool = False)[source]

A lazy dataloader for anomaly detection workflows

This is a data loading utility to point towards a specific data set (with path) and to load it at a later point in time during execution of a workflow.

This way we limit memory usage and allow for virtually unlimited scaling of the number of data sets in a workflow.

Parameters:

path (str) – Path to the relevant data set.
do_caching (bool, default=False) – Whether to cache the loaded data or not

cache_

Cached version of the loaded data set. Only available if do_caching==True and the data has been loaded before.

Type:: DataSet

Raises:: FileNotFoundError – If the given path does not point to an existing file or directory.

load() → DataSet[source]

Load the dataset. If do_caching==True, the loaded will be saved in the cache if no cache is available yet, and the cached data will be returned.

Returns:: data_set – The loaded dataset.
Return type:: DataSet

class dtaianomaly.data.DataSet(X_test: ndarray, y_test: ndarray, X_train: ndarray = None, y_train: ndarray = None)[source]

A class for time series anomaly detection data sets. These consist of the raw data for training and testing anomaly detectors, as well as the respective ground truth labels.

Parameters:

X_test (array-like of shape (n_samples_test, n_attributes)) – The test time series data.
y_test (array-like of shape (n_samples_test)) – The ground truth anomaly labels of the test data.
X_train (array-like of shape (n_samples_train, n_attributes), optional) – The train time series. If not given, then the test data will be used for training and the data is only compatible with unsupervised anomaly detectors.
y_train (array-like of shape (n_samples_train), optional) – The ground truth anomaly labels of the training data. If not given, either the train data should not be given either, or the train data is assumed to consist of only normal data.

static check_is_valid(X_test: ndarray, y_test: ndarray, X_train: ndarray | None, y_train: ndarray | None) → None[source]

Checks if the given elements refer o a valid DataSet. If the elements would not give a valid DataSet, then a ValueError is raised.

Parameters:

X_test (array-like of shape (n_samples_test, n_attributes)) – The test time series data.
y_test (array-like of shape (n_samples_test)) – The ground truth anomaly labels of the test data.
X_train (array-like of shape (n_samples_train, n_attributes) or None) – The train time series data. Note that, even though X_train can be None, it must be provided.
y_train (array-like of shape (n_samples_train) or None.) – The ground truth anomaly labels of the train data. Note that, even though y_train can be None, it must be provided.

Raises:

ValueError: –

If the given variables would not lead to a valid DataSet. This is the case if:

If X_test or y_test are not valid array-like.
If y_test is not univariate and has a value different from 0 or 1.
If X_test and y_test consist of a different number of samples.
If X_train is not None, but it is not a valid array-like.
If X_train is not None and consists of a different number of attributes than X_test.
If y_train is not None but X_train is None.
If y_train is not None but it is not a valid array-like.
If y_train is not None, but it is not univariate and has a . value different from 0 or 1.
If y_train is not None but consists of a different number of samples than X_train.

compatible_supervision() → List[Supervision][source]

Get the compatible supervision types for this data set.

Returns:

compatible_types – A list containing the compatible types for this dataset. The following suprvision types can be compatible:

Supervision.UNSUPERVISED: Always compatible.
Supervision.SEMI_SUPERVISED: Compatible if and only if there is some training data given (which is assumed to be normal).
Supervision.SUPERVISED: Only compatible if both training data and training labels are provided.

Return type:

list of Supervision

is_compatible(detector: BaseDetector) → bool[source]

Checks if the given anomaly detector is compatible with this DataSet.

Parameters:

detector (BaseDetector) – The anomaly detector to check if it is compatible with this DataSet.

Returns:

is_compatible – True if and only if the given anomaly detector is compatible with this DataSet. The detector is compatible if

This DataSet does not contain any training data or training labels, only unsupervised anomaly detectors are compatible
This DataSet contains training data but no training labels, then unsupervised and semi-supervised anomaly detectors are compatible.
This DataSet contains training data and labels, then supervised, unsupervised and semi-supervised anomaly detectors are compatible.

Return type:

bool

is_valid() → bool[source]

Checks whether this DataSet is valid or not.

Returns:: is_valid – True if and only if this instance is valid, i.e., if the attributes X_test, y_test, X_train and y_train of this instance pass all the checks of check_is_valid().
Return type:: bool

Synthetic data

dtaianomaly.data.demonstration_time_series() -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Generate a time series for demonstration purposes. This is a noisy sine wave with one valley that is deeper than the other ones.

Returns:

x (np.ndarray of shape (nb_samples)) – The raw time series data
y (np.ndarray of shape (nb_samples)) – The ground truth labels

../_images/Demonstration-time-series.svg

dtaianomaly.data.make_sine_wave(nb_samples: int, amplitude: float = 1.0, frequency: float = 5.0, phase: float = 0.0, noise_level: float = 0.2, seed: int = None, **kwargs) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]

Generate a random sine wave and inject anomalies into it.

Parameters:

nb_samples (int) – The length of the sine wave.
amplitude (float, default=1.0) – The amplitude of the sine wave, the max absolute value of the sine wave.
frequency (float, default=5.0) – The frequency of the sine wave, the number of oscillations
phase (float, default=0.0) – The phase of the sine wave, where the oscillation starts.
noise_level (float, default=0.2) – The amount of Gaussian noise to add to the time series
seed (int, default=None) – The seed for generating a random sine wave. If no value is provided, then the sine wave will be random.
**kwargs – Parameters to pass to the inject_anomalies method.

Returns:

x (np.ndarray of shape (nb_samples)) – The raw time series data
y (np.ndarray of shape (nb_samples)) – The ground truth labels

Loading data

class dtaianomaly.data.UCRLoader(path: str | Path, do_caching: bool = False)[source]

Lazy dataloader for the UCR suite of anomaly detection data sets.

This implementation expects the file names to contain the start and stop time stamps of the single anomaly in the time series as: *_<train-test-split>_<start>_<stop>.txt.

dtaianomaly.data.from_directory(directory: str | Path, dataloader: Type[LazyDataLoader]) → List[LazyDataLoader][source]

Construct a LazyDataLoader instance for every file in the given directory

Parameters:

directory (str or Path) – Path to the directory in question
dataloader (LazyDataLoader object) – Class object of the data loader, called for constructing each data loader instance

Returns:

data_loaders – A list of the initialized data loaders, one for each data set in the given directory.

Return type:

List[LazyDataLoader]

Raises:

FileNotFoundError – If directory cannot be found