Data module
This module contains functionality to dynamically load data when executing a pipeline or workflow. It can be imported as follows:
>>> from dtaianomaly import data
Custom data loaders can be implemented by extending LazyDataLoader.
- class dtaianomaly.data.LazyDataLoader(do_caching: bool = False)[source]
A lazy dataloader for anomaly detection workflows
This is a data loading utility to point towards a specific data set and to load it at a later point in time during execution of a workflow.
This way we limit memory usage and allow for virtually unlimited scaling of the number of data sets in a workflow.
- Parameters:
do_caching (bool, default=False) – Whether to cache the loaded data or not
- cache_
Cached version of the loaded data set. Only available if
do_caching==Trueand the data has been loaded before.- Type:
- class dtaianomaly.data.DataSet(X_test: ndarray, y_test: array, X_train: ndarray = None, y_train: array = None, feature_names: List[str] = None, time_steps_test: array = None, time_steps_train: array = None)[source]
A class for time series anomaly detection data sets. These consist of the raw data for training and testing anomaly detectors, as well as the respective ground truth labels.
- Parameters:
X_test (array-like of shape (n_samples_test, n_attributes)) – The test time series data.
y_test (array-like of shape (n_samples_test)) – The ground truth anomaly labels of the test data.
X_train (array-like of shape (n_samples_train, n_attributes), default=None) – The train time series. If not given, then the test data will be used for training and the data is only compatible with unsupervised anomaly detectors.
y_train (array-like of shape (n_samples_train), default=None) – The ground truth anomaly labels of the training data. If not given, either the train data should not be given either, or the train data is assumed to consist of only normal data.
feature_names (list of str, default=None) – The name of each feature in the data. The number of names must be identical to the number of actual features. If None, then the data is assumed to be unnamed.
time_steps_test (array-like of shape (n_samples_test), default=None) – The time steps corresponding to the test data. If
None, then no time steps are known.time_steps_train (array-like of shape (n_samples_train), default=None) – The time steps corresponding to the train data. If
None, then no time steps are known. Can only be provided if there is actually some training data given (X_train` != None).
- static check_is_valid(X_test: ndarray, y_test: ndarray, X_train: ndarray | None, y_train: ndarray | None) None[source]
Checks if the given elements refer o a valid
DataSet. If the elements would not give a validDataSet, then aValueErroris raised.- Parameters:
X_test (array-like of shape (n_samples_test, n_attributes)) – The test time series data.
y_test (array-like of shape (n_samples_test)) – The ground truth anomaly labels of the test data.
X_train (array-like of shape (n_samples_train, n_attributes) or
None) – The train time series data. Note that, even thoughX_traincan beNone, it must be provided.y_train (array-like of shape (n_samples_train) or
None.) – The ground truth anomaly labels of the train data. Note that, even thoughy_traincan beNone, it must be provided.
- Raises:
ValueError: –
If the given variables would not lead to a valid
DataSet. This is the case if:If
X_testory_testare not valid array-like.If
y_testis not univariate and has a value different from 0 or 1.If
X_testandy_testconsist of a different number of samples.If
X_trainis notNone, but it is not a valid array-like.If
X_trainis notNoneand consists of a different number of attributes thanX_test.If
y_trainis notNonebutX_trainisNone.If
y_trainis notNonebut it is not a valid array-like.If
y_trainis notNone, but it is not univariate and has a . value different from 0 or 1.If
y_trainis notNonebut consists of a different number of samples thanX_train.
- compatible_supervision() List[Supervision][source]
Get the compatible supervision types for this data set.
- Returns:
compatible_types – A list containing the compatible types for this dataset. The following suprvision types can be compatible:
Supervision.UNSUPERVISED: Always compatible.Supervision.SEMI_SUPERVISED: Compatible if and only if there is some training data given (which is assumed to be normal).Supervision.SUPERVISED: Only compatible if both training data and training labels are provided.
- Return type:
list of Supervision
- is_compatible(detector: BaseDetector) bool[source]
Checks if the given anomaly detector is compatible with this
DataSet.- Parameters:
detector (BaseDetector) – The anomaly detector to check if it is compatible with this
DataSet.- Returns:
is_compatible – True if and only if the given anomaly detector is compatible with this
DataSet. The detector is compatible ifThis
DataSetdoes not contain any training data or training labels, only unsupervised anomaly detectors are compatibleThis
DataSetcontains training data but no training labels, then unsupervised and semi-supervised anomaly detectors are compatible.This
DataSetcontains training data and labels, then supervised, unsupervised and semi-supervised anomaly detectors are compatible.
- Return type:
bool
- is_valid() bool[source]
Checks whether this
DataSetis valid or not.- Returns:
is_valid – True if and only if this instance is valid, i.e., if the attributes
X_test,y_test,X_trainandy_trainof this instance pass all the checks ofcheck_is_valid().- Return type:
bool
Demonstration time series
- dtaianomaly.data.demonstration_time_series() -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>)[source]
Generate a time series for demonstration purposes. This is a noisy sine wave with one valley that is deeper than the other ones.
- Returns:
x (np.ndarray of shape (nb_samples)) – The raw time series data
y (np.ndarray of shape (nb_samples)) – The ground truth labels
- class dtaianomaly.data.DemonstrationTimeSeriesLoader(do_caching: bool = False)[source]
A data loader object to load the demonstration time series.
Loading data
- class dtaianomaly.data.PathDataLoader(path: str | Path, do_caching: bool = False)[source]
A dataloader which reads data from a given path. The data loader will load the data that is stored at that path.
- Parameters:
path (str) – The path at which the data set is located.
- Raises:
FileNotFoundError – If the given path does not point to an existing file or directory.
- dtaianomaly.data.from_directory(directory: str | Path, dataloader: Type[PathDataLoader], **kwargs) List[PathDataLoader][source]
Construct a PathDataLoader instance for every file in the given directory
- Parameters:
directory (str or Path) – Path to the directory in question
dataloader (PathDataLoader object) – Class object of the data loader, called for constructing each data loader instance
**kwargs – Additional arguments to be passed to the dataloader
- Returns:
data_loaders – A list of the initialized data loaders, one for each data set in the given directory.
- Return type:
List[PathDataLoader]
- Raises:
FileNotFoundError – If directory cannot be found
- class dtaianomaly.data.UCRLoader(path: str | Path, do_caching: bool = False)[source]
Lazy dataloader for the UCR suite of anomaly detection data sets [20].
This implementation expects the file names to contain the start and stop time stamps of the single anomaly in the time series as:
*_<train-test-split>_<start>_<stop>.txt.