Workflow module

This module contains the workflow functionality.

>>> from dtaianomaly import workflow

Below we illustrate how a simple workflow can be initialized, which will apply Matrix Profile and Isolation Forest on a dataset from the UCR archive, and compute the area under the ROC and PR curves:

>>> from dtaianomaly.data import UCRLoader
>>> from dtaianomaly.anomaly_detection import MatrixProfileDetector, IsolationForest
>>> from dtaianomaly.evaluation import AreaUnderROC, AreaUnderPR
>>> workflow = workflow.Workflow(
...     dataloaders=[
...         UCRLoader(path='data/UCR-time-series-anomaly-archive/001_UCR_Anomaly_DISTORTED1sddb40_35000_52000_52620.txt'),
...     ],
...     detectors=[MatrixProfileDetector(window_size=100), IsolationForest(15)],
...     metrics=[AreaUnderROC(), AreaUnderPR()]
... )

We refer to the documentation for more information regarding the configuration and use of a Workflow.

class dtaianomaly.workflow.Workflow(dataloaders: LazyDataLoader | List[LazyDataLoader], metrics: Metric | List[Metric], detectors: BaseDetector | List[BaseDetector], preprocessors: Preprocessor | List[Preprocessor] = None, thresholds: Thresholding | List[Thresholding] = None, n_jobs: int = 1, trace_memory: bool = False, error_log_path: str = './error_logs', fit_unsupervised_on_test_data: bool = False, fit_semi_supervised_on_test_data: bool = False, show_progress: bool = False)[source]

Run anomaly detection experiments

Run all combinations of dataloaders, preprocessors, detectors, and metrics. The metrics requiring a thresholding operation are combined with every element of thresholds. If an error occurs in any execution of an anomaly detector or loading of data, then the error will be written to an error file, which is an executable Python file to reproduce the error.

Parameters:
  • dataloaders (LazyDataLoader or list of LazyDataLoader) – The dataloaders that will be used to load data, and consequently this data is used for evaluation within this workflow.

  • metrics (Metric or list of Metric) – The metrics to evaluate within this workflow.

  • detectors (BaseDetector or list of BaseDetector) – The anomaly detectors to evaluate.

  • thresholds (Thresholding or list of Thresholding, default=None) – The thresholds used for converting continuous anomaly scores to binary anomaly predictions. Each threshold will be combined with each BinaryMetric given via the metrics parameter. The thresholds do not apply on a ProbaMetric. If equals None or an empty list, then all the given metrics via the metrics argument must be of type ProbaMetric. Otherwise, a ValueError will be raised.

  • preprocessors (Preprocessor or list of Preprocessor, default=None) – The preprocessors to apply before evaluating the model. If equals None or an empty list, then no preprocssing will be done, aka. using dtaianomaly.preprocessing.Preprocessor as the preprocessor for each pipeline.

  • n_jobs (int, default=1) – Number of processes to run in parallel while evaluating all combinations.

  • trace_memory (bool, default=False) – Whether or not memory usage of each run is reported. While this might give additional insights into the models, their runtime will be higher due to additional internal bookkeeping.

  • error_log_path (str, default='./error_logs') – The path in which the error logs should be saved.

  • fit_unsupervised_on_test_data (bool, default=False) – Whether to fit the unsupervised anomaly detectors on the test data. If True, then the test data will be used to fit the detector and to evaluate the detector. This is no issue, since unsupervised detectors do not use labels and can deal with anomalies in the training data.

  • fit_semi_supervised_on_test_data (bool, default=False) – Whether to fit the semi-supervised anomaly detectors on the test data. If True, then the test data will be used to fit the detector and to evaluate the detector. This is not really an issue, because it only breaks the assumption of semi-supervised methods of normal training data. However, these methods do not use the training labels themselves.

  • show_progress (bool, default=False) –

    Whether to show the progress using a TQDM progress bar or not.

    Note

    Ensure tqdm installed for this (which is not part of the core dependencies of dtaianomaly). Otherwise, no progress bar will be shown.

run(**kwargs) DataFrame[source]

Run the experimental workflow. Evaluate each pipeline within this workflow on each dataset within this workflow in a grid-like manner.

Returns:

results – A pandas dataframe with the results of this workflow. Each row represents an execution of an anomaly detector on a given dataset with some preprocessing steps. The columns correspond to the different evaluation metrics, running time and potentially also the memory usage.

Return type:

pd.DataFrame

dtaianomaly.workflow.workflow_from_config(path: str, max_size: int = 1000000)[source]

Construct a Workflow instance based on a JSON or TOML file. The file is first parsed, and then interpreted to obtain a Workflow

Parameters:
  • path (str) – Path to the config file

  • max_size (int, optional) – Maximal size of the config file in bytes. Defaults to 1 MB.

Returns:

workflow – The parsed workflow from the given config file.

Return type:

Workflow

Raises:
  • TypeError – If the given path is not a string.

  • FileNotFoundError – If the given path does not correspond to an existing file.

  • ValueError – If the given path does not refer to a json or TOML file.

dtaianomaly.workflow.interpret_config(config: dict)[source]

Actual parsing/interpretation logic

All the different _interpret_* functions below check the config for the corresponding dtaianomaly objects. These functions should be extended when the full package is extended.

Parameters:

config (dict) – The config to parse

Returns:

Containing all the components specified in the config

Return type:

Workflow