Preprocessing module

This module contains preprocessing functionality.

>>> from dtaianomaly import preprocessing

Custom preprocessors can be implemented by extending the base Preprocessor class.

preprocessing.check_preprocessing_inputs(y: ndarray | None = None) → None

Check if the given X and y arrays are valid.

Parameters:

X (array-like of shape (n_samples, n_attributes)) – Raw time series
y (array-like, default=None) – Ground-truth information

Raises:

ValueError – If inputs are not valid numeric arrays
ValueError – If inputs have a different size in the first dimension (n_samples)

class dtaianomaly.preprocessing.Preprocessor[source]

Base preprocessor class.

fit(X: ndarray, y: ndarray | None = None) → Preprocessor[source]

First checks the inputs with check_preprocessing_inputs(), and then fits this preprocessor.

Parameters:

X (array-like of shape (n_samples, n_attributes)) – Raw time series
y (array-like, default=None) – Ground-truth information

Returns:

self – Returns the fitted instance self.

Return type:

Preprocessor

fit_transform(X: ndarray, y: ndarray | None = None) → Tuple[ndarray, ndarray | None][source]

First checks the inputs with check_preprocessing_inputs(), and then chains the fit and transform methods on the given data, i.e., first fit this preprocessor on the given X and y, after which the given X and y will be transformed.

Parameters:

X (array-like of shape (n_samples, n_attributes)) – Raw time series
y (array-like of shape (n_samples), default=None) – Ground-truth information

Returns:

X_transformed (np.ndarray of shape (n_samples, n_attributes)) – Preprocessed raw time series
y_transformed (np.ndarray of shape (n_samples)) – The transformed ground truth. If no ground truth was provided (y=None), then None will be returned as well.

transform(X: ndarray, y: ndarray | None = None) → Tuple[ndarray, ndarray | None][source]

First checks the inputs with check_preprocessing_inputs(), and then transforms (i.e., preprocesses) the given time series.

Parameters:

X (array-like of shape (n_samples, n_attributes)) – Raw time series
y (array-like of shape (n_samples), default=None) – Ground-truth information

Returns:

X_transformed (np.ndarray of shape (n_samples, n_attributes)) – Preprocessed raw time series
y_transformed (np.ndarray of shape (n_samples)) – The transformed ground truth. If no ground truth was provided (y=None), then None will be returned as well.

class dtaianomaly.preprocessing.ChainedPreprocessor(*base_preprocessors: Preprocessor | List[Preprocessor])[source]

Wrapper chaining multiple Preprocessor objects.

Parameters:: base_preprocessors (list of Preprocessor objects) – The preprocessors to chain. These preprocessors can be passed as a single list argument or as multiple independent arguments to the constructor.

class dtaianomaly.preprocessing.Identity[source]: Identity preprocessor. A dummy preprocessor which does not do any processing at all.

class dtaianomaly.preprocessing.MinMaxScaler[source]

Rescale raw time series to a [0, 1] via min-max scaling. The minimum and maximum is computed on a training set, after which these values can be used to transform a new time series. Therefore, there is no guarantee that the values of the transformed test set will actually be in the range [0, 1].

For multivariate time series, each attribute will be normalized independently, i.e., the minimum and maximum of each attribute in the transformed time series will 0 and 1, respectively.

If the minimum and maximum of an attribute is the same (the time series consists of only one value), then the transformation will not do anything.

min_

The minimum value in each attribute of the training data.

Type:: array-like of shape (n_attributes)

max_

The maximum value in each attribute of the training data.

Type:: array-like of shape (n_attributes)

Raises:: NotFittedError – If the transform method is called before fitting this MinMaxScaler.

class dtaianomaly.preprocessing.StandardScaler(min_std: float = 1e-09)[source]

Standard scale the data: rescale to zero mean, unit variance.

Rescale to zero mean and unit variance. A mean value and standard deviation is computed on a training set, after which these values can be used to transform a new time series. Therefore, there is no guarantee that the values of the transformed test set will actually have zero mean and unit variance.

For multivariate time series, each attribute will be normalized independently, i.e., the mean and std of each attribute in the transformed time series will 1.0 and 0.0, respectively.

Parameters:: min_std (float, default = 1e-9) – The minimum std required to actually Z-normalize an attribute. If the standard deviation is below this value, then no normalization will be applied. This prevents amplifying noise in the data.

mean_

The mean value in each attribute of the training data.

Type:: array-like of shape (n_attributes)

std_

The standard deviation in each attribute of the training data.

Type:: array-like of shape (n_attributes)

Raises:: NotFittedError – If the transform method is called before fitting this StandardScaler.

class dtaianomaly.preprocessing.RobustScaler(quantile_range: (<class 'float'>, <class 'float'>) = (25.0, 75.0))[source]

Scale the time series using robust statistics.

The RobustScaler is similar to StandardScaler, but uses robust statistics rather than mean and standard deviation. The center of the data is computed via the median, and the scale is computed as the range between two quantiles (by default uses the IQR). This ensures that scaling is less affected by outliers.

For a time series \(x\), center \(c\) and scale \(s\), observation \(x_i\) is scaled to observation \(y_i\) using the following equation:

\[y_i = \frac{x_i - c}{s}\]

Notice the similarity with the formula for standard scaling. For multivariate time series, each attribute is scaled independently, each with an independent scale and center.

Parameters:: quantile_range (tuple of (float, float), default = (25.0, 75.0)) – Quantile range used to compute the scale_ of the robust scaler. By default, this is equal to the Inter Quantile Range (IQR). The first value of the quantile range corresponds to the smallest quantile, the second value corresponds to the larger quantile. If the first value is not smaller than the second value, an error will be thrown. The values must also both be in the range [0, 100].

center_

The median value in each attribute of the training data.

Type:: array-like of shape (n_attributes)

scale_

The quantile range for each attribute of the training data.

Type:: array-like of shape (n_attributes)

Raises:: NotFittedError – If the transform method is called before fitting this StandardScaler.

class dtaianomaly.preprocessing.MovingAverage(window_size: int)[source]

Computes the moving average of a time series. This is the unweighted average of the observations within a window.

To compute the moving average at time \(t\), the window is centered at position \(t\). For an odd window size, the number of measurements taken before and after \(t\) is equal (namely (window_size - 1 ) / 2. For an even window size, there is one additional observation taken before \(t\), to ensure a correct window size.

For multivariate time series, the moving average is computed within each attribute independently.

Parameters:: window_size (int) – Length of the window in which the average should be computed.

class dtaianomaly.preprocessing.ExponentialMovingAverage(alpha: float)[source]

Compute exponential moving average. For a given input \(x\), the exponential moving average \(y\) is computed as

\[\begin{split}y_0 &= x_0 \\ y_t &= \alpha \cdot x_t + (1 - \alpha) \cdot y_{t-1}\end{split}\]

with \(0 < \alpha < 1\) the smoothing factor. Higher values of \(\alpha\) result in more smoothing.

Parameters:: alpha (float) – The decaying factor to be used in the exponential moving average.

class dtaianomaly.preprocessing.SamplingRateUnderSampler(sampling_rate: int)[source]

Undersample time series with sampling rate sampling_rate. This means that every sampling_rate element is taken from the time series. After undersampling, only 1/sampling_rate percent of the original samples will remain.

Parameters:: sampling_rate (int) – The rate at which should be sampled.

class dtaianomaly.preprocessing.NbSamplesUnderSampler(nb_samples: int)[source]

Undersample time series such that exactly nb_samples samples remain in the original time series. This enables to manually set the size of the transformed time series, independent of the original size of the time series.

Parameters:: nb_samples (int, default=None) – The number of samples remaining.

class dtaianomaly.preprocessing.Differencing(order: int, window_size: int = 1)[source]

Applies differencing to the given time series. For a time series \(x\) and given season \(m\), the difference \(y\) is computed as:

\[y_t = x_t - x_{t-m}\]

This differencing process can be applied a given order of times, recursively.

Parameters:

order (int) – The number of times the differencing procuder should be applied. If the order is 0, then no differencing will be applied.
window_size (int, default=1) – The decaying factor to be used in the exponential moving average.

class dtaianomaly.preprocessing.PiecewiseAggregateApproximation(n: int)[source]

Performs piecewise aggregate approximation.

Piecewise Aggregate Approximation (PAA) [keogh2001dimensionality] is a form of dimensionality reduction of time series, originally proposed for fast indexing of time series in large databases. Given a value for \(n\), PAA divides the time series in \(n\) equi-sized frames. Next, each frame is replaced by its mean value. Specifically, for a time series \(x\) of length \(N\), position \(i\) in the transformed time series \(y\) equals:

\[y_i = \frac{n}{N} \displaystyle\sum_{j=N/N(i-1)+1}^{(n/N)i} x_j\]

For multivariate time series, the dimension of each attribute is reduced independently, but the same frames are used.

Parameters:: n (int) – The number of equi-sized frames to generate.

References

[keogh2001dimensionality]

Keogh, E., Chakrabarti, K., Pazzani, M. et al. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowledge and Information Systems 3, 263–286 (2001). doi: 10.1007/PL00011669.