Skip to content

Commit

Permalink
feat: make re-windowing after feature extraction optional (#18)
Browse files Browse the repository at this point in the history
* refactor: move tests to pytest

* feat: make re-windowing after feature extraction optional
  • Loading branch information
tilman151 authored Jan 26, 2023
1 parent 2ce289e commit d51e56b
Show file tree
Hide file tree
Showing 3 changed files with 159 additions and 134 deletions.
15 changes: 12 additions & 3 deletions docs/use_cases/feature_extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,14 @@ It may be useful to extract hand-crafted features, i.e. RMS or P2P, from this vi
The [RulDataModule][rul_datasets.core.RulDataModule] provides the option to use a custom feature extractor on each window of data.

The feature extractor can be anything that can be called as a function.
It should take a numpy array with the shape `[num_windows, window_size, num_features]` and return an array with the shape `[num_windows, num_new_features]`.
It should take a numpy array with the shape `[num_windows, window_size, num_features]` and return another array.
Depending on whether a `window_size` is supplied to the data module, the expected output shape of the feature extractor is:

* `window_size is None`: `[num_new_windows, new_window_size, features]`
* `window_size is not None`: `[num_windows, features]`

An example would be taking the mean of the window with `lambda x: np.mean(x, axis=1)`.
After applying the feature extractor, the data module extracts new windows of extracted features:
Because this function reduces the windows to a single feature vector, we set `window_size` to 10 to get new windows of that size:

```pycon
>>> import rul_datasets
Expand Down Expand Up @@ -43,4 +48,8 @@ The number of samples will reduce by `num_runs * (window_size - 1)` due to the r
3674
>>> dm_extracted.to_dataset("dev")
3656
```
```

If your feature extractor produces windows itself, you can set `window_size` to `None`.
This way, no new windows are extracted.
An example would be extracting multiple sub-windows from the existing windows.
44 changes: 27 additions & 17 deletions rul_datasets/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,13 @@ class RulDataModule(pl.LightningDataModule):
`feature_extractor` and `window_size` arguments to the constructor. The
`feature_extractor` is a callable that takes a windowed time series as a numpy
array with the shape `[num_windows, window_size, num_features]` and returns
another numpy array with the shape `[num_windows, num_new_features]`. The time
series of extracted features is then re-windowed with `window_size`.
another numpy array. Depending on `window_size`, the expected output shapes for
the `feature_extractor` are:
* `window_size is None`: `[num_new_windows, new_window_size, features]`
* `window_size is not None`: `[num_windows, features]`
If `window_size` is set, the extracted features are re-windowed.
Examples:
Default
Expand Down Expand Up @@ -67,8 +72,18 @@ def __init__(
pre-process the dataset. Afterwards, `setup_data` is called to load all
splits into memory.
If `feature_extractor` and `window_size` are supplied, the data module extracts
new features from each window of the time series and re-windows it afterwards.
If a `feature_extractor` is supplied, the data module extracts new features
from each window of the time series. If `window_size` is `None`,
it is assumed that the extracted features form a new windows themselves. If
`window_size` is an int, it is assumed that the extracted features are a
single feature vectors and should be re-windowed. The expected output shapes
for the `feature_extractor` are:
* `window_size is None`: `[num_new_windows, new_window_size, features]`
* `window_size is not None`: `[num_windows, features]`
The expected input shape for the `feature_extractor` is always
`[num_windows, window_size, features]`.
Args:
reader: The dataset reader for the desired dataset, e.g. CmapssLoader.
Expand All @@ -84,10 +99,10 @@ def __init__(
self.feature_extractor = feature_extractor
self.window_size = window_size

if (self.feature_extractor is not None) != (self.window_size is not None):
if (self.feature_extractor is None) and (self.window_size is not None):
raise ValueError(
"feature_extractor and window_size cannot be set without "
"the other. Please supply values for both."
"A feature extractor has to be supplied "
"to set a window size for re-windowing."
)

hparams = deepcopy(self.reader.hparams)
Expand Down Expand Up @@ -194,7 +209,7 @@ def setup(self, stage: Optional[str] = None) -> None:
If the data module was constructed with a `feature_extractor` argument,
the feature windows are passed to the feature extractor. The resulting,
new features are re-windowed.
new features may be re-windowed.
Args:
stage: Ignored. Only for adhering to parent class interface.
Expand All @@ -221,20 +236,15 @@ def _setup_split(self, split: str) -> Tuple[torch.Tensor, torch.Tensor]:
def _apply_feature_extractor_per_run(
self, features: List[np.ndarray], targets: List[np.ndarray]
) -> Tuple[List[np.ndarray], List[np.ndarray]]:
if self.feature_extractor is not None and self.window_size is not None:
if self.feature_extractor is not None:
features = [self.feature_extractor(f) for f in features]
if self.window_size is not None:
cutoff = self.window_size - 1
features = [self._apply_feature_extractor(f) for f in features]
# cut off because feats are re-windowed
features = [utils.extract_windows(f, self.window_size) for f in features]
targets = [t[cutoff:] for t in targets]

return features, targets

def _apply_feature_extractor(self, features: np.ndarray) -> np.ndarray:
features = self.feature_extractor(features) # type: ignore
features = utils.extract_windows(features, self.window_size) # type: ignore

return features

def train_dataloader(self, *args: Any, **kwargs: Any) -> DataLoader:
"""
Create a [data loader][torch.utils.data.DataLoader] for the training split.
Expand Down
Loading

0 comments on commit d51e56b

Please sign in to comment.