feat: make re-windowing after feature extraction optional (#18)

* refactor: move tests to pytest * feat: make re-windowing after feature extraction optional
tilman151 · Jan 26, 2023 · d51e56b · d51e56b
1 parent 2ce289e
commit d51e56b
Show file tree

Hide file tree

Showing 3 changed files with 159 additions and 134 deletions.
diff --git a/docs/use_cases/feature_extraction.md b/docs/use_cases/feature_extraction.md
@@ -3,9 +3,14 @@ It may be useful to extract hand-crafted features, i.e. RMS or P2P, from this vi
 The [RulDataModule][rul_datasets.core.RulDataModule] provides the option to use a custom feature extractor on each window of data.
 
 The feature extractor can be anything that can be called as a function.
-It should take a numpy array with the shape `[num_windows, window_size, num_features]` and return an array with the shape `[num_windows, num_new_features]`.
+It should take a numpy array with the shape `[num_windows, window_size, num_features]` and return another array.
+Depending on whether a `window_size` is supplied to the data module, the expected output shape of the feature extractor is:
+
+* `window_size is None`: `[num_new_windows, new_window_size, features]`
+* `window_size is not None`: `[num_windows, features]`
+
 An example would be taking the mean of the window with `lambda x: np.mean(x, axis=1)`.
-After applying the feature extractor, the data module extracts new windows of extracted features:
+Because this function reduces the windows to a single feature vector, we set `window_size` to 10 to get new windows of that size:
 
 ```pycon
 >>> import rul_datasets
@@ -43,4 +48,8 @@ The number of samples will reduce by `num_runs * (window_size - 1)` due to the r
 3674
 >>> dm_extracted.to_dataset("dev")
 3656
-```
+```
+
+If your feature extractor produces windows itself, you can set `window_size` to `None`.
+This way, no new windows are extracted.
+An example would be extracting multiple sub-windows from the existing windows.
diff --git a/rul_datasets/core.py b/rul_datasets/core.py
@@ -27,8 +27,13 @@ class RulDataModule(pl.LightningDataModule):
     `feature_extractor` and `window_size` arguments to the constructor. The
     `feature_extractor` is a callable that takes a windowed time series as a numpy
     array with the shape `[num_windows, window_size, num_features]` and returns
-    another numpy array with the shape `[num_windows, num_new_features]`. The time
-    series of extracted features is then re-windowed with `window_size`.
+    another numpy array. Depending on `window_size`, the expected output shapes for
+    the `feature_extractor` are:
+
+    * `window_size is None`: `[num_new_windows, new_window_size, features]`
+    * `window_size is not None`: `[num_windows, features]`
+
+    If `window_size` is set, the extracted features are re-windowed.
 
     Examples:
         Default
@@ -67,8 +72,18 @@ def __init__(
         pre-process the dataset. Afterwards, `setup_data` is called to load all
         splits into memory.
 
-        If `feature_extractor` and `window_size` are supplied, the data module extracts
-        new features from each window of the time series and re-windows it afterwards.
+        If a `feature_extractor` is supplied, the data module extracts new features
+        from each window of the time series. If `window_size` is `None`,
+        it is assumed that the extracted features form a new windows themselves. If
+        `window_size` is an int, it is assumed that the extracted features are a
+        single feature vectors and should be re-windowed. The expected output shapes
+        for the `feature_extractor` are:
+
+        * `window_size is None`: `[num_new_windows, new_window_size, features]`
+        * `window_size is not None`: `[num_windows, features]`
+
+        The expected input shape for the `feature_extractor` is always
+        `[num_windows, window_size, features]`.
 
         Args:
             reader: The dataset reader for the desired dataset, e.g. CmapssLoader.
@@ -84,10 +99,10 @@ def __init__(
         self.feature_extractor = feature_extractor
         self.window_size = window_size
 
-        if (self.feature_extractor is not None) != (self.window_size is not None):
+        if (self.feature_extractor is None) and (self.window_size is not None):
             raise ValueError(
-                "feature_extractor and window_size cannot be set without "
-                "the other. Please supply values for both."
+                "A feature extractor has to be supplied "
+                "to set a window size for re-windowing."
             )
 
         hparams = deepcopy(self.reader.hparams)
@@ -194,7 +209,7 @@ def setup(self, stage: Optional[str] = None) -> None:
 
         If the data module was constructed with a `feature_extractor` argument,
         the feature windows are passed to the feature extractor. The resulting,
-        new features are re-windowed.
+        new features may be re-windowed.
 
         Args:
             stage: Ignored. Only for adhering to parent class interface.
@@ -221,20 +236,15 @@ def _setup_split(self, split: str) -> Tuple[torch.Tensor, torch.Tensor]:
     def _apply_feature_extractor_per_run(
         self, features: List[np.ndarray], targets: List[np.ndarray]
     ) -> Tuple[List[np.ndarray], List[np.ndarray]]:
-        if self.feature_extractor is not None and self.window_size is not None:
+        if self.feature_extractor is not None:
+            features = [self.feature_extractor(f) for f in features]
+        if self.window_size is not None:
             cutoff = self.window_size - 1
-            features = [self._apply_feature_extractor(f) for f in features]
-            # cut off because feats are re-windowed
+            features = [utils.extract_windows(f, self.window_size) for f in features]
             targets = [t[cutoff:] for t in targets]
 
         return features, targets
 
-    def _apply_feature_extractor(self, features: np.ndarray) -> np.ndarray:
-        features = self.feature_extractor(features)  # type: ignore
-        features = utils.extract_windows(features, self.window_size)  # type: ignore
-
-        return features
-
     def train_dataloader(self, *args: Any, **kwargs: Any) -> DataLoader:
         """
         Create a [data loader][torch.utils.data.DataLoader] for the training split.