Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data read #85

Draft
wants to merge 69 commits into
base: main
Choose a base branch
from
Draft

Add data read #85

wants to merge 69 commits into from

Conversation

oruebel
Copy link
Contributor

@oruebel oruebel commented Aug 31, 2024

This PR is to try and implement the proposed approach for data read from #83 to see if this approach is viable. This PR is experimental right now and should not be merged.

2. Proposed Implementation for reading data arrays

BaseReadData

  • Create a new ReadDatasetWrapper and ReadAttributeWrapper classes for reading data. This is modified from the proposal, which suggested a single BaseReadData class for reading any array (datasets, attributes) from a file.
  • Support conversion to boost multi-dimensional array for convenience
  • Note I did not update BaseRecordingData to inherit from ReadDataWrapper because the two are not compatible right now. BaseReadData uses the io and the path to access the data, whereas BaseRecordingData leaves that choice to the I/O backend and stores the references to the dataset. We may or may not want to change this.

BaseIO

  • Note I did not add abstract methods for lazy reading objects from a file to BaseIO (or more accurately I remove them) because: 1) I wanted to use the shared_ptr to the I/O rather than a raw pointer, which I can't get from BaseIO, and 2) with the ReadDatasetWrapper this is more approbriately done in the Container class directly.
  • Add pure virtual method to allow us to get the storage object type (Group, Dataset, Attribute) for a given path
  • Add pure virtual methods to read data values from a Dataset or Attribute that the BaseReadData can call for read

HDF5IO

  • Note In contrast to the proposal, I did not implement specific version of the BaseReadData for HDF5 but left read logic to HDF5IO itself so that the ReadDatasetWrapper can remain generic. To make this more manageable, I defined ReadAttributeWrapper separately
  • Implement the methods for reading data values from a Dataset or Attribute that the HDF5ReadDataSet and HDF5ReadAttribute wrappers can call can call for read
  • Implement the getObjectType method for getting the storage object type (Group, Dataset, Attribute) for a given path

Container

  • Store the io object on the Container so that we can call io->readDataset and io->readAttribute in the read methods

NWB types: TimeSeries, ElectricalSeries etc.

  • Remove storage of properties from the Container classes and replace them with access methods that return BaseReadData objects instead. This allows for reading in both read and write mode and avoids keeping data in memory that we have already written to disk. For example, in TimeSeries, these variables would need to change to properties:
    /**
    * @brief Base unit of measurement for working with the data. Actual stored
    * values are not necessarily stored in these units. To access the data in
    * these units, multiply ‘data’ by ‘conversion’ and add ‘offset’.
    */
    std::string unit;
    /**
    * @brief The description of the TimeSeries.
    */
    std::string description;
    /**
    * @brief Human-readable comments about the TimeSeries.
    */
    std::string comments;
    /**
    * @brief Size used in dataset creation. Can be expanded when writing if
    * needed.
    */
    SizeArray dsetSize;
    /**
    * @brief Chunking size used in dataset creation.
    */
    SizeArray chunkSize;
    /**
    * @brief Scalar to multiply each element in data to convert it to the
    * specified ‘unit’.
    */
    float conversion;
    /**
    * @brief Smallest meaningful difference between values in data, stored in the
    * specified by unit.
    */
    float resolution;
    /**
    * @brief Scalar to add to the data after scaling by ‘conversion’ to finalize
    * its coercion to the specified ‘unit’.
    */
    float offset;
    /**
    * @brief The starting time of the TimeSeries.
    */
    float startingTime = 0.0;
  • Add access methods that return BaseReadData for missing fields

3. Proposed implementation for reading whole Containers (e.g., to read an ElectricalSeries)

  • Add access methods on the respective Container that owns the respective objects, e.g., NWBFile owning ElectricalSeries objects to retrieve the object
  • Add abstract factory method (that is templated on the return type) to Container to create an instance of the specific Container type using only the io and path for the Container as input. The specific Container classes, such as TimeSeries will then need to implement a corresponding constructor that uses io and path as input.

Step 1: Define the Template Factory Method in Container

class Container {
public:
   
    template <typename T>
    static std::unique_ptr<T> create(const BaseIO& io, const std::string& path) {
        static_assert(std::is_base_of<Container, T>::value, "T must be a derived class of Container");
        return std::unique_ptr<T>(new T(path, io));
    }
};

Step 2: Implement the constructors on the specific Container classes (e.g., TimeSeries)

  • Add the necessary constructor
class TimeSeries : public Container {
public:
    TimeSeries(const std::string& path, const BaseIO& io) {
        // Implementation of TimeSeries constructor
    }
};

4. Proposed implementation for reading untyped groups (e.g., /acquisition)

I'm not sure we'll need do this, since a group by itself does not define data. To access the contents we could define access methods on the parent Container class (e.g., NWBFile) that owns the untyped group to access its contents.

TODO

Next steps

@oruebel
Copy link
Contributor Author

oruebel commented Sep 1, 2024

@stephprince when you get a chance ,could you please do a first code review of this PR to make sure this is heading in the right direction. I now have a first outline of one possible solution for how we might implement read. There is still a lot more work to be done before this PR is ready, but it would be useful if you could take a look before I go any further with this approach.

I would start by looking at:

  1. tests/examples/test_ecephys_data_read.cpp which shows an example of how read works for the user
  2. BaseIO then defines the main new classes used for reading and HDF5IO then implements the actual reading
  3. Container and ElectricalSeries also have some relevant changes to allow us to construct Container objects for read and how we can get specific datasets/attributes

@oruebel
Copy link
Contributor Author

oruebel commented Sep 1, 2024

@stephprince I just added a documentation page as well, which hopefully helps explain the current proposed design for read so we can review and discuss.

oruebel and others added 6 commits September 8, 2024 23:43
* Merged main into add_read

* Fix docs build for SpikeEventSeries

* Fix code formatting

* Fix segfault due to duplicate declaration of NWBFile.io parameter
Copy link
Collaborator

@stephprince stephprince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally got a chance to look through this and added some comments / questions! A couple of comments In addition to what I added in code:

  1. I think we want to add tests for reading in string datasets, attributes, and multidimensional datasets. There were a couple of sections in the docs examples maybe working on testing these things that were commented out, not sure if those were WIP or resolved elsewhere.
  2. How do links/references work on read?

I'm partway through looking through the follow up PR, will ping you there when I finish.

@@ -6,8 +6,8 @@ on:
- main

pull_request:
branches:
- main
#branches:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uncomment workflow triggers before merging

Copy link
Contributor Author

@oruebel oruebel Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I had modified this is because it would'nt trigger workflows if they didn't target the main branch. I think it would be useful to run the tests on all PRs, even if they target branches other than main

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, that makes sense. We can remove the branch section entirely then or yes just leave it commented out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way is fine. I just commented them out here, because I wanted to see what tests are working/failing on this PR. We can change workflows also in a separate PR and remove this change here before merging if you prefer.

src/nwb/NWBFile.cpp Outdated Show resolved Hide resolved
src/nwb/hdmf/base/Data.hpp Outdated Show resolved Hide resolved
src/io/hdf5/HDF5IO.hpp Outdated Show resolved Hide resolved
template<typename VTYPE = std::any>
inline std::unique_ptr<
IO::ReadDataWrapper<AQNWB::Types::StorageObjectType::Dataset, VTYPE>>
dataLazy() const
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an alternative to appending Lazy to the name of every member variable that is a readDataWrapper? I think as a user I would prefer to access data in the same way whether it's in memory or read from the file, and we can specify that data will always be lazy loaded on read. I'm wondering if that's possible to use a single this->data() variable here if we use some approach such as

  1. std::variant<BaseRecordingData, ReadDataWrapper>, or
  2. by using a common base class for those two classes.

I can't remember if we had discussed this approach in the past or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #91 I replaced dataLazy (and other read fields) with the DEFINE_FIELD macro to create these access methods. I changed the name of the functions there to use the naming convention read<FieldPath>, e.g., readData or readTimestamps . To avoid confusion for attributes I used the full path in naming the field, e.g., readDataUnit and readTimestampsUnit.

I don't think we want to store the read wrappers on the container. They are lightweight and a user may need to use a data type different than the default, so creating the read wrapper on request is useful. I.e., I don't think it is strictly necessary for BaseRecordingData and ReadDataWrapper to have a common base. We could rename TimeSeries.data to TimeSeries.recordData to make them more consistent, or we could leave it as is and say that attributes without a prefix are for acquisition and with the prefix are for read only.

With all of this said, I think it is worth discussing whether having a common base class for BaseRecordingData and ReadDataWrapper would be useful.

/**
* @brief Generic structure to hold type-erased data and shape
*/
class DataBlockGeneric
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be a way to only expose DataBlock to the user and use DataBlockGeneric under the hood as needed?

I feel it would be simpler if they could use DataBlock for reading data, and if the type is unknown use the default std::any. I might just be missing reasons to keep it separate at the user level, but was wondering if that could simplify things

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of only using DataBlock but we'll need to think a bit more about how to do that. One key difference is that DataBlockGeneric stores the entire data as std::any so we can use a single cast std::any_cast<std::vector<DTYPE>>(genericData.data) to transform the data to the correct type without having to copy the data. DataBlock on the other hand stores the data as and std::vector and to cast std::vector<std::any> required copying the data vector and casting each individual element. Similarly, on read we would probably also need to do another copy in order to move the data into the std::vector<std::any>. Happy to chat about how we may be able to simplify this.

Maybe one approach could be to have DataBlock store the data internally as std::any and then access the data via a function instead so that we do the casting when accessing the data. However, when reading data where the user may not know the data type before-hand, the user still needs to then define the type to cast to since we cannot simply cast std::any to std::vector<std::any>, but we need to cast to the correct type.

{
std::vector<int32_t> testData = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};

SECTION("read 1D data block of a 1D dataset")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would want to add additional tests for at least one type of multidimensional data here

Copy link
Contributor Author

@oruebel oruebel Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit tests are still very minimal right now. I totally agree, there are a lot more tests that need to be added. However, I wanted to get your take on the overall design first before diving even deeper into testing all the different cases.

Comment on lines +89 to +95
size_t pos = path.find_last_of('/');
if (pos == std::string::npos) {
return nullptr;
}

std::string parentPath = path.substr(0, pos);
std::string attrName = path.substr(pos + 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these manipulations as well as the mergePath functions I think are good reasons to switch to using the std::filesystem::path representation?

std::string parentPath = path.substr(0, pos);
std::string attrName = path.substr(pos + 1);

try {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use getH5ObjectType here instead of trying to open object as group or a dataset?

docs/pages/userdocs/read.dox Outdated Show resolved Hide resolved
@oruebel
Copy link
Contributor Author

oruebel commented Sep 24, 2024

Thanks for taking a look at the PR.

  1. think we want to add tests for reading in string datasets, attributes, and multidimensional datasets. There were a couple of sections in the docs examples maybe working on testing these things that were commented out, not sure if those were WIP or resolved elsewhere.

Totally agree. This PR + #91 together outline how read can work, but there are still a lot of tests that need to be added to make sure everything is working as expected. I think in particular on the HDF5IO there are probably still some gremlins hiding. This is part of the todo list in #91. Part of the question for the review is to decide on what items we view as essential before merging (e.g., unit tests) vs. items we want to address in separate PRs. I'd suggest to maybe just add any items to the TODO list in the description of #91 and then we can go through and prioritize.

2. How do links/references work on read?

Links should be handled transparently by HDF5 so I think read should work in the same way as for groups and datasets. I have not yet addressed the case of object references that are stored as values of attributes and datasets. I think this will require additional logic in the ReadDataWrapper and HDF5IO.

oruebel and others added 2 commits September 24, 2024 14:58
Co-authored-by: Steph Prince <40640337+stephprince@users.noreply.github.com>
Co-authored-by: Steph Prince <40640337+stephprince@users.noreply.github.com>
src/io/ReadIO.hpp Outdated Show resolved Hide resolved
@oruebel
Copy link
Contributor Author

oruebel commented Sep 25, 2024

I'm a bit confused why the tests are failing after submitting the suggestions from the review. The build in the test fails with /Users/runner/work/aqnwb/aqnwb/src/nwb/ecephys/ElectricalSeries.cpp:61:25: error: use of undeclared identifier 'BaseDataType'; did you mean 'IO::BaseDataType'? m_io->createAttribute(BaseDataType::I32, But that is not what in line 61 in this PR

https://github.com/NeurodataWithoutBorders/aqnwb/actions/runs/11023552242/job/30615439516?pr=85#step:6:249

@oruebel
Copy link
Contributor Author

oruebel commented Oct 22, 2024

@stephprince looking at the workflow details https://github.com/NeurodataWithoutBorders/aqnwb/actions/runs/11023552242/job/31901884513?pr=85 The workflow is checking out:

/opt/homebrew/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +b2fb27fdba42dd5183bce7d2cc44421c50e2f428:refs/remotes/pull/85/merge

However, looking at that commit hash at b2fb27f , GitHub says that "This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository."

Digging around a bit more with ChatGPT, it says:

When a GitHub Action runs for a pull request, it often checks out a temporary merge commit. This commit represents the state of the code if the pull request were to be merged into the target branch (usually main). This allows the CI to test the combined code without actually merging it. [...] The commit b2fb27f is likely a temporary merge commit created by GitHub to test the pull request. These commits are not part of the repository's history and exist only in the context of the CI job.

I didn't know that this is what GitHub was doing, but it makes sense that GitHub wants to simulate the state of the code as if the pull request were merged into the target branch. Looking at the code in b2fb27f it indeed shows the error that the build is showing in line ElectricalSeries.cpp:61:25: error: use of undeclared identifier 'BaseDataType'; did you mean 'IO::BaseDataType'? m_io->createAttribute(BaseDataType::I32, I.e., something seems to be going wrong when git tries to automatically merge the branches.

We could modify the tests.yml to run all the tests with both states, i.e., with the code as is in the PR as well as with the temporary merge. This would make the CI runtime longer (since all tests would run twice) and make the workflow a bit longer but would help with finding merge errors. So its a tradeoff between runtime and complexity and being more rigorous. What do you think? Here is how the modified tests.yml workflow could look, where:

  • test-pr-state Job: Checks out the specific PR branch state using github.event.pull_request.head.sha.
  • test-merged-stateJob: Checks out the temporary merge commit using refs/pull/${{ github.event.pull_request.number }}/merge.
  • sanitize Job: Runs after both test-pr-state and test-merged-state jobs
  • validate Job: Runs after all previous jobs and performs validation.
name: Run tests

on:
  push:
    branches:
      - main
  pull_request:
  workflow_dispatch:

jobs:
  test-pr-state:  # Job to test the state of the PR branch as it is
    if: github.event_name == 'pull_request'  # Run only on pull request events
    name: Test PR State
    defaults:
      run:
        shell: bash
    concurrency:
      group: ${{ github.workflow }}-${{ github.ref }}-${{ matrix.os }}
      cancel-in-progress: true
    strategy:
      fail-fast: false
      matrix:
        os: [macos-latest, ubuntu-latest]

    runs-on: ${{ matrix.os }}

    steps:
      - name: Checkout PR branch
        uses: actions/checkout@v4
        with:
          ref: ${{ github.event.pull_request.head.sha }}  # Checkout the PR branch directly

      - name: Install dependencies - ubuntu
        if: matrix.os == 'ubuntu-latest'
        run: |
          sudo apt-get update
          sudo apt-get install -y libhdf5-dev libboost-all-dev
          git clone https://github.com/catchorg/Catch2.git
          cd Catch2
          git checkout "v3.5.3"
          cmake -Bbuild -H. -DBUILD_TESTING=OFF
          sudo cmake --build build/ --target install

      - name: Install dependencies - macos
        if: matrix.os == 'macos-latest'
        run: brew install hdf5 boost catch2

      - name: Configure
        shell: pwsh
        run: cmake "--preset=ci-$("${{ matrix.os }}".split("-")[0])"

      - name: Build
        run: cmake --build build --config Release -j 2

      - name: Install
        run: cmake --install build --config Release --prefix prefix

      - name: Test
        working-directory: build
        run: ctest --output-on-failure --no-tests=error -C Release -j 2

      - name: Upload artifacts
        uses: actions/upload-artifact@v3
        with:
          name: test-files-${{ matrix.os }}
          path: |
            build/tests/data/*.nwb

  test-merged-state:  # Job to test the state of the code after merging the PR into the target branch
    if: github.event_name == 'pull_request'  # Run only on pull request events
    name: Test Merged State
    defaults:
      run:
        shell: bash
    concurrency:
      group: ${{ github.workflow }}-${{ github.ref }}-${{ matrix.os }}
      cancel-in-progress: true
    strategy:
      fail-fast: false
      matrix:
        os: [macos-latest, ubuntu-latest]

    runs-on: ${{ matrix.os }}

    steps:
      - name: Checkout merged state
        uses: actions/checkout@v4
        with:
          ref: refs/pull/${{ github.event.pull_request.number }}/merge  # Checkout the merge commit

      - name: Install dependencies - ubuntu
        if: matrix.os == 'ubuntu-latest'
        run: |
          sudo apt-get update
          sudo apt-get install -y libhdf5-dev libboost-all-dev
          git clone https://github.com/catchorg/Catch2.git
          cd Catch2
          git checkout "v3.5.3"
          cmake -Bbuild -H. -DBUILD_TESTING=OFF
          sudo cmake --build build/ --target install

      - name: Install dependencies - macos
        if: matrix.os == 'macos-latest'
        run: brew install hdf5 boost catch2

      - name: Configure
        shell: pwsh
        run: cmake "--preset=ci-$("${{ matrix.os }}".split("-")[0])"

      - name: Build
        run: cmake --build build --config Release -j 2

      - name: Install
        run: cmake --install build --config Release --prefix prefix

      - name: Test
        working-directory: build
        run: ctest --output-on-failure --no-tests=error -C Release -j 2

      - name: Upload artifacts
        uses: actions/upload-artifact@v3
        with:
          name: test-files-${{ matrix.os }}
          path: |
            build/tests/data/*.nwb

  sanitize:  # Job to run sanitization tests on the merged state
    if: github.event_name == 'pull_request'  # Run only on pull request events
    needs: [test-pr-state, test-merged-state]

    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          ref: refs/pull/${{ github.event.pull_request.number }}/merge  # Checkout the merge commit

      - name: Install dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y libhdf5-dev libboost-all-dev
          git clone https://github.com/catchorg/Catch2.git
          cd Catch2
          git checkout "v3.5.3"
          cmake -Bbuild -H. -DBUILD_TESTING=OFF
          sudo cmake --build build/ --target install

      - name: Configure
        run: cmake --preset=ci-sanitize

      - name: Build
        run: cmake --build build/sanitize -j 2

      - name: Test
        working-directory: build/sanitize
        env:
          ASAN_OPTIONS: "strict_string_checks=1:\
            detect_stack_use_after_return=1:\
            check_initialization_order=1:\
            strict_init_order=1:\
            detect_leaks=1:\
            halt_on_error=1"
          UBSAN_OPTIONS: "print_stacktrace=1:\
            halt_on_error=1"
        run: ctest --output-on-failure --no-tests=error -j 2

  validate:  # Job to validate the results using the test files
    needs: [test-pr-state, test-merged-state, sanitize]
    defaults:
      run:
        shell: bash
    concurrency:
      group: ${{ github.workflow }}-${{ github.ref }}-${{ matrix.os }}
      cancel-in-progress: true
    strategy:
      fail-fast: false
      matrix:
        os: [macos-latest, ubuntu-latest]

    runs-on: ${{ matrix.os }}

    steps:
      - name: Download test files
        uses: actions/download-artifact@v3
        with:
          name: test-files-${{ matrix.os }}
          path: nwb_files

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install pynwb and run validation
        run: |
          python -m pip install --upgrade pip
          python -m pip install nwbinspector
          nwbinspector nwb_files --threshold BEST_PRACTICE_VIOLATION

@oruebel
Copy link
Contributor Author

oruebel commented Oct 23, 2024

We could modify the tests.yml to run all the tests with both states, i.e., with the code as is in the PR as well as with the temporary merge. This would make the CI runtime longer (since all tests would run twice) and make the workflow a bit longer but would help with finding merge errors.

We decide not to do this and to continue testing only for the merged version. We decided to add a note in the developer docs to clarify this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Propose refactor of I/O class organization
2 participants