-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data read #85
base: main
Are you sure you want to change the base?
Add data read #85
Conversation
@stephprince when you get a chance ,could you please do a first code review of this PR to make sure this is heading in the right direction. I now have a first outline of one possible solution for how we might implement read. There is still a lot more work to be done before this PR is ready, but it would be useful if you could take a look before I go any further with this approach. I would start by looking at:
|
@stephprince I just added a documentation page as well, which hopefully helps explain the current proposed design for read so we can review and discuss. |
* Merged main into add_read * Fix docs build for SpikeEventSeries * Fix code formatting * Fix segfault due to duplicate declaration of NWBFile.io parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally got a chance to look through this and added some comments / questions! A couple of comments In addition to what I added in code:
- I think we want to add tests for reading in string datasets, attributes, and multidimensional datasets. There were a couple of sections in the docs examples maybe working on testing these things that were commented out, not sure if those were WIP or resolved elsewhere.
- How do links/references work on read?
I'm partway through looking through the follow up PR, will ping you there when I finish.
@@ -6,8 +6,8 @@ on: | |||
- main | |||
|
|||
pull_request: | |||
branches: | |||
- main | |||
#branches: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uncomment workflow triggers before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I had modified this is because it would'nt trigger workflows if they didn't target the main branch. I think it would be useful to run the tests on all PRs, even if they target branches other than main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, that makes sense. We can remove the branch section entirely then or yes just leave it commented out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way is fine. I just commented them out here, because I wanted to see what tests are working/failing on this PR. We can change workflows also in a separate PR and remove this change here before merging if you prefer.
template<typename VTYPE = std::any> | ||
inline std::unique_ptr< | ||
IO::ReadDataWrapper<AQNWB::Types::StorageObjectType::Dataset, VTYPE>> | ||
dataLazy() const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an alternative to appending Lazy
to the name of every member variable that is a readDataWrapper
? I think as a user I would prefer to access data in the same way whether it's in memory or read from the file, and we can specify that data will always be lazy loaded on read. I'm wondering if that's possible to use a single this->data() variable here if we use some approach such as
std::variant<BaseRecordingData, ReadDataWrapper>
, or- by using a common base class for those two classes.
I can't remember if we had discussed this approach in the past or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #91 I replaced dataLazy
(and other read fields) with the DEFINE_FIELD
macro to create these access methods. I changed the name of the functions there to use the naming convention read<FieldPath>
, e.g., readData
or readTimestamps
. To avoid confusion for attributes I used the full path in naming the field, e.g., readDataUnit
and readTimestampsUnit
.
I don't think we want to store the read wrappers on the container. They are lightweight and a user may need to use a data type different than the default, so creating the read wrapper on request is useful. I.e., I don't think it is strictly necessary for BaseRecordingData
and ReadDataWrapper
to have a common base. We could rename TimeSeries.data
to TimeSeries.recordData
to make them more consistent, or we could leave it as is and say that attributes without a prefix are for acquisition and with the prefix are for read only.
With all of this said, I think it is worth discussing whether having a common base class for BaseRecordingData and ReadDataWrapper would be useful.
/** | ||
* @brief Generic structure to hold type-erased data and shape | ||
*/ | ||
class DataBlockGeneric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be a way to only expose DataBlock
to the user and use DataBlockGeneric
under the hood as needed?
I feel it would be simpler if they could use DataBlock
for reading data, and if the type is unknown use the default std::any
. I might just be missing reasons to keep it separate at the user level, but was wondering if that could simplify things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of only using DataBlock
but we'll need to think a bit more about how to do that. One key difference is that DataBlockGeneric
stores the entire data as std::any
so we can use a single cast std::any_cast<std::vector<DTYPE>>(genericData.data)
to transform the data to the correct type without having to copy the data. DataBlock
on the other hand stores the data as and std::vector
and to cast std::vector<std::any>
required copying the data vector and casting each individual element. Similarly, on read we would probably also need to do another copy in order to move the data into the std::vector<std::any>
. Happy to chat about how we may be able to simplify this.
Maybe one approach could be to have DataBlock store the data internally as std::any
and then access the data via a function instead so that we do the casting when accessing the data. However, when reading data where the user may not know the data type before-hand, the user still needs to then define the type to cast to since we cannot simply cast std::any
to std::vector<std::any>
, but we need to cast to the correct type.
{ | ||
std::vector<int32_t> testData = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}; | ||
|
||
SECTION("read 1D data block of a 1D dataset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would want to add additional tests for at least one type of multidimensional data here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit tests are still very minimal right now. I totally agree, there are a lot more tests that need to be added. However, I wanted to get your take on the overall design first before diving even deeper into testing all the different cases.
size_t pos = path.find_last_of('/'); | ||
if (pos == std::string::npos) { | ||
return nullptr; | ||
} | ||
|
||
std::string parentPath = path.substr(0, pos); | ||
std::string attrName = path.substr(pos + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these manipulations as well as the mergePath
functions I think are good reasons to switch to using the std::filesystem::path representation
?
std::string parentPath = path.substr(0, pos); | ||
std::string attrName = path.substr(pos + 1); | ||
|
||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use getH5ObjectType
here instead of trying to open object as group or a dataset?
Thanks for taking a look at the PR.
Totally agree. This PR + #91 together outline how read can work, but there are still a lot of tests that need to be added to make sure everything is working as expected. I think in particular on the
Links should be handled transparently by HDF5 so I think read should work in the same way as for groups and datasets. I have not yet addressed the case of object references that are stored as values of attributes and datasets. I think this will require additional logic in the |
Co-authored-by: Steph Prince <40640337+stephprince@users.noreply.github.com>
Co-authored-by: Steph Prince <40640337+stephprince@users.noreply.github.com>
Co-authored-by: Steph Prince <40640337+stephprince@users.noreply.github.com>
I'm a bit confused why the tests are failing after submitting the suggestions from the review. The build in the test fails with |
@stephprince looking at the workflow details https://github.com/NeurodataWithoutBorders/aqnwb/actions/runs/11023552242/job/31901884513?pr=85 The workflow is checking out:
However, looking at that commit hash at b2fb27f , GitHub says that "This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository." Digging around a bit more with ChatGPT, it says: When a GitHub Action runs for a pull request, it often checks out a temporary merge commit. This commit represents the state of the code if the pull request were to be merged into the target branch (usually main). This allows the CI to test the combined code without actually merging it. [...] The commit b2fb27f is likely a temporary merge commit created by GitHub to test the pull request. These commits are not part of the repository's history and exist only in the context of the CI job. I didn't know that this is what GitHub was doing, but it makes sense that GitHub wants to simulate the state of the code as if the pull request were merged into the target branch. Looking at the code in b2fb27f it indeed shows the error that the build is showing in line We could modify the
|
We decide not to do this and to continue testing only for the merged version. We decided to add a note in the developer docs to clarify this behavior. |
This PR is to try and implement the proposed approach for data read from #83 to see if this approach is viable. This PR is experimental right now and should not be merged.
2. Proposed Implementation for reading data arrays
BaseReadData
ReadDatasetWrapper
andReadAttributeWrapper
classes for reading data. This is modified from the proposal, which suggested a singleBaseReadData
class for reading any array (datasets, attributes) from a file.BaseRecordingData
to inherit fromReadDataWrapper
because the two are not compatible right now.BaseReadData
uses the io and the path to access the data, whereasBaseRecordingData
leaves that choice to the I/O backend and stores the references to the dataset. We may or may not want to change this.BaseIO
BaseIO
(or more accurately I remove them) because: 1) I wanted to use the shared_ptr to the I/O rather than a raw pointer, which I can't get from BaseIO, and 2) with the ReadDatasetWrapper this is more approbriately done in the Container class directly.HDF5IO
BaseReadData
forHDF5
but left read logic toHDF5IO
itself so that theReadDatasetWrapper
can remain generic. To make this more manageable, I definedReadAttributeWrapper
separatelyHDF5ReadDataSet
andHDF5ReadAttribute
wrappers can call can call for readgetObjectType
method for getting the storage object type (Group, Dataset, Attribute) for a given pathContainer
io
object on theContainer
so that we can callio->readDataset
andio->readAttribute
in the read methodsNWB types:
TimeSeries
,ElectricalSeries
etc.Container
classes and replace them with access methods that returnBaseReadData
objects instead. This allows for reading in both read and write mode and avoids keeping data in memory that we have already written to disk. For example, inTimeSeries
, these variables would need to change to properties:aqnwb/src/nwb/base/TimeSeries.hpp
Lines 91 to 140 in e873d95
BaseReadData
for missing fields3. Proposed implementation for reading whole
Containers
(e.g., to read anElectricalSeries
)Container
that owns the respective objects, e.g.,NWBFile
owningElectricalSeries
objects to retrieve the objectContainer
to create an instance of the specificContainer
type using only theio
andpath
for theContainer
as input. The specificContainer
classes, such asTimeSeries
will then need to implement a corresponding constructor that usesio
andpath
as input.Step 1: Define the Template Factory Method in
Container
Step 2: Implement the constructors on the specific
Container
classes (e.g.,TimeSeries
)4. Proposed implementation for reading untyped groups (e.g.,
/acquisition
)I'm not sure we'll need do this, since a group by itself does not define data. To access the contents we could define access methods on the parent
Container
class (e.g.,NWBFile
) that owns the untyped group to access its contents.TODO
ReadDatasetWrapper
,ReadAttributeWrapper
(See Add read for neurodata_types, e.g., Container, TimeSeries #91)ReadDatasetWrapper
andReadAttributeWrapper
(I think this could be useful)Next steps