You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The presence of two accepted data file formats for each data product is often troublesome and creates extra overhead in regular operations. Moving to a single file format would streamline our data flow processes and simplify file I/O immensely in pyDARNio.
Current Formats
Site-structured files
Pros
Data stored in separate groups for each averaging period, so the program crashing in between writes will not impact the file
Facilitates post-processing of files since each group is an atomic piece of data and metadata to be processed.
Native structure used by borealis_postprocessors
Easily written during radar operation
Cons
Not easy to make range-time plots from
Each group contains fields that are identical across groups (i.e. experiment slice metadata that does not change)
Array-structured files
Pros
Easy to make range-time plots from
Redundant information stored only once
Cons
Requires more care when indexing into arrays as some arrays are zero-padded in a mock of ragged lists (e.g. the field pulse_phase_offset has a first dimension of num_sequences; the number of sequences can vary from group to group, so the array field has dimensions [num_groups, max_num_sequences, ...] to accommodate the ragged arrays).
Would require array resizing if being written during radar operation, as each new data packet came in
Recommendations
I think it would be pertinent to adopt a new file format that is easily writable like site-structured files, but reduces the redundant fields of each group within the current site-structured file format. For example, fields like antenna_arrays_order or experiment_id can be stored as attributes of the File object itself. When writing a new record to the file, Borealis could then check if the top-level file attributes exist before writing them, to avoid overwrites as new records are written.
The creation of arrays for range-time plots can then be handled by pyDARNio. This would mean that this operation has to happen on each file read, but this too could be done more succinctly. pyDARNio incurs a lot of overhead in reading the groups of the site-structured file to determine the maximum dimension for many arrays, then allocating an array and re-reading each group to copy the data into the allocated array. This could instead be replaced with list concatenation, creating a data array that is flat along the ragged dimension, with a timestamps array providing easy slicing into the data arrays. This should be much faster than the current method of converting site-structured files to array-structured, and thus not overly inconvenient.
The text was updated successfully, but these errors were encountered:
@alexchartier@JWiker I would like to know your thoughts on changing the Borealis file format. I find the current two-structure situation to be rather annoying, but changing formats will bring its own challenges. From my thinking about this, hopefully all changes can be contained to borealis, pyDARNio, and our data_flow and borealis-data-utils. I'm not sure if your group uses the latter two of those repos, but we rely on them for our automated processes.
I'm happy to get your input on this, thanks in advance.
Problem
The presence of two accepted data file formats for each data product is often troublesome and creates extra overhead in regular operations. Moving to a single file format would streamline our data flow processes and simplify file I/O immensely in pyDARNio.
Current Formats
Site-structured files
Pros
borealis_postprocessors
Cons
Array-structured files
Pros
Cons
pulse_phase_offset
has a first dimension ofnum_sequences
; the number of sequences can vary from group to group, so the array field has dimensions[num_groups, max_num_sequences, ...]
to accommodate the ragged arrays).Recommendations
I think it would be pertinent to adopt a new file format that is easily writable like site-structured files, but reduces the redundant fields of each group within the current site-structured file format. For example, fields like
antenna_arrays_order
orexperiment_id
can be stored as attributes of the File object itself. When writing a new record to the file, Borealis could then check if the top-level file attributes exist before writing them, to avoid overwrites as new records are written.The creation of arrays for range-time plots can then be handled by pyDARNio. This would mean that this operation has to happen on each file read, but this too could be done more succinctly. pyDARNio incurs a lot of overhead in reading the groups of the site-structured file to determine the maximum dimension for many arrays, then allocating an array and re-reading each group to copy the data into the allocated array. This could instead be replaced with list concatenation, creating a data array that is flat along the ragged dimension, with a timestamps array providing easy slicing into the data arrays. This should be much faster than the current method of converting site-structured files to array-structured, and thus not overly inconvenient.
The text was updated successfully, but these errors were encountered: