Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single data file structure #480

Open
RemingtonRohel opened this issue May 9, 2024 · 1 comment · May be fixed by #502
Open

Single data file structure #480

RemingtonRohel opened this issue May 9, 2024 · 1 comment · May be fixed by #502

Comments

@RemingtonRohel
Copy link
Contributor

Problem

The presence of two accepted data file formats for each data product is often troublesome and creates extra overhead in regular operations. Moving to a single file format would streamline our data flow processes and simplify file I/O immensely in pyDARNio.

Current Formats

Site-structured files

Pros

  • Data stored in separate groups for each averaging period, so the program crashing in between writes will not impact the file
  • Facilitates post-processing of files since each group is an atomic piece of data and metadata to be processed.
  • Native structure used by borealis_postprocessors
  • Easily written during radar operation

Cons

  • Not easy to make range-time plots from
  • Each group contains fields that are identical across groups (i.e. experiment slice metadata that does not change)

Array-structured files

Pros

  • Easy to make range-time plots from
  • Redundant information stored only once

Cons

  • Requires more care when indexing into arrays as some arrays are zero-padded in a mock of ragged lists (e.g. the field pulse_phase_offset has a first dimension of num_sequences; the number of sequences can vary from group to group, so the array field has dimensions [num_groups, max_num_sequences, ...] to accommodate the ragged arrays).
  • Would require array resizing if being written during radar operation, as each new data packet came in

Recommendations

I think it would be pertinent to adopt a new file format that is easily writable like site-structured files, but reduces the redundant fields of each group within the current site-structured file format. For example, fields like antenna_arrays_order or experiment_id can be stored as attributes of the File object itself. When writing a new record to the file, Borealis could then check if the top-level file attributes exist before writing them, to avoid overwrites as new records are written.

The creation of arrays for range-time plots can then be handled by pyDARNio. This would mean that this operation has to happen on each file read, but this too could be done more succinctly. pyDARNio incurs a lot of overhead in reading the groups of the site-structured file to determine the maximum dimension for many arrays, then allocating an array and re-reading each group to copy the data into the allocated array. This could instead be replaced with list concatenation, creating a data array that is flat along the ragged dimension, with a timestamps array providing easy slicing into the data arrays. This should be much faster than the current method of converting site-structured files to array-structured, and thus not overly inconvenient.

@RemingtonRohel RemingtonRohel added this to the Version 1.0 milestone May 9, 2024
@RemingtonRohel
Copy link
Contributor Author

@alexchartier @JWiker I would like to know your thoughts on changing the Borealis file format. I find the current two-structure situation to be rather annoying, but changing formats will bring its own challenges. From my thinking about this, hopefully all changes can be contained to borealis, pyDARNio, and our data_flow and borealis-data-utils. I'm not sure if your group uses the latter two of those repos, but we rely on them for our automated processes.

I'm happy to get your input on this, thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant