Single data file structure #480

RemingtonRohel · 2024-05-09T20:21:30Z

Problem

The presence of two accepted data file formats for each data product is often troublesome and creates extra overhead in regular operations. Moving to a single file format would streamline our data flow processes and simplify file I/O immensely in pyDARNio.

Current Formats

Site-structured files

Pros

Data stored in separate groups for each averaging period, so the program crashing in between writes will not impact the file
Facilitates post-processing of files since each group is an atomic piece of data and metadata to be processed.
Native structure used by borealis_postprocessors
Easily written during radar operation

Cons

Not easy to make range-time plots from
Each group contains fields that are identical across groups (i.e. experiment slice metadata that does not change)

Array-structured files

Pros

Easy to make range-time plots from
Redundant information stored only once

Cons

Requires more care when indexing into arrays as some arrays are zero-padded in a mock of ragged lists (e.g. the field pulse_phase_offset has a first dimension of num_sequences; the number of sequences can vary from group to group, so the array field has dimensions [num_groups, max_num_sequences, ...] to accommodate the ragged arrays).
Would require array resizing if being written during radar operation, as each new data packet came in

Recommendations

I think it would be pertinent to adopt a new file format that is easily writable like site-structured files, but reduces the redundant fields of each group within the current site-structured file format. For example, fields like antenna_arrays_order or experiment_id can be stored as attributes of the File object itself. When writing a new record to the file, Borealis could then check if the top-level file attributes exist before writing them, to avoid overwrites as new records are written.

The creation of arrays for range-time plots can then be handled by pyDARNio. This would mean that this operation has to happen on each file read, but this too could be done more succinctly. pyDARNio incurs a lot of overhead in reading the groups of the site-structured file to determine the maximum dimension for many arrays, then allocating an array and re-reading each group to copy the data into the allocated array. This could instead be replaced with list concatenation, creating a data array that is flat along the ragged dimension, with a timestamps array providing easy slicing into the data arrays. This should be much faster than the current method of converting site-structured files to array-structured, and thus not overly inconvenient.

The text was updated successfully, but these errors were encountered:

RemingtonRohel · 2024-05-09T20:28:49Z

@alexchartier @JWiker I would like to know your thoughts on changing the Borealis file format. I find the current two-structure situation to be rather annoying, but changing formats will bring its own challenges. From my thinking about this, hopefully all changes can be contained to borealis, pyDARNio, and our data_flow and borealis-data-utils. I'm not sure if your group uses the latter two of those repos, but we rely on them for our automated processes.

I'm happy to get your input on this, thanks in advance.

RemingtonRohel added this to the Version 1.0 milestone May 9, 2024

RemingtonRohel added enhancement question improve-data-quality labels May 9, 2024

RemingtonRohel linked a pull request Sep 10, 2024 that will close this issue

New data format #502

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single data file structure #480

Single data file structure #480

RemingtonRohel commented May 9, 2024

RemingtonRohel commented May 9, 2024

Single data file structure #480

Single data file structure #480

Comments

RemingtonRohel commented May 9, 2024

Problem

Current Formats

Site-structured files

Pros

Cons

Array-structured files

Pros

Cons

Recommendations

RemingtonRohel commented May 9, 2024