Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New data format #502

Open
wants to merge 31 commits into
base: develop
Choose a base branch
from
Open

New data format #502

wants to merge 31 commits into from

Conversation

RemingtonRohel
Copy link
Contributor

@RemingtonRohel RemingtonRohel commented Sep 9, 2024

This PR introduces a new format for Borealis-produced HDF5 files.

Features

  • Single structure for files, removing the "site" and "array" structure distinction that was previously used.
  • All metadata that is static throughout an experiment is stored in a top-level metadata group, and linked to in each record
  • Records are still written as top-level groups of the file, with all data and metadata included in them
  • Added new fields:
    • antenna_arrays: descriptors ["main", "intf"] for bfiq files (or just ["main"] if intf array not present)
    • antenna_locations: [x, y, z] coordinates of each antenna relative to the midpoint of the main array
    • antennas_iq_data: replaces data for antennas_iq files
    • bfiq_data: replaces data for bfiq files
    • lag_numbers: The lag numbers for each unique pair of pulses in the experiment, in units of tau_spacing
    • lag_pulses: replaces lags, for a more descriptive name for the data inside (which is a 2D array of the unique pulse pairs)
    • pulse_timing: replaces pulse_timing_us
    • range_gates: array of the range gates used for this experiment
    • rawrf_data: replaces data for rawrf files
    • rx_antennas: indices into antenna_locations corresponding to a dimension of antennas_iq_data or rawrf_data
    • rx_intf_antennas: indices into antenna_locations of all interferometer array antennas used for receiving
    • rx_main_antennas: indices into antenna_locations of all main array antennas used for receiving
    • rx_intf_phases: complex phases for beamforming each interferometer-array antenna stream
    • rx_main_phases: complex phases for beamforming each main-array antenna stream
    • station_location: [latitude, longitude, altitude] of the radar
    • tx_antennas: indices into antenna_locations of all antennas used for transmission
  • Removed fields:
    • data
    • lags
    • num_ranges
    • num_samps
    • antenna_arrays_order
    • data_descriptors
    • data_dimensions
    • noise_at_freq
    • pulse_timing_us
  • Datasets may now have associated Dimension Scales that provide extra context for the meaning of the data within
  • Some fields are optional, and are not guaranteed to be included in each file of a given type (e.g. cfs_freqs will not be present if the experiment does not conduct a clear frequency search)
  • rawacf files can now be written directly with the DMAP file format, instead of HDF5.
  • Since no fields have the same name between file types, an antennas_iq file can be processed to rawacf while still maintaining the original antennas_iq datasets. Rather than having two or more files to handle, the single file would then contain both the low-level and higher-level data products, with all accompanying metadata.
  • record names are now human-readable timestamps, as opposed to milliseconds since epoch (which required a calculator to put into interpretable terms)

Config file changes

The fields related to antenna specifications have been moved under a single field antennas, containing:

  • [main|intf]_locations: a map of { "antenna index" : [x, y, z] coordinates } giving the coordinates of each [main or intf]-array antenna relative to the midpoint of the main antenna array.
  • [main|intf]_antenna_count: number of antennas in the respective array
  • [main|intf]_antenna_spacing: separation in meters between adjacent antennas in the respective array
  • standard_positions: flag indicating standard array positioning (i.e. arrays of equally-spaced antennas arranged on a line parallel to the x-axis)

* Store static data as top-level metadata of the file
* `data_write.SliceData` -> `utils.file_formats.SliceData`
* Refactored the general HDF5 file format:
   - Top level "metadata" group
   - All data stored as `Dataset`, with `"description"` metadata attached
   - File-level metadata is hard-linked within each record group to the top-level "metadata" group entry
* Added labels for dimensions of vector fields in the data files, to aid in data interpretation and usage.
* Created HDF5Writer class to handle turning SliceData dataclass into the correct types for writing to Borealis HDF5 files.
* Removed ability to write JSON files
* Started refactoring DataWrite.output_data() to remove the internal functions
* DataWrite.__init__() now instantiates sockets internally (avoids passing sockets to threads, which is explicitly recommended against by zmq documentation)
* Functionality to send rawacf record to realtime is now internal to write_correlations() function
* Added `rx_main_phases` and `rx_intf_phases` fields
* Removed `--file-type` option to data_write.py script (only hdf5 supported)
* Replaced useless `assert` statement (ignored when script run with `-O` flag, which `steamed_hams.py` uses for release mode)
* Added support for writing rawacf files directly as DMAP
* Added dimension scales to certain fields for HDF5 files. These are datasets associated with a dimension of another dataset, e.g. associating the `sqn_timestamps` with the "sequence" dimension of `data`.
* Added units metadata for fields in HDF5 files.
* Added "rawacf_format" field to config files, specifying the default format to use when writing rawacf files.
* Added support to overwrite the rawacf format files are written with the "--rawacf-format" argument to steamed_hams.py
* Added `darn-dmap` as a dependency, and fixed numpy and pydarnio versions.
* New `antennas` field which holds all antenna information
   - `main_locations`: {index: [x, y, z]} for each main antenna
   - `intf_locations`: {index: [x, y, z]} for each intf antenna
   - `main_antenna_count`: number of main array antennas
   - `intf_antenna_count`: number of intf array antennas
   - `main_antenna_spacing`: uniform spacing between main-array antennas
   - `intf_antenna_spacing`: uniform spacing between intf-array antennas
   - `standard_positions`: flag indicating whether array antennas follow the standard linear configuration. If so, verifies that positions are parallel to x-axis and equally spaced by [main|intf]_antenna_spacing
* Added tests for the new fields
* Updated config files for each site
* Each type (antennas_iq, bfiq, etc.) has its own data field (antennas_iq_data, bfiq_data, etc.)
* `antenna_locations` field added containing [x, y, z] locations of each antenna
* `antenna_arrays` field added for bfiq files containing descriptors for the array dimension of the data (e.g. ["main", "intf"]
* `required` added to metadata, indicating whether it is an error for a field to be missing or not.
* `data` field removed
* `[main|intf]_antenna_count` fields removed
* `lags` field renamed to `lag_pulses`
* `num_ranges` and `num_samps` fields removed
* `range_gates` field added, simply an array of the range gates for the file (e.g. 0-74)
* `rx_antennas`, ``rx_main_antennas`, `rx_intf_antennas`, and `tx_antennas` fields added, giving indices into `antenna_locations` of the antennas used for the experiment
* `station_location` field added, giving lat, lon, altitude of the radar
* Refactored `get_phase_shift()` in `signals.py` to use the antenna positions and interferometer array offsets for beamforming
* also refactored some variable names for simplicity
@RemingtonRohel RemingtonRohel linked an issue Sep 10, 2024 that may be closed by this pull request
* Created script file_docs_builder.py to generate .rst files for each file type
* Changed file name when writing DMAP files directly (ensuring slice_id is written as a letter instead of a number)
…ata, bfiq_data, rawrf_data final dimension.

* Given as an array of ints, representing the time of measurement relative to the first pulse in the sequence. Microseconds.
* `tests/simulators/steamed_sham.py` will call a simulator instead of usrp_driver.cpp
* `tests/simulators/driver_sim.py` mocks usrp_driver.cpp, generating
  noise instead of data (and not currently adding the pulse data to the
noise)
* Updated record name format to use hyphens, like `YYYYMMDD-HHMM-SS.fffff`
* Removed `dim_labels` from `lag_pulses` field
* Correctly format SliceData object as DMAP
* Use new pydarnio functions for converting rawacf records
* Fix dmap filename convention
* Serve data for all slices to realtime from data_write
* Replaced test data for realtime sim with single-record dmap
… xarray

* pyDARNio implementation of array-structured fields being added in parallel to this branch. This implementation will only allow in-memory array-structuring, and will not support writing of array-structured data files.
@RemingtonRohel RemingtonRohel marked this pull request as ready for review September 18, 2024 20:51
src/utils/file_formats.py Outdated Show resolved Hide resolved
tests/simulators/steamed_sham.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Single data file structure
2 participants