Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref dataset ingest fails if a file has no "branch_time_in_child" global NetCDF attribute #38

Closed
bouweandela opened this issue Dec 11, 2024 · 2 comments · Fixed by #42
Closed
Labels
bug Something isn't working

Comments

@bouweandela
Copy link
Contributor

Describe the bug

ref dataset ingest fails if a file has no "branch_time_in_child" global NetCDF attribute

Failing Test

Run ref --log-level INFO datasets ingest --n-jobs 20 --source-type cmip6 ~/climate_data/CMIP6/
on the file /home/bandela/climate_data/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r101i1p1f1/Amon/tas/gr/v20200412/tas_Amon_EC-Earth3_historical_r101i1p1f1_gr_199001-199012.nc.

This results in the following error message:

╭───────────────────────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────────────────────────────────────────────────────────────────╮
│ in pandas._libs.lib.maybe_convert_numeric:2374                                                                                                                                                                                      │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Unable to parse string "None"

During handling of the above exception, another exception occurred:

╭───────────────────────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────────────────────────────────────────────────────────────────╮
│ /home/bandela/src/cmip-ref/cmip-ref/packages/ref/src/ref/cli/datasets.py:141 in ingest                                                                                                                                              │
│                                                                                                                                                                                                                                     │
│   138 │   │   logger.error(f"File or directory {file_or_directory} does not exist")            ╭──────────────────────────────────────────────────────── locals ────────────────────────────────────────────────────────╮           │
│   139 │   │   raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), file_or_directo │           adapter = <ref.datasets.cmip6.CMIP6DatasetAdapter object at 0x7be6108bd790>                                  │           │
│   140 │                                                                                        │            config = Config(                                                                                            │           │
│ ❱ 141 │   data_catalog = adapter.find_local_datasets(file_or_directory)                        │                     │   paths=Paths(                                                                                   │           │
│   142 │   data_catalog = adapter.validate_data_catalog(data_catalog, skip_invalid=skip_invalid │                     │   │   data=PosixPath('/home/bandela/.config/cmip-ref/data'),                                     │           │
│   143 │                                                                                        │                     │   │   log=PosixPath('/home/bandela/.config/cmip-ref/log'),                                       │           │
│   144 │   logger.info(                                                                         │                     │   │   tmp=PosixPath('/home/bandela/.config/cmip-ref/tmp'),                                       │           │
│                                                                                                │                     │   │   allow_out_of_tree_datasets=True                                                            │           │
│                                                                                                │                     │   ),                                                                                             │           │
│                                                                                                │                     │   db=Db(                                                                                         │           │
│                                                                                                │                     │   │   database_url='sqlite:////home/bandela/.config/cmip-ref/db/ref.db',                         │           │
│                                                                                                │                     │   │   run_migrations=True                                                                        │           │
│                                                                                                │                     │   ),                                                                                             │           │
│                                                                                                │                     │   _raw=None,                                                                                     │           │
│                                                                                                │                     │   _config_file=PosixPath('/home/bandela/.config/cmip-ref/ref.toml')                              │           │
│                                                                                                │                     )                                                                                                  │           │
│                                                                                                │               ctx = <click.core.Context object at 0x7be6108b3910>                                                      │           │
│                                                                                                │                db = <ref.database.Database object at 0x7be610869e50>                                                   │           │
│                                                                                                │           dry_run = False                                                                                              │           │
│                                                                                                │ file_or_directory = PosixPath('/home/bandela/climate_data/CMIP6')                                                      │           │
│                                                                                                │            kwargs = {'n_jobs': 20}                                                                                     │           │
│                                                                                                │            n_jobs = 20                                                                                                 │           │
│                                                                                                │      skip_invalid = False                                                                                              │           │
│                                                                                                │             solve = False                                                                                              │           │
│                                                                                                │       source_type = <SourceDatasetType.CMIP6: 'cmip6'>                                                                 │           │
│                                                                                                ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯           │
│                                                                                                                                                                                                                                     │
│ /home/bandela/src/cmip-ref/cmip-ref/packages/ref/src/ref/datasets/cmip6.py:194 in find_local_datasets                                                                                                                               │
│                                                                                                                                                                                                                                     │
│   191 │   │                                                                                                                                                                                                                         │
│   192 │   │   # Temporary fix for some datasets                                                                                                                                                                                     │
│   193 │   │   # TODO: Replace with a standalone package that contains metadata fixes for CMIP6                                                                                                                                      │
│ ❱ 194 │   │   datasets = _apply_fixes(datasets)                                                                                                                                                                                     │
│   195 │   │                                                                                                                                                                                                                         │
│   196 │   │   return datasets                                                                                                                                                                                                       │
│   197                                                                                                                                                                                                                               │
│                                                                                                                                                                                                                                     │
│ ╭──────────────────────────────────────────────────────────────────────────────────────────── locals ────────────────────────────────────────────────────────────────────────────────────────────╮                                  │
│ │           builder = Builder(paths=['/home/bandela/climate_data/CMIP6'], storage_options={}, depth=10, exclude_patterns=[], include_patterns=['*.nc'], joblib_parallel_kwargs={'n_jobs': 20})   │                                  │
│ │          datasets = │     activity_id branch_method branch_time_in_child  ...                                               path    version                                        instance_id │                                  │
│ │                     0      AerChemMIP      Standard                  0.0  ...  /home/bandela/climate_data/CMIP6/AerChemMIP/BC...  v20201021  CMIP6.AerChemMIP.BCC.BCC-ESM1.ssp370.r1i1p1f1.... │                                  │
│ │                     1      AerChemMIP      Standard                  0.0  ...  /home/bandela/climate_data/CMIP6/AerChemMIP/BC...  v20201021  CMIP6.AerChemMIP.BCC.BCC-ESM1.ssp370.r1i1p1f1.... │                                  │
│ │                     2           C4MIP      standard              24455.0  ...  /home/bandela/climate_data/CMIP6/C4MIP/NCAR/CE...  v20191119  CMIP6.C4MIP.NCAR.CESM2.esm-1pct-brch-1000PgC.r... │                                  │
│ │                     3           C4MIP      standard              24455.0  ...  /home/bandela/climate_data/CMIP6/C4MIP/NCAR/CE...  v20191119  CMIP6.C4MIP.NCAR.CESM2.esm-1pct-brch-1000PgC.r... │                                  │
│ │                     4           C4MIP      standard              24455.0  ...  /home/bandela/climate_data/CMIP6/C4MIP/NCAR/CE...  v20191119  CMIP6.C4MIP.NCAR.CESM2.esm-1pct-brch-1000PgC.r... │                                  │
│ │                     ...           ...           ...                  ...  ...                                                ...        ...                                                ... │                                  │
│ │                     6226  ScenarioMIP      standard             735110.0  ...  /home/bandela/climate_data/CMIP6/ScenarioMIP/N...  v20200528  CMIP6.ScenarioMIP.NCAR.CESM2.ssp585.r11i1p1f1.... │                                  │
│ │                     6227  ScenarioMIP      standard             735110.0  ...  /home/bandela/climate_data/CMIP6/ScenarioMIP/N...  v20200528  CMIP6.ScenarioMIP.NCAR.CESM2.ssp585.r11i1p1f1.... │                                  │
│ │                     6228  ScenarioMIP      standard             735110.0  ...  /home/bandela/climate_data/CMIP6/ScenarioMIP/N...  v20200528  CMIP6.ScenarioMIP.NCAR.CESM2.ssp585.r4i1p1f1.O... │                                  │
│ │                     6229  ScenarioMIP      standard             735110.0  ...  /home/bandela/climate_data/CMIP6/ScenarioMIP/N...  v20200528  CMIP6.ScenarioMIP.NCAR.CESM2.ssp585.r4i1p1f1.O... │                                  │
│ │                     6230  ScenarioMIP      Standard                  0.0  ...  /home/bandela/climate_data/CMIP6/ScenarioMIP/N...  v20190731  CMIP6.ScenarioMIP.NUIST.NESM3.ssp126.r1i1p1f1.... │                                  │
│ │                                                                                                                                                                                                │                                  │
│ │                     [6231 rows x 37 columns]                                                                                                                                                   │                                  │
│ │         drs_items = ['activity_id', 'institution_id', 'source_id', 'experiment_id', 'member_id', 'table_id', 'variable_id', 'grid_label', 'version']                                           │                                  │
│ │ file_or_directory = PosixPath('/home/bandela/climate_data/CMIP6')                                                                                                                              │                                  │
│ │              self = <ref.datasets.cmip6.CMIP6DatasetAdapter object at 0x7be6108bd790>                                                                                                          │                                  │
│ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯                                  │
│                                                                                                                                                                                                                                     │
│ /home/bandela/src/cmip-ref/cmip-ref/packages/ref/src/ref/datasets/cmip6.py:56 in _apply_fixes                                                                                                                                       │
│                                                                                                                                                                                                                                     │
│    53 │   data_catalog = data_catalog.groupby("instance_id").apply(_fix_parent_variant_label).                                                                                                                                      │
│    54 │                                                                                                                                                                                                                             │
│    55 │   # EC-Earth3 uses "D" as a suffix for the branch_time_in_child and branch_time_in_par                                                                                                                                      │
│ ❱  56 │   data_catalog["branch_time_in_child"] = pd.to_numeric(                                                                                                                                                                     │
│    57 │   │   data_catalog["branch_time_in_child"].astype(str).str.replace("D", ""), errors="r                                                                                                                                      │
│    58 │   )                                                                                                                                                                                                                         │
│    59 │   data_catalog["branch_time_in_parent"] = pd.to_numeric(                                                                                                                                                                    │
│                                                                                                                                                                                                                                     │
│ ╭───────────────────────────────────────────────────────────────────────────────────────── locals ──────────────────────────────────────────────────────────────────────────────────────────╮                                       │
│ │ data_catalog = │     activity_id branch_method branch_time_in_child  ...                                               path    version                                        instance_id │                                       │
│ │                0      AerChemMIP      Standard                  0.0  ...  /home/bandela/climate_data/CMIP6/AerChemMIP/BC...  v20201021  CMIP6.AerChemMIP.BCC.BCC-ESM1.ssp370.r1i1p1f1.... │                                       │
│ │                1      AerChemMIP      Standard                  0.0  ...  /home/bandela/climate_data/CMIP6/AerChemMIP/BC...  v20201021  CMIP6.AerChemMIP.BCC.BCC-ESM1.ssp370.r1i1p1f1.... │                                       │
│ │                2           C4MIP      standard              24455.0  ...  /home/bandela/climate_data/CMIP6/C4MIP/NCAR/CE...  v20191119  CMIP6.C4MIP.NCAR.CESM2.esm-1pct-brch-1000PgC.r... │                                       │
│ │                3           C4MIP      standard              24455.0  ...  /home/bandela/climate_data/CMIP6/C4MIP/NCAR/CE...  v20191119  CMIP6.C4MIP.NCAR.CESM2.esm-1pct-brch-1000PgC.r... │                                       │
│ │                4           C4MIP      standard              24455.0  ...  /home/bandela/climate_data/CMIP6/C4MIP/NCAR/CE...  v20191119  CMIP6.C4MIP.NCAR.CESM2.esm-1pct-brch-1000PgC.r... │                                       │
│ │                ...           ...           ...                  ...  ...                                                ...        ...                                                ... │                                       │
│ │                6226  ScenarioMIP      standard             735110.0  ...  /home/bandela/climate_data/CMIP6/ScenarioMIP/N...  v20200528  CMIP6.ScenarioMIP.NCAR.CESM2.ssp585.r11i1p1f1.... │                                       │
│ │                6227  ScenarioMIP      standard             735110.0  ...  /home/bandela/climate_data/CMIP6/ScenarioMIP/N...  v20200528  CMIP6.ScenarioMIP.NCAR.CESM2.ssp585.r11i1p1f1.... │                                       │
│ │                6228  ScenarioMIP      standard             735110.0  ...  /home/bandela/climate_data/CMIP6/ScenarioMIP/N...  v20200528  CMIP6.ScenarioMIP.NCAR.CESM2.ssp585.r4i1p1f1.O... │                                       │
│ │                6229  ScenarioMIP      standard             735110.0  ...  /home/bandela/climate_data/CMIP6/ScenarioMIP/N...  v20200528  CMIP6.ScenarioMIP.NCAR.CESM2.ssp585.r4i1p1f1.O... │                                       │
│ │                6230  ScenarioMIP      Standard                  0.0  ...  /home/bandela/climate_data/CMIP6/ScenarioMIP/N...  v20190731  CMIP6.ScenarioMIP.NUIST.NESM3.ssp126.r1i1p1f1.... │                                       │
│ │                                                                                                                                                                                           │                                       │
│ │                [6231 rows x 37 columns]                                                                                                                                                   │                                       │
│ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯                                       │
│                                                                                                                                                                                                                                     │
│ /home/bandela/mambaforge/envs/esmvaltool/lib/python3.11/site-packages/pandas/core/tools/numeric.py:222 in to_numeric                                                                                                                │
│                                                                                                                                                                                                                                     │
│   219 │   │   values = ensure_object(values)                                                   ╭─────────────────────────────────────── locals ────────────────────────────────────────╮                                            │
│   220 │   │   coerce_numeric = errors not in ("ignore", "raise")                               │            arg = 0            0.0                                                     │                                            │
│   221 │   │   try:                                                                             │                  1            0.0                                                     │                                            │
│ ❱ 222 │   │   │   values, new_mask = lib.maybe_convert_numeric(  # type: ignore[call-overload] │                  2        24455.0                                                     │                                            │
│   223 │   │   │   │   values,                                                                  │                  3        24455.0                                                     │                                            │
│   224 │   │   │   │   set(),                                                                   │                  4        24455.0                                                     │                                            │
│   225 │   │   │   │   coerce_numeric=coerce_numeric,                                           │                  │   │     ...                                                        │                                            │
│                                                                                                │                  6226    735110.0                                                     │                                            │
│                                                                                                │                  6227    735110.0                                                     │                                            │
│                                                                                                │                  6228    735110.0                                                     │                                            │
│                                                                                                │                  6229    735110.0                                                     │                                            │
│                                                                                                │                  6230         0.0                                                     │                                            │
│                                                                                                │                  Name: branch_time_in_child, Length: 6231, dtype: object              │                                            │
│                                                                                                │ coerce_numeric = False                                                                │                                            │
│                                                                                                │       downcast = None                                                                 │                                            │
│                                                                                                │  dtype_backend = <no_default>                                                         │                                            │
│                                                                                                │         errors = 'raise'                                                              │                                            │
│                                                                                                │       is_index = False                                                                │                                            │
│                                                                                                │     is_scalars = False                                                                │                                            │
│                                                                                                │      is_series = True                                                                 │                                            │
│                                                                                                │           mask = None                                                                 │                                            │
│                                                                                                │       new_mask = None                                                                 │                                            │
│                                                                                                │    orig_values = array(['0.0', '0.0', '24455.0', ..., '735110.0', '735110.0', '0.0'], │                                            │
│                                                                                                │                  │     dtype=object)                                                  │                                            │
│                                                                                                │         values = array(['0.0', '0.0', '24455.0', ..., '735110.0', '735110.0', '0.0'], │                                            │
│                                                                                                │                  │     dtype=object)                                                  │                                            │
│                                                                                                │   values_dtype = dtype('O')                                                           │                                            │
│                                                                                                ╰───────────────────────────────────────────────────────────────────────────────────────╯                                            │
│                                                                                                                                                                                                                                     │
│ in pandas._libs.lib.maybe_convert_numeric:2416                                                                                                                                                                                      │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Unable to parse string "None" at position 967

The contents of the file look like this

$ ncdump -h /home/bandela/climate_data/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r101i1p1f1/Amon/tas/gr/v20200412/tas_Amon_EC-Earth3_historical_r101i1p1f1_gr_199001-199012.nc 
netcdf tas_Amon_EC-Earth3_historical_r101i1p1f1_gr_199001-199012 {
dimensions:
	time = UNLIMITED ; // (12 currently)
	lat = 256 ;
	lon = 512 ;
	bnds = 2 ;
variables:
	double time(time) ;
		time:bounds = "time_bnds" ;
		time:units = "days since 1850-01-01 00:00:00" ;
		time:calendar = "proleptic_gregorian" ;
		time:axis = "T" ;
		time:long_name = "time" ;
		time:standard_name = "time" ;
	double time_bnds(time, bnds) ;
	double lat(lat) ;
		lat:bounds = "lat_bnds" ;
		lat:units = "degrees_north" ;
		lat:axis = "Y" ;
		lat:long_name = "Latitude" ;
		lat:standard_name = "latitude" ;
	double lat_bnds(lat, bnds) ;
	double lon(lon) ;
		lon:bounds = "lon_bnds" ;
		lon:units = "degrees_east" ;
		lon:axis = "X" ;
		lon:long_name = "Longitude" ;
		lon:standard_name = "longitude" ;
	double lon_bnds(lon, bnds) ;
	double height ;
		height:units = "m" ;
		height:axis = "Z" ;
		height:positive = "up" ;
		height:long_name = "height" ;
		height:standard_name = "height" ;
	float tas(time, lat, lon) ;
		tas:standard_name = "air_temperature" ;
		tas:long_name = "Near-Surface Air Temperature" ;
		tas:comment = "near-surface (usually, 2 meter) air temperature" ;
		tas:units = "K" ;
		tas:cell_methods = "area: time: mean" ;
		tas:cell_measures = "area: areacella" ;
		tas:history = "2019-06-21T19:49:31Z altered by CMOR: Treated scalar dimension: \'height\'. 2019-06-21T19:49:31Z altered by CMOR: Reordered dimensions, original order: lat lon time." ;
		tas:coordinates = "height" ;
		tas:missing_value = 1.e+20f ;
		tas:_FillValue = 1.e+20f ;

// global attributes:
		:Conventions = "CF-1.7 CMIP-6.2" ;
		:activity_id = "CMIP" ;
		:comment = "This experiment is part of SMHI\'s Large Ensemble. All experiments of the ensemble were started from a set of initial conditions in 1970." ;
		:contact = "cmip6-data@ec-earth.org" ;
		:creation_date = "2019-06-21T19:49:31Z" ;
		:data_specs_version = "01.00.30" ;
		:experiment = "all-forcing simulation of the recent past" ;
		:experiment_id = "historical" ;
		:external_variables = "areacella" ;
		:forcing_index = 1 ;
		:frequency = "mon" ;
		:further_info_url = "https://furtherinfo.es-doc.org/CMIP6.EC-Earth-Consortium.EC-Earth3.historical.none.r101i1p1f1" ;
		:grid = "T255L91-ORCA1L75" ;
		:grid_label = "gr" ;
		:initialization_index = 1 ;
		:institution = "AEMET, Spain; BSC, Spain; CNR-ISAC, Italy; DMI, Denmark; ENEA, Italy; FMI, Finland; Geomar, Germany; ICHEC, Ireland; ICTP, Italy; IDL, Portugal; IMAU, The Netherlands; IPMA, Portugal; KIT, Karlsruhe, Germany; KNMI, The Netherlands; Lund University, Sweden; Met Eireann, Ireland; NLeSC, The Netherlands; NTNU, Norway; Oxford University, UK; surfSARA, The Netherlands; SMHI, Sweden; Stockholm University, Sweden; Unite ASTR, Belgium; University College Dublin, Ireland; University of Bergen, Norway; University of Copenhagen, Denmark; University of Helsinki, Finland; University of Santiago de Compostela, Spain; Uppsala University, Sweden; Utrecht University, The Netherlands; Vrije Universiteit Amsterdam, the Netherlands; Wageningen University, The Netherlands. Mailing address: EC-Earth consortium, Rossby Center, Swedish Meteorological and Hydrological Institute/SMHI, SE-601 76 Norrkoping, Sweden" ;
		:institution_id = "EC-Earth-Consortium" ;
		:mip_era = "CMIP6" ;
		:nominal_resolution = "100 km" ;
		:parent_mip_era = "CMIP6" ;
		:physics_index = 1 ;
		:product = "model-output" ;
		:realization_index = 101 ;
		:realm = "atmos" ;
		:source = "EC-Earth3 (2019): \n",
			"aerosol: none\n",
			"atmos: IFS cy36r4 (TL255, linearly reduced Gaussian grid equivalent to 512 x 256 longitude/latitude; 91 levels; top level 0.01 hPa)\n",
			"atmosChem: none\n",
			"land: HTESSEL (land surface scheme built in IFS)\n",
			"landIce: none\n",
			"ocean: NEMO3.6 (ORCA1 tripolar primarily 1 deg with meridional refinement down to 1/3 degree in the tropics; 362 x 292 longitude/latitude; 75 levels; top grid cell 0-1 m)\n",
			"ocnBgchem: none\n",
			"seaIce: LIM3" ;
		:source_id = "EC-Earth3" ;
		:source_type = "AOGCM" ;
		:sub_experiment = "none" ;
		:sub_experiment_id = "none" ;
		:table_id = "Amon" ;
		:table_info = "Creation Date:(09 May 2019) MD5:1d844b3662ef3f929a5d7e5fa67a5d37" ;
		:title = "EC-Earth3 output prepared for CMIP6" ;
		:tracking_id = "hdl:21.14100/83813d98-e07c-4e53-af30-f09fa66fdc98" ;
		:variable_id = "tas" ;
		:variant_label = "r101i1p1f1" ;
		:license = "CMIP6 model data produced by EC-Earth-Consortium is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (https://creativecommons.org/licenses). Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse for terms of use governing CMIP6 output, including citation requirements and proper acknowledgment. Further information about this data, including some limitations, can be found via the further_info_url (recorded as a global attribute in this file) and at http://www.ec-earth.org. The data producers and data providers make no warranty, either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;
		:cmor_version = "3.4.0" ;
		:history = "2019-06-21T19:49:31Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards.;\n",
			"processed by ece2cmor vv1.1.0, git rev. a8c66d7d6246ddb05d1826d6237df61747d587ca\n",
			"The cmor-fixer version v2.1 script has been applied." ;
}

Expected behavior

I would expect the tool to gracefully handle this.

Screenshots

None

System

  • OS: [e.g. Windows, Linux, macOS]
  • Python version [e.g. Python 3.11]
  • Please also upload your uv.lock file (first run uv lock to make sure the lock file is up-to-date)

ref package installed at commit 80ce4a3

Linux, Python 3.11

Additional context

None

@bouweandela bouweandela added the bug Something isn't working label Dec 11, 2024
@lewisjared
Copy link
Contributor

Thanks for the report. I think I fixed this (not very gracefully) in #42.

I 100% support your package of metadata fixes 😁

@bouweandela
Copy link
Contributor Author

Thanks for the report. I think I fixed this (not very gracefully) in #42.

Great!

I 100% support your package of metadata fixes 😁

I wouldn't want to take credit for the idea, but I believe it is a good idea, so am promoting it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants