Composite/array values in string valued global attributes #341
Replies: 37 comments 8 replies
-
Do you mean, can they be arrays of strings? |
Beta Was this translation helpful? Give feedback.
-
Yes. My understanding if I have read the NetCDF data model UML correctly is that attributes are arrays anyhow: https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html Assigning Other than the common case above which is a single element, arrays of two or more elements allowed in these attributes? |
Beta Was this translation helpful? Give feedback.
-
I don't remember that we have considered this question. The text was formulated before strings were introduced. Arrays of strings previously could only be represented by two-dimensional character arrays. The document does not mention that two-dimensional character arrays might be expected in these global attributes, so I guess that no-one thought they would be needed. That's not the same thing as prohibiting them, but I imagine that existing programs which access these attributes will expect them to contain a single string. If the program was written using netCDF4 Python, it would expect a string to be returned, and receiving a list of strings would probably lead to some error. Therefore I'm inclined to think that we should recommend that these attributes ( |
Beta Was this translation helpful? Give feedback.
-
I agree that we should explicitly say that when a CF attribute value is supposed to be a string, it should not be an array of strings. Of course, a compelling use case might justify relaxing this rule, but, as Jonathan noted, that would likely break some existing software. |
Beta Was this translation helpful? Give feedback.
-
Agree on all counts -- there's a lot of precedent for using strings to represent multiple items via whitespace, delimiters, etc. As long as you can embed newlines, then there should be no need for an array. |
Beta Was this translation helpful? Give feedback.
-
Pretty long discussion in cf-convention/cf-conventions#141 The discussed use cases were for the history attribute, anything in CF that is "space separated" e.g. coordinates, flag definitions, etc... |
Beta Was this translation helpful? Give feedback.
-
FWIW my understanding of the CF spec is indeed that all attributes essentially are a 1-D "array" of values, but this only "somewhat like" an array, since it is 1-dimensional at most. However, my further understanding is that attributes containing multiple strings are not fully supported at least in the Python interface. Also from experiment, multi-element string arrays can now be assigned to attributes, but these are encoded as variable-length strings, not an array of 'ordinary' strings. So the datatype of a single string and a string array are effectively different.
I have almost no experience of other language APIs, but I can see that in the C interface, HTH, and I'd be keen to hear the situation in other languages from someone less Python-centric. |
Beta Was this translation helpful? Give feedback.
-
Yes, this issue is more or less about > 2 element variable length strings. While the question mentions global attributes, it nonetheless should be extended to cover places where string/char attributes in CF are present in general. |
Beta Was this translation helpful? Give feedback.
-
@pp-mo what you are seeing there are differences between the netCDF classic and the netCDF enhanced data models. The classic data model has no support for variable length data types (aka strings). The only slightly unusual thing I've seen that might be python specific, is the netCDF4 python library will force the data type of attributes containing character arrays that have any chars outside the ASCII code point range into a string vlen type rather than a char array. |
Beta Was this translation helpful? Give feedback.
-
Like for the global attributes, we don't have statements in the convention about arrays of strings for variable attributes, since they didn't exist when much of the text was written. The CF string-valued attributes are all expected to contain a word or a blank-separated list of words. I'm not aware of any reason for needing to use an array of strings, and this would certainly not work with some existing software. From the above discussion and cf-convention/cf-conventions#141, I'm inclined to think we should insert explicit statements in the convention text and correspondingly in the conformance document to say that any CF string-valued attributes must be either a 1D character array or a scalar string. Would anyone disagree with that? As with all aspects, this could be reconsidered in future. |
Beta Was this translation helpful? Give feedback.
-
Consistent with my earlier comment, I support Jonathan's comment immediately above. |
Beta Was this translation helpful? Give feedback.
-
Arrays of vlen strings as attributes have been a thing in netCDF for about 16 years now. It has been 6 years since cf-convention/cf-conventions#141 was started, and 4 years since that conversation went dormant. I think the future to consider supporting these types of values is now (or soon). For my own work, having to parse some CF custom string syntax (see e.g. cell_methods) or mangling my flag definitions has caused more issues that just dealing with features that have been in netCDF itself for my entire professional career. I'd support keeping the Conventions attribute as a 1d char array attribute, since that is what you would need to read to know if you might encounter more "advanced" features in a file. I guess the above is just asking, if not now, when? |
Beta Was this translation helpful? Give feedback.
-
Hmm -- is there any talk of CF 2.0? In which we could start expecting "modern" netcdf, and stop supporting old stuff from COARDS (at least when talking about it in the docs, etc?) We wouldn't want to do any major breaking changes, but a chance to remove some of teh cruft of the past would be nice, and then we wouldn't need to have these debates .... |
Beta Was this translation helpful? Give feedback.
-
My feeling is that CF 2.0 gets talked about in a "one day" sense that just stops the conversation. In CF 1.11 right now netCDF4 groups and the vlen string dtype for variables are supported so it's already "modern" in some ways. Groups can be especially breaking if programs aren't expecting them. I know when I first started working with CF, outdated (dare I say false) statements like "dtype x aren't supported in netCDF" that used to be in the document made me suspicious that the conventions were not being updated or maintained. Aside, I think for a CF 2.0, it would be neat to break it up into different parts that are "independent" which are then opt into via the Conventions attribute: e.g. "CF-2.0 CF-DSG" for a data file that is following the "core CF conventions" with the discrete sampling geometry extension or perhaps some externally defined standard like "CF-2.0 CF-Radial" for the radar data. |
Beta Was this translation helpful? Give feedback.
-
@DocOtak pointed out that it's been six years since cf-convention/cf-conventions#141 was started. I hope we can finish it now! I'm on a mission to bring ancient issues to conclusion, admittedly rather slowly. I propose that we split the preamble of 2.6 on Attributes into two paragraphs, since it's quite a long paragraph, with a new sentence to begin the second paragraph, shown in bold in the following. The rest is all existing text. Also, I show in italic a sentence which I have moved from the very end to near the start and updated it to recognise there's now more than one list.
In section 2.6 of the conformance document, I propose that we insert a new requirement: String-valued attributes defined by CF must be scalar strings or 1D character arrays; arrays of strings are not allowed. Enough support has already been expressed for making this change in principle. What do you all think about the above proposals? |
Beta Was this translation helpful? Give feedback.
-
I would rather just keep it simple, and not allow arrays of strings -- it's really hard for me the see the benefit. But I won't put up a fight if others really want them. One comment on that -- it may be unique to Python, but one of the major still remaining type issues in Python is that an atomic string is also a Sequence of strings (i.e. a Sequence of length-one strings) -- this means you sometimes have to write convoluted code to make sure you don't confuse them -- so I'd really rather that confusion weren't an option when reading CF data. |
Beta Was this translation helpful? Give feedback.
-
well, yes, it certainly is :-( However, it's not just that that makes me wary of this -- I'm wary because it's a bit tricky to correctly write code that will work with either a single string or an array of strings, particularly if all the test case a user is looking at are only one of those.
OK -- fair enough so there's a need. What I think would be best would be a clear line, e.g.: "This attribute is always an array of strings -- even if it is a length-1 array" (for CF version > something) However, I'm not sure how we could get from here to there without breaking even more workflows....
Ahh yes -- perhaps we DO define, for every controlled attribute, whether or not it can be (or must be?) an array of strings. And can we disallow multidimensional character arrays please? |
Beta Was this translation helpful? Give feedback.
-
Dear @ChrisBarker-NOAA My suggestion is that we say in the preamble of Sect 2.6 "Attributes" that string-valued attributes must not be arrays of Could you agree to this? Cheers Jonathan |
Beta Was this translation helpful? Give feedback.
-
Is it off the table to require an array of strings (even if of length-1) for these? If I were starting from scratch, that's what I'd do. Or maybe suggest that a length-1 array be used for these? |
Beta Was this translation helpful? Give feedback.
-
@DocOtak is right that netCDF itself does not allow multidimensional attributes. NUG says "Another difference between attributes and variables is that variables may be multidimensional. Attributes are all either scalars (single-valued) or vectors (a single, fixed dimension)." That's useful to know, thanks - it removes one of @ChrisBarker-NOAA's concerns. 😃 What do others (@benjwadams @DocOtak @sethmcg) think of not allowing In answer to your last question, Chris, I don't think we can prohibit scalar strings; that would be even more backward-incompatible than allowing arrays of strings. We could add explicit advice to software-writers, as well as data-writers, to be aware that the affected attributes could be either scalars or arrays. |
Beta Was this translation helpful? Give feedback.
-
Fair enough -- I'll shut up now.
We certainly. should do that. But I'd take in one step farther for data-writers: suggest that those attributes be written, initially, as len-1 arrays of strings, rather than a single string. The idea being that they may well be added to in the future. |
Beta Was this translation helpful? Give feedback.
-
Jonathan wrote:
I would be happy with this. |
Beta Was this translation helpful? Give feedback.
-
This comment contains two independent points: Firstly: To my (not very deep) knowledge all relevant languages have the capability to test for variable/attribute type, so it should not be a problem to handle distinguish between and an array of chars or a scalar vlen string, and an array of vlen strings. In python type tests may not be very 'pythonic' but that does not carry much weight to my mind. Continuing down the python road, despite what I first argued for, @pp-mo mentioned that python do not distinguish between ["some text"] and "some text" that is saved to a netCDF file and then read back. If I have not misunderstood something this means that making all arrays of chars or scalar vlen strings into length-1 arrays of vlen strings as @ChrisBarker-NOAA suggests would not help much. @sethmcg gave a couple of examples where arrays of vlen strings could be very useful, and wrote
I fully agree with this! Secondly: |
Beta Was this translation helpful? Give feedback.
-
Agree about not using a Python-Centric perspective. But dealing with two possible types in a attribute strikes me as a harder to deal with, rather than easier, with a statically typed language such as C (or even more so, Fortran) -- but I'd like to hear from someone that writes tools to read CF files in other languages if they have any opinion on this. |
Beta Was this translation helpful? Give feedback.
-
Darn -- IMHO, that's a bug in the Python lib :-(. I may test and submit a PR. |
Beta Was this translation helpful? Give feedback.
-
Dear Lars
That is all correct, I believe. My proposal is to allow arrays of strings for those attributes, but recommend against them if the data might be read by software that doesn't support CF 1.>=X. Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
Do we need the recommendation conditional on software support, since we don't do that for other features for which we think some software might not support (e.g. groups and strings)? It wouldn't be a recommendation that could be checked, in any case. |
Beta Was this translation helpful? Give feedback.
-
I think that we need to be a bit more specific than writing ".... software ...."? That is, isn't the key software component what version of the netCDF library as such is used, and which version of the netCDF data model this some software implements. Or did I miss something here? Irrespective of this, the conversation has now diverged from the question posed in the initial post. Maybe we should continue in a new discussion with a better title? |
Beta Was this translation helpful? Give feedback.
-
key, yes, but I think tha's fairly obvious -- you can't use nc4 features without at an nc4 aware lib. But client software also needs to be written to expect wither a single string of an array array of strings -- they have to be treated differently. but yes -- not really the same topic. |
Beta Was this translation helpful? Give feedback.
-
Dear all Originally I wrote something longer
David commented
It's true, it couldn't be checked, and we don't always make such remarks. For strings, we say in 2.2, "The Perhaps a recommendation is too strong in this case. A recommendation would cause a warning to be emitted by the cf-checker. Nonetheless, I think the remark is worth making, as we did for strings. As a compromise on my compromise, I suggest that we don't include a recommendation in the conformance document, but we do state in the text that, because arrays of strings were not permitted in atributes before CF 1.12, software written before CF 1.12 might fail if it encounters them. How is that? If we agree on this, I will open an issue to propose the text changes. Cheers Jonathan |
Beta Was this translation helpful? Give feedback.
-
Hi, I'm getting a number of questions on whether string valued global attributes such as "references" can be composed of multiple values in https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#description-of-file-contents. Should such attributes only consist of one value, or are two or more values considered acceptable as well?
Referencing issue (among others):
ioos/compliance-checker#1093
Beta Was this translation helpful? Give feedback.
All reactions