Example datasets for training (human and machine) - for CF roadmap #355

lhmarsden · 2024-09-19T14:21:55Z

lhmarsden
Sep 19, 2024
Collaborator

Topic for discussion

Example datasets would be useful as training datasets for human users. These training datasets could also potentially be used for training a GPT (for example) to 'understand' AI in order to help human users with questions.

This discussion relates to this theme in the CF roadmap publication that we are collectively working on; the 5 year plan.

lhmarsden · 2024-09-19T14:27:02Z

lhmarsden
Sep 19, 2024
Collaborator Author

Could this 'collection' of example datasets be searchable?

0 replies

lhmarsden · 2024-09-25T05:46:03Z

lhmarsden
Sep 25, 2024
Collaborator Author

@agstephens it was great to discuss this with you at the workshop last week. Would be good to hear your thoughts on how we proceed for the purpose of the roadmap publication.

0 replies

sethmcg · 2024-09-25T21:39:40Z

sethmcg
Sep 25, 2024
Collaborator

What are the desirable characteristics of these example datasets?

Are we talking idealized examples where everything is perfect (but perhaps without much real data)? Or real-world datasets that may have imperfections but are widely used? Something else?

0 replies

ChrisBarker-NOAA · 2024-09-26T01:30:13Z

ChrisBarker-NOAA
Sep 26, 2024
Collaborator

I would say "everything is perfect" examples, that's kind of the point :-)

Many of us learn much more bey example, so if someone can find an example that's similar to their use case, it will be much easier to figure out what to do.

perhaps without much real data

Yes, I think the actual data isn't the point, so ideally, they'd be trimmed down and small...

0 replies

lhmarsden · 2024-09-26T06:47:31Z

lhmarsden
Sep 26, 2024
Collaborator Author

I think it depends what we are using these datasets for.

For human training, everything should be perfect. It could be that synethetic examples (no real data) could be used for this. Large language models like ChatGPT could help us make these more quickly than we might've been able to do so a few years ago, though this will still take a lot of manual work. Related to this are profiles of CF, e.g. the CF-WMO profiles. These profiles outline explicitely, with fewer degrees of freedom, how to encode certain types of data. Therefore I think there is some overlap with https://github.com/orgs/cf-convention/discussions/358 here.

For machine training, I don't think everything needs to be perfect if we have enough datasets. It could actually be beneficial include labelled imperfections, so the machine could learn what not to do. Though this hypothesis requires a lot of testing...

0 replies

lhmarsden · 2024-09-26T06:49:46Z

lhmarsden
Sep 26, 2024
Collaborator Author

For machine training, in order to gather a large enough data collection (1000+ ???), we may have to use real datasets. We discussed at the workshop that multiple examples from the same collection (e.g. datasets for different days) could be used.

0 replies

sadielbartholomew · 2024-10-01T14:58:41Z

sadielbartholomew
Oct 1, 2024
Maintainer

If the data itself isn't particularly important, and only the metadata, and I think this is generally true, then how about we have our examples in header-only CDL notation form_ - to keep everything as small as we can i.e. no data arrays included (which are the part that can make files large and therefore difficult to store and transfer), without losing any metadata information? This can be generated from the 'real' datasets quite easily, using ncdump -h and then saving the output as <filename>.cdl>, so with a small script we can batch convert any real datasets into that format.

2 replies

lhmarsden Oct 2, 2024
Collaborator Author

I second that idea

ethanrd Oct 2, 2024
Maintainer

I agree in general. Though, I suspect some CF capabilities would benefit from example datasets with some data. Coordinate subsampling comes to mind but I expect there are others.

agstephens · 2024-10-02T19:30:44Z

agstephens
Oct 2, 2024

Hi everyone, apologies for the delay in getting to this thread. I originally proposed this idea at the CF Workshop (2024) so let me share some high-level thoughts based on the discussions we had in the brief hackathon, reading your comments above, and some further thinking. I'll try to break it down as follows:

What we tried/learnt at the CF Workshop 2024
Some thoughts about how we might progress this work

1. What we tried/learnt at the CF Workshop 2024

1.1. Large Language Models and CF

We talked about training datasets for a Large Language Model (LLM) such as ChatGPT and some of key points in our discussion were:

The language that we want an LLM to know is actually CDL (i.e. the output of ncdump)
ChatGPT already has some knowledge and skills around CF and NetCDF
- It has already seen a lot of open-source material that is relevant (GitHub and documentation)
- With a few sentences of description (about temperature, times, locations etc), it created a valid CF-compliant CDL header
- With more prompting it improved the metadata in the header and added some features more specific to recent CF versions
- Although it made mistakes, further prompting could guide it
It would be more efficient to investigate simplest solutions first:
1. Try the free ChatGPT version
2. Try an OpenAI Assistant (like an AI Agent)
  - Using uploaded files (docs and examples) and a long text prompt (i.e. a set of rules the agent should know)
3. Try more involved approaches:
  - Retrieval Augmented Generation (RAG)
  - Fine-tuning
Every few months, it would be wise to retry the simplest solution to check how well it works with the latest publicly available LLMs
Measuring performance is hard:
- How do we compare different solutions?
- How do we measure progress/accuracy?

1.2. Creating example datasets

Regarding the creation of a high-quality dataset:

We noted that synthetic example data might be useful (and quicker to generate)
There were two distinct use cases:
1. Create a small set of files to accompany the conventions documentation:
- To demonstrate implementation of each CF feature
- To use as best practice examples
- To use as a template for creating a similar file
- For use by the CF-Checker
- Ideally, a unique set of files exists for each CF version
- 10-20 files might be enough to meet this need
1. Create a large set of file headers (CDL) for Machine Learning (ML) applications:
- For training LLMs
- Including data files classified as "high" and "low" quality - so ML classifiers could be training on them
- For other unknown uses
- Probably want >10,000 individual files
- The dataset could be published at: https://paperswithcode.com/datasets
- Could ChatGPT help us create synthetic examples files?

2. Some thoughts about how we might progress this work

Having experimented very briefly with prompting an LLM (ChatGPT) it is clear that there is some promise. There is also a danger that we could put a lot of effort into something that becomes obsolete within months or years. I can think of 6 different areas/approaches that we could take forward. It would be great to hear which people think is the best combination, and in what order we could tackle them:

2.1. Define what we want from a CF-Agent
2.2. Create performance metrics and tests
2.3. Small file set to support the CF documentation
2.4. Large file set to support ML applications
2.5. Try a range of ML approaches
2.6. Constrain the problem

2.1. Define what we want from a CF-Agent

A real danger is that effort is put into playing with a "CF-Agent" but there is no clear goal, leading to a vague outcome. We might benefit from defining a set of use cases and/or requirements, such as:

Able to describe the contents of a file in human language(s)
Able to check for errors
Able to advise on best practice
Able to create CF-compliant files
Other uses?

2.2. Create performance metrics and tests

Any work done with ML models requires us to be able to recognise when the AI is doing a good job. We would need to define performance metrics and repeatable tests that could assess the quality of the model/application. The tests would also enable regression testing when updates were made to parts of the system outside of our control.

2.3. Small file set to support the CF documentation

This has been mentioned above in section 1.2. Ideally there would be a set of files to support:

each version of the CF conventions
each feature/profile/property mentioned in the conventions document

2.4. Large file set to support ML applications

This has been mentioned above in section 1.2. Some of the key considerations would be:

crowd-source data files from many organisations and scientific disciplines
CF-check all files
Maybe check files can be read by at least three libraries/tools (in multiple languages)
Use CDL headers only (although how can we test reading files if we only have the headers?)
Aim for >10,000 files
Need "good" and "bad" files for classification
How do we identify a high-quality file? Is it more than CF-compliant?
Publish to paperswithcode

2.5. Try a range of ML approaches

We should consider a hierarchy of approaches, from the quickest/cheapest first:

Free public LLM
Agent/Assistant on top of (1)
More complex approaches like multi-agent and RAG
Fine-tuning of commercial LLM
Fine-tuning of open source LLM (such as Llama 3.1)

2.6. Constrain the problem

Many of the above options appear to involve a lot of work. How can we constrain the problem, so that we can try something out quickly? Here are some ideas, please add more:

Use NetCDF files from a single project (such as CMIP6) as our universe of possible files - use that to test if our proposed approaches will actually work
Pick an old version of CF (e.g. CF-1.8) and try to train a model to be an expert in only that - we would need less examples, files, rules, etc.

If we find a good solution to these cut-down problems, we can scale them up.

(Apologies this is so long - thanks for staying with it - I look forward to hearing your thoughts and working on it with you :-) )

6 replies

lhmarsden Oct 3, 2024
Collaborator Author

It also just word an award! https://www.gbif.org/news/6aw2VFiEHYlqb48w86uKSf/chatipt-system-wins-the-2024-ebbe-nielsen-challenge

agstephens Oct 3, 2024

@lhmarsden: what is great about this example (ChatIPT) is how focused it is on achieving a clear goal. We can certainly take inspiration from that kind of approach.

lhmarsden Oct 18, 2024
Collaborator Author

@agstephens should we ever want to use AI for a converter, the main challenge is working with unstructured input data.

Perhaps relevant to this, I developed with @filchos a spreadsheet template generator with a configuration for CF-NetCDF. It is designed to help people who prefer to work usually with spreadsheets get their data into some more structured format before they start converting them. The user can select column headers from the full list of CF standard names, and there is a separate sheet for global attributes. I would like to add a separate sheet for variable attributes too at some point.

The template generator is hosted here:
https://www.nordatanet.no/aen/template-generator/config%3DCF-NetCDF

The code is here (anyone reading is welcome to provide feedback using issues or we can work together on developing it further)
https://github.com/SIOS-Svalbard/Nansen_Legacy_template_generator

We published a paper on this over the summer:
https://doi.org/10.5334/dsj-2024-038

agstephens Oct 22, 2024

Hi @lhmarsden, I just had a quick look through your paper and template generator. Looks good! I agree that this kind of tool could be a very useful part of the process for converting user data/information into NetCDF files. The template generator highlights the complexity in moving from the "scientist" view of the dataset to a correctly and appropriately encoded CF-netCDF view of the dataset.

I'm intrigued as to how much of the job you might be able to give to a trained AI model. For example, if you provided lots of example text/csv/excel files, then maybe it could read your data and infer the data structures itself, whilst recommending appropriate attributes.

lhmarsden Oct 22, 2024
Collaborator Author

Thanks @agstephens

I work with a lot of scientists who have unstructured data. Sometimes I can't understand the contents of their files (even after a lot of trying). In such cases, I don't think AI will manage either. Even understanding what column headers are can be tricky. Maybe with some back-and-forth dialogue between the user and the AI some progress could be made. I think this is the approach ChatIPT takes.

But of course, there is a sliding scale from structured to unstructured.

ChrisBarker-NOAA · 2024-10-02T20:00:25Z

ChrisBarker-NOAA
Oct 2, 2024
Collaborator

I suspect some CF capabilities would benefit from example datasets with some data.

+1 on this -- I was recently looking at the ragged array format for trajectories, and it was really hard to wrap. my haed around it without and example with data in it.

2 replies

sadielbartholomew Oct 18, 2024
Maintainer

Fair enough, I can foresee other use cases too where data is helpful. Though of course if they have data we want it to be small in array terms, so that there is enough to understand the general concept but not more than necessary as to create larger files. Is that reasonable as a principle to aim for, do we think?

lhmarsden Oct 21, 2024
Collaborator Author

I think that is reasonable @sadielbartholomew. Enough to be helpful and easily readable. I suppose we could even use your cfplot library to help people visualise the data if the arrays are quite long.

Armin-RS · 2024-10-03T07:47:35Z

Armin-RS
Oct 3, 2024

There seems to be some overlap with this old issue: cf-convention/cf-conventions#348
Especially with 1.2 i)

1 reply

agstephens Oct 3, 2024

There seems to be some overlap with this old issue: cf-convention/cf-conventions#348 Especially with 1.2 i)

@Armin-RS, I agree. Maybe it is most useful to move that part of the discussion into cf-convention/cf-conventions#348, and to focus this current discussion more on the Machine Learning aspect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CF Conventions

Example datasets for training (human and machine) - for CF roadmap #355

{{title}}

Replies: 10 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CF Conventions

Example datasets for training (human and machine) - for CF roadmap #355

lhmarsden Sep 19, 2024 Collaborator

Topic for discussion

Replies: 10 comments · 11 replies

lhmarsden Sep 19, 2024 Collaborator Author

lhmarsden Sep 25, 2024 Collaborator Author

sethmcg Sep 25, 2024 Collaborator

ChrisBarker-NOAA Sep 26, 2024 Collaborator

lhmarsden Sep 26, 2024 Collaborator Author

lhmarsden Sep 26, 2024 Collaborator Author

sadielbartholomew Oct 1, 2024 Maintainer

lhmarsden Oct 2, 2024 Collaborator Author

ethanrd Oct 2, 2024 Maintainer

agstephens Oct 2, 2024

1. What we tried/learnt at the CF Workshop 2024

1.1. Large Language Models and CF

1.2. Creating example datasets

2. Some thoughts about how we might progress this work

2.1. Define what we want from a CF-Agent

2.2. Create performance metrics and tests

2.3. Small file set to support the CF documentation

2.4. Large file set to support ML applications

2.5. Try a range of ML approaches

2.6. Constrain the problem

lhmarsden Oct 3, 2024 Collaborator Author

agstephens Oct 3, 2024

lhmarsden Oct 18, 2024 Collaborator Author

agstephens Oct 22, 2024

lhmarsden Oct 22, 2024 Collaborator Author

ChrisBarker-NOAA Oct 2, 2024 Collaborator

sadielbartholomew Oct 18, 2024 Maintainer

lhmarsden Oct 21, 2024 Collaborator Author

Armin-RS Oct 3, 2024

agstephens Oct 3, 2024

lhmarsden
Sep 19, 2024
Collaborator

Replies: 10 comments 11 replies

lhmarsden
Sep 19, 2024
Collaborator Author

lhmarsden
Sep 25, 2024
Collaborator Author

sethmcg
Sep 25, 2024
Collaborator

ChrisBarker-NOAA
Sep 26, 2024
Collaborator

lhmarsden
Sep 26, 2024
Collaborator Author

lhmarsden
Sep 26, 2024
Collaborator Author

sadielbartholomew
Oct 1, 2024
Maintainer

lhmarsden Oct 2, 2024
Collaborator Author

ethanrd Oct 2, 2024
Maintainer

agstephens
Oct 2, 2024

lhmarsden Oct 3, 2024
Collaborator Author

lhmarsden Oct 18, 2024
Collaborator Author

lhmarsden Oct 22, 2024
Collaborator Author

ChrisBarker-NOAA
Oct 2, 2024
Collaborator

sadielbartholomew Oct 18, 2024
Maintainer

lhmarsden Oct 21, 2024
Collaborator Author

Armin-RS
Oct 3, 2024