Skip to content

Managing Datasets

bart-v edited this page Sep 25, 2024 · 3 revisions

One of the main motivations for this implementation is the ability to easily manage the different Bio-ORACLE datasets served by ERDDAP.

For that end, there are Python scripts under the utils directory which will do the job for us. Make sure that you have installed the Python dependencies before running them:

#only needed once, for first installation (python3)
apt install python3.12-venv
python3 -m venv .env
source .env/bin/activate
pip3 install lxml

The main script is LoopGenerateDatasetsXml.py. As the name says, this script runs a loop of the GenerateDatasetsXml.sh script over the files in the /data/layers directory. GenerateDatasetsXml.sh is a native ERDDAP script which takes in input data files and generates an XML snippet that can be added to the datasets.xml file to configure a dataset. After looping over the data layers, LoopGenerateDatasetsXml.py then uses the second script, CollateGenerateDatasetsXml.py, to glue all the results together as the datasets.xml file.

While LoopGenerateDatasetsXml.py takes a couple of minutes to run, CollateGenerateDatasets.xml runs very quickly. So, if you run the first script but have to do some manual editing, you can just run CollateGenerateDatasets.xml when you are done, instead of having to run the loop again.

IMPORTANT: At the moment, all paths in the scripts are hardcoded, so be careful when changing any paths in either the script or the server.

The scripts must be run from the repository root. For example:

# Make sure you are in the repository root
$ source .env/bin/activate
(.env) $ pwd
/data/bio-oracle-erddap


# Make sure your virtual environment is active
(.env) $ python utils/LoopGenerateDatasetsXml.py

Found 93 layers.
Processing layers with GenerateDatasetsXml.sh. Files will be created in the 'logs/datasets' directory.
  0%|                                                                                          | 0/93 [00:00<?, ?it/s]

This will take a few minutes to run. XML snippets for each dataset will be created in the logs/datasets directory. Inside that same directory is another logs directory with the standard error outputs of the GenerateDatasetsXml.sh script. You can check these files to see what went wrong in generating the XML snippet.

Once it's finished running, simply restart the container with:

(.env) $ docker compose restart

Adding individual datasets

The LoopGenerateDatasetsXml.py script regenerates XML snippets for ALL datasets.

If you would like to add an individual dataset, run the GenerateDatasetsXml.sh like so:

$ ./GenerateDatasetsXml.sh      \
        EDDGridFromNcFiles      \
        $DATASET_DIRECTORY      \
        $DATASET_FILENAME       \
        $DATASET_DIRECTORY/$DATASET_FILENAME  \
        nothing                 \
        nothing                 \
        nothing                 \
        nothing

Replacing $DATASET_DIRECTORY and $DATASET_FILENAME for the proper values. Note the layers are in /data/layers/, but these are mounted to /datasets/ in the docker container. $DATASET_DIRECTORY should be then /datasets/.

You can also use the --include and --exclude flags in the LoopGenerateDatasetsXml.sh script to arbitrarily add or remove dataset from the regex. This is only going to generate the XML snippets for the datasets you select, substantially reducing the runtime of the script.

Clone this wiki locally