Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce workflow run crate #32

Merged
merged 70 commits into from
Nov 7, 2023
Merged

Conversation

inutano
Copy link
Collaborator

@inutano inutano commented Nov 9, 2022

Some missing features:

  • lacking mentions in the top-level object to link to a run
  • lacking input and output in the workflow object

Concerns:

  • a property can be an object or an array of objects depending on the number of appended objects
  • log files should be CreativeWork? -> a log file is subjectOfa workflow run

@inutano
Copy link
Collaborator Author

inutano commented Nov 9, 2022

a property can be an object or an array of objects depending on the number of appended objects

solved by removing compact=True

@inutano
Copy link
Collaborator Author

inutano commented Dec 20, 2022

🤔 Format class and format property do not exist in the existing terms?

@inutano
Copy link
Collaborator Author

inutano commented Dec 20, 2022

todo: add an example by a workflow using FileStats objects recording format specific statistics of output files

@stain
Copy link

stain commented Dec 20, 2022

Now need to re-add the sapporo terms like exitCode to the @context.

You can do it like this:

EXTRA_TERMS = {
    "ParameterConnection": "https://w3id.org/ro/terms/workflow-run#ParameterConnection",
    "connection": "https://w3id.org/ro/terms/workflow-run#connection",
    "sourceParameter": "https://w3id.org/ro/terms/workflow-run#sourceParameter",
    "targetParameter": "https://w3id.org/ro/terms/workflow-run#targetParameter"
}
crate = ROCrate()
crate.metadata.extra_terms.update(EXTRA_TERMS)
# ...

@simleo
Copy link

simleo commented Dec 20, 2022

thinking Format class and format property do not exist in the existing terms?

The property is called encodingFormat, see the Bioschemas FormalParameter profile. Example:

{
    "@id": "#in_file",
    "@type": "FormalParameter",
    "additionalType": "File",
    "encodingFormat" : "http://edamontology.org/format_1929",
    "name": "in_file"
}

@simleo
Copy link

simleo commented Dec 22, 2022

Looking at ro-crate-metadata-example.json: the ComputationalWorkflow's input and output point to actual files; they should point to formal parameters instead. The CreateAction's object and result are the properties that should point to actual parameter values, and refer back to the formal parameters via exampleOfWork. For instance:

Formal Parameter:

{
    "@id": "#fastq_out_1",
    "@type": "FormalParameter",
    "additionalType": "File",
    "encodingFormat": "http://edamontology.org/format_1930"
}

Actual file representation:

{
    "@id": "outputs/SRR1274307_1.fastq",
    "@type": "File",
    "exampleOfWork": {
        "@id": "#fastq_out_1"
    },
    ...
}

By the way, determining the workflow's inputs and outputs does not seem straightforward here since the Sapporo engine acts, in a way, as a wrapper around the CWL engine with different input / output slots. The CWL workflow has three inputs (nthreads, repo, run_id) and one output (fastq_files, which is an array of files), while the workflow seen as a Sapporo application seems to have only one input, the JSON workflow parameters file (which I guess contains settings for the CWL parameters) and four outputs, the fastq files as separate items plus the stdout and stderr dumps. I wonder what would be the best representation here. With the single JSON input, the run would only be reproducible on Sapporo, but not in other CWL implementations. The representation could instead stick to the CWL input definitions and add workflow_params.json as a configuration file. Similarly, the stdout.log and stdout.err files could be added as plain files (maybe marking them as Sapporo log files via description) and not be listed among the outputs.

@simleo
Copy link

simleo commented Jan 9, 2023

thinking Format class and format property do not exist in the existing terms?

The property is called encodingFormat, see the Bioschemas FormalParameter profile.

Regarding the other custom terms:

  • many are very specific to the genomic sequence analysis domain (e.g. mappedReads). Since the profiles are meant to be generic, we should not include those in ro-terms. Sapporo could instead move the more domain-specific part to a file in a custom format that's referenced from the RO-Crate metadata. By the way, those terms have FileStats as their domain, but FileStats has been removed from the list. Also, stats has FileStats as range.
  • sha512 is now available
  • uid, gid and mode are not intrinsic properties of the file as a chunk of data, but rather of how it is stored in a specific file system. Sapporo could record those properties for the files in their original paths, but they would be meaningless if the crate contains copies, or references URLs where those files can be downloaded.

@inutano
Copy link
Collaborator Author

inutano commented Jan 16, 2023

Thank you very much @stain and @simleo !! Sorry for not replying earlier.

By the way, determining the workflow's inputs and outputs does not seem straightforward here since the Sapporo engine acts, in a way, as a wrapper around the CWL engine with different input / output slots.
With the single JSON input, the run would only be reproducible on Sapporo, but not in other CWL implementations.

Yes, I think this is a general concern for a WES implementation that aims to generate RO-crate. With the current one that Sapporo generates for each workflow engine, it's only reusable with Sapporo to run it again.

My idea is to let Sapporo use engine's feature to generate RO-crate if possible (like cwltool), but generates a generic one (the current form) for the engines without the RO-crate feature (like the other engines). It'd be hard to parse the input/output file for each engine and continue supporting all of them. Ideally, all engines should support RO-crate and a WES just uses the features.

Thanks @simleo for the advice on the additional terms (thanks for finding the FileStats bug as well!), I'll follow your guide to blush up the terms we should prepare.

@inutano inutano force-pushed the introduce-workflow-run-crate branch from b6df700 to a5f8ebe Compare February 8, 2023 08:48
@inutano
Copy link
Collaborator Author

inutano commented Aug 8, 2023

@simleo Thank you for reviewing! Hiro has fixed most of the issues, but we still have some points which need to be discussed.


The @context entry is updated as you suggested.
17f2222

But I suspect that the initial one was also valid:

"@context": [
    "https://w3id.org/ro/crate/1.1/context",
    "https://w3id.org/ro/terms/sapporo"
]

looks like a valid JSON-LD context. Or am I missing something in the spec? cc. @stain


The log files stdout.log and stderr.log were added to the crate. They were missing because of .gitignore!
85241ea


Three CreativeWorks were added, they simply have been missed. Thanks! 🙏
31ea971


The root data entity now mentions the CreateAction.
aa123d6


For the license, this is not straightforward, for us at least. As Sapporo is a WES and cannot expect what kind of workflows are requested, Sapporo often cannot find the license of the given workflow. If the requested workflow is parsable and has license information in it, we can describe the license in ro-crate-metadata.json. But if the workflow doesn't have a license, or is not parsable by WES like a nextflow workflow, Sapporo cannot add a license.

I assume the other engines might have the same situation - do you have any suggestions or guidelines for this?


description is removed, with the very same reason for the license.
cbb958e


The main workflow file is currently missing in the crate because the request has been made by the workflow URL. But it should include the main workflow description file should be included. This requires the update of Sapporo's workflow runner, so we will work on this.


The workflow's input list is missing the nthreads parameter

This is because the parameter is not specified in the run request sent to Sapporo. The parameter exists in the workflow definition file but is not specified in the request, so the run was made with the default value specified in the workflow definition. This is another thing that WES cannot find.


The workflow's output list needs to reflect the output parameters in the CWL workflow, like it was done for the input list.

This is also another WES-specific issue. A WES can inspect the run request (workflow definition to run, job parameters to be given), but cannot inspect the workflow definition content itself (in other words, nextflow/snakemake files cannot be parsable). Therefore, a WES just can list the files generated by the run without mapping them to the defined (expected) outputs. I understand this is not ideal, but I think this is the limitation when we implement the ro-run-crate function in a WES.

Thank you very much again for your huge effort in helping us!

@simleo
Copy link

simleo commented Aug 28, 2023

@inutano Regarding the @context entry, I've checked with the JSON-LD playground and you are right, the previous @context was valid. Sorry about this, I was referring to a previous version of ro-terms where JSON files were not valid JSON-LD. You can revert to the previous @context, but adding the workflow-run entry (so that sha512 is defined):

"@context": [
    "https://w3id.org/ro/crate/1.1/context",
    "https://w3id.org/ro/terms/sapporo",
    "https://w3id.org/ro/terms/workflow-run"
]

Regarding the license, it's normal for a workflow runner not to know the workflow's license in the general case. Specifying a license is required by Workflow RO-Crate though, and you also need to specify an open license to upload the final example to Zenodo. In the case of Sapporo, maybe the license could be added by the user through an option. There's a similar situation in runcrate: the runcrate convert command does not know the workflow's license, so it has a --license command line option to allow the user to specify it. However, for the purpose of getting a final example, the license entry can be added manually.

As for the workflow's input and output entries, since these are not known in general, it's probably better not to include them at all. They are not required by the specification. An example of engine that's not including them is COMPSs.

@simleo
Copy link

simleo commented Sep 25, 2023

I think it's almost ready. I would remove the workflow's input and output entries for the reasons discussed above.

@inutano
Copy link
Collaborator Author

inutano commented Sep 27, 2023

@simleo Thank you so much for reviewing! @suecharo can help you if needed!

@inutano
Copy link
Collaborator Author

inutano commented Oct 24, 2023

Manually removed the input output and the related files: I think now it's ready to be uploaded to Zenodo @simleo Is there a specific Zenodo repo or do I have to create one for this?

See:
https://github.com/sapporo-wes/sapporo-service/blob/introduce-workflow-run-crate/tests/ro-crate/ro-crate_dir/ro-crate-metadata.json

The directory:
https://github.com/sapporo-wes/sapporo-service/tree/introduce-workflow-run-crate/tests/ro-crate/ro-crate_dir

@simleo
Copy link

simleo commented Oct 24, 2023

Manually removed the input output and the related files

I suggested to remove only the workflow's input and output entries and the corresponding formal parameters, NOT the action's object and result and the corresponding files -- please put those back as they were. They are among the most important things in the crate.

Also, this should not be a manual change: the RO-Crate is meant to give an example of the implementation, so it's Sapporo that has to be changed in order to change the RO-Crate.

@inutano
Copy link
Collaborator Author

inutano commented Oct 25, 2023

@simleo Thank you for your reviewing and explanation at the call yesterday! @suecharo fixed the code to generate the JSON file without input output but still with object result to link to the files used/generated by the workflow. Now the example JSON file looks better I hope.
https://github.com/sapporo-wes/sapporo-service/blob/introduce-workflow-run-crate/tests/ro-crate/ro-crate_dir/ro-crate-metadata.json

One question, is it okay to keep the files not mentioned in the JSON, e.g. cmd.txt in the ro-crate_dir directory? Or should the directory contain the mentioned files only?

@simleo
Copy link

simleo commented Oct 25, 2023

@inutano @suecharo the changes in 0b7e990 also moved the exe/fastqc.cwl and exe/trimmomatic_pe.cwl from the workflow's hasPart (which was the right place) to the CreateAction's input. Please put them back as they were before.

Regarding exe/ERR034597_1.small.fq.gz and exe/ERR034597_2.small.fq.gz: since these are inputs I would expect them to be in a subdirectory called "input", rather than "exe". From the RO-Crate point of view this is not wrong, but using "input" would make the crate more human-readable.

One question, is it okay to keep the files not mentioned in the JSON, e.g. cmd.txt in the ro-crate_dir directory? Or should the directory contain the mentioned files only?

The crate directory can contain files that are not mentioned in the metadata file.

@inutano
Copy link
Collaborator Author

inutano commented Oct 26, 2023

@simleo Thanks for the comments!

the changes in 0b7e990 also moved the exe/fastqc.cwl and exe/trimmomatic_pe.cwl from the workflow's hasPart (which was the right place) to the CreateAction's input.

My apologies, I had to explain/ask about this. Another WES issue here is that a WES does not know which files are workflow dependencies and which are input parameters, as it only can see that the workflow engine downloaded the files into the working directory. In this case, it may look obvious from the file extensions that the *.cwl files are parts of the main workflow, but this may not be true if the files are without the extensions, or simply composed by workflow languages other than CWL. Therefore we put all the files being made available by the engine as 'input'. I know this looks very strange, but is required to keep consistency. What do you think?

Regarding exe/ERR034597_1.small.fq.gz and exe/ERR034597_2.small.fq.gz: since these are inputs I would expect them to be in a subdirectory called "input", rather than "exe". From the RO-Crate point of view this is not wrong, but using "input" would make the crate more human-readable.

This makes sense. This is just inherited from our WES implementation and the internal namespace usage went out for the ro-crate. We will discuss if we can change it.

The crate directory can contain files that are not mentioned in the metadata file.

Cool, thanks!

@simleo
Copy link

simleo commented Oct 26, 2023

Therefore we put all the files being made available by the engine as 'input'. I know this looks very strange, but is required to keep consistency. What do you think?

If the distinction is unknown to Sapporo, I guess adding them to input makes sense. This peculiarity has to be mentioned in the Zenodo description of the example and in the manuscript.

This is just inherited from our WES implementation and the internal namespace usage went out for the ro-crate. We will discuss if we can change it.

If it's complicated, leave that be. Better not delay the PR merge and example publication.

@inutano
Copy link
Collaborator Author

inutano commented Oct 27, 2023

Okay, I'll add it to the manuscript. for the exe we'll make a decision soon

@suecharo
Copy link
Collaborator

suecharo commented Nov 7, 2023

This pytest error will be fixed in the next release.

@suecharo suecharo changed the base branch from main to develop November 7, 2023 00:49
@suecharo suecharo marked this pull request as ready for review November 7, 2023 00:50
@suecharo suecharo merged commit bf7b7ee into develop Nov 7, 2023
3 of 4 checks passed
@suecharo suecharo deleted the introduce-workflow-run-crate branch November 7, 2023 00:51
@simleo
Copy link

simleo commented Nov 7, 2023

@suecharo @inutano congrats on merging the PR! I've noticed the crate at tests/ro-crate/ro-crate_dir/ still does not have a license. I've seen you added a "license": "https://opensource.org/license/apache-2-0/" to tests/ro-crate/ro-crate-metadata.json (whose purpose is not clear to me). You should do the same in tests/ro-crate/ro-crate_dir/ro-crate-metadata.json, then you can upload the crate to Zenodo (as a .crate.zip archive with contents directly at the top level). Also, please tag a release that includes these changes (and possibly publish that to Zenodo as well), so you can reference it in the paper. Looking forward to the crate, Sapporo release and manuscript updates. Thanks!

@suecharo suecharo restored the introduce-workflow-run-crate branch November 15, 2023 10:23
@suecharo
Copy link
Collaborator

Thanks @simleo !!

Fixed at #34,
and released as:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants