From 13f4d361402c5ec803bb5e6afedf37f0d28b66c4 Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Wed, 8 Apr 2020 16:03:58 +0100 Subject: [PATCH 1/6] Adding @id for external URIs --- docs/1.1-DRAFT/index.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/docs/1.1-DRAFT/index.md b/docs/1.1-DRAFT/index.md index 42c3f0b6..8e24e6d7 100644 --- a/docs/1.1-DRAFT/index.md +++ b/docs/1.1-DRAFT/index.md @@ -46,7 +46,7 @@ See for further details about RO-Crate. -This specification is Copyright 2017-2019 University of Technology Sydney, +This specification is Copyright 2017-2020 University of Technology Sydney, The University of Manchester UK and the RO-Crate contributors. @@ -194,7 +194,7 @@ If present in the root directory, `ro-crate-preview.html` MUST: ### Payload files and directories -These are the actual files and directories that make up the dataset being described. +These are the actual files and directories that make up the dataset being described. The base RO-Crate specification makes no assumptions about the presence of any specific files or folders beyond the reserved RO-Crate files described above. Payload files may appear directly in the _RO-Crate Root_ alongside the _RO-Crate Metadata File_, and/or appear in sub-directories of the _RO-Crate Root_. Each file and directory MAY be represented as Data Entities in the _RO-Crate Metadata File_. @@ -377,7 +377,6 @@ The _RO-Crate Metadata File Descriptor_ MAY contain information such as licensing for the _RO-Crate Metadata File_ so metadata can be licensed separately from Data. - The table below outlines the properties that the _Root Data Entity_ MUST have to be minimally valid and additionally highlights properties required to meet other common use-cases, including the minimum metadata necessary to mint a DataCite DOI: #### Direct properties of the Root Data Entity @@ -435,9 +434,13 @@ The following _RO-Crate Metadata File_ represents a minimal description of an _R Where files and folders are represented as _Data Entities_ in the RO-Crate JSON-LD, these MUST be linked to, either directly or indirectly, from the Root Data Entity using the [hasPart] property. Directory hierarchies MAY be represented with nested Dataset _Data Entities_, or the Root Dataset MAY refer to files anywhere in the hierarchy using [hasPart]. -_Data Entities_ representing files MUST be of `@type: File`, which is an RO-Crate alias for +_Data Entities_ representing files MUST be of `@type: File`, which is an RO-Crate alias for . The term _file_ here is liberal, and includes "downloadable" resources where `@id` is an absolute URI. + +_Data Entities_ representing directories MUST be of `@type: Dataset`. The term _directory_ here includes HTTP file listings where `@id` is an absolute URI, however "external" directories SHOULD have a programmatic listing of their content (e.g. another RO-Crate). -_Data Entities_ representing directories MUST be of `@type: Dataset`. +_Data Entities_ can also be other types, for instance an online database. These SHOULD be of `@type: CreativeWork` and typically have a `@id` which is an absolute URI. + +In all cases, `@type` MAY be an array in order to also specify a more specific type, e.g. `@type: [File, Workflow]` _Note: as indicated above, there is no requirement to represent every file and folder in an RO-Crate as Data Entities in the RO-Crate JSON-LD._ @@ -564,8 +567,7 @@ The table below outlines the properties that Data Entities, when present, MUST h A `File` _Data Entity_ MUST have the following properties: * `@type`: MUST be `File`, or an array where `File` is one of the values. -* `@id` MUST be a _URI Path_ relative to the _RO Crate root_ - +* `@id` MUST be either a _URI Path_ relative to the _RO Crate root_, or an absolute URI. #### Directory File Entity From 9aa3549debd9a6e8aa3aadf66a9a545fa1e2650d Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Thu, 9 Apr 2020 08:57:05 +0100 Subject: [PATCH 2/6] More on webby File Data Entities --- docs/1.1-DRAFT/index.md | 83 ++++++++++++++++++++++++++++++++++------- 1 file changed, 70 insertions(+), 13 deletions(-) diff --git a/docs/1.1-DRAFT/index.md b/docs/1.1-DRAFT/index.md index 8e24e6d7..839eba00 100644 --- a/docs/1.1-DRAFT/index.md +++ b/docs/1.1-DRAFT/index.md @@ -99,7 +99,7 @@ _Contextual Entity_: A JSON-LD representation of an entity associated with a _Da _Linked Data_: A data structure where properties, types and resources are identified with _URIs_, which if retrieved over the Web, further describe or provide the identified property/type/resource. -URI: A _Uniform Resource Identifier_ as defined in [RFC 3986], for example `http://example.com/path/file.html` - commonly known as _URL_. In this document the term _URI_ includes _IRI_, which also permit international Unicode characters. +_URI_: A _Uniform Resource Identifier_ as defined in [RFC 3986], for example `http://example.com/path/file.html` - commonly known as _URL_. In this document the term _URI_ includes _IRI_, which also permit international Unicode characters. _URI Path_: The relative _path_ element of an _URI_ as defined in [RFC3986 section 3.3](https://tools.ietf.org/html/rfc3986#section-3.3), e.g. `path/file.html` @@ -152,7 +152,7 @@ A valid _RO-Crate JSON-LD_ graph MUST describe: It is RECOMMENDED that any referenced _contextual entities_ are also described in the _RO-Crate Metadata File_ with the same identifier. Similarly it is RECOMMENDED that any _contextual entity_ in the _RO-Crate Metadata file_ is linked to from at least one of the other entities using the same identifier. -The appendix [RO-Crate JSON-LD](#ro-crate-json-ld) details the general structure of the JSON-LD that is expected in the _RO-Crate Metadata File_. In short, the rest of this specification describe the different types of entities that can be added as `{}` objects to the _RO-Crate JSON-LD_ `@graph` below: +The appendix [RO-Crate JSON-LD](#ro-crate-json-ld) details the general structure of the JSON-LD that is expected in the _RO-Crate Metadata File_. In short, the rest of this specification describe the different types of entities that can be added as `{}` objects to the _RO-Crate JSON-LD_ `@graph` array below: ```json { "@context": "https://w3id.org/ro/crate/1.1-DRAFT/context", @@ -203,12 +203,13 @@ The base RO-Crate specification makes no assumptions about the presence of any s A minimal RO-Crate is a directory containing a single _RO-Crate Metadata File_. -At the basic level, an RO-Crate is a collection of files represented as a schema.org [Dataset], that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc. The _RO-Crate Metadata File_ describes the RO-Crate, and MUST be stored in the _RO-Crate Root_. Self-containment is a core principle of RO-Crate, i.e. that all _Dataset_ files and relevant metadata SHOULD, as far as possible, be contained by the RO-Crate, rather than referring to external resources. However the RO-Crate MAY also reference external resources which are stored or accessed separately, via URIs, e.g. because these cannot be included for practical or legal reasons. +At the basic level, an RO-Crate is a collection of files and resources represented as a schema.org [Dataset], that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc. The _RO-Crate Metadata File_ describes the RO-Crate, and MUST be stored in the _RO-Crate Root_. -It is important to note that the _RO-Crate Metadata File_ is not an exhaustive manifest or inventory, that is, it does not necessarily list or describe all files in the package. Rather it is focused on providing sufficient amount of metadata to understand and use the content, and is designed to be compatible with existing and future approaches that _do_ have full inventories / manifest and integrity checks, e.g. by using checksums, such as [BagIt] and Oxford Common File Layout [OCFL Object]s. +While RO-Crate is well catered for describing a _Dataset_ as files and relevant metadata that are _contained_ by the RO-Crate in the sense of living within the same root directory, RO-Crates can also reference external resources which are stored or accessed separately, via absolute URIs. This is particularly recommended where some resources cannot be co-hosted for practical or legal reasons, or if the RO-Crate itself is primarily web-based. -The intention is that RO-Crates can work well with a variety of archive file formats, e.g. tar, zip, etc., and approaches to capturing file manifests and file fixity, such as [BagIt], [OCFL] and [git]. +It is important to note that the _RO-Crate Metadata File_ is not an exhaustive manifest or inventory, that is, it does not necessarily list or describe all files in the package. Rather it is focused on providing sufficient amount of metadata to understand and use the content, and is designed to be compatible with existing and future approaches that _do_ have full inventories / manifest and integrity checks, e.g. by using checksums, such as [BagIt] and Oxford Common File Layout [OCFL Object]s. +The intention is that RO-Crates can work well with a variety of archive file formats, e.g. tar, zip, etc., and approaches to capturing file manifests and file fixity, such as [BagIt], [OCFL] and [git]. An RO-Crate can also be hosted on the web or mainly refer to web resources, although extra care to ensure persistence and consistency should be taken for archiving such RO-Crates. ## RO-Crate Metadata @@ -253,7 +254,7 @@ Generally, the standard keys from [schema.org] should be used. However, RO-Crate * `File` is mapped to which was chosen as a compromise as it has many of the properties that are needed to describe a generic file. Future versions of schema.org or a research data extension may re-define `File`. * `Journal` is mapped to . -Note that JSON-LD examples given on website may not be in _flattened_ form; any nested entities in _RO-Crate JSON-LD_ SHOULD be described as separate contextual entities in the flat `@graph` list. +Note that JSON-LD examples given on the website may not be in _flattened_ form; any nested entities in _RO-Crate JSON-LD_ SHOULD be described as separate contextual entities in the flat `@graph` list. To simplify processing and avoid confusion with string values, the _RO-Crate JSON-LD Context_ requires URIs and entity references to be given in the form `"author": {"@id": "http://example.com/alice"}`, even where [schema.org] for some properties otherwise permit shorter forms like `"author": "http://example.com/alice"`. @@ -280,6 +281,7 @@ The keys `RepositoryObject` and `RepositoryCollection` were chosen to avoid coll From [Dublin Core Terms](http://purl.org/dc/terms/) RO-Crate use: - `conformsTo` mapped to + ### Summary of Coverage RO-Crate is simply a way to make metadata assertions about a set of files and folders that make up a _Dataset_. These assertions can be made at three levels: @@ -358,7 +360,7 @@ the the _Root Data Entity_ by following this algorithm: 1. For each entity in `@graph` array 2. ..if the `conformsTo` property is a URI that starts with `https://w3id.org/ro/crate/` -3. ....from this entity's `about` object keep the `@id` URI as variable _root_ +3. ....from this entity's `about` object keep the `@id` URI as variable _root_ 4. For each entity in `@graph` array 5. .. if the entity has an `@id` URI that matches _root_ return it @@ -383,7 +385,7 @@ The table below outlines the properties that the _Root Data Entity_ MUST have to The _Root Data Entity_ MUST have the following properties: * `@type`: MUST be [Dataset] -* `@id`: MUST be a a string of ‘./’ +* `@id`: MUST end with `/` and SHOULD be the string `./` * `name`: SHOULD identify the dataset to humans well enough to disambiguate it from other RO-Crates * `description`: SHOULD further elaborate on the name to provide a summary of the context in which the dataset is important. * `datePublished`: MUST be a string in ISO 8601 date format and SHOULD be specified to at least the precision of a day, MAY be a timestamp down to the millisecond. @@ -419,18 +421,18 @@ The following _RO-Crate Metadata File_ represents a minimal description of an _R "description": "Palliative care planning for nursing home residents with advanced dementia ...", "license": {"@id": "https://creativecommons.org/licenses/by-nc-sa/3.0/au/"} }, -{ + { "@id": "https://creativecommons.org/licenses/by-nc-sa/3.0/au/", "@type": "CreativeWork", "description": "This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Australia License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/au/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.", "identifier": "https://creativecommons.org/licenses/by-nc-sa/3.0/au/", "name": "Attribution-NonCommercial-ShareAlike 3.0 Australia (CC BY-NC-SA 3.0 AU)" - } + } ] } ``` -### Examples of referencing _Data Entities_ (files and folders) from the _Root Data Entity_ +### Referencing _Data Entities_ (files and folders) from the _Root Data Entity_ Where files and folders are represented as _Data Entities_ in the RO-Crate JSON-LD, these MUST be linked to, either directly or indirectly, from the Root Data Entity using the [hasPart] property. Directory hierarchies MAY be represented with nested Dataset _Data Entities_, or the Root Dataset MAY refer to files anywhere in the hierarchy using [hasPart]. @@ -561,7 +563,7 @@ If there is no web-accessible description for a file format it SHOULD be describ ### Core Metadata for _Data Entities_ -The table below outlines the properties that Data Entities, when present, MUST have to be minimally valid . +The table below outlines the properties that Data Entities, when present, MUST have to be minimally valid. #### File Data Entity @@ -573,7 +575,62 @@ A `File` _Data Entity_ MUST have the following properties: A `Dataset` (directory) _Data Entity_ MUST have the following properties: * `@type` MUST be `Dataset` or an array wher `Dataset` is one of the values. -* `@id` MUST be a _URI Path_ relative to the _RO Crate root; SHOULD end with `/` +* `@id` MUST be either an a _URI Path_ relative to the _RO Crate root, or an absolute URI. The id SHOULD end with `/` + +### Web-based Data Entities + +While one use-case of RO-Crates is to describe _files_ contained within the _RO-Crate root_ directory, RO-Crates can also gather resources on the web, identified using _absolute URIs_ instead of relative _URI paths_. + +This can be important particularly where a file can't be included in the _RO-Crate root_ because of licensing concerns, large data sizes, or where it is desirable to link to the latest online version. + +Example of an RO-Crate where some of the _File Data Entities_ are external to the _RO-Crate root_: + +```json +{ "@context": "https://w3id.org/ro/crate/1.1-DRAFT/context", + "@graph": [ + { + "@type": "CreativeWork", + "@id": "ro-crate-metadata.jsonld", + "conformsTo": {"@id": "https://w3id.org/ro/crate/1.1-DRAFT"}, + "about": {"@id": "./"} + }, + { + "@id": "./", + "@type": [ + "Dataset" + ], + "hasPart": [ + { + "@id": "survey-responses-2019.csv" + }, + { + "@id": "https://zenodo.org/record/3541888/files/ro-crate-1.0.0.pdf" + }, + ], + }, + { + "@id": "survey-responses-2019.csv", + "@type": "File", + "contentSize": "26452", + "description": "Survey responses", + "encodingFormat": "text/csv" + }, + { + "@id": "https://zenodo.org/record/3541888/files/ro-crate-1.0.0.pdf", + "@type": "File", + "contentSize": "310691", + "description": "RO-Crate specification", + "encodingFormat": "application/pdf" + } +] +} +``` + +Additional care SHOULD be taken to improve persistence and longevity of web resources included +in an RO-Crate, as they can be more difficult to archive or move along with the _RO-Crate root_, and +may change intentionally or unintentionally leaving the RO-Crate with partial or outdated information. + + ## Representing _Contextual Entities_ From fe05ab5a8c568114b568c71bf751af0bb85e7aea Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Thu, 9 Apr 2020 13:10:49 +0100 Subject: [PATCH 3/6] more about file metadata when it's external --- docs/1.1-DRAFT/index.md | 42 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/docs/1.1-DRAFT/index.md b/docs/1.1-DRAFT/index.md index 839eba00..34a5ec07 100644 --- a/docs/1.1-DRAFT/index.md +++ b/docs/1.1-DRAFT/index.md @@ -581,7 +581,7 @@ A `Dataset` (directory) _Data Entity_ MUST have the following properties: While one use-case of RO-Crates is to describe _files_ contained within the _RO-Crate root_ directory, RO-Crates can also gather resources on the web, identified using _absolute URIs_ instead of relative _URI paths_. -This can be important particularly where a file can't be included in the _RO-Crate root_ because of licensing concerns, large data sizes, or where it is desirable to link to the latest online version. +Using Web-based data entities can be important particularly where a file can't be included in the _RO-Crate root_ because of licensing concerns, large data sizes, privacy, or where it is desirable to link to the latest online version. Example of an RO-Crate where some of the _File Data Entities_ are external to the _RO-Crate root_: @@ -630,7 +630,47 @@ Additional care SHOULD be taken to improve persistence and longevity of web reso in an RO-Crate, as they can be more difficult to archive or move along with the _RO-Crate root_, and may change intentionally or unintentionally leaving the RO-Crate with partial or outdated information. +File Data Entries with an `@id` URI outside the _RO-Crate Root_ SHOULD at the time of RO-Crate creation be directly downloadable by a simple retrieval (e.g. HTTP GET), permitting redirections and HTTP authentication. For instance in the example above, and cannot be used as `@id` above as retrieving these URLS give a HTML landing page rather than the desired PDF as indicated by `encodingFormat`. +A files on the web may change, the timestamp property `sdDatePublished` SHOULD be included to indicate when the absolute URL was accessed, and derived metadata like `encodingFormat` and `contentSize` was considered to be representative: + +```json + { + "@id": "https://zenodo.org/record/3541888/files/ro-crate-1.0.0.pdf", + "@type": "File", + "contentSize": "310691", + "encodingFormat": "application/pdf", + "sdDatePublished": "2020-04-09T13:09:21+01:00Z" + } +``` + +#### Embedded data entities that are also on the web + +File Data Entities may already have a corresponding web presence, for instance a landing page that describes the file, including persistent identifiers (e.g. DOI) that goes to an intermediate HTML page instead of the downloadable file directly. + +These can be included for File Data Entities as additional metadata, regardless of if the File is included in the _RO-Crate Root_ directory or exists on the Web, by using properties: + +* `identifier` for formal identifier string such as DOIs +* `url` with a string URL for a *download* link (if not available, a download landing page) for this file +* `subjectOf` to a `CreativeWork` (or `WebPage`) that mentions this file or its content (but also other resources). +* `mainEntityOfPage` to a `CreativeWork` (or `WebPage`) that primarily describes this file (or its content) + +```json + { + "@id": "survey-responses-2019.csv", + "@type": "File", + "encodingFormat": "text/csv", + "url": "http://example.com/downloads/2019/survey-responses-2019.csv", + "subjectOf": {"@id": "http://example.com/reports/2019/annual-survey.html"} + }, + { + "@id": "https://zenodo.org/record/3541888/files/ro-crate-1.0.0.pdf", + "@type": "File", + "encodingFormat": "application/pdf", + "identifier": "https://doi.org/10.5281/zenodo.3541888", + "url": "https://zenodo.org/record/3541888" + } +``` ## Representing _Contextual Entities_ From a7474226a531db269dd0be23587a6a355b1c697d Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Thu, 9 Apr 2020 13:36:12 +0100 Subject: [PATCH 4/6] downloads of directories? Tricky! --- docs/1.1-DRAFT/index.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/docs/1.1-DRAFT/index.md b/docs/1.1-DRAFT/index.md index 34a5ec07..5cbedc26 100644 --- a/docs/1.1-DRAFT/index.md +++ b/docs/1.1-DRAFT/index.md @@ -672,6 +672,29 @@ These can be included for File Data Entities as additional metadata, regardless } ``` +#### Directories on the web; dataset distributions + +A _Directory File Entry_ or `Dataset` identifier by an absolute URL on the web can be harder to download than a `File` because it consists of multiple resources. It is RECOMMENDED that such directories have a complete listing of their content in `hasPart`, enabling download traversal. + +Alternatively a common mechanism to provide downloads of a reasonably sized directory is as an archive file in formats like `.zip` or `.tar.gz`, described as a [DataDownload]. + + { + "@id": "lots_of_little_files/", + "@type": "Dataset", + "description": "This directory contains many small files, that we're not going to describe in detail.", + "name": "Too many files", + "distribution": {"@id": "http://example.com/downloads/2020/lots_of_little_files.zip"} + }, + { + "@id": "http://example.com/downloads/2020/lots_of_little_files.zip", + "@type": "DataDownload", + "encodingFormat": "application/zip", + "contentSize": "82818928" + } + +Similarly, the _RO-Crate root_ entity may also provide a `distribution` URL, in which case the download SHOULD be an archive that contain the _RO-Crate Metadata file_. + +In all cases, consumers should be aware that a `DataDownload` is a snapshot that may not be reflecting the current state of the `Dataset` or RO-Crate. ## Representing _Contextual Entities_ The _RO-Crate JSON-LD_ @graph SHOULD contain additional information about _Contextual Entities_ for the use of both humans (in `ro-crate-preview.html`) and machines (in `ro-crate-metadata.jsonld`). This also helps to maximise the extent to which an _RO-Crate_ is self-contained and self-describing, in that it reduces the need for the consumer of an RO-Crate to refer to external information which may change or become unavailable over time. From 5fd261ccb6e52feeae108dc1a32f6254c0aa5839 Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Thu, 9 Apr 2020 16:58:13 +0100 Subject: [PATCH 5/6] Fix JSON, thanks @proccaserra Fixes https://github.com/ResearchObject/ro-crate/pull/74#pullrequestreview-390813133 --- docs/1.1-DRAFT/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/1.1-DRAFT/index.md b/docs/1.1-DRAFT/index.md index 5cbedc26..ab93f9c6 100644 --- a/docs/1.1-DRAFT/index.md +++ b/docs/1.1-DRAFT/index.md @@ -606,8 +606,8 @@ Example of an RO-Crate where some of the _File Data Entities_ are external to th { "@id": "https://zenodo.org/record/3541888/files/ro-crate-1.0.0.pdf" }, - ], - }, + ] + }, { "@id": "survey-responses-2019.csv", "@type": "File", From fb132dcb3e0c6dd56cdde53f9260c8da12547b03 Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Mon, 27 Apr 2020 15:39:20 +0100 Subject: [PATCH 6/6] No longer require 1 or more payload files --- docs/1.1-DRAFT/index.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/1.1-DRAFT/index.md b/docs/1.1-DRAFT/index.md index ab93f9c6..e146569c 100644 --- a/docs/1.1-DRAFT/index.md +++ b/docs/1.1-DRAFT/index.md @@ -126,13 +126,15 @@ The structure an _RO-Crate_ MUST follow is: | ro-crate-preview.html # RO-Crate Website homepage MAY be present | ro-crate-preview_files/ # MAY be present | | [other RO-Crate Website files] -| [payload files and directories] # 1 or more SHOULD be present +| [payload files and directories] # 0 or more ``` The name of the _RO-Crate root_ directory is not defined, but a root directory is identifiable by the presence of `ro-crate-metadata.jsonld`. For instance, if an _RO-Crate_ is archived in a ZIP-file, the ZIP root directory correspond to _RO-Crate root_ directory if it contains `ro-crate-metadata.jsonld`. +[Data Entities](#core-metadata-for-data-entities) in the RO-Crate MUST either be _payload files/directories_ present within the RO-Crate root directory or its subdirectories, or be [Web-based Data Entities](#web-based-data-entities). + ### RO-Crate Metadata File (`ro-crate-metadata.jsonld`)