Skip to content

Commit

Permalink
add vignette on metadata
Browse files Browse the repository at this point in the history
  • Loading branch information
ThierryO committed Aug 28, 2024
1 parent 72f8e00 commit 0d58450
Show file tree
Hide file tree
Showing 2 changed files with 100 additions and 0 deletions.
2 changes: 2 additions & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ navbar:
href: articles/plain_text.html
- text: Storing dataframes under version control
href: articles/version_control.html
- text: Metadata
href: articles/metadata.html
- text: Potential workflow
href: articles/workflow.html
- text: Efficiency
Expand Down
98 changes: 98 additions & 0 deletions vignettes/metadata.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
title: "Adding metadata"
author: "Thierry Onkelinx"
output:
rmarkdown::html_vignette:
fig_caption: yes
vignette: >
%\VignetteIndexEntry{Adding metadata}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

## Introduction

`git2rdata` supports extra metadata since version 0.4.1.
Metadata is stored in a separate file with the same name as the data file, but with the extension `.yml`.
The metadata file is a YAML file with a specific structure.
The metadata file contains a generic section and a section for each variable in the data file.
The generic section contains information about the data file as a whole.
The variable sections contain information about the variables in the data file.
The metadata file is stored in the same directory as the data file.

The generic section contains the following mandatory fields, automatically created by `git2rdata`:
- `git2rdata`: the version of `git2rdata` used to create the metadata.
- `datahash`: the hash of the data file.
- `hash`: the hash of the metadata file.
- `optimize`: a logical indicating whether the data file is optimized for `git2rdata`.
- `sorting`: a character vector with the names of the variables in the data file.
- `split_by`: a character vector with the names of the variables used to split the data file.
- `NA string`: the string used to represent missing values in the data file.

The generic section can contain the following optional fields:
- `table name`: the name of the dataset.
- `title`: the title of the dataset.
- `description`: a description of the dataset.

The variable sections contain the following mandatory fields, automatically created by `git2rdata`:
- `type`: the type of the variable.
- `class`: the class of the variable.
- `levels`: the levels of the variable (for factors).
- `index`: the index of the variable (for factors).
- `NA string`: the string used to represent missing values in the variable.

The variable sections can contain the following optional fields:
- `description`: a description of the variable.

## Adding metadata

`write_vc()` only stores the mandatory fields in the metadata file.

```{r store-metadata}
library(git2rdata)
root <- tempfile("git2rdata-metadata")
dir.create(root)
write_vc(iris, file = "iris", root = root, sorting = "Sepal.Length")
```

## Reading metadata

`read_vc()` reads the metadata file and adds it as attributes to the `data.frame`.
`print()` and `summary()` alert the user to the `display_metadata()` function.
This function displays the metadata of a `git2rdata` object.
Missing optional metadata results in an `NA` value in the output of `display_metadata()`.

```{r read-metadata}
my_iris <- read_vc("iris", root = root)
str(my_iris)
print(head(my_iris))
summary(my_iris)
display_metadata(my_iris)
```

## Updating the optional metadata

To add metadata to a `git2rdata` object, use the `update_description()` function.
This function allows you to add or update the metadata of a `git2rdata` object.
Setting an argument to `NA` or an empty string will remove the corresponding field from the metadata.
The function only updates the metadata file, not the data file.
To see the changes, read the object again before using `display_metadata()`.

```{r update-metadata}
update_description(
file = "iris", root = root, name = "iris", title = "Iris dataset",
description =
"The Iris dataset is a multivariate dataset introduced by the British
statistician and biologist Ronald Fisher in his 1936 paper The use of multiple
measurements in taxonomic problems.",
field_description = c(
Sepal.Length = "The length of the sepal in cm",
Sepal.Width = "The width of the sepal in cm",
Petal.Length = "The length of the petal in cm",
Petal.Width = "The width of the petal in cm",
Species = "The species of the iris"
)
)
display_metadata(read_vc("iris", root = root))
```

0 comments on commit 0d58450

Please sign in to comment.