-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-observation.Rmd
82 lines (58 loc) · 5.89 KB
/
03-observation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# Data Observation and Curation
## Data Management
Store all of your research data in the `data` subdirectories. It is recommended that raw data not be altered once downloaded or collected. Maintaining a separate raw data file facilitates reproducibility be preserving a common point of analytical origin. It is similarly recommended that whenever possible data processing, transformation, or manipulation be completed with code as this practice facilitates re-analysis and reduces opportunities of confusion.
Complete the [data_metadata.csv](data_metadata.csv) file indexing each **raw** and **derived** data file, including the fields:
- `path`: the path to the data folder, likely one of: `raw\private`, `raw\public`, `derived\private` or `derived\public`
- `name`: the file name, including extension
- `metadata`: list of metadata files for this data source, stored in the `data\metadata` folder. These may include ISO-191** or FGDC standard `XML` files, data dictionaries, licenses or attributions, user guides, webpage printouts, etc.
- `status`: which may be `included` for data included in the repository or `create` or `acquire` for data that must created or acquired, `derived` for data that will be generated by code from other data files, `simulated` for data that replaces the true research data with a simulated data due to confidentiality or legal constraints, and `unavailable` for data that cannot be shared or reproduced in any way.
- `description`: *very* brief description of the dataset.
Researchers are **strongly encouraged** to include additional metadata in the `metadata` folder.
Further information about the procedures used to create data with 'status = derive' should be maintained in the [procedure_metadata.csv](../procedure/procedure_metadata.csv).
See more about metadata in the engaging with data section of the previous chapter.
## Collect preliminary data
- metadata!
- code/scripts for data acquisition
- directory structure for data
- scratch (not tracked)
- raw / public
- raw / private (not tracked)
- derived / public
- derived / private (not tracked)
- file size limits for GitHub / GitLab
| Processing | Access | -- |
| :--: | :--: | :--: |
| | Private | Public |
| Raw | RPri | RPub |
| Derived | DPri | DPub
### Raw private data
Store raw data in this folder as it is collected or downloaded if the data cannot be publicly redistributed. For example, data versioning and sharing my be restricted because of large file sizes, licensing, ethics, privacy, or confidentiality. Best practices are to include code to automate the process of downloading or simulating raw private data in the first step of the methods, or to include instructions here for accessing any private or restricted-access data.
*This folder is ignored by Git versioning* with the exception of this `readme.md` file by the following lines in `.gitignore`
```gitignore
# Ignore contents of private folder, with the exception of its readme file
private/**
!private/readme.md
```
### Caution: Dealing with large files
Files can come in two flavors: **plain text**, like source code, Markdown, or system logs; and **binary** files, like images, videos, or shapefiles.
As a version management tool, Git is designed to track changes in **plain text** files; it can store changes in **binary** files as well, but it can only record that the whole file changed.
GitHub will warn you of files larger than `50 MiB`, and reject files larger than `100 MiB`.
Therefore, large files should generally be placed in `private` directories so that they are not tracked by Git or uploaded to GitHub.
OSF and Figshare both allow for larger file storage options, so you may store large files on those services and write code for downloading those files to private directories as the analysis runs.
Significant data sources could be registered with their own DOI links.
If version management of large files is required, GitHub provides **paid** hosting for the Git LFS (Large File Storage) program.
However, we still suggest saving large files as separate data resources,
so that downstream researchers attempting to reproduce or replicate your work are not required to
modify your code, or install and pay for the same large file storage options that you have used.
If you have already commit changes with large files, follow GitHub's instructions here: [Removing files from a repository's history](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github?platform=mac#removing-files-from-a-repositorys-history).
On GitHub Desktop, go to the History tab of your repository and **undo** the last commit.
The changes tab will repopulate with the changes from that commit, where you should be able uncheck any large files from inclusion in the commit.
Meanwhile, move the large files into the appropriate `private` directory, and the `gitignore` file should take over and make them disappear from the list of changes in GitHub Desktop.
## Updating the analysis plan
You will likely encounter unexpected challenges and the need to change your original, pre-analysis registration plan.
This is *normal*: just be diligent about updating your analysis plan, cataloguing deviations from the original plan, and committing changes to the repository.
Document **unplanned deviations** as they occur in the analysis plan.
If the study is a metascience study, then categorize unplanned deviations **for reproduction** if the aim of the deviation is still to reproduce the original methodology and original results.
Categorize deviations as **for reanalysis** if the aim is to alter a methodological parameter of the study to compare results, e.g. as a test of sensitivity, uncertainty, or robustness.
Categorize deviations as **for replication** if the aim is to alter the spatial-temporal coverage of the study or to otherwise repeat the study methodology with new data/observations.
For full transparency, document both the rationale and the form of each deviation.