Skip to content

Latest commit

 

History

History
178 lines (129 loc) · 6.75 KB

data_statement.md

File metadata and controls

178 lines (129 loc) · 6.75 KB

Data statement for [Dataset Name]

==============================

This documentation is aimed to help provide the characteristic information about a dataset. This data statment card has been borrowed and adapted from University of Washington Tech Policy Lab.

Last updated: August 2023

Provide a short summary of data.

Table of Contents

This is intended to make it easier to navigate the different sections.

1 HEADER


This section describes important background information of the dataset.

  • Dataset Title:
  • Dataset Curator(s): [name, affiliation]
  • Dataset Version: [version, date]
  • Dataset Citation and DOI:
  • Data Statement Author(s): [name, affiliation]
  • Data Statement Version: [version, date]
  • Data Statement Citation:
  • Links to versions of this data statement in other languages:

2 EXECUTIVE SUMMARY


Provide a summary of 60-100 words, that should atleast address the following:

  • a one-sentence description of the curation rationale
  • the language(s)
  • an overview of relevant quantitative information such as the dataset size.

3 CURATION RATIONALE


Provide a description that address the following questions:

  • Why was this dataset created?
  • What is the task or research question the dataset is intended to address?
  • Which texts were included and what were the goals in selecting texts, both in the original collection and in any further sub-selection?
  • What is the internal organization of the dataset?
  • What constitutes a data instance?

4 DOCUMENTATION FOR SOURCE DATASETS


Provide information about the pre-existing dataset that the dataset was built out of. Include information such as :

  • Link to the data statement for source dataset(s). If data statement is not available a link to a publication or any other documentation.
  • Links to license(s) for source datasets.

5 LANGUAGE VARIETIES


Provide information about all the languages and language varieties present. The languages should be characterised with the following:

  • a language tag from BCP-47 identifying the language variety (e.g., en-US or yue-Hant-HK)
  • a prose description elucidating and elaborating on the BCP-47 tag (e.g., English as spoken in Palo Alto, California)

6 SPEAKER DEMOGRAPHIC


Provide information about all the speaker groups represented in the dataset. Some of the specifications that can be included are :

  • Age
  • Gender
  • Race/ethnicity
  • Socioeconomic status
  • First language(s)
  • Proficiency in the language(s) of the data
  • Number of different speakers represented
  • Presence of disordered speech

7 ANNOTATOR DEMOGRAPHIC


Provide information about all the annotator groups represented in the dataset. Some of the specifications that can be included are :

  • Age
  • Gender
  • Race/ethnicity
  • Socioeconomic status
  • First language(s)
  • Proficiency in the language(s) of the data being annotated
  • Number of different annotators represented
  • Relevant training

8 SPEECH SITUATION AND TEXT CHARACTERISTICS


Provide a description of the cultural context of the language practices collected. Some of the specifications that can be included are :

  • Time and place of linguistic activity
  • Date(s) of data collection
  • Modality (spoken, signed, written)
  • Scripted/edited vs. spontaneous
  • Synchronous (e.g., in-person or live online chatting) vs. asynchronous (e.g., letters, emails, forums) interaction
  • Speakers’ intended audience
  • Genre (e.g., newswire vs. social media)
  • Topic (e.g., entertainment vs. natural disaster)
  • Non-linguistic context (e.g., photos speakers were all looking at; a game participants are playing)
  • Additional details about the cultural context (optional)

9 PREPROCESSING AND DATA FORMATTING


Provide a description of all the preprocessing and data formatting modifications that were made to the data (except for annotations). This should include the following :

  • information about any anonymization procedures
  • any tools used to make the modifications
  • whether or not raw data is included in the dataset

10 CAPTURE QUALITY


Provide a description about any quality issures in the data capture (issues with the collection methodologies)

11 LIMITATIONS


Provide a description of any challenges that could not be fully addressed and the resulting limitations.

12 METADATA


Some of the following information can be provided:

  • License: Link to the license/copyright permissions for use or modification of the dataset
  • Annotation guidelines: Link to the published or online guidelines that annotators used to annotate the data
  • Annotation process: Link to documentation providing metadata about the annotation process, including protections for annotator anonymity, how annotators were compensated, and which aspects of the annotation were produced automatically
  • Dataset quality: Metrics for inter-annotator agreement and/or other numerical scores of dataset quality
  • Errata: Link to the list of known errors and how to report additional ones

13 DISCLOSURES AND ETHICAL REVIEW


Some of the following information can be provided:

  • a description of the funding source for the dataset and relevant information (e.g., grant number) should be specified.
  • For projects that went through an ethical approval process, a link to the institution (e.g., IRB) should be provided.
  • a brief description of any consent process used
  • if speakers or annotators were compensated, how compensation rates were determined
  • any access restrictions to the data; and any potential conflicts of interest

14 OTHER


Provide any further information that may be relevant to the dataset.

15 GLOSSARY


Provide a list of terms and their definitions. This may include technical or unfamiliar to nonexperts terms.

About this document


Include this information about the document verbatim at the end of your data statement. If you adapt the data statement template, include a note about your changes here