Data statement for [Dataset Name]

==============================

This documentation is aimed to help provide the characteristic information about a dataset. This data statment card has been borrowed and adapted from University of Washington Tech Policy Lab.

Last updated: August 2023

Provide a short summary of data.

1 HEADER

This section describes important background information of the dataset.

Dataset Title:
Dataset Curator(s): [name, affiliation]
Dataset Version: [version, date]
Dataset Citation and DOI:
Data Statement Author(s): [name, affiliation]
Data Statement Version: [version, date]
Data Statement Citation:
Links to versions of this data statement in other languages:

2 EXECUTIVE SUMMARY

Provide a summary of 60-100 words, that should atleast address the following:

a one-sentence description of the curation rationale
the language(s)
an overview of relevant quantitative information such as the dataset size.

3 CURATION RATIONALE

Provide a description that address the following questions:

Why was this dataset created?
What is the task or research question the dataset is intended to address?
Which texts were included and what were the goals in selecting texts, both in the original collection and in any further sub-selection?
What is the internal organization of the dataset?
What constitutes a data instance?

4 DOCUMENTATION FOR SOURCE DATASETS

Provide information about the pre-existing dataset that the dataset was built out of. Include information such as :

Link to the data statement for source dataset(s). If data statement is not available a link to a publication or any other documentation.
Links to license(s) for source datasets.

5 LANGUAGE VARIETIES

Provide information about all the languages and language varieties present. The languages should be characterised with the following:

a language tag from BCP-47 identifying the language variety (e.g., en-US or yue-Hant-HK)
a prose description elucidating and elaborating on the BCP-47 tag (e.g., English as spoken in Palo Alto, California)

6 SPEAKER DEMOGRAPHIC

Provide information about all the speaker groups represented in the dataset. Some of the specifications that can be included are :

Age
Gender
Race/ethnicity
Socioeconomic status
First language(s)
Proficiency in the language(s) of the data
Number of different speakers represented
Presence of disordered speech

7 ANNOTATOR DEMOGRAPHIC

Provide information about all the annotator groups represented in the dataset. Some of the specifications that can be included are :

Age
Gender
Race/ethnicity
Socioeconomic status
First language(s)
Proficiency in the language(s) of the data being annotated
Number of different annotators represented
Relevant training

8 SPEECH SITUATION AND TEXT CHARACTERISTICS

Provide a description of the cultural context of the language practices collected. Some of the specifications that can be included are :

Time and place of linguistic activity
Date(s) of data collection
Modality (spoken, signed, written)
Scripted/edited vs. spontaneous
Synchronous (e.g., in-person or live online chatting) vs. asynchronous (e.g., letters, emails, forums) interaction
Speakers’ intended audience
Genre (e.g., newswire vs. social media)
Topic (e.g., entertainment vs. natural disaster)
Non-linguistic context (e.g., photos speakers were all looking at; a game participants are playing)
Additional details about the cultural context (optional)

9 PREPROCESSING AND DATA FORMATTING

Provide a description of all the preprocessing and data formatting modifications that were made to the data (except for annotations). This should include the following :

information about any anonymization procedures
any tools used to make the modifications
whether or not raw data is included in the dataset

10 CAPTURE QUALITY

Provide a description about any quality issures in the data capture (issues with the collection methodologies)

11 LIMITATIONS

Provide a description of any challenges that could not be fully addressed and the resulting limitations.

12 METADATA

Some of the following information can be provided:

License: Link to the license/copyright permissions for use or modification of the dataset
Annotation guidelines: Link to the published or online guidelines that annotators used to annotate the data
Annotation process: Link to documentation providing metadata about the annotation process, including protections for annotator anonymity, how annotators were compensated, and which aspects of the annotation were produced automatically
Dataset quality: Metrics for inter-annotator agreement and/or other numerical scores of dataset quality
Errata: Link to the list of known errors and how to report additional ones

13 DISCLOSURES AND ETHICAL REVIEW

Some of the following information can be provided:

a description of the funding source for the dataset and relevant information (e.g., grant number) should be specified.
For projects that went through an ethical approval process, a link to the institution (e.g., IRB) should be provided.
a brief description of any consent process used
if speakers or annotators were compensated, how compensation rates were determined
any access restrictions to the data; and any potential conflicts of interest

14 OTHER

Provide any further information that may be relevant to the dataset.

15 GLOSSARY

Provide a list of terms and their definitions. This may include technical or unfamiliar to nonexperts terms.

About this document

Include this information about the document verbatim at the end of your data statement. If you adapt the data statement template, include a note about your changes here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_statement.md

data_statement.md

Data statement for [Dataset Name]

Table of Contents

1 HEADER

2 EXECUTIVE SUMMARY

3 CURATION RATIONALE

4 DOCUMENTATION FOR SOURCE DATASETS

5 LANGUAGE VARIETIES

6 SPEAKER DEMOGRAPHIC

7 ANNOTATOR DEMOGRAPHIC

8 SPEECH SITUATION AND TEXT CHARACTERISTICS

9 PREPROCESSING AND DATA FORMATTING

10 CAPTURE QUALITY

11 LIMITATIONS

12 METADATA

13 DISCLOSURES AND ETHICAL REVIEW

14 OTHER

15 GLOSSARY

About this document

Files

data_statement.md

Latest commit

History

data_statement.md

File metadata and controls

Data statement for [Dataset Name]

Table of Contents

1 HEADER

2 EXECUTIVE SUMMARY

3 CURATION RATIONALE

4 DOCUMENTATION FOR SOURCE DATASETS

5 LANGUAGE VARIETIES

6 SPEAKER DEMOGRAPHIC

7 ANNOTATOR DEMOGRAPHIC

8 SPEECH SITUATION AND TEXT CHARACTERISTICS

9 PREPROCESSING AND DATA FORMATTING

10 CAPTURE QUALITY

11 LIMITATIONS

12 METADATA

13 DISCLOSURES AND ETHICAL REVIEW

14 OTHER

15 GLOSSARY

About this document