==============================
This documentation is aimed to help provide the characteristic information about a dataset. This data statment card has been borrowed and adapted from University of Washington Tech Policy Lab.
Last updated: August 2023
Provide a short summary of data.
This is intended to make it easier to navigate the different sections.
- Header
- Executive Summary
- Curation Rationale
- Documentation For Source Datasets
- Language Varieties
- Speaker Demographic
- Annotator Demographic
- Speech Situation and Text Characteristics
- Preprocessing and Data Formatting
- Capture Quality
- Limitations
- Metadata
- Disclosures and Ethical Review
- Other
- Glossary
This section describes important background information of the dataset.
- Dataset Title:
- Dataset Curator(s): [name, affiliation]
- Dataset Version: [version, date]
- Dataset Citation and DOI:
- Data Statement Author(s): [name, affiliation]
- Data Statement Version: [version, date]
- Data Statement Citation:
- Links to versions of this data statement in other languages:
Provide a summary of 60-100 words, that should atleast address the following:
- a one-sentence description of the curation rationale
- the language(s)
- an overview of relevant quantitative information such as the dataset size.
Provide a description that address the following questions:
- Why was this dataset created?
- What is the task or research question the dataset is intended to address?
- Which texts were included and what were the goals in selecting texts, both in the original collection and in any further sub-selection?
- What is the internal organization of the dataset?
- What constitutes a data instance?
Provide information about the pre-existing dataset that the dataset was built out of. Include information such as :
- Link to the data statement for source dataset(s). If data statement is not available a link to a publication or any other documentation.
- Links to license(s) for source datasets.
Provide information about all the languages and language varieties present. The languages should be characterised with the following:
- a language tag from BCP-47 identifying the language variety (e.g., en-US or yue-Hant-HK)
- a prose description elucidating and elaborating on the BCP-47 tag (e.g., English as spoken in Palo Alto, California)
Provide information about all the speaker groups represented in the dataset. Some of the specifications that can be included are :
- Age
- Gender
- Race/ethnicity
- Socioeconomic status
- First language(s)
- Proficiency in the language(s) of the data
- Number of different speakers represented
- Presence of disordered speech
Provide information about all the annotator groups represented in the dataset. Some of the specifications that can be included are :
- Age
- Gender
- Race/ethnicity
- Socioeconomic status
- First language(s)
- Proficiency in the language(s) of the data being annotated
- Number of different annotators represented
- Relevant training
Provide a description of the cultural context of the language practices collected. Some of the specifications that can be included are :
- Time and place of linguistic activity
- Date(s) of data collection
- Modality (spoken, signed, written)
- Scripted/edited vs. spontaneous
- Synchronous (e.g., in-person or live online chatting) vs. asynchronous (e.g., letters, emails, forums) interaction
- Speakers’ intended audience
- Genre (e.g., newswire vs. social media)
- Topic (e.g., entertainment vs. natural disaster)
- Non-linguistic context (e.g., photos speakers were all looking at; a game participants are playing)
- Additional details about the cultural context (optional)
Provide a description of all the preprocessing and data formatting modifications that were made to the data (except for annotations). This should include the following :
- information about any anonymization procedures
- any tools used to make the modifications
- whether or not raw data is included in the dataset
Provide a description about any quality issures in the data capture (issues with the collection methodologies)
Provide a description of any challenges that could not be fully addressed and the resulting limitations.
Some of the following information can be provided:
- License: Link to the license/copyright permissions for use or modification of the dataset
- Annotation guidelines: Link to the published or online guidelines that annotators used to annotate the data
- Annotation process: Link to documentation providing metadata about the annotation process, including protections for annotator anonymity, how annotators were compensated, and which aspects of the annotation were produced automatically
- Dataset quality: Metrics for inter-annotator agreement and/or other numerical scores of dataset quality
- Errata: Link to the list of known errors and how to report additional ones
Some of the following information can be provided:
- a description of the funding source for the dataset and relevant information (e.g., grant number) should be specified.
- For projects that went through an ethical approval process, a link to the institution (e.g., IRB) should be provided.
- a brief description of any consent process used
- if speakers or annotators were compensated, how compensation rates were determined
- any access restrictions to the data; and any potential conflicts of interest
Provide any further information that may be relevant to the dataset.
Provide a list of terms and their definitions. This may include technical or unfamiliar to nonexperts terms.
Include this information about the document verbatim at the end of your data statement. If you adapt the data statement template, include a note about your changes here