Skip to content

CLDF dataset accompanying Roberts et al.'s "Lexical Wordlists of South-West Tibetic Languages" from 2022

License

Notifications You must be signed in to change notification settings

lexibank/dhakalsouthwesttibetic

Repository files navigation

CLDF dataset derived from Dhakal et al.'s "South-Western Tibetic" from 2024

How to cite

If you use these data please cite

Description

This dataset is licensed under a CC-BY-4.0 license

Available online at https://github.com/lexibank/dhakalsouthwesttibetic

Conceptlists in Concepticon:

Notes

Data Collection

Data was collected, led by D. N. Dhakal, in 2018, using a questionnaire of 243 items. The original data as it was collected is available from the folder raw/ in all files ending in .tab (Kagate_240.tab, etc.).

The CLDF conversion was first done with this original data, but later, we converted the data from the first CLDF version to the EDICTOR format that we needed for the curation and annotation process. As a result, the data that is shared with the CLDF repository contains additional, at times manual, modifications. A comparison with the original data is always possible, specifically also, since the forms in the original collection are available from the column Value in the CSV file providing the forms in CLDF (cldf/forms.csv).

Requirements

We assume that you have Python in a fresh virtual environment available, as well as SQLite, and a basic terminal that offers a basic Shell (e.g. bash).

To install the required Python packages, type:

pip install -e .

Comparison and Extension of the Data

Data was later compared and extended by adding data for Tibetic languages and Old Chinese from Sagart et al. (2019). The conversion was first carried out in a dedicated Python script, selecting those concepts present in both datasets. The CLDF version now provides a combined dataset with both the originally collected data (wordlists of about 240 items) and the comparative wordlist in which Tibetic languages and Old Chinese from Sagart et al. are added. Both versions (the original version of 8 varieties and the combined version with a limited number of concepts) can be retrieved with the commands we provide in the Makefile by typing:

make base-data

This code makes use of the SQLite version of the data provided in the folder sqlite which was created with the help of the pycldf package. The conversion of the data to SQLite can also be carried out with the help of the Makefile by typing:

make db

Accordingly, the base data can also be created:

make full-data

If you install the Python package pyedictor (pip install pyedictor >= 0.4), you can extract the base data and the full data also with slightly modified commands that yield, however, the same results.

make base-data-ed
make full-data-ed

Our phylogenetic analyses are based on the combined data. The nexus file we used as the basis here can also be created automatically with the help of the Makefile.

make nexus-file

The resulting Nexus file is stored in the folder nexus as full-wordlist.nex.

If you want to test TIGER scores, Delta Scores, and Q-Residuals in the data, you can also do this with the Makefile, but you must install additional packages first.

make install
make tiger-et-al

This will print out the scores computed for the base wordlist and the full wordlist.

Wordlist       TIGER    Corrected TIGER     Delta    Q-Residuals
----------  --------  -----------------  --------  -------------
Combined    0.678752           0.379645  0.342274     0.00852446
Tibetic     0.74708            0.193927  0.39871      0.0122

Statistics

Glottolog: 100% Concepticon: 99% Source: 100% BIPA: 100% CLTS SoundClass: 100%

  • Varieties: 14 (linked to 13 different Glottocodes)
  • Concepts: 243 (linked to 240 different Concepticon concept sets)
  • Lexemes: 2,903
  • Sources: 2
  • Synonymy: 1.08
  • Invalid lexemes: 0
  • Tokens: 14,206
  • Segments: 316 (0 BIPA errors, 0 CLTS sound class errors, 314 CLTS modified)
  • Inventory size (avg): 68.57

Contributors

Name GitHub user Description Role
Dubi Nanda Dhakal main data collection and analysis Author
Johann-Mattis List @lingulist cognate coding Author
Sean Roberts @seannyD data cleaning and analysis Author

CLDF Datasets

The following CLDF datasets are available in cldf:

About

CLDF dataset accompanying Roberts et al.'s "Lexical Wordlists of South-West Tibetic Languages" from 2022

Resources

License

Stars

Watchers

Forks

Packages

No packages published