If you use these data please cite
- the original source
Dhakal, D. N., List, J.-M., Roberts, S. G.(2024) A Phylogenetic Study of South-Western Tibetic. Journal of Language Evolution. https://doi.org/10.1093/jole/lzae008
- the derived dataset using the DOI of the particular released version you were using
This dataset is licensed under a CC-BY-4.0 license
Available online at https://github.com/lexibank/dhakalsouthwesttibetic
Conceptlists in Concepticon:
Data was collected, led by D. N. Dhakal, in 2018, using a questionnaire of 243 items. The original data as it was collected is available from the folder raw/
in all files ending in .tab
(Kagate_240.tab
, etc.).
The CLDF conversion was first done with this original data, but later, we converted the data from the first CLDF version to the EDICTOR format that we needed for the curation and annotation process. As a result, the data that is shared with the CLDF repository contains additional, at times manual, modifications. A comparison with the original data is always possible, specifically also, since the forms in the original collection are available from the column Value
in the CSV file providing the forms in CLDF (cldf/forms.csv
).
We assume that you have Python in a fresh virtual environment available, as well as SQLite, and a basic terminal that offers a basic Shell (e.g. bash).
To install the required Python packages, type:
pip install -e .
Data was later compared and extended by adding data for Tibetic languages and Old Chinese from Sagart et al. (2019). The conversion was first carried out in a dedicated Python script, selecting those concepts present in both datasets. The CLDF version now provides a combined dataset with both the originally collected data (wordlists of about 240 items) and the comparative wordlist in which Tibetic languages and Old Chinese from Sagart et al. are added. Both versions (the original version of 8 varieties and the combined version with a limited number of concepts) can be retrieved with the commands we provide in the Makefile by typing:
make base-data
This code makes use of the SQLite version of the data provided in the folder sqlite
which was created with the help of the pycldf
package. The conversion of the data to SQLite can also be carried out with the help of the Makefile by typing:
make db
Accordingly, the base data can also be created:
make full-data
If you install the Python package pyedictor
(pip install pyedictor >= 0.4
), you can extract the base data and the full data also with slightly modified commands that yield, however, the same results.
make base-data-ed
make full-data-ed
Our phylogenetic analyses are based on the combined data. The nexus file we used as the basis here can also be created automatically with the help of the Makefile.
make nexus-file
The resulting Nexus file is stored in the folder nexus
as full-wordlist.nex
.
If you want to test TIGER scores, Delta Scores, and Q-Residuals in the data, you can also do this with the Makefile, but you must install additional packages first.
make install
make tiger-et-al
This will print out the scores computed for the base wordlist and the full wordlist.
Wordlist TIGER Corrected TIGER Delta Q-Residuals
---------- -------- ----------------- -------- -------------
Combined 0.678752 0.379645 0.342274 0.00852446
Tibetic 0.74708 0.193927 0.39871 0.0122
- Varieties: 14 (linked to 13 different Glottocodes)
- Concepts: 243 (linked to 240 different Concepticon concept sets)
- Lexemes: 2,903
- Sources: 2
- Synonymy: 1.08
- Invalid lexemes: 0
- Tokens: 14,206
- Segments: 316 (0 BIPA errors, 0 CLTS sound class errors, 314 CLTS modified)
- Inventory size (avg): 68.57
Name | GitHub user | Description | Role |
---|---|---|---|
Dubi Nanda Dhakal | main data collection and analysis | Author | |
Johann-Mattis List | @lingulist | cognate coding | Author |
Sean Roberts | @seannyD | data cleaning and analysis | Author |
The following CLDF datasets are available in cldf:
- CLDF Wordlist at cldf/cldf-metadata.json