Skip to content

Commit

Permalink
Cleanup, refactorings, re-generated CLDF, added tests.
Browse files Browse the repository at this point in the history
  • Loading branch information
chrzyki committed Nov 11, 2019
1 parent 288d281 commit c0e6a4b
Show file tree
Hide file tree
Showing 12 changed files with 2,855 additions and 2,756 deletions.
27 changes: 27 additions & 0 deletions FORMS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
## Specification of form manipulation


Specification of the value-to-form processing in Lexibank datasets:

The value-to-form processing is divided into two steps, implemented as methods:
- `FormSpec.split`: Splits a string into individual form chunks.
- `FormSpec.clean`: Normalizes a form chunk.

These methods use the attributes of a `FormSpec` instance to configure their behaviour.

- `brackets`: `{'(': ')'}`
Pairs of strings that should be recognized as brackets, specified as `dict` mapping opening string to closing string
- `separators`: `(';', '/', ',')`
Iterable of single character tokens that should be recognized as word separator
- `missing_data`: `('?', '-')`
Iterable of strings that are used to mark missing data
- `strip_inside_brackets`: `True`
Flag signaling whether to strip content in brackets (**and** strip leading and trailing whitespace)
- `replacements`: `[]`
List of pairs (`source`, `target`) used to replace occurrences of `source` in formswith `target` (before stripping content in brackets)
- `first_form_only`: `False`
Flag signaling whether at most one form should be returned from `split` - effectively ignoring any spelling variants, etc.
- `normalize_whitespace`: `True`
Flag signaling whether to normalize whitespace - stripping leading and trailing whitespace and collapsing multi-character whitespace to single spaces
- `normalize_unicode`: `None`
UNICODE normalization form to use for input of `split` (`None`, 'NFD' or 'NFC')
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@ This dataset was prepared from a collection of original field work data by Timot
## Statistics


[![Build Status](https://travis-ci.org/lingpy/bodtkhobwa/.svg?branch=master)](https://travis-ci.org/lingpy/bodtkhobwa/)
[![Build Status](https://travis-ci.org/lexibank/bodtkhobwa.svg?branch=master)](https://travis-ci.org/lexibank/bodtkhobwa)
![Glottolog: 100%](https://img.shields.io/badge/Glottolog-100%25-brightgreen.svg "Glottolog: 100%")
![Concepticon: 90%](https://img.shields.io/badge/Concepticon-90%25-yellowgreen.svg "Concepticon: 90%")
![Concepticon: 90%](https://img.shields.io/badge/Concepticon-90%25-green.svg "Concepticon: 90%")
![Source: 100%](https://img.shields.io/badge/Source-100%25-brightgreen.svg "Source: 100%")
![BIPA: 100%](https://img.shields.io/badge/BIPA-100%25-brightgreen.svg "BIPA: 100%")
![CLTS SoundClass: 100%](https://img.shields.io/badge/CLTS%20SoundClass-100%25-brightgreen.svg "CLTS SoundClass: 100%")
Expand Down
16 changes: 8 additions & 8 deletions cldf/cldf-metadata.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,37 +11,37 @@
"dc:related": null,
"dc:source": "sources.bib",
"dc:title": "Lexical Cognates in Western Kho-Bwa",
"dcat:accessURL": "https://github.com/lingpy/bodtkhobwa/",
"dcat:accessURL": "https://github.com/lexibank/bodtkhobwa",
"prov:wasDerivedFrom": [
{
"rdf:type": "prov:Entity",
"dc:title": "Repository",
"rdf:about": "https://github.com/lingpy/bodtkhobwa/",
"dc:created": "v1.1-11-ge8b1906"
"rdf:about": "https://github.com/lexibank/bodtkhobwa",
"dc:created": "v1.1-13-g288d281"
},
{
"rdf:type": "prov:Entity",
"dc:title": "Glottolog",
"rdf:about": "https://github.com/lingulist/glottolog-data",
"rdf:about": "https://github.com/glottolog/glottolog",
"dc:created": "v4.0"
},
{
"rdf:type": "prov:Entity",
"dc:title": "Concepticon",
"rdf:about": "https://github.com/LinguList/concepticon-data",
"dc:created": "v2.1.0-86-g556e0e1"
"rdf:about": "https://github.com/concepticon/concepticon-data",
"dc:created": "v2.2.0"
},
{
"rdf:type": "prov:Entity",
"dc:title": "CLTS",
"rdf:about": "https://github.com/cldf-clts/clts/",
"rdf:about": "https://github.com/cldf-clts/clts",
"dc:created": "v1.4"
}
],
"prov:wasGeneratedBy": [
{
"dc:title": "python",
"dc:description": "3.5.2"
"dc:description": "3.7.3"
},
{
"dc:title": "python-packages",
Expand Down
3,444 changes: 1,722 additions & 1,722 deletions cldf/cognates.csv

Large diffs are not rendered by default.

1,840 changes: 920 additions & 920 deletions cldf/forms.csv

Large diffs are not rendered by default.

16 changes: 8 additions & 8 deletions cldf/languages.csv
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
ID,Name,Glottocode,Glottolog_Name,ISO639P3code,Macroarea,Latitude,Longitude,Family,ChineseName,SubGroup
Duhumbi,Duhumbi,chug1252,,,,27.42,92.23,,,Western Kho-Bwa
Jerigaon,Jerigaon,jeri1243,,,,27.340629,92.486595,,,Western Kho-Bwa
Khispi,Khispi,lish1235,,,,27.37,92.23,,,Western Kho-Bwa
Khoina,Khoina,khoi1253,,,,27.334981,92.52994,,,Western Kho-Bwa
Khoitam,Khoitam,khoi1252,,,,27.327487,92.439455,,,Western Kho-Bwa
Rahung,Rahung,rahu1234,,,,27.310778,92.395028,,,Western Kho-Bwa
Rupa,Rupa,rupa1234,,,,27.203065,92.398757,,,Western Kho-Bwa
Shergaon,Shergaon,sher1261,,,,27.105018,92.272133,,,Western Kho-Bwa
Duhumbi,Duhumbi,chug1252,Chug,cvg,Eurasia,27.42,92.23,Sino-Tibetan,,Western Kho-Bwa
Jerigaon,Jerigaon,jeri1243,Jerigaon,,,27.340629,92.486595,Sino-Tibetan,,Western Kho-Bwa
Khispi,Khispi,lish1235,Lish,lsh,Eurasia,27.37,92.23,Sino-Tibetan,,Western Kho-Bwa
Khoina,Khoina,khoi1253,Khoina,,,27.334981,92.52994,Sino-Tibetan,,Western Kho-Bwa
Khoitam,Khoitam,khoi1252,Khoitam,,,27.327487,92.439455,Sino-Tibetan,,Western Kho-Bwa
Rahung,Rahung,rahu1234,Rahung,,,27.310778,92.395028,Sino-Tibetan,,Western Kho-Bwa
Rupa,Rupa,rupa1234,Rupa,,,27.203065,92.398757,Sino-Tibetan,,Western Kho-Bwa
Shergaon,Shergaon,sher1261,Shergaon,,,27.105018,92.272133,Sino-Tibetan,,Western Kho-Bwa
8 changes: 4 additions & 4 deletions cldf/parameters.csv
Original file line number Diff line number Diff line change
Expand Up @@ -128,9 +128,9 @@ ID,Name,Concepticon_ID,Concepticon_Gloss
128_daybeforeyesterday,day before yesterday,1180,DAY BEFORE YESTERDAY
129_deaf,deaf,996,DEAF
130_deep,deep,1593,DEEP
131_deerbarking,"deer, barking",1936,DEER
131_deerbarking,"deer, barking",3152,MUNTJACS
132_deersambar,"deer, sambar",,
133_defeatvi,defeat (vi),782,DEFEAT
133_defeatvi,defeat (vi),866,WIN
134_deityghost,"deity, ghost",1944,GOD
135_descend,descend,2014,GO DOWN (DESCEND)
136_die,die,1494,DIE
Expand Down Expand Up @@ -173,7 +173,7 @@ ID,Name,Concepticon_ID,Concepticon_Gloss
174_female,female,1551,FEMALE
175_fence,fence,1690,FENCE
176_fernfiddlehead,"fern, fiddlehead",,
177_fetch,fetch,,
177_fetch,fetch,3551,FETCH
178_fewlittle,"few, little",1242,FEW
179_field,field,212,FIELD
180_finger,finger,1303,FINGER
Expand Down Expand Up @@ -383,7 +383,7 @@ ID,Name,Concepticon_ID,Concepticon_Gloss
384_rahungpeopleof,"Rahung, people of",,
385_rainn,rain (n),658,RAIN (PRECIPITATION)
386_rainbow,rainbow,1733,RAINBOW
387_reap,reap,1827,HARVEST CROPS
387_reap,reap,190,MOW
388_recognise,recognise,2248,KNOW (SOMEBODY)
389_red,red,156,RED
390_reflexivemarker,reflexive marker,,
Expand Down
74 changes: 35 additions & 39 deletions cldf/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,49 +1,45 @@
appdirs==1.4.3
atomicwrites==1.2.1
certifi==2018.8.24
atomicwrites==1.3.0
certifi==2019.9.11
chardet==3.0.4
-e git+https://github.com/cldf/cldfbench.git@f373855e3b9cde029578e77c26136f0df26a82fa#egg=cldfbench
cldfbench==1.0.0
cldfcatalog==1.3.0
colorlog==3.1.4
configparser==3.5.0
-e git+https://github.com/cldf/csvw.git@bcd398856fdfe6408567cc02c7ff8b67ba1c8e38#egg=csvw
decorator==4.3.0
defusedxml==0.5.0
idna==2.7
clldutils==3.3.0
colorlog==4.0.2
csvw==1.6.0
decorator==4.4.1
idna==2.8
isodate==0.6.0
jdcal==1.4.1
-e git+https://github.com/lingpy/bodtkhobwa/@e80544c6a001e3ce30ef197463be03834e5a3704#egg=lexibank_bodtkhobwa
-e git+https://github.com/lexibank/castrosui.git@a2d13314e81682f6619c6af90a6ecf4bff401d00#egg=lexibank_castrosui
-e git+https://github.com/lexibank/lieberherrkhobwa.git@1985e1f912b2454cfcaa45611b488c95e31ff04c#egg=lexibank_lieberherrkhobwa
-e git+https://github.com/digling/sinotibetan-data.git@9488b83bc5cb4d443bf9f3b81c20711a95b772c6#egg=lexibank_sagartst
-e git+https://github.com/LinguList/lingpy.git@2a1671c1b65886e1d33eccd74818b29bc4ce73dd#egg=lingpy
lxml==4.2.4
Markdown==2.6.11
networkx==2.2
newick==0.9.2
numpy==1.15.1
openpyxl==2.6.4
pathlib2==2.3.2
pluggy==0.8.0
lingpy==2.6.5
Markdown==3.1.1
networkx==2.1
newick==1.0.0
numpy==1.17.4
openpyxl==3.0.0
packaging==19.2
pluggy==0.13.0
purl==1.5
py==1.7.0
py==1.8.0
pybtex==0.22.2
-e git+https://github.com/cldf-clts/pyclts@4842f1fd9613de6ef20a917dbc3bd723e8d0ffbb#egg=pyclts
-e git+https://github.com/concepticon/pyconcepticon.git@615a048b11cc6e5f8a3fe92619ad7790b23154db#egg=pyconcepticon
pycountry==18.12.8
-e git+https://github.com/clld/pyglottolog.git@0f24f24a46d1f510c975337e4c0d8c23b357c8bd#egg=pyglottolog
-e git+https://github.com/lexibank/pylexibank/@b2b24aafcbe618f08ef994d8e0ce2d29f601bff1#egg=pylexibank
pytest==4.0.0
regex==2018.8.29
requests==2.20.0
rfc3986==1.1.0
scipy==1.2.0
six==1.11.0
SQLAlchemy==1.3.0
tabulate==0.8.2
pycldf==1.8.2
pyclts==2.0.0
pyconcepticon==2.5.1
pycountry==19.8.18
pyglottolog==2.2.1
pylexibank==2.1.0
pytest==5.2.2
regex==2019.11.1
requests==2.22.0
rfc3986==1.3.2
segments==2.1.2
six==1.13.0
SQLAlchemy==1.3.10
tabulate==0.8.5
termcolor==1.1.0
tqdm==4.25.0
tqdm==4.38.0
uritemplate==3.0.0
urllib3==1.24.2
urllib3==1.25.6
wcwidth==0.1.7
xlrd==1.1.0
xlrd==1.2.0
zipp==0.6.0
131 changes: 97 additions & 34 deletions lexibank_bodtkhobwa.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from pylexibank.dataset import Dataset as BaseDataset
from pylexibank import Cognate, Language

from pylexibank.util import pb
from pylexibank.util import progressbar

from lingpy import *
from clldutils.misc import slug
Expand All @@ -15,6 +15,7 @@
class CustomCognate(Cognate):
Segment_Slice = attr.ib(default=None)


@attr.s
class CustomLanguage(Language):
Latitude = attr.ib(default=None)
Expand All @@ -23,62 +24,124 @@ class CustomLanguage(Language):
SubGroup = attr.ib(default=None)
Family = attr.ib(default=None)


class Dataset(BaseDataset):
id = 'bodtkhobwa'
id = "bodtkhobwa"
dir = Path(__file__).parent
cognate_class = CustomCognate
language_class = CustomLanguage

def cmd_makecldf(self, args):

wl = Wordlist(
self.raw_dir.joinpath('bodt-khobwa-cleaned.tsv').as_posix(),
conf=self.raw_dir.joinpath('wordlist.rc').as_posix()
)
self.raw_dir.joinpath("bodt-khobwa-cleaned.tsv").as_posix(),
conf=self.raw_dir.joinpath("wordlist.rc").as_posix(),
)
args.writer.add_sources()

concept_lookup = {}
for concept in self.conceptlist.concepts.values():
idx = concept.id.split('-')[-1]+'_'+slug(concept.english)
for concept in self.conceptlists[0].concepts.values():
idx = concept.id.split("-")[-1] + "_" + slug(concept.english)
args.writer.add_concept(
ID=idx,
Name=concept.english,
Concepticon_ID=concept.concepticon_id,
Concepticon_Gloss=concept.concepticon_gloss
)
ID=idx,
Name=concept.english,
Concepticon_ID=concept.concepticon_id,
Concepticon_Gloss=concept.concepticon_gloss,
)
concept_lookup[concept.english] = idx

language_lookup = args.writer.add_languages(lookup_factory='Name')

#num = 580
#for concept in wl.rows:
args.writer.add_languages(lookup_factory="Name")

# num = 580
# for concept in wl.rows:
# if not concept in concepts:
# print('"{0}","{1}",,,'.format(num, concept))
# num += 1

mapper = { "pʰl": "pʰ l", "aw": "au", "ɛj": "ɛi", "ɔw": "ɔu", "bl": "b l", "aj": "ai", "ɔj": "ɔi", "(ŋ)": "ŋ", "kʰl": "kʰ l", "kl": "k l", "ej": "ei", "uj": "ui", "bɹ": "b ɹ", "ɐʰ": "ɐʰ/ɐ", "hw": "h w", "ɔːʰ": "ɔːʰ/ɔː", "dʑr": "dʑ r", "ow": "ou", "pl": "p l", "lj": "l j", "tʰj": "tʰ j", "aːʰ": "aːʰ/aː", "bj": "b j", "mp": "m p", "pɹ": "p ɹ", "ɐ̃ʰ": "ɐ̃ʰ/ɐ̃", "ɔ̃ʰ": "ɔ̃ʰ/ɔ̃", "aj~e/aj": "aj~e/ai", "aj~ej/ej": "aj~ej/ei", "kl": "k l", "kʰɹ": "kʰ ɹ", "ɛːʰ":"ɛːʰ/ɛː", "ɔʰ": "ɔʰ/ɔ", "tɹ": "t ɹ", "ɐːʰ": "ɐːʰ/ɐ", "br": "b r", "kɹ": "k ɹ", "kʰj": "kʰ j", "kʰr": "kʰ r", "gɹ": "g ɹ", "hj": "h j", "bl~gl/bl": "bl~gl/b l", "dj": "d j", "ej~i/ej": "ej~i/ei", "e~a/ej": "e~a/ei", "fl": "f l", "kʰw": "kʰ w", "mj": "m j", "pr": "p r", "pʰl~bl/pʰl": "pʰl~bl/pʰ l", "pʰr": "pʰ r", "pʰr~pʰl/pʰr": "pʰr~pʰl/pʰ r", "pʰw": "pʰ w", "pʰɹ": "pʰ ɹ", "tr": "t r", "tɕʰɹ": "tɕʰ ɹ", "tʰr": "tʰ r", "tʰw": "tʰ w", "dɾ": "d ɾ", "tɾ": "t ɾ", "zj": "z j", "ɔj~uj/uj": "ɔj~uj/ui", }
mapper = {
"pʰl": "pʰ l",
"aw": "au",
"ɛj": "ɛi",
"ɔw": "ɔu",
"bl": "b l",
"aj": "ai",
"ɔj": "ɔi",
"(ŋ)": "ŋ",
"kʰl": "kʰ l",
"ej": "ei",
"uj": "ui",
"bɹ": "b ɹ",
"ɐʰ": "ɐʰ/ɐ",
"hw": "h w",
"ɔːʰ": "ɔːʰ/ɔː",
"dʑr": "dʑ r",
"ow": "ou",
"pl": "p l",
"lj": "l j",
"tʰj": "tʰ j",
"aːʰ": "aːʰ/aː",
"bj": "b j",
"mp": "m p",
"pɹ": "p ɹ",
"ɐ̃ʰ": "ɐ̃ʰ/ɐ̃",
"ɔ̃ʰ": "ɔ̃ʰ/ɔ̃",
"aj~e/aj": "aj~e/ai",
"aj~ej/ej": "aj~ej/ei",
"kl": "k l",
"kʰɹ": "kʰ ɹ",
"ɛːʰ": "ɛːʰ/ɛː",
"ɔʰ": "ɔʰ/ɔ",
"tɹ": "t ɹ",
"ɐːʰ": "ɐːʰ/ɐ",
"br": "b r",
"kɹ": "k ɹ",
"kʰj": "kʰ j",
"kʰr": "kʰ r",
"gɹ": "g ɹ",
"hj": "h j",
"bl~gl/bl": "bl~gl/b l",
"dj": "d j",
"ej~i/ej": "ej~i/ei",
"e~a/ej": "e~a/ei",
"fl": "f l",
"kʰw": "kʰ w",
"mj": "m j",
"pr": "p r",
"pʰl~bl/pʰl": "pʰl~bl/pʰ l",
"pʰr": "pʰ r",
"pʰr~pʰl/pʰr": "pʰr~pʰl/pʰ r",
"pʰw": "pʰ w",
"pʰɹ": "pʰ ɹ",
"tr": "t r",
"tɕʰɹ": "tɕʰ ɹ",
"tʰr": "tʰ r",
"tʰw": "tʰ w",
"dɾ": "d ɾ",
"tɾ": "t ɾ",
"zj": "z j",
"ɔj~uj/uj": "ɔj~uj/ui",
}

# add data to cldf
args.writer['FormTable', 'Segments'].separator = ' + '
args.writer['FormTable', 'Segments'].datatype = Datatype.fromvalue({
"base": "string",
"format": "([\\S]+)( [\\S]+)*"
})
for idx in pb(wl, desc='cldfify'):
segments = ' '.join([mapper.get(x, x) for x in wl[idx, 'tokens']])
morphemes = segments.split(' + ')
concept = concept_lookup.get(wl[idx, 'concept'], '')
args.writer["FormTable", "Segments"].separator = " + "
args.writer["FormTable", "Segments"].datatype = Datatype.fromvalue(
{"base": "string", "format": "([\\S]+)( [\\S]+)*"}
)
for idx in progressbar(wl, desc="cldfify"):
segments = " ".join([mapper.get(x, x) for x in wl[idx, "tokens"]])
morphemes = segments.split(" + ")
concept = concept_lookup.get(wl[idx, "concept"], "")
lex = args.writer.add_form_with_segments(
Language_ID=wl[idx, 'doculect'],
Language_ID=wl[idx, "doculect"],
Parameter_ID=concept,
Value=wl[idx, 'form'],
Form=wl[idx, 'form'],
Value=wl[idx, "form"],
Form=wl[idx, "form"],
Segments=morphemes,
Source=['Bodt2019'],
Source=["Bodt2019"],
)
for morpheme_index, cogid in enumerate(wl[idx, 'crossids']):
alignment = wl[idx, 'alignment'].split(' + ')[morpheme_index].split()
alignment = ' '.join([mapper.get(x, x) for x in alignment]).split()
for morpheme_index, cogid in enumerate(wl[idx, "crossids"]):
alignment = wl[idx, "alignment"].split(" + ")[morpheme_index].split()
alignment = " ".join([mapper.get(x, x) for x in alignment]).split()
if int(cogid):
args.writer.add_cognate(
lexeme=lex,
Expand Down
Loading

0 comments on commit c0e6a4b

Please sign in to comment.