Can be installed on all operating systems and supports Python version >= 3.7, to install run:
pip install -e .[tests]
For a zsh
shell, which is the default shell for the new Macs you will need to escape with \
the brackets:
pip install -e .\[tests\]
This code base uses flake8 and mypy to ensure that the format of the code is consistent and contain type hints. The flake8 settings can be found in setup.cfg and the mypy settings within pyproject.toml. To run these linters:
isort pymusas_models model_function_tests model_creation_tests model_release.py
flake8
mypy
/pymusas_models
- contains the code that creates all of the PyMUSAS models./model_release.py
- Releases the models, that have been created locally, to GitHub as a GitHub release per model./model_creation_tests
/model_function_tests
- The tests are divided up by language, using each language's BCP 47 language code, and then model (currently we only have one model therule based tagger
)./model_function_tests/fr
/model_function_tests/fr/test_rule_based_tagger.py
/model_function_tests/it
/model_function_tests/it/test_rule_based_tagger.py
- other language codes
/model_creation_tests/test_create_and_install_models.py
- This creates and installs the models used withintests
and in doing so tests that this part of the code base works. Note that we install the models to a temporary Python virtual environment.
The testing structure of /model_function_tests
has been heavily influenced by how spaCy tests their models.
Each model is created using a spaCy configuration and meta file, of which we can have more than one model for each language. These configuration and meta files are automatically created using the Command Line Interface (CLI) to pymusas_models
and then used to create the PyMUSAS spaCy models with their relevant installation data (distribution files, README, meta data, etc). This process is done per model and all model data is stored in their own model folder, named based off the model naming convention specified in the main README, within the directory you have specified to store this data in. Each of these spaCy model folders contain the relevant information to create a GitHub release for that model.
The CLI knows which models to create and how to create them by utilising the given meta data stored in the language_resources.json file (for more information of the language_resources.json file see the Language Resource Meta Data section).
To create all of the models and store them in the folder ./models
run the following:
python pymusas_models/__main__.py create-models --models-directory ./models --language-resource-file ./language_resources.json
This will create the following folders:
./models/cmn_dual_upos2usas_contextual-0.3.0
./models/cmn_single_upos2usas_contextual-0.3.0
./models/cy_dual_basiccorcencc2usas_contextual-0.3.0
- other model folders
To automate the release of the models we have created we are going to use the GitHub REST API. This REST API has a rate limit of 5000 calls per hour when you are running it as an authenticated client, for details on authentication. As we are creating releases we need to have a Personal Access Token (PAT) for authentication with public_repo
scope, the PAT can be created at the following link, we named our PAT pymusas-models
.
Once you have created your PAT add it to the following file GITHUB_TOKEN.json
this file should never be added to the repository as it will contain your PAT which is sensitive information. The PAT should be added to the JSON file like so:
{"PAT": "YOUR PAT TOKEN"}
Now, assuming all of the models you would like to release are in the ./models
directory, we can release the model to GitHub using the model_release.py script like so:
python model_release.py
Once ran successfully it will state the rate limit you had and have left on the GitHub REST API, like below:
Current rate limit: {'limit': 5000, 'used': 74, 'remaining': 4926, 'reset': 1652299539}
Rate limit after model releases: {'limit': 5000, 'used': 76, 'remaining': 4924, 'reset': 1652299539}
In addition you should see the models you wanted to release to GitHub now on GitHub within the releases section.
Some errors that can occur when running the model_release.py script:
- The model you want to release has already been released. If this occurs and is a mistake then delete the model from the
./models
folder. If this is not a mistake then you may need to change the model version of the model (c
element as described in theModel Versioning
section from the main README) as each model that is released has to have a unique model name. - The model did not upload correctly.
Once you have corrected the error re-run the model_release.py script.
If you want to specify the version of model, e.g. the c
part of model version as described in the model versioning section within the main README use the --model-version
command line option (default value "0").
In addition to specify the version of spaCy
that the model will be compatible with use the --spacy-version
command line option (default value ">=3.0,<4.0"). This spaCy version is overridden per language if the language resource file for a given language specifies a spacy version
. See python pymusas_models/__main__.py create-models --help
for more details.
Below we show how both of these command line options can be used:
python pymusas_models/__main__.py create-models \
--models-directory ./models \
--language-resource-file ./language_resources.json \
--model-version 1 \
--spacy-version ">=3.0,<4.0"
This will create the following folders, assuming we are using PyMUSAS version 0.3.0
:
./models/cmn_dual_upos2usas_contextual-0.3.1
./models/cmn_single_upos2usas_contextual-0.3.1
./models/cy_dual_basiccorcencc2usas_contextual-0.3.1
- other model folders
Of which all of these models will enforce a spaCy version >=3.0,<4.0
.
To create the overview of the models table from the main README:
- If you have not already done so create all of the models (if you have done this please skip this step):
python pymusas_models/__main__.py create-models --models-directory ./models --language-resource-file ./language_resources.json
- Run the following which will print out the Markdown overview of the models table, which can then be copied into the main README:
python pymusas_models/__main__.py overview-of-models --models-directory ./models
As the tests are both:
- Testing that the models can be created and installed via
pip
locally. - Once created and installed the models function as expected.
This has resulted in two test folders, as shown in General Folder Structure, /model_function_tests
and /model_creation_tests
. The /model_creation_tests
tests the first bullet point and /model_function_tests
tests the second bullet point.
As the /model_function_tests
require the installed models that are created from /model_creation_tests
the /model_creation_tests
tests are ran first whereby the models created will be installed to a virtual environment that will be saved to ./temp_venv
NOTE ./temp_venv
is assumed to not exist, an error will occur if the directory does exist, unless you specify the --overwrite
flag which will first delete the directory if it exists and then re-create.
Linux/Mac
pytest --virtual-env-directory=./temp_venv ./model_creation_tests
Using the overwrite flag, which will first delete ./temp_venv
if it exists:
pytest --virtual-env-directory=./temp_venv --overwrite ./model_creation_tests
This last command can be ran as a make
command:
make model-creation-tests
Note Mac users, I have found that make
might not work if using the make
command version that comes as default with your Mac (version 3.81), but the make
command you can install through Conda (version 4.2.1) will work.
Windows
pytest --virtual-env-directory=.\temp_venv .\model_creation_tests
Using the overwrite flag, which will first delete .\temp_venv
if it exists:
pytest --virtual-env-directory=.\temp_venv --overwrite .\model_creation_tests
By separating these tests into two different test folders it allows the virtual environment to be cached, which allows the second set of tests, /model_function_tests
, to be ran as many times as you like without having to re-create the virtual environment.
Linux/Mac
source ./temp_venv/venv/bin/activate # Used to activate the virtual environment
pytest ./model_function_tests
deactivate
Windows
.\temp_venv\venv\Scripts\Activate.ps1 # Used to activate the virtual environment
pytest .\model_function_tests
deactivate
Linux/Mac
There is a make command that will run all tests:
make run-all-tests
Note Mac users, I have found that make
might not work if using the make
command version that comes as default with your Mac (version 3.81), but the make
command you can install through Conda (version 4.2.1) will work.
Windows
To run all tests:
pytest --virtual-env-directory=.\temp_venv --overwrite .\model_creation_tests
.\temp_venv\venv\Scripts\Activate.ps1 # Used to activate the virtual environment
pytest .\model_function_tests
deactivate
Language resource meta data is stored in the language_resources.json file, it is used by the entry points to the main package, pymusas_models
, to create the models. The structure of the JSON file is the following:
{
"Language one BCP 47 code": {
"resources":[
{
"data type": "single",
"url": "PERMANENT URL TO RESOURCE"
},
{
"data type": "mwe",
"url": "PERMANENT URL TO RESOURCE"
}
],
"model information": {
"POS mapper": "POS TAGSET",
"spacy version": "VERSION OF SPACY"
},
"language data": {
"description": "LANANGUAGE NAME",
"macrolanguage": "Macrolanguage code",
"script": "ISO 15924 script code"
}
},
"Language Two BCP 47 code" : {
"resources":[
{
"data type": "single",
"url": "PERMANENT URL TO RESOURCE"
}
],
"model information": {
"POS mapper": null
},
"language data":{
"description": "LANANGUAGE NAME",
"macrolanguage": "Macrolanguage code",
"script": "ISO 15924 script code"
}
},
...
}
- The BCP 47 code of the language, the BCP47 language subtag lookup tool is a great tool to use to find a BCP 47 code for a language.
resources
- this is a list of resource files that are associated with the given language. There is no limit on the number of resources files associated with a language.data type
value can be 1 of 2 values:single
- Theurl
value has to be of the single word lexicon file format.mwe
- Theurl
value has to be of the Multi Word Expression lexicon file format.
url
- permanent URL link to the associated resource.
model information
- this is data that helps to create the model given the resources and the assumed NLP models, e.g. POS tagger, that will be used with the PyMUSAS model.POS mapper
- A mapper from that maps from the POS tagset of the tagged text to the POS tagset used in the lexicons. The mappers used are those from within the PyMUSAS mappers module. We currently assume that each resource associated with the model uses the same POS tagset in the lexicon, this is a limitation of this model creation framework rather than the PyMUSAS package itself.spacy version
- Optional this key is only required if the version of spaCy required has to be more specific than the version specified by PyMUSAS. The version of spaCy required, this should be a String and follow the standard Python pip install syntax ofspaCy
followed by a version specifier, e.g.spacy>=3.3
.
language data
- this is data that is associated with theBCP 47
language code. To some degree this is redundant as we can look this data up through theBCP 47
code, however we thought it is better to have it in the meta data for easy lookup. All of this data can be easily found through looking up theBCP 47
language code in the BCP47 language subtag lookup tooldescription
- Thedescription
of the language code.macrolanguage
- The macrolanguage tag, note if this does not exist then give the primary language tag, which could be the same as the wholeBCP 47
code. Themacrolanguage
tag could be useful in future for grouping languages.script
- The ISO 15924 script code of the language code. TheBCP 47
code by default does not always include the script of the language as the default script for that language is assumed, therefore this data is here to make the default more explicit.
Below is an extract of the ./language_resources.json, to give as an example of this JSON structure:
{
"cmn": {
"resources":[
{
"data type": "single",
"url": "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/69477221c3feaf8ab2c2033abf430e5c4ae1d5ce/Chinese/semantic_lexicon_chi.tsv"
},
{
"data type": "mwe",
"url": "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/69477221c3feaf8ab2c2033abf430e5c4ae1d5ce/Chinese/mwe-chi.tsv"
}
],
"model information": {
"POS mapper": "UPOS"
},
"language data": {
"description": "Mandarin Chinese",
"macrolanguage": "zh",
"script": "Hani"
}
},
"fi" : {
"resources":[
{
"data type": "single",
"url": "https://raw.githubusercontent.com/UCREL/Multilingual-USAS/9b3e7920e7b8e997ec36ca02410cd4f57f5a8835/Finnish/pos_mapped_semantic_lexicon_fin.tsv"
}
],
"model information": {
"POS mapper": "UPOS",
"spacy version": ">=3.3,<4.0"
},
"language data":{
"description": "Finnish",
"macrolanguage": "fi",
"script": "Latn"
}
},
...
}