What's new
Added 🎉
- Roadmap added.
- Define the MWE template and it's syntax, this is stated in
Notes -> Multi Word Expression Syntax
in theUsage
section of the documentation. This is the first task of issue #24. - PEP 561 (Distributing and Packaging Type Information) compatible by adding
py.typed
file. - Added srsly as a pip requirement, we use srsly to serialise components to bytes, for example the
pymusas.lexicon_collection.LexiconCollection.to_bytes
function usessrsly
to serialise theLexiconCollection
tobytes
. - An abstract class,
pymusas.base.Serialise
, that requires sub-classes to create two methodsto_bytes
andfrom_bytes
so that the class can be serialised. pymusas.lexicon_collection.LexiconCollection
has three new methodsto_bytes
,from_bytes
, and__eq__
. This allows the collection to be serialised and to be compared to other collections.- A Lexicon Collection class for Multi Word Expression (MWE),
pymusas.lexicon_collection.MWELexiconCollection
, which allows a user to easily create and / or load in from a TSV file a MWE lexicon, like the MWE lexicons from the Multilingual USAS repository. In addition it contains the functionality to match a MWE template to templates stored in theMWELexiconCollection
class following the MWE special syntax rules, this is all done through themwe_match
method. It also supports Part Of Speech mapping so that you can map from the lexicon's POS tagset to the tagset of your choice, in both a one-to-one and one-to-many mapping. Like thepymusas.lexicon_collection.LexiconCollection
it containsto_bytes
,from_bytes
, and__eq__
methods for serialisation and comparisons. - The rule based taggers have now been componentised so that they are based off a
List
ofRule
s and aRanker
whereby eachRule
defines how a token(s) in a text can be matched to a semantic category. Given the matches from theRule
s the for each token, a token can have zero or more matches, theRanker
ranks each match and finds the global best match for each token in the text. The taggers now support direct match and wildcard Multi Word Expressions. Due to this:pymusas.taggers.rule_based.USASRuleBasedTagger
has been changed and re-named topymusas.taggers.rule_based.RuleBasedTagger
and now only has a__call__
method.pymusas.spacy_api.taggers.rule_based.USASRuleBasedTagger
has been changed and re-named topymusas.spacy_api.taggers.rule_based.RuleBasedTagger
.
- A Rule system, of which all rules can be found in
pymusas.taggers.rules
:pymusas.taggers.rules.rule.Rule
an abstract class that describes how other sub-classes define the__call__
method and it's signature. This abstract class is sub-classed frompymusas.base.Serialise
.pymusas.taggers.rules.single_word.SingleWordRule
a concrete sub-class ofRule
for finding Single word lexicon entry matches.pymusas.taggers.rules.mwe.MWERule
a concrete sub-class ofRule
for finding Multi Word Expression entry matches.
- A Ranking system, of which all of the components that are linked to ranking can be found in
pymusas.rankers
:pymusas.rankers.ranking_meta_data.RankingMetaData
describes a lexicon entry match, that are typically generated frompymusas.taggers.rules.rule.Rule
classes being called. These matches indicate that some part of a text, one or more tokens, matches a lexicon entry whether that is a Multi Word Expression or single word lexicon.pymusas.rankers.lexicon_entry.LexiconEntryRanker
an abstract class that describes how other sub-classes should rank each token in the text and the expected output through the class's__call__
method. This abstract class is sub-classed frompymusas.base.Serialise
.pymusas.rankers.lexicon_entry.ContextualRuleBasedRanker
a concrete sub-class ofLexiconEntryRanker
based off the ranking rules from Piao et al. 2003.pymusas.rankers.lexical_match.LexicalMatch
describes the lexical match within apymusas.rankers.ranking_meta_data.RankingMetaData
object.
pymusas.utils.unique_pos_tags_in_lexicon_entry
a function that given a lexicon entry, either Multi Word Expression or Single word, returns aSet[str]
of unique POS tags in the lexicon entry.pymusas.utils.token_pos_tags_in_lexicon_entry
a function that given a lexicon entry, either Multi Word Expression or Single word, yields aTuple[str, str]
of word and POS tag from the lexicon entry.- A mapping from USAS core to Universal Part Of Speech (UPOS) tagset.
- A mapping from USAS core to basic CorCenCC POS tagset.
- A mapping from USAS core to Penn Chinese Treebank POS tagset tagset.
pymusas.lexicon_collection.LexiconMetaData
, object that contains all of the meta data about a single or Multi Word Expression lexicon entry.pymusas.lexicon_collection.LexiconType
which describes the different types of single and Multi Word Expression (MWE) lexicon entires and templates that PyMUSAS uses or will use in the case of curly braces.- The usage documentation, for the "How-to Tag Text", has been updated so that it includes an Indonesian example which does not use spaCy instead uses the Indonesian TreeTagger.
- spaCy registered functions for reading in a
LexiconCollection
orMWELexiconCollection
from a TSV. These can be found inpymusas.spacy_api.lexicon_collection
. - spaCy registered functions for creating
SingleWordRule
andMWERule
. These can be found inpymusas.spacy_api.taggers.rules
. - spaCy registered function for creating
ContextualRuleBasedRanker
. This can be found inpymusas.spacy_api.rankers
. - spaCy registered function for creating a
List
ofRule
s, this can be found here:pymusas.spacy_api.taggers.rules.rule_list
. LexiconCollection
andMWELexiconCollection
open the TSV file downloaded throughfrom_tsv
method by default usingutf-8
encoding.pymusas_rule_based_tagger
is now a spacy registered factory by using an entry point.MWELexiconCollection
warns users that it does not support curly braces MWE template expressions.- All of the POS mappings can now be called through a spaCy registered function, all of these functions can be found in the
pymusas.spacy_api.pos_mapper
module. - Updated the
Introduction
andHow-to Tag Text
usage documentation with the new updates that PyMUSAS now supports, e.g. MWE's. Also theHow-to Tag Text
is updated so that it uses the pre-configured spaCy components that have been created for each language, this spaCy components can be found and downloaded from the pymusas-models repository.
Removed 🗑
pymusas.taggers.rule_based.USASRuleBasedTagger
this is now replaced withpymusas.taggers.rule_based.RuleBasedTagger
.pymusas.spacy_api.taggers.rule_based.USASRuleBasedTagger
this is now replaced withpymusas.spacy_api.taggers.rule_based.RuleBasedTagger
.Using PyMUSAS
usage documentation page as it requires updating.
Commits
cc52c6d Added languages that we support
a0f748b Merge pull request #32 from UCREL/mwe
5feb6ef Added the changes to the documentation
39b88ae Added link to MWE syntax notes
9b63279 Updated so that it uses the pre-configured models
91a7089 Added that we support MWE and have models that can be downloaded
61b8265 Needs to be updated before being added back into the documentation
4ff95aa version 0.3.0
2ab0d4b Added spacy registered functions for pos mappers
0b288bb Changed API loading page to the base
module
6da04a9 MWE Lexicon Collection can handle curly braces being added but will be ignored
5042323 @reader to @misc due to config file format
f186803 isort
1e2d045 spacy factory entry point
17f7821 spacy factory entry point
014f73d Added rule_list spacy registered function
37fb15e No longer use OS default encoding
8c21fc8 CI does not fail on windows when it should, Fixed
67ee480 CI does not fail on windows when it should, DEBUGGING
e48017a isort
745b57a spacy registered function for ContextualRuleBasedRanker
fed00b2 Click issue with version 8.1.0
543b251 spacy registered functions for tagger rules
89d59ec Click issue with version 8.1.0
4b8a22c pytest issue with version 7.1.0
e4b75a5 Click issue with version 8.1.0
787496e spacy registered functions for lexicon collections
6fb5882 Added roadmap link
ca53cc6 ROADMAP from main branch
1626496 update
5a98ccd Now up to date
404da49 PEP 561 compatible by adding py.typed file
c55d991 Added py.typed
2575138 Added srsly as a requirement
bdc84bb Added srsly as a requirement
03ddc79 Moved the new_rule_based tagger into rule_based
d97aa08 Moved the new_rule_based tagger into rule_based
67e60ba flake8
92b43ab Updated examples
ea7fd40 Updated examples
f2a7d47 Added lexicon TSV file that was deleted after removing old tagger
20ba93b Removed old tagger
f316ef5 Serialised methods for custom classes
4135e67 eq methods for the LexiconEntryRanker classes
4ea243b eq methods for the LexiconCollection classes
15a9013 to and from bytes method for the ranker classes
85f96b6 to and from bytes method for SingleWordRule and LexiconCollection
2f1275b Compare meta data directly rather than through a for loop
c71c3b5 to and from bytes method for rule and MWE rule
75e341d to and from bytes methods for MWELexiconCollection
8b862e8 Added srsly as known third party package
dcd90e0 First version of roadmap
e631dd0 ignore abstract method in code coverage results
b017816 update_factory_attributes can update either requires or assigns
0c4b2f2 Corrected docstring
0fe5379 Refactored spacy tagger in doing so created the functions in the utils module
9f4e884 Refactored spacy tagger in doing so created the functions in the utils module
13c2ab3 End of day
f74c887 Corrected docstring
4eacbbf Example conll script
a4e1f0f Added default punctuation and number POS tags
c700642 Added default punctuation and number POS tags
fffa4f6 Reverse POS mappers
e534ae1 New rule based tagger outputs token indexes of MWE
6f1591b pydoc-markdown requirement fixed
b8dd34c Added longest_mwe_template and most_wildcards_in_mwe_template attributes to MWELexiconCollection
fa56b3a End of day
c17e0f0 make wildcard plural
68b5944 Fixed docstring link error
895e7ee Updated version of the tagger using Rules and Ranker
b65da30 MWE Rule match can use a POS Mapper
86c0e6b MWE Rule match can use a POS Mapper
3c4a6f5 MWE lexicon collection can handle POS Mapping
f674577 MWE lexicon collection can handle POS Mapping
a0fa451 MWE lexicon collection can handle POS Mapping
49f631d removed extra whitespace, flake8
6329d79 removed extra whitespace, flake8
b5c8056 unique pos tags from lexicon entry function
cd647db New version of the tagger, works with single word rules
e51bbf1 Added pos mapper
62448b3 ContextualRuleBasedRanker works with global lowest ranking
b4197dd Refactored test data locations and made tests simpler
65e9828 Restructured single word rule test data
6a15b6b Restructured single word rule test data
cc75575 Restructured single word rule test data
e929b76 Single word rule
614f0db Corrected docstring
ed66305 Corrected docstring
2c43738 Added semantic_tags to RankingMetaData object
2649042 Added MWE Lexicon Rules
7f11774 Added MWE Lexicon Rules
0824a3e Better example
2024873 missing assert statement
1f52680 Test empty list parameter for rule based ranker
2051545 Ranker to rank output from single and MWE rules
e57915f Added LexiconMetaData to MWELexiconCollection
526a395 Added Lexicon Meta Data object
ee2fe06 Refactored lexicon entry from collection in the tests
0b3b5bc MWE lexicon collection can detect MWE given an MWE template
1e8f7f6 Corrected python examples
5ead038 Documentation
3119906 isort and flake8 corrected and removed an if statement
e76c719 Made the MWE direct lookup more efficient
036a85e MWE benchmark for MWE direct lookup
5a8b325 MWE direct lookup can handle regular expression special syntax
99ac3f7 Merge branch 'main' into mwe
d381180 Merge pull request #25 from UCREL/indonesian-documentation
f597074 Indonesian example added to the usage documentation
b765d4b isort issue resolved
bf83c87 First version of MWE matching with no special syntax #24
d48656a MWE Lexicon Collection
8823d94 Adds support for raw docstrings
89adf29 Moved LexiconCollection test data into its own folder
dc9675e Moved LexiconCollection test data into its own folder
dc1f2e3 moved lexicon collection tests into a seperate folder
03d3141 Merge branch 'mwe' of github.com:UCREL/pymusas into mwe
9f1512d Update MWE syntax definitions and examples
fc6fbe4 Re-organisation of the test data files/folders
738ff4a Fixed broken link
cd1136c Start of the MWE syntax guide