- revise
pipeline
module:- rationale: serialize to JSON as default
- remove dependency with
brat
code (conll2standoff
) - update tests
- update TreeTagger installation script
- and provide a version of Mac OS
- evaluation of
MLCitationMatcher
(viatests/test_eval.py
) - parallelise train/disambiguate/feature extraction with
dask
- write test
FeatureExtractor.extract_nil
(mr) - write tests for
ned.candidates.CandidatesGenerator
(mr) - write documentation for feature functions (mf)
- implement
MLCitationMatcher.train
- implement
MLCitationMatcher.classify
- improve the code quality/style
- create evaluation
py.tests
for- NER
- RelEX (compare rule-based and ML-based extraction)
- NED
- create some stats about the traning/test corpus (but not here, on APh corpus repo)
- number of entities by class
- number of relations
- number tokens
- language distribution of documents
-
remove obsolete functions from
pipeline
-
to streamline installation, try to remove local dependencies:
- add
pysuffix
to the codebase =>Utils.pysuffix
(or so)
- add
-
change the
LookupDictionary
inUtils.FastDict
so that it gets the data directly from the Knowledge Base instead of the static file (needs tests)- put author names into a dictionary, assuring that the keys are unique
- this code uses the new KB, not the one in
citation_extractor.ned
flat_author_names = {"%s$$n%i"%(author.get_urn(), i+1):name[1] for author in kb.get_authors() for i,name in enumerate(author.get_names())
if author.get_urn() is not None} -
move
crfpp_templates
to thedata
directory -
re-organise the logging
-
rewrite tests for
pipeline
module -
write tests for:
- creating and running a citation extractor
- test whether the
citation_extractor
can be pickled - use of the several classifiers (not only CRF) i.e. scikitlearnadapter
- test that the ActiveLearner still works