-
Notifications
You must be signed in to change notification settings - Fork 0
MigrateResourcesFromSourceforge
Resources that are still available at Sourceforge download pages
Alternatives for a given resource are:
- offer it at Github releases of appropriate repository
- offer it at Kielipankki downloads (few most recent versions), possibly under test directory
- offer it at IDA (old versions, resources not needed anymore, pure test versions)
- do not migrate the resource at all and just let it be available at Sourceforge
Do we want the resources visible in Kielipankki resources list?
Kielipankki download pages could have directories
-
hfst/morphologies/fi
for Finnish -
hfst/morphologies/test
for other morphologies -
hfst/tokenizers/fi
,hfst/ner-taggers/fi
etc.? -
hfst/spellers
hfst/hyphenators
etc.?
Directory resources/morphological-transducers
Fresh versions of the packages will be offered at Kielipankki download pages.
The scripts for making the packages are available at Github. PID and metadata entries for packages needed. 3 Metashare-articles: main article and under it Finnish Morphology
and Test Morphologies (or just 'Transducers') in multiple languages
?
hfst/
morphologies/ # Metashare article "Morphologies for HFST"
fi/ # Metashare article "Finnish morphology"
v-2017-12-29/
hfst-finnish.zip
test/ # Metashare article "Test morphologies for various languages"
hfst-english.zip
hfst-french.zip
hfst-german.zip
hfst-italian.zip
hfst-swedish.zip
hfst-turkish.zip
directory | description |
---|---|
hfst | Transducers and other resources for HFST |
hfst/morphologies | Morphologies for HFST |
hfst/morphologies/fi | Finnish morphology |
hfst/morphologies/test | Test morphologies for various languages |
hfst/morphologies/fi/v-2017-12-29 | self-explanatory? |
directory | readme.txt |
---|---|
hfst | not needed? |
hfst/morphologies | Morphology packages for HFST containing binary transducers as well as installable scripts for performing morphological analysis and generation. |
hfst/morphologies/fi | Finnish morphology based on OMorFi (insert link here). The package contains binary transducers and an installable script for performing morphological analysis and generation. Licensed under GPL 3.0. |
hfst/morphologies/test | Test morphologies for various languages. Each package contains binary transducers and an installable script for performing morphological analysis and generation. Licenses vary and are indicated in README/LICENSE files in packages. |
hfst/morphologies/fi/v-2017-12-29 | Finnish morphology based on OMorFi (insert link here). The package contains binary transducers and an installable script for performing morphological analysis and generation. Licensed under GPL 3.0. |
Based on omorfi. The README of tarball just links to omorfi pages.
The exceptions lexicon and parts of the main lexicon are based on WordNet 2.0. The lexicon has been further expanded by including words appearing in the British National Corpus (BNC). The grammar itself is licensed under the Gnu General Public License. The BNC expansion is not available in src/
as we don't have license to distribute, but a guide to producing something similar may be found in
HFST application tutorial wiki page (old page at KitWiki).
A lexical transducer for English derived from the British National Corpus. See under /src
for the frequency information these weights were collected from. The final result was generated using this data, hfst-strings2fst and hfst-minimize.
Based on Den stora svenska ordlistan (link not working?). The directory src
of tarball contains Krister Lindén's scripts and other source files to build the transducer.
Based on trmorph (or the old version?). The tarball contains the source under directory trmorph-0.2.1
.
Based on morph-it (will be migrated to new web site). The tarball contains the source under directory morph-it/current_version
.
Based on morphisto. The README of tarball has instructions for compiling the morphology. Morphisto hasn't been updated for a while, so there is probably no need to make a new package.
Based on morphalou. The README of tarball has a link to morphalou pages but no more instructions. Morphalou is essentially a long list of ready-inflected forms in ~160 MB XML file.
Erzya, Greenlandic, Sámi languages via Giellatekno? Maybe easiest to extract them from Apertium packages and offer a stable release e.g. once a year?
(Giellatekno is planning a graphical analysis tool where people can just paste in text and get analysis / trees / whatever out in the other end. The tool should autoupdate the analysers at regular intervals so that people would always have the latest and greatest analysers, and would not have to fiddle with setting up hfst and giella packages; e.g. one of the update channels could be the nightly builds; that won’t solve the needs of everyone, but for linguists just wanting to analyse text using the tools, it should be the easiest solution.)
Directory optimized-lookup
Available at Github from beginning of 2016 when the project was migrated. Older versions are still available at Sourceforge, but they can be offered at IDA.
- hfst-ol.jar (is this updated?):
ant
createshfst-ol.jar
that can be tested withjava -jar hfst-ol.jar INFILE
- hfst-optimized-lookup-python.tar.gz (is this updated?): remove swig, just hfst_lookup dir (todo: test making a package that contains all files under this dir)
- hfst-optimized-lookup.tar.gz (a fresh release should be offered at Github) (test package making)
- hfst-ospell.tar.gz (available at Github)
Directory hfst
Available at Github from beginning of 2016 when the project was migrated. Older versions are still available at Sourceforge, but they can be offered at IDA.
Directory resources
File finnish-process.tbz2
(last updated 2015-02-04, size 143.2 MB) doesn't need to be migrated.
See above.
A Finnish tokenizer is being developed. It could be offered as finnish-tokenize
script at Kielipankki. We would also need at least swedish-tokenize and english-tokenize. There are maybe already tokenization resources available, so they do not need to implemeted from scratch?
Directory resouces/spell-transducers
Not updated for a long time, currently available via voikko and giellatekno (https://gtsvn.uit.no/langtech/trunk/langs/LANG/tools/spellcheckers/fstbased/desktop/hfst
). Is there a need to offer them also via Kielipankki? http://divvun.no/korrektur/otherapps.html
Directory resources/hyphenation-transducers
- hyph-utf8-hfst.2010-04-23.tar.gz (2010-04-23, 627.2 MB)
HFST hyphenation transducers for over 40 languages transformed automatically from TeX CTAN distribution’s hyphenation rulesets. These transducers can be offered at Kielipankki test directory. Should the individual hyphenation transducers be offered as separate packages too? At least the Finnish hyphenator seems to be a test version that doesn't perform very well.