Skip to content

MigrateResourcesFromSourceforge

eaxelson edited this page Dec 19, 2018 · 1 revision

Resources that are still available at Sourceforge download pages

General considerations

Alternatives for a given resource are:

  • offer it at Github releases of appropriate repository
  • offer it at Kielipankki downloads (few most recent versions), possibly under test directory
  • offer it at IDA (old versions, resources not needed anymore, pure test versions)
  • do not migrate the resource at all and just let it be available at Sourceforge

Do we want the resources visible in Kielipankki resources list?

Kielipankki download pages could have directories

  • hfst/morphologies/fi for Finnish
  • hfst/morphologies/test for other morphologies
  • hfst/tokenizers/fi, hfst/ner-taggers/fi etc.?
  • hfst/spellers hfst/hyphenators etc.?

Fresh versions of the packages will be offered at Kielipankki download pages. The scripts for making the packages are available at Github. PID and metadata entries for packages needed. 3 Metashare-articles: main article and under it Finnish Morphology and Test Morphologies (or just 'Transducers') in multiple languages?

Proposed directory structure:

hfst/
    morphologies/  # Metashare article "Morphologies for HFST"
        fi/  # Metashare article "Finnish morphology"
            v-2017-12-29/
                hfst-finnish.zip
        test/  # Metashare article "Test morphologies for various languages"
            hfst-english.zip
            hfst-french.zip
            hfst-german.zip
            hfst-italian.zip
            hfst-swedish.zip
            hfst-turkish.zip

Proposed descriptions for directories:

directory description
hfst Transducers and other resources for HFST
hfst/morphologies Morphologies for HFST
hfst/morphologies/fi Finnish morphology
hfst/morphologies/test Test morphologies for various languages
hfst/morphologies/fi/v-2017-12-29 self-explanatory?

Proposed readme files (readme.txt) in directories:

directory readme.txt
hfst not needed?
hfst/morphologies Morphology packages for HFST containing binary transducers as well as installable scripts for performing morphological analysis and generation.
hfst/morphologies/fi Finnish morphology based on OMorFi (insert link here). The package contains binary transducers and an installable script for performing morphological analysis and generation. Licensed under GPL 3.0.
hfst/morphologies/test Test morphologies for various languages. Each package contains binary transducers and an installable script for performing morphological analysis and generation. Licenses vary and are indicated in README/LICENSE files in packages.
hfst/morphologies/fi/v-2017-12-29 Finnish morphology based on OMorFi (insert link here). The package contains binary transducers and an installable script for performing morphological analysis and generation. Licensed under GPL 3.0.

More information about the morphologies

Finnish

Based on omorfi. The README of tarball just links to omorfi pages.

English

The exceptions lexicon and parts of the main lexicon are based on WordNet 2.0. The lexicon has been further expanded by including words appearing in the British National Corpus (BNC). The grammar itself is licensed under the Gnu General Public License. The BNC expansion is not available in src/ as we don't have license to distribute, but a guide to producing something similar may be found in HFST application tutorial wiki page (old page at KitWiki).

English BNC

A lexical transducer for English derived from the British National Corpus. See under /src for the frequency information these weights were collected from. The final result was generated using this data, hfst-strings2fst and hfst-minimize.

Swedish

Based on Den stora svenska ordlistan (link not working?). The directory src of tarball contains Krister Lindén's scripts and other source files to build the transducer.

Turkish

Based on trmorph (or the old version?). The tarball contains the source under directory trmorph-0.2.1.

Italian

Based on morph-it (will be migrated to new web site). The tarball contains the source under directory morph-it/current_version.

German

Based on morphisto. The README of tarball has instructions for compiling the morphology. Morphisto hasn't been updated for a while, so there is probably no need to make a new package.

French

Based on morphalou. The README of tarball has a link to morphalou pages but no more instructions. Morphalou is essentially a long list of ready-inflected forms in ~160 MB XML file.

Other?

Erzya, Greenlandic, Sámi languages via Giellatekno? Maybe easiest to extract them from Apertium packages and offer a stable release e.g. once a year?

(Giellatekno is planning a graphical analysis tool where people can just paste in text and get analysis / trees / whatever out in the other end. The tool should autoupdate the analysers at regular intervals so that people would always have the latest and greatest analysers, and would not have to fiddle with setting up hfst and giella packages; e.g. one of the update channels could be the nightly builds; that won’t solve the needs of everyone, but for linguists just wanting to analyse text using the tools, it should be the easiest solution.)

Directory optimized-lookup

Available at Github from beginning of 2016 when the project was migrated. Older versions are still available at Sourceforge, but they can be offered at IDA.

  • hfst-ol.jar (is this updated?): ant creates hfst-ol.jar that can be tested with java -jar hfst-ol.jar INFILE
  • hfst-optimized-lookup-python.tar.gz (is this updated?): remove swig, just hfst_lookup dir (todo: test making a package that contains all files under this dir)
  • hfst-optimized-lookup.tar.gz (a fresh release should be offered at Github) (test package making)
  • hfst-ospell.tar.gz (available at Github)

Directory hfst

Available at Github from beginning of 2016 when the project was migrated. Older versions are still available at Sourceforge, but they can be offered at IDA.

Directory resources

File finnish-process.tbz2 (last updated 2015-02-04, size 143.2 MB) doesn't need to be migrated.

Morphologies

See above.

Tokenizers

A Finnish tokenizer is being developed. It could be offered as finnish-tokenize script at Kielipankki. We would also need at least swedish-tokenize and english-tokenize. There are maybe already tokenization resources available, so they do not need to implemeted from scratch?

Not updated for a long time, currently available via voikko and giellatekno (https://gtsvn.uit.no/langtech/trunk/langs/LANG/tools/spellcheckers/fstbased/desktop/hfst). Is there a need to offer them also via Kielipankki? http://divvun.no/korrektur/otherapps.html

  • hyph-utf8-hfst.2010-04-23.tar.gz (2010-04-23, 627.2 MB)

HFST hyphenation transducers for over 40 languages transformed automatically from TeX CTAN distribution’s hyphenation rulesets. These transducers can be offered at Kielipankki test directory. Should the individual hyphenation transducers be offered as separate packages too? At least the Finnish hyphenator seems to be a test version that doesn't perform very well.