Skip to content
Change the repository type filter

All

    Repositories list

    • Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.
      HTML
      0350Updated Nov 9, 2024Nov 9, 2024
    • OpusPocus

      Public
      Marian machine translation training pipeline for thousands of models
      Python
      02190Updated Nov 8, 2024Nov 8, 2024
    • Data Analytics Tool
      JavaScript
      1900Updated Nov 7, 2024Nov 7, 2024
    • Scripts for running bitextor jobs
      Shell
      1010Updated Nov 6, 2024Nov 6, 2024
    • Set of scripts to run monotextor-like pipeline under slurm HPCs
      Rust
      GNU General Public License v3.0
      0200Updated Nov 4, 2024Nov 4, 2024
    • Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
      Python
      0800Updated Nov 2, 2024Nov 2, 2024
    • Jupyter Notebook
      5101Updated Oct 29, 2024Oct 29, 2024
    • Shell
      0130Updated Oct 17, 2024Oct 17, 2024
    • Shell
      0000Updated Oct 15, 2024Oct 15, 2024
    • Curriculum training
      Python
      MIT License
      516190Updated Sep 14, 2024Sep 14, 2024
    • OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
      Python
      1348561Updated Sep 7, 2024Sep 7, 2024
    • Internet archive downloader
      Jupyter Notebook
      0210Updated Aug 7, 2024Aug 7, 2024
    • HPLT-WP4

      Public
      Information and pipelines on WP4: language models training
      Python
      Creative Commons Zero v1.0 Universal
      2100Updated Jul 11, 2024Jul 11, 2024
    • Python port of Moses tokenizer, truecaser and normalizer
      Python
      MIT License
      57487274Updated May 26, 2024May 26, 2024
    • tf/idf-based document aligner from Bitextor
      C++
      Apache License 2.0
      0001Updated Mar 19, 2024Mar 19, 2024
    • PHP
      MIT License
      1000Updated Mar 9, 2024Mar 9, 2024
    • This contains the configuration and scripts for HPLT MT model releases.
      Python
      0410Updated Mar 6, 2024Mar 6, 2024
    • OpusFilter - Parallel corpus processing toolkit
      Python
      MIT License
      18000Updated Jan 3, 2024Jan 3, 2024
    • clianer

      Public
      A lightweight command-line frontend to OpusCleaner
      Python
      MIT License
      1000Updated Nov 27, 2023Nov 27, 2023
    • Make-shift interface for managing Paracrawl processing and exploring its outputs
      HTML
      1000Updated Oct 10, 2023Oct 10, 2023
    • 0100Updated Feb 7, 2023Feb 7, 2023