cc-extract

A simple tool for extracting low resource corpora from the Common Crawl

Download the fasttext language classification model into the root directory

wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

The same script needs to be ran in a few steps.

usage: extract_by_tld.py [-h] [--search SEARCH]
                         [--search_dir SEARCH_DIR]
                         [--fetch FETCH] [--warc_out WARC_OUT]
                         [--n_proc N_PROC] [--extract EXTRACT]
                         [--extract_out EXTRACT_OUT]
                         [--stoplist_lang STOPLIST_LANG]
                         [--fasttext_lang FASTTEXT_LANG]
                         [--fasttext_lang_ignore FASTTEXT_LANG_IGNORE]

optional arguments:
  -h, --help            show this help message and exit
  --search SEARCH
  --search_dir SEARCH_DIR
  --fetch FETCH
  --warc_out WARC_OUT
  --n_proc N_PROC
  --extract EXTRACT
  --extract_out EXTRACT_OUT
  --stoplist_lang STOPLIST_LANG
  --fasttext_lang FASTTEXT_LANG
  --fasttext_lang_ignore FASTTEXT_LANG_IGNORE

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
deduplicate.py		deduplicate.py
extract_by_tld.py		extract_by_tld.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cc-extract

About

Releases

Packages

Languages

vesteinn/cc-extract

Folders and files

Latest commit

History

Repository files navigation

cc-extract

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages