This tool provides a full range of analytics automatically computed on either monolingual or bilingual data sets to help making informed decisions about them.
It shows corpora details, volumes, language, length, noise and quality score distributions, common n-grams and others in the spirit of the work carried out by https://www.semanticscholar.org/paper/Documenting-the-English-Colossal-Clean-Crawled-Dodge-Sap/40c3327a6ddb0603b6892344509c7f428ab43d81.
Support for language-dependent components has been added for dozens of languages.
Automated reports are generated out of the tool, actioned from a web application to which a corpus can be uploaded. Once processed, the viewer will plot the analysis and automatically generate a PDF report containing the same information.
Icon: https://thenounproject.com/icon/fingerprint-3530285/
- sudo docker-compose build
- sudo docker-compose up
URLS to upload and view a dataset:
- Uploader: localhost:8000/uploader
- Viewer: localhost:8000/viewer
If you need to access docker to run stuff inside:
- sudo docker exec -it dat-webapp /bin/bash
Code and data are located in /work
Aside from uploading from the webapp interface, the runstats.sh
(located in /work/scripts/
) can be used for generating stats, running it with parameters as follows:
bash /work/scripts/runstats.sh {CORPUS_PATH} {YAML_FILENAME} {SOURCE_LANGUAGE} {TARGET_LANGUAGE} {FORMAT} {LANGUAGE_FORMAT}
Being:
- CORPUS_PATH: The path to the corpus to be analyzed.
- YAML_FILENAME: The path and filename the resulting stats yaml will have.
- SOURCE_LANGUAGE: Source language code (2-letters, ISO 639-1)
- TARGET_LANGUAGE: Target language code (2-letters, ISO 639-1), or
-
for monolingual. - FORMAT: File format. Currently accepted values are
bitext
,tmx
,tsv
anddocs
. - LANGUAGE_FORMAT: Currently accepted values are
parallel
andmono
.
Even though is not possible to run it by default, it's easy to adapt the scripts to generate only "lite" stats (this is, skipping those that are computationally heavy-weighted: ngrams, duplicates...). This mode is useful when processing huge corpora (currently only monolingual is supported). In order to generate the lite stats, modify the call to readcorpus_mono.py
in runstats.sh
so it includes the --lite
flag.
The stats generated with this tool come in a handy yaml format with the following fields:
bicleaner_scores
: Distribution of segments pairs with certain Bicleaner AI scores (only for parallel corpora)corpus
: Corpus filenamedocs_avg_lm
: Distribution of documents having a certain Monocleaner average fluency score, of its segments (only for monolingual documents)docs_collections
: Distribution of documents per origin collection (only for monoligual documents)docs_langs
: Distribution of documents having a certain percentage of its segments in the declared document language (only for monolingual documents)docs_segments
: Distribution of documents having a certain amount of segments (only for monolingual documents)docs_segments_mean
: Mean value ofdocs_segments
(only for monolingual documents)docs_segments_median
: Median value ofdocs_segments
(only for monolingual documents)docs_timestamp
: Unix timestamp indicating when were the documents part of the stats obtained (only for monolingual documents)docs_top100_domains
: 100 most common domains, and the amount of documents for each one (only for monolingual documents)docs_top100_tld
: 100 most common top level domains (not including subdomains), and the amount of document for each one (only for monolingual documents)docs_total
: Total amount of documents in the corpus (only for monolingual documents)docs_warning
: List of issues encountered while processing documents (only for monolingual documents)docs_unmatching_xxx
: Some documents (a total of xxx) in the corpus had a different amount of segments and LM scores or language identification, so they were discarded.
docs_wds
: Distribution of documents having a certain Document Score (only for monolingual documents)hardrules_tags
: List of possible issues in the segments, detected by Hardrulesnot_too_long
: Percentage of segments being larger than 1024 characters.not_too_short
: Percentage of segments being shorter than 3 tokens.no_urls
: Percentage of segments containing URLs.no_bad_encoding
: Percentage of bad encoded segments.no_porn
: Percentage of segments having porn content (not available for all languages)
monocleaner_scores
: Distribution of segments with a certain Monocleaner score (only for monolingual corpora)sentence_pairs
: Total amount of segments (in the case of monolingual corpora) or segment pairs (in the case of parallel corpora)src_bytes
: Total size of source segments, uncompressed.srclang
: Source language.src_langs
: Distribution of source segments languages, as identified by FastSpellsrc_ngrams
: Distribution of the 5 most common n-grams of each order (1-grams to 5-grams) in source segments (not computed in monolingual lite stats mode)src_sent_tokens
: Distribution of source segments having a certain amount of tokens (more info on tokenization tools here) (not computed in monolingual lite stats mode)src_sent_tokens_mean
: Mean value ofsrc_sent_tokens
(not computed in monolingual lite stats mode)src_sent_tokens_median
: Median value ofsrc_sent_tokens
(not computed in monolingual lite stats mode)src_tokens
: Total amount of tokens in source segments (not computed in monolingual lite stats mode)src_unique_sents
: Distribution of source segments having a certain amount of tokens, after removing duplicated segments (not computed in monolingual lite stats mode)timestamp
: Unix timestamp indicating when were the stats obtained.trg_bytes
: Total size of target segments, uncompressed (only for parallel corpora)trglang
: Target language (only for parallel corpora)trg_langs
: Distribution of target segments languages, as identified by FastSpell (only for parallel corpora)trg_ngrams
: Distribution of the 5 most common n-grams of each order (1-grams to 5-grams) in target segments (only for parallel corpora)trg_sent_tokens
: Distribution of target segments having a certain amount of tokens (more info on tokenization tools here) (only for parallel corpora)trg_sent_tokens_mean
: Mean value oftrg_sent_tokens
(only for parallel corpora)trg_sent_tokens_median
: Median value oftrg_sent_tokens
(only for parallel corpora)trg_tokens
: Total amount of tokens in target segments (only for parallel corpora)trg_unique_sents
: Distribution of target segments having a certain amount of tokens, after removing duplicated segments (only for parallel corpora)ttr_src
: Type-Token Ratio of the source segments (not computed in monolingual lite stats mode)ttr_trg
: Type-Token Ratio of the target segments (only for parallel corpora)unique_sents
: Total amount of segments (for monolingual corpora) or segment pairs (for parallel corpora), after removing duplicated segments or segment pairs (not computed in monolingual lite stats mode)warnings
: List of issues encountered while processing the corpus.src_warning_tok_xxx_yyy
: The source language is not supported by a dedicated tokenizer, so it fallbacks to the xxx tokenizer with the yyy language (only for parallel corpora).trg_warning_tok_xxx_yyy
: Same as the above but for the target language (only for parallel corpora).ngrams_xxx_nostopwords
: No stopwords available for the xxx language (the language being processed)ngrams_xxx_freq
: The stopwords used for the xxx language were simply obtained by frequency (top 1% of the corpus)
HPLTAnalytics comes with a webapp that is able to display the generated yaml files in a friendlier, more confortable interface. It has the following sections:
- General overview:
- Corpus name
- Date on which the analysis was performed
- Language(s)
- Volumes
- Documents (only for monolingual documents)
- Segments
- Unique segments (not computed in monolingual lite stats mode)
- Size in tokens (not computed in monolingual lite stats mode)
- File size
- Type Token Ratio
- Lexical variation indicator. The ratio is obtained by dividing the total number of different words (called types) by the total number of words (called tokens). The higher, the better as high TTR indicates a high degree of lexical variation while low TTR indicates the opposite (not computed in monolingual lite stats mode)
- Top 10 domains (excluding subdomains) (only for monolingual documents)
- Top 10 TLDs (only for monolingual documents)
- Document size (in segments). Histogram showing the distribution of document sizes (only for monolingual documents)
- Documents by collection (only for monolingual documents)
- Language distribution.
- Number of segments: Shows percentage of automatically identified languages.
- Percentage of segments in the declared languge, inside documents (only for monolingual documents)
- Quality Score distribution: Histogram showing the distribution of segments (monolingual) or sentence pairs (parallel) having a certain language model score (monolingual) or bicleaner score (tool that computes the likelihood of two sentences of being mutual translations)
- Quality Score average distribution: Histogram displaying the distribution of the average fluency score of segments in documents (only for monolingual documents)
- Document Score distribution: Histogram showing the distribution of Document Score (only for monolingual documents)
- Segment length distribution: tokens per segment for each language, showing total, unique and duplicate segments or segment pairs (not computed in monolingual lite stats mode)
- Noise distribution: the result of applying hard rules and computing which percentage is affected by them (too short or too long sentences, sentences being URLs, bad encoding, sentences containing poor language, etc.). (not computed in monolingual lite stats mode)
- Frequent n-grams: 1-5 more frequent n-grams (not computed in monolingual lite stats mode)
- HPLT monolingual documents for Afrikaans: it shows that more than a half of the documents come from the same domain, and that a large amount of documents contain less than a 30% of segments in Afrikaans. It also contains a lot of short segments. The general low quality of this corpus is confirmed also by its Document Scores.
- Parallel English-Norwegian HPLT corpus from initial data release: it shows that deduplication needs to be addressed as one of the most important issues.
- Monolingual Basque corpus from HPLT: it shows that at least half of the corpus is not in Basque, and that a very high percentage of segments are very short.