release 1.0

Unbabel · Nov 19, 2021 · d9fb85a · d9fb85a
1 parent bc70efe
commit d9fb85a
Showing 1 changed file with 61 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -8,16 +8,23 @@
   <a href="https://github.com/psf/black"><img alt="Code Style" src="https://img.shields.io/badge/code%20style-black-black" /></a>
 </p>
 
+> Version 1.0.0 is finally out 🥳! whats new?
+> 1) `comet-compare` command for statistical comparison between two models
+> 2) `comet-score` with multiple hypothesis/systems
+> 3) Embeddings caching for faster inference (thanks to [@jsouza](https://github.com/jsouza)).
+> 4) Length Batching for faster inference (thanks to [@CoderPat](https://github.com/CoderPat))
+> 5) Integration with SacreBLEU for dataset downloading (thanks to [@mjpost](https://github.com/mjpost))
+> 6) Monte-carlo Dropout for uncertainty estimation (thanks to [@glushkovato](https://github.com/glushkovato) and [@chryssa-zrv](https://github.com/chryssa-zrv))
+> 7) Some code refactoring 
+
 ## Quick Installation
 
 Detailed usage examples and instructions can be found in the [Full Documentation](https://unbabel.github.io/COMET/html/index.html).
 
 Simple installation from PyPI
 
-_We are planing to release version 1.0.0 in November. Meanwhile we recommend you to use our Pre-release of version and open issues if you find something unexpected:_
-
 ```bash
-pip install unbabel-comet==1.0.0rc9
+pip install unbabel-comet==1.0.0
 ```
 
 To develop locally install [Poetry](https://python-poetry.org/docs/#installation) and run the following commands:
@@ -35,40 +42,72 @@ PYTHONPATH=. ./comet/cli/score.py
 
 ## Scoring MT outputs:
 
-### Via Bash:
+### CLI Usage:
 
-Examples from WMT20:
+Test examples:
 
 ```bash
 echo -e "Dem Feuer konnte Einhalt geboten werden\nSchulen und Kindergärten wurden eröffnet." >> src.de
-echo -e "The fire could be stopped\nSchools and kindergartens were open" >> hyp.en
+echo -e "The fire could be stopped\nSchools and kindergartens were open" >> hyp1.en
+echo -e "The fire could have been stopped\nSchools and pre-school were open" >> hyp2.en
 echo -e "They were able to control the fire.\nSchools and kindergartens opened" >> ref.en
 ```
 
 ```bash
-comet-score -s src.de -t hyp.en -r ref.en
+comet-score -s src.de -t hyp1.en -r ref.en
+```
+
+Scoring multiple systems:
+
+```bash
+comet-score -s src.de -t hyp1.en hyp2.en -r ref.en
+```
+
+WMT test sets via [SacreBLEU](https://github.com/mjpost/sacrebleu):
+
+```bash
+comet-score -d wmt20:en-de -t PATH/TO/TRANSLATIONS
 ```
 
 You can select another model/metric with the --model flag and for reference-free (QE-as-a-metric) models you don't need to pass a reference.
 
 ```bash
-comet-score -s src.de -t hyp.en --model wmt20-comet-qe-da
+comet-score -s src.de -t hyp1.en --model wmt20-comet-qe-da
 ```
 
 Following the work on [Uncertainty-Aware MT Evaluation](https://aclanthology.org/2021.findings-emnlp.330/) you can use the --mc_dropout flag to get a variance/uncertainty value for each segment score. If this value is high, it means that the metric is less confident in that prediction.
 
 ```bash
-comet-score -s src.de -t hyp.en -r ref.en --mc_dropout 30
+comet-score -s src.de -t hyp1.en -r ref.en --mc_dropout 30
 ```
 
-When comparing two MT systems we encourage you to run the `comet-compare` command to get a **contrastive statistical significance** with bootstrap resampling [(Koehn, et al 2004)](https://aclanthology.org/W04-3250/).
+When comparing two MT systems we encourage you to run the `comet-compare` command to get **statistical significance** with Paired T-Test and bootstrap resampling [(Koehn, et al 2004)](https://aclanthology.org/W04-3250/).
 
 ```bash
-comet-compare --help
+comet-compare -s src.de -x hyp1.en -y hyp2.en -r ref.en
 ```
 
 For even more detailed MT contrastive evaluation please take a look at our new tool [MT-Telescope](https://github.com/Unbabel/MT-Telescope).
 
+#### Multi-GPU Inference:
+
+COMET is optimized to be used in a single GPU by taking advantage of length batching and embedding caching. When using Multi-GPU since data e spread across GPUs we will typically get fewer cache hits and the length batching samples is replaced by a [DistributedSampler](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#replace-sampler-ddp). Because of that, according to our experiments, using 1 GPU is faster than using 2 GPUs specially when scoring multiple systems for the same source and reference.
+
+Nonetheless, if your data does not have repetitions and you have more than 1 GPU available, you can **run multi-GPU inference with the following command**:
+
+```bash
+comet-score -s src.de -t hyp1.en -r ref.en --gpus 2
+```
+
+#### Changing Embedding Cache Size:
+You can change the cache size of COMET using the following env variable:
+
+```bash
+export COMET_EMBEDDINGS_CACHE="2048"
+```
+by default the COMET cache size is 1024.
+
+
 ### Scoring within Python:
 
 ```python
@@ -108,7 +147,7 @@ We recommend the two following models to evaluate your translations:
 
 These two models were developed to participate on the WMT20 Metrics shared task [(Mathur et al. 2020)](https://aclanthology.org/2020.wmt-1.77.pdf) and to the date, these are the best performing metrics at segment-level in the recently released Google MQM data [(Freitag et al. 2020)](https://arxiv.org/pdf/2104.14478.pdf). Also, in a large-scale study performed by Microsoft Research these two metrics are ranked 1st and 2nd in terms of system-level decision accuracy [(Kocmi et al. 2020)](https://arxiv.org/pdf/2107.10821.pdf).
 
-For more information about the available COMET models we invite you to read our metrics descriptions [here](METRICS.md)
+For more information about the available COMET models read our metrics descriptions [here](METRICS.md)
 
 ## Train your own Metric: 
 
@@ -117,6 +156,14 @@ Instead of using pretrained models your can train your own model with the follow
 comet-train --cfg configs/models/{your_model_config}.yaml
 ```
 
+You can then use your own metric to score:
+
+```bash
+comet-score -s src.de -t hyp1.en -r ref.en --model PATH/TO/CHECKPOINT
+```
+
+**Note:** Please contact ricardo.rei@unbabel.com if you wish to host your own metric within COMET available metrics!
+
 ## unittest:
 In order to run the toolkit tests you must run the following command:
 
@@ -126,6 +173,8 @@ coverage report -m
 ```
 
 ## Publications
+If you use COMET please cite our work! Also, don't forget to say which model you used to evaluate your systems.
+
 - [Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task](http://statmt.org/wmt21/pdf/2021.wmt-1.111.pdf)
 
 - [Uncertainty-Aware Machine Translation Evaluation](https://aclanthology.org/2021.findings-emnlp.330/)