diff --git a/README.md b/README.md index 8e7cb8a..fea8fe9 100644 --- a/README.md +++ b/README.md @@ -48,24 +48,30 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | Model | #params | Base/Student Model | Teacher Model | Train Dataset | Supervised | | --------------------------------------------------------------------------------------------------------------------------- | :-----: | --------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | :--------: | -| [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 12M | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | -| [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 125M | [IndoRoBERTa Base](https://huggingface.co/flax-community/indonesian-roberta-base) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | | [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | | [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 12M | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | | [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | | [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 125M | [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | | [ConGen-Indo-e5 Small](https://huggingface.co/LazarusNLP/congen-indo-e5-small) | 118M | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | -| [S-IndoBERT Base mMARCO](https://huggingface.co/LazarusNLP/s-indobert-base-mmarco) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) | ✅ | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | See: [README](./training/all/) | ✅ | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | See: [README](./training/all/) | ✅ | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 125M | [IndoBERT Base p2](https://huggingface.co/indobenchmark/indobert-base-p2) | N/A | See: [README](./training/all/) | ✅ | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 118M | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | N/A | See: [README](./training/all/) | ✅ | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 134M | [DistilBERT Base Multilingual](https://huggingface.co/distilbert-base-multilingual-cased) | mUSE | See: [SBERT](https://www.sbert.net/docs/pretrained_models.html#model-overview) | ✅ | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 125M | [XLM-RoBERTa Base](https://huggingface.co/xlm-roberta-base) | [paraphrase-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2) | See: [SBERT](https://www.sbert.net/docs/pretrained_models.html#model-overview) | ✅ | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 118M | [Multilingual-MiniLM-L12-H384](https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384) | See: [arXiv](https://arxiv.org/abs/2212.03533) | See: [🤗](https://huggingface.co/intfloat/multilingual-e5-small) | ✅ | | [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 278M | [XLM-RoBERTa Base](https://huggingface.co/xlm-roberta-base) | See: [arXiv](https://arxiv.org/abs/2212.03533) | See: [🤗](https://huggingface.co/intfloat/multilingual-e5-base) | ✅ | | [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 560M | [XLM-RoBERTa Large](https://huggingface.co/xlm-roberta-large) | See: [arXiv](https://arxiv.org/abs/2212.03533) | See: [🤗](https://huggingface.co/intfloat/multilingual-e5-large) | ✅ | +??? example "Deprecated Models" + + | Model | #params | Base/Student Model | Teacher Model | Train Dataset | Supervised | + | ---------------------------------------------------------------------------------------- | :-----: | --------------------------------------------------------------------------------- | ------------- | ----------------------------------------------------------------------------- | :--------: | + | [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 12M | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | + | [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 125M | [IndoRoBERTa Base](https://huggingface.co/flax-community/indonesian-roberta-base) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | + | [S-IndoBERT Base mMARCO](https://huggingface.co/LazarusNLP/s-indobert-base-mmarco) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) | ✅ | + | [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 125M | [IndoBERT Base p2](https://huggingface.co/indobenchmark/indobert-base-p2) | N/A | See: [README](./training/all/) | ✅ | + ## Results ### Semantic Textual Similarity @@ -74,18 +80,15 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | Model | Spearman's Correlation (%) ↑ | | --------------------------------------------------------------------------------------------------------------------------- | :--------------------------: | -| [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 44.08 | -| [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 61.26 | | [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 70.13 | | [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 79.97 | | [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 80.47 | | [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 81.16 | | [ConGen-Indo-e5 Small](https://huggingface.co/LazarusNLP/congen-indo-e5-small) | 80.94 | | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 74.56 | -| [S-IndoBERT Base mMARCO](https://huggingface.co/LazarusNLP/s-indobert-base-mmarco) | 72.95 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 73.84 | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 76.03 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 73.45 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 79.57 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 75.08 | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | **83.83** | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 78.89 | @@ -106,7 +109,7 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 40.41 | 47.29 | 40.68 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 65.52 | 75.92 | 70.13 | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 67.18 | 76.59 | 70.16 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 60.62 | 71.95 | 66.31 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 68.33 | 78.33 | 73.04 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 41.35 | 54.93 | 48.79 | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 52.81 | 65.07 | 57.97 | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 70.20 | 79.61 | 74.80 | @@ -125,7 +128,7 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 76.81 | 83.16 | 85.87 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 88.14 | 91.47 | 92.91 | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 87.61 | 90.91 | 92.31 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 87.78 | 91.14 | 92.58 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 93.27 | 95.63 | 96.46 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 70.44 | 77.94 | 81.56 | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 81.41 | 87.05 | 89.44 | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 91.50 | 94.34 | 95.39 | @@ -146,7 +149,7 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 55.66 | 54.48 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 58.40 | 57.21 | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 58.31 | 57.11 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 60.36 | 59.29 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 61.51 | 59.24 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 55.99 | 52.44 | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 65.43 | 63.55 | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 64.16 | 61.33 | @@ -164,8 +167,8 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | [ConGen-Indo-e5 Small](https://huggingface.co/LazarusNLP/congen-indo-e5-small) | 66.92 | 66.29 | | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 61.89 | 60.97 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 66.37 | 66.31 | -| [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 66.02 | 65.s97 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 68.90 | 68.88 | +| [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 66.02 | 65.97 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 67.02 | 66.86 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 65.25 | 63.45 | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 70.72 | 70.58 | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 67.92 | 67.23 | @@ -184,7 +187,7 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 61.13 | 61.70 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 57.27 | 57.47 | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 58.86 | 59.31 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 57.04 | 57.14 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 58.18 | 57.99 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 63.63 | 64.13 | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 63.18 | 63.78 | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 64.54 | 65.04 | @@ -203,7 +206,7 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 82.0 | 76.92 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 84.4 | 79.79 | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 83.4 | 79.04 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 84.8 | 80.03 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 82.0 | 78.15 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 78.8 | 73.64 | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 89.6 | **86.56** | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 83.6 | 79.51 | @@ -224,9 +227,9 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 59.82 | 53.41 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 72.01 | 56.79 | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 71.36 | 56.83 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 69.32 | 54.76 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | **76.29** | 57.05 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 58.48 | 50.50 | -| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | **74.87** | **57.96** | +| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 74.87 | **57.96** | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 63.97 | 51.85 | | [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 60.25 | 50.91 | | [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 61.39 | 51.62 | diff --git a/docs/index.md b/docs/index.md index 8e7cb8a..fea8fe9 100644 --- a/docs/index.md +++ b/docs/index.md @@ -48,24 +48,30 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | Model | #params | Base/Student Model | Teacher Model | Train Dataset | Supervised | | --------------------------------------------------------------------------------------------------------------------------- | :-----: | --------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | :--------: | -| [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 12M | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | -| [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 125M | [IndoRoBERTa Base](https://huggingface.co/flax-community/indonesian-roberta-base) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | | [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | | [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 12M | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | | [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | | [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 125M | [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | | [ConGen-Indo-e5 Small](https://huggingface.co/LazarusNLP/congen-indo-e5-small) | 118M | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | -| [S-IndoBERT Base mMARCO](https://huggingface.co/LazarusNLP/s-indobert-base-mmarco) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) | ✅ | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | See: [README](./training/all/) | ✅ | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | See: [README](./training/all/) | ✅ | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 125M | [IndoBERT Base p2](https://huggingface.co/indobenchmark/indobert-base-p2) | N/A | See: [README](./training/all/) | ✅ | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 118M | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | N/A | See: [README](./training/all/) | ✅ | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 134M | [DistilBERT Base Multilingual](https://huggingface.co/distilbert-base-multilingual-cased) | mUSE | See: [SBERT](https://www.sbert.net/docs/pretrained_models.html#model-overview) | ✅ | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 125M | [XLM-RoBERTa Base](https://huggingface.co/xlm-roberta-base) | [paraphrase-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2) | See: [SBERT](https://www.sbert.net/docs/pretrained_models.html#model-overview) | ✅ | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 118M | [Multilingual-MiniLM-L12-H384](https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384) | See: [arXiv](https://arxiv.org/abs/2212.03533) | See: [🤗](https://huggingface.co/intfloat/multilingual-e5-small) | ✅ | | [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 278M | [XLM-RoBERTa Base](https://huggingface.co/xlm-roberta-base) | See: [arXiv](https://arxiv.org/abs/2212.03533) | See: [🤗](https://huggingface.co/intfloat/multilingual-e5-base) | ✅ | | [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 560M | [XLM-RoBERTa Large](https://huggingface.co/xlm-roberta-large) | See: [arXiv](https://arxiv.org/abs/2212.03533) | See: [🤗](https://huggingface.co/intfloat/multilingual-e5-large) | ✅ | +??? example "Deprecated Models" + + | Model | #params | Base/Student Model | Teacher Model | Train Dataset | Supervised | + | ---------------------------------------------------------------------------------------- | :-----: | --------------------------------------------------------------------------------- | ------------- | ----------------------------------------------------------------------------- | :--------: | + | [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 12M | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | + | [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 125M | [IndoRoBERTa Base](https://huggingface.co/flax-community/indonesian-roberta-base) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | + | [S-IndoBERT Base mMARCO](https://huggingface.co/LazarusNLP/s-indobert-base-mmarco) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) | ✅ | + | [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 125M | [IndoBERT Base p2](https://huggingface.co/indobenchmark/indobert-base-p2) | N/A | See: [README](./training/all/) | ✅ | + ## Results ### Semantic Textual Similarity @@ -74,18 +80,15 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | Model | Spearman's Correlation (%) ↑ | | --------------------------------------------------------------------------------------------------------------------------- | :--------------------------: | -| [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 44.08 | -| [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 61.26 | | [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 70.13 | | [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 79.97 | | [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 80.47 | | [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 81.16 | | [ConGen-Indo-e5 Small](https://huggingface.co/LazarusNLP/congen-indo-e5-small) | 80.94 | | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 74.56 | -| [S-IndoBERT Base mMARCO](https://huggingface.co/LazarusNLP/s-indobert-base-mmarco) | 72.95 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 73.84 | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 76.03 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 73.45 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 79.57 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 75.08 | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | **83.83** | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 78.89 | @@ -106,7 +109,7 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 40.41 | 47.29 | 40.68 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 65.52 | 75.92 | 70.13 | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 67.18 | 76.59 | 70.16 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 60.62 | 71.95 | 66.31 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 68.33 | 78.33 | 73.04 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 41.35 | 54.93 | 48.79 | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 52.81 | 65.07 | 57.97 | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 70.20 | 79.61 | 74.80 | @@ -125,7 +128,7 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 76.81 | 83.16 | 85.87 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 88.14 | 91.47 | 92.91 | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 87.61 | 90.91 | 92.31 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 87.78 | 91.14 | 92.58 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 93.27 | 95.63 | 96.46 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 70.44 | 77.94 | 81.56 | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 81.41 | 87.05 | 89.44 | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 91.50 | 94.34 | 95.39 | @@ -146,7 +149,7 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 55.66 | 54.48 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 58.40 | 57.21 | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 58.31 | 57.11 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 60.36 | 59.29 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 61.51 | 59.24 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 55.99 | 52.44 | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 65.43 | 63.55 | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 64.16 | 61.33 | @@ -164,8 +167,8 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | [ConGen-Indo-e5 Small](https://huggingface.co/LazarusNLP/congen-indo-e5-small) | 66.92 | 66.29 | | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 61.89 | 60.97 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 66.37 | 66.31 | -| [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 66.02 | 65.s97 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 68.90 | 68.88 | +| [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 66.02 | 65.97 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 67.02 | 66.86 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 65.25 | 63.45 | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 70.72 | 70.58 | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 67.92 | 67.23 | @@ -184,7 +187,7 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 61.13 | 61.70 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 57.27 | 57.47 | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 58.86 | 59.31 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 57.04 | 57.14 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 58.18 | 57.99 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 63.63 | 64.13 | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 63.18 | 63.78 | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 64.54 | 65.04 | @@ -203,7 +206,7 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 82.0 | 76.92 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 84.4 | 79.79 | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 83.4 | 79.04 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 84.8 | 80.03 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | 82.0 | 78.15 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 78.8 | 73.64 | | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 89.6 | **86.56** | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 83.6 | 79.51 | @@ -224,9 +227,9 @@ Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For S | [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 59.82 | 53.41 | | [all-IndoBERT Base](https://huggingface.co/LazarusNLP/all-indobert-base) | 72.01 | 56.79 | | [all-IndoBERT Base-v2](https://huggingface.co/LazarusNLP/all-indobert-base-v2) | 71.36 | 56.83 | -| [all-IndoBERT Base p2](https://huggingface.co/LazarusNLP/all-indobert-base-p2) | 69.32 | 54.76 | +| [all-Indo-e5 Small-v2](https://huggingface.co/LazarusNLP/all-indo-e5-small-v2) | **76.29** | 57.05 | | [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 58.48 | 50.50 | -| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | **74.87** | **57.96** | +| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 74.87 | **57.96** | | [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 63.97 | 51.85 | | [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 60.25 | 50.91 | | [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 61.39 | 51.62 | diff --git a/docs/training/all.md b/docs/training/all.md index e960181..d7173d9 100644 --- a/docs/training/all.md +++ b/docs/training/all.md @@ -34,6 +34,18 @@ python train_all_mnrl.py \ --learning-rate 2e-5 ``` +### Multilingual e5 Small + +```sh +python train_all_mnrl.py \ + --model-name intfloat/multilingual-e5-small \ + --max-seq-length 128 \ + --num-epochs 5 \ + --train-batch-size-pairs 384 \ + --train-batch-size-triplets 256 \ + --learning-rate 2e-5 +``` + ## References ```bibtex diff --git a/mkdocs.yml b/mkdocs.yml index d1d617e..d9cb787 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -47,4 +47,5 @@ markdown_extensions: pygments_lang_class: true - pymdownx.inlinehilite - pymdownx.snippets + - pymdownx.details - pymdownx.superfences diff --git a/training/all/README.md b/training/all/README.md index e960181..d7173d9 100644 --- a/training/all/README.md +++ b/training/all/README.md @@ -34,6 +34,18 @@ python train_all_mnrl.py \ --learning-rate 2e-5 ``` +### Multilingual e5 Small + +```sh +python train_all_mnrl.py \ + --model-name intfloat/multilingual-e5-small \ + --max-seq-length 128 \ + --num-epochs 5 \ + --train-batch-size-pairs 384 \ + --train-batch-size-triplets 256 \ + --learning-rate 2e-5 +``` + ## References ```bibtex