Skip to content

Commit

Permalink
Revert "Update tokenizer.py"
Browse files Browse the repository at this point in the history
This reverts commit 6f1b5d5.

With this commit active underscores weren't being correctly detokenized
as spaces.

https://community.libretranslate.com/t/the-problem-of-translating-symbols-oov-out-of-the-vocabulary/1071/4
  • Loading branch information
PJ-Finlay committed Aug 11, 2024
1 parent 4d6c125 commit 543c50e
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion argostranslate/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ def encode(self, sentence: str) -> List[str]:
return tokens

def decode(self, tokens: List[str]) -> str:
return self.lazy_processor().decode_pieces(tokens)
detokenized = "".join(tokens)
return detokenized.replace("▁", " ")


class BPETokenizer(Tokenizer):
Expand Down

0 comments on commit 543c50e

Please sign in to comment.