Looking for suggestions for XML translation #100

argosopentech · 2021-04-29T11:53:16Z

argosopentech
Apr 29, 2021
Maintainer

I'm looking into adding XML support to Argos Translate (#23).

The difficult part is that tags in the source sentence need to be placed correctly into the target sentence, ex:

Joey is a <b>good</b> dog!
¡Joey es un <b>buen</b> perro!

This clearly needs to be done by the seq2seq model because words within a tag need to be translated in the context of the surrounding words.

I've tried writing some code to normalize tags in the input dataset into a standard format:

Joey likes chasing <i>Sylvìlagus floridanus</i> and <i>Marmota monax</i>.
Joey likes chasing <x>Sylvìlagus floridanus</x> and <x>Marmota monax</x>.

Joey likes <a href="https://www.wegmans.com">store brand dog food</a>.
Joey likes <x>store brand dog food</x>.

Then at inference I could use the standardized tags to place tags in the output. The issue with this is that most of the data I'm currently using for Argos Translate only contains a handful of tags in this format which is likely not sufficient.

My current plan is to try to find/generate more data in this format but any suggestions for better strategies are greatly appreciated!

Reference:
http://www.statmt.org/wmt20/pdf/2020.wmt-1.138.pdf

pierotofy · 2021-04-29T14:28:49Z

pierotofy
Apr 29, 2021

I think the most promising approach is that identified by the paper at section 3.1.

E.g.

Translate Joey is a good dog! --> ¡Joey es un buen perro!
Pick a random word (e.g. good)
Translate good --> buen
If you find buen in the translation, then you can infer that good and buen are likely corresponding and you can augment them with tags.

I can't make sense of what the authors mean by using a pair damage fraction.

1 reply

PJ-Finlay Apr 29, 2021
Collaborator

I had missed that in the paper, good point. I was thinking of using Wiktionary data to find sub translations to inject tags but we can just use the existing model to identify sub translations.

I think the "pair damage fraction" is injecting invalid tags to be able to translate them too. I'm not sure if I want to do that though, maybe it's useful for HTML but seems like it might introduce more problems than its worth.

PJ-Finlay · 2021-04-30T12:43:15Z

PJ-Finlay
Apr 30, 2021
Collaborator

argosopentech/argos-train@da9e198

I’ve been doing some preliminary testing with tag injection. It’s slow but seems to work well!

Something the paper didn’t mention but appears to be necessary is to incorporate translation confidence (Hypothesis.score) to avoid matching text that the seq2seq model “fixes”.

Data:
Joey was a good dog who was scared of swimming.
Joey era un buen perro que tenía miedo de nadar.

Successful injection:
Joey was a good dog <x>who was scared of swimming</x>.
Joey era un buen perro <x>que tenía miedo de nadar</x>.

"Fix" breaks things injection (but should have a lower translation score):
Joey was a good do<x>g who was scared of swimming</x>.
Joey era un buen perro <x>que tenía miedo de nadar</x>.

This is because "g who was scared of swimming" can be translated by the seq2seq model as "que tenía miedo de nadar".

0 replies

argosopentech · 2021-06-24T13:04:52Z

argosopentech
Jun 24, 2021
Maintainer Author

I was able to generate ~3000 lines of tag injected data by running on a CPU for a month: https://github.com/argosopentech/tag-injection

2 replies

pierotofy Jun 24, 2021

Nice!

Is that sufficient for training, or is more data needed? I might be able to help generate more.

PJ-Finlay Jun 24, 2021
Collaborator

To get this production ready we would need to:

Train a new model using this data.
Write the code to break up text, run inference on it, and rebuild the xml structure.
Generate data for other languages.
Train new models with tag data.

I'm currently planning to do few shot translation with an API model provider and then come back to this. Since model training is time consuming and expensive I'm planning to train new models all at once for Argos Translate 2.0 with other potentially breaking changes like removing the tokenizer. If anyone is interested in working on this we could train a test model and test running inference before scaling up to more languages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looking for suggestions for XML translation #100

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Looking for suggestions for XML translation #100

argosopentech Apr 29, 2021 Maintainer

Replies: 3 comments · 3 replies

pierotofy Apr 29, 2021

PJ-Finlay Apr 29, 2021 Collaborator

PJ-Finlay Apr 30, 2021 Collaborator

argosopentech Jun 24, 2021 Maintainer Author

pierotofy Jun 24, 2021

PJ-Finlay Jun 24, 2021 Collaborator

argosopentech
Apr 29, 2021
Maintainer

Replies: 3 comments 3 replies

pierotofy
Apr 29, 2021

PJ-Finlay Apr 29, 2021
Collaborator

PJ-Finlay
Apr 30, 2021
Collaborator

argosopentech
Jun 24, 2021
Maintainer Author

PJ-Finlay Jun 24, 2021
Collaborator