Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Status: This week

Author: Hao Tan, Mohit Bansal

Topic: Image , Text , Transformers

Category: Multimodal

Conference: EMNLP

Year: 2020

Link: https://arxiv.org/abs/2010.06775

Questions

What did authors try to accomplish?

What were the key elements of the approach?

What can you use yourself from this paper?

What other references to follow?