Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about model input at training and inference time. #5

Open
kimwongyuda opened this issue Nov 15, 2022 · 2 comments
Open

Question about model input at training and inference time. #5

kimwongyuda opened this issue Nov 15, 2022 · 2 comments

Comments

@kimwongyuda
Copy link

kimwongyuda commented Nov 15, 2022

Let's take an example "Allie drove to Boston for a meeting."

When I pretrain UCTopic, the model takes input_ids as [0, 50264, 324, 4024, 7, 2278, 13, 10, 529, 4, 2] (Allie can be unchanged up to unchange probability) and entity_ids as [2] (mask token of entity embedding).

Then, the model computes contrastive losses by using the hidden state from entity_ids token [2] .

However, the entity embedding from LUKE doesn't have all entities.
Also, the entity embedding only has information about entities, but doesn't have general noun phrases.

  1. Therefore, I guess that such hidden state is weak when unseen entity or general noun phrases are inputted. Is it right? (Allie doesn't also appear in entity vocab of LUKE)

But, when I analyze your code, the model always takes entity_ids as [2] at inference phase(clustering or topic mining) as well as training phase.

  1. So, as if the cls token of BERT represents all tokens in sentence, does the token [2] (mask token) represent entity tokens in input_ids?
  2. Also, since the model only uses the mask token in entity vocab, the model can deal with unseen entity or general noun phrases? (we don't need to worry about the first question ?)

Thank you.

@JiachengLi1995
Copy link
Owner

  1. I guess the entity embedding table is only used for the LUKE pre-training stage. In the hugging face, we don't use that table, instead, this model use entity positions to represent entities. Seen or unseen entities are not the problems. More details please refer to hugginface and LUKE paper.
  2. The token used in your example represents only the entity 'Allie'.
  3. In UCTopic, it only masks entity tokens during pre-training. For other usages like inference, we don't mask any tokens in sentences. Hence, UCTopic is not restricted to mask tokens for entity vocab. The model can deal with unseen entities because of the mask strategy during pre-training.

@kimwongyuda
Copy link
Author

Thank you for your explanation !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants