You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let's take an example "Allie drove to Boston for a meeting."
When I pretrain UCTopic, the model takes input_ids as [0, 50264, 324, 4024, 7, 2278, 13, 10, 529, 4, 2] (Allie can be unchanged up to unchange probability) and entity_ids as [2] (mask token of entity embedding).
Then, the model computes contrastive losses by using the hidden state from entity_ids token [2] .
However, the entity embedding from LUKE doesn't have all entities.
Also, the entity embedding only has information about entities, but doesn't have general noun phrases.
Therefore, I guess that such hidden state is weak when unseen entity or general noun phrases are inputted. Is it right? (Allie doesn't also appear in entity vocab of LUKE)
But, when I analyze your code, the model always takes entity_ids as [2] at inference phase(clustering or topic mining) as well as training phase.
So, as if the cls token of BERT represents all tokens in sentence, does the token [2] (mask token) represent entity tokens in input_ids?
Also, since the model only uses the mask token in entity vocab, the model can deal with unseen entity or general noun phrases? (we don't need to worry about the first question ?)
Thank you.
The text was updated successfully, but these errors were encountered:
I guess the entity embedding table is only used for the LUKE pre-training stage. In the hugging face, we don't use that table, instead, this model use entity positions to represent entities. Seen or unseen entities are not the problems. More details please refer to hugginface and LUKE paper.
The token used in your example represents only the entity 'Allie'.
In UCTopic, it only masks entity tokens during pre-training. For other usages like inference, we don't mask any tokens in sentences. Hence, UCTopic is not restricted to mask tokens for entity vocab. The model can deal with unseen entities because of the mask strategy during pre-training.
Let's take an example "Allie drove to Boston for a meeting."
When I pretrain UCTopic, the model takes input_ids as [0, 50264, 324, 4024, 7, 2278, 13, 10, 529, 4, 2] (Allie can be unchanged up to unchange probability) and entity_ids as [2] (mask token of entity embedding).
Then, the model computes contrastive losses by using the hidden state from entity_ids token [2] .
However, the entity embedding from LUKE doesn't have all entities.
Also, the entity embedding only has information about entities, but doesn't have general noun phrases.
But, when I analyze your code, the model always takes entity_ids as [2] at inference phase(clustering or topic mining) as well as training phase.
Thank you.
The text was updated successfully, but these errors were encountered: