You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
Thanks for having shared your implementation of the RL chatbot.
I might ask stupid questions since I am not an expert in RL neither in NLP so sorry in advance! 1- In python/RL/train.py
l307, saver.restore(sess, os.path.join(model_path, model_name)) seems to intialize the weight of the model with some pretrained params, correct? Is it the ones given by the Seq2seq trained as usual in a supervised way? I dont find anywhere the 'model-55' you are using for this... Am I missing something?
2- In python/RL/rl_mpdel.py
Why do we have build_model and build_generator, it seems to have the same setup but not the same output. Is it RL specific?
3- In the paper
Also, in the paper they specified that for the reward they use a seq2seq2 model and not the RL model. Is this taken into consideration in your code?
Thanks a lot for your answers!
The text was updated successfully, but these errors were encountered:
If the checkpoint exists, saver can restore the trained parameters
build_model will construct the graph for training, build_generator will construct the graph for inferring.
The most of the parts of two graphs is same.
Separate two graphs can make the development easier.
In the paper, they first train the model with seq2seq until convergence, then use policy gradient to train the model. The graph of seq2seq and RL is similar, but the reward function is used for the later.
Thanks a lot for your answers but I still dont get the 3.
1 - Which are the weights that are used for the reward ? The ones that are used in Seq2Seq after convergence or the ones of the policy that are being updated ?
2 - When I test the RL method, I dont have the same results as you show in the README when using model-56-3000. Is it normal?
3 - Here you have a file with sentences. You dont have an incoming flow of data, does it act as a replay memory?
Hi,
Thanks for having shared your implementation of the RL chatbot.
I might ask stupid questions since I am not an expert in RL neither in NLP so sorry in advance!
1- In python/RL/train.py
l307, saver.restore(sess, os.path.join(model_path, model_name)) seems to intialize the weight of the model with some pretrained params, correct? Is it the ones given by the Seq2seq trained as usual in a supervised way? I dont find anywhere the 'model-55' you are using for this... Am I missing something?
2- In python/RL/rl_mpdel.py
Why do we have build_model and build_generator, it seems to have the same setup but not the same output. Is it RL specific?
3- In the paper
Also, in the paper they specified that for the reward they use a seq2seq2 model and not the RL model. Is this taken into consideration in your code?
Thanks a lot for your answers!
The text was updated successfully, but these errors were encountered: