Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RL training #14

Open
estelleaf opened this issue Jun 25, 2018 · 2 comments
Open

RL training #14

estelleaf opened this issue Jun 25, 2018 · 2 comments
Labels

Comments

@estelleaf
Copy link

estelleaf commented Jun 25, 2018

Hi,
Thanks for having shared your implementation of the RL chatbot.
I might ask stupid questions since I am not an expert in RL neither in NLP so sorry in advance!
1- In python/RL/train.py
l307, saver.restore(sess, os.path.join(model_path, model_name)) seems to intialize the weight of the model with some pretrained params, correct? Is it the ones given by the Seq2seq trained as usual in a supervised way? I dont find anywhere the 'model-55' you are using for this... Am I missing something?

2- In python/RL/rl_mpdel.py
Why do we have build_model and build_generator, it seems to have the same setup but not the same output. Is it RL specific?

3- In the paper
Also, in the paper they specified that for the reward they use a seq2seq2 model and not the RL model. Is this taken into consideration in your code?

Thanks a lot for your answers!

@pochih
Copy link
Owner

pochih commented Jul 24, 2018

  1. If the checkpoint exists, saver can restore the trained parameters

  2. build_model will construct the graph for training, build_generator will construct the graph for inferring.
    The most of the parts of two graphs is same.
    Separate two graphs can make the development easier.

  3. In the paper, they first train the model with seq2seq until convergence, then use policy gradient to train the model. The graph of seq2seq and RL is similar, but the reward function is used for the later.

@estelleaf
Copy link
Author

estelleaf commented Jul 24, 2018

Thanks a lot for your answers but I still dont get the 3.
1 - Which are the weights that are used for the reward ? The ones that are used in Seq2Seq after convergence or the ones of the policy that are being updated ?
2 - When I test the RL method, I dont have the same results as you show in the README when using model-56-3000. Is it normal?
3 - Here you have a file with sentences. You dont have an incoming flow of data, does it act as a replay memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants