-
Notifications
You must be signed in to change notification settings - Fork 35
API proposal #2
Comments
We should also have a Also we should discuss "mandatory" fields for an environment. |
Do you guys know about https://github.com/JuliaPOMDP/POMDPs.jl ? POMDPs are a pretty related problem to RL. could be some synergy there |
Thanks... I didn't know about that org. Certainly related... I'll have to On Monday, August 8, 2016, Christof Stocker notifications@github.com
|
After reviewing what's in JuliaPOMDP, I'd like to move forward with what we've already discussed, and possibly write some (conditionally included) code to link our abstractions to the JuliaPOMDP world. They have large web of interdependencies, most of which are not registered packages, and their own package manager to handle it. I'm weary of depending on that, and I'd prefer to make our own path. In addition, they have much more of a focus (seemingly) on finite, table-based, offline solvers, and I don't know if there's really that much overlap beyond some of the verbs and test environments. tl;dr Someday we should revisit whether we can merge, but for now I think it's easier to stay arms-length. |
@spencerlyon2 I'm thinking about how/where we could convey that there are no more steps to take (i.e. the episode is done). I wonder if we can fit environments into the iteration pattern, or maybe that an "episode" is a iterator/wrapper around an environment and policy: immutable Episode{E,P}
env::E
policy::P
total_reward::Float64
niter::Int
end
# todo: start, next, done definitions
for (s,a,r) in Episode(env, policy)
# do something, or not?
end |
Also thinking we need the state as input to step, and that it should be mutating: r, s′ = step!(env, s, a) I'd like for a "sarsa" update to make sense: # assume we have (s,a) for this iteration already (it's last iteration's a′)
r, s′ = step!(env, s, a)
done(env) && break # ??
a′ = action(policy, r, s′, actions(env, s′) |
What would mutate in a call to |
I just wanted to say that I am hugely excited about the prospect of machine learning API with different backends similar to the absolutely amazing Plots.jl. I am not good enough to contribute substantially to this except from a users point of view, but hugs to you all for this :-) |
Presumably the call to |
Thanks... yeah I hope all works out well.
If you have time to contribute, we could use help with QA... either writing tests, or just using and reporting back on problems/suggestions! |
I made many of the changes we discussed, and wrote out a bare minimum readme. Please review when you have time @spencerlyon2. No hurry. |
FWIW I think the current API is very easy to understand, and I think it will be quite easy for users to use it for their own extensions. One thing I am not quite sure of is that of |
@pkofod I agree. maybe this line could be |
Yes, something like that. Would check_constraints then potentially be a recursion, or? I mean a new Edit: alternatively something like function Base.next(ep::Episode, i)
env = ep.env
s = state(env)
passed = false
while !passed
a = action(ep.policy, reward(env), s, actions(env))
passed = check_constraints(env, s, a)
end
r, s′ = step!(env, s, a)
ep.total_reward += r
ep.niter = i + 1
(s, a, r, s′), i+1
end |
I'd be ok throwing an error too. In theory the agent should be choosing On Thursday, September 15, 2016, Patrick Kofod Mogensen <
|
When I wrote that bit I was thinking that we we should throw an error. Really I want a way to communicate that the action space maybe current state dependent. |
The agent is given a valid action set in the line before, so I think it's On Thursday, September 15, 2016, Spencer Lyon notifications@github.com
|
Of course. My question is then, is it really necessary? Is it just "to be sure"? I mean, it will only error if the |
@tbreloff I'm probably missing something obvious here, but In that routine I don't see where we get an action set based on the current state. What I have in mind is that each period I think that a better way to do this would be to define Thoughts? |
Makes sense to have env in actions to constrain the actions before picking one . |
In my mind that's already true because I assume the environment knows the The check_constraints could just '@Assert a in A' Also I can't remember if I switched it yet but I added "math sets" to On Friday, September 16, 2016, Spencer Lyon notifications@github.com
|
Ok I didn't understand that. We should make a note somewhere so that is very clear to users (and devs). I can't recall, do we already have a way to describe the action space for the env (the |
Note that the action space is different from the set of appropriate actions, given the state. Sorry for the double message, I just didn't think I made that clear |
I don't remember if we distinguish actions from action_space, but that On Friday, September 16, 2016, Spencer Lyon notifications@github.com
|
I'm going to make the following API changes: actions(env) --> A
done(env) --> bool
# becomes:
actions(env, s) --> A
finished(env, s′) --> bool and I'll change check_constraints into an assert. |
ref: f21b5cf |
I want to link the solvers to the work I'm doing in StochasticOptimization, so there's still a lot of changes to be made here, but hopefully one could develop the environments without too much changing. |
After a chat with @tbreloff, we roughly decided on the following core API methods:
actions(::AbstractEnvironment, s::State) -> A(s)
: This would return the setA(s)
that contains all valid actions from the states
step(::AbstractEnvironment, a::Action) -> s', r
: This method returns the next states'
and the rewardr
associated with taking actiona
. Note that the current states
is a field of theAbstractEnvironment
. This method is also responsible for setting this state field to bes'
action(::AbstractPolicy, s'::State, r::Reward, A(s')::Actions) -> a'
: This method should be implemented by each subtype ofAbstractPolicy
and should observe a state transition (result of the callingstep
method with the previous actiona
) and output the next actiona'
.@tbreloff also mentioned an
episode(::AbstractEnvironment)
method, but I wasn't clear on its purpose so I'll let him fill in the blanks there.The text was updated successfully, but these errors were encountered: