Objective: To solve a prison-break problem using machine learning algorithms
Here is a prisoner, Steve, He is known as a fugitive from prison.
there is a prison with special rooms.
At any stage, Steve can only move a house up, down, left, or right. Of course, if it does not hit the wall! (The red lines are walls and it's impossible to cross them.)
The heavy weight chained to the gang's foot not only made it possible for him to climb the wall, which even disturbed him through the houses of the map.
Steve first you have to cling to the keys.
As you can see in the map of prision, there are two keys in the prison that open both solo doors.
Steve will need to take one of these keys and then go to the detained detainee in solitary confinement.
In some houses, the camera path is installed; if Steve stays in these homes, his image is recorded and tortured for being roamed in prison.
As you see in the map , the guardianship is being guarded. If Steve enters a house from the table where the guards are located, he will be arrested and exiled to the detained imprisoned until the end of his life and will no longer have the opportunity of this honorable profession!
The guards are at any moment with equal probability in one of the four houses marked with the * sign.
In order to avoid looping and confusing agent,reward of -1 should be considered for each transfer. The policy to find between 4 direction for Steve is boltzmann policy at first with high temperature and then reduce the parameter.
When lambda is zero, SARSA and Q-learning are the same and just see the next state, if lamda is considered one, the result is the same as Monte Carlo method which wait untill the end of episod.
There is a tradeoff between speed and performance in such problems. This tradeoff can be controled by lambda.
The best result is for monte Carlo but with most processing power and delay.
With the policy itaration, agent can find the optimal policy.