In the code bellow, is an example of policy evaluation for very simple task. Example is taken from the book: "Reinforcement Learning: An Introduction, Surto and Barto".
#!/usr/local/bin/python
"""
This is an example of policy evaluation for a random walk policy.
Example 6.2: Random Walk from the book:
"Reinforcement Learning: An Introduction, Surto and Barto"
The policy is evaluated by dynamic programing and TD(0).
In this example, the policy can start in five states 1, 2, 3, 4, 5 and end in
two states 0 and 6. The allowed transitions between the states are as follwes:
0 <-> 1 <-> 2 <-> 3 <-> 4 <-> 5 <-> 6
The reward for ending in the state 6 is 1.
The reward for ending in the state 0 is 0.
In any state, except the final states, you can take two actions: 'left' and 'right'.
In the final states the policy and episodes end.
Because this example implements the random walk policy then both actions can be
taken with th…
Comments