Skip to main content

Posts

Showing posts from April, 2009

Temporal-Difference Learning Policy Evaluation in Python

In the code bellow, is an example of policy evaluation for very simple task. Example is taken from the book: "Reinforcement Learning: An Introduction, Surto and Barto" . #!/usr/local/bin/python """ This is an example of policy evaluation for a random walk policy. Example 6.2: Random Walk from the book: "Reinforcement Learning: An Introduction, Surto and Barto" The policy is evaluated by dynamic programing and TD(0). In this example, the policy can start in five states 1, 2, 3, 4, 5 and end in two states 0 and 6. The allowed transitions between the states are as follwes: 0 <-> 1 <-> 2 <-> 3 <-> 4 <-> 5 <-> 6 The reward for ending in the state 6 is 1. The reward for ending in the state 0 is 0. In any state, except the final states, you can take two actions: 'left' and 'right'. In the final states the policy and episodes end. Because this example implements the random walk policy then both actions can