Skip to main content

Temporal-Difference Learning Policy Evaluation in Python

In the code bellow, is an example of policy evaluation for very simple task. Example is taken from the book: "Reinforcement Learning: An Introduction, Surto and Barto".

#!/usr/local/bin/python

"""
This is an example of policy evaluation for a random walk policy.

Example 6.2: Random Walk from the book:
"Reinforcement Learning: An Introduction, Surto and Barto"

The policy is evaluated by dynamic programing and TD(0).

In this example, the policy can start in five states 1, 2, 3, 4, 5 and end in
two states 0 and 6. The allowed transitions between the states are as follwes:

0 <-> 1 <-> 2 <-> 3 <-> 4 <-> 5 <-> 6

The reward for ending in the state 6 is 1.
The reward for ending in the state 0 is 0.

In any state, except the final states, you can take two actions: 'left' and 'right'.
In the final states the policy and episodes end.

Because this example implements the random walk policy then both actions can be
taken with the probability 0.5 .
"""

import random

# Instead of letters, I use number for the states. The states 0 and 6 are the final states.
# The states 1,...,5 are the states A,...,E

states = range(0,7)
finalStates = [0,6]
reward = [ 1 if s == 6 else 0 for s in states]

def policy(s):
"""Random walk policy: ations are 'go left' nand 'go right'."""

return random.choice(['left', 'right'])

def execute_policy(s, a):
"""Change state based on the taken action."""

if a == 'left':
return s - 1
else:
return s + 1

def TD_0(V_star, alpha, gamma, numOfEpisodes = 10000):
"""Use Temporal-Difference Learning to learn V^*."""

for episode in range(numOfEpisodes):
# select random start state
s = random.randint(min(states)+1, max(states)-1)

endOfEpisode = False
while not endOfEpisode:
if s in finalStates:
# evaluate the value of the final state
V_star[s] = V_star[s] + alpha*(reward[s] - V_star[s])

# because we are in the final state then end the episode
endOfEpisode = True
continue
else:
# get an action for this state from the policy
a = policy(s)

# execute policy => take an action
s_prime = execute_policy(s, a)

# evaluate the action
V_star[s] = V_star[s] + alpha*(reward[s] + gamma*V_star[s_prime] - V_star[s])

s = s_prime

def V(s, d = 0):
"""Value function computed by dynamic programing."""

if d > 20:
return 0

if s in finalStates:
return reward[s]

return 0.5*(V(s-1, d+1) + V(s+1, d+1))

###############################################################################
##
## Experiments
##
###############################################################################

gamma = 1

print("TD(0): alpha = 0.15")

# init the value function
V_star = [0.5 for s in states]
TD_0(V_star, 0.15, gamma, 100000)
for s, V_s in enumerate(V_star):
V_s_star = s/6.0
print("V(%d) = %0.3f err = %.3f" % (s, V_s, abs(V_s_star - V_s)))

print("TD(0): alpha = 0.05")

# init the value function
V_star = [0.5 for s in states]
TD_0(V_star, 0.05, gamma, 100000)
for s, V_s in enumerate(V_star):
V_s_star = s/6.0
print("V(%d) = %0.3f err = %.3f" % (s, V_s, abs(V_s_star - V_s)))

print "Dynamic programing:"
for s in states:
V_s_star = s/6.0
V_s = V(s)
print("V(%d) = %0.3f err = %.3f" % (s, V_s, abs(V_s_star - V_s)))

Comments

Popular posts from this blog

Viterbi Algorithm in C++ and using STL

To practice my C++ and STL skills, I implemented the Viterbi algorithm example from the Wikipedia page:  http://en.wikipedia.org/wiki/Viterbi_algorithm . The original algorithm was implemented in Python. I reimplemented the example in C++ and I used STL (mainly  vector  and  map  classes).  This code is in public-domain. So, use it as you want.  The complete solution for MS Visual C++ 2008 can be found at  http://filip.jurcicek.googlepages.com/ViterbiSTL.rar // ViterbiSTL.cpp : is an C++ and STL implementatiton of the Wikipedia example // Wikipedia: http://en.wikipedia.org/wiki/Viterbi_algorithm#A_concrete_example // It as accurate implementation as it was possible #include "stdafx.h" #include "string" #include "vector" #include "map" #include "iostream" using namespace std; //states = ('Rainy', 'Sunny') //  //observations = ('walk', 'shop', 'clean') //  //start_probability = {'Rainy': 0.6...

how the make HCL and G graphs, and on the fly compositon of HCL and G for KALDI

Well, I had again to do something ;-) The task is to generate/create/update a decoding graph for KALDI on the fly. In my case, I aim at changing a G (grammar) in the context of a dialogue system. One can generate a new HCLG but this would take a lot of time as this involves FST determinization, epsilon-removal, minimization, etc. Therefore, I tried to use on-the-fly composition of statically prepared HCL and G. At first, I struggled with it but later I made it work. See  https://github.com/jpuigcerver/kaldi-decoders/issues/1 Here is a short summary: At the end, I managed to get LabelLookAheadMatcher to work. It is mostly based on the code and examples in opendcd, e.g. https://github.com/opendcd/opendcd/blob/master/script/makegraphotf.sh . First, Here is how I build and prepare the HCL and G. Please not that OpenFST must be compiled with  --enable-lookahead-fsts , see http://www.openfst.org/twiki/bin/view/FST/ReadMe . #--------------- fstdeterminize ${lang}/...

Difference between the Standard Deviation, Standard Error and Confidence Itervals

In its simplest: the standard deviation represents the variability of input values, the standard error (of the mean) represents variability of computed mean, the confidence intervals represents where the 'true' mean value might lie. Computation: the standard deviation is computed from the variance of your data - input values, the standard error is computed from the standard deviation, the confidence intervals are computed from the standard error. Read this: Many people confuse the standard deviation (SD) and the standard error of the mean (SE) and are unsure which, if either, to use in presenting data in graphical or tabular form. The SD is an index of the variability of the original data points and should be reported in all studies. The SE reflects the variability of the mean values, as if the study were repeated a large number of times. By itself, the SE is not particularly useful; however, it is used in constructing 95% and 99% confidence intervals ( CIs ), which indicate ...