Skip to main content

Temporal-Difference Learning Policy Evaluation in Python

In the code bellow, is an example of policy evaluation for very simple task. Example is taken from the book: "Reinforcement Learning: An Introduction, Surto and Barto".

#!/usr/local/bin/python

"""
This is an example of policy evaluation for a random walk policy.

Example 6.2: Random Walk from the book:
"Reinforcement Learning: An Introduction, Surto and Barto"

The policy is evaluated by dynamic programing and TD(0).

In this example, the policy can start in five states 1, 2, 3, 4, 5 and end in
two states 0 and 6. The allowed transitions between the states are as follwes:

0 <-> 1 <-> 2 <-> 3 <-> 4 <-> 5 <-> 6

The reward for ending in the state 6 is 1.
The reward for ending in the state 0 is 0.

In any state, except the final states, you can take two actions: 'left' and 'right'.
In the final states the policy and episodes end.

Because this example implements the random walk policy then both actions can be
taken with the probability 0.5 .
"""

import random

# Instead of letters, I use number for the states. The states 0 and 6 are the final states.
# The states 1,...,5 are the states A,...,E

states = range(0,7)
finalStates = [0,6]
reward = [ 1 if s == 6 else 0 for s in states]

def policy(s):
"""Random walk policy: ations are 'go left' nand 'go right'."""

return random.choice(['left', 'right'])

def execute_policy(s, a):
"""Change state based on the taken action."""

if a == 'left':
return s - 1
else:
return s + 1

def TD_0(V_star, alpha, gamma, numOfEpisodes = 10000):
"""Use Temporal-Difference Learning to learn V^*."""

for episode in range(numOfEpisodes):
# select random start state
s = random.randint(min(states)+1, max(states)-1)

endOfEpisode = False
while not endOfEpisode:
if s in finalStates:
# evaluate the value of the final state
V_star[s] = V_star[s] + alpha*(reward[s] - V_star[s])

# because we are in the final state then end the episode
endOfEpisode = True
continue
else:
# get an action for this state from the policy
a = policy(s)

# execute policy => take an action
s_prime = execute_policy(s, a)

# evaluate the action
V_star[s] = V_star[s] + alpha*(reward[s] + gamma*V_star[s_prime] - V_star[s])

s = s_prime

def V(s, d = 0):
"""Value function computed by dynamic programing."""

if d > 20:
return 0

if s in finalStates:
return reward[s]

return 0.5*(V(s-1, d+1) + V(s+1, d+1))

###############################################################################
##
## Experiments
##
###############################################################################

gamma = 1

print("TD(0): alpha = 0.15")

# init the value function
V_star = [0.5 for s in states]
TD_0(V_star, 0.15, gamma, 100000)
for s, V_s in enumerate(V_star):
V_s_star = s/6.0
print("V(%d) = %0.3f err = %.3f" % (s, V_s, abs(V_s_star - V_s)))

print("TD(0): alpha = 0.05")

# init the value function
V_star = [0.5 for s in states]
TD_0(V_star, 0.05, gamma, 100000)
for s, V_s in enumerate(V_star):
V_s_star = s/6.0
print("V(%d) = %0.3f err = %.3f" % (s, V_s, abs(V_s_star - V_s)))

print "Dynamic programing:"
for s in states:
V_s_star = s/6.0
V_s = V(s)
print("V(%d) = %0.3f err = %.3f" % (s, V_s, abs(V_s_star - V_s)))

Post a Comment

Popular posts from this blog

Viterbi Algorithm in C++ and using STL

To practice my C++ and STL skills, I implemented the Viterbi algorithm example from the Wikipedia page: http://en.wikipedia.org/wiki/Viterbi_algorithm. The original algorithm was implemented in Python. I reimplemented the example in C++ and I used STL (mainly vector and map classes).  This code is in public-domain. So, use it as you want. 
The complete solution for MS Visual C++ 2008 can be found at http://filip.jurcicek.googlepages.com/ViterbiSTL.rar

// ViterbiSTL.cpp : is an C++ and STL implementatiton of the Wikipedia example // Wikipedia: http://en.wikipedia.org/wiki/Viterbi_algorithm#A_concrete_example
// It as accurate implementation as it was possible

#include "stdafx.h"
#include "string" #include "vector" #include "map" #include "iostream"
using namespace std;
//states = ('Rainy', 'Sunny') //  //observations = ('walk', 'shop', 'clean') //  //start_probability = {'Rainy': 0.6, 'Sunny': 0.…

how the make HCL and G graphs, and on the fly compositon of HCL and G for KALDI

Well, I had again to do something ;-) The task is to generate/create/update a decoding graph for KALDI on the fly. In my case, I aim at changing a G (grammar) in the context of a dialogue system.

One can generate a new HCLG but this would take a lot of time as this involves FST determinization, epsilon-removal, minimization, etc. Therefore, I tried to use on-the-fly composition of statically prepared HCL and G. At first, I struggled with it but later I made it work. See https://github.com/jpuigcerver/kaldi-decoders/issues/1

Here is a short summary:

At the end, I managed to get LabelLookAheadMatcher to work. It is mostly based on the code and examples in opendcd, e.g. https://github.com/opendcd/opendcd/blob/master/script/makegraphotf.sh.

First, Here is how I build and prepare the HCL and G. Please not that OpenFST must be compiled with --enable-lookahead-fsts, see http://www.openfst.org/twiki/bin/view/FST/ReadMe.

#--------------- fstdeterminize ${lang}/L_disambig.fst | fstarcsort >…