Pawe
Jan J. Mulawka
Cichosz
Institute of Electronics Fundamentals
Warsaw University of Technology
Nowowiejska 15/19, 00-665 Warsaw, Poland
cichosz@ipe.pw.edu.pl
Institute of Electronics Fundamentals
Warsaw University of Technology
Nowowiejska 15/19, 00-665 Warsaw, Poland
jml@ipe.pw.edu.pl
The problem of temporal credit assignment in reinforcement learning is
typically solved using algorithms based on the methods of temporal
differences TD(
). Of those, Q-learning is currently best understood
and most widely used. Using TD-based algorithms with
often
allows one to speed up the propagation of credit significantly, but it
involves certain implementational problems. The traditional implementation of
TD(
) based on eligibility traces suffers from lack
of generality and computational inefficiency. The TTD (Truncated Temporal
Differences) procedure is a simple TD(
) approximation technique
that appears to overcome these drawbacks of eligibility traces. The paper
outlines this technique, discusses its computational efficiency advantages,
and presents experimental studies with the combination of TTD and Q-learning
in deterministic and stochastic environments. These experiments show that TTD
makes it possible to obtain a significant learning speedup without reducing
reliability at essentially the same computational cost as usual TD(0)
learning. We conclude that the TTD procedure is probably the most promising
way of using TD methods for reinforcement learning, especially for tasks with
large state spaces and a hard temporal credit assignment problem.