Reinforcement Learning in Java


This page contains a simple RL grid path-finding demo applet and application implemented using the classes from my rl package. The GUI is rather primitive, but I didn't want to invest much time into the interface, because 1) the AWT appears to be the least stable part of the Java standard library, 2) in its present form it is still an Awkward Window Toolkit to some extent. I wrote it during two or three weeks of the 1997 summer holidays to learn Java (this is my first experience with the language) and to entertain myself by programming just for fun. I haven't used this code for any purpose except just playing with it a little (all my previous research code was written in C++), but maybe I will. Of course I realize that using Java for numerical computation is problematic, but the "inefficiency" of the (current implementations of) the language is not always an issue.

Both this demo and the underlying reinforcement learning code require Java 1.1 and won't run under Java 1.0.2. The applet has been tested only with HotJava and there are probably some problems with Netscape. If the applet doesn't work, you can download the code and run locally as an application.


Your browser can't run Java applets. Here's the picture of what you would see if it could.


User's Guide

OK, it is mostly self-explanatory, so just a very short user's guide. The learner (red) is required to find a shortest path to a goal cell (green). It cannot move into cells occupied by obstacles (black). A trial is a sequence of steps that begins in some fixed or random (see below) initial and ends when a goal cell is reached. The learner receives a reward of -1 at each step except for the final one, when the reinforcement is 0.

Library

The "library" demonstrated by this demo consists of two main packages, rl and rl.gui, and a small package called util with some auxiliary stuff that I found useful, but missing in the standard Java library. (One day I will probably add some prefix to these package names, such as PL.something.) The reinforcement learning code is quite simple, but flexible. It does not apply any "formal" design patterns (not explicitly, at least), but it is hopefully object-oriented to some extent. It does not follow the "standard" RL interface published on Rich Sutton's home page, but could be easily adapted to conform to it. The sources can be downloaded and freely used.

The demo uses most of the functionality available in the rl and rl.gui packages. The grid environment is used mainly because it is easy to simulate and visualize in a nice way. However, to implement a similar demo for another task, say cart-pole balancing, you only need to define and "plug in" two classes: CartPole for the simulation of the task and CartPoleCanvas for its visualization. Everything else (including all the GUI stuff) should work with absolutely no change.

Algorithms

The learning algorithm used in this demo is Sarsa combined with TTD to implement lambda>0. These two are described, respectively, in:

  1. Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionists systems. Technical report CUED/F-INFENG/TR 166, Cambridge University, Engineering Department.
  2. Cichosz, P. (1995). Truncating temporal differences: On the efficient implementation of TD(lambda) for reinforcement learning. Journal of Artificial Intelligence Research, 2:287-318. See my publications page.

A Boltzmann-distribution action selection mechanism is used. The "library" provides also Q-learning and it is straightforward to implement other well-known algorithms (say, AHC). You can also implement some other temporal credit assignment mechanism instead of TTD without changing the implementation of these algorithms. But I am going too far into "technical" details, which are covered by the javadoc-generated documentation of the code.

How to use it

Now, how can you play with the demo? If you don't like the layout of obstacles and goal cells, you can modify it using the left and right mouse buttons, respectively. This can be done at any time, either before or in the course of learning. You start the simulation by pressing the Start button. This clears the learner's knowledge, in this case by resetting its Q-values to 0. At any time you can suspend or resume the simulation using the Suspend and Resume buttons, respectively. When you have enough, or want to be able to start from the beginning, press Stop. Whenever the simulation is suspended or stop, the two text fields labeled "Trail" and "Step" display, respectively, the number of the current trial and the number of the last step made in the current trial. They are not updated after each step, since (in my Java development environment, at least) it turned out to be extremely memory-consuming. If you find the learner moving too fast, you can introduce a delay between steps by the use of the scroll bar labeled "Delay". By default the only delay in the simulation is that required to re-draw (an appropriate part of) the grid drawing.

Properties

Both the learner and the grid environment have some modifiable properties, or parameters. These are, for the learner:

  1. lambda, the recency factor,
  2. m, the TTD experience buffer length,
  3. gamma, the discount factor,
  4. beta, the step-size value used for Q-function update, and
  5. temperature, controlling the randomness of action selection,
and, for the environment,
  1. startX, the x coordinate of the learner's starting cell or -1 if starting in a random cell,
  2. startY, the y coordinate of the learner's starting cell or -1 if starting in a random cell,
  3. xSize, the x size of the grid, and
  4. ySize, the y size of the grid.

You can change any of these parameters whenever you like by typing a new value in the corresponding text field and pressing Return (trying to set an illegal value will have no effect). There is no enforced upper limit on the grid size, but you will probably want not to exceed 50x50, as this is the size of the look-up table used for function representation by the demo (sure, it could be changed dynamically, but it's fixed to keep things simple). You may wonder why the parameters are displayed in some strange order, which is neither logical or alphabetical -- this is because 1) they are determined at run time and the class that displays them has no idea of what they mean, 2) regrettably the standard Java library has no sort routine, and 3) I didn't want to use some third-party sort (e.g., from the JGL) or to implement my own just to sort the properties alphabetically.

Download

OK, that's probably all. Have fun. You can download the binaries as classes.jar and sources as either src.tar.gz or src.zip. The HTML documentation is available as doc.tar.gz or doc.zip. You can also browse it online. Questions, comments, or criticism welcome.


My Home Page My Reinforcement Learning Page

Pawel Cichosz