Abstracts of My Publications


Wawrzynski, P., Arabas, J., & Cichosz, P. (2008). Predictive control for Artificial Intelligence in computer games. In Proceedings of the Tenth International Conference on Artificial Intelligence and Soft Computing (ICAISC-2008). Springer.

Abstract

The subject of this paper is artificial intelligence (AI) of non-player characters in computer games, i.e. bots. We develop an idea of game AI based on predictive control. Bots activity is defined by a currently realized plan. This plan results from an optimization process in which random plans are continuously generated and reselected. We apply our idea to implement a bot for the game Half-Life. Our bot, Randomly Planning Fighter (RPF), defeats the bot earlier designed for Half-Life with the use of behavior-based techniques. The experiments prove that on-line planning can be feasible in rapidly changing environment of modern computer games.

Walczak, T., & Cichosz, P. (2006). A distributed learning control system for elevator groups. In Proceedings of the Eighth International Conference on Artificial Intelligence and Soft Computing (ICAISC-2006). Springer.

Abstract

Human-designed elevator control policies usually perform sufficiently well, but are costly to obtain and do not easily adapt to changing traffic patterns. This paper describes an adaptive distributed elevator control system based on reinforcement learning. Whereas inspired by prior work, the design of the system is novel, developed with the intention to avoid any unrealistic assumptions that would limit its practical usefulness. Encouraging experimental results are presented with a realistic simulator of an elevator group.

Cichosz, P. (1999). A forwards view of replacing eligibility traces for states and state-action pairs. Mathematical Algorithms, 1:283-297.

Abstract

Reinforcement learning algorithms based on temporal difference (TD) methods may be implemented in their TD(lambda>0) versions by maintaining and incrementally updating eligibility traces for each state. For the classical, accumulating form of eligibility traces, each visit to a state adds to the current value of its trace, so that trace values depend on both the recency and frequency of visits. Recently an alternative, replacing form of eligibility traces has been proposed, where visit frequency has no effect on trace values. Prior work on this novel form of eligibility traces has shown that it may be expected to perform better than classical accumulating traces, particularly for large lambda. Whereas both accumulating and replacing eligibility traces correspond to a backwards view of TD(lambda) with respect to time, the algorithm can be better understood and more efficiently implemented using a forwards view to present its effects. It was done for accumulating traces by prior work. In this paper the effects of replacing eligibility traces for states and state-action pairs are analyzed and a corresponding forwards view is presented. The analysis provides an insightful illustration of the difference between the two forms of eligibility traces and suggests an efficient implementation of replacing TD(lambda) without using traces.

Cichosz, P. (1999). An analysis of experience replay in temporal difference learning. Cybernetics and Systems, 30:341-363.

Abstract

Temporal difference (TD) methods are used by reinforcement learning algorithms for predicting future rewards. This article analyses theoretically and illustrates experimentally the effects of performing TD(lambda) prediction updates backwards for a number of past experiences. More exactly, two related techniques described in the literature are examined, referred to as replayed TD and backwards TD. The former is essentially an online learning method which performs at each time step a regular TD(0) update, and then replays updates backwards for a number of previous states. The latter operates in offline mode, after the end of a trial updating backwards the predictions for all visited states. They are both shown to be approximately equivalent to TD(lambda) with variable lambda values selected in a particular way. This is true even if they perform only TD(0) updates. The experimental results show that replayed TD(0) is competitive to TD(lambda) with regard to learning speed and quality.

Cichosz, P. (1999). TD(lambda) learning without eligibility traces: A theoretical analysis. Journal of Experimental and Theoretical Artificial Intelligence, 11:239-263.

Abstract

A common approach to learning from delayed rewards is to use temporal difference (TD) methods for predicting future reinforcement values. They are parameterized by a recency factor lambda which determines whether and how the outcomes from several consecutive time steps contribute to a single prediction update. TD(lambda>0) has been found to usually yield noticeably faster learning than TD(0), but its standard eligibility traces implementation is associated with some well known deficiencies, in particular significantly increased computational expense. This article investigates theoretically two possible ways of implementing TD(lambda) without eligibility traces, both proposed by prior work. One is the TTD procedure, which efficiently approximates the effects of eligibility traces by the use of truncated TD(lambda) returns. The other is experience replay, which relies on replaying TD prediction updates backwards in time. We provide novel theoretical results related to the former and present an original analysis of the effects of two variations of the latter. The ultimate effect of these investigations is a unified view of the apparently different computational techniques. This contributes to the TD(lambda) research in general, by highlighting interesting relationships between several TD-based algorithms and facilitating their further analysis.

Cichosz, P. (1996). Delayed reinforcement learning of multidimensional control actions. Systems Analysis-Modelling-Simulation, 24:233-248.

Abstract

This paper addresses the problem of reinforcement learning in domains with multidimensional action spaces, occurring, e.g., in applications to control or robotics. Classical reinforcement learning algorithms, based on the methods of temporal differences and related to dynamic programming, such as Q-learning, can be applied to such domains by recoding the action space appropriately (transforming it artificially to a single dimension), but this straightforward recoding approach suffers from significant inefficiencies. Two alternative approaches to applying Q-learning to tasks with vector actions are discussed: a copying approach studied previously by others and a novel algorithm called Q-V-learning. They both maintain separate Q-functions for each element of the action vector, but they differ in the way of using those Q-functions for computing TD errors. The Q-V-learning algorithm is informally argued to be potentially more effective. Experimental results are presented where these two methods clearly outperform the simple recoding approach, while they are associated with a much lower computational expense. In these experiments the novel algorithm proposed in this paper appears to perform best of all the three compared algorithms. The paper concludes with a discussion of the contributions of this work and of the most important directions for future research in this area.

Cichosz, P. (1995). Truncating temporal differences: On the efficient implementation of TD(lambda) for reinforcement learning. Journal of Artificial Intelligence Research, 2:287-318.

Abstract

Temporal difference (TD) methods constitute a class of methods for learning predictions in multi-step prediction problems, parameterized by a recency factor lambda. Currently the most important application of these methods is to temporal credit assignment in reinforcement learning. Well known reinforcement learning algorithms, such as AHC or Q-learning, may be viewed as instances of TD learning. This paper examines the issues of the efficient and general implementation of TD(lambda) for arbitrary lambda, for use with reinforcement learning algorithms optimizing the discounted sum of rewards. The traditional approach, based on eligibility traces, is argued to suffer from both inefficiency and lack of generality. The TTD (Truncated Temporal Differences) procedure is proposed as an alternative, that indeed only approximates TD(lambda), but requires very little computation per action and can be used with arbitrary function representation methods. The idea from which it is derived is fairly simple and not new, but probably unexplored so far. Encouraging experimental results are presented, suggesting that using lambda>0 with the TTD procedure allows one to obtain a significant learning speedup at essentially the same cost as usual TD(0) learning.

Cichosz, P. (1996). Truncated temporal differences with function approximation: Successful examples using CMAC. In Proceedings of the Thirteenth European Meeting on Cybernetics and Systems Research (EMCSR-96). Austrian Society for Cybernetic Studies.

Abstract

Combining reinforcement learning algorithms with function approximators in order to generalize over the state space has recently received particular interest and is widely believed to be one of the crucial issues for scaling reinforcement learning to practically interesting domains. This paper examines the combination of the TTD procedure, a computationally efficient approximate implementation of TD(lambda) methods, with CMAC, a function approximator especially suitable for reinforcement learning due to its computational efficiency and on-line learning capability. Most of previous studies have investigated the combination of CMAC with either TD(0)-based algorithms, which usually learn much slower than for lambda>0, or with the traditional implementation of TD(lambda) based on eligibility traces, associated with high computational costs. This study, by combining CMAC with TTD, attempts to reconcile fast learning with computational efficiency and generalization capabilities. The presented experimental results show the successful performance of the Q-learning algorithm implemented using the TTD procedure and CMAC in two tasks with continuous state spaces.

Cichosz, P. (1997). Integrated learning and planning based on truncating temporal differences. To appear in Proceedings of the Ninth European Conference on Machine Learning (ECML-97). Springer Verlag.

Abstract

Reinforcement learning systems learn to act in an uncertain environment by executing actions and observing their long-term effects. A large number of time steps may be required before this trial-and-error process converges to a satisfactory policy. It is highly desirable that the number of experiences needed by the system to learn to perform its task be minimized, particularly if making errors costs much. One approach to achieve this goal is to use hypothetical experiences, which requires some additional computation, but may reduce the necessary number of much more costly real experiences. This well-known idea of augmenting reinforcement learning by planning is revisited in this paper in the context of truncated TD(lambda), or TTD, a simple computational technique which allows reinforcement learning algorithms based on the methods of temporal differences to learn considerably faster with essentially no additional computational expense. Two different ways of combining TTD with planning are proposed which make it possible to benefit from lambda>0 in both the learning and planning processes. The algorithms are evaluated experimentally on a family of grid path-finding tasks and shown to indeed yield a considerable reduction of the number of real interactions with the environment necessary to converge, as well as an improvement of scaling properties.

Cichosz, P. (1996). Truncated temporal differences and sequential replay: Comparison, integration, and experiments. To appear in Proceedings of the Poster Session of the Ninth International Symposium on Methodologies for Intelligent Systems (ISMIS-96), Zakopane, Poland. Oak Ridge Laboratory.

Abstract

This paper examines two techniques for speeding up reinforcement learning algorithms based on the methods of temporal differences (TD). The first of them, recently developed by the author and known as the TTD procedure, is an approximate implementation of TD(lambda>0), significantly more computationally efficient than the traditional eligibility traces implementation. The second technique, previously used by other authors and called here sequential replay (SR), relies on performing at each step several temporally chained TD(0) updates. In this paper the SR technique is shown to yield roughly mathematically equivalent overall effects as TTD with certain parameter values. It is also demonstrated how sequential replay can be integrated with the TTD procedure, leading to the TTD-SR procedure, covering both the basic techniques as special cases and making it possible to use sequential replay with lambda>0. The results of experimental studies carried out with the combination of TTD-SR and Q-learning, the most popular TD-based reinforcement learning algorithm, are presented.

Cichosz, P., & Mulawka, J. J. (1995). Fast and efficient reinforcement learning with truncated temporal differences. In Proceedings of the Twelfth International Conference on Machine Learning (ML-95), Tahoe City, California. Morgan Kaufmann.

Abstract

The problem of temporal credit assignment in reinforcement learning is typically solved using algorithms based on the methods of temporal differences TD(lambda). Of those, Q-learning is currently best understood and most widely used. Using TD-based algorithms with lambda>0 often allows one to speed up the propagation of credit significantly, but it involves certain implementational problems. The traditional implementation of TD(lambda>0) based on eligibility traces suffers from lack of generality and computational inefficiency. The TTD (Truncated Temporal Differences) procedure is a simple TD(lambda) approximation technique that appears to overcome these drawbacks of eligibility traces. The paper outlines this technique, discusses its computational efficiency advantages, and presents experimental studies with the combination of TTD and Q-learning in deterministic and stochastic environments. These experiments show that TTD makes it possible to obtain a significant learning speedup without reducing reliability at essentially the same computational cost as usual TD(0) learning. We conclude that the TTD procedure is probably the most promising way of using TD methods for reinforcement learning, especially for tasks with large state spaces and a hard temporal credit assignment problem.

Cichosz, P., & Mulawka, J. J. (1995). Integrated architectures for learning, planning, and reacting based on approximating TD(lambda). In Proceedings of the First International Workshop on Intelligent Adaptive Systems (IAS-95), Melbourne Beech, Florida.

Abstract

Sutton's Dyna framework is a simple, but appealing approach to integrating learning, planning, and reacting in autonomous intelligent agents. Previous work on Dyna has concentrated on combining it with reinforcement learning algorithms based on the methods of temporal differences in their simplest TD(0) form. This paper discusses the possibilities of Dyna-like architectures that can use TD(lambda) for general lambda for both learning and planning, so as to reduce the number of the agent's interactions with the environment necessary to identify an optimal decision policy. A simple, but general and efficient technique for the implementation of TD(lambda), called the TTD procedure, is used as a basis of two novel integrated architectures: TTDyna-F, using a predictive forwards environment model, as in original Dyna, and TTDyna-B, using an inverse backwards model. The advantages and drawbacks of both approaches are discussed and directions for future work are outlined.

Cichosz, P., & Mulawka, J. J. (1995). GBQL: A novel genetics-based reinforcement learning architecture. In Proceedings of the Third European Congress on Intelligent Techniques and Soft Computing (EUFIT'95), Aachen, Germany.

Abstract

This research attempts to integrate the existing ideas in two fields: reinforcement learning algorithms based on the methods of temporal differences (TD), in particular Q-learning, and genetics-based machine learning, in particular classifier systems (CS). Close relations between the bucket brigade credit assignment algorithm used in classifier systems and TD methods, several widely realized drawbacks of CS, and good theoretical properties of TD, gave the initial motivation for developing a learning architecture that would combine TD-based temporal credit assignment algorithms with genetics-based adaptive knowledge representation. This paper presents a simple instantiation of this idea, called GBQL (Genetics-Based Q-Learning). This learning architecture may be expected to be a promising alternative for stimulus-response classifier systems on one hand, and for the implementations of Q-learning using other knowledge representation methods (e.g., connectionist networks) on the other hand.

Cichosz, P. (1995). Learning multidimensional control actions from delayed reinforcements. In Proceedings of the Eighth International Symposium on System-Modelling-Control (SMC-8), Zakopane, Poland.

Abstract

This paper addresses the problem of learning multidimensional control actions from delayed rewards. Classical reinforcement learning algorithms can be applied to tasks with multidimensional action spaces by recoding the action space appropriately (transforming it artificially to a single dimension), but this straightforward recoding approach suffers from significant inefficiencies. An alternative approach to applying Q-learning to tasks with vector actions is proposed, called Q-V-learning. Experimental results are presented where this algorithm clearly outperforms the simple recoding approach, while it is associated with a much lower computational expense.

Seredynski, F., Cichosz, P., & Klebus, G. P. (1995). Learning classifier systems in multi-agent environments. In Proceedings of the First IEE/IEEE International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications (GALESIA'95), Sheffield, UK.

Abstract

The paper is devoted to the problem of learning decision policies in multi-agent games. This problem is a simple, but appealing computational model of several important real-world problems in such domains as parallel computing, optimization, and control on one hand, and economy, social, and political sciences on the other hand. We describe a general framework for studying games of intelligent agents, extending the basic model of games with limited interactions, and its specific realization based on learning classifier systems. Simulation results are presented that illustrate the convergence properties of the resulting system. Avenues for future work in this area are outlined.

Cichosz, P., & Mulawka, J. J. (1996). Faster temporal credit assignment in learning classifier systems. In Proceedings of the First Polish Conference on Evolutionary Algorithms (KAE-96), Murzasichle, Poland.

Abstract

Classifier systems are genetics-based learning systems using the paradigm of reinforcement learning. In the most challenging case of delayed reinforcement, it involves a difficult temporal credit assignment problem. Standard classifier systems solve this problem using the bucket brigade algorithm. In this paper we show how to make the temporal credit assignment process faster by augmenting this algorithm by some refinements borrowed from a related field of reinforcement learning algorithms based on the methods of temporal differences (TD). These algorithms usually converge significantly faster if they are used in combination with TD(lambda). As a natural consequence of the easily noticeable similarity between the bucket brigade and TD(0), the BB(lambda) algorithm is derived, using the standard technique of eligibility traces. The TTD(lambda,m) procedure, which eliminates eligibility traces and implements an approximation of TD(lambda) in a computationally efficient way, has also been ported to the context of classifier systems, yielding the TBB(lambda,m) algorithm. The two resulting novel algorithms provide promising and, strangely enough, completely unexplored so far possibilities of making learning classifier systems learn faster under the conditions of reinforcement delay.

Cichosz, P. (1994). Reinforcement Learning Algorithms Based on the Methods of Temporal Differences. Master's Thesis, Institute of Computer Science, Warsaw University of Technology.

Abstract

The reinforcement learning paradigm differs significantly from the traditional supervised learning paradigm. An agent in each particular input situation must generate an action. Then it receives a reinforcement value from the environment, providing a measure of the agent's performance. The task for the agent is to maximize the reinforcement values it receives in long term.

Reinforcement learning agents are adaptive, reactive, and self-improving. To formulate a particular task as a reinforcement learning task one just has to design an appropriate reinforcement function, specifying the goal of the task. This makes the paradigm widely applicable, especially in such domains as game playing, automatic control, and robotics.

The reinforcement value received by the agent at a particular time step may reflect the positive or negative consequences of actions taken several steps before. In order to deal with such delayed reinforcement one needs some algorithms for temporal credit assignment. This thesis concentrates on a class of algorithms based on Sutton's temporal differences (TD) methods. The AHC and Q-learning algorithms are well known instances of this class. The TTD procedure is proposed for the efficient and general implementation of this class of algorithms, as an alternative to the traditional so called eligibility traces implementation, which is found to suffer from both ineficciency and lack of generality. Important practical issues in using these algorithms are discussed.

The problem of learning with multidimensional actions is addressed and a simple way to generalize appropriately TD-based algorithms is presented. It is argued that existing one-dimensional algorithms are hardly applicable to tasks with vector actions and that the proposed extensions, despite their simplicity, constitute a promising approach to this problem, though they require further work.

A novel genetics-based reinforcement learning architecture is introduced. It combines Q-learning with genetics-based knowledge representation and rule discovery mechanisms. For the class of learning problems considered in this thesis, it can be a promising alternative to Holland's classifier systems with the bucket brigade temporal credit assignment algorithm.

For all described algorithms experimental results are presented, ilustrating their performance.

Several important open problems in reinforcement learning are identified and directions for future research are outlined.

Cichosz, P. (1997). Reinforcement Learning by Truncating Temporal Differences. PhD Thesis, Department of Electronics and Information Technology, Warsaw University of Technology.

Abstract

The paradigm of reinforcement learning provides an appealing framework for developing intelligent adaptive systems. The learner interacts with a possibly unknown and stochastic environment by observing its states and performing actions. It receives scalar reinforcement, or reward values, which provide a relative measure of the quality of the executed actions. The learner's task is to identify an optimal decision policy, i.e., a state-action mapping that leads to the maximization of the rewards it receives in the long term.

Reinforcement values may be sparse and delayed with respect to the actions which contributed to them. A common approach to learning from delayed rewards is to use TD(lambda) methods for predicting future rewards in each state. Q-learning is currently the most popular and best theoretically understood TD-based reinforcement learning algorithm, but a variety of other related algorithms can be used.

There have been a few impressive practical applications of reinforcement learning, but the existing algorithms still suffer from important deficiencies. This thesis examines possible ways of overcoming some of them, and thus making it easier to develop successful intelligent systems based on the reinforcement learning paradigm.

Probably the most painful problem to be addressed is the relatively slow convergence of reinforcement learning algorithms. Although using TD(lambda>0) is known to usually give a considerable learning speedup, in practice TD(0) is still often used, because positive lambda increases the computational expense enormously, particularly for realistic tasks, with large state spaces. This is because TD(lambda) is implemented using eligibility traces, maintained and updated at each time step for all states. In this thesis the effects of the eligibility traces implementation are analyzed and an alternative implementation is derived, called the TTD procedure, which closely approximates TD(lambda) in a computationally efficient way, so that one can use lambda>0 at essentially the same cost as TD(0). This novel technique is theoretically shown to be approximately equivalent to, and empirically demonstrated to perform at least as well as eligibility traces, while it gives impressive computational savings. This is the major contribution of the dissertation around which the remaining contributions are concentrated.

The theoretical analysis of TTD leads to a number of additional interesting results. The proposed technique is shown to be covered by the existing TD(lambda) convergence theory, by proving its error-reduction property. It is extended to variable lambda, to allow one to select lambda values adaptively, which has been suggested by some prior work. Reinforcement learning speedup techniques proposed by other authors, based on experience replay, are shown to be equivalent to special variable lambda forms of TTD. A TTD analog is derived for replacing eligibility traces, recently proposed by other authors and argued to have some important advantages over traditional, accumulating eligibility traces. Finally, a version of TTD is presented for average-reward reinforcement learning, alternative to the standard discounted-reward framework, adopted by this work.

To apply reinforcement learning algorithms to tasks with large, especially continuous state spaces, it is usually necessary to combine them with learning function approximators to generalize over the state space. For a particular, but widely used class of function approximators, known as parameter-estimation methods, TTD is rederived in a gradient form and shown to be equivalent to the corresponding version of eligibility traces. Empirical results are presented for the combination of TTD and CMAC, a function approximator particularly well suited to reinforcement learning, which show that it learns successfully and requires much less computation than eligibility traces. A simple, but novel and probably useful function approximator for multidimensional discrete state spaces is proposed and experimented with as well.

Another way to speed up reinforcement learning examined in the thesis is to combine it with planning, i.e., learning from hypothetical experiences. This well-known idea is instantiated in a novel way, using the framework of TTD. Two hybrid learning and planning algorithms are proposed, which make both the learning and planning processes benefit from positive lambda. In a series of experiments they are shown to considerably reduce the number of real experiences necessary to converge.

The practical usefulness of a learning algorithm strongly depends on how well it scales up with the complexity of the task it is required to solve. The scaling properties of the algorithms described in the thesis are examined in a series of systematic experiments. The experiments show that the novel algorithms and extensions contributed by this work improve not only learning speed, but also scaling properties, and thus bring us closer to more realistic applications.


My home page My publications page

Pawel Cichosz