Simultaneous Deep Model-Based Reinforcement Learning and State Inference
under Partial Observability.
William Cong,
Josiah P. Hanna.
Proceedings of the IEEE International Conference on Robotics and Automation
(ICRA). May 2026.
Abstract
BibTeX
Model-based reinforcement learning (MBRL) is a promising approach to enabling
robots to learn directly from a limited number of real-world interactions. MBRL
is notoriously difficult in settings without full state observability because
algorithms must simultaneously infer state from incomplete observations and use
these inferences to learn environment dynamics. Toward the use of MBRL for
autonomous robots, we introduce EMBRL, an expectation-maximization framework
that combines classical Bayesian state estimation with deep MBRL to jointly
infer states and learn neural network state transition models. This framework
takes advantage of the rich theory and practice of state estimation from the
field of robotics, while enabling behavior learning without a priori known robot
dynamics. Though conceptually straightforward, our instantiation of this
framework for deep MBRL reveals several key challenges when using a learned
transition model both for state inference and policy learning. We introduce a
practical implementation of EMBRL using both particle and extended Kalman
filters and smoothers and discuss key design choices necessary for effective
implementation. Finally, we evaluate different instantiations of the EMBRL
framework on both simulated and real-robot tasks and show that our methods learn
higher performing policies compared to strong MBRL baselines using recurrent
neural networks.
Efficient and Versatile Quadrupedal Skating: Optimal Co-design via
Reinforcement Learning and Bayesian Optimization.
Hanwen Wang, Zhenlong Fang,
Josiah P. Hanna, Xiaobin Xiong.
Proceedings of the IEEE International Conference on Robotics and Automation
(ICRA). May 2026.
Abstract
BibTeX
In this paper, we propose an effective mechanical and algorithmic solution to
enabling skating motion with passive wheels on state-of-the-art quadrupedal
robots. The skating locomotion enables a hybrid combination of wheeled and
legged mobility without the necessity of motorization at the feet, which
simultaneously promote efficiency, speed, and mechanical simplicity. To realize
these potential advantage of skating, we employ a bilevel optimization approach
with an upper level optimization via Bayesian Optimization (BO) to search for
the best mechanical design and a lower level Reinforcement Learning (RL) to find
an optimal motor policy. The end results not only provide optimal mechanical and
control designs but also show versatile locomotion behaviors such as hockey stop
(a rapid braking maneuver by turning sideways to maximize friction) and self-
aligning behavior (automatically adjusts its orientation to move more
efficiently in the commanded direction), providing the first comprehensive study
on quadrupedal robotic skating.
When to Plan: Learning to Select Between Reactive Control and Deliberative
Planning.
Adam Labiosa,
Josiah P. Hanna.
To appear at the Reinforcement Learning Conference (RLC). May 2026.
Abstract
BibTeX
It has long been recognized that humans have the ability to switch between fast,
reactive decision-making and slower, deliberative planning. In this paper, we
study the question of how to learn this ability, known as meta-reasoning, in
artificial agents. We model reactive decision-making as a policy that directly
maps state observations to actions. Such policies can be trained with
reinforcement learning (RL) or imitation learning, but may generalize poorly
outside of their training distribution. Alternatively, model-based decision-time
planning is more likely to produce good actions out-of-distribution but requires
additional computation time, which delays acting. In this work, we introduce an
RL method for training a meta-reasoning policy that allocates computation by
conditioning on an uncertainty score from the reactive policy, enabling it to
predict when the reactive policy is likely to perform poorly and when planning
is needed. We conduct an empirical study on motion planning and navigation
environments, showing that this design enables the meta-reasoning policy to
learn when the reactive policy provides a good-enough action versus when
decision-time planning is needed. Additionally, we show that our design enables
the meta-agent to shift toward fully reactive control as the reactive policy
improves.
A Simple Baseline for Learning Approximate State Abstractions in Factored
State Spaces.
Anshuman Senapti,
Josiah P. Hanna.
To appear at the Reinforcement Learning Conference (RLC). May 2026.
Abstract
BibTeX
The ability to identify and ignore task irrelevant environment variables is
central to intelligent behavior. In reinforcement learning (RL), existing
methods typically rely on auxiliary objectives to learn similar forms of
abstraction and such objectives tend to add significant complexity to the base
RL algorithm. In this work, we take a step back and ask: can selective
abstraction emerge naturally from reward optimization alone, without any
additional objectives? We introduce a surprisingly simple neural network
architecture change: a learnable, observation-independent attention mask applied
to the inputs of the policy and value networks and trained end-to-end using only
the RL objective. Despite its simplicity, this architectural modification
consistently improves sample efficiency and learns to mask out distracting input
variables across 12 continuous control tasks. We analyze the dynamics of
gradient descent using this method on a linear regression task and demonstrate
improved feature credit assignment. Finally, we conduct experiments on toy MDPs
and show that the attention mask leads to accurate Q-value estimation and
induces soft abstractions over a factored state space. Our findings challenge
the need for complex auxiliary objectives to learn state abstractions in deep RL
and suggest a simple baseline for future research.
Centralized Adaptive Sampling for Reliable Co-training of Independent Multi-
Agent Policies.
Nicholas E. Corrado,
Josiah P. Hanna.
To appear at the Reinforcement Learning Conference (RLC). May 2026.
Abstract
BibTeX
Independent on-policy policy gradient algorithms are widely used for multi-agent
reinforcement learning (MARL) in cooperative and no-conflict games, but they are
known to converge sub-optimally when each agent’s individual policy gradient
points away from an optimal joint equilibrium. Going beyond prior work, we
observe that sub-optimal convergence can still arise \textit{even when the
expected individual policy gradients of all agents point toward the optimal
joint solution.} After collecting a finite set of trajectories, stochasticity in
independent action sampling can cause the joint data distribution to deviate
from the expected joint on-policy distribution. This \textit{sampling error}
w.r.t. the joint on-policy distribution produces inaccurate gradient estimates
that can make agents converge sub-optimally. We hypothesize that joint sampling
error can be reduced through coordinated action selection and that doing so will
increase the \textit{reliability} of policy gradient learning in MARL (i.e., the
probability of agents converging to an optimal joint policy). To test this
hypothesis, we first introduce an adaptive action sampling approach to reduce
joint sampling error in the Centralized Training with Decentralized Execution
setting. Our method, Cooperative Sampling Error Reduction (CoSER), continually
adapts a centralized behavior policy to place higher probability on joint
actions that are under-sampled w.r.t. the current joint policy. We then
empirically evaluate CoSER on a diverse set of no-conflict multi-agent games and
demonstrate that (1) CoSER reduces joint sampling error more efficiently than
independent on-policy sampling and (2) this reduction increases the reliability
of independent policy gradient algorithms.
Abstract Sim2Real through Approximate Information States.
Yunfu Deng, Yuhao Li,
Josiah P. Hanna.
IEEE Robotics and Automation Letters (RA-L). April 2026.
This paper contains material that was previously presented at the 2025 NeurIPS
Embodied World Models for Decision-Making.
Abstract
BibTeX
In recent years, reinforcement learning (RL) has shown remarkable success in
robotics when a fast and accurate simulator is available for a given task. When
using RL and simulation, more simulator realism is generally beneficial but
becomes harder to obtain as robots are deployed in increasingly complex and
widescale domains. In such settings, simulators will likely fail to model all
relevant details of a given target task and this observation motivates the study
of sim2real with simulators that leave out key task details. In this paper, we
formalize and study the abstract sim2real problem: given an abstract simulator
that models a target task at a coarse level of abstraction, how can we train a
policy with RL in the abstract simulator and successfully transfer it to the
real-world? Our first contribution is to formalize this problem using the
language of state abstraction from the RL literature. This framing shows that an
abstract simulator can be grounded to match the target task if the grounded
abstract dynamics take the history of states into account. Based on the
formalism, we then introduce a method that uses real-world task data to correct
the dynamics of the abstract simulator. We then show that this method enables
successful policy transfer both in sim2sim and sim2real evaluation.
Adversarial Agents: Black-Box Evasion Attacks with Reinforcement
Learning.
Kyle and Noirot Ferrand, Jean-Charles and Sheatsley, Ryan and Pauley, Eric and
Hanna, Josiah and McDaniel, Patrick Domico.
Findings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). March 2026.
Abstract
BibTeX
Paper
Attacks on machine learning models have been extensively studied through
stateless optimization. In this paper, we demonstrate how a reinforcement
learning (RL) agent can learn a new class of attack algorithms that generate
adversarial samples. Unlike traditional adversarial machine learning (AML)
methods that craft adversarial samples independently, our RL-based approach
retains and exploits past attack experience to improve the effectiveness and
efficiency of future attacks. We formulate adversarial sample generation as a
Markov Decision Process and evaluate RL's ability to (a) learn effective
and efficient attack strategies and (b) compete with state-of-the-art AML. On
two image classification benchmarks, our agent increases attack success rate by
up to 13.2% and decreases the average number of victim model queries per attack
by up to 16.9% from the start to the end of training. In a head-to-head
comparison with state-of-the-art image attacks, our approach enables an
adversary to generate adversarial samples with 17% more success on unseen inputs
post-training. From a security perspective, this work demonstrates a powerful
new attack vector that uses RL to train agents that attack ML models efficiently
and at scale.
On-Policy Policy Gradient Reinforcement Learning Without On-Policy
Sampling.
Nicholas Corrado,
Josiah P. Hanna.
Transactions on Machine Learning Research (TMLR). January 2026.
Abstract
BibTeX
Paper
On-policy reinforcement learning (RL) algorithms perform policy updates using
i.i.d. trajectories collected by the current policy. However, after observing
only a finite number of trajectories, on-policy sampling may produce data that
fails to match the expected on-policy data distribution. This sampling error
leads to noisy updates and data inefficient on-policy learning. Recent work in
the policy evaluation setting has shown that non-i.i.d., off-policy sampling can
produce data with lower sampling error than on-policy sampling can produce.
Motivated by this observation, we introduce an adaptive, off-policy sampling
method to improve the data efficiency of on-policy policy gradient algorithms.
Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error
by collecting data with a behavior policy that increases the probability of
sampling actions that are under-sampled with respect to the current policy.
Rather than discarding data from old policies -- as is commonly done in on-
policy algorithms -- PROPS uses data collection to adjust the distribution of
previously collected data to be approximately on-policy. We empirically evaluate
PROPS on both continuous-action MuJoCo benchmark tasks as well as discrete-
action tasks and demonstrate that (1) PROPS decreases sampling error throughout
training and (2) improves the data efficiency of on-policy policy gradient
algorithms. Our work improves the RL community's understanding of a nuance
in the on-policy vs off-policy dichotomy: on-policy learning requires on-policy
data, not on-policy sampling.
When Can Model-Free Reinforcement Learning be Enough for Thinking?.
Josiah P. Hanna, Nicholas E. Corrado.
Proceedings of Advances in Neural Information Processing Systems
(NeurIPS). December 2025.
A
shorter
version of this work appeared at the Finding the Frame Workshop at the
Reinforcement Learning Conference (RLC) and received the
"Most Thought-provoking" Paper Award.
Abstract
BibTeX
Paper
Recent work on large language models has demonstrated the use of model-free
reinforcement learning (RL) to train reasoning-like capabilities. The emergence
of "thinking" through model-free RL is interesting as thinking actions
neither produce reward nor change the external world state to one where the
agent is more likely to get reward. This paper seeks to build a domain-
independent understanding of when model-free RL will lead to
"thinking" as a strategy for reward maximization. To build this
understanding, we first introduce a theoretical model which we call a thought
Markov decision process (MDP). Thought MDPs minimally extend the classical MDP
model to include an abstract notion of thought state and thought action. Using
the thought MDP model, we prove the importance of policy initialization in
determining whether or not thinking emerges and show formally that thought
actions are equivalent to the agent choosing to perform a step of policy
improvement before continuing to act. We then show that open-source LLMs satisfy
the conditions that our theory predicts are necessary for model-free RL to
produce thinking-like behavior. Finally, we hypothesize sufficient conditions
that would enable thinking to be learned outside of language generation and
introduce a toy domain where a combination of multi-task pre-training and
designated thought actions enable more data-efficient RL compared to non-
thinking agents.
Abstract Sim2Real through Approximate Information States.
Yunfu Deng, Yuhao Li,
Josiah P. Hanna.
Proceedings of the Workshop on Embodied World Models for Decision-Making at
NeurIPS. December 2025.
Abstract
BibTeX
In recent years, reinforcement learning (RL) has shown remarkable success in
robotics when a fast and accurate simulator is available for a given task. When
using RL and simulation, more simulator realism is generally beneficial but
becomes harder to obtain as robots are deployed in increasingly complex and
widescale domains. In such settings, simulators will likely fail to model all
relevant details of a given target task and this observation motivates the study
of sim2real with simulators that leave out key task details. In this paper, we
formalize and study the abstract sim2real problem: given an abstract simulator
that models a target task at a coarse level of abstraction, how can we train a
policy with RL in the abstract simulator and successfully transfer it to the
real-world? Our first contribution is to formalize this problem using the
language of state abstraction from the RL literature. This framing shows that an
abstract simulator can be grounded to match the target task if the grounded
abstract dynamics take the history of states into account. Based on the
formalism, we then introduce a method that uses real-world task data to correct
the dynamics of the abstract simulator. We then show that this method enables
successful policy transfer both in sim2sim and sim2real evaluation.
Pretraining Decision Transformers with Reward Prediction for In-Context
Multi-task Structured Bandit Learning.
Subhojyoti Mukherjee,
Josiah P. Hanna, Qiaomin Xie, Robert Nowak.
Proceedings of the Reinforcement Learning Conference (RLC). August 2025.
Abstract
BibTeX
Paper
In this paper, we study the multi-task structured bandit problem where the goal
is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks
share a common structure and any optimal algorithm should exploit the shared
structure to minimize the cumulative regret for an unseen but related test task.
We use a transformer as a decision-making algorithm to learn this shared
structure so as to generalize to the unseen test task. The prior work of
pretrained decision transformers like DPT requires access to the optimal action
during training which may be hard in several scenarios. Diverging from these
works, our learning algorithm does not need the knowledge of optimal action per
task during training but predicts a reward vector for each of the actions using
only the observed offline data from the diverse training tasks. Finally, during
inference time, it selects action using the reward predictions employing various
exploration strategies in-context for an unseen test task. We show that our
model outperforms other methods like DPT, and Algorithmic Distillation (AD) and
matches the performance of algorithms that requires privileged information on
the structure of the problem. Interestingly, we show that our algorithm, without
the knowledge of the underlying problem structure, can learn a near-optimal
policy in-context by leveraging the shared structure across diverse tasks. We
show that when the shared structure breaks down with the introduction of new
actions both during training and test time, our proposed algorithm fails to
learn the underlying latent structure. We further show that our algorithm
conducts an implicit two-phase exploration and validate all of these findings
over several experiments spanning linear, non-linear, real-life datasets,
bilinear, and latent bandit settings. Finally, we theoretically analyze the
performance of our algorithm and obtain generalization bounds in the in-context
multi-task learning setting.
Thinking is Another Form of Control.
Josiah P. Hanna, Nicholas E. Corrado.
Proceedings of the Finding the Frame Workshop at the Reinforcement Learning
Conference (RLC). August 2025.
Received the
"Most Thought-provoking" Paper Award.
Most Thought-provoking Paper Award
Abstract
BibTeX
Paper
Recent work on large language models has demonstrated the use of model-free
reinforcement learning (RL) to train reasoning-like capabilities. The emergence
of "thinking" through model-free RL is interesting as thinking actions
neither produce reward nor change the external world state to one where the
agent is more likely to get reward. This paper seeks to build a domain-
independent understanding of when model-free RL will lead to
"thinking" as a strategy for reward maximization. To build this
understanding, we first introduce a theoretical model which we call a thought
Markov decision process (MDP). Thought MDPs minimally extend the classical MDP
model to include an abstract notion of thought state and thought action. Using
the thought MDP model, we prove the importance of policy initialization in
determining whether or not thinking emerges and show formally that thought
actions are equivalent to the agent choosing to perform a step of policy
improvement before continuing to act. We then show that open-source LLMs satisfy
the conditions that our theory predicts are necessary for model-free RL to
produce thinking-like behavior. Finally, we hypothesize sufficient conditions
that would enable thinking to be learned outside of language generation and
introduce a toy domain where a combination of multi-task pre-training and
designated thought actions enable more data-efficient RL compared to non-
thinking agents.
Stable Offline Value Function Learning with Bisimulation-based
Representations.
Brahma S. Pavse, Yudong Chen, Qiaomin Xie,
Josiah P. Hanna.
Proceedings of the International Conference on Machine Learning (ICML).
July 2025.
Abstract
BibTeX
Paper
In reinforcement learning, offline value function learning is the procedure of
using an offline dataset to estimate the expected discounted return from each
state when taking actions according to a fixed target policy. The stability of
this procedure, i.e., whether it converges to its fixed-point, critically
depends on the representations of the state-action pairs. Poorly learned
representations can make value function learning unstable, or even divergent.
Therefore, it is critical to stabilize value function learning by explicitly
shaping the state-action representations. Recently, the class of bisimulation-
based algorithms have shown promise in shaping representations for control.
However, it is still unclear if this class of methods can stabilize value
function learning. In this work, we investigate this question and answer it
affirmatively. We introduce a bisimulation-based algorithm called kernel
representations for offline policy evaluation (KROPE). KROPE uses a kernel to
shape state-action representations such that state-action pairs that have
similar immediate rewards and lead to similar next state-action pairs under the
target policy also have similar representations. We show that KROPE: 1) learns
stable representations and 2) leads to lower value error than baselines. Our
analysis provides new theoretical insight into the stability properties of
bisimulation-based methods and suggests that practitioners can use these methods
to improve the stability and accuracy of offline evaluation of reinforcement
learning agents.
Demystifying the Paradox of Importance Sampling with an Estimated History-
Dependent Behavior Policy in Off-Policy Evaluation.
Hongyi Zhou,
Josiah P. Hanna, Jin Zhu, Ying Yang, Chengchun Shi.
Proceedings of the International Conference on Machine Learning (ICML).
July 2025.
Abstract
BibTeX
Paper
This paper studies off-policy evaluation (OPE) in reinforcement learning with a
focus on behavior policy estimation for importance sampling. Prior work has
shown empirically that estimating a history-dependent behavior policy can lead
to lower mean squared error (MSE) even when the true behavior policy is
Markovian. However, the question of *why* the use of history should lower MSE
remains open. In this paper, we theoretically demystify this paradox by deriving
a bias-variance decomposition of the MSE of ordinary importance sampling (IS)
estimators, demonstrating that history-dependent behavior policy estimation
decreases their asymptotic variances while increasing their finite-sample
biases. Additionally, as the estimated behavior policy conditions on a longer
history, we show a consistent decrease in variance. We extend these findings to
a range of other OPE estimators, including the sequential IS estimator, the
doubly robust estimator and the marginalized IS estimator, with the behavior
policy estimated either parametrically or non-parametrically.
WeRef: An Open-source and Extensible Dataset for Referee Gesture Recognition
in RoboCup.
Zisen Shao,
Josiah P. Hanna.
RoboCup-2025: Robot Soccer World Cup XXVIII. July 2025.
Oral Presentation
Abstract
BibTeX
Visual recognition of referee gestures is essential for fully autonomous robot
soccer, yet progress in deep learning approaches has been hindered by the
absence of a public, standardized dataset and base- line. In this paper, we
present WeRef, an open-source synthetic data generation pipeline for RoboCup
referee gestures built on the Webots simulator, along with a large-scale
synthetic dataset. Our pipeline auto- matically randomizes human models,
backgrounds, lighting conditions, obstacle presence, and camera viewpoints to
produce diverse data with- out manual labeling. We evaluate WeRef on real
competition data us- ing a 2D CNN with GRU classifier. We show that synthetic
samples generated by WeRef effectively augment limited real data, substantially
reducing the need for costly data collection and improving recognition accuracy
when training on a combination of synthetic and real data. By releasing both the
generation software and the resulting dataset, we provide a scalable, open-
source framework to facilitate the devel- opment of referee gesture recognition
in the RoboCup Standard Plat- form League (SPL). The WeRef pipeline and dataset
are available at https://github.com/ZisenShao/WeRef.
Reinforcement Learning Within the Classical Robotics Stack: A Case Study in
Robot Soccer.
Adam Labiosa, Zhihan Wang, Siddhant Agarwal, William Cong, Geethika Hemkumar,
Abhinav Narayan Harish, Benjamin Hong, Josh Kelle, Chen Li, Yuhao Li, Zisen
Shao, Peter Stone,
Josiah P. Hanna.
Proceedings of the IEEE International Conference on Robotics and Automation
(ICRA). May 2025.
A short version of this work appeared at the
Roboletics 2.0 Workshop at
ICRA 2025 and received the Best RoboCup-Themed Paper Award.
Abstract
BibTeX
Paper
Robot decision-making in partially observable, real-time, dynamic, and multi-
agent environments remains a difficult and unsolved challenge. Model-free
reinforcement learning (RL) is a promising approach to learning decision-making
in such domains, however, end-to-end RL in complex environments is often
intractable. To address this challenge in the RoboCup Standard Platform League
(SPL) domain, we developed a novel architecture integrating RL within a
classical robotics stack, while employing a multi-fidelity sim2real approach and
decomposing behavior into learned sub-behaviors with heuristic selection. Our
architecture led to victory in the 2024 RoboCup SPL Challenge Shield Division.
In this work, we fully describe our system's architecture and empirically
analyze key design decisions that contributed to its success. Our approach
demonstrates how RL-based behaviors can be integrated into complete robot
behavior architectures.
Reinforcement Learning Within the Classical Robotics Stack: A Case Study in
Robot Soccer.
Adam Labiosa, Zhihan Wang, Siddhant Agarwal, William Cong, Geethika Hemkumar,
Abhinav Narayan Harish, Benjamin Hong, Josh Kelle, Chen Li, Yuhao Li, Zisen
Shao, Peter Stone,
Josiah P. Hanna.
Proceedings of the Roboletics 2.0 Workshop at ICRA. May 2025.
Best RoboCup-themed Paper
Abstract
BibTeX
Robot decision-making in partially observable, real-time, dynamic, and multi-
agent environments remains a difficult and unsolved challenge. Model-free
reinforcement learning (RL) is a promising approach to learning decision-making
in such domains, however, end-to-end RL in complex environments is often
intractable. To address this challenge in the RoboCup Standard Platform League
(SPL) domain, we developed a novel architecture integrating RL within a
classical robotics stack, while employing a multi-fidelity sim2real approach and
decomposing behavior into learned sub-behaviors with heuristic selection. Our
architecture led to victory in the 2024 RoboCup SPL Challenge Shield Division.
In this work, we fully describe our system's architecture and empirically
analyze key design decisions that contributed to its success. Our approach
demonstrates how RL-based behaviors can be integrated into complete robot
behavior architectures.
Multi-Robot Collaboration through Reinforcement Learning and Abstract
Simulation.
Adam Labiosa,
Josiah P. Hanna.
Proceedings of the IEEE International Conference on Robotics and Automation
(ICRA). May 2025.
Abstract
BibTeX
Paper
Teams of people coordinate to perform complex tasks by forming abstract mental
models of world and agent dynamics. The use of abstract models contrasts with
much recent work in robot learning that uses a high-fidelity simulator and
reinforcement learning (RL) to obtain policies for physical robots. Motivated by
this difference, we investigate the extent to which so-called \textit{abstract
simulators} can be used for multi-agent reinforcement learning (MARL) and the
resulting policies successfully deployed on teams of physical robots. An
abstract simulator models the robot's target task at a high-level of
abstraction and discards many details of the world that could impact optimal
decision-making. Policies are trained in an abstract simulator then transferred
to the physical robot by making use of separately-obtained low-level perception
and motion control modules. We identify three key categories of modifications to
the abstract simulator that enable policy transfer to physical robots:
simulation fidelity enhancements, training optimizations and simulation
stochasticity. We then run an empirical study with extensive ablations to
determine the value of each modification category for enabling policy transfer
in cooperative robot soccer tasks. We also compare the performance of policies
produced by our methodology with a well-tuned non-learning-based behavior
architecture from the annual RoboCup competition and find that our approach
leads to a similar level of performance. Broadly we show that MARL can be use to
train cooperative physical robot behaviors using highly abstract models of the
world.
Adaptive Exploration for Data-Efficient General Value Function
Evaluations.
Arushi Jain,
Josiah P. Hanna, Doina Precup.
Proceedings of Advances in Neural Information Processing Systems
(NeurIPS). December 2024.
Abstract
BibTeX
Paper
General Value Functions (GVFs) (Sutton et al, 2011) are an established way to
represent predictive knowledge in reinforcement learning. Each GVF computes the
expected return for a given policy, based on a unique pseudo-reward. Multiple
GVFs can be estimated in parallel using off-policy learning from a single stream
of data, often sourced from a fixed behavior policy or pre-collected dataset.
This leaves an open question: how can behavior policy be chosen for data-
efficient GVF learning? To address this gap, we propose GVFExplorer, which aims
at learning a behavior policy that efficiently gathers data for evaluating
multiple GVFs in parallel. This behavior policy selects actions in proportion to
the total variance in the return across all GVFs, reducing the number of
environmental interactions. To enable accurate variance estimation, we use a
recently proposed temporal-difference-style variance estimator. We prove that
each behavior policy update reduces the mean squared error in the summed
predictions over all GVFs. We empirically demonstrate our method's
performance in both tabular representations and nonlinear function
approximation.
Data-Efficient Policy Evaluation Through Behavior Policy Search.
Josiah P. Hanna, Yash Chandak, Martha White, Philip Thomas, Scott Niekum,
Peter Stone.
Journal of Machine Learning Research (JMLR). November 2024.
This article contains material that was first published at
ICML 2017.
Abstract
BibTeX
Paper
We consider the task of evaluating a policy for a \textit{Markov decision
process} (MDP). The standard unbiased technique for evaluating a policy is to
deploy the policy and observe its performance. We show that the data collected
from deploying a different policy, commonly called the \textit{behavior policy},
can be used to produce unbiased estimates with lower mean squared error than
this standard technique. We derive an analytic expression for a \textit{minimal
variance behavior policy} -- a behavior policy that minimizes the mean squared
error of the resulting estimates. Because this expression depends on terms that
are unknown in practice, we propose a novel policy evaluation sub-problem,
\textit{behavior policy search}: searching for a behavior policy that reduces
mean squared error. We present two behavior policy search algorithms and
empirically demonstrate their effectiveness in lowering the mean squared error
of policy performance estimates.
Adaptive Exploration for Data-Efficient General Value Function
Evaluations.
Arushi Jain,
Josiah P. Hanna, Doina Precup.
Proceedings of the European Workshop on Reinforcement Learning (EWRL).
October 2024.
Abstract
BibTeX
Paper
General Value Functions (GVFs) (Sutton et al, 2011) are an established way to
represent predictive knowledge in reinforcement learning. Each GVF computes the
expected return for a given policy, based on a unique pseudo-reward. Multiple
GVFs can be estimated in parallel using off-policy learning from a single stream
of data, often sourced from a fixed behavior policy or pre-collected dataset.
This leaves an open question: how can behavior policy be chosen for data-
efficient GVF learning? To address this gap, we propose GVFExplorer, which aims
at learning a behavior policy that efficiently gathers data for evaluating
multiple GVFs in parallel. This behavior policy selects actions in proportion to
the total variance in the return across all GVFs, reducing the number of
environmental interactions. To enable accurate variance estimation, we use a
recently proposed temporal-difference-style variance estimator. We prove that
each behavior policy update reduces the mean squared error in the summed
predictions over all GVFs. We empirically demonstrate our method's
performance in both tabular representations and nonlinear function
approximation.
Reinforcement Learning via Auxiliary Task Distillation.
Abhinav Narayan Harish, Larry Heck,
Josiah P. Hanna, Zsolt Kira, Andrew
Szot.
Proceedings of the European Conference on Computer Vision (ECCV). October
2024.
Abstract
BibTeX
Paper
We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill),
a new method that enables reinforcement learning (RL) to perform long-horizon
robot control problems by distilling behaviors from auxiliary RL tasks.
AuxDistill achieves this by concurrently carrying out multi-task RL with
auxiliary tasks, which are easier to learn and relevant to the main task. A
weighted distillation loss transfers behaviors from these auxiliary tasks to
solve the main task. We demonstrate that AuxDistill can learn a pixels-to-
actions policy for a challenging multi-stage embodied object rearrangement task
from the environment reward without demonstrations, a learning curriculum, or
pre-trained skills. AuxDistill achieves $2.3 \times$ higher success than the
previous state-of-the-art baseline in the Habitat Object Rearrangement benchmark
and outperforms methods that use pre-trained skills and expert demonstrations.
Toward the confident deployment of real-world reinforcement learning
agents.
Josiah P. Hanna.
AI Magazine. September 2024.
Abstract
BibTeX
Paper
Intelligent learning agents must be able to learn from experience so as to
accomplish tasks that require more ability than could be initially programmed.
Reinforcement learning (RL) has emerged as a potentially powerful class of
solution methods to create agents that learn from trial-and-error interaction
with the world. Despite many prominent success stories, a number of challenges
often stand between the use of RL in real-world problems. As part of the AAAI
New Faculty Highlight Program, in this article, I will describe the work that my
group is doing at the University of Wisconsin—Madison with the intent to remove
barriers to the use of RL in practice. Specifically, I will describe recent work
that aims to give practitioners confidence in learned behaviors, methods to
increase the data efficiency of RL, and work on “challenge” domains that stress
RL algorithms beyond current testbeds.
Guided Data Augmentation for Offline Reinforcement Learning and Imitation
Learning.
Nicholas Corrado, Yuxiao Qu, John U. Balis, Adam Labiosa,
Josiah P.
Hanna.
Proceedings of the Reinforcement Learning Conference (RLC). August 2024.
Abstract
BibTeX
Paper
In offline reinforcement learning (RL), RL agents learn to solve a task using
only a fixed dataset of previously collected data. While offline RL has proven
to be a viable method for learning real-world robot control policies, it
typically requires large amounts of expert-quality data to learn effective
policies that generalize to out-of-distribution states. Unfortunately, such data
is often difficult and expensive to acquire in real-world tasks. Several recent
works have leveraged data augmentation (DA) to inexpensively generate additional
data, but most DA works apply augmentations in a random fashion and ultimately
produce highly suboptimal augmented data. In this work, we propose
\textbf{Gu}ided \textbf{D}ata \textbf{A}ugmentation (GuDA), a human-guided DA
framework that generates expert-quality augmented data. The key insight behind
GuDA is that while it may be difficult to demonstrate the sequence of actions
required to produce expert data, a user can often easily characterize when an
augmented trajectory segment represents progress toward task completion. Thus, a
user can restrict the space of possible augmentations to automatically reject
suboptimal augmented data. To extract a policy from GuDA, we use off-the-shelf
offline reinforcement learning and behavior cloning algorithms. We evaluate GuDA
on a physical robot soccer task as well as simulated D4RL navigation tasks, a
simulated autonomous driving task, and a simulated soccer task. Empirically,
GuDA enables learning given a small initial dataset of potentially suboptimal
experience and outperforms a random DA strategy as well as a model-based DA
strategy.
Replacing Implicit Regression with Classification in Policy Gradient
Reinforcement Learning.
Josiah P. Hanna, Brahma S. Pavse, Abhinav Harish.
Proceedings of the Finding the Frame Workshop at the Reinforcement Learning
Conference (RLC). August 2024.
Abstract
BibTeX
Paper
Stochastic policy gradient methods are a fundamental class of reinforcement
learning algorithms. When using these algorithms for continuous control, it is
common to parameterize the policy using a Gaussian distribution. In this paper,
we show that the policy gradient with Gaussian policies can be viewed as the
gradient of a weighted least-squares objective function. That is, policy
gradient algorithms are implicitly implementing a form of regression. Several
recent works have shown that reformulating regression problems as classification
problems can improve learning. Inspired by these works, we investigate whether
replacing this implicit regression with classification can improve the data
efficiency and stability of policy learning. We introduce a novel policy
gradient surrogate objective for softmax policies over a discretized action
space. This surrogate objective uses a form of cross-entropy loss to replace the
implicit least-squares loss found in the surrogate loss for Gaussian policies.
We extend prior theoretical analysis of this loss to our policy gradient
surrogate objective and provide experiments showing that this novel loss
improves the data efficiency of stochastic policy gradient learning.
Conservative Evaluation of Offline Policy Learning.
Hager Radi Abdelwahed,
Josiah P. Hanna, Matthew E. Taylor.
Transactions on Machine Learning Research (TMLR). August 2024.
Abstract
BibTeX
Paper
The world offers unprecedented amounts of data in real-world domains, from which
we can develop successful decision-making systems. It is possible for
reinforcement learning (RL) to learn control policies offline from such data but
challenging to deploy an agent during learning in safety-critical domains.
Offline RL learns from historical data without access to an environment.
Therefore, we need a methodology for estimating how a newly-learned agent will
perform when deployed in the real environment before actually deploying it. To
achieve this, we propose a framework for conservative evaluation of offline
policy learning (CEOPL). We focus on being conservative so that the probability
that our agent performs below a baseline is approximately $\delta$, where
$\delta$ specifies how much risk we are willing to accept. In our setting, we
assume access to a data stream, split into a train-set to learn an offline
policy, and a test-set to estimate a lower-bound on the offline policy using
off-policy evaluation with bootstrap confidence intervals. A lower-bound
estimate allows us to decide when to deploy our learned policy with minimal risk
of overestimation. We demonstrate CEOPL on a range of tasks as well as real-
world medical data.
Learning to Stabilize Online Reinforcement Learning in Unbounded State
Spaces.
Brahma Pavse, Matthew Zurek, Yudong Chen, Qiaomin Xie,
Josiah P.
Hanna.
Proceedings of the International Conference on Machine Learning (ICML).
July 2024.
Abstract
BibTeX
Paper
In many reinforcement learning (\textsc{rl}) applications, we want policies that
reach desired states and then keep the controlled system within an acceptable
region around the desired states over an indefinite period of time. This latter
objective is called \textit{stability} and is especially important when the
state space is unbounded, such that the states can be arbitrarily far from each
other and the agent can drift far away from the desired states. For example, in
stochastic queuing networks, where queues of waiting jobs can grow without
bound, the desired state is all-zero queue lengths. Here, a stable policy
ensures queue lengths are finite while an optimal policy minimizes queue
lengths. Since an optimal policy is also stable, one would expect that
\textsc{rl} algorithms would implicitly give us stable policies. However, in
this work, we find that deep \textsc{rl} algorithms that directly minimize the
distance to the desired state during online training often result in unstable
policies, i.e., policies that drift far away from the desired state. We
attribute this instability to poor credit-assignment for destabilizing actions.
We then introduce an approach based on two ideas: 1) a Lyapunov-based cost-
shaping technique and 2) state transformations to the unbounded state space. We
conduct an empirical study on various queuing networks and traffic signal
control problems and find that our approach performs competitively against
strong baselines with knowledge of the transition dynamics. Our code is
available here: \url{https://github.com/Badger-RL/STOP}.
SaVeR: Optimal Data Collection Strategy for Safe Policy Evaluation in Tabular
MDP.
Subhojyoti Mukherjee,
Josiah P. Hanna, Robert Nowak.
Proceedings of the International Conference on Machine Learning (ICML).
July 2024.
Abstract
BibTeX
Paper
In this paper, we study safe data collection for the purpose of policy
evaluation in tabular Markov decision processes (MDPs). In policy evaluation, we
are given a target policy and asked to estimate the expected cumulative reward
it will obtain. Policy evaluation requires data and we are interested in the
question of what behavior policy should collect the data for the most accurate
evaluation of the target policy. While prior work has considered behavior policy
selection, in this paper, we additionally consider a safety constraint on the
behavior policy. Namely, we assume there exists a known default policy that
incurs a particular expected cost when run and we enforce that the cumulative
cost of all behavior policies ran is better than a constant factor of the cost
that would be incurred had we always run the default policy. We first show that
there exists a class of intractable MDPs where no safe oracle algorithm with
knowledge about problem parameters can efficiently collect data and satisfy the
safety constraints. We then define the tractability condition for an MDP such
that a safe oracle algorithm can efficiently collect data and using that we
prove the first lower bound for this setting. We then introduce an algorithm
SaVeR for this problem that approximates the safe oracle algorithm and bound the
finite-sample mean squared error of the algorithm while ensuring it satisfies
the safety constraint. Finally, we show in simulations that SaVeR produces low
MSE policy evaluation while satisfying the safety constraint.
Understanding when Dynamics-Invariant Data Augmentations Benefit Model-Free
Reinforcement Learning Updates.
Nicholas Corrado,
Josiah P. Hanna.
Proceedings of the International Conference on Learning Representations
(ICLR). May 2024.
Abstract
BibTeX
Paper
Recently, data augmentation (DA) has emerged as a method for leveraging domain
knowledge to inexpensively generate additional data in reinforcement learning
(RL) tasks, often yielding substantial improvements in data efficiency. While
prior work has demonstrated the utility of incorporating augmented data directly
into model-free RL updates, it is not well-understood when a particular DA
strategy will improve data efficiency. In this paper, we seek to identify
general aspects of DA responsible for observed learning improvements. Our study
focuses on sparse-reward tasks with dynamics-invariant data augmentation
functions, serving as an initial step towards a more general understanding of DA
and its integration into RL training. Experimentally, we isolate three relevant
aspects of DA: state-action coverage, reward density, and the number of
augmented transitions generated per update (the augmented replay ratio). From
our experiments, we draw two conclusions: (1) increasing state-action coverage
often has a much greater impact on data efficiency than increasing reward
density, and (2) decreasing the augmented replay ratio substantially improves
data efficiency.In fact, certain tasks in our empirical study are solvable only
when the replay ratio is sufficiently low.
SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic
Bandits.
Subhojyoti Mukherjee, Qiaomin Xie,
Josiah P. Hanna, Robert Nowak.
Proceedings of the International Conference on Artificial Intelligence and
Statistics (AISTATS). May 2024.
This paper contains material that was previously presented at the
2023
ICML Workshop on the Many Facets of Preference-Based Learning.
Abstract
BibTeX
Paper
In this paper, we study the problem of optimal data collection for policy
evaluation in linear bandits. In policy evaluation, we are given a
\textit{target} policy and asked to estimate the expected cumulative reward it
will obtain when executed in an environment formalized as a multi-armed bandit.
In this paper, we focus on linear bandit setting with heteroscedastic reward
noise. This is the first work that focuses on such an optimal data collection
strategy for policy evaluation involving heteroscedastic reward noise in the
linear bandit setting. We first formulate an optimal design for weighted least
squares estimates in the heteroscedastic linear bandit setting that reduces the
MSE of the target policy. We term this as policy-weighted least square
estimation and use this formulation to derive the optimal behavior policy for
data collection. We then propose a novel algorithm SPEED (\textbf{S}tructured
\textbf{P}olicy \textbf{E}valuation \textbf{E}xperimental \textbf{D}esign) that
tracks the optimal behavior policy and derives its regret with respect to the
optimal behavior policy. Finally, we empirically validate that \sp\ leads to
policy evaluation with mean squared error comparable to the oracle strategy and
significantly lower than simply running the target policy.
Future Prediction Can be a Strong Evidence of Good History Representation in
Partially Observable Environments.
Jeongyeol Kwon, Liu Yang,
Josiah P. Hanna, Robert Nowak.
Arxiv Pre-Print. February 2024.
Abstract
BibTeX
Paper
Learning a good history representation is one of the core challenges of
reinforcement learning (RL) in partially observable environments. Recent works
have shown the advantages of various auxiliary tasks for facilitating
representation learning. However, the effectiveness of such auxiliary tasks has
not been fully convincing, especially in partially observable environments that
require long-term memorization and inference. In this empirical study, we
investigate the effectiveness of future prediction for learning the
representations of histories, possibly of extensive length, in partially
observable environments. We first introduce an approach that decouples the task
of learning history representations from policy optimization via future
prediction. Then, our main contributions are two-fold: (a) we demonstrate that
the performance of reinforcement learning is strongly correlated with the
prediction accuracy of future observations in partially observable environments,
and (b) our approach can significantly improve the overall end-to-end approach
by preventing high-variance noisy signals from reinforcement learning objectives
to influence the representation learning. We illustrate our claims on three
types of benchmarks that necessitate the ability to process long histories for
high returns.
Conditional Mutual Information for Disentangled Representations in
Reinforcement Learning.
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck,
Josiah P. Hanna,
Stefano V. Albrecht.
Proceedings of Advances in Neural Information Processing Systems
(NeurIPS). December 2023.
Abstract
BibTeX
Paper
Reinforcement Learning (RL) environments can produce training data with spurious
correlations between features due to the amount of training data or its limited
feature coverage. This can lead to RL agents encoding these misleading
correlations in their latent representation, preventing the agent from
generalising if the correlation changes within the environment or when deployed
in the real world. Disentangled representations can improve robustness, but
existing disentanglement techniques that minimise mutual information between
features require independent features, thus they cannot disentangle
\textit{correlated} features. We propose an auxiliary task for RL algorithms
that learns a disentangled representation of high-dimensional observations with
correlated features by minimising the \textit{conditional} mutual information
between features in the representation. We demonstrate experimentally, using
continuous control tasks, that our approach improves generalisation under
correlation shifts, as well as improving the training performance of RL
algorithms in the presence of correlated features.
Multi-task Representation Learning for Pure Exploration in Bilinear
Bandits.
Subhojyoti Mukherjee, Qiaomin Xie,
Josiah P. Hanna, Robert Nowak.
Proceedings of Advances in Neural Information Processing Systems
(NeurIPS). December 2023.
Abstract
BibTeX
Paper
We study multi-task representation learning for the problem of pure exploration
in bilinear bandits. % In bilinear bandits, an action takes the form of a pair
of arms from two different entity types and the reward is a bilinear function of
the known feature vectors of the arms. % % In the \textit{multi-task bilinear
bandit problem}, we aim to find optimal actions for multiple tasks that share a
common low-dimensional linear representation. The objective is to leverage this
characteristic to expedite the process of identifying the best pair of arms for
all tasks. % We propose the algorithm \gob\ that uses an experimental design
approach to optimize sample allocations for learning the global representation
as well as minimize the number of samples needed to identify the optimal pair of
arms in individual tasks. % To the best of our knowledge, this is the first
study to give sample complexity analysis for pure exploration in bilinear
bandits with shared representation. % Our results demonstrate that by learning
the shared representation across tasks, we achieve significantly improved sample
complexity compared to the traditional approach of solving tasks independently.
State-Action Similarity-Based Representations for Off-Policy
Evaluation.
Brahma S. Pavse,
Josiah P. Hanna.
Proceedings of Advances in Neural Information Processing Systems
(NeurIPS). December 2023.
Abstract
BibTeX
Paper
In reinforcement learning, off-policy evaluation (OPE) is the problem of
estimating the expected return of an evaluation policy given a fixed dataset
that was collected by running one or more different policies. One of the more
empirically successful algorithms for OPE has been the fitted q-evaluation (FQE)
algorithm that uses temporal difference updates to learn an action-value
function, which is then used to estimate the expected return of the evaluation
policy. Typically, the original fixed dataset is fed directly into FQE to learn
the action-value function of the evaluation policy. Instead, in this paper, we
seek to enhance the data-efficiency of FQE by first transforming the fixed
dataset using a learned encoder, and then feeding the transformed dataset into
FQE. To learn such an encoder, we introduce an OPE tailored state-action
behavioral similarity metric, and use this metric and the fixed dataset to learn
an encoder that models this metric. Theoretically, we show that this metric
allows us to bound the error in the resulting OPE estimate. Empirically, we show
that other state-action similarity metrics lead to representations that cannot
represent the action-value function of the evaluation policy, and that our
stateaction representation method boosts the data-efficiency of FQE and lowers
OPE error relative to other OPE-based representation learning methods on
challenging OPE tasks. We also empirically show that the learned representations
significantly mitigate divergence of FQE under varying distribution shifts. Our
code is available here: https://github.com/Badger-RL/ROPE.
SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic
Bandits.
Subhojyoti Mukherjee, Qiaomin Xie,
Josiah P. Hanna, Robert Nowak.
ICML Workshop on the Many Facets of Preference-Based Learning. July 2023.
Abstract
BibTeX
Paper
In this paper, we study the problem of optimal data collection for policy
evaluation in linear bandits. In policy evaluation, we are given a
\textit{target} policy and asked to estimate the expected cumulative reward it
will obtain when executed in an environment formalized as a multi-armed bandit.
In this paper, we focus on linear bandit setting with heteroscedastic reward
noise. This is the first work that focuses on such an optimal data collection
strategy for policy evaluation involving heteroscedastic reward noise in the
linear bandit setting. We first formulate an optimal design for weighted least
squares estimates in the heteroscedastic linear bandit setting that reduces the
MSE of the target policy. We term this as policy-weighted least square
estimation and use this formulation to derive the optimal behavior policy for
data collection. We then propose a novel algorithm SPEED (\textbf{S}tructured
\textbf{P}olicy \textbf{E}valuation \textbf{E}xperimental \textbf{D}esign) that
tracks the optimal behavior policy and derives its regret with respect to the
optimal behavior policy. Finally, we empirically validate that \sp\ leads to
policy evaluation with mean squared error comparable to the oracle strategy and
significantly lower than simply running the target policy.
Temporal Disentanglement of Representations for Improved Generalisation in
Reinforcement Learning.
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck,
Josiah P. Hanna,
Stefano V. Albrecht.
Proceedings of the International Conference on Learning Representations
(ICLR). May 2023.
Abstract
BibTeX
Paper
Reinforcement Learning (RL) agents are often unable to generalise well to
environment variations in the state space that were not observed during
training. This issue is especially problematic for image-based RL, where a
change in just one variable, such as the background colour, can change many
pixels in the image. The changed pixels can lead to drastic changes in the
agent's latent representation of the image, causing the learned policy to
fail. To learn more robust representations, we introduce TEmporal
Disentanglement (TED), a self-supervised auxiliary task that leads to
disentangled image representations exploiting the sequential nature of RL
observations. We find empirically that RL algorithms utilising TED as an
auxiliary task adapt more quickly to changes in environment variables with
continued training compared to state-of-the-art representation learning methods.
Since TED enforces a disentangled structure of the representation, our
experiments also show that policies trained with TED generalise better to unseen
values of variables irrelevant to the task (e.g. background colour) as well as
unseen values of variables that affect the optimal policy (e.g. goal positions).
Scaling Marginalized Importance Sampling to High-Dimensional State-Spaces via
State Abstraction.
Brahma S Pavse,
Josiah P Hanna.
Proceedings of the AAAI Conference on Artificial Intelligence. February
2023.
Abstract
BibTeX
Paper
We consider the problem of off-policy evaluation (OPE) in reinforcement learning
(RL), where the goal is to estimate the performance of an evaluation policy, πe,
using a fixed dataset, D, collected by one or more policies that may be
different from πe. Current OPE algorithms may produce poor OPE estimates under
policy distribution shift i.e., when the probability of a particular stateaction
pair occurring under πe is very different from the probability of that same pair
occurring in D (Voloshin et al. 2021; Fu et al. 2021). In this work, we propose
to improve the accuracy of OPE estimators by projecting the high-dimensional
state-space into a low-dimensional state-space using concepts from the state
abstraction literature. Specifically, we consider marginalized importance
sampling (MIS) OPE algorithms which compute state-action distribution correction
ratios to produce their OPE estimate. In the original ground statespace, these
ratios may have high variance which may lead to high variance OPE. However, we
prove that in the lower-dimensional abstract state-space the ratios can have
lower variance resulting in lower variance OPE. We then highlight the challenges
that arise when estimating the abstract ratios from data, identify sufficient
conditions to overcome these issues, and present a minimax optimization problem
whose solution yields these abstract ratios. Finally, our empirical evaluation
on difficult, high-dimensional state-space OPE tasks shows that the abstract
ratios can make MIS OPE estimators achieve lower mean-squared error and more
robust to hyperparameter tuning than the ground ratios.
Robust On-Policy Sampling for Data-Efficient Policy Evaluation in
Reinforcement Learning.
Rujie Zhong, Duohan Zhang, Lukas Schäfer, Stefano V. Albrecht,
Josiah P.
Hanna.
Proceedings of Neural and Information Processing Systems (NeurIPS).
December 2022.
This paper contains material that was previously presented at the NeurIPS
Workshop on Offline Reinforcement Learning (OfflineRL) 2021.
Abstract
BibTeX
Paper
Reinforcement learning (RL) algorithms are often categorized as either on-policy
or off-policy depending on whether they use data from a target policy of
interest or from a different behavior policy. In this paper, we study a subtle
distinction between on-policy data and on-policy sampling in the context of the
RL sub-problem of policy evaluation. We observe that on-policy sampling may fail
to match the expected distribution of on-policy data after observing only a
finite number of trajectories and this failure hinders data-efficient policy
evaluation. Towards improved data-efficiency, we show how non-i.i.d., off-policy
sampling can produce data that more closely matches the expected on-policy data
distribution and consequently increases the accuracy of the Monte Carlo
estimator for policy evaluation. We introduce a method called Robust On-Policy
Sampling and demonstrate theoretically and empirically that it produces data
that converges faster to the expected on-policy distribution compared to on-
policy sampling. Empirically, we show that this faster convergence leads to
lower mean squared error policy value estimates.
Temporal Disentanglement of Representations for Improved Generalisation in
Reinforcement Learning..
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck,
Josiah P. Hanna,
Stefano V. Albrecht.
Proceedings of the NeurIPS 2022 Workshop on Deep Reinforcement Learning.
December 2022.
Abstract
BibTeX
Paper
Reinforcement Learning (RL) agents are often unable to generalise well to
environment variations in the state space that were not observed during
training. This issue is especially problematic for image-based RL, where a
change in just one variable, such as the background colour, can change many
pixels in the image. The changed pixels can lead to drastic changes in the
agent's latent representation of the image, causing the learned policy to
fail. To learn more robust representations, we introduce TEmporal
Disentanglement (TED), a self-supervised auxiliary task that leads to
disentangled image representations exploiting the sequential nature of RL
observations. We find empirically that RL algorithms utilising TED as an
auxiliary task adapt more quickly to changes in environment variables with
continued training compared to state-of-the-art representation learning methods.
Since TED enforces a disentangled structure of the representation, our
experiments also show that policies trained with TED generalise better to unseen
values of variables irrelevant to the task (e.g. background colour) as well as
unseen values of variables that affect the optimal policy (e.g. goal positions).
Scaling Marginalized Importance Sampling to High-Dimensional State-Spaces via
State Abstraction.
Brahma S Pavse,
Josiah P. Hanna.
Proceedings of the Offline Reinforcement Learning Workshop at NeurIPS
2022. December 2022.
Abstract
BibTeX
Paper
We consider the problem of off-policy evaluation (OPE) in reinforcement learning
(RL), where the goal is to estimate the performance of an evaluation policy, πe,
using a fixed dataset, D, collected by one or more policies that may be
different from πe. Current OPE algorithms may produce poor OPE estimates under
policy distribution shift i.e., when the probability of a particular stateaction
pair occurring under πe is very different from the probability of that same pair
occurring in D (Voloshin et al. 2021; Fu et al. 2021). In this work, we propose
to improve the accuracy of OPE estimators by projecting the high-dimensional
state-space into a low-dimensional state-space using concepts from the state
abstraction literature. Specifically, we consider marginalized importance
sampling (MIS) OPE algorithms which compute state-action distribution correction
ratios to produce their OPE estimate. In the original ground statespace, these
ratios may have high variance which may lead to high variance OPE. However, we
prove that in the lower-dimensional abstract state-space the ratios can have
lower variance resulting in lower variance OPE. We then highlight the challenges
that arise when estimating the abstract ratios from data, identify sufficient
conditions to overcome these issues, and present a minimax optimization problem
whose solution yields these abstract ratios. Finally, our empirical evaluation
on difficult, high-dimensional state-space OPE tasks shows that the abstract
ratios can make MIS OPE estimators achieve lower mean-squared error and more
robust to hyperparameter tuning than the ground ratios.
Multi-agent Databases via Independent Learning.
Chi Zhang, Olga Papaemmanouil,
Josiah P. Hanna, Aditya Akella.
Proceedings of the 4th International Workshop on Applied AI for Database
Systems and Applications. September 2022.
Abstract
BibTeX
Paper
Machine learning is rapidly being used in database research to improve the
effectiveness of numerous tasks included but not limited to query optimization,
workload scheduling, physical design, etc. Currently, the research focus has
been on replacing a single database component responsible for one task by its
learning-based counterpart. However, query performance is not simply determined
by the performance of a single component, but by the cooperation of multiple
ones. As such, learning based database components need to collaborate during
both training and execution in order to develop policies that meet end
performance goals. Thus, the paper attempts to address the question "Is it
possible to design a database consisting of various learned components that
cooperatively work to improve end-to-end query latency?". To answer this
question, we introduce MADB (Multi-Agent DB), a proof-of-concept system that
incorporates a learned query scheduler and a learned query optimizer. MADB
leverages a cooperative multi-agent reinforcement learning approach that allows
the two components to exchange the context of their decisions with each other
and collaboratively work towards reducing the query latency. Preliminary results
demonstrate that MADB can outperform the non-cooperative integration of learned
components.
Simulation-Acquired Latent Action Spaces for Dynamics Generalization.
Nicholas Corrado, Yuxiao Qu,
Josiah P. Hanna.
Proceedings of the 1st Conference on Lifelong Learning Agents (CoLLAs).
August 2022.
Abstract
BibTeX
Paper
Deep reinforcement learning has shown incredible promise at training high-
performing agents to solve high-dimensional continuous control tasks in a
particular training environment. However, to be useful in real-world settings,
long-lived agents must perform well across a range of environmental conditions.
Naively applying deep RL to a task where environment conditions may vary from
episode to episode can be data inefficient. To address this inefficiency, we
introduce a method that discovers structure in an agent’s high-dimensional
continuous action space to speed up learning across a range of environmental
conditions. Whereas prior work on finding so-called latent action spaces
requires expert demonstrations or on-task experience, we instead propose to
discover the latent, lower-dimensional action space in a simulated environment
and then transfer the learned action space for training over a distribution of
environments. We evaluate our novel method on randomized variants of simulated
MuJoCo environments and find that, when there is a lower-dimensional action
space to exploit, our method significantly increases data efficiency. For
instance, in the Ant environment, our method reduces the 8-dimensional action
space to a 3-dimensional action space and doubles the median return achieved
after a training budget of 2 million timesteps.
ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling.
Subhojyoti Mukherjee,
Josiah P. Hanna, Robert Nowak.
Proceedings of the 38th International Conference on Uncertainty in Artificial
Intelligence (UAI). August 2022.
Abstract
BibTeX
Paper
This paper studies the problem of data collection for policy evaluation in
Markov decision processes (MDPs). In policy evaluation, we are given a target
policy and asked to estimate the expected cumulative reward it will obtain in an
environment formalized as an MDP. We develop theory for optimal data collection
within the class of tree-structured MDPs by first deriving an oracle data
collection strategy that uses knowledge of the variance of the reward
distributions. We then introduce the Reduced Variance Sampling (ReVar) algorithm
that approximates the oracle strategy when the reward variances are unknown a
priori and bound its sub-optimality compared to the oracle strategy. Finally, we
empirically validate that ReVar leads to policy evaluation with mean squared
error comparable to the oracle strategy and significantly lower than simply
running the target policy.
Decoupled Reinforcement Learning to Stabilise Intrinsically-Motivated
Exploration.
Lukas Schäfer,
Josiah P Hanna, Filippos Christiano, Stefano V.
Albrecht.
Proceedings of the International Conference on Autonomous and Multi-agent
Systems (AAMAS). May 2022.
Abstract
BibTeX
Paper
Intrinsic rewards can improve exploration in reinforcement learning, but the
exploration process may suffer from instability caused by non-stationary reward
shaping and strong dependency on hyperparameters. In this work, we introduce
Decoupled RL (DeRL) as a general framework which trains separate policies for
intrinsicallymotivated exploration and exploitation. Such decoupling allows DeRL
to leverage the benefits of intrinsic rewards for exploration while
demonstrating improved robustness and sample efficiency. We evaluate DeRL
algorithms in two sparse-reward environments with multiple types of intrinsic
rewards. Our results show that DeRL is more robust to varying scale and rate of
decay of intrinsic rewards and converges to the same evaluation returns than
intrinsically-motivated baselines in fewer interactions. Lastly, we discuss the
challenge of distribution shift and show that divergence constraint regularisers
can successfully minimise instability caused by divergence of exploration and
exploitation policies.
Behavior Policy Search for Risk Estimators in RL.
Elita Lobo, Yash Chandak, Subramanian Dharmashankar,
Josiah P. Hanna,
Marek Petrik.
Proceedings of the NeurIPS Workshop on Safe and Robust Control of Uncertain
Systems. December 2021.
Abstract
BibTeX
Paper
In real-world sequential decision problems, exploration is expensive, and the
risk of expert decision policies must be evaluated from limited data. In this
setting, Monte Carlo (MC) risk estimators are typically used to estimate the
risk of decision policies. Unfortunately, while these estimators have the
desired low bias property, they often suffer from large variance. In this paper,
we consider the problem of minimizing the asymptotic mean squared error and
hence variance of MC risk estimators. We show that by carefully choosing the
data sampling policy (behavior policy), we can obtain low variance estimates of
the risk of any given decision policy.
Safe Evaluation for Offline Learning: Are We Ready To Deploy?.
Hager Radi,
Josiah P. Hanna, Peter Stone, Matthew E. Taylor.
Proceedings of the NeurIPS Workshop on Deployable Decision Making in Embodied
Systems (DDM). December 2021.
Abstract
BibTeX
Paper
The world currently offers an abundance of data in multiple domains, from which
we can learn reinforcement learning (RL) policies without further interaction
with the environment. RL agents learning offline from such data is possible but
deploying them while learning might be dangerous in domains where safety is
critical. Therefore, it is essential to find a way to estimate how a newly-
learned agent will perform if deployed in the target environment before actually
deploying it and without the risk of overestimating its true performance. To
achieve this, we introduce a framework for safe evaluation of offline learning
using approximate high-confidence off-policy evaluation (HCOPE) to estimate the
performance of offline policies during learning. In our setting, we assume a
source of data, which we split into a train-set, to learn an offline policy, and
a test-set, to estimate a lower-bound on the offline policy using off-policy
evaluation with bootstrapping. A lower-bound estimate tells us how good a newly-
learned target policy would perform before it is deployed in the real
environment, and therefore allows us to decide when to deploy our learned
policy.
Robust On-Policy Sampling for Data-Efficient Policy Evaluation in
Reinforcement Learning.
Rujie Zhong,
Josiah P. Hanna, Lukas Schäfer, Stefano V. Albrecht.
Proceedings of the NeurIPS Workshop on Offline Reinforcement Learning
(OfflineRL). December 2021.
Abstract
BibTeX
Paper
Reinforcement learning (RL) algorithms are often categorized as either on-policy
or off-policy depending on whether they use data from a target policy of
interest or from a different behavior policy. In this paper, we study a subtle
distinction between on-policy data and on-policy sampling in the context of the
RL sub-problem of policy evaluation. We observe that on-policy sampling may fail
to match the expected distribution of on-policy data after observing only a
finite number of trajectories and this failure hinders data-efficient policy
evaluation. Towards improved data-efficiency, we show how non-i.i.d., off-policy
sampling can produce data that more closely matches the expected on-policy data
distribution and consequently increases the accuracy of the Monte Carlo
estimator for policy evaluation. We introduce a method called Robust On-Policy
Sampling and demonstrate theoretically and empirically that it produces data
that converges faster to the expected on-policy distribution compared to on-
policy sampling. Empirically, we show that this faster convergence leads to
lower mean squared error policy value estimates.
Interpretable Goal Recognition in the Presence of Occluded Factors for
Autonomous Vehicles.
Josiah P. Hanna, Arrasy Rahman, Elliot Fosong, Francisco Eiras, Mihai
Dobre, John Redford, Subramanian Ramamoorthy, Stefano V. Albrecht.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). October 2021.
Abstract
BibTeX
Paper
Recognising the goals or intentions of observed vehicles is a key step towards
predicting the long-term future behaviour of other agents in an autonomous
driving scenario. When there are unseen obstacles or occluded vehicles in a
scenario, goal recognition may be confounded by the effects of these unseen
entities on the behaviour of observed vehicles. Existing prediction algorithms
that assume rational behaviour with respect to inferred goals may fail to make
accurate long-horizon predictions because they ignore the possibility that the
behaviour is influenced by such unseen entities. We introduce the Goal and
Occluded Factor Inference (GOFI) algorithm which bases inference on inverse-
planning to jointly infer a probabilistic belief over goals and potential
occluded factors. We then show how these beliefs can be integrated into Monte
Carlo Tree Search (MCTS). We demonstrate that jointly inferring goals and
occluded factors leads to more accurate beliefs with respect to the true world
state and allows an agent to safely navigate several scenarios where other
baselines take unsafe actions leading to collisions.
Towards Quantum-Secure Authentication and Key Agreement Via Abstract Multi-
Agent Interaction.
Ibrahim H. Ahmed,
Josiah P. Hanna, Elliot Fosong, Stefano V.
Albrecht.
Proceedings of the International Conference on Practical Applications of
Agents and Multi-Agent Systems (PAAMS). October 2021.
Abstract
BibTeX
Paper
Current methods for authentication and key agreement based on public-key
cryptography are vulnerable to quantum computing. We propose a novel approach
based on artificial intelligence research in which communicating parties are
viewed as autonomous agents which interact repeatedly using their private
decision models. Authentication and key agreement are decided based on the
agents' observed behaviors during the interaction. The security of this
approach rests upon the difficulty of modeling the decisions of interacting
agents from limited observations, a problem which we conjecture is also hard for
quantum computing. We release PyAMI, a prototype authentication and key
agreement system based on the proposed method. We empirically validate our
method for authenticating legitimate users while detecting different types of
adversarial attacks. Finally, we show how reinforcement learning techniques can
be used to train server models which effectively probe a client's decisions
to achieve more sample-efficient authentication.
A Joint Imitation-Reinforcement Learning Framework for Reduced Baseline
Regret.
Sheelabhadra Dey, Sumedh Pendurkar, Guni Sharon,
Josiah P. Hanna.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). September 2021.
Abstract
BibTeX
Paper
In various control task domains, existing controllers provide a baseline level
of performance that—though possibly suboptimal—should be maintained.
Reinforcement learning (RL) algorithms that rely on extensive exploration of the
state and action space can be used to optimize a control policy. However, fully
exploratory RL algorithms may decrease performance below a baseline level during
training. In this paper, we address the issue of online optimization of a
control policy while minimizing regret with respect to a baseline policy
performance. We present a joint imitationreinforcement learning framework,
denoted JIRL. The learning process in JIRL assumes the availability of a
baseline policy and is designed with two objectives in mind (a) training while
leveraging demonstrations from the baseline policy to minimize regret with
respect to the baseline policy, and (b) eventually surpassing the baseline
performance. JIRL addresses these objectives by initially learning to imitate
the baseline policy and gradually shifting control from the baseline to an RL
agent. Experimental results show that JIRL effectively accomplishes the
aforementioned objectives in several, continuous action-space domains. The
results demonstrate that JIRL is comparable to a state-of-the-art algorithm in
its final performance while incurring significantly lower baseline regret during
training. Moreover, the results show a reduction factor of up to 21 in baseline
regret over a trust-region based approach that guarantees monotonic policy
improvement.
Decoupled Reinforcement Learning to Stabilise Intrinsically-Motivated
Exploration.
Lukas Schäfer,
Josiah P. Hanna, Filippos Christiano, Stefano V
Albrecht.
Proceedings of the ICML Workshop on Unsupervised Reinforcement Learning
(URL). July 2021.
Abstract
BibTeX
Paper
Intrinsic rewards can improve exploration in reinforcement learning, but the
exploration process may suffer from instability caused by non-stationary reward
shaping and strong dependency on hyperparameters. In this work, we introduce
Decoupled RL (DeRL) as a general framework which trains separate policies for
intrinsicallymotivated exploration and exploitation. Such decoupling allows DeRL
to leverage the benefits of intrinsic rewards for exploration while
demonstrating improved robustness and sample efficiency. We evaluate DeRL
algorithms in two sparse-reward environments with multiple types of intrinsic
rewards. Our results show that DeRL is more robust to varying scale and rate of
decay of intrinsic rewards and converges to the same evaluation returns than
intrinsically-motivated baselines in fewer interactions. Lastly, we discuss the
challenge of distribution shift and show that divergence constraint regularisers
can successfully minimise instability caused by divergence of exploration and
exploitation policies.
Importance sampling in reinforcement learning with an estimated behavior
policy.
Josiah P. Hanna, Scott Niekum, Peter Stone.
Machine Learning. May 2021.
This article contains material that was previously published in an AAMAS 2019
paper and an ICML 2019 paper.
Abstract
BibTeX
Paper
In reinforcement learning, importance sampling is a widely used method for
evaluating an expectation under the distribution of data of one policy when the
data has in fact been generated by a different policy. Importance sampling
requires computing the likelihood ratio between the action probabilities of a
target policy and those of the data-producing behavior policy. In this article,
we study importance sampling where the behavior policy action probabilities are
replaced by their maximum likelihood estimate of these probabilities under the
observed data. We show this general technique reduces variance due to sampling
error in Monte Carlo style estimators. We introduce two novel estimators that
use this technique to estimate expected values that arise in the RL literature.
We find that these general estimators reduce the variance of Monte Carlo
sampling methods, leading to faster learning for policy gradient algorithms and
more accurate off-policy policy evaluation. We also provide theoretical analysis
showing that our new estimators are consistent and have asymptotically lower
variance than Monte Carlo estimators.
Grounded Action Transformation for Sim-to-Real Reinforcement
Learning.
Josiah P. Hanna, Siddharth Desai, Haresh Karnan, Garrett Warnell, Peter
Stone.
Machine Learning (MLJ): Special Issue on Reinforcement Learning for Real
Life. May 2021.
This article contains material that was previously published in an AAAI 2017
paper and an IROS 2020 paper.
Abstract
BibTeX
Paper
Reinforcement learning in simulation is a promising alternative to the
prohibitive sample cost of reinforcement learning in the physical world.
Unfortunately, policies learned in simulation often perform worse than hand-
coded policies when applied on the target, physical system. Grounded simulation
learning (GSL) is a general framework that promises to address this issue by
altering the simulator to better match the real world (Farchy et al. 2013). This
article introduces a new algorithm for GSL – Grounded Action Transformation
(GAT) – and applies it to learning control policies for a humanoid robot. We
evaluate our algorithm in controlled experiments where we show it to allow
policies learned in simulation to transfer to the real world. We then apply our
algorithm to learning a fast bipedal walk on a humanoid robot and demonstrate a
43.27% improvement in forward walk velocity compared to a state-of-the art hand-
coded walk. This striking empirical success notwithstanding, further empirical
analysis shows that GAT may struggle when the real world has stochastic state
transitions. To address this limitation we generalize GAT to the stochastic GAT
(SGAT) algorithm and empirically show that SGAT leads to successful real world
transfer in situations where GAT may fail to find a good policy. Our results
contribute to a deeper understanding of grounded simulation learning and
demonstrate its effectiveness for applying reinforcement learning to learn robot
control policies entirely in simulation.
An Imitation from Observation Approach to Transfer Learning with Dynamics
Mismatch.
Siddharth Desai, Ishan Durugkar, Haresh Karnan, Garrett Warnell,
Josiah
Hanna, Peter Stone.
Proceedings of Advances in Neural Information Processing Systems
(NeurIPS). December 2020.
Abstract
BibTeX
Paper
We examine the problem of transferring a policy learned in a source environment
to a target environment with different dynamics, particularly in the case where
it is critical to reduce the amount of interaction with the target environment
during learning. This problem is particularly important in sim-to-real transfer
because simulators inevitably model real-world dynamics imperfectly. In this
paper, we show that one existing solution to this transfer problem ”grounded
action transformation” is closely related to the problem of imitation from
observation (IfO): learning behaviors that mimic the observations of behavior
demonstrations. After establishing this relationship, we hypothesize that recent
state-of-the-art approaches from the IfO literature can be effectively
repurposed for grounded transfer learning. To validate our hypothesis we derive
a new algorithm, generative adversarial reinforced action transformation (GARAT)
based on adversarial imitation from observation techniques. We run experiments
in several domains with mismatched dynamics, and find that agents trained with
GARAT achieve higher returns in the target environment compared to existing
black-box transfer methods.
On Sampling Error in Batch Action-Value Prediction Algorithms.
Brahma Pavse, Ishan Durugkar,
Josiah P. Hanna, Peter Stone.
Proceedings of the Offline Reinforcement Learning Workshop at Neural
Information Processing Systems (NeurIPS). December 2020.
Abstract
BibTeX
Paper
Temporal difference (TD) learning is one of the main foundations of modern
reinforcement learning. This paper studies the use of TD(0), a canonical TD
algorithm, to estimate the value function of a given policy from a batch of
data. In this batch setting, we show that TD(0) may converge to an inaccurate
value function because the update following an action is weighted according to
the number of times that action occurred in the batch – not the true probability
of the action under the given policy. To address this limitation, we introduce
policy sampling error corrected-TD(0) (PSEC-TD(0)). PSEC-TD(0) first estimates
the empirical distribution of actions in each state in the batch and then uses
importance sampling to correct for the mismatch between the empirical weighting
and the correct weighting for updates following each action. We refine the
concept of a certainty-equivalence estimate and argue that PSEC-TD(0) is a more
data efficient estimator than TD(0) for a fixed batch of data. Finally, we
conduct an empirical evaluation of PSEC-TD(0) on three batch value function
learning tasks, with a hyperparameter sensitivity analysis, and show that PSEC-
TD(0) produces value function estimates with lower mean squared error than
TD(0).
RIDM: Reinforced Inverse Dynamics Modeling for Learning from a Single
Observed Demonstration.
Brahma S. Pavse, Faraz Torabi,
Josiah Hanna, Garrett Warnell, Peter
Stone.
IEEE Robotics and Automation Letters (RA-L). October 2020.
Abstract
BibTeX
Paper
Augmenting reinforcement learning with imitation learning is often hailed as a
method by which to improve upon learning from scratch. However, most existing
methods for integrating these two techniques are subject to several strong
assumptions—chief among them that information about demonstrator actions is
available. In this paper, we investigate the extent to which this assumption is
necessary by introducing and evaluating reinforced inverse dynamics modeling
(RIDM), a novel paradigm for combining imitation from observation (IfO) and
reinforcement learning with no dependence on demonstrator action information.
Moreover, RIDM requires only a single demonstration trajectory and is able to
operate directly on raw (unaugmented) state features. We find experimentally that
RIDM performs favorably compared to a baseline approach for several tasks in
simulation as well as for tasks on a real UR5 robot arm. Experiment videos can
be found at https://sites.google.com/view/ridm-reinforced-inverse-dynami.
Reinforced Grounded Action Transformation for Sim-to-Real Transfer.
Haresh Karnan, Sid Desai,
Josiah Hanna, Garrett Warnell, Peter Stone.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). October 2020.
Abstract
BibTeX
Paper
Robots can learn to do complex tasks in simulation, but often, learned behaviors
fail to transfer well to the real world due to simulator imperfections (the
reality gap). Some existing solutions to this sim-to-real problem, such as
Grounded Action Transformation (GAT), use a small amount of real world
experience to minimize the reality gap by grounding the simulator. While very
effective in certain scenarios, GAT is not robust on problems that use complex
function approximation techniques to model a policy. In this paper, we introduce
Reinforced Grounded Action Transformation (RGAT), a new sim-to-real technique
that uses Reinforcement Learning (RL) not only to update the target policy in
simulation, but also to perform the grounding step itself. This novel
formulation allows for end-to-end training during the grounding step, which,
compared to GAT, produces a better grounded simulator. Moreover, we show
experimentally in several MuJoCo domains that our approach leads to successful
transfer for policies modeled using neural networks
Stochastic Grounded Action Transformation for Robot Learning in
Simulation.
Sid Desai, Haresh Karnan,
Josiah Hanna, Garrett Warnell, Peter Stone.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). October 2020.
Abstract
BibTeX
Paper
Robot control policies learned in simulation do not often transfer well to the
real world. Many existing solutions to this sim-to-real problem, such as the
Grounded Action Transformation (GAT) algorithm, seek to correct for or ground
these differences by matching the simulator tothe real world. However, the
efficacy of these approaches is limited if they do not explicitly account for
stochasticity inthe target environment. In this work, we analyze the prob-lems
associated with grounding a deterministic simulator in astochastic real world
environment, and we present examples where GAT fails to transfer a good policy
due to stochastic transitions in the target domain. In response, we introduce
the Stochastic Grounded Action Transformation (SGAT) algorithm, which models
this stochasticity when grounding the simulator. We find experimentally for both
simulated and physical target domains that SGAT can find policies that are
robust to stochasticity in the target domain.
Reducing Sampling Error in Batch Temporal Difference Learning.
Brahma Pavse, Ishan Durugkar,
Josiah P. Hanna, Peter Stone.
Proceedings of the 37th International Conference on Machine Learning
(ICML). July 2020.
Abstract
BibTeX
Paper
Temporal difference (TD) learning is one of the main foundations of modern
reinforcement learning. This paper studies the use of TD(0), a canonical TD
algorithm, to estimate the value function of a given policy from a batch of
data. In this batch setting, we show that TD(0) may converge to an inaccurate
value function because the update following an action is weighted according to
the number of times that action occurred in the batch – not the true probability
of the action under the given policy. To address this limitation, we introduce
policy sampling error corrected-TD(0) (PSEC-TD(0)). PSEC-TD(0) first estimates
the empirical distribution of actions in each state in the batch and then uses
importance sampling to correct for the mismatch between the empirical weighting
and the correct weighting for updates following each action. We refine the
concept of a certainty-equivalence estimate and argue that PSEC-TD(0) is a more
data efficient estimator than TD(0) for a fixed batch of data. Finally, we
conduct an empirical evaluation of PSEC-TD(0) on three batch value function
learning tasks, with a hyperparameter sensitivity analysis, and show that PSEC-
TD(0) produces value function estimates with lower mean squared error than
TD(0).
Learning an Interpretable Traffic Signal Control Policy.
James Ault,
Josiah P. Hanna, Guni Sharon.
Proceedings of the 19th International Conference on Autonomous Agents and
Multiagent Systems (AAMAS). May 2020.
Abstract
BibTeX
Paper
Signalized intersections are managed by controllers that assign right of way
(green, yellow, and red lights) to non-conflicting directions. Optimizing the
actuation policy of such controllers is expected to alleviate traffic congestion
and its adverse impact. Given such a safety-critical domain, the affiliated
actuation policy is required to be interpretable in a way that can be understood
and regulated by a human. This paper presents and analyzes several on-line
optimization techniques for tuning interpretable control functions. Although
these techniques are defined in a general way, this paper assumes a specific
class of interpretable control functions (polynomial functions) for analysis
purposes. We show that such an interpretable policy function can be as effective
as a deep neural network for approximating an optimized signal actuation policy.
We present empirical evidence that supports the use of value-based reinforcement
learning for on-line training of the control function. Specifically, we present
and study three variants of the Deep Q-learning algorithm that allow the
training of an interpretable policy function. Our Deep Regulatable Hardmax
Q-learning variant is shown to be particularly effective in optimizing our
interpretable actuation policy, resulting in up to 19.4% reduced vehicles delay
compared to commonly deployed actuated signal controllers.
Importance Sampling Policy Evaluation with an Estimated Behavior
Policy.
Josiah Hanna, Scott Niekum, Peter Stone.
Proceedings of the 36th International Conference on Machine Learning
(ICML). June 2019.
Abstract
BibTeX
Paper
We consider the problem of off-policy evaluation in Markov decision processes.
Off-policy evaluation is the task of evaluating the expected return of one
policy with data generated by a different, behavior policy. Importance sampling
is a technique for off-policy evaluation that re-weights off-policy returns to
account for differences in the likelihood of the returns between the two
policies. In this paper, we study importance sampling with an estimated behavior
policy where the behavior policy estimate comes from the same set of data used
to compute the importance sampling estimate. We find that this estimator often
lowers the mean squared error of off-policy evaluation compared to importance
sampling with the true behavior policy or using a behavior policy that is
estimated from a separate data set. Intuitively, estimating the behavior policy
in this way corrects for error due to sampling in the action-space. Our
empirical results also extend to other popular variants of importance sampling
and show that estimating a non-Markovian behavior policy can further lower
large-sample mean squared error even when the true behavior policy is Markovian.
RIDM: Reinforced Inverse Dynamics Modeling for Learning From a Single
Observed Demonstration.
Brahma S. Pavse, Faraz Torabi,
Josiah P. Hanna, Garrett Warnell, Peter
Stone.
Proceedings of the Imitation, Intent, and Interaction (I3) Workshop at ICML
2019. June 2019.
Abstract
BibTeX
Paper
Imitation learning has long been an approach to alleviate the tractability
issues that arise in reinforcement learning. However, most literature makes
several assumptions such as access to the expert's actions, availability of
many expert demonstrations, and injection of task-specific domain knowledge into
the learning process. We propose reinforced inverse dynamics modeling (RIDM), a
method of combining reinforcement learning and imitation from observation (IfO)
to perform imitation using a single expert demonstration, with no access to the
expert's actions, and with little task-specific domain knowledge. Given
only a single set of the expert's raw states, such as joint angles in a
robot control task, at each time-step, we learn an inverse dynamics model to
produce the necessary low-level actions, such as torques, to transition from one
state to the next such that the reward from the environment is maximized. We
demonstrate that RIDM outperforms other techniques when we apply the same
constraints on the other methods on six domains of the MuJoCo simulator and for
two different robot soccer tasks for two experts from the RoboCup 3D simulation
league on the SimSpark simulator.
Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator.
Josiah Hanna, Peter Stone.
Proceedings of the 18th International Conference on Autonomous Agents and
Multiagent Systems (AAMAS). May 2019.
This paper contains material that was previously presented at the 2018 NeurIPS
Deep Reinforcement Learning Workshop.
Abstract
BibTeX
Paper
This paper studies a class of reinforcement learning algorithms known as policy
gradient methods. Policy gradient methods optimize the performance of a policy
by estimating the gradient of the expected return with respect to the policy
parameters. One of the core challenges of applying policy gradient methods is
obtaining an accurate estimate of this gradient. Most policy gradient methods
rely on Monte Carlo sampling to estimate this gradient. When only a limited
number of environment steps can be collected, Monte Carlo policy gradient
estimates may suffer from sampling error – samples receive more or less weight
than they will in expectation. In this paper, we introduce the Sampling Error
Corrected policy gradient estimator that corrects the inaccurate Monte Carlo
weights. Our approach treats the observed data as if it were generated by a
different policy than the policy that actually generated the data. It then uses
importance sampling between the two – in the process correcting the inaccurate
Monte Carlo weights. Under a limiting set of assumptions we can show that this
gradient estimator will have lower variance than the Monte Carlo gradient
estimator. We show experimentally that our approach improves the learning speed
of two policy gradient methods compared to standard Monte Carlo sampling even
when the theoretical assumptions fail to hold.
Selecting Compliant Agents for Opt-in Microtolling.
Josiah P. Hanna, Guni Sharon, Stephen D. Boyles, Peter Stone.
Proceedings of the 33rd AAAI Conference on Artificial Intelligence
(AAAI). January 2019.
Abstract
BibTeX
Paper
This paper examines the impact of tolls on social welfare in the context of a
transportation network in which only a portion of the agents are subject to
tolls. More specifically, this paper addresses the question: which subset of
agents provides the most system benefit if they are compliant with an
approximate marginal cost tolling scheme? Since previous work suggests this
problem is NP-hard, we examine a heuristic approach. Our experimental results on
three real-world traffic scenarios suggest that evaluating the marginal impact
of a given agent serves as a particularly strong heuristic for selecting an
agent to be compliant. Results from using this heuristic for selecting 7.6% of
the agents to be compliant achieved an increase of up to 10.9% in social welfare
over not tolling at all. The presented heuristic approach and conclusions can
help practitioners target specific agents to participate in an opt-in tolling
scheme.
Towards a Data Efficient Off-Policy Policy Gradient.
Josiah Hanna, Peter Stone.
AAAI Spring Symposium on Data Efficient Reinforcement Learning. April
2018.
Abstract
BibTeX
Paper
The ability to learn from off-policy data – data generated from past interaction
with the environment – is essential to data efficient reinforcement learning.
Recent work has shown that the use of off-policy data not only allows the re-use
of data but can even improve performance in comparison to on-policy
reinforcement learning. In this work we investigate if a recently proposed
method for learning a better data generation policy, commonly called a behavior
policy, can also increase the data efficiency of policy gradient reinforcement
learning. Empirical results demonstrate that with an appropriately selected
behavior policy we can estimate the policy gradient more accurately. The results
also motivate further work into developing methods for adapting the behavior
policy as the policy we are learning changes.
DyETC: Dynamic Electronic Toll Collection for Traffic Congestion
Alleviation.
Haipeng Chen, Bo An, Guni Sharon,
Josiah P. Hanna, Peter Stone, Chunyan
Miao, Yeng Chai Soh.
Proceedings of the 32nd AAAI Conference on Artificial Intelligence
(AAAI). February 2018.
Abstract
BibTeX
Paper
To alleviate traffic congestion in urban areas, electronic toll collection (ETC)
systems are deployed all over the world. Despite the merits, tolls are usually
pre-determined and fixed from day to day, which fail to consider traffic
dynamics and thus have limited regulation effect when traffic conditions are
abnormal. In this paper, we propose a novel dynamic ETC (DyETC) scheme which
adjusts tolls to traffic conditions in realtime. The DyETC problem is formulated
as a Markov decision process (MDP), the solution of which is very challenging
due to its 1) multi-dimensional state space, 2) multi-dimensional, continuous
and bounded action space, and 3) time-dependent state and action values. Due to
the complexity of the formulated MDP, existing methods cannot be applied to our
problem. Therefore, we develop a novel algorithm, PG-β, which makes three
improvements to traditional policy gradient method by proposing 1) time-
dependent value and policy functions, 2) Beta distribution policy function and
3) state abstraction. Experimental results show that, compared with existing ETC
schemes, DyETC increases traffic volume by around $8%$, and reduces travel time
by around $14.6%$ during rush hour. Considering the total traffic volume in a
traffic network, this contributes to a substantial increase to social welfare.
Network-wide Adaptive Tolling for Connected and Automated vehicles.
Guni Sharon, Michael W. Levin,
Josiah P. Hanna, Tarun Rambha, Stephen D.
Boyles, Peter Stone.
Transportation Research Part C. September 2017.
Abstract
BibTeX
Paper
This article proposes Delta-tolling, a simple adaptive pricing scheme which only
requires travel time observations and two tuning parameters. These tolls are
applied throughout a road network, and can be updated as frequently as travel
time observations are made. Notably, Delta-tolling does not require any details
of the traffic flow or travel demand models other than travel time observations,
rendering it easy to apply in real-time. The flexibility of this tolling scheme
is demonstrated in three specific traffic modeling contexts with varying traffic
flow and user behavior assumptions: a day-to-day pricing model using static
network equilibrium with link delay functions; a within-day adaptive pricing
model using the cell transmission model and dynamic routing of vehicles; and a
microsimulation of reservation-based intersection control for connected and
autonomous vehicles with myopic routing. In all cases, D-tolling produces
significant benefits over the no-toll case, measured in terms of average travel
time and social welfare, while only requiring two parameters to be tuned. Some
optimality results are also given for the special case of the static network
equilibrium model with BPR-style delay functions.
Data-Efficient Policy Evaluation Through Behavior Policy Search.
Josiah Hanna, Philip Thomas, Peter Stone, Scott Niekum.
Proceedings of the 34th International Conference on Machine Learning
(ICML). August 2017.
Abstract
BibTeX
Paper
We consider the task of evaluating a policy for a Markov decision process (MDP).
The standard unbiased technique for evaluating a policy is to deploy the policy
and observe its performance. We show that the data collected from deploying a
different policy, commonly called the behavior policy, can be used to produce
unbiased estimates with lower mean squared error than this standard technique.
We derive an analytic expression for the optimal behavior policy—the behavior
policy that minimizes the mean squared error of the resulting estimates. Because
this expression depends on terms that are unknown in practice, we propose a
novel policy evaluation sub-problem, behavior policy search: searching for a
behavior policy that reduces mean squared error. We present a behavior policy
search algorithm and empirically demonstrate its effectiveness in lowering the
mean squared error of policy performance estimates.
Fast and Precise Black and White Ball Detection for RoboCup Soccer.
Jacob Menashe, Josh Kelle, Katie Genter,
Josiah Hanna, Elad Liebman,
Sanmit Narvekar, Ruohan Zhang, Peter Stone.
RoboCup-2017: Robot Soccer World Cup XXI. July 2017.
Abstract
BibTeX
Paper
In 2016, UT Austin Villa claimed the Standard Platform League's second
place position at the RoboCup International Robot Soccer Competition in Leipzig,
Germany as well as first place at both the RoboCup US Open in Brunswick, USA and
the World RoboCup Conference in Beijing, China. This paper describes some of the
key contributions that led to the team's victories with a primary focus on
our techniques for identifying and tracking black and white soccer balls. UT
Austin Villa's ball detection system was overhauled in order to transition
from the league's bright orange ball, used every year of the competition
prior to 2016, to the truncated icosahedral pattern commonly associated with
soccer balls. We evaluated and applied a series of heuristic region-of-interest
identification techniques and supervised machine learning methods to produce a
ball detector capable of reliably detecting the ball's position with no
prior knowledge of the ball's position. In 2016, UT Austin Villa suffered
only a single loss which occurred after regulation time during a penalty kick
shootout. We attribute much of UT Austin Villa's success in 2016 to our
robots' effectiveness at quickly and consistently localizing the ball. In
this work we discuss the specifics of UT Austin Villa's ball detector
implementation which are applicable to the specific problem of ball detection in
RoboCup, as well as to the more general problem of fast and precise object
detection in computationally constrained domains. Furthermore we provide
empirical analyses of our approach to support the conclusion that modern deep
learning techniques can enhance visual recognition tasks even in the face of
these computational constraints.
Bootstrapping with Models: Confidence Intervals for Off-Policy
Evaluation.
Josiah Hanna, Peter Stone, Scott Niekum.
Proceedings of the 16th International Conference on Autonomous Agents and
Multiagent Systems (AAMAS). May 2017.
Abstract
BibTeX
Paper
For an autonomous agent, executing a poor policy may be costly or even
dangerous. For such agents, it is desirable to determine confidence interval
lower bounds on the performance of any given policy without executing said
policy. Current methods for exact high confidence off-policy evaluation that use
importance sampling require a substantial amount of data to achieve a tight
lower bound. Existing model-based methods only address the problem in discrete
state spaces. Since exact bounds are intractable for many domains we trade off
strict guarantees of safety for more data-efficient approximate bounds. In this
context, we propose two bootstrapping off-policy evaluation methods which use
learned MDP transition models in order to estimate lower confidence bounds on
policy performance with limited data in both continuous and discrete state
spaces. Since direct use of a model may introduce bias, we derive a theoretical
upper bound on model bias for when the model transition function is estimated
with i.i.d. trajectories. This bound broadens our understanding of the
conditions under which model-based methods have high bias. Finally, we
empirically evaluate our proposed methods and analyze the settings in which
different bootstrapping off-policy confidence interval methods succeed and fail.
Real-time Adaptive Tolling Scheme for Optimized Social Welfare in Traffic
Networks.
Guni Sharon,
Josiah P. Hanna, Tarun Rambha, Michael W. Levin, Michael
Albert, Stephen D. Boyles, Peter Stone.
Proceedings of the 16th International Conference on Autonomous Agents and
Multiagent Systems (AAMAS-2017). May 2017.
Abstract
BibTeX
Paper
Connected and autonomous vehicle technology has advanced rapidly in recent
years. These technologies create possibilities for advanced AI-based traffic
management techniques. Developing such techniques is an important challenge and
opportunity for the AI community as it requires synergy between experts in game
theory, multiagent systems, behavioral science, and flow optimization. This
paper takes a step in this direction by considering traffic flow optimization
through setting and broadcasting of dynamic and adaptive tolls. Previous tolling
schemes either were not adaptive in real-time, not scalable to large networks,
or did not optimize traffic flow over an entire network. Moreover, previous
schemes made strong assumptions on observable demands, road capacities and users
homogeneity. This paper introduces Delta-tolling, a novel tolling scheme that is
adaptive in real-time and able to scale to large networks. We provide
theoretical evidence showing that under certain assumptions Delta-tolling is
equal to Marginal-Cost Tolling, which provably leads to system-optimal, and
empirical evidence showing that Delta-tolling increases social welfare (by up to
33%) in two traffic simulators with markedly different modeling assumptions.
Grounded Action Transformation for Robot Learning in Simulation.
Josiah Hanna, Peter Stone.
Proceedings of the 31st AAAI Conference on Artificial Intelligence
(AAAI). February 2017.
Abstract
BibTeX
Paper
Robot learning in simulation is a promising alternative to the prohibitive
sample cost of learning in the physical world. Unfortunately, policies learned
in simulation often perform worse than hand-coded policies when applied on the
physical robot. Grounded simulation learning (GSL) promises to address this
issue by altering the simulator to better match the real world. This paper
proposes a new algorithm for GSL – Grounded Action Transformation – and applies
it to learning of humanoid bipedal locomotion. Our approach results in a 43.27%
improvement in forward walk velocity compared to a state-of-the art hand-coded
walk. We further evaluate our methodology in controlled experiments using a
second, higher-fidelity simulator in place of the real world. Our results
contribute to a deeper understanding of grounded simulation learning and
demonstrate its effectiveness for learning robot control policies.
UT Austin Villa: RoboCup 2015 3D Simulation League Competition and Technical
Challenges Champions.
Patrick MacAlpine,
Josiah Hanna, Jason Liang, Peter Stone.
RoboCup-2015: Robot Soccer World Cup XIX. July 2016.
Abstract
BibTeX
Paper
The UT Austin Villa team, from the University of Texas at Austin, won the 2015
RoboCup 3D Simulation League, winning all 19 games that the team played. During
the course of the competition the team scored 87 goals and conceded only 1.
Additionally the team won the RoboCup 3D Simulation League technical challenge
by winning each of a series of three league challenges: drop-in player, kick
accuracy, and free challenge. This paper describes the changes and improvements
made to the team between 2014 and 2015 that allowed it to win both the main
competition and each of the league technical challenges.
Minimum Cost Matching for Autonomous Carsharing.
Josiah P. Hanna, Michael Albert, Donna Chen, Peter Stone.
Proceedings of the 9th IFAC Symposium on Intelligent Autonomous Vehicles (IAV
2016). June 2016.
Abstract
BibTeX
Paper
Carsharing programs provide an alternative to private vehicle ownership.
Combining carsharing programs with autonomous vehicles would improve user access
to vehicles thereby removing one of the main challenges to widescale adoption of
these programs. While the ability to easily move cars to meet demand would be
significant for carsharing programs, if implemented incorrectly it could lead to
worse system performance. In this paper, we seek to improve the performance of a
fleet of shared autonomous vehicles through improved matching of vehicles to
passengers requesting rides. We consider carsharing with autonomous vehicles as
an assignment problem and examine four different methods for matching cars to
users in a dynamic setting. We show how applying a recent algorithm (Scalable
Collision-avoiding Role Assignment with Minimal-makespan or SCRAM) for
minimizing the maximal edge in a perfect matching can result in a more
efficient, reliable, and fair carsharing system. Our results highlight some of
the problems with greedy or decentralized approaches. Introducing a centralized
system creates the possibility for users to strategically mis-report their
locations and improve their expected wait time so we provide a proof
demonstrating that cancellation fees can be applied to eliminate the incentive
to mis-report location.
Operations of a shared, autonomous, electric vehicle fleet: Implications of
vehicle & charging infrastructure decisions.
T Donna Chen, Kara M Kockelman,
Josiah P Hanna.
Transportation Research Part A: Policy and Practice. January 2016.
Abstract
BibTeX
Paper
There are natural synergies between shared autonomous vehicle (AV) fleets and
electric vehicle (EV) technology, since fleets of AVs resolve the practical
limitations of today's non-autonomous EVs, including traveler range
anxiety, access to charging infrastructure, and charging time management. Fleet-
managed AVs relieve such concerns, managing range and charging activities based
on real-time trip demand and established charging-station locations, as
demonstrated in this paper. This work explores the management of a fleet of
shared autonomous electric vehicles (SAEVs) in a regional, discrete-time, agent-
based model. The simulation examines the operation of SAEVs under various
vehicle range and charging infrastructure scenarios in a gridded city modeled
roughly after the densities of Austin, Texas. <br><br> Results based
on 2009 NHTS trip distance and time-of-day distributions indicate that fleet
size is sensitive to battery recharge time and vehicle range, with each 80-mile
range SAEV replacing 3.7 privately owned vehicles and each 200-mile range SAEV
replacing 5.5 privately owned vehicles, under Level II (240-volt AC) charging.
With Level III 480-volt DC fast-charging infrastructure in place, these ratios
rise to 5.4 vehicles for the 80-mile range SAEV and 6.8 vehicles for the
200-mile range SAEV. SAEVs can serve 96‐98% of trip requests with
average wait times between 7 and 10 minutes per trip. However, due to the need
to travel while “empty" for charging and passenger pick-up, SAEV fleets are
predicted to generate an additional 7.1 – 14.0% of travel miles. Financial
analysis suggests that the combined cost of charging infrastructure, vehicle
capital and maintenance, electricity, insurance, and registration for a fleet of
SAEVs ranges from $0.42 to $0.49 per occupied mile traveled, which implies SAEV
service can be offered at the equivalent per-mile cost of private vehicle
ownership for low-mileage households, and thus be competitive with current
manually-driven carsharing services and significantly cheaper than on-demand
driver-operated transportation services. When Austin-specific trip patterns
(with more concentrated trip origins and destinations) are introduced in a final
case study, the simulation predicts a decrease in fleet <q>empty</q>
vehicle-miles (down to 3 – 4% of all SAEV travel) and average wait times
(ranging from 2 to 4 minutes per trip), with each SAEV replacing 5 – 9 privately
owned vehicles.
Approximation of Lorenz-Optimal Solutions in Multiobjective Markov Decision
Processes.
Patrice Perny, Paul Weng, Judy Goldsmith,
Josiah P. Hanna.
Proceedings of the International Conference on Uncertainty in Artificial
Intelligence (UAI). July 2013.
Abstract
BibTeX
Paper
This paper is devoted to fair optimization in Multiobjective Markov Decision
Processes (MOMDPs). A MOMDP is an extension of the MDP model for planning under
uncertainty while trying to optimize several reward functions simultaneously.
This applies to multiagent problems when rewards define individual utility
functions, or in multicriteria problems when rewards refer to different
features. In this setting, we study the determination of policies leading to
Lorenz-nondominated tradeoffs. Lorenz dominance is a refinement of Pareto
dominance that was introduced in Social Choice for the measurement of
inequalities. In this paper, we introduce methods to efficiently approximate the
sets of Lorenz-non-dominated solutions of infinite-horizon, discounted MOMDPs.
The approximations are polynomial-sized subsets of those solutions.
The Academic Advising Planning Domain.
Joshua T. Guerin,
Josiah P. Hanna, Libby Ferland, Nicholas Mattei, Judy
Goldsmith.
Proceedings of the 3rd Workshop on the International Planning Competition at
ICAPS. July 2012.
Abstract
BibTeX
Paper
The International Probabilistic Planning Competition is a leading showcase for
fast stochastic planners. The current domains used in the competition have
raised challenges that the leading deterministic-planner-based MDP solvers have
been able to meet. We argue that in order to continue to raise challenges and
match real world applications, domains must be generated that exhibit true
stochasticity, multi-valued domain variables, and concurrent actions. In this
paper we propose the academic advising domain as a planning competition domain
that exhibits these characteristics. We believe that this domain can build upon
the success of previous contests in pushing the limits of MDP planning research.