Jump to Year:
[2024]
[2023]
[2022]
[2021]
[2020]
[2019]
[2018]
[2017]
[2016]
[2013]
[2012]
Data-Efficient Policy Evaluation Through Behavior Policy Search.
Josiah P. Hanna, Yash Chandak, Martha White, Philip Thomas, Scott Niekum, Peter Stone.
To appear in Journal of Machine Learning Research (JMLR). November 2024.
Abstract
BibTeX
Paper
We consider the task of evaluating a policy for a {\textbackslash}textit\{Markov decision process\} (MDP). The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance. We show that the data collected from deploying a different policy, commonly called the {\textbackslash}textit\{behavior policy\}, can be used to produce unbiased estimates with lower mean squared error than this standard technique. We derive an analytic expression for a {\textbackslash}textit\{minimal variance behavior policy\} -- a behavior policy that minimizes the mean squared error of the resulting estimates. Because this expression depends on terms that are unknown in practice, we propose a novel policy evaluation sub-problem, {\textbackslash}textit\{behavior policy search\}: searching for a behavior policy that reduces mean squared error. We present two behavior policy search algorithms and empirically demonstrate their effectiveness in lowering the mean squared error of policy performance estimates.
Stable Offline Value Function Learning with Bisimulation-based Representations.
Brahma S. Pavse, Yudong Chen, Qiaomin Xie,
Josiah P. Hanna.
Arxiv Pre-Print. October 2024.
Abstract
BibTeX
Paper
In reinforcement learning, offline value function learning is the procedure of using an offline dataset to estimate the expected discounted return from each state when taking actions according to a fixed target policy. The stability of this procedure, i.e., whether it converges to its fixed-point, critically depends on the representations of the state-action pairs. Poorly learned representations can make value function learning unstable, or even divergent. Therefore, it is critical to stabilize value function learning by explicitly shaping the state-action representations. Recently, the class of bisimulation-based algorithms have shown promise in shaping representations for control. However, it is still unclear if this class of methods can stabilize value function learning. In this work, we investigate this question and answer it affirmatively. We introduce a bisimulation-based algorithm called kernel representations for offline policy evaluation (KROPE). KROPE uses a kernel to shape state-action representations such that state-action pairs that have similar immediate rewards and lead to similar next state-action pairs under the target policy also have similar representations. We show that KROPE: 1) learns stable representations and 2) leads to lower value error than baselines. Our analysis provides new theoretical insight into the stability properties of bisimulation-based methods and suggests that practitioners can use these methods for stable and accurate evaluation of offline reinforcement learning agents.
Toward the Confident Deployment of Real-world Reinforcement Learning Agents.
Josiah P. Hanna.
AI Magazine. September 2024.
Abstract
BibTeX
Paper
Intelligent learning agents must be able to learn from experience so as to accomplish tasks that require more ability than could be initially programmed. Reinforcement learning (RL) has emerged as a potentially powerful class of solution methods to create agents that learn from trial-and-error interaction with the world. Despite many prominent success stories, a number of challenges often stand between the use of RL in real-world problems. As part of the AAAI New Faculty Highlight Program, in this article, I will describe the work that my group is doing at the University of Wisconsin—Madison with the intent to remove barriers to the use of RL in practice. Specifically, I will describe recent work that aims to give practitioners confidence in learned behaviors, methods to increase the data efficiency of RL, and work on “challenge” domains that stress RL algorithms beyond current testbeds.
Conservative Evaluation of Offline Policy Learning.
Hager Radi Abdelwahed,
Josiah P. Hanna, Matthew E. Taylor.
Transactions of Machine Learning Research (TMLR). August 2024.
Abstract
BibTeX
Paper
The world offers unprecedented amounts of data in real-world domains, from which we can develop successful decision-making systems. It is possible for reinforcement learning (RL) to learn control policies offline from such data but challenging to deploy an agent during learning in safety-critical domains. Offline RL learns from historical data without access to an environment. Therefore, we need a methodology for estimating how a newly-learned agent will perform when deployed in the real environment before actually deploying it. To achieve this, we propose a framework for conservative evaluation of offline policy learning (CEOPL). We focus on being conservative so that the probability that our agent performs below a baseline is approximately \${\textbackslash}delta\$, where \${\textbackslash}delta\$ specifies how much risk we are willing to accept. In our setting, we assume access to a data stream, split into a train-set to learn an offline policy, and a test-set to estimate a lower-bound on the offline policy using off-policy evaluation with bootstrap confidence intervals. A lower-bound estimate allows us to decide when to deploy our learned policy with minimal risk of overestimation. We demonstrate CEOPL on a range of tasks as well as real-world medical data.
Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning.
Subhojyoti Mukherjee,
Josiah P. Hanna, Qiaomin Xie, Robert Nowak.
Arxiv Pre-Print. June 2024.
Abstract
BibTeX
Paper
In this paper, we study multi-task structured bandit problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and the algorithm exploits the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer as a decision-making algorithm to learn this shared structure so as to generalize to the test task. The prior work of pretrained decision transformers like DPT requires access to the optimal action during training which may be hard in several scenarios. Diverging from these works, our learning algorithm does not need the knowledge of optimal action per task during training but predicts a reward vector for each of the actions using only the observed offline data from the diverse training tasks. Finally, during inference time, it selects action using the reward predictions employing various exploration strategies in-context for an unseen test task. Our model outperforms other SOTA methods like DPT, and Algorithmic Distillation over a series of experiments on several structured bandit problems (linear, bilinear, latent, non-linear). Interestingly, we show that our algorithm, without the knowledge of the underlying problem structure, can learn a near-optimal policy in-context by leveraging the shared structure across diverse tasks. We further extend the field of pre-trained decision transformers by showing that they can leverage unseen tasks with new actions and still learn the underlying latent structure to derive a near-optimal policy. We validate this over several experiments to show that our proposed solution is very general and has wide applications to potentially emergent online and offline strategies at test time. Finally, we theoretically analyze the performance of our algorithm and obtain generalization bounds in the in-context multi-task learning setting.
Future Prediction Can Be a Strong Evidence of Good History Representation in Partially Observable Environments.
Jeongyeol Kwon, Liu Yang,
Josiah P. Hanna, Robert Nowak.
Arxiv Pre-Print. February 2024.
Abstract
BibTeX
Paper
Learning a good history representation is one of the core challenges of reinforcement learning (RL) in partially observable environments. Recent works have shown the advantages of various auxiliary tasks for facilitating representation learning. However, the effectiveness of such auxiliary tasks has not been fully convincing, especially in partially observable environments that require long-term memorization and inference. In this empirical study, we investigate the effectiveness of future prediction for learning the representations of histories, possibly of extensive length, in partially observable environments. We first introduce an approach that decouples the task of learning history representations from policy optimization via future prediction. Then, our main contributions are two-fold: (a) we demonstrate that the performance of reinforcement learning is strongly correlated with the prediction accuracy of future observations in partially observable environments, and (b) our approach can significantly improve the overall end-to-end approach by preventing high-variance noisy signals from reinforcement learning objectives to influence the representation learning. We illustrate our claims on three types of benchmarks that necessitate the ability to process long histories for high returns.
Adaptive Exploration for Data-Efficient General Value Function Evaluations.
Arushi Jain,
Josiah P. Hanna, Doina Precup.
Proceedings of Advances in Neural Information Processing Systems (NeurIPS). December 2024.
Abstract
BibTeX
Paper
General Value Functions (GVFs) (Sutton et al, 2011) are an established way to represent predictive knowledge in reinforcement learning. Each GVF computes the expected return for a given policy, based on a unique pseudo-reward. Multiple GVFs can be estimated in parallel using off-policy learning from a single stream of data, often sourced from a fixed behavior policy or pre-collected dataset. This leaves an open question: how can behavior policy be chosen for data-efficient GVF learning? To address this gap, we propose GVFExplorer, which aims at learning a behavior policy that efficiently gathers data for evaluating multiple GVFs in parallel. This behavior policy selects actions in proportion to the total variance in the return across all GVFs, reducing the number of environmental interactions. To enable accurate variance estimation, we use a recently proposed temporal-difference-style variance estimator. We prove that each behavior policy update reduces the mean squared error in the summed predictions over all GVFs. We empirically demonstrate our method's performance in both tabular representations and nonlinear function approximation.
Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning.
Nicholas Corrado, Yuxiao Qu, John U. Balis, Adam Labiosa,
Josiah P. Hanna.
Proceedings of the Reinforcement Learning Conference (RLC). August 2024.
Abstract
BibTeX
Paper
In offline reinforcement learning (RL), RL agents learn to solve a task using only a fixed dataset of previously collected data. While offline RL has proven to be a viable method for learning real-world robot control policies, it typically requires large amounts of expert-quality data to learn effective policies that generalize to out-of-distribution states. Unfortunately, such data is often difficult and expensive to acquire in real-world tasks. Several recent works have leveraged data augmentation (DA) to inexpensively generate additional data, but most DA works apply augmentations in a random fashion and ultimately produce highly suboptimal augmented data. In this work, we propose {\textbackslash}textbf\{Gu\}ided {\textbackslash}textbf\{D\}ata {\textbackslash}textbf\{A\}ugmentation (GuDA),
a human-guided DA framework that generates expert-quality augmented data. The key insight behind GuDA is that while it may be difficult to demonstrate the sequence of actions required to produce expert data, a user can often easily characterize when an augmented trajectory segment represents progress toward task completion. Thus, a user can restrict the space of possible augmentations to automatically reject suboptimal augmented data. To extract a policy from GuDA, we use off-the-shelf offline reinforcement learning and behavior cloning algorithms. We evaluate GuDA on a physical robot soccer task as well as simulated D4RL navigation tasks, a simulated autonomous driving task, and a simulated soccer task. Empirically, GuDA enables learning given a small initial dataset of potentially suboptimal experience and outperforms a random DA strategy as well as a model-based DA strategy.
SaVeR: Optimal Data Collection Strategy for Safe Policy Evaluation in Tabular MDP.
Subhojyoti Mukherjee,
Josiah P. Hanna, Robert Nowak.
Proceedings of the International Conference on Machine Learning (ICML). July 2024.
Abstract
BibTeX
Paper
In this paper, we study safe data collection for the purpose of policy evaluation in tabular Markov decision processes (MDPs). In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain. Policy evaluation requires data and we are interested in the question of what behavior policy should collect the data for the most accurate evaluation of the target policy. While prior work has considered behavior policy selection, in this paper, we additionally consider a safety constraint on the behavior policy. Namely, we assume there exists a known default policy that incurs a particular expected cost when run and we enforce that the cumulative cost of all behavior policies ran is better than a constant factor of the cost that would be incurred had we always run the default policy. We first show that there exists a class of intractable MDPs where no safe oracle algorithm with knowledge about problem parameters can efficiently collect data and satisfy the safety constraints. We then define the tractability condition for an MDP such that a safe oracle algorithm can efficiently collect data and using that we prove the first lower bound for this setting. We then introduce an algorithm SaVeR for this problem that approximates the safe oracle algorithm and bound the finite-sample mean squared error of the algorithm while ensuring it satisfies the safety constraint. Finally, we show in simulations that SaVeR produces low MSE policy evaluation while satisfying the safety constraint.
Learning To Stabilize Online Reinforcement Learning in Unbounded State Spaces.
Brahma Pavse, Matthew Zurek, Yudong Chen, Qiaomin Xie,
Josiah P. Hanna.
Proceedings of the International Conference on Machine Learning (ICML). July 2024.
Abstract
BibTeX
Paper
In many reinforcement learning ({\textbackslash}textsc\{rl\}) applications, we want policies that reach desired states and then keep the controlled system within an acceptable region around the desired states over an indefinite period of time.
This latter objective is called {\textbackslash}textit\{stability\} and is especially important when the state space is unbounded, such that the states can be arbitrarily far from each other and the agent can drift far away from the desired states.
For example, in stochastic queuing networks, where queues of waiting jobs can grow without bound, the desired state is all-zero queue lengths. Here, a stable policy ensures queue lengths are finite while an optimal policy minimizes queue lengths.
Since an optimal policy is also stable, one would expect that {\textbackslash}textsc\{rl\} algorithms would implicitly give us stable policies. However, in this work, we find that deep {\textbackslash}textsc\{rl\} algorithms that directly minimize the distance to the desired state during online training often result in unstable policies, i.e., policies that drift far away from the desired state. We attribute this instability to poor credit-assignment for destabilizing actions. We then introduce an approach based on two ideas: 1) a Lyapunov-based cost-shaping technique and 2) state transformations to the unbounded state space.
We conduct an empirical study on various queuing networks and traffic signal control problems and find that our approach performs competitively against strong baselines with knowledge of the transition dynamics. Our code is available here: {\textbackslash}url\{https://github.com/Badger-RL/STOP\}.
SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits.
Subhojyoti Mukherjee, Qiaomin Xie,
Josiah P. Hanna, Robert Nowak.
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). May 2024.
Abstract
BibTeX
Paper
In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a {\textbackslash}textit\{target\} policy and asked to estimate the expected cumulative reward it will obtain when executed in an environment formalized as a multi-armed bandit. In this paper, we focus on linear bandit setting with heteroscedastic reward noise. This is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the target policy. We term this as policy-weighted least square estimation and use this formulation to derive the optimal behavior policy for data collection. We then propose a novel algorithm SPEED ({\textbackslash}textbf\{S\}tructured {\textbackslash}textbf\{P\}olicy {\textbackslash}textbf\{E\}valuation {\textbackslash}textbf\{E\}xperimental {\textbackslash}textbf\{D\}esign) that tracks the optimal behavior policy and derives its regret with respect to the optimal behavior policy. Finally, we empirically validate that {\textbackslash}sp{\textbackslash} leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.
Understanding When Dynamics-Invariant Data Augmentations Benefit Model-Free Reinforcement Learning Updates.
Nicholas Corrado,
Josiah P. Hanna.
Proceedings of the International Conference on Learning Representations (ICLR). May 2024.
Abstract
BibTeX
Paper
Recently, data augmentation (DA) has emerged as a method for leveraging domain knowledge to inexpensively generate additional data in reinforcement learning (RL) tasks, often yielding substantial improvements in data efficiency.
While prior work has demonstrated the utility of incorporating augmented data directly into model-free RL updates,
it is not well-understood when a particular DA strategy will improve data efficiency.
In this paper, we seek to identify general aspects of DA responsible for observed learning improvements.
Our study focuses on sparse-reward tasks with dynamics-invariant data augmentation functions, serving as an initial step towards a more general understanding of DA and its integration into RL training.
Experimentally, we isolate three relevant aspects of DA: state-action coverage, reward density, and the number of augmented transitions generated per update (the augmented replay ratio).
From our experiments, we draw two conclusions: (1) increasing state-action coverage often has a much greater impact on data efficiency than increasing reward density, and (2) decreasing the augmented replay ratio substantially improves data efficiency.In fact, certain tasks in our empirical study are solvable only when the replay ratio is sufficiently low.
Reinforcement Learning Via Auxiliary Task Distillation.
Abhinav Narayan Harish, Larry Heck,
Josiah P. Hanna, Zsolt Kira, Andrew Szot.
Proceedings of the European Conference on Computer Vision (ECCV). October 2024.
Abstract
BibTeX
Paper
We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill), a new method that enables reinforcement learning (RL) to perform long-horizon robot control problems by distilling behaviors from auxiliary RL tasks. AuxDistill achieves this by concurrently carrying out multi-task RL with auxiliary tasks, which are easier to learn and relevant to the main task. A weighted distillation loss transfers behaviors from these auxiliary tasks to solve the main task. We demonstrate that AuxDistill can learn a pixels-to-actions policy for a challenging multi-stage embodied object rearrangement task from the environment reward without demonstrations, a learning curriculum, or pre-trained skills. AuxDistill achieves \$2.3 {\textbackslash}times\$ higher success than the previous state-of-the-art baseline in the Habitat Object Rearrangement benchmark and outperforms methods that use pre-trained skills and expert demonstrations.
Replacing Implicit Regression with Classification in Policy Gradient Reinforcement Learning.
Josiah P. Hanna, Brahma S. Pavse, Abhinav Harish.
Proceedings of the Finding the Frame Workshop at the Reinforcement Learning Conference (RLC). August 2024.
Abstract
BibTeX
Paper
Stochastic policy gradient methods are a fundamental class of reinforcement learning algorithms. When using these algorithms for continuous control, it is common to parameterize the policy using a Gaussian distribution. In this paper, we show that the policy gradient with Gaussian policies can be viewed as the gradient of a weighted least-squares objective function. That is, policy gradient algorithms are implicitly implementing a form of regression. Several recent works have shown that reformulating regression problems as classification problems can improve learning. Inspired by these works, we investigate whether replacing this implicit regression with classification can improve the data efficiency and stability of policy learning. We introduce a novel policy gradient surrogate objective for softmax policies over a discretized action space. This surrogate objective uses a form of cross-entropy loss to replace the implicit least-squares loss found in the surrogate loss for Gaussian policies. We extend prior theoretical analysis of this loss to our policy gradient surrogate objective and provide experiments showing that this novel loss improves the data efficiency of stochastic policy gradient learning.
On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling.
Nicholas Corrado,
Josiah P. Hanna.
Arxiv Pre-Print. November 2023.
Abstract
BibTeX
Paper
On-policy reinforcement learning (RL) algorithms perform policy updates using i.i.d. trajectories collected by the current policy. However, after observing only a finite number of trajectories, on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to noisy updates and data inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error than on-policy sampling can produce. Motivated by this observation, we introduce an adaptive, off-policy sampling method to improve the data efficiency of on-policy policy gradient algorithms. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy that increases the probability of sampling actions that are under-sampled with respect to the current policy. Rather than discarding data from old policies -- as is commonly done in on-policy algorithms -- PROPS uses data collection to adjust the distribution of previously collected data to be approximately on-policy. We empirically evaluate PROPS on both continuous-action MuJoCo benchmark tasks as well as discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) improves the data efficiency of on-policy policy gradient algorithms. Our work improves the RL community's understanding of a nuance in the on-policy vs off-policy dichotomy: on-policy learning requires on-policy data, not on-policy sampling.
State-Action Similarity-Based Representations for Off-Policy Evaluation.
Brahma S. Pavse,
Josiah P. Hanna.
Proceedings of Advances in Neural Information Processing Systems (NeurIPS). December 2023.
Abstract
BibTeX
Paper
In reinforcement learning, off-policy evaluation (OPE) is the problem of estimating the expected return of an evaluation policy given a fixed dataset that was collected by running one or more different policies. One of the more empirically successful algorithms for OPE has been the fitted q-evaluation (FQE) algorithm that uses temporal difference updates to learn an action-value function, which is then used to estimate the expected return of the evaluation policy. Typically, the original fixed dataset is fed directly into FQE to learn the action-value function of the evaluation policy. Instead, in this paper, we seek to enhance the data-efficiency of FQE by first transforming the fixed dataset using a learned encoder, and then feeding the transformed dataset into FQE. To learn such an encoder, we introduce an OPE tailored state-action behavioral similarity metric, and use this metric and the fixed dataset to learn an encoder that models this metric. Theoretically, we show that this metric allows us to bound the error in the resulting OPE estimate. Empirically, we show that other state-action similarity metrics lead to representations that cannot represent the action-value function of the evaluation policy, and that our stateaction representation method boosts the data-efficiency of FQE and lowers OPE error relative to other OPE-based representation learning methods on challenging OPE tasks. We also empirically show that the learned representations significantly mitigate divergence of FQE under varying distribution shifts. Our code is available here: https://github.com/Badger-RL/ROPE.
Multi-task Representation Learning for Pure Exploration in Bilinear Bandits.
Subhojyoti Mukherjee, Qiaomin Xie,
Josiah P. Hanna, Robert Nowak.
Proceedings of Advances in Neural Information Processing Systems (NeurIPS). December 2023.
Abstract
BibTeX
Paper
We study multi-task representation learning for the problem of pure exploration in bilinear bandits.
\%
In bilinear bandits, an action takes the
form of a pair of arms from two different entity
types and the reward is a bilinear function
of the known feature vectors of the arms.
\%
\%
In the {\textbackslash}textit\{multi-task bilinear bandit problem\}, we aim to find optimal actions for multiple tasks that share a common low-dimensional linear representation. The objective is to leverage this characteristic to expedite the process of identifying the best pair of arms for all tasks.
\%
We propose the algorithm {\textbackslash}gob{\textbackslash} that uses an experimental design approach to optimize sample allocations for learning the global representation as well as minimize the number of samples needed to identify the optimal pair of arms in individual tasks.
\%
To the best of our knowledge, this is the first study to give sample complexity analysis for pure exploration in bilinear bandits with shared representation.
\%
Our results demonstrate that by learning the shared representation across tasks, we achieve significantly improved sample complexity compared to the traditional approach of solving tasks independently.
Conditional Mutual Information for Disentangled Representations in Reinforcement Learning.
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck,
Josiah P. Hanna, Stefano V. Albrecht.
Proceedings of Advances in Neural Information Processing Systems (NeurIPS). December 2023.
Abstract
BibTeX
Paper
Reinforcement Learning (RL) environments can produce training data with spurious correlations between features due to the amount of training data or its limited feature coverage. This can lead to RL agents encoding these misleading correlations in their latent representation, preventing the agent from generalising if the correlation changes within the environment or when deployed in the real world. Disentangled representations can improve robustness, but existing disentanglement techniques that minimise mutual information between features require independent features, thus they cannot disentangle {\textbackslash}textit\{correlated\} features.
We propose an auxiliary task for RL algorithms that learns a disentangled representation of high-dimensional observations with correlated features by minimising the {\textbackslash}textit\{conditional\} mutual information between features in the representation.
We demonstrate experimentally, using continuous control tasks, that our approach improves generalisation under correlation shifts, as well as improving the training performance of RL algorithms in the presence of correlated features.
Temporal Disentanglement of Representations for Improved Generalisation in Reinforcement Learning.
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck,
Josiah P. Hanna, Stefano V. Albrecht.
Proceedings of the International Conference on Learning Representations (ICLR). May 2023.
Abstract
BibTeX
Paper
Reinforcement Learning (RL) agents are often unable to generalise well to environment variations in the state space that were not observed during training. This issue is especially problematic for image-based RL, where a change in just one variable, such as the background colour, can change many pixels in the image. The changed pixels can lead to drastic changes in the agent's latent representation of the image, causing the learned policy to fail. To learn more robust representations, we introduce TEmporal Disentanglement (TED), a self-supervised auxiliary task that leads to disentangled image representations exploiting the sequential nature of RL observations. We find empirically that RL algorithms utilising TED as an auxiliary task adapt more quickly to changes in environment variables with continued training compared to state-of-the-art representation learning methods. Since TED enforces a disentangled structure of the representation, our experiments also show that policies trained with TED generalise better to unseen values of variables irrelevant to the task (e.g. background colour) as well as unseen values of variables that affect the optimal policy (e.g. goal positions).
Scaling Marginalized Importance Sampling To High-Dimensional State-Spaces Via State Abstraction.
Brahma S Pavse,
Josiah P. Hanna.
Proceedings of the AAAI Conference on Artificial Intelligence. February 2023.
Abstract
BibTeX
Paper
We consider the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of an evaluation policy, πe, using a fixed dataset, D, collected by one or more policies that may be different from πe. Current OPE algorithms may produce poor OPE estimates under policy distribution shift i.e., when the probability of a particular stateaction pair occurring under πe is very different from the probability of that same pair occurring in D (Voloshin et al. 2021; Fu et al. 2021). In this work, we propose to improve the accuracy of OPE estimators by projecting the high-dimensional state-space into a low-dimensional state-space using concepts from the state abstraction literature. Specifically, we consider marginalized importance sampling (MIS) OPE algorithms which compute state-action distribution correction ratios to produce their OPE estimate. In the original ground statespace, these ratios may have high variance which may lead to high variance OPE. However, we prove that in the lower-dimensional abstract state-space the ratios can have lower variance resulting in lower variance OPE. We then highlight the challenges that arise when estimating the abstract ratios from data, identify sufficient conditions to overcome these issues, and present a minimax optimization problem whose solution yields these abstract ratios. Finally, our empirical evaluation on difficult, high-dimensional state-space OPE tasks shows that the abstract ratios can make MIS OPE estimators achieve lower mean-squared error and more robust to hyperparameter tuning than the ground ratios.
SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits.
Subhojyoti Mukherjee, Qiaomin Xie,
Josiah P. Hanna, Robert Nowak.
ICML Workshop on the Many Facets of Preference-Based Learning. July 2023.
Abstract
BibTeX
Paper
In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a {\textbackslash}textit\{target\} policy and asked to estimate the expected cumulative reward it will obtain when executed in an environment formalized as a multi-armed bandit. In this paper, we focus on linear bandit setting with heteroscedastic reward noise. This is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the target policy. We term this as policy-weighted least square estimation and use this formulation to derive the optimal behavior policy for data collection. We then propose a novel algorithm SPEED ({\textbackslash}textbf\{S\}tructured {\textbackslash}textbf\{P\}olicy {\textbackslash}textbf\{E\}valuation {\textbackslash}textbf\{E\}xperimental {\textbackslash}textbf\{D\}esign) that tracks the optimal behavior policy and derives its regret with respect to the optimal behavior policy. Finally, we empirically validate that {\textbackslash}sp{\textbackslash} leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.
Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning.
Rujie Zhong, Duohan Zhang, Lukas Schäfer, Stefano V. Albrecht,
Josiah P. Hanna.
Proceedings of Neural and Information Processing Systems (NeurIPS). December 2022.
Abstract
BibTeX
Paper
Reinforcement learning (RL) algorithms are often categorized as either on-policy or off-policy depending on whether they use data from a target policy of interest or from a different behavior policy. In this paper, we study a subtle distinction between on-policy data and on-policy sampling in the context of the RL sub-problem of policy evaluation. We observe that on-policy sampling may fail to match the expected distribution of on-policy data after observing only a finite number of trajectories and this failure hinders data-efficient policy evaluation. Towards improved data-efficiency, we show how non-i.i.d., off-policy sampling can produce data that more closely matches the expected on-policy data distribution and consequently increases the accuracy of the Monte Carlo estimator for policy evaluation. We introduce a method called Robust On-Policy Sampling and demonstrate theoretically and empirically that it produces data that converges faster to the expected on-policy distribution compared to on-policy sampling. Empirically, we show that this faster convergence leads to lower mean squared error policy value estimates.
ReVar: Strengthening Policy Evaluation Via Reduced Variance Sampling.
Subhojyoti Mukherjee,
Josiah P. Hanna, Robert Nowak.
Proceedings of the 38th International Conference on Uncertainty in Artificial Intelligence (UAI). August 2022.
Abstract
BibTeX
Paper
This paper studies the problem of data collection for policy evaluation in Markov decision processes (MDPs). In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain in an environment formalized as an MDP. We develop theory for optimal data collection within the class of tree-structured MDPs by first deriving an oracle data collection strategy that uses knowledge of the variance of the reward distributions. We then introduce the Reduced Variance Sampling (ReVar) algorithm that approximates the oracle strategy when the reward variances are unknown a priori and bound its sub-optimality compared to the oracle strategy. Finally, we empirically validate that ReVar leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.
Simulation-Acquired Latent Action Spaces for Dynamics Generalization.
Nicholas Corrado, Yuxiao Qu,
Josiah P. Hanna.
Proceedings of the 1st Conference on Lifelong Learning Agents (CoLLAs). August 2022.
Abstract
BibTeX
Paper
Deep reinforcement learning has shown incredible promise at training high-performing agents to solve high-dimensional continuous control tasks in a particular training environment. However, to be useful in real-world settings, long-lived agents must perform well across a range of environmental conditions. Naively applying deep RL to a task where environment conditions may vary from episode to episode can be data inefficient. To address this inefficiency, we introduce a method that discovers structure in an agent’s high-dimensional continuous action space to speed up learning across a range of environmental conditions. Whereas prior work on finding so-called latent action spaces requires expert demonstrations or on-task experience, we instead propose to discover the latent, lower-dimensional action space in a simulated environment and then transfer the learned action space for training over a distribution of environments. We evaluate our novel method on randomized variants of simulated MuJoCo environments and find that, when there is a lower-dimensional action space to exploit, our method significantly increases data efficiency. For instance, in the Ant environment, our method reduces the 8-dimensional action space to a 3-dimensional action space and doubles the median return achieved after a training budget of 2 million timesteps.
Decoupled Reinforcement Learning To Stabilise Intrinsically-Motivated Exploration.
Lukas Schäfer,
Josiah P. Hanna, Filippos Christiano, Stefano V Albrecht.
Proceedings of the International Conference on Autonomous and Multi-agent Systems (AAMAS). May 2022.
Abstract
BibTeX
Paper
Intrinsic rewards can improve exploration in reinforcement learning, but the exploration process may suffer from instability caused by non-stationary reward shaping and strong dependency on hyperparameters. In this work, we introduce Decoupled RL (DeRL) as a general framework which trains separate policies for intrinsicallymotivated exploration and exploitation. Such decoupling allows DeRL to leverage the benefits of intrinsic rewards for exploration while demonstrating improved robustness and sample efficiency. We evaluate DeRL algorithms in two sparse-reward environments with multiple types of intrinsic rewards. Our results show that DeRL is more robust to varying scale and rate of decay of intrinsic rewards and converges to the same evaluation returns than intrinsically-motivated baselines in fewer interactions. Lastly, we discuss the challenge of distribution shift and show that divergence constraint regularisers can successfully minimise instability caused by divergence of exploration and exploitation policies.
Scaling Marginalized Importance Sampling To High-Dimensional State-Spaces Via State Abstraction.
Brahma S Pavse,
Josiah P. Hanna.
Proceedings of the Offline Reinforcement Learning Workshop at NeurIPS 2022. December 2022.
Abstract
BibTeX
Paper
We consider the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of an evaluation policy, πe, using a fixed dataset, D, collected by one or more policies that may be different from πe. Current OPE algorithms may produce poor OPE estimates under policy distribution shift i.e., when the probability of a particular stateaction pair occurring under πe is very different from the probability of that same pair occurring in D (Voloshin et al. 2021; Fu et al. 2021). In this work, we propose to improve the accuracy of OPE estimators by projecting the high-dimensional state-space into a low-dimensional state-space using concepts from the state abstraction literature. Specifically, we consider marginalized importance sampling (MIS) OPE algorithms which compute state-action distribution correction ratios to produce their OPE estimate. In the original ground statespace, these ratios may have high variance which may lead to high variance OPE. However, we prove that in the lower-dimensional abstract state-space the ratios can have lower variance resulting in lower variance OPE. We then highlight the challenges that arise when estimating the abstract ratios from data, identify sufficient conditions to overcome these issues, and present a minimax optimization problem whose solution yields these abstract ratios. Finally, our empirical evaluation on difficult, high-dimensional state-space OPE tasks shows that the abstract ratios can make MIS OPE estimators achieve lower mean-squared error and more robust to hyperparameter tuning than the ground ratios.
Temporal Disentanglement of Representations for Improved Generalisation in Reinforcement Learning..
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck,
Josiah P. Hanna, Stefano V. Albrecht.
Proceedings of the NeurIPS 2022 Workshop on Deep Reinforcement Learning. December 2022.
Abstract
BibTeX
Paper
Reinforcement Learning (RL) agents are often unable to generalise well to environment variations in the state space that were not observed during training. This issue is especially problematic for image-based RL, where a change in just one variable, such as the background colour, can change many pixels in the image. The changed pixels can lead to drastic changes in the agent's latent representation of the image, causing the learned policy to fail. To learn more robust representations, we introduce TEmporal Disentanglement (TED), a self-supervised auxiliary task that leads to disentangled image representations exploiting the sequential nature of RL observations. We find empirically that RL algorithms utilising TED as an auxiliary task adapt more quickly to changes in environment variables with continued training compared to state-of-the-art representation learning methods. Since TED enforces a disentangled structure of the representation, our experiments also show that policies trained with TED generalise better to unseen values of variables irrelevant to the task (e.g. background colour) as well as unseen values of variables that affect the optimal policy (e.g. goal positions).
Multi-agent Databases Via Independent Learning.
Chi Zhang, Olga Papaemmanouil,
Josiah P. Hanna, Aditya Akella.
Proceedings of the 4th International Workshop on Applied AI for Database Systems and Applications. September 2022.
Abstract
BibTeX
Paper
Machine learning is rapidly being used in database research to improve the effectiveness of numerous tasks included but not limited to query optimization, workload scheduling, physical design, etc. Currently, the research focus has been on replacing a single database component responsible for one task by its learning-based counterpart. However, query performance is not simply determined by the performance of a single component, but by the cooperation of multiple ones. As such, learning based database components need to collaborate during both training and execution in order to develop policies that meet end performance goals. Thus, the paper attempts to address the question "Is it possible to design a database consisting of various learned components that cooperatively work to improve end-to-end query latency?". To answer this question, we introduce MADB (Multi-Agent DB), a proof-of-concept system that incorporates a learned query scheduler and a learned query optimizer. MADB leverages a cooperative multi-agent reinforcement learning approach that allows the two components to exchange the context of their decisions with each other and collaboratively work towards reducing the query latency. Preliminary results demonstrate that MADB can outperform the non-cooperative integration of learned components.
Grounded Action Transformation for Sim-to-Real Reinforcement Learning.
Josiah P. Hanna, Sid Desai, Haresh Karnan, Garrett Warnell, Peter Stone.
Machine Learning (MLJ): Special Issue on Reinforcement Learning for Real Life. May 2021.
Abstract
BibTeX
Paper
Reinforcement learning in simulation is a promising alternative to the prohibitive sample cost of reinforcement learning in the physical world. Unfortunately, policies learned in simulation often perform worse than hand-coded policies when applied on the target, physical system. Grounded simulation learning (GSL) is a general framework that promises to address this issue by altering the simulator to better match the real world (Farchy et al. 2013). This article introduces a new algorithm for GSL – Grounded Action Transformation (GAT) – and applies it to learning control policies for a humanoid robot. We evaluate our algorithm in controlled experiments where we show it to allow policies learned in simulation to transfer to the real world. We then apply our algorithm to learning a fast bipedal walk on a humanoid robot and demonstrate a 43.27\% improvement in forward walk velocity compared to a state-of-the art hand-coded walk. This striking empirical success notwithstanding, further empirical analysis shows that GAT may struggle when the real world has stochastic state transitions. To address this limitation we generalize GAT to the stochastic GAT (SGAT) algorithm and empirically show that SGAT leads to successful real world transfer in situations where GAT may fail to find a good policy. Our results contribute to a deeper understanding of grounded simulation learning and demonstrate its effectiveness for applying reinforcement learning to learn robot control policies entirely in simulation.
Importance Sampling in Reinforcement Learning with an Estimated Behavior Policy.
Josiah P. Hanna, Scott Niekum, Peter Stone.
Machine Learning. May 2021.
Abstract
BibTeX
Paper
In reinforcement learning, importance sampling is a widely used method for evaluating an expectation under the distribution of data of one policy when the data has in fact been generated by a different policy. Importance sampling requires computing the likelihood ratio between the action probabilities of a target policy and those of the data-producing behavior policy. In this article, we study importance sampling where the behavior policy action probabilities are replaced by their maximum likelihood estimate of these probabilities under the observed data. We show this general technique reduces variance due to sampling error in Monte Carlo style estimators. We introduce two novel estimators that use this technique to estimate expected values that arise in the RL literature. We find that these general estimators reduce the variance of Monte Carlo sampling methods, leading to faster learning for policy gradient algorithms and more accurate off-policy policy evaluation. We also provide theoretical analysis showing that our new estimators are consistent and have asymptotically lower variance than Monte Carlo estimators.
Interpretable Goal Recognition in the Presence of Occluded Factors for Autonomous Vehicles.
Josiah P. Hanna, Arrasy Rahman, Elliot Fosong, Francisco Eiras, Mihai Dobre, John Redford, Subramanian Ramamoorthy, Stefano V. Albrecht.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). October 2021.
Abstract
BibTeX
Paper
Recognising the goals or intentions of observed vehicles is a key step towards predicting the long-term future behaviour of other agents in an autonomous driving scenario. When there are unseen obstacles or occluded vehicles in a scenario, goal recognition may be confounded by the effects of these unseen entities on the behaviour of observed vehicles. Existing prediction algorithms that assume rational behaviour with respect to inferred goals may fail to make accurate long-horizon predictions because they ignore the possibility that the behaviour is influenced by such unseen entities. We introduce the Goal and Occluded Factor Inference (GOFI) algorithm which bases inference on inverse-planning to jointly infer a probabilistic belief over goals and potential occluded factors. We then show how these beliefs can be integrated into Monte Carlo Tree Search (MCTS). We demonstrate that jointly inferring goals and occluded factors leads to more accurate beliefs with respect to the true world state and allows an agent to safely navigate several scenarios where other baselines take unsafe actions leading to collisions.
A Joint Imitation-Reinforcement Learning Framework for Reduced Baseline Regret.
Sheelabhadra Dey, Sumedh Pendurkar, Guni Sharon,
Josiah P. Hanna.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). September 2021.
Abstract
BibTeX
Paper
In various control task domains, existing controllers provide a baseline level of performance that—though possibly suboptimal—should be maintained. Reinforcement learning (RL) algorithms that rely on extensive exploration of the state and action space can be used to optimize a control policy. However, fully exploratory RL algorithms may decrease performance below a baseline level during training. In this paper, we address the issue of online optimization of a control policy while minimizing regret with respect to a baseline policy performance. We present a joint imitationreinforcement learning framework, denoted JIRL. The learning process in JIRL assumes the availability of a baseline policy and is designed with two objectives in mind (a) training while leveraging demonstrations from the baseline policy to minimize regret with respect to the baseline policy, and (b) eventually surpassing the baseline performance. JIRL addresses these objectives by initially learning to imitate the baseline policy and gradually shifting control from the baseline to an RL agent. Experimental results show that JIRL effectively accomplishes the aforementioned objectives in several, continuous action-space domains. The results demonstrate that JIRL is comparable to a state-of-the-art algorithm in its final performance while incurring significantly lower baseline regret during training. Moreover, the results show a reduction factor of up to 21 in baseline regret over a trust-region based approach that guarantees monotonic policy improvement.
Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning.
Rujie Zhong,
Josiah P. Hanna, Lukas Schäfer, Stefano V. Albrecht.
Proceedings of the NeurIPS Workshop on Offline Reinforcement Learning (OfflineRL). December 2021.
Abstract
BibTeX
Paper
Reinforcement learning (RL) algorithms are often categorized as either on-policy or off-policy depending on whether they use data from a target policy of interest or from a different behavior policy. In this paper, we study a subtle distinction between on-policy data and on-policy sampling in the context of the RL sub-problem of policy evaluation. We observe that on-policy sampling may fail to match the expected distribution of on-policy data after observing only a finite number of trajectories and this failure hinders data-efficient policy evaluation. Towards improved data-efficiency, we show how non-i.i.d., off-policy sampling can produce data that more closely matches the expected on-policy data distribution and consequently increases the accuracy of the Monte Carlo estimator for policy evaluation. We introduce a method called Robust On-Policy Sampling and demonstrate theoretically and empirically that it produces data that converges faster to the expected on-policy distribution compared to on-policy sampling. Empirically, we show that this faster convergence leads to lower mean squared error policy value estimates.
Safe Evaluation for Offline Learning: Are We Ready To Deploy?
Hager Radi,
Josiah P. Hanna, Peter Stone, Matthew E. Taylor.
Proceedings of the NeurIPS Workshop on Deployable Decision Making in Embodied Systems (DDM). December 2021.
Abstract
BibTeX
Paper
The world currently offers an abundance of data in multiple domains, from which we can learn reinforcement learning (RL) policies without further interaction with the environment. RL agents learning offline from such data is possible but deploying them while learning might be dangerous in domains where safety is critical. Therefore, it is essential to find a way to estimate how a newly-learned agent will perform if deployed in the target environment before actually deploying it and without the risk of overestimating its true performance. To achieve this, we introduce a framework for safe evaluation of offline learning using approximate high-confidence off-policy evaluation (HCOPE) to estimate the performance of offline policies during learning. In our setting, we assume a source of data, which we split into a train-set, to learn an offline policy, and a test-set, to estimate a lower-bound on the offline policy using off-policy evaluation with bootstrapping. A lower-bound estimate tells us how good a newly-learned target policy would perform before it is deployed in the real environment, and therefore allows us to decide when to deploy our learned policy.
Behavior Policy Search for Risk Estimators in RL.
Elita Lobo, Yash Chandak, Subramanian Dharmashankar,
Josiah P. Hanna, Marek Petrik.
Proceedings of the NeurIPS Workshop on Safe and Robust Control of Uncertain Systems. December 2021.
Abstract
BibTeX
Paper
In real-world sequential decision problems, exploration is expensive, and the risk of expert decision policies must be evaluated from limited data. In this setting, Monte Carlo (MC) risk estimators are typically used to estimate the risk of decision policies. Unfortunately, while these estimators have the desired low bias property, they often suffer from large variance. In this paper, we consider the problem of minimizing the asymptotic mean squared error and hence variance of MC risk estimators. We show that by carefully choosing the data sampling policy (behavior policy), we can obtain low variance estimates of the risk of any given decision policy.
Towards Quantum-Secure Authentication and Key Agreement Via Abstract Multi-Agent Interaction.
Ibrahim H. Ahmed,
Josiah P. Hanna, Elliot Fosong, Albrecht Stefano V.
Proceedings of the International Conference on Practical Applications of Agents and Multi-Agent Systems (PAAMS). October 2021.
Abstract
BibTeX
Paper
Current methods for authentication and key agreement based on public-key cryptography are vulnerable to quantum computing. We propose a novel approach based on artificial intelligence research in which communicating parties are viewed as autonomous agents which interact repeatedly using their private decision models. Authentication and key agreement are decided based on the agents' observed behaviors during the interaction. The security of this approach rests upon the difficulty of modeling the decisions of interacting agents from limited observations, a problem which we conjecture is also hard for quantum computing. We release PyAMI, a prototype authentication and key agreement system based on the proposed method. We empirically validate our method for authenticating legitimate users while detecting different types of adversarial attacks. Finally, we show how reinforcement learning techniques can be used to train server models which effectively probe a client's decisions to achieve more sample-efficient authentication.
Decoupled Reinforcement Learning To Stabilise Intrinsically-Motivated Exploration.
Lukas Schäfer,
Josiah P. Hanna, Filippos Christiano, Stefano V Albrecht.
Proceedings of the ICML Workshop on Unsupervised Reinforcement Learning (URL). July 2021.
Abstract
BibTeX
Paper
Intrinsic rewards can improve exploration in reinforcement learning, but the exploration process may suffer from instability caused by non-stationary reward shaping and strong dependency on hyperparameters. In this work, we introduce Decoupled RL (DeRL) as a general framework which trains separate policies for intrinsicallymotivated exploration and exploitation. Such decoupling allows DeRL to leverage the benefits of intrinsic rewards for exploration while demonstrating improved robustness and sample efficiency. We evaluate DeRL algorithms in two sparse-reward environments with multiple types of intrinsic rewards. Our results show that DeRL is more robust to varying scale and rate of decay of intrinsic rewards and converges to the same evaluation returns than intrinsically-motivated baselines in fewer interactions. Lastly, we discuss the challenge of distribution shift and show that divergence constraint regularisers can successfully minimise instability caused by divergence of exploration and exploitation policies.
RIDM: Reinforced Inverse Dynamics Modeling for Learning From a Single Observed Demonstration.
Brahma S. Pavse, Faraz Torabi,
Josiah P. Hanna, Garrett Warnell, Peter Stone.
IEEE Robotics and Automation Letters. October 2020.
Abstract
BibTeX
Paper
Augmenting reinforcement learning with imitation learning is often hailed as a method by which to improve upon learning from scratch. However, most existing methods for integrating these two techniques are subject to several strong assumptions—chief among them that information about demonstrator actions is available. In this paper, we investigate the extent to which this assumption is necessary by introducing and evaluating reinforced inverse dynamics modeling (RIDM), a novel paradigm for combining imitation from observation (IfO) and reinforcement learning with no dependence on demonstrator action information. Moreover, RIDM requires only a single demonstration trajectory and is able to operate directly on raw (unaugmented) state features. We find experimentally that RIDM performs favorably compared to a baseline approach for several tasks in simulation as well as for tasks on a real UR5 robot arm. Experiment videos can be found at https://sites.google.com/view/ridm-reinforced-inverse-dynami.
An Imitation From Observation Approach To Transfer Learning with Dynamics Mismatch.
Siddharth Desai, Ishan Durugkar, Haresh Karnan, Garrett Warnell,
Josiah P. Hanna, Peter Stone.
Proceedings of the 33rd Advances in Neural Information Processing Systems (NeurIPS). December 2020.
Abstract
BibTeX
Paper
We examine the problem of transferring a policy learned in a source environment to a target environment with different dynamics, particularly in the case where it is critical to reduce the amount of interaction with the target environment during learning. This problem is particularly important in sim-to-real transfer because simulators inevitably model real-world dynamics imperfectly. In this paper, we show that one existing solution to this transfer problem ”grounded action transformation” is closely related to the problem of imitation from observation (IfO): learning behaviors that mimic the observations of behavior demonstrations. After establishing this relationship, we hypothesize that recent state-of-the-art approaches from the IfO literature can be effectively repurposed for grounded transfer learning. To validate our hypothesis we derive a new algorithm, generative adversarial reinforced action transformation (GARAT) based on adversarial imitation from observation techniques. We run experiments in several domains with mismatched dynamics, and find that agents trained with GARAT achieve higher returns in the target environment compared to existing black-box transfer methods.
Stochastic Grounded Action Transformation for Robot Learning in Simulation.
Sid Desai, Haresh Karnan,
Josiah P. Hanna, Garrett Warnell, Peter Stone.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). October 2020.
Abstract
BibTeX
Paper
Robot control policies learned in simulation do not often transfer well to the real world. Many existing solutions to this sim-to-real problem, such as the Grounded Action Transformation (GAT) algorithm, seek to correct for or ground these differences by matching the simulator tothe real world. However, the efficacy of these approaches is limited if they do not explicitly account for stochasticity inthe target environment. In this work, we analyze the prob-lems associated with grounding a deterministic simulator in astochastic real world environment, and we present examples where GAT fails to transfer a good policy due to stochastic transitions in the target domain. In response, we introduce the Stochastic Grounded Action Transformation (SGAT) algorithm, which models this stochasticity when grounding the simulator. We find experimentally for both simulated and physical target domains that SGAT can find policies that are robust to stochasticity in the target domain.
Reinforced Grounded Action Transformation for Sim-to-Real Transfer.
Haresh Karnan, Sid Desai,
Josiah P. Hanna, Garrett Warnell, Peter Stone.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). October 2020.
Abstract
BibTeX
Paper
Robots can learn to do complex tasks in simulation, but often, learned behaviors fail to transfer well to the real world due to simulator imperfections (the reality gap). Some existing solutions to this sim-to-real problem, such as Grounded Action Transformation (GAT), use a small amount of real world experience to minimize the reality gap by grounding the simulator. While very effective in certain scenarios, GAT is not robust on problems that use complex function approximation techniques to model a policy. In this paper, we introduce Reinforced Grounded Action Transformation (RGAT), a new sim-to-real technique that uses Reinforcement Learning (RL) not only to update the target policy in simulation, but also to perform the grounding step itself. This novel formulation allows for end-to-end training during the grounding step, which, compared to GAT, produces a better grounded simulator. Moreover, we show experimentally in several MuJoCo domains that our approach leads to successful transfer for policies modeled using neural networks
Learning an Interpretable Traffic Signal Control Policy.
James Ault,
Josiah P. Hanna, Guni Sharon.
Proceedings of the 19th International Conference on Autonomous Agents and Multiagent Systems (AAMAS). May 2020.
Abstract
BibTeX
Paper
Signalized intersections are managed by controllers that assign right of way (green, yellow, and red lights) to non-conflicting directions. Optimizing the actuation policy of such controllers is expected to alleviate traffic congestion and its adverse impact. Given such a safety-critical domain, the affiliated actuation policy is required to be interpretable in a way that can be understood and regulated by a human. This paper presents and analyzes several on-line optimization techniques for tuning interpretable control functions. Although these techniques are defined in a general way, this paper assumes a specific class of interpretable control functions (polynomial functions) for analysis purposes. We show that such an interpretable policy function can be as effective as a deep neural network for approximating an optimized signal actuation policy. We present empirical evidence that supports the use of value-based reinforcement learning for on-line training of the control function. Specifically, we present and study three variants of the Deep Q-learning algorithm that allow the training of an interpretable policy function. Our Deep Regulatable Hardmax Q-learning variant is shown to be particularly effective in optimizing our interpretable actuation policy, resulting in up to 19.4\% reduced vehicles delay compared to commonly deployed actuated signal controllers.
Reducing Sampling Error in Batch Temporal Difference Learning.
Brahma Pavse, Ishan Durugkar,
Josiah P. Hanna, Peter Stone.
Proceedings of the Offline Reinforcement Learning Workshop at Neural Information Processing Systems (NeurIPS). December 2020.
Abstract
BibTeX
Paper
Temporal difference (TD) learning is one of the main foundations of modern reinforcement learning. This paper studies the use of TD(0), a canonical TD algorithm, to estimate the value function of a given policy from a batch of data. In this batch setting, we show that TD(0) may converge to an inaccurate value function because the update following an action is weighted according to the number of times that action occurred in the batch – not the true probability of the action under the given policy. To address this limitation, we introduce policy sampling error corrected-TD(0) (PSEC-TD(0)). PSEC-TD(0) first estimates the empirical distribution of actions in each state in the batch and then uses importance sampling to correct for the mismatch between the empirical weighting and the correct weighting for updates following each action. We refine the concept of a certainty-equivalence estimate and argue that PSEC-TD(0) is a more data efficient estimator than TD(0) for a fixed batch of data. Finally, we conduct an empirical evaluation of PSEC-TD(0) on three batch value function learning tasks, with a hyperparameter sensitivity analysis, and show that PSEC-TD(0) produces value function estimates with lower mean squared error than TD(0).
Importance Sampling Policy Evaluation with an Estimated Behavior Policy.
Josiah P. Hanna, Scott Niekum, Peter Stone.
Proceedings of the 36th International Conference on Machine Learning (ICML). June 2019.
Abstract
BibTeX
Paper
We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a technique for off-policy evaluation that re-weights off-policy returns to account for differences in the likelihood of the returns between the two policies. In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set. Intuitively, estimating the behavior policy in this way corrects for error due to sampling in the action-space. Our empirical results also extend to other popular variants of importance sampling and show that estimating a non-Markovian behavior policy can further lower large-sample mean squared error even when the true behavior policy is Markovian.
Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator.
Josiah P. Hanna, Peter Stone.
Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS). May 2019.
Abstract
BibTeX
Paper
This paper studies a class of reinforcement learning algorithms known as policy gradient methods. Policy gradient methods optimize the performance of a policy by estimating the gradient of the expected return with respect to the policy parameters. One of the core challenges of applying policy gradient methods is obtaining an accurate estimate of this gradient. Most policy gradient methods rely on Monte Carlo sampling to estimate this gradient. When only a limited number of environment steps can be collected, Monte Carlo policy gradient estimates may suffer from sampling error – samples receive more or less weight than they will in expectation. In this paper, we introduce the Sampling Error Corrected policy gradient estimator that corrects the inaccurate Monte Carlo weights. Our approach treats the observed data as if it were generated by a different policy than the policy that actually generated the data. It then uses importance sampling between the two – in the process correcting the inaccurate Monte Carlo weights. Under a limiting set of assumptions we can show that this gradient estimator will have lower variance than the Monte Carlo gradient estimator. We show experimentally that our approach improves the learning speed of two policy gradient methods compared to standard Monte Carlo sampling even when the theoretical assumptions fail to hold.
Selecting Compliant Agents for Opt-in Microtolling.
Josiah P. Hanna, Guni Sharon, Stephen D. Boyles, Peter Stone.
Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI). January 2019.
Abstract
BibTeX
Paper
This paper examines the impact of tolls on social welfare in the context of a transportation network in which only a portion of the agents are subject to tolls. More specifically, this paper addresses the question: which subset of agents provides the most system benefit if they are compliant with an approximate marginal cost tolling scheme? Since previous work suggests this problem is NP-hard, we examine a heuristic approach. Our experimental results on three real-world traffic scenarios suggest that evaluating the marginal impact of a given agent serves as a particularly strong heuristic for selecting an agent to be compliant. Results from using this heuristic for selecting 7.6\% of the agents to be compliant achieved an increase of up to 10.9\% in social welfare over not tolling at all. The presented heuristic approach and conclusions can help practitioners target specific agents to participate in an opt-in tolling scheme.
RIDM: Reinforced Inverse Dynamics Modeling for Learning From a Single Observed Demonstration.
Brahma S. Pavse, Faraz Torabi,
Josiah P. Hanna, Garrett Warnell, Peter Stone.
Proceedings of the Imitation, Intent, and Interaction (I3) Workshop at ICML 2019. June 2019.
Abstract
BibTeX
Paper
Imitation learning has long been an approach to alleviate the tractability issues that arise in reinforcement learning. However, most literature makes several assumptions such as access to the expert's actions, availability of many expert demonstrations, and injection of task-specific domain knowledge into the learning process. We propose reinforced inverse dynamics modeling (RIDM), a method of combining reinforcement learning and imitation from observation (IfO) to perform imitation using a single expert demonstration, with no access to the expert's actions, and with little task-specific domain knowledge. Given only a single set of the expert's raw states, such as joint angles in a robot control task, at each time-step, we learn an inverse dynamics model to produce the necessary low-level actions, such as torques, to transition from one state to the next such that the reward from the environment is maximized. We demonstrate that RIDM outperforms other techniques when we apply the same constraints on the other methods on six domains of the MuJoCo simulator and for two different robot soccer tasks for two experts from the RoboCup 3D simulation league on the SimSpark simulator.
DyETC: Dynamic Electronic Toll Collection for Traffic Congestion Alleviation.
Haipeng Chen, Bo An, Guni Sharon,
Josiah P. Hanna, Peter Stone, Chunyan Miao, Yeng Chai Soh.
Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI). February 2018.
Abstract
BibTeX
Paper
To alleviate traffic congestion in urban areas, electronic toll collection (ETC) systems are deployed all over the world. Despite the merits, tolls are usually pre-determined and fixed from day to day, which fail to consider traffic dynamics and thus have limited regulation effect when traffic conditions are abnormal. In this paper, we propose a novel dynamic ETC (DyETC) scheme which adjusts tolls to traffic conditions in realtime. The DyETC problem is formulated as a Markov decision process (MDP), the solution of which is very challenging due to its 1) multi-dimensional state space, 2) multi-dimensional, continuous and bounded action space, and 3) time-dependent state and action values. Due to the complexity of the formulated MDP, existing methods cannot be applied to our problem. Therefore, we develop a novel algorithm, PG-β, which makes three improvements to traditional policy gradient method by proposing 1) time-dependent value and policy functions, 2) Beta distribution policy function and 3) state abstraction. Experimental results show that, compared with existing ETC schemes, DyETC increases traffic volume by around \$8\%\$, and reduces travel time by around \$14.6\%\$ during rush hour. Considering the total traffic volume in a traffic network, this contributes to a substantial increase to social welfare.
Towards a Data Efficient Off-Policy Policy Gradient.
Josiah P. Hanna, Peter Stone.
AAAI Spring Symposium on Data Efficient Reinforcement Learning. April 2018.
Abstract
BibTeX
Paper
The ability to learn from off-policy data – data generated from past interaction with the environment – is essential to data efficient reinforcement learning. Recent work has shown that the use of off-policy data not only allows the re-use of data but can even improve performance in comparison to on-policy reinforcement learning. In this work we investigate if a recently proposed method for learning a better data generation policy, commonly called a behavior policy, can also increase the data efficiency of policy gradient reinforcement learning. Empirical results demonstrate that with an appropriately selected behavior policy we can estimate the policy gradient more accurately. The results also motivate further work into developing methods for adapting the behavior policy as the policy we are learning changes.
Network-wide Adaptive Tolling for Connected and Automated Vehicles.
Guni Sharon, Michael W. Levin,
Josiah P. Hanna, Tarun Rambha, Stephen D. Boyles, Peter Stone.
Transportation Research Part C. September 2017.
Abstract
BibTeX
Paper
This article proposes Delta-tolling, a simple adaptive pricing scheme which only requires travel time observations and two tuning parameters. These tolls are applied throughout a road network, and can be updated as frequently as travel time observations are made. Notably, Delta-tolling does not require any details of the traffic flow or travel demand models other than travel time observations, rendering it easy to apply in real-time. The flexibility of this tolling scheme is demonstrated in three specific traffic modeling contexts with varying traffic flow and user behavior assumptions: a day-to-day pricing model using static network equilibrium with link delay functions; a within-day adaptive pricing model using the cell transmission model and dynamic routing of vehicles; and a microsimulation of reservation-based intersection control for connected and autonomous vehicles with myopic routing. In all cases, D-tolling produces significant benefits over the no-toll case, measured in terms of average travel time and social welfare, while only requiring two parameters to be tuned. Some optimality results are also given for the special case of the static network equilibrium model with BPR-style delay functions.
Data-Efficient Policy Evaluation Through Behavior Policy Search.
Josiah P. Hanna, Philip Thomas, Peter Stone, Scott Niekum.
Proceedings of the 34th International Conference on Machine Learning (ICML). August 2017.
Abstract
BibTeX
Paper
We consider the task of evaluating a policy for a Markov decision process (MDP). The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance. We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique. We derive an analytic expression for the optimal behavior policy—the behavior policy that minimizes the mean squared error of the resulting estimates. Because this expression depends on terms that are unknown in practice, we propose a novel policy evaluation sub-problem, behavior policy search: searching for a behavior policy that reduces mean squared error. We present a behavior policy search algorithm and empirically demonstrate its effectiveness in lowering the mean squared error of policy performance estimates.
Real-time Adaptive Tolling Scheme for Optimized Social Welfare in Traffic Networks.
Guni Sharon,
Josiah P. Hanna, Tarun Rambha, Michael W. Levin, Michael Albert, Stephen D. Boyles, Peter Stone.
Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS-2017). May 2017.
Abstract
BibTeX
Paper
Connected and autonomous vehicle technology has advanced rapidly in recent years. These technologies create possibilities for advanced AI-based traffic management techniques. Developing such techniques is an important challenge and opportunity for the AI community as it requires synergy between experts in game theory, multiagent systems, behavioral science, and flow optimization. This paper takes a step in this direction by considering traffic flow optimization through setting and broadcasting of dynamic and adaptive tolls. Previous tolling schemes either were not adaptive in real-time, not scalable to large networks, or did not optimize traffic flow over an entire network. Moreover, previous schemes made strong assumptions on observable demands, road capacities and users homogeneity. This paper introduces Delta-tolling, a novel tolling scheme that is adaptive in real-time and able to scale to large networks. We provide theoretical evidence showing that under certain assumptions Delta-tolling is equal to Marginal-Cost Tolling, which provably leads to system-optimal, and empirical evidence showing that Delta-tolling increases social welfare (by up to 33\%) in two traffic simulators with markedly different modeling assumptions.
Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation.
Josiah P. Hanna, Peter Stone, Scott Niekum.
Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS). May 2017.
Abstract
BibTeX
Paper
For an autonomous agent, executing a poor policy may be costly or even dangerous. For such agents, it is desirable to determine confidence interval lower bounds on the performance of any given policy without executing said policy. Current methods for exact high confidence off-policy evaluation that use importance sampling require a substantial amount of data to achieve a tight lower bound. Existing model-based methods only address the problem in discrete state spaces. Since exact bounds are intractable for many domains we trade off strict guarantees of safety for more data-efficient approximate bounds. In this context, we propose two bootstrapping off-policy evaluation methods which use learned MDP transition models in order to estimate lower confidence bounds on policy performance with limited data in both continuous and discrete state spaces. Since direct use of a model may introduce bias, we derive a theoretical upper bound on model bias for when the model transition function is estimated with i.i.d. trajectories. This bound broadens our understanding of the conditions under which model-based methods have high bias. Finally, we empirically evaluate our proposed methods and analyze the settings in which different bootstrapping off-policy confidence interval methods succeed and fail.
Grounded Action Transformation for Robot Learning in Simulation.
Josiah P. Hanna, Peter Stone.
Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI). February 2017.
Abstract
BibTeX
Paper
Robot learning in simulation is a promising alternative to the prohibitive sample cost of learning in the physical world. Unfortunately, policies learned in simulation often perform worse than hand-coded policies when applied on the physical robot. Grounded simulation learning (GSL) promises to address this issue by altering the simulator to better match the real world. This paper proposes a new algorithm for GSL – Grounded Action Transformation – and applies it to learning of humanoid bipedal locomotion. Our approach results in a 43.27\% improvement in forward walk velocity compared to a state-of-the art hand-coded walk. We further evaluate our methodology in controlled experiments using a second, higher-fidelity simulator in place of the real world. Our results contribute to a deeper understanding of grounded simulation learning and demonstrate its effectiveness for learning robot control policies.
Fast and Precise Black and White Ball Detection for RoboCup Soccer.
Jacob Menashe, Josh Kelle, Katie Genter,
Josiah P. Hanna, Elad Liebman, Sanmit Narvekar, Ruohan Zhang, Peter Stone.
RoboCup-2017: Robot Soccer World Cup XXI. July 2017.
Abstract
BibTeX
Paper
In 2016, UT Austin Villa claimed the Standard Platform League's second place position at the RoboCup International Robot Soccer Competition in Leipzig, Germany as well as first place at both the RoboCup US Open in Brunswick, USA and the World RoboCup Conference in Beijing, China. This paper describes some of the key contributions that led to the team's victories with a primary focus on our techniques for identifying and tracking black and white soccer balls. UT Austin Villa's ball detection system was overhauled in order to transition from the league's bright orange ball, used every year of the competition prior to 2016, to the truncated icosahedral pattern commonly associated with soccer balls. We evaluated and applied a series of heuristic region-of-interest identification techniques and supervised machine learning methods to produce a ball detector capable of reliably detecting the ball's position with no prior knowledge of the ball's position. In 2016, UT Austin Villa suffered only a single loss which occurred after regulation time during a penalty kick shootout. We attribute much of UT Austin Villa's success in 2016 to our robots' effectiveness at quickly and consistently localizing the ball. In this work we discuss the specifics of UT Austin Villa's ball detector implementation which are applicable to the specific problem of ball detection in RoboCup, as well as to the more general problem of fast and precise object detection in computationally constrained domains. Furthermore we provide empirical analyses of our approach to support the conclusion that modern deep learning techniques can enhance visual recognition tasks even in the face of these computational constraints.
Operations of a Shared, Autonomous, Electric Vehicle Fleet: Implications of Vehicle \& Charging Infrastructure Decisions.
T Donna Chen, Kara M Kockelman,
Josiah P. Hanna.
Transportation Research Part A: Policy and Practice. January 2016.
Abstract
BibTeX
Paper
There are natural synergies between shared autonomous vehicle (AV) fleets and electric vehicle (EV) technology, since fleets of AVs resolve the practical limitations of today's non-autonomous EVs, including traveler range anxiety, access to charging infrastructure, and charging time management. Fleet-managed AVs relieve such concerns, managing range and charging activities based on real-time trip demand and established charging-station locations, as demonstrated in this paper. This work explores the management of a fleet of shared autonomous electric vehicles (SAEVs) in a regional, discrete-time, agent-based model. The simulation examines the operation of SAEVs under various vehicle range and charging infrastructure scenarios in a gridded city modeled roughly after the densities of Austin, Texas. {\textless}br{\textgreater}{\textless}br{\textgreater} Results based on 2009 NHTS trip distance and time-of-day distributions indicate that fleet size is sensitive to battery recharge time and vehicle range, with each 80-mile range SAEV replacing 3.7 privately owned vehicles and each 200-mile range SAEV replacing 5.5 privately owned vehicles, under Level II (240-volt AC) charging. With Level III 480-volt DC fast-charging infrastructure in place, these ratios rise to 5.4 vehicles for the 80-mile range SAEV and 6.8 vehicles for the 200-mile range SAEV. SAEVs can serve 96\&\#8208;98\% of trip requests with average wait times between 7 and 10 minutes per trip. However, due to the need to travel while “empty" for charging and passenger pick-up, SAEV fleets are predicted to generate an additional 7.1 – 14.0\% of travel miles. Financial analysis suggests that the combined cost of charging infrastructure, vehicle capital and maintenance, electricity, insurance, and registration for a fleet of SAEVs ranges from \$0.42 to \$0.49 per occupied mile traveled, which implies SAEV service can be offered at the equivalent per-mile cost of private vehicle ownership for low-mileage households, and thus be competitive with current manually-driven carsharing services and significantly cheaper than on-demand driver-operated transportation services. When Austin-specific trip patterns (with more concentrated trip origins and destinations) are introduced in a final case study, the simulation predicts a decrease in fleet {\textless}q{\textgreater}empty{\textless}/q{\textgreater} vehicle-miles (down to 3 – 4\% of all SAEV travel) and average wait times (ranging from 2 to 4 minutes per trip), with each SAEV replacing 5 – 9 privately owned vehicles.
UT Austin Villa: RoboCup 2015 3D Simulation League Competition and Technical Challenges Champions.
Patrick MacAlpine,
Josiah P. Hanna, Jason Liang, Peter Stone.
RoboCup-2015: Robot Soccer World Cup XIX. July 2016.
Abstract
BibTeX
Paper
The UT Austin Villa team, from the University of Texas at Austin, won the 2015 RoboCup 3D Simulation League, winning all 19 games that the team played. During the course of the competition the team scored 87 goals and conceded only 1. Additionally the team won the RoboCup 3D Simulation League technical challenge by winning each of a series of three league challenges: drop-in player, kick accuracy, and free challenge. This paper describes the changes and improvements made to the team between 2014 and 2015 that allowed it to win both the main competition and each of the league technical challenges.
Minimum Cost Matching for Autonomous Carsharing.
Josiah P. Hanna, Michael Albert, Donna Chen, Peter Stone.
Proceedings of the 9th IFAC Symposium on Intelligent Autonomous Vehicles (IAV 2016). June 2016.
Abstract
BibTeX
Paper
Carsharing programs provide an alternative to private vehicle ownership. Combining carsharing programs with autonomous vehicles would improve user access to vehicles thereby removing one of the main challenges to widescale adoption of these programs. While the ability to easily move cars to meet demand would be significant for carsharing programs, if implemented incorrectly it could lead to worse system performance. In this paper, we seek to improve the performance of a fleet of shared autonomous vehicles through improved matching of vehicles to passengers requesting rides. We consider carsharing with autonomous vehicles as an assignment problem and examine four different methods for matching cars to users in a dynamic setting. We show how applying a recent algorithm (Scalable Collision-avoiding Role Assignment with Minimal-makespan or SCRAM) for minimizing the maximal edge in a perfect matching can result in a more efficient, reliable, and fair carsharing system. Our results highlight some of the problems with greedy or decentralized approaches. Introducing a centralized system creates the possibility for users to strategically mis-report their locations and improve their expected wait time so we provide a proof demonstrating that cancellation fees can be applied to eliminate the incentive to mis-report location.
Approximation of Lorenz-Optimal Solutions in Multiobjective Markov Decision Processes.
Patrice Perny, Paul Weng, Judy Goldsmith,
Josiah P. Hanna.
Proceedings of the International Conference on Uncertainty in Artificial Intelligence (UAI). July 2013.
Abstract
BibTeX
Paper
This paper is devoted to fair optimization in Multiobjective Markov Decision Processes (MOMDPs). A MOMDP is an extension of the MDP model for planning under uncertainty while trying to optimize several reward functions simultaneously. This applies to multiagent problems when rewards define individual utility functions, or in multicriteria problems when rewards refer to different features. In this setting, we study the determination of policies leading to Lorenz-nondominated tradeoffs. Lorenz dominance is a refinement of Pareto dominance that was introduced in Social Choice for the measurement of inequalities. In this paper, we introduce methods to efficiently approximate the sets of Lorenz-non-dominated solutions of infinite-horizon, discounted MOMDPs. The approximations are polynomial-sized subsets of those solutions.
The Academic Advising Planning Domain.
Joshua T. Guerin,
Josiah P. Hanna, Libby Ferland, Nicholas Mattei, Judy Goldsmith.
Proceedings of the 3rd Workshop on the International Planning Competition at ICAPS. July 2012.
Abstract
BibTeX
Paper
The International Probabilistic Planning Competition is a
leading showcase for fast stochastic planners. The current domains
used in the competition have raised challenges that the
leading deterministic-planner-based MDP solvers have been
able to meet. We argue that in order to continue to raise challenges
and match real world applications, domains must be
generated that exhibit true stochasticity, multi-valued domain
variables, and concurrent actions. In this paper we propose
the academic advising domain as a planning competition domain
that exhibits these characteristics. We believe that this
domain can build upon the success of previous contests in
pushing the limits of MDP planning research.