Bold titles indicate strongly peer-reviewed conference papers or journal article.
Jump to Year:
[2024]
[2023]
[2022]
[2021]
[2020]
[2019]
[2018]
[2017]
[2016]
[2013]
[2012]
SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits.
Subhojyoti Mukherjee, Qiaomin Xie,
Josiah P. Hanna, Robert Nowak.
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). May 2024.
Abstract
BibTeX
Arxiv Link
In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a {\textbackslash}textit\{target\} policy and asked to estimate the expected cumulative reward it will obtain when executed in an environment formalized as a multi-armed bandit. In this paper, we focus on linear bandit setting with heteroscedastic reward noise. This is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the target policy. We term this as policy-weighted least square estimation and use this formulation to derive the optimal behavior policy for data collection. We then propose a novel algorithm SPEED ({\textbackslash}textbf\{S\}tructured {\textbackslash}textbf\{P\}olicy {\textbackslash}textbf\{E\}valuation {\textbackslash}textbf\{E\}xperimental {\textbackslash}textbf\{D\}esign) that tracks the optimal behavior policy and derives its regret with respect to the optimal behavior policy. Finally, we empirically validate that {\textbackslash}sp{\textbackslash} leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.
Understanding When Dynamics-Invariant Data Augmentations Benefit Model-Free Reinforcement Learning Updates.
Nicholas Corrado,
Josiah P. Hanna.
Proceedings of the International Conference on Learning Representations (ICLR). May 2024.
Abstract
BibTeX
Arxiv Link
Recently, data augmentation (DA) has emerged as a method for leveraging domain knowledge to inexpensively generate additional data in reinforcement learning (RL) tasks, often yielding substantial improvements in data efficiency.
While prior work has demonstrated the utility of incorporating augmented data directly into model-free RL updates,
it is not well-understood when a particular DA strategy will improve data efficiency.
In this paper, we seek to identify general aspects of DA responsible for observed learning improvements.
Our study focuses on sparse-reward tasks with dynamics-invariant data augmentation functions, serving as an initial step towards a more general understanding of DA and its integration into RL training.
Experimentally, we isolate three relevant aspects of DA: state-action coverage, reward density, and the number of augmented transitions generated per update (the augmented replay ratio).
From our experiments, we draw two conclusions: (1) increasing state-action coverage often has a much greater impact on data efficiency than increasing reward density, and (2) decreasing the augmented replay ratio substantially improves data efficiency.In fact, certain tasks in our empirical study are solvable only when the replay ratio is sufficiently low.
Conditional Mutual Information for Disentangled Representations in ReinforcementLearning.
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck,
Josiah P. Hanna, Stefano V. Albrecht.
Proceedings of Advances in Neural Information Processing Systems (NeurIPS). December 2023.
Abstract
BibTeX
Reinforcement Learning (RL) environments can produce training data with spurious correlations between features due to the amount of training data or its limited feature coverage. This can lead to RL agents encoding these misleading correlations in their latent representation, preventing the agent from generalising if the correlation changes within the environment or when deployed in the real world. Disentangled representations can improve robustness, but existing disentanglement techniques that minimise mutual information between features require independent features, thus they cannot disentangle correlated features. We propose an auxiliary task for RL algorithms that learns a disentangled representation of high-dimensional observations with correlated features by minimising the conditional mutual information between features in the representation. We demonstrate experimentally, using continuous control tasks, that our approach improves generalisation under correlation shifts, as well as improving the training performance of RL algorithms in the presence of correlated features.
Multi-task Representation Learning for Pure Exploration in Bilinear Bandits.
Subhojyoti Mukherjee, Qiaomin Xie,
Josiah P. Hanna, Robert Nowak.
Proceedings of Advances in Neural Information Processing Systems (NeurIPS). December 2023.
Abstract
BibTeX
We study multi-task representation learning for the problem of pure exploration in bilinear bandits. In bilinear bandits, an action takes the form of a pair of arms from two different entity types and the reward is a bilinear function of the known feature vectors of the arms. In the multi-task bilinear bandit problem, we aim to find optimal actions for multiple tasks that share a common low-dimensional linear representation. The objective is to leverage this characteristic to expedite the process of identifying the best pair of arms for all tasks. We propose the algorithm GOBLIN that uses an experimental design approach to optimize sample allocations for learning the global representation as well as minimize the number of samples needed to identify the optimal pair of arms in individual tasks. To the best of our knowledge, this is the first study to give sample complexity analysis for pure exploration in bilinear bandits with shared representation. Our results demonstrate that by learning the shared representation across tasks, we achieve significantly improved sample complexity compared to the traditional approach of solving tasks independently.
State-Action Similarity-Based Representations for Off-Policy Evaluation.
Brahma S. Pavse,
Josiah P. Hanna.
Proceedings of Advances in Neural Information Processing Systems (NeurIPS). December 2023.
Abstract
BibTeX
In reinforcement learning, off-policy evaluation (OPE) is the problem of estimating the expected return of an evaluation policy given a fixed dataset that was collected by running one or more different policies. One of the more empirically successful algorithms for OPE has been the fitted q-evaluation (FQE) algorithm that uses temporal difference updates to learn an action-value function, which is then used to estimate the expected return of the evaluation policy. Typically, the original fixed dataset is fed directly into FQE to learn the action-value function of the evaluation policy. Instead, in this paper, we seek to enhance the data-efficiency of FQE by first transforming the fixed dataset using a learned encoder, and then feeding the transformed dataset into FQE. To learn such an encoder, we introduce an OPE-tailored state-action behavioral similarity metric, and use this metric and the fixed dataset to learn an encoder that models this metric. Theoretically, we show that this metric allows us to bound the OPE error. Empirically, we show that other state-action similarity metrics lead to representations that cannot represent the action-value function of the evaluation policy, and that our state-action representation method boosts the data-efficiency of FQE and lowers OPE error relative to other OPE-based representation learning methods on challenging OPE tasks.
Temporal Disentanglement of Representations for Improved Generalisation in Reinforcement Learning.
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck,
Josiah P. Hanna, Stefano V. Albrecht.
Proceedings of the International Conference on Learning Representations (ICLR). April 2023.
This work previously appeared at the
2022 NeurIPS Workshop on Deep Reinforcement Learning
Abstract
BibTeX
Paper (pdf)
Reinforcement Learning (RL) agents are often unable to generalise well to
environment variations in the state space that were not observed during training.
This issue is especially problematic for image-based RL, where a change in just one
variable, such as the background colour, can change many pixels in the image. The
changed pixels can lead to drastic changes in the agent’s latent representation of
the image, causing the learned policy to fail. To learn more robust representations,
we introduce TEmporal Disentanglement (TED), a self-supervised auxiliary task
that leads to disentangled image representations exploiting the sequential nature
of RL observations. We find empirically that RL algorithms utilising TED as
an auxiliary task adapt more quickly to changes in environment variables with
continued training compared to state-of-the-art representation learning methods.
Since TED enforces a disentangled structure of the representation, our experiments
also show that policies trained with TED generalise better to unseen values of
variables irrelevant to the task (e.g. background colour) as well as unseen values of
variables that affect the optimal policy (e.g. goal positions).
Scaling Marginalized Importance Sampling To High-Dimensional State-Spaces Via State Abstraction.
Brahma Pavse,
Josiah P. Hanna.
Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI). February 2023.
A short version of this paper was published at the
Workshop on Offline RL (OfflineRL) at NeurIPS 2022.
Abstract
BibTeX
Paper (pdf)
We consider the problem of off-policy evaluation (OPE) in reinforcement
learning (RL), where the goal is to estimate the performance of an
evaluation policy, πe, using a fixed dataset, D, collected by one or more
policies that may be different from πe. Current OPE algorithms may produce
poor OPE estimates under policy distribution shift i.e., when the
probability of a particular state-action pair occurring under πe is very
different from the probability of that same pair occurring in D
(Voloshin et al. 2021; Fu et al. 2021). In this work, we propose to improve
the accuracy of OPE estimators by projecting the high-dimensional
state-space into a low-dimensional state-space using concepts from the state
abstraction literature. Specifically, we consider marginalized importance
sampling (MIS) OPE algorithms which compute state-action distribution
correction ratios to produce their OPE estimate. In the original ground
state-space, these ratios may have high variance which may lead to high
variance OPE. However, we prove that in the lower-dimensional abstract
state-space the ratios can have lower variance resulting in lower variance
OPE. We then highlight the challenges that arise when estimating the abstract
ratios from data, identify sufficient conditions to overcome these issues,
and present a min-imax optimization problem whose solution yields these
abstract ratios. Finally, our empirical evaluation on difficult,
high-dimensional state-space OPE tasks shows that the abstract ratios can
make MIS OPE estimators achieve lower mean-squared error and more robust to
hyperparameter tuning than the ground ratios.
SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits.
Subhojyoti Mukherjee, Qiaomin Xie,
Josiah P. Hanna, Robert Nowak.
ICML Workshop on the Many Facets of Preference-Based Learning. July 2023.
Abstract
BibTeX
Arxiv Link
In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a \textit{target} policy and asked to estimate the expected cumulative reward it will obtain when executed in an environment formalized as a multi-armed bandit. In this paper, we focus on linear bandit setting with heteroscedastic reward noise. This is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the target policy. We term this as policy-weighted least square estimation and use this formulation to derive the optimal behavior policy for data collection. We then propose a novel algorithm SPEED (\textbf{S}tructured \textbf{P}olicy \textbf{E}valuation \textbf{E}xperimental \textbf{D}esign) that tracks the optimal behavior policy and derives its regret with respect to the optimal behavior policy. Finally, we empirically validate that \sp\ leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.
Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning.
Nicholas Corrado, Yuxiao Qu, John U. Balis, Adam Labiosa,
Josiah P. Hanna.
Arxiv Pre-print. September 2023.
Abstract
BibTeX
Arxiv Link
Learning from demonstration (LfD) is a popular technique that uses expert demonstrations to learn robot control policies. However, the difficulty in acquiring expert-quality demonstrations limits the applicability of LfD methods: real-world data collection is often costly and the quality of the demonstrations depends greatly on the demonstrator's abilities and safety concerns. A number of works have leveraged data augmentation (DA) to inexpensively generate additional demonstration data, but most DA works generate augmented data in a random fashion and ultimately produce highly suboptimal data. In this work, we propose Guided Data Augmentation (GuDA), a human-guided DA framework that generates expert-quality augmented data. The key insight of GuDA is that while it may be difficult to demonstrate the sequence of actions required to produce expert data, a user can often easily identify when an augmented trajectory segment represents task progress. Thus, the user can impose a series of simple rules on the DA process to automatically generate augmented samples that approximate expert behavior. To extract a policy from GuDA, we use off-the-shelf offline reinforcement learning and behavior cloning algorithms. We evaluate GuDA on a physical robot soccer task as well as simulated D4RL navigation tasks, a simulated autonomous driving task, and a simulated soccer task. Empirically, we find that GuDA enables learning from a small set of potentially suboptimal demonstrations and substantially outperforms a DA strategy that samples augmented data randomly.
On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling.
Nicholas Corrado,
Josiah P. Hanna.
Arxiv Pre-print. September 2023.
Abstract
BibTeX
Arxiv Link
On-policy reinforcement learning (RL) algorithms perform policy updates using i.i.d. trajectories collected by the current policy. However, after observing only a finite number of trajectories, on-policy sampling may produce data that fails to match the expected on-policy data distribution. This \textit{sampling error} leads to noisy updates and data inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error than on-policy sampling can produce [Zhong et al. 2022]. Motivated by this observation, we introduce an adaptive, off-policy sampling method to improve the data efficiency of on-policy policy gradient algorithms.Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy that increases the probability of sampling actions that are under-sampled with respect to the current policy. Rather than discarding data from old policies -- as is commonly done in on-policy algorithms -- PROPS uses data collection to adjust the distribution of previously collected data to be approximately on-policy. We empirically evaluate PROPS on both continuous-action MuJoCo benchmark tasks as well discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) improves the data efficiency of on-policy policy gradient algorithms. Our work improves the RL community’s understanding of a nuance in the on-policy vs off-policy dichotomy: on-policy learning requires on-policy data, not on-policy sampling.
Tackling Unbounded State Spaces in Continuing Task Reinforcement Learning.
Brahma Pavse, Matthew Zurek, Yudong Chen, Qiaomin Xie,
Josiah P. Hanna.
Arxiv PRE-PRINT. June 2023.
Abstract
BibTeX
Arxiv Link
While deep reinforcement learning (RL) algorithms have been successfully applied to many tasks, their inability to extrapolate and strong reliance on episodic resets inhibits their applicability to many real-world settings. For instance, in stochastic queueing problems, the state space can be unbounded and the agent may have to learn online without the system ever being reset to states the agent has seen before. In such settings, we show that deep RL agents can diverge into unseen states from which they can never recover due to the lack of resets, especially in highly stochastic environments. Towards overcoming this divergence, we introduce a Lyapunov-inspired reward shaping approach that encourages the agent to first learn to be stable (i.e. to achieve bounded cost) and then to learn to be optimal. We theoretically show that our reward shaping technique reduces the rate of divergence of the agent and empirically find that it prevents it. We further combine our reward shaping approach with a weight annealing scheme that gradually introduces optimality and log-transform of state inputs, and find that these techniques enable deep RL algorithms to learn high performing policies when learning online in unbounded state space domains.
Robust On-Policy Sampling for Data-Efficient Policy Evaluation.
Rujie Zhong, Duohan Zhang, Lukas Sch\"afer, Stefano V. Albrecht,
Josiah P. Hanna.
Proceedings of Advances in Neural Information Processing Systems (NeurIPS). December 2022.
An earlier version of this paper was published at the
Workshop on Offline RL (OfflineRL) at NeurIPS 2021.
Abstract
BibTeX
Paper (pdf)
Reinforcement learning (RL) algorithms are often categorized as either on-policy or off-policy depending on whether they use data from a target policy of interest or from a different behavior policy. In this paper, we study a subtle distinction between on-policy data and on-policy sampling in the context of the RL sub-problem of policy evaluation. We observe that on-policy sampling may fail to match the expected distribution of on-policy data after observing only a finite number of trajectories and this failure hinders data-efficient policy evaluation. Towards improved data-efficiency, we show how non-i.i.d., off-policy sampling can produce data that more closely matches the expected on-policy data distribution and consequently increases the accuracy of the Monte Carlo estimator for policy evaluation. We introduce a method called Robust On-policy Sampling and demonstrate theoretically and empirically that it produces data that converges faster to the expected on-policy distribution compared to on-policy sampling. Empirically, we show that this faster convergence leads to lower mean squared error policy value estimates.
ReVar: Strengthening Policy Evaluation Via Reduced Variance Sampling.
Subhojyoti Mukherjee,
Josiah P. Hanna, Robert Nowak.
Proceedings of the 38th International Conference on Uncertainty in Artificial Intelligence (UAI). August 2022.
Abstract
BibTeX
Paper (pdf)
This paper studies the problem of data collection for policy evaluation in
Markov decision processes (MDPs). In policy evaluation, we are given a
target policy and asked to estimate the expected cumulative reward it will
obtain in an environment formalized as an MDP. We develop theory for optimal
data collection within the class of tree-structured MDPs by first deriving
an oracle data collection strategy that uses knowledge of the variance of
the reward distributions. We then introduce the Reduced Variance Sampling
(ReVar) algorithm that approximates the oracle strategy when the reward
variances are unknown a priori and bound its sub-optimality compared to the
oracle strategy. Finally, we empirically validate that ReVar leads to policy
evaluation with mean squared error comparable to the oracle strategy and
significantly lower than simply running the target policy.
Simulation-Acquired Latent Action Spaces for Dynamics Generalization.
Nicholas Corrado, Yuxiao Qu,
Josiah P. Hanna.
Proceedings of the 1st Conference on Lifelong Learning Agents (CoLLAs). August 2022.
Video Presentation
Abstract
BibTeX
Paper (pdf)
Deep reinforcement learning has shown incredible promise at training
high-performing agents to solve high-dimensional continuous control tasks in
a particular training environment. However, to be useful in real-world
settings, long-lived agents must perform well across a range of environmental
conditions. Naively applying deep RL to a task where environment conditions
may vary from episode to episode can be data inefficient. To address this
inefficiency, we introduce a method that discovers structure in an agent’s
high-dimensional continuous action space to speed up learning across a range
of environmental conditions. Whereas prior work on finding so-called latent
action spaces requires expert demonstrations or on-task experience, we instead
propose to discover the latent, lower-dimensional action space in a simulated
environment and then transfer the learned action space for training over a
distribution of environments. We evaluate our novel method on randomized
variants of simulated MuJoCo environments and find that, when there is a
lower-dimensional action space to exploit, our method significantly increases
data efficiency. For instance, in the Ant environment, our method reduces the
8-dimensional action space to a 3-dimensional action space and doubles the
median return achieved after a training budget of 2 million timesteps.
Decoupled Reinforcement Learning To Stabilise Intrinsically-Motivated Exploration.
Lukas Schäfer, Filippos Christianos,
Josiah P. Hanna, Stefano V. Albrecht.
Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS). May 2022.
This paper contains material previously published at the
Unsupervised Reinforcement Learning (URL) Workshop at ICML 2021.
Abstract
BibTeX
Paper (pdf)
Intrinsic rewards can improve exploration in reinforcement learn-
ing, but the exploration process may suffer from instability caused
by non-stationary reward shaping and strong dependency on hy-
perparameters. In this work, we introduce Decoupled RL (DeRL) as
a general framework which trains separate policies for intrinsically-
motivated exploration and exploitation. Such decoupling allows
DeRL to leverage the benefits of intrinsic rewards for exploration
while demonstrating improved robustness and sample efficiency.
We evaluate DeRL algorithms in two sparse-reward environments
with multiple types of intrinsic rewards. Our results show that
DeRL is more robust to varying scale and rate of decay of intrin-
sic rewards and converges to the same evaluation returns than
intrinsically-motivated baselines in fewer interactions. Lastly, we
discuss the challenge of distribution shift and show that divergence
constraint regularisers can successfully minimise instability caused
by divergence of exploration and exploitation policies.
Scaling Marginalized Importance Sampling To High-Dimensional State-Spaces Via State Abstraction.
Brahma Pavse,
Josiah P. Hanna.
NeurIPS Workshop on Offline Reinforcement Learning (OfflineRL). December 2022.
Abstract
BibTeX
Paper (pdf)
We consider the problem of off-policy evaluation (OPE) in reinforcement
learning (RL), where the goal is to estimate the performance of an
evaluation policy, πe, using a fixed dataset, D, collected by one or more
policies that may be different from πe. Current OPE algorithms may produce
poor OPE estimates under policy distribution shift i.e., when the
probability of a particular state-action pair occurring under πe is very
different from the probability of that same pair occurring in D
(Voloshin et al. 2021; Fu et al. 2021). In this work, we propose to improve
the accuracy of OPE estimators by projecting the high-dimensional
state-space into a low-dimensional state-space using concepts from the state
abstraction literature. Specifically, we consider marginalized importance
sampling (MIS) OPE algorithms which compute state-action distribution
correction ratios to produce their OPE estimate. In the original ground
state-space, these ratios may have high variance which may lead to high
variance OPE. However, we prove that in the lower-dimensional abstract
state-space the ratios can have lower variance resulting in lower variance
OPE. We then highlight the challenges that arise when estimating the abstract
ratios from data, identify sufficient conditions to overcome these issues,
and present a min-imax optimization problem whose solution yields these
abstract ratios. Finally, our empirical evaluation on difficult,
high-dimensional state-space OPE tasks shows that the abstract ratios can
make MIS OPE estimators achieve lower mean-squared error and more robust to
hyperparameter tuning than the ground ratios.
Temporal Disentanglement of Representations for Improved Generalisation in Reinforcement Learning.
Mhairi Dunion, Trevor McInroe, Kevin Sebastian Luck,
Josiah P. Hanna, Stefano V. Albrecht.
NeurIPS Workshop on Deep Reinforcement Learning. December 2022.
Abstract
BibTeX
Paper (pdf)
Reinforcement Learning (RL) agents are often unable to generalise well to environment variations in the state space that were not observed during training. This issue is especially problematic for image-based RL, where a change in just one variable, such as the background colour, can change many pixels in the image, which can lead to drastic changes in the agent's latent representation of the image, causing the learned policy to fail. To learn more robust representations, we introduce TEmporal Disentanglement (TED), a self-supervised auxiliary task that leads to disentangled image representations exploiting the sequential nature of RL observations. We find empirically that RL algorithms utilising TED as an auxiliary task adapt more quickly to changes in environment variables with continued training compared to state-of-the-art representation learning methods. Since TED enforces a disentangled structure of the representation, we also find that policies trained with TED generalise better to unseen values of variables irrelevant to the task (e.g. background colour) as well as unseen values of variables that affect the optimal policy (e.g. goal positions).
Multi-agent Databases Via Independent Learning.
Chi Zhang, Olga Papaemmanouil,
Josiah P. Hanna, Aditya Akella.
Proceedings of the 4th International Workshop on Applied AI for Database Systems and Applications. September 2022.
Abstract
BibTeX
Paper (pdf)
Machine learning is rapidly being used in database research to
improve the effectiveness of numerous tasks included but not limited to
query optimization, workload scheduling, physical design, etc. Currently,
the research focus has been on replacing a single database component
responsible for one task by its learning-based counterpart. However, query
performance is not simply determined by the performance of a single
component, but by the cooperation of multiple ones. As such, learning based
database components need to collaborate during both training and execution
in order to develop policies that meet end performance goals. Thus, the
paper attempts to address the question "Is it possible to design a database
consisting of various learned components that cooperatively work to improve
end-to-end query latency?". To answer this question, we introduce MADB
(Multi-Agent DB), a proof-of-concept system that incorporates a learned query
scheduler and a learned query optimizer. MADB leverages a cooperative
multi-agent reinforcement learning approach that allows the two components
to exchange the context of their decisions with each other and
collaboratively work towards reducing the query latency. Preliminary results
demonstrate that MADB can outperform the non-cooperative integration of
learned components.
Importance Sampling in Reinforcement Learning with an Estimated Behavior Policy.
Josiah P. Hanna, Scott Niekum, Peter Stone.
Machine Learning (MLJ). May 2021.
This article contains material that was previously published in an
AAMAS 2019 paper and an
ICML 2019 paper.
Abstract
BibTeX
Paper (pdf)
In reinforcement learning, importance sampling is a widely used method for
evaluating an expectation under the distribution of data of one policy when
the data has in fact been generated by a different policy.
Importance sampling requires computing the likelihood ratio between the action
probabilities of a target policy and those of the data-producing behavior
policy. In this article, we study importance sampling where the behavior
policy action probabilities are replaced by their maximum likelihood estimate
of these probabilities under the observed data.
We show this general technique reduces variance due to sampling error in Monte
Carlo style estimators. We introduce two novel estimators that use this
technique to estimate expected values that arise in the RL literature. We find
that these general estimators reduce the variance of Monte Carlo sampling
methods, leading to faster learning for policy gradient algorithms and more
accurate off-policy policy evaluation. We also provide theoretical analysis
showing that our new estimators are consistent and have asymptotically lower
variance than Monte Carlo estimators.
Grounded Action Transformation for Sim-to-Real Reinforcement Learning.
Josiah P. Hanna, Sid Desai, Haresh Karnan, Garrett Warnell, Peter Stone.
Machine Learning (MLJ): Special Issue on Reinforcement Learning for Real Life. May 2021.
This article contains material that was previously published in an
AAAI 2017 paper and an
IROS 2020 paper.
Abstract
BibTeX
Paper (pdf)
Reinforcement learning in simulation is a promising alternative to the prohibitive sample cost of reinforcement learning in the physical world.
Unfortunately, policies learned in simulation often perform worse than hand-coded policies when applied on the target, physical system.
Grounded simulation learning (GSL) is a general framework that promises to address this issue by altering the simulator to better match the real world (Farchy et al. 2013). This article introduces a new algorithm for GSL -- Grounded Action Transformation (GAT) -- and applies it to learning control policies for a humanoid robot.
We evaluate our algorithm in controlled experiments where we show it to allow policies learned in simulation to transfer to the real world.
We then apply our algorithm to learning a fast bipedal walk on a humanoid robot and demonstrate a 43.27% improvement in forward walk velocity compared to a state-of-the art hand-coded walk.
This striking empirical success notwithstanding, further empirical analysis shows that GAT may struggle when the real world has stochastic state transitions.
To address this limitation we generalize GAT to the stochastic GAT (SGAT) algorithm and empirically show that SGAT leads to successful real world transfer in situations where GAT may fail to find a good policy.
Our results contribute to a deeper understanding of grounded simulation learning and demonstrate its effectiveness for applying reinforcement learning to learn robot control policies entirely in simulation.
Interpretable Goal Recognition in the Presence of Occluded Factors for Autonomous Vehicles.
Josiah P. Hanna, Arrasy Rahman, Elliot Fosong, Francisco Eiras, Mihai Dobre, John Redford, Subramanian Ramamoorthy, Stefano V. Albrecht.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). September 2021.
Abstract
BibTeX
Paper (pdf)
Recognising the goals or intentions of observed vehicles is a key step towards predicting the long-term future behaviour of other agents in an autonomous driving scenario. When there are unseen obstacles or occluded vehicles in a scenario, goal recognition may be confounded by the effects of these unseen entities on the behaviour of observed vehicles. Existing prediction algorithms that assume rational behaviour with respect to inferred goals may fail to make accurate long-horizon predictions because they ignore the possibility that the behaviour is influenced by such unseen entities. We introduce the Goal and Occluded Factor Inference (GOFI) algorithm which bases inference on inverse-planning to jointly infer a probabilistic belief over goals and potential occluded factors. We then show how these beliefs can be integrated into Monte Carlo Tree Search (MCTS). We demonstrate that jointly inferring goals and occluded factors leads to more accurate beliefs with respect to the true world state and allows an agent to safely navigate several scenarios where other baselines take unsafe actions leading to collisions.
A Joint Imitation-Reinforcement Learning Framework for Reduced Baseline Regret.
Sheelabhadra Dey, Sumedh Pendurkar, Guni Sharon,
Josiah P. Hanna.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). September 2021.
Abstract
BibTeX
Paper (pdf)
In various control task domains, existing controllers provide a baseline level of performance that—though
possibly suboptimal—should be maintained. Reinforcement
learning (RL) algorithms that rely on extensive exploration
of the state and action space can be used to optimize a
control policy. However, fully exploratory RL algorithms may
decrease performance below a baseline level during training.
In this paper, we address the issue of online optimization
of a control policy while minimizing regret with respect to
a baseline policy performance. We present a joint imitationreinforcement learning framework, denoted JIRL. The learning
process in JIRL assumes the availability of a baseline policy
and is designed with two objectives in mind (a) training while
leveraging demonstrations from the baseline policy to minimize
regret with respect to the baseline policy, and (b) eventually
surpassing the baseline performance. JIRL addresses these
objectives by initially learning to imitate the baseline policy and
gradually shifting control from the baseline to an RL agent. Experimental results show that JIRL effectively accomplishes the
aforementioned objectives in several, continuous action-space
domains. The results demonstrate that JIRL is comparable
to a state-of-the-art algorithm in its final performance while
incurring significantly lower baseline regret during training.
Moreover, the results show a reduction factor of up to 21
in baseline regret over a trust-region based approach that
guarantees monotonic policy improvement.
Behavior Policy Search for Risk Estimators in RL.
Elita Lobo, Yash Chandak, Dharmashankar Subramanian,
Josiah P. Hanna, Marek Petrik.
NeurIPS Workshop on Safe and Robust Control of Uncertain Systems. December 2021.
Abstract
BibTeX
Paper (pdf)
In real-world sequential decision problems, exploration is expensive, and the risk
of expert decision policies must be evaluated from limited data. In this setting,
Monte Carlo (MC) risk estimators are typically used to estimate the risk of decision
policies. Unfortunately, while these estimators have the desired low bias property,
they often suffer from large variance. In this paper, we consider the problem of
minimizing the asymptotic mean squared error and hence variance of MC risk
estimators. We show that by carefully choosing the data sampling policy (behavior
policy), we can obtain low variance estimates of the risk of any given decision
policy.
Robust On-Policy Data Collection for Data-Efficient Policy Evaluation.
Rujie Zhong,
Josiah P. Hanna, Lukas Sch\"afer, Stefano V. Albrecht.
NeurIPS Workshop on Offline Reinforcement Learning (OfflineRL). December 2021.
Abstract
BibTeX
Paper (pdf)
This paper considers how to complement offline reinforcement learning (RL) data with additional data collection for the task of policy evaluation. In policy evaluation, the task is to estimate the expected return of an evaluation policy on an environment of interest. Prior work on offline policy evaluation typically only considers a static dataset. We consider a setting where we can collect a small amount of additional data to combine with a potentially larger offline RL dataset. We show that simply running the evaluation policy – on-policy data collection – is sub-optimal for this setting. We then introduce two new data collection strategies for policy evaluation, both of which consider previously collected data when collecting future data so as to reduce distribution shift (or sampling error) in the entire dataset collected. Our empirical results show that compared to on-policy sampling, our strategies produce data with lower sampling error and generally lead to lower mean-squared error in policy evaluation for any total dataset size. We also show that these strategies can start from initial off-policy data, collect additional data, and then use both the initial and new data to produce low mean-squared error policy evaluation without using off-policy corrections.
Safe Evaluation for Offline Learning: Are We Ready To Deploy?
Hager Radi,
Josiah P. Hanna, Peter Stone, Matthew E. Taylor.
NeurIPS Workshop on Deployable Decision Making in Embodied Systems (DDM). December 2021.
Abstract
BibTeX
Paper (pdf)
The world currently offers an abundance of data in multiple domains, from which
we can learn reinforcement learning (RL) policies without further interaction
with the environment. RL agents learning offline from such data is possible but
deploying them while learning might be dangerous in domains where safety is
critical. Therefore, it is essential to find a way to estimate how a newly-learned
agent will perform if deployed in the target environment before actually deploying
it and without the risk of overestimating its true performance. To achieve this, we
introduce a framework for safe evaluation of offline learning using approximate
high-confidence off-policy evaluation (HCOPE) to estimate the performance of
offline policies during learning. In our setting, we assume a source of data, which
we split into a train-set, to learn an offline policy, and a test-set, to estimate a
lower-bound on the offline policy using off-policy evaluation with bootstrapping.
A lower-bound estimate tells us how good a newly-learned target policy would
perform before it is deployed in the real environment, and therefore allows us to
decide when to deploy our learned policy.
Towards Quantum-Secure Authentication and Key Agreement Via Abstract Multi-Agent Interaction.
Ibrahim H. Ahmed,
Josiah P. Hanna, Elliot Fosong, Stefano V. Albrecht.
International Conference on Practical Applications of Agents and Multi-Agent Systems (PAAMS). October 2021.
Abstract
BibTeX
Paper (pdf)
Current methods for authentication and key agreement based on public-key cryptography are vulnerable to quantum computing. We propose a novel approach based on artificial intelligence research in which communicating parties are viewed as autonomous agents which interact repeatedly using their private decision models. Authentication and key agreement are decided based on the agents' observed behaviors during the interaction. The security of this approach rests upon the difficulty of modeling the decisions of interacting agents from limited observations, a problem which we conjecture is also hard for quantum computing. We release PyAMI, a prototype authentication and key agreement system based on the proposed method. We empirically validate our method for authenticating legitimate users while detecting different types of adversarial attacks. Finally, we show how reinforcement learning techniques can be used to train server models which effectively probe a client's decisions to achieve more sample-efficient authentication.
Decoupling Exploration and Exploitation in Reinforcement Learning.
Lukas Schäfer, Filippos Christianos,
Josiah P. Hanna, Stefano V. Albrecht.
ICML Workshop on Unsupervised Reinforcement Learning (URL). July 2021.
Abstract
BibTeX
Paper (pdf)
Intrinsic rewards are commonly applied to improve exploration in reinforcement learning. However, these approaches suffer from instability caused by non-stationary reward shaping and strong dependency on hyperparameters. In this work, we propose Decoupled RL (DeRL) which trains separate policies for exploration and exploitation. DeRL can be applied with on-policy and off-policy RL algorithms. We evaluate DeRL algorithms in two sparse-reward environments with multiple types of intrinsic rewards. We show that DeRL is more robust to scaling and speed of decay of intrinsic rewards and converges to the same evaluation returns than intrinsically motivated baselines in fewer interactions.
RIDM: Reinforced Inverse Dynamics Modeling for Learning From a Single Observed Demonstration.
Brahma S. Pavse, Faraz Torabi,
Josiah P. Hanna, Garrett Warnell, Peter Stone.
IEEE Robotics and Automation Letters. July 2020.
This article was jointly published as an IROS 2020 paper
and previously published at the
Imitation, Intent, and Interaction (I3) Workshop at ICML 2019.
Abstract
BibTeX
Paper (pdf)
Imitation learning has long been an approach to alleviate the tractability
issues that arise in reinforcement learning. However, most literature makes
several assumptions such as access to the expert's actions, availability of
many expert demonstrations, and injection of task-specific domain knowledge
into the learning process. We propose reinforced inverse dynamics modeling
(RIDM), a method of combining reinforcement learning and imitation from
observation (IfO) to perform imitation using a single expert demonstration,
with no access to the expert's actions, and with little task-specific domain
knowledge. Given only a single set of the expert's raw states, such as joint
angles in a robot control task, at each time-step, we learn an inverse
dynamics model to produce the necessary low-level actions, such as torques,
to transition from one state to the next such that the reward from the
environment is maximized. We demonstrate that RIDM outperforms other
techniques when we apply the same constraints on the other methods on six
domains of the MuJoCo simulator and for two different robot soccer tasks for
two experts from the RoboCup 3D simulation league on the SimSpark simulator.
An Imitation From Observation Approach To Transfer Learning with Dynamics Mismatch.
Siddharth Desai, Ishan Durugkar, Haresh Karnan, Garrett Warnell,
Josiah P. Hanna, Peter Stone.
Proceedings of the 33rd Advances in Neural Information Processing Systems (NeurIPS). December 2020.
Abstract
BibTeX
Paper (pdf)
We examine the problem of transferring a policy learned in a source environment to a target environment with different dynamics, particularly in the case where it is critical to reduce the amount of interaction with the target environment during learning. This problem is particularly important in sim-to-real transfer because simulators inevitably model real-world dynamics imperfectly. In this paper, we show that one existing solution to this transfer problem ”grounded action transformation” is closely related to the problem of imitation from observation (IfO): learning behaviors that mimic the observations of behavior demonstrations. After establishing this relationship, we hypothesize that recent state-of-the-art approaches from the IfO literature can be effectively repurposed for grounded transfer learning. To validate our hypothesis we derive a new algorithm, generative adversarial reinforced action transformation (GARAT) based on adversarial imitation from observation techniques. We run experiments in several domains with mismatched dynamics, and find that agents trained with GARAT achieve higher returns in the target environment compared to existing black-box transfer methods.
Reinforced Grounded Action Transformation for Sim-to-Real Transfer.
Haresh Karnan, Sid Desai,
Josiah P. Hanna, Garrett Warnell, Peter Stone.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). October 2020.
Abstract
BibTeX
Paper (pdf)
Robots can learn to do complex tasks in simulation, but often, learned behaviors fail to transfer well to the real world due to simulator imperfections (the reality gap). Some existing solutions to this sim-to-real problem, such as Grounded Action Transformation (GAT), use a small amount of real world experience to minimize the reality gap by grounding the simulator. While very effective in certain scenarios, GAT is not robust on problems that use complex function approximation techniques to model a policy. In this paper, we introduce Reinforced Grounded Action Transformation (RGAT), a new sim-to-real technique that uses Reinforcement Learning (RL) not only to update the target policy in simulation, but also to perform the grounding step itself. This novel formulation allows for end-to-end training during the grounding step, which, compared to GAT, produces a better grounded simulator. Moreover, we show experimentally in several MuJoCo domains that our approach leads to successful transfer for policies modeled using neural networks
Stochastic Grounded Action Transformation for Robot Learning in Simulation.
Sid Desai, Haresh Karnan,
Josiah P. Hanna, Garrett Warnell, Peter Stone.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). October 2020.
Abstract
BibTeX
Paper (pdf)
Robot control policies learned in simulation do not often transfer well to the real world. Many existing solutions to this sim-to-real problem, such as the Grounded Action Transformation (GAT) algorithm, seek to correct for or ground these differences by matching the simulator tothe real world. However, the efficacy of these approaches is limited if they do not explicitly account for stochasticity inthe target environment. In this work, we analyze the prob-lems associated with grounding a deterministic simulator in astochastic real world environment, and we present examples where GAT fails to transfer a good policy due to stochastic transitions in the target domain. In response, we introduce the Stochastic Grounded Action Transformation (SGAT) algorithm, which models this stochasticity when grounding the simulator. We find experimentally for both simulated and physical target domains that SGAT can find policies that are robust to stochasticity in the target domain.
Reducing Sampling Error in Batch Temporal Difference Learning.
Brahma Pavse, Ishan Durugkar,
Josiah P. Hanna, Peter Stone.
Proceedings of the 37th International Conference on Machine Learning (ICML). July 2020.
Abstract
BibTeX
Paper (pdf) Slides (pdf)
Temporal difference (TD) learning is one of the main foundations of modern
reinforcement learning. This paper studies the use of TD(0), a canonical
TD algorithm, to estimate the value function of a given policy from a batch
of data. In this batch setting, we show that TD(0) may converge to an
inaccurate value function because the update following an action is weighted
according to the number of times that action occurred in the batch – not the
true probability of the action under the given policy. To address this limitation,
we introduce policy sampling error corrected-TD(0) (PSEC-TD(0)). PSEC-TD(0)
first estimates the empirical distribution of actions in each state in the
batch and then uses importance sampling to correct for the mismatch between
the empirical weighting and the correct weighting for updates following each
action. We refine the concept of a certainty-equivalence estimate and argue
that PSEC-TD(0) is a more data efficient estimator than TD(0) for a fixed batch
of data. Finally, we conduct an empirical evaluation of PSEC-TD(0) on three
batch value function learning tasks, with a hyperparameter sensitivity analysis,
and show that PSEC-TD(0) produces value function estimates with lower mean
squared error than TD(0).
Learning an Interpretable Traffic Signal Control Policy.
James Ault,
Josiah P. Hanna, Guni Sharon.
Proceedings of the 19th International Conference on Autonomous Agents and Multiagent Systems (AAMAS). May 2020.
Abstract
BibTeX
Paper (pdf)
Signalized intersections are managed by controllers that assign right of way
(green, yellow, and red lights) to non-conflicting directions. Optimizing the
actuation policy of such controllers is expected to alleviate traffic
congestion and its adverse impact. Given such a safety-critical domain, the
affiliated actuation policy is required to be interpretable in a way that can
be understood and regulated by a human. This paper presents and analyzes
several on-line optimization techniques for tuning interpretable control
functions. Although these techniques are defined in a general way, this paper
assumes a specific class of interpretable control functions (polynomial functions)
for analysis purposes. We show that such an interpretable policy function can
be as effective as a deep neural network for approximating an optimized signal
actuation policy. We present empirical evidence that supports the use of
value-based reinforcement learning for on-line training of the control function.
Specifically, we present and study three variants of the Deep Q-learning algorithm
that allow the training of an interpretable policy function. Our Deep Regulatable
Hardmax Q-learning variant is shown to be particularly effective in optimizing
our interpretable actuation policy, resulting in up to 19.4\% reduced vehicles
delay compared to commonly deployed actuated signal controllers.
On Sampling Error in Batch Action-Value Prediction Algorithms.
Brahma S. Pavse,
Josiah P. Hanna, Ishan Durugkar, Peter Stone.
In the Offline Reinforcement Learning Workshop at Neural Information Processing Systems (NeurIPS). December 2020.
5-mins Video Presentation
Abstract
BibTeX
Paper (pdf)
Estimating a policy's action-values is a fundamental aspect of reinforcement
learning. In this work, we study the application of TD methods for learning
action-values in an offline setting with a fixed batch of data. Motivated by
recent work \citep{pavse2020psectd}, we observe that a fixed batch of offline
data may contain two forms of distribution shift: the data may be collected
from a different behavior policy than the target policy (off-policy data)
and the empirical distribution of the data may differ from the sampling
distribution of the data (sampling error). In this work, we focus
on the second problem by analyzing the sampling error that arises due to
variance in sampling from a finite-sized batch of data in the
\emph{on-policy offline} RL setting. We study how action-value learning
algorithms suffer from this \emph{sampling error} by considering their so-called
\emph{certainty-equivalence estimates} \citep{ sutton1988learning,pavse2020psectd}.
We prove that each algorithm uses its certainty-equivalence estimates of the
policy and transition dynamics to converge to its respective fixed-point. We
then empirically evaluate each algorithm's performance by measuring the
mean-squared value error on Gridworld. Ultimately, we find that by reducing
sampling error, an algorithm can produce significantly accurate action-value
estimations.
Importance Sampling Policy Evaluation with an Estimated Behavior Policy.
Josiah P. Hanna, Scott Niekum, Peter Stone.
Proceedings of the 36th International Conference on Machine Learning (ICML). June 2019.
Abstract
BibTeX
Paper (pdf) Slides (pdf)
We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a technique for off-policy evaluation that re-weights off-policy returns to account for differences in the likelihood of the returns between the two policies. In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set. Intuitively, estimating the behavior policy in this way corrects for error due to sampling in the action-space. Our empirical results also extend to other popular variants of importance sampling and show that estimating a non-Markovian behavior policy can further lower large-sample mean squared error even when the true behavior policy is Markovian.
Reducing Sampling Error in the Monte Carlo Policy Gradient Estimator.
Josiah P. Hanna, Peter Stone.
Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS). May 2019.
This paper contains material that was previously presented at the 2018 NeurIPS Deep Reinforcement Learning Workshop.
Abstract
BibTeX
Paper (pdf) Slides (pdf)
This paper studies a class of reinforcement learning algorithms known as policy gradient methods. Policy gradient methods optimize the performance of a policy by estimating the gradient of the expected return with respect to the policy parameters. One of the core challenges of applying policy gradient methods is obtaining an accurate estimate of this gradient. Most policy gradient methods rely on Monte Carlo sampling to estimate this gradient. When only a limited number of environment steps can be collected, Monte Carlo policy gradient estimates may suffer from sampling error -- samples receive more or less weight than they will in expectation. In this paper, we introduce the Sampling Error Corrected policy gradient estimator that corrects the inaccurate Monte Carlo weights. Our approach treats the observed data as if it were generated by a different policy than the policy that actually generated the data. It then uses importance sampling between the two -- in the process correcting the inaccurate Monte Carlo weights. Under a limiting set of assumptions we can show that this gradient estimator will have lower variance than the Monte Carlo gradient estimator. We show experimentally that our approach improves the learning speed of two policy gradient methods compared to standard Monte Carlo sampling even when the theoretical assumptions fail to hold.
Selecting Compliant Agents for Opt-in Microtolling.
Josiah P. Hanna, Guni Sharon, Stephen D. Boyles, Peter Stone.
Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI). January 2019.
Abstract
BibTeX
Paper (pdf) Poster (pdf)
This paper examines the impact of tolls on social welfare in the context of a
transportation network in which only a portion of the agents are subject to
tolls. More specifically, this paper addresses the question: which subset of
agents provides the most system benefit if they are compliant with an approximate
marginal cost tolling scheme? Since previous work suggests this problem is NP-hard,
we examine a heuristic approach. Our experimental results on three real-world
traffic scenarios suggest that evaluating the marginal impact of a given agent
serves as a particularly strong heuristic for selecting an agent to be
compliant. Results from using this heuristic for selecting 7.6% of the agents
to be compliant achieved an increase of up to 10.9% in social welfare over not
tolling at all. The presented heuristic approach and conclusions can help
practitioners target specific agents to participate in an opt-in tolling scheme.
RIDM: Reinforced Inverse Dynamics Modeling for Learning From a Single Observed Demonstration.
Brahma S. Pavse, Faraz Torabi,
Josiah P. Hanna, Garrett Warnell, Peter Stone.
Imitation, Intent, and Interaction (I3) Workshop at ICML 2019. June 2019.
Abstract
BibTeX
Paper (pdf)
Imitation learning has long been an approach to alleviate the tractability
issues that arise in reinforcement learning. However, most literature makes
several assumptions such as access to the expert's actions, availability of
many expert demonstrations, and injection of task-specific domain knowledge
into the learning process. We propose reinforced inverse dynamics modeling
(RIDM), a method of combining reinforcement learning and imitation from
observation (IfO) to perform imitation using a single expert demonstration,
with no access to the expert's actions, and with little task-specific domain
knowledge. Given only a single set of the expert's raw states, such as joint
angles in a robot control task, at each time-step, we learn an inverse
dynamics model to produce the necessary low-level actions, such as torques,
to transition from one state to the next such that the reward from the
environment is maximized. We demonstrate that RIDM outperforms other
techniques when we apply the same constraints on the other methods on six
domains of the MuJoCo simulator and for two different robot soccer tasks for
two experts from the RoboCup 3D simulation league on the SimSpark simulator.
DyETC: Dynamic Electronic Toll Collection for Traffic Congestion Alleviation.
Haipeng Chen, Bo An, Guni Sharon,
Josiah P. Hanna, Peter Stone, Chunyan Miao, Yeng Chai Soh.
Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI). February 2018.
Abstract
BibTeX
Paper (pdf)
To alleviate traffic congestion in urban areas, electronic toll collection
(ETC) systems are deployed all over the world. Despite the merits, tolls
are usually pre-determined and fixed from day to day, which fail to
consider traffic dynamics and thus have limited regulation effect when
traffic conditions are abnormal. In this paper, we propose a novel
dynamic ETC (DyETC) scheme which adjusts tolls to traffic conditions in
realtime. The DyETC problem is formulated as a Markov decision process
(MDP), the solution of which is very challenging due to its 1)
multi-dimensional state space, 2) multi-dimensional, continuous and
bounded action space, and 3) time-dependent state and action values. Due
to the complexity of the formulated MDP, existing methods cannot be
applied to our problem. Therefore, we develop a novel algorithm,
PG-$\beta$, which makes three improvements to traditional policy gradient
method by proposing 1) time-dependent value and policy functions, 2)
Beta distribution policy function and 3) state abstraction. Experimental
results show that, compared with existing ETC schemes, DyETC increases
traffic volume by around $8\%$, and reduces travel time by around
$14.6\%$ during rush hour. Considering the total traffic volume in a
traffic network, this contributes to a substantial increase to social
welfare.
Towards a Data Efficient Off-Policy Policy Gradient.
Josiah P. Hanna, Peter Stone.
AAAI Spring Symposium on Data Efficient Reinforcement Learning. March 2018.
Abstract
BibTeX
Paper (pdf)
The ability to learn from off-policy data -- data generated from past interaction with the environment -- is essential to data efficient reinforcement learning. Recent work has shown that the use of off-policy data not only allows the re-use of data but can even improve performance in comparison to on-policy reinforcement learning. In this work we investigate if a recently proposed method for learning a better data generation policy, commonly called a behavior policy, can also increase the data efficiency of policy gradient reinforcement learning. Empirical results demonstrate that with an appropriately selected behavior policy we can estimate the policy gradient more accurately. The results also motivate further work into developing methods for adapting the behavior policy as the policy we are learning changes.
Network-wide Adaptive Tolling for Connected and Automated Vehicles.
Guni Sharon, Michael W. Levin,
Josiah P. Hanna, Tarun Rambha, Stephen D. Boyles, Peter Stone.
Transportation Research Part C. September 2017.
This article contains material that was previously published in an
AAMAS 2017 paper.
Abstract
BibTeX
Paper (pdf)
This article proposes Delta-tolling, a simple adaptive pricing scheme which only requires travel
time observations and two tuning parameters. These tolls are applied throughout a road
network, and can be updated as frequently as travel time observations are made.
Notably, Delta-tolling does not require any details of the traffic flow or travel demand models
other than travel time observations, rendering it easy to apply in real-time. The flexibility
of this tolling scheme is demonstrated in three specific traffic modeling contexts with varying
traffic flow and user behavior assumptions: a day-to-day pricing model using static
network equilibrium with link delay functions; a within-day adaptive pricing model using
the cell transmission model and dynamic routing of vehicles; and a microsimulation of
reservation-based intersection control for connected and autonomous vehicles with myopic
routing. In all cases, D-tolling produces significant benefits over the no-toll case, measured
in terms of average travel time and social welfare, while only requiring two
parameters to be tuned. Some optimality results are also given for the special case of the
static network equilibrium model with BPR-style delay functions.
Data-Efficient Policy Evaluation Through Behavior Policy Search.
Josiah P. Hanna, Philip Thomas, Peter Stone, Scott Niekum.
Proceedings of the 34th International Conference on Machine Learning (ICML). August 2017.
Abstract
BibTeX
Paper (pdf) Slides (pdf) Poster (pdf)
We consider the task of evaluating a policy for a Markov decision process
(MDP). The standard unbiased technique for evaluating a policy is to deploy
the policy and observe its performance. We show that the data collected from
deploying a different policy, commonly called the behavior policy, can be
used to produce unbiased estimates with lower mean squared error than this
standard technique. We derive an analytic expression for the optimal behavior
policy---the behavior policy that minimizes the mean squared error of the
resulting estimates. Because this expression depends on terms that are
unknown in practice, we propose a novel policy evaluation sub-problem,
behavior policy search: searching for a behavior policy that reduces mean
squared error. We present a behavior policy search algorithm and empirically
demonstrate its effectiveness in lowering the mean squared error of policy
performance estimates.
Real-time Adaptive Tolling Scheme for Optimized Social Welfare in Traffic Networks.
Guni Sharon,
Josiah P. Hanna, Tarun Rambha, Michael W. Levin, Michael Albert, Stephen D. Boyles, Peter Stone.
Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS-2017). May 2017.
Based on a paper that appeared in the ATT 2016 workshop:
Ninth International Workshop on Agents in Traffic and Transportation
Abstract
BibTeX
Paper (pdf)
Connected and autonomous vehicle technology has advanced rapidly in recent
years. These technologies create possibilities for advanced AI-based
traffic management techniques. Developing such techniques is an important
challenge and opportunity for the AI community as it requires synergy
between experts in game theory, multiagent systems, behavioral science,
and flow optimization. This paper takes a step in this direction by
considering traffic flow optimization through setting and broadcasting of
dynamic and adaptive tolls. Previous tolling schemes either were not
adaptive in real-time, not scalable to large networks, or did not
optimize traffic flow over an entire network. Moreover, previous schemes
made strong assumptions on observable demands, road capacities and users
homogeneity. This paper introduces Delta-tolling, a novel tolling scheme
that is adaptive in real-time and able to scale to large networks. We
provide theoretical evidence showing that under certain assumptions
Delta-tolling is equal to Marginal-Cost Tolling, which provably leads to
system-optimal, and empirical evidence showing that Delta-tolling
increases social welfare (by up to 33\%) in two traffic simulators with
markedly different modeling assumptions.
Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation.
Josiah P. Hanna, Peter Stone, Scott Niekum.
Proceedings of the 16th International Conference on Autonomous Agents and Multiagent Systems (AAMAS). May 2017.
Abstract
BibTeX
Paper (pdf) Slides (pdf) Poster (pdf)
For an autonomous agent, executing a poor policy may be costly or even
dangerous. For such agents, it is desirable to determine confidence
interval lower bounds on the performance of any given policy without
executing said policy. Current methods for exact high confidence off-policy
evaluation that use importance sampling require a substantial amount of
data to achieve a tight lower bound. Existing model-based methods only
address the problem in discrete state spaces. Since exact bounds are
intractable for many domains we trade off strict guarantees of safety for
more data-efficient approximate bounds. In this context, we propose two
bootstrapping off-policy evaluation methods which use learned MDP
transition models in order to estimate lower confidence bounds on policy
performance with limited data in both continuous and discrete state spaces.
Since direct use of a model may introduce bias, we derive a theoretical
upper bound on model bias for when the model transition function is
estimated with i.i.d. trajectories. This bound broadens our understanding
of the conditions under which model-based methods have high bias. Finally,
we empirically evaluate our proposed methods and analyze the settings in
which different bootstrapping off-policy confidence interval methods
succeed and fail.
Grounded Action Transformation for Robot Learning in Simulation.
Josiah P. Hanna, Peter Stone.
Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI). February 2017.
Abstract
BibTeX
Paper (pdf) Slides (pdf) Poster (pdf)
Robot learning in simulation is a promising alternative to the prohibitive sample cost of learning in the physical world. Unfortunately, policies learned in simulation often perform worse than hand-coded policies when applied on the physical robot. Grounded simulation learning (GSL) promises to address this issue by altering the simulator to better match the real world. This paper proposes a new algorithm for GSL -- Grounded Action Transformation -- and applies it to learning of humanoid bipedal locomotion. Our approach results in a 43.27\% improvement in forward walk velocity compared to a state-of-the art hand-coded walk. We further evaluate our methodology in controlled experiments using a second, higher-fidelity simulator in place of the real world. Our results contribute to a deeper understanding of grounded simulation learning and demonstrate its effectiveness for learning robot control policies.
Fast and Precise Black and White Ball Detection for RoboCup Soccer.
Jacob Menashe, Josh Kelle, Katie Genter,
Josiah P. Hanna, Elad Liebman, Sanmit Narvekar, Ruohan Zhang, Peter Stone.
RoboCup-2017: Robot Soccer World Cup XXI. July 2017.
Abstract
BibTeX
Paper (pdf)
In 2016, UT Austin Villa claimed the Standard Platform League's second place
position at the RoboCup International Robot Soccer Competition in Leipzig,
Germany as well as first place at both the RoboCup US Open in Brunswick, USA
and the World RoboCup Conference in Beijing, China. This paper describes some
of the key contributions that led to the team's victories with a primary
focus on our techniques for identifying and tracking black and white soccer
balls. UT Austin Villa's ball detection system was overhauled in order to
transition from the league's bright orange ball, used every year of the
competition prior to 2016, to the truncated icosahedral pattern commonly
associated with soccer balls.
We evaluated and applied a series of heuristic region-of-interest
identification techniques and supervised machine learning methods to produce a
ball detector capable of reliably detecting the ball's position with no prior
knowledge of the ball's position. In 2016, UT Austin Villa suffered only a
single loss which occurred after regulation time during a penalty kick
shootout. We attribute much of UT Austin Villa's success in 2016 to our robots'
effectiveness at quickly and consistently localizing the ball.
In this work we discuss the specifics of UT Austin Villa's ball detector
implementation which are applicable to the specific problem of ball detection
in RoboCup, as well as to the more general problem of fast and precise object
detection in computationally constrained domains. Furthermore we provide
empirical analyses of our approach to support the conclusion that modern deep
learning techniques can enhance visual recognition tasks even in the face of
these computational constraints.
Operations of a Shared, Autonomous, Electric Vehicle Fleet: Implications of Vehicle \& Charging Infrastructure Decisions.
T Donna Chen, Kara M Kockelman,
Josiah P. Hanna.
Transportation Research Part A: Policy and Practice. January 2016.
Official version from
Publisher's Webpage
Abstract
BibTeX
Paper (pdf)
There are natural synergies between shared autonomous vehicle (AV) fleets and
electric vehicle (EV) technology, since fleets of AVs resolve the practical
limitations of today's non-autonomous EVs, including traveler range anxiety,
access to charging infrastructure, and charging time management. Fleet-managed
AVs relieve such concerns, managing range and charging activities based on
real-time trip demand and established charging-station locations, as
demonstrated in this paper. This work explores the management of a fleet of
shared autonomous electric vehicles (SAEVs) in a regional, discrete-time,
agent-based model. The simulation examines the operation of SAEVs under
various vehicle range and charging infrastructure scenarios in a gridded city
modeled roughly after the densities of Austin, Texas.
Results based on 2009 NHTS trip distance and time-of-day distributions
indicate that fleet size is sensitive to battery recharge time and vehicle
range, with each 80-mile range SAEV replacing 3.7 privately owned vehicles
and each 200-mile range SAEV replacing 5.5 privately owned vehicles, under
Level II (240-volt AC) charging. With Level III 480-volt DC fast-charging
infrastructure in place, these ratios rise to 5.4 vehicles for the 80-mile
range SAEV and 6.8 vehicles for the 200-mile range SAEV. SAEVs can serve
96\‐98\% of trip requests with average wait times between 7 and 10
minutes per trip. However, due to the need to travel while ``empty" for
charging and passenger pick-up, SAEV fleets are predicted to generate an
additional 7.1 -- 14.0\% of travel miles. Financial analysis suggests that
the combined cost of charging infrastructure, vehicle capital and maintenance,
electricity, insurance, and registration for a fleet of SAEVs ranges from
$0.42 to $0.49 per occupied mile traveled, which implies SAEV service can be
offered at the equivalent per-mile cost of private vehicle ownership for
low-mileage households, and thus be competitive with current manually-driven
carsharing services and significantly cheaper than on-demand driver-operated
transportation services. When Austin-specific trip patterns (with more
concentrated trip origins and destinations) are introduced in a final case
study, the simulation predicts a decrease in fleet empty
vehicle-miles
(down to 3 -- 4\% of all SAEV travel) and average wait times (ranging
from 2 to 4 minutes per trip), with each SAEV replacing 5 -- 9 privately
owned vehicles.
UT Austin Villa: RoboCup 2015 3D Simulation League Competition and Technical Challenges Champions.
Patrick MacAlpine,
Josiah P. Hanna, Jason Liang, Peter Stone.
RoboCup-2015: Robot Soccer World Cup XIX. July 2016.
Accompanying videos at
http://www.cs.utexas.edu/~AustinVilla/sim/3dsimulation/#2015
Abstract
BibTeX
Paper (pdf)
The UT Austin Villa team, from the University of Texas at Austin, won the
2015 RoboCup 3D Simulation League, winning all 19 games that the team played.
During the course of the competition the team scored 87 goals and conceded only
1. Additionally the team won the RoboCup 3D Simulation League technical
challenge by winning each of a series of three league challenges: drop-in
player, kick accuracy, and free challenge. This paper describes the changes and
improvements made to the team between 2014 and 2015 that allowed it to win both
the main competition and each of the league technical challenges.
Minimum Cost Matching for Autonomous Carsharing.
Josiah P. Hanna, Michael Albert, Donna Chen, Peter Stone.
Proceedings of the 9th IFAC Symposium on Intelligent Autonomous Vehicles (IAV 2016). June 2016.
Abstract
BibTeX
Paper (pdf) Slides (pdf)
Carsharing programs provide an alternative to private vehicle ownership. Combining carsharing programs with autonomous vehicles would improve user access to vehicles thereby removing one of the main challenges to widescale adoption of these programs. While the ability to easily move cars to meet demand would be significant for carsharing programs, if implemented incorrectly it could lead to worse system performance. In this paper, we seek to improve the performance of a fleet of shared autonomous vehicles through improved matching of vehicles to passengers requesting rides. We consider carsharing with autonomous vehicles as an assignment problem and examine four different methods for matching cars to users in a dynamic setting.
We show how applying a recent algorithm (Scalable Collision-avoiding Role Assignment with Minimal-makespan or SCRAM) for minimizing the maximal edge in a perfect matching can result in a more efficient, reliable, and fair carsharing system.
Our results highlight some of the problems with greedy or decentralized approaches.
Introducing a centralized system creates the possibility for users to strategically mis-report their locations and improve their expected wait time so we provide a proof demonstrating that cancellation fees can be applied to eliminate the incentive to mis-report location.
Approximation of Lorenz-Optimal Solutions in Multiobjective Markov Decision Processes.
Patrice Perny, Paul Weng, Judy Goldsmith,
Josiah P. Hanna.
Proceedings of the International Conference on Uncertainty in Artificial Intelligence (UAI). July 2013.
Abstract
BibTeX
Paper (pdf)
This paper is devoted to fair optimization
in Multiobjective Markov Decision Processes
(MOMDPs). A MOMDP is an extension of
the MDP model for planning under uncertainty
while trying to optimize several reward
functions simultaneously. This applies
to multiagent problems when rewards define
individual utility functions, or in multicriteria
problems when rewards refer to different
features. In this setting, we study the determination
of policies leading to Lorenz-nondominated
tradeoffs. Lorenz dominance is a
refinement of Pareto dominance that was introduced
in Social Choice for the measurement
of inequalities. In this paper, we introduce
methods to efficiently approximate
the sets of Lorenz-non-dominated solutions of
infinite-horizon, discounted MOMDPs. The
approximations are polynomial-sized subsets
of those solutions.
The Academic Advising Planning Domain.
Joshua T. Guerin,
Josiah P. Hanna, Libby Ferland, Nicholas Mattei, Judy Goldsmith.
Proceedings of the 3rd Workshop on the International Planning Competition at ICAPS. July 2012.
Abstract
BibTeX
Paper (pdf)
The International Probabilistic Planning Competition is a
leading showcase for fast stochastic planners. The current domains
used in the competition have raised challenges that the
leading deterministic-planner-based MDP solvers have been
able to meet. We argue that in order to continue to raise challenges
and match real world applications, domains must be
generated that exhibit true stochasticity, multi-valued domain
variables, and concurrent actions. In this paper we propose
the academic advising domain as a planning competition domain
that exhibits these characteristics. We believe that this
domain can build upon the success of previous contests in
pushing the limits of MDP planning research.