|
Off-Policy Evaluation for Action-Dependent Non-stationary Environments
Yash Chandak,
Shiv Shankar,
Nathaniel D. Bastian,
Bruno Castro da Silva,
Emma Brunskill,
Philip Thomas
Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022)
Abstract | Arxiv | Code | Video
Methods for sequential decision-making are often built upon a foundational assumption that the underlying decision process is stationary. This limits the application of such methods because real-world problems are often subject to changes due to external factors (\textit{passive} non-stationarity), changes induced by interactions with the system itself (\textit{active} non-stationarity), or both (\textit{hybrid} non-stationarity). In this work, we take the first steps towards the fundamental challenge of on-policy and off-policy evaluation amidst structured changes due to active, passive, or hybrid non-stationarity. Towards this goal, we make a \textit{higher-order stationarity} assumption such that non-stationarity results in changes over time, but the way changes happen is fixed. We propose, OPEN, an algorithm that uses a double application of counterfactual reasoning and a novel importance-weighted instrument-variable regression to obtain both a lower bias and a lower variance estimate of the structure in the changes of a policy's past performances. Finally, we show promising results on how OPEN can be used to predict future performances for several domains inspired by real-world applications that exhibit non-stationarity.
|
|
Universal Off-Policy Evaluation
Yash Chandak,
Scott Niekum,
Bruno Castro da Silva,
Erik Learned-Miller,
Emma Brunskill,
Philip Thomas
Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021)
5th Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2022)
Best Paper Award at RLDM.
Abstract | Arxiv | Code | Video | Tutorial
When faced with sequential decision-making problems, it is often useful to be able to predict what would happen if decisions were made using a new policy. Those predictions must often be based on data collected under some previously used decision-making rule. Many previous methods enable such off-policy (or counterfactual) estimation of the expected value of a performance measure called the return. In this paper, we take the first steps towards a universal off-policy estimator (UnO) -- one that provides off-policy estimates and high-confidence bounds for any parameter of the return distribution. We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns. Finally, we also discuss Uno's applicability in various settings, including fully observable, partially observable (i.e., with unobserved confounders), Markovian, non-Markovian, stationary, smoothly non-stationary, and discrete distribution shifts.
|
|
Towards Safe Policy Improvement for Non-Stationary MDPs
Yash Chandak,
Scott Jordan,
Georgios Theocharous,
Martha White,
Philip Thomas
(Spotlight) Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS 2020)
Abstract | Arxiv | Blogpost | Code | Video
Many real-world sequential decision-making problems involve critical systems that present both human-life and financial risks. While several works in the past have proposed methods that are safe for deployment, they assume that the underlying problem is stationary. However, many real-world problems of interest exhibit non-stationarity, and when stakes are high, the cost associated with a false stationarity assumption may be unacceptable. Addressing safety in the presence of non-stationarity remains an open question in the literature. We present a type of Seldonian algorithm (Thomas et al., 2019), taking the first steps towards ensuring safety, with high confidence, for smoothly varying non-stationary decision problems, through a synthesis of model-free reinforcement learning algorithms with methods from time-series analysis.
|
|
Optimizing for the Future in Non-Stationary MDPs
Yash Chandak,
Georgios Theocharous,
Shiv Shankar,
Martha White,
Sridhar Mahadevan,
Philip Thomas
Thirty-seventh International Conference on Machine Learning (ICML 2020)
Abstract | Arxiv | Blogpost | Code | Video
Most reinforcement learning methods are based upon the key assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process is stationary. However, in many real-world applications, this assumption is violated, and using existing algorithms may result in a performance lag. To proactively search for a good future policy, we present a policy gradient algorithm that maximizes a forecast of future performance. This forecast is obtained by fitting a curve to the counter-factual estimates of policy performance over time, without explicitly modeling the underlying non-stationarity. The resulting algorithm amounts to a non-uniform reweighting of past data, and we observe that minimizing performance over some of the data from past episodes can be beneficial when searching for a policy that maximizes future performance. We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques, on three simulated problems motivated by real-world applications.
|
|
Lifelong Learning with a Changing Action Set
Yash Chandak,
Georgios Theocharous,
Chris Nota,
Philip Thomas
(Oral) Thirty-fourth AAAI Conference on Artificial Intelligence (AAAI 2020)
Outstanding Student Paper Honorable Mention.
Abstract | Arxiv | Code
In many real-world sequential decision making problems, the number of available actions (decisions) can vary over time. While problems like catastrophic forgetting, changing transition dynamics, changing rewards functions, etc. have been well-studied in the lifelong learning literature, the setting where the action set changes remains unaddressed. In this paper, we present an algorithm that autonomously adapts to an action set whose size changes over time. To tackle this open problem, we break it into two problems that can be solved iteratively: inferring the underlying, unknown, structure in the space of actions and optimizing a policy that leverages this structure. We demonstrate the efficiency of this approach on large-scale real-world lifelong learning problems.
|