I will be starting my work as a postdoc for Prof. Emma Brunskill at Stanford University. My research interests are mostly at the intersection of reinforcement learning and machine learning, specifically on the sub-topics related to continual learning, non-stationarity, safety, off-policy data, and other challenges stemming from real-world applications in education and healthcare.

I received my PhD at the University of Massachusetts, where I was a member of the Autonomous Learning Lab (ALL) and was fortunate to be advised by Prof. Philip Thomas.


  • Our papers on (a) Off-Policy evaluation for action-dependent non-stationary environments, and (b) Factored distributionally robust policies for contextual bandits, got accepted at NeurIPS’22.
  • PhD done. Graduated! Thanks to Phil for making my PhD journey amazing!
  • RLDM 2022 best paper award for our work on universal off-policy evaluation.
  • I will be interning at DeepMind London from May.
  • Outstanding Reviewer at NeurIPS’21 and Top Reviewer at ICML’21.

Selected Publications

Click here for all the publications.


Universal Off-Policy Evaluation
Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, Philip Thomas
Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021)
5th Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2022)
Best Paper Award at RLDM.

Abstract | Arxiv | Code | Video | Tutorial

When faced with sequential decision-making problems, it is often useful to be able to predict what would happen if decisions were made using a new policy. Those predictions must often be based on data collected under some previously used decision-making rule. Many previous methods enable such off-policy (or counterfactual) estimation of the expected value of a performance measure called the return. In this paper, we take the first steps towards a universal off-policy estimator (UnO) -- one that provides off-policy estimates and high-confidence bounds for any parameter of the return distribution. We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns. Finally, we also discuss Uno's applicability in various settings, including fully observable, partially observable (i.e., with unobserved confounders), Markovian, non-Markovian, stationary, smoothly non-stationary, and discrete distribution shifts.


Towards Safe Policy Improvement for Non-Stationary MDPs
Yash Chandak, Scott Jordan, Georgios Theocharous, Martha White, Philip Thomas
(Spotlight) Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS 2020)

Abstract | Arxiv | Blogpost | Code | Video

Many real-world sequential decision-making problems involve critical systems that present both human-life and financial risks. While several works in the past have proposed methods that are safe for deployment, they assume that the underlying problem is stationary. However, many real-world problems of interest exhibit non-stationarity, and when stakes are high, the cost associated with a false stationarity assumption may be unacceptable. Addressing safety in the presence of non-stationarity remains an open question in the literature. We present a type of Seldonian algorithm (Thomas et al., 2019), taking the first steps towards ensuring safety, with high confidence, for smoothly varying non-stationary decision problems, through a synthesis of model-free reinforcement learning algorithms with methods from time-series analysis.


Optimizing for the Future in Non-Stationary MDPs
Yash Chandak, Georgios Theocharous, Shiv Shankar, Martha White, Sridhar Mahadevan, Philip Thomas
Thirty-seventh International Conference on Machine Learning (ICML 2020)

Abstract | Arxiv | Blogpost | Code | Video

Most reinforcement learning methods are based upon the key assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process is stationary. However, in many real-world applications, this assumption is violated, and using existing algorithms may result in a performance lag. To proactively search for a good future policy, we present a policy gradient algorithm that maximizes a forecast of future performance. This forecast is obtained by fitting a curve to the counter-factual estimates of policy performance over time, without explicitly modeling the underlying non-stationarity. The resulting algorithm amounts to a non-uniform reweighting of past data, and we observe that minimizing performance over some of the data from past episodes can be beneficial when searching for a policy that maximizes future performance. We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques, on three simulated problems motivated by real-world applications.


Lifelong Learning with a Changing Action Set
Yash Chandak, Georgios Theocharous, Chris Nota, Philip Thomas
(Oral) Thirty-fourth AAAI Conference on Artificial Intelligence (AAAI 2020)
Outstanding Student Paper Honorable Mention.

Abstract | Arxiv | Code

In many real-world sequential decision making problems, the number of available actions (decisions) can vary over time. While problems like catastrophic forgetting, changing transition dynamics, changing rewards functions, etc. have been well-studied in the lifelong learning literature, the setting where the action set changes remains unaddressed. In this paper, we present an algorithm that autonomously adapts to an action set whose size changes over time. To tackle this open problem, we break it into two problems that can be solved iteratively: inferring the underlying, unknown, structure in the space of actions and optimizing a policy that leverages this structure. We demonstrate the efficiency of this approach on large-scale real-world lifelong learning problems.