I’m a PhD candidate at the University of Massachusetts. I am also a member of the Autonomous Learning Lab (ALL) and am fortunate to be advised by Prof. Philip Thomas. My primary interest is in continual learning, a branch of Artificial Intelligence, which aims at teaching machines how to adapt to changing scenarios and acquire new concepts over time. My research is mostly at the intersection of reinforcement learning and machine learning, with a focus on challenges of real-world applications. I enjoy reading and looking out for inspirations from neuroscience as well.


  • Outstanding Reviewer at NeurIPS’21 and Top Reviewer at ICML’21.
  • Our papers on (a) Universal off-Policy evaluation, and (b) SOPE: Spectrum of off-policy estimators, got accepted at NeurIPS’21.
  • Our paper on providing high confidence generalization for reinforcement learning got accepted at ICML’21.
  • Our paper on providing high-confidence off-policy variance estimates got accepted at AAAI’21.

Selected Publications

Click here for all the publications.


Towards Safe Policy Improvement for Non-Stationary MDPs
Yash Chandak, Scott Jordan, Georgios Theocharous, Martha White, Philip Thomas
(Spotlight) Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS 2020)

Abstract | Arxiv | Blogpost | Code | Video

Many real-world sequential decision-making problems involve critical systems that present both human-life and financial risks. While several works in the past have proposed methods that are safe for deployment, they assume that the underlying problem is stationary. However, many real-world problems of interest exhibit non-stationarity, and when stakes are high, the cost associated with a false stationarity assumption may be unacceptable. Addressing safety in the presence of non-stationarity remains an open question in the literature. We present a type of Seldonian algorithm (Thomas et al., 2019), taking the first steps towards ensuring safety, with high confidence, for smoothly varying non-stationary decision problems, through a synthesis of model-free reinforcement learning algorithms with methods from time-series analysis.


Optimizing for the Future in Non-Stationary MDPs
Yash Chandak, Georgios Theocharous, Shiv Shankar, Martha White, Sridhar Mahadevan, Philip Thomas
Thirty-seventh International Conference on Machine Learning (ICML 2020)

Abstract | Arxiv | Blogpost | Code | Video

Most reinforcement learning methods are based upon the key assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process is stationary. However, in many real-world applications, this assumption is violated, and using existing algorithms may result in a performance lag. To proactively search for a good future policy, we present a policy gradient algorithm that maximizes a forecast of future performance. This forecast is obtained by fitting a curve to the counter-factual estimates of policy performance over time, without explicitly modeling the underlying non-stationarity. The resulting algorithm amounts to a non-uniform reweighting of past data, and we observe that minimizing performance over some of the data from past episodes can be beneficial when searching for a policy that maximizes future performance. We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques, on three simulated problems motivated by real-world applications.


Lifelong Learning with a Changing Action Set
Yash Chandak, Georgios Theocharous, Chris Nota, Philip Thomas
(Oral) Thirty-fourth AAAI Conference on Artificial Intelligence (AAAI 2020)
Outstanding Student Paper Honorable Mention.

Abstract | Arxiv | Code

In many real-world sequential decision making problems, the number of available actions (decisions) can vary over time. While problems like catastrophic forgetting, changing transition dynamics, changing rewards functions, etc. have been well-studied in the lifelong learning literature, the setting where the action set changes remains unaddressed. In this paper, we present an algorithm that autonomously adapts to an action set whose size changes over time. To tackle this open problem, we break it into two problems that can be solved iteratively: inferring the underlying, unknown, structure in the space of actions and optimizing a policy that leverages this structure. We demonstrate the efficiency of this approach on large-scale real-world lifelong learning problems.