Publications

2023
Sharma A, Parbhoo S, Gottesman O, Doshi-Velez F. Robust Decision-Focused Learning for Reward Transfer. 2023. Publisher's VersionAbstract
Decision-focused (DF) model-based reinforcement learning has recently been introduced as a powerful algorithm which can focus on learning the MDP dynamics which are most relevant for obtaining high rewards. While this approach increases the performance of agents by focusing the learning towards optimizing for the reward directly, it does so by learning less accurate dynamics (from a MLE standpoint), and may thus be brittle to changes in the reward function. In this work, we develop the robust decision-focused (RDF) algorithm which leverages the non-identifiability of DF solutions to learn models which maximize expected returns while simultaneously learning models which are robust to changes in the reward function. We demonstrate on a variety of toy example and healthcare simulators that RDF significantly increases the robustness of DF to changes in the reward function, without decreasing the overall return the agent obtains.
Fu H, Yao J, Gottesman O, Doshi-Velez F, Konidaris G. Performance Bounds for Model and Policy Transfer in Hidden-parameter MDPs, in The Eleventh International Conference on Learning Representations. ; 2023. Publisher's VersionAbstract
In the Hidden-Parameter MDP (HiP-MDP) framework, a family of reinforcement learning tasks is generated by varying hidden parameters specifying the dynamics and reward function for each individual task. HiP-MDP is a natural model for families of tasks in which meta- and lifelong-reinforcement learning approaches can succeed. Given a learned context encoder that infers the hidden parameters from previous experience, most existing algorithms fall into two categories: \$\textbackslashtextit\model transfer\\$ and \$\textbackslashtextit\policy transfer\\$, depending on which function the hidden parameters are used to parameterize. We characterize the robustness of model and policy transfer algorithms with respect to hidden parameter estimation error. We first show that the value function of HiP-MDPs is Lipschitz continuous under certain conditions. We then derive regret bounds for both settings through the lens of Lipschitz continuity. Finally, we empirically corroborate our theoretical analysis by experimentally varying the hyper-parameters governing the Lipschitz constants of two continuous control problems; the resulting performance is consistent with our predictions.
Paper
Sharma A, Parbhoo S, Gottesman O, Doshi-Velez F. Robust Decision-Focused Learning for Reward Transfer. Preprint. 2023. Publisher's VersionAbstract
Decision-focused (DF) model-based reinforcement learning has recently been introduced as a powerful algorithm which can focus on learning the MDP dynamics which are most relevant for obtaining high rewards. While this approach increases the performance of agents by focusing the learning towards optimizing for the reward directly, it does so by learning less accurate dynamics (from a MLE standpoint), and may thus be brittle to changes in the reward function. In this work, we develop the robust decision-focused (RDF) algorithm which leverages the non-identifiability of DF solutions to learn models which maximize expected returns while simultaneously learning models which are robust to changes in the reward function. We demonstrate on a variety of toy example and healthcare simulators that RDF significantly increases the robustness of DF to changes in the reward function, without decreasing the overall return the agent obtains.
Paper
Sharma A, Zhang J, Nikovski D, Doshi-Velez F. Travel-time prediction using neural-network-based mixture models. The 14th International Conference on Ambient Systems, Networks and Technologies Networks (ANT 2022) and The 6th International Conference on Emerging Data and Industry 4.0 (EDI40). 2023;220 :1033–1038. Publisher's VersionAbstract
Accurate estimation of travel times is an important step in smart transportation and smart building systems. Poor estimation of travel times results in both frustrated users and wasted resources. Current methods that estimate travel times usually only return point estimates, losing important distributional information necessary for accurate decision-making. We propose using neural network-based mixture distributions to predict a user's travel times given their origin and destination coordinates. We show that our method correctly estimates the travel time distribution, maximizes utility in a downstream elevator scheduling task, and is easy to retrain—making it a versatile and an inexpensive-to-maintain module when deployed in smart crowd management systems.
Paper
2022
Havasi M, Parbhoo S, Doshi-Velez F. Addressing Leakage in Concept Bottleneck Models. Preprint. 2022. Publisher's Version Paper
Zhang KW, Gottesman O, Doshi-Velez F. A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes. Preprint. 2022. Publisher's VersionAbstract
In the reinforcement learning literature, there are many algorithms developed for either Contextual Bandit (CB) or Markov Decision Processes (MDP) environments. However, when deploying reinforcement learning algorithms in the real world, even with domain expertise, it is often difficult to know whether it is appropriate to treat a sequential decision making problem as a CB or an MDP. In other words, do actions affect future states, or only the immediate rewards? Making the wrong assumption regarding the nature of the environment can lead to inefficient learning, or even prevent the algorithm from ever learning an optimal policy, even with infinite data. In this work we develop an online algorithm that uses a Bayesian hypothesis testing approach to learn the nature of the environment. Our algorithm allows practitioners to incorporate prior knowledge about whether the environment is that of a CB or an MDP, and effectively interpolate between classical CB and MDP-based algorithms to mitigate against the effects of misspecifying the environment. We perform simulations and demonstrate that in CB settings our algorithm achieves lower regret than MDP-based algorithms, while in non-bandit MDP settings our algorithm is able to learn the optimal policy, often achieving comparable regret to MDP-based algorithms.
Paper
Liao QV, Zhang Y, Luss R, Doshi-Velez F, Dhurandhar A. Connecting Algorithmic Research and Usage Contexts: A Perspective of Contextualized Evaluation for Explainable AI. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing. 2022;10 :147–159. Publisher's VersionAbstract
Recent years have seen a surge of interest in the field of explainable AI (XAI), with a plethora of algorithms proposed in the literature. However, a lack of consensus on how to evaluate XAI hinders the advancement of the field. We highlight that XAI is not a monolithic set of technologies–-researchers and practitioners have begun to leverage XAI algorithms to build XAI systems that serve different usage contexts, such as model debugging and decision-support. Algorithmic research of XAI, however, often does not account for these diverse downstream usage contexts, resulting in limited effectiveness or even unintended consequences for actual users, as well as difficulties for practitioners to make technical choices. We argue that one way to close the gap is to develop evaluation methods that account for different user requirements in these usage contexts. Towards this goal, we introduce a perspective of contextualized XAI evaluation by considering the relative importance of XAI evaluation criteria for prototypical usage contexts of XAI. To explore the context dependency of XAI evaluation criteria, we conduct two survey studies, one with XAI topical experts and another with crowd workers. Our results urge for responsible AI research with usage-informed evaluation practices, and provide a nuanced understanding of user requirements for XAI in different usage contexts.
Paper
Trella AL, Zhang KW, Nahum-Shani I, Shetty V, Doshi-Velez F, Murphy SA. Designing Reinforcement Learning Algorithms for Digital Interventions: Pre-Implementation Guidelines. Algorithms. 2022;15 (8) :255. Publisher's VersionAbstract
Online reinforcement learning (RL) algorithms are increasingly used to personalize digital interventions in the fields of mobile health and online education. Common challenges in designing and testing an RL algorithm in these settings include ensuring the RL algorithm can learn and run stably under real-time constraints, and accounting for the complexity of the environment, e.g., a lack of accurate mechanistic models for the user dynamics. To guide how one can tackle these challenges, we extend the PCS (predictability, computability, stability) framework, a data science framework that incorporates best practices from machine learning and statistics in supervised learning to the design of RL algorithms for the digital interventions setting. Furthermore, we provide guidelines on how to design simulation environments, a crucial tool for evaluating RL candidate algorithms using the PCS framework. We show how we used the PCS framework to design an RL algorithm for Oralytics, a mobile health study aiming to improve users’ tooth-brushing behaviors through the personalized delivery of intervention messages. Oralytics will go into the field in late 2022.
Paper
Lage I, Pradier MF, McCoy TH, Perlis RH, Doshi-Velez F. Do clinicians follow heuristics in prescribing antidepressants?. Journal of Affective Disorders. 2022;311 :110–114. Publisher's VersionAbstract
Background While clinicians commonly learn heuristics to guide antidepressant treatment selection, surveys suggest real-world prescribing practices vary widely. We aimed to determine the extent to which antidepressant prescriptions were consistent with commonly-advocated heuristics for treatment selection. Methods This retrospective longitudinal cohort study examined electronic health records from psychiatry and non-psychiatry practice networks affiliated with two large academic medical centers between March 2008 and December 2017. Patients included 45,955 individuals with a major depressive disorder or depressive disorder not otherwise specified diagnosis who were prescribed at least one of 11 common antidepressant medications. Specific clinical features that may impact prescribing choices were extracted from coded data, and analyzed for association with index prescription in logistic regression models adjusted for sociodemographic variables and provider type. Results Multiple clinical features yielded 10% or greater change in odds of prescribing, including overweight and underweight status and sexual dysfunction. These heuristics were generally applied similarly across hospital systems and psychiatrist and non-psychiatrist providers. Limitations These analyses rely on coded clinical data, which is likely to substantially underestimate prevalence of particular clinical features. Additionally, numerous other features that may impact prescribing choices are not able to be modeled. Conclusion Our results confirm the hypothesis that clinicians apply heuristics on the basis of clinical features to guide antidepressant prescribing, although the magnitude of these effects is modest, suggesting other patient- or clinician-level factors have larger effects. Funding This work was funded by NSF GRFP (grant no. DGE1745303), Harvard SEAS, the Center for Research on Computation and Society at Harvard, the Harvard Data Science Initiative, and a grant from the National Institute of Mental Health (grant no. 1R01MH106577).
Paper
Lage I, McCoy Jr TH, Perlis RH, Doshi-Velez F. Efficiently identifying individuals at high risk for treatment resistance in major depressive disorder using electronic health records. Journal of Affective Disorders. 2022;306 :254–259. Publisher's VersionAbstract
Background With the emergence of evidence-based treatments for treatment-resistant depression, strategies to identify individuals at greater risk for treatment resistance early in the course of illness could have clinical utility. We sought to develop and validate a model to predict treatment resistance in major depressive disorder using coded clinical data from the electronic health record. Methods We identified individuals from a large health system with a diagnosis of major depressive disorder receiving an index antidepressant prescription, and used a tree-based machine learning classifier to build a risk stratification model to identify those likely to experience treatment resistance. The resulting model was validated in a second health system. Results In the second health system, the extra trees model yielded an AUC of 0.652 (95% CI: 0.623–0.682); with sensitivity constrained at 0.80, specificity was 0.358 (95% CI: 0.300–0.413). Lift in the top quintile was 1.99 (95% CI: 1.76–2.22). Including additional data for the 4 weeks following treatment initiation did not meaningfully improve model performance. Limitations The extent to which these models generalize across additional health systems will require further investigation. Conclusion Electronic health records facilitated stratification of risk for treatment-resistant depression and demonstrated generalizability to a second health system. Efforts to improve upon such models using additional measures, and to understand their performance in real-world clinical settings, are warranted.
Paper
Zeng X, Yao J, Doshi-Velez F, Pan W. From Soft Trees to Hard Trees: Gains and Losses. Preprint. 2022.Abstract
Trees are widely used as interpretable models. However, when they are greedily trained they can yield suboptimal predictive performance. Training soft trees, with probabilistic splits rather than deterministic ones, provides a way to supposedly globally optimize tree models. For interpretability purposes, a hard tree can be obtained from a soft tree by binarizing the probabilistic splits, called hardening. Unfortunately, the good performance of the soft model is often lost after hardening. We systematically study two factors contributing to the performance drop: first, the loss surface of the soft tree loss has many local optima (and thus the logic for using the soft tree loss becomes less clear), and second, the relative values of the soft tree loss do not correspond to relative values of the hard tree loss. We also demonstrate that simple mitigation methods in literature do not fully mitigate the performance drop.
Paper
Littman ML, Ajunwa I, Berger G, Boutilier C, Currie M, Doshi-Velez F, Hadfield G, Horowitz MC, Isbell C, Kitano H, et al. Gathering Strength, Gathering Storms: The One Hundred Year Study on Artificial Intelligence (AI100) 2021 Study Panel Report. Preprint. 2022. Publisher's VersionAbstract
In September 2021, the "One Hundred Year Study on Artificial Intelligence" project (AI100) issued the second report of its planned long-term periodic assessment of artificial intelligence (AI) and its impact on society. It was written by a panel of 17 study authors, each of whom is deeply rooted in AI research, chaired by Michael Littman of Brown University. The report, entitled "Gathering Strength, Gathering Storms," answers a set of 14 questions probing critical areas of AI development addressing the major risks and dangers of AI, its effects on society, its public perception and the future of the field. The report concludes that AI has made a major leap from the lab to people's lives in recent years, which increases the urgency to understand its potential negative effects. The questions were developed by the AI100 Standing Committee, chaired by Peter Stone of the University of Texas at Austin, consisting of a group of AI leaders with expertise in computer science, sociology, ethics, economics, and other disciplines.
Paper
Parbhoo S, Joshi S, Doshi-Velez F. Generalizing Off-Policy Evaluation From a Causal Perspective For Sequential Decision-Making. Preprint. 2022. Publisher's VersionAbstract
Assessing the effects of a policy based on observational data from a different policy is a common problem across several high-stake decision-making domains, and several off-policy evaluation (OPE) techniques have been proposed. However, these methods largely formulate OPE as a problem disassociated from the process used to generate the data (i.e. structural assumptions in the form of a causal graph). We argue that explicitly highlighting this association has important implications on our understanding of the fundamental limits of OPE. First, this implies that current formulation of OPE corresponds to a narrow set of tasks, i.e. a specific causal estimand which is focused on prospective evaluation of policies over populations or sub-populations. Second, we demonstrate how this association motivates natural desiderata to consider a general set of causal estimands, particularly extending the role of OPE for counterfactual off-policy evaluation at the level of individuals of the population. A precise description of the causal estimand highlights which OPE estimands are identifiable from observational data under the stated generative assumptions. For those OPE estimands that are not identifiable, the causal perspective further highlights where more experimental data is necessary, and highlights situations where human expertise can aid identification and estimation. Furthermore, many formalisms of OPE overlook the role of uncertainty entirely in the estimation process.We demonstrate how specifically characterising the causal estimand highlights the different sources of uncertainty and when human expertise can naturally manage this uncertainty. We discuss each of these aspects as actionable desiderata for future OPE research at scale and in-line with practical utility.
Paper
Keramati R, Gottesman O, Celi LA. Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation. Preprint. 2022. Paper
Chin Z, Raval S, Doshi-Velez F, Wattenberg M, Celi LA. Identifying Structure in the MIMIC ICU Dataset, in Proceedings of the Conference on Health, Inference, and Learning, 2022. ; 2022. Publisher's VersionAbstract
The MIMIC-III dataset, containing trajectories of 40,000 ICU patients, is one of the most popular datasets in machine learning for health space. However, there has been very little systematic exploration to understand what is the natural structure of these data–-most analyses enforce some type of top-down clustering or embedding. We take a bottom-up approach, identifying consistent structures that are robust across a range of embedding choices. We identified two dominant structures sorted by either fraction-inspired oxygen or creatinine –- both of which were validated as the key features by our clinical co-author. Our bottom-up approach in studying the macro-structure of a dataset can also be adapted for other datasets.
Paper
Yacoby Y, Green B, Jr CLG, Doshi-Velez F. “If it didn’t happen, why would I change my decision?”: How Judges Respond to Counterfactual Explanations for the Public Safety Assessment. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing. 2022;10 :219–230. Publisher's VersionAbstract
Many researchers and policymakers have expressed excitement about algorithmic explanations enabling more fair and responsible decision-making. However, recent experimental studies have found that explanations do not always improve human use of algorithmic advice. In this study, we shed light on how people interpret and respond to counterfactual explanations (CFEs)–-explanations that show how a model's output would change with marginal changes to its input(s)–-in the context of pretrial risk assessment instruments (PRAIs). We ran think-aloud trials with eight sitting U.S. state court judges, providing them with recommendations from a PRAI that includes CFEs. We found that the CFEs did not alter the judges' decisions. At first, judges misinterpreted the counterfactuals as real–-rather than hypothetical–-changes to defendants. Once judges understood what the counterfactuals meant, they ignored them, stating their role is only to make decisions regarding the actual defendant in question. The judges also expressed a mix of reasons for ignoring or following the advice of the PRAI without CFEs. These results add to the literature detailing the unexpected ways in which people respond to algorithms and explanations. They also highlight new challenges associated with improving human-algorithm collaborations through explanations.
Paper
Zhang K, Wang H, Du J, Chu B, Robles Arévalo A, Kindle R, Celi LA, Doshi-Velez F. An interpretable RL framework for pre-deployment modeling in ICU hypotension management. npj Digital Medicine. 2022;5 (1) :1–10. Publisher's VersionAbstract
Computational methods from reinforcement learning have shown promise in inferring treatment strategies for hypotension management and other clinical decision-making challenges. Unfortunately, the resulting models are often difficult for clinicians to interpret, making clinical inspection and validation of these computationally derived strategies challenging in advance of deployment. In this work, we develop a general framework for identifying succinct sets of clinical contexts in which clinicians make very different treatment choices, tracing the effects of those choices, and inferring a set of recommendations for those specific contexts. By focusing on these few key decision points, our framework produces succinct, interpretable treatment strategies that can each be easily visualized and verified by clinical experts. This interrogation process allows clinicians to leverage the model’s use of historical data in tandem with their own expertise to determine which recommendations are worth investigating further e.g. at the bedside. We demonstrate the value of this approach via application to hypotension management in the ICU, an area with critical implications for patient outcomes that lacks data-driven individualized treatment strategies; that said, our framework has broad implications on how to use computational methods to assist with decision-making challenges on a wide range of clinical domains.
Paper
Chiu J, Mittal R, Tumma N, Sharma A, Doshi-Velez F. A Joint Learning Approach for Semi-supervised Neural Topic Modeling, in Proceedings of the Sixth Workshop on Structured Prediction for NLP. Dublin, Ireland: Association for Computational Linguistics ; 2022 :40–51. Publisher's VersionAbstract
Topic models are some of the most popular ways to represent textual data in an interpret-able manner. Recently, advances in deep generative models, specifically auto-encoding variational Bayes (AEVB), have led to the introduction of unsupervised neural topic models, which leverage deep generative models as opposed to traditional statistics-based topic models. We extend upon these neural topic models by introducing the Label-Indexed Neural Topic Model (LI-NTM), which is, to the extent of our knowledge, the first effective upstream semi-supervised neural topic model. We find that LI-NTM outperforms existing neural topic models in document reconstruction benchmarks, with the most notable results in low labeled data regimes and for data-sets with informative labels; furthermore, our jointly learned classifier outperforms baseline classifiers in ablation studies.
Paper
Wu C, Parbhoo S, Havasi M, Doshi-Velez F. Learning Optimal Summaries of Clinical Time-series with Concept Bottleneck Models, in Proceedings of the 7th Machine Learning for Healthcare Conference. PMLR ; 2022 :648–672. Publisher's VersionAbstract
Despite machine learning models’ state-of-the-art performance in numerous clinical prediction and intervention tasks, their complex black-box processes pose a great barrier to their real-world deployment. Clinical experts must be able to understand the reasons behind a model’s recommendation before taking action, as it is crucial to assess for criteria other than accuracy, such as trust, safety, fairness, and robustness. In this work, we enable human inspection of clinical timeseries prediction models by learning concepts, or groupings of features into high-level clinical ideas such as illness severity or kidney function. We also propose an optimization method which then selects the most important features within each concept, learning a collection of sparse prediction models that are sufficiently expressive for examination. On a real-world task of predicting vasopressor onset in ICU units, our algorithm achieves predictive performance comparable to state-of-the-art deep learning models while learning concise groupings conducive for clinical inspection.
Paper
Tang S, Makar M, Sjoding M, Doshi-Velez F, Wiens J. Leveraging Factored Action Spaces for Efficient Offline Reinforcement Learning in Healthcare, in Decision Awareness in Reinforcement Learning Workshop at ICML 2022. ; 2022. Publisher's VersionAbstract
Many reinforcement learning (RL) applications have combinatorial action spaces, where each action is a composition of sub-actions. A standard RL approach ignores this inherent factorization structure, resulting in a potential failure to make meaningful inferences about rarely observed sub-action combinations; this is particularly problematic for offline settings, where data may be limited. In this work, we propose a form of linear Q-function decomposition induced by factored action spaces. We study the theoretical properties of our approach, identifying scenarios where it is guaranteed to lead to zero bias when used to approximate the Q-function. Outside the regimes with theoretical guarantees, we show that our approach can still be useful because it leads to better sample efficiency without necessarily sacrificing policy optimality, allowing us to achieve a better bias-variance trade-off. Across several offline RL problems using simulators and real-world datasets motivated by healthcare problems, we demonstrate that incorporating factored action spaces into value-based RL can result in better-performing policies. Our approach can help an agent make more accurate inferences within under-explored regions of the state-action space when applying RL to observational datasets.
Paper

Pages