Publications
Scroll to papers in year: 2024 2023 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001
Accepted Papers and Preprints
GLOW.AI members† Bryan's advisors+ Bryan's collaborators*Localized Zeroth-Order Prompt Optimization
Wenyang Hu†, Yao Shu†, Zongmin Yu, Zhaoxuan Wu†, Xiaoqiang Lin†, Zhongxiang Dai†, See-Kiong Ng*, and Bryan Kian Hsiang Low†
Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
25.8% acceptance rate (spotlight presentation); also appeared in ICML 2024 Workshop on In-Context Learning
Abstract. The efficacy of large language models (LLMs) in understanding and generating natural language has aroused a wide interest in developing prompt-based methods to harness the power of black-box LLMs. Existing methodologies usually prioritize a global optimization for finding the global optimum, which however will perform poorly in certain tasks. This thus motivates us to re-think the necessity of finding a global optimum in prompt optimization. To answer this, we conduct a thorough empirical study on prompt optimization and draw two major insights. Contrasting with the rarity of global optimum, local optima are usually prevalent and well-performed, which can be more worthwhile for efficient prompt optimization (Insight I). The choice of the input domain, covering both the generation and the representation of prompts, affects the identification of well-performing local optima (Insight II). Inspired by these insights, we propose a novel algorithm, namely localized zeroth-order prompt optimization (ZOPO), which incorporates a Neural Tangent Kernel-based derived Gaussian process into standard zeroth-order optimization for an efficient search of well-performing local optima in prompt optimization. Remarkably, ZOPO outperforms existing baselines in terms of both the optimization performance and the query efficiency, which we demonstrate through extensive experiments.
DETAIL: Task DEmonsTration Attribution for Interpretable In-context Learning
Zijian Zhou†, Xiaoqiang Lin†, Xinyi Xu†, Alok Prakash*, Daniela Rus*, and Bryan Kian Hsiang Low†
Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
25.8% acceptance rate; also appeared in ICML 2024 Workshop on In-Context Learning
Abstract. In-context learning (ICL) allows transformer-based language models that are pre-trained on general text to quickly learn a specific task with a few "task demonstrations" without updating their parameters, significantly boosting their flexibility and generality. ICL possesses many distinct characteristics from conventional machine learning, thereby requiring new approaches to interpret this learning paradigm. Taking the viewpoint of recent works showing that transformers learn in context by formulating an internal optimizer, we propose an influence function-based attribution technique, DETAIL, that addresses the specific characteristics of ICL. We empirically verify the effectiveness of our approach for demonstration attribution while being computationally efficient. Leveraging the results, we then show how DETAIL can help improve model performance in real-world scenarios through demonstration reordering and curation. Finally, we experimentally prove the wide applicability of DETAIL by showing our attribution scores obtained on white-box models are transferable to black-box models in improving model performance.
Prompt Optimization with EASE? Efficient Ordering-aware Automated Selection of Exemplars
Zhaoxuan Wu†, Xiaoqiang Lin†, Zhongxiang Dai†, Wenyang Hu†, Yao Shu†, See-Kiong Ng*, Patrick Jaillet*, and Bryan Kian Hsiang Low†
Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
25.8% acceptance rate; also appeared in ICML 2024 Workshop on In-Context Learning
Abstract. Large language models (LLMs) have shown impressive capabilities in real-world applications. The capability of in-context learning (ICL) allows us to adapt an LLM to downstream tasks by including input-label exemplars in the prompt without model fine-tuning. However, the quality of these exemplars in the prompt greatly impacts performance, highlighting the need for an effective automated exemplar selection method. Recent studies have explored retrieval-based approaches to select exemplars tailored to individual test queries, which can be undesirable due to extra test-time computation and an increased risk of data exposure. Moreover, existing methods fail to adequately account for the impact of exemplar ordering on the performance. On the other hand, the impact of the instruction, another essential component in the prompt given to the LLM, is often overlooked in existing exemplar selection methods. To address these challenges, we propose a novel method named EASE, which leverages the hidden embedding from a pre-trained language model to represent ordered sets of exemplars and uses a neural bandit algorithm to optimize the sets of exemplars while accounting for exemplar ordering. Our EASE can efficiently find an ordered set of exemplars that performs well for all test queries from a given task, thereby eliminating test-time computation. Importantly, EASE can be readily extended to jointly optimize both the exemplars and the instruction. Through extensive empirical evaluations (including novel tasks), we demonstrate the superiority of EASE over existing methods, and reveal practical insights about the impact of exemplar selection on ICL, which may be of independent interest.
Data Distribution Valuation
Xinyi Xu†, Shuaiqi Wang, Chuan Sheng Foo*, Bryan Kian Hsiang Low†, and Giulia Fanti*
Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
25.8% acceptance rate
Abstract. Data valuation is a class of techniques for quantitatively assessing the value of data for applications like pricing in data marketplaces. Existing data valuation methods define a value for a discrete dataset. However, in many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled. For example, consider a buyer trying to evaluate whether to purchase data from different vendors. The buyer may observe (and compare) only a small preview sample from each vendor, to decide which vendor's data distribution is most useful to the buyer and purchase. The core question is how should we compare the values of data distributions from their samples? Under a Huber characterization of the data heterogeneity across vendors, we propose a maximum-mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies for comparing data distributions from samples. We empirically demonstrate that our method is sample-efficient and effective in identifying valuable data distributions against several existing baselines, on multiple real-world datasets (e.g., network intrusion detection, credit card fraud detection) and downstream applications (classification, regression).
Gradient-Free Methods for Nonconvex Nonsmooth Stochastic Compositional Optimization
Zhuanghua Liu†, Luo Luo, and Bryan Kian Hsiang Low†
Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
25.8% acceptance rate
Abstract. The stochastic compositional optimization (SCO) is popular in many real-world applications, including risk management, reinforcement learning, and meta-learning. However, most of the previous methods for SCO require the smoothness assumption on both the outer and inner functions, which limits their applications to a wider range of problems. In this paper, we study the SCO problem in that both the outer and inner functions are Lipschitz continuous but possibly nonconvex and nonsmooth. In particular, we propose gradient-free stochastic methods for finding the (_,_)-Goldstein stationary points of such problems with non-asymptotic convergence rates. Our results also lead to an improved convergence rate for the convex nonsmooth SCO problem. Furthermore, we conduct numerical experiments to demonstrate the effectiveness of the proposed methods.
Active Set Ordering
Quoc Phong Nguyen†, Sunil Gupta, Svetha Venkatesh, Bryan Kian Hsiang Low†, and Patrick Jaillet*
Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
25.8% acceptance rate
Abstract. In this paper, we formalize the active set ordering problem, which involves actively discovering a set of inputs based on their orderings determined by expensive evaluations of a blackbox function. We then propose the mean prediction (MP) algorithm and theoretically analyze it in terms of the regret of predicted pairwise orderings between inputs. Notably, as a special case of this framework, we can cast Bayesian optimization as an active set ordering problem by recognizing that maximizers can be identified solely by comparison rather than by precisely estimating the function evaluations. As a result, we are able to construct the popular Gaussian process upper confidence bound (GP-UCB) algorithm through the lens of ordering with several nuanced insights. We empirically validate the performance of our proposed solution using various synthetic functions and real-world datasets.
Position Paper: Data-Centric AI in the Age of Large Language Models
Xinyi Xu†, Zhaoxuan Wu†, Rui Qiao†, Arun Verma†, Yao Shu†, Jingtan Wang†, Xinyuan Niu†, Zhenfeng He†, Jiangwei Chen†, Zijian Zhou†, Gregory Kang Ruey Lau†, Hieu Dao†, Lucas Agussurja†, Rachael Hwee Ling Sim†, Xiaoqiang Lin†, Wenyang Hu†, Zhongxiang Dai†, Pang Wei Koh*, and Bryan Kian Hsiang Low†
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
% acceptance rate (findings)
Abstract. This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs). We start by making the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs, and yet it receives disproportionally low attention from the research community. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization. In each scenario, we underscore the importance of data, highlight promising research directions, and articulate the potential impacts on the research community and, where applicable, the society as a whole. For instance, we advocate for a suite of data-centric benchmarks tailored to the scale and complexity of data for LLMs. These benchmarks can be used to develop new data curation methods and document research efforts and results, which can help promote openness and transparency in AI and LLM research.
Waterfall: Framework for Robust and Scalable Text Watermarking
Gregory Kang Ruey Lau†, Xinyuan Niu†, Hieu Dao†, Jiangwei Chen†, Chuan Sheng Foo*, and Bryan Kian Hsiang Low†
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
% acceptance rate (main track); also appeared in ICML 2024 Workshop on Generative AI and Law; also appeared in ICML 2024 Workshop on Foundation Models in the Wild
Abstract. Protecting intellectual property (IP) of text such as articles and code is increasingly important, especially as sophisticated attacks become possible, such as paraphrasing by large language models (LLMs) or even unauthorized training of LLMs on copyrighted text to infringe such IP. However, existing text watermarking methods are not robust enough against such attacks nor scalable to millions of users for practical implementation. In this paper, we propose Waterfall, the first training-free framework for robust and scalable text watermarking applicable across multiple text types (e.g., articles, code) and languages supportable by LLMs, for general text and LLM data provenance. Waterfall comprises several key innovations, such as being the first to use LLM as paraphrasers for watermarking along with a novel combination of techniques that are surprisingly effective in achieving robust verifiability and scalability. We empirically demonstrate that Waterfall achieves significantly better scalability, robust verifiability, and computational efficiency compared to SOTA article-text watermarking methods, and also showed how it could be directly applied to the watermarking of code.
PIED: Physics-Informed Experimental Design For Inverse Problems
Apivich Hemachandra†, Gregory Kang Ruey Lau†, See-Kiong Ng*, and Bryan Kian Hsiang Low†
ICML Workshop on AI for Science: Scaling in AI for Scientific Discovery, 2024
Abstract. In many inverse problems (IPs) in science and engineering, optimization of design parameters (e.g., sensor placement) with experimental design (ED) methods is performed due to high data acquisition costs when conducting physical experiments, and often has to be done up front due to practical constraints on sensor deployments. However, existing ED methods are often challenging to use in practical PDE-based inverse problems due to significant computational bottlenecks during forward simulation and inverse parameter estimation. This paper presents Physics-Informed Experimental Design (PIED), the first ED framework that makes use of PINNs in a fully differentiable architecture to perform continuous optimization of design parameters for IPs. PIED utilizes techniques such as learning of a shared NN parameter initialization, and approximation of PINN training dynamics during the ED process, for better estimation of the inverse parameters. PIED selects the optimal design parameters for one-shot deployment, allows exploitation of parallel computation unlike existing methods, and is empirically shown to significantly outperform existing ED benchmarks in IPs for both finite-dimensional and function-valued inverse parameters given limited budget for observations.
Neural Dueling Bandits
Arun Verma†, Zhongxiang Dai†, Xiaoqiang Lin†, Patrick Jaillet*, and Bryan Kian Hsiang Low†
ICML Workshop on Foundations of Reinforcement Learning and Control: Connections and Perspectives, 2024
Abstract. Contextual dueling bandit is used to model the bandit problems, where a learner's goal is to find the best arm for a given context using observed noisy preference feedback over the selected arms for the past contexts. However, existing algorithms assume the reward function is linear, which can be complex and non-linear in many real-life applications like online recommendations or ranking web search results. To overcome this challenge, we use a neural network to estimate the reward function using preference feedback for the previously selected arms. We propose upper confidence bound- and Thompson sampling-based algorithms with sub-linear regret guarantees that efficiently select arms in each round. We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution. Experimental results on the problem instances derived from synthetic datasets corroborate our theoretical results.
Heterogeneous Federated Zeroth-Order Optimization using Gradient Surrogates
Yao Shu†, Xiaoqiang Lin†, Zhongxiang Dai†, and Bryan Kian Hsiang Low†
ICML Workshop on Differentiable Almost Everything: Differentiable Relaxations, Algorithms, Operators, and Simulators, 2024
Abstract. Federated optimization, an emerging paradigm that finds wide applications, e.g., federated learning, enables multiple clients (e.g., edge devices) to collaboratively optimize a global function by sharing their local gradients. However, the gradient information is not available in many applications, giving rise to the paradigm of federated zeroth-order optimization (ZOO). Existing federated ZOO algorithms typically suffer from the limitations of query and communication round inefficiency, which can be attributed to (a) their reliance on a substantial number of function queries for gradient estimation and (b) the significant disparity between their realized local updates and the intended global updates caused by client heterogeneity. To this end, we (a) introduce trajectory-informed gradient surrogates which are capable of using the history of function queries during optimization for accurate and query-efficient gradient estimation, and (b) develop the technique of adaptive gradient correction using these surrogates to mitigate the aforementioned disparity. With these, we propose the federated zeroth-order optimization using gradient surrogates (FZooS) algorithm for query- and communication round-efficient heterogeneous federated ZOO, which is supported by our theoretical analyses and extensive experiments.
TRACE: TRansformer-based Attribution using Contrastive Embeddings in LLMs
Cheng Wang, Xinyang Lu†, See-Kiong Ng*, and Bryan Kian Hsiang Low†
arXiv:2407.04981, 2024
Abstract. The rapid evolution of large language models (LLMs) represents a substantial leap forward in natural language understanding and generation. However, alongside these advancements come significant challenges related to the accountability and transparency of LLM responses. Reliable source attribution is essential to adhering to stringent legal and regulatory standards, including those set forth by the General Data Protection Regulation. Despite the well-established methods in source attribution within the computer vision domain, the application of robust attribution frameworks to natural language processing remains underexplored. To bridge this gap, we propose a novel and versatile TRansformer-based Attribution framework using Contrastive Embeddings called TRACE that, in particular, exploits contrastive learning for source attribution. We perform an extensive empirical evaluation to demonstrate the performance and efficiency of TRACE in various settings and show that TRACE significantly improves the ability to attribute sources accurately, making it a valuable tool for enhancing the reliability and trustworthiness of LLMs.
Prompt Optimization with Human Feedback
Xiaoqiang Lin†, Zhongxiang Dai†, Arun Verma†, See-Kiong Ng*, Patrick Jaillet*, and Bryan Kian Hsiang Low†
ICML Workshop on Models of Human Feedback for AI Alignment, 2024
(oral presentation)
Abstract. Large language models (LLMs) have shown impressive capabilities in real-world applications. The capability of in-context learning (ICL) allows us to adapt an LLM to downstream tasks by including input-label exemplars in the prompt without model fine-tuning. However, the quality of these exemplars in the prompt greatly impacts performance, highlighting the need for an effective automated exemplar selection method. Recent studies have explored retrieval-based approaches to select exemplars tailored to individual test queries, which can be undesirable due to extra test-time computation and an increased risk of data exposure. Moreover, existing methods fail to adequately account for the impact of exemplar ordering on the performance. On the other hand, the impact of the instruction, another essential component in the prompt given to the LLM, is often overlooked in existing exemplar selection methods. To address these challenges, we propose a novel method named EASE, which leverages the hidden embedding from a pre-trained language model to represent ordered sets of exemplars and uses a neural bandit algorithm to optimize the sets of exemplars while accounting for exemplar ordering. Our EASE can efficiently find an ordered set of exemplars that performs well for all test queries from a given task, thereby eliminating test-time computation. Importantly, EASE can be readily extended to jointly optimize both the exemplars and the instruction. Through extensive empirical evaluations (including novel tasks), we demonstrate the superiority of EASE over existing methods, and reveal practical insights about the impact of exemplar selection on ICL, which may be of independent interest.
Deletion-Anticipative Data Selection with a Limited Budget
Rachael Hwee Ling Sim†, Jue Fan, Xiao Tian, Patrick Jaillet*, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), 2024
27.5% acceptance rate
Abstract. Supervised data subset selection and active learning have often been used in data acquisition to select a smaller training set and reduce the time and cost of training machine learning (ML) models. These methods assume the selected training set to remain available throughout the ML model deployment. Such an assumption is increasingly being challenged as data owners, enabled by the GDPR's right to erasure, request deletions of their data. This raises an important question: During data acquisition of a training set of size k, how can a learner proactively maximize the data utility after future unknown deletions? We propose that the learner anticipates/estimates the probability that (i) each data owner in the feasible set will independently delete its data or (ii) a number of deletions occur out of k, and justify our proposal with concrete real-world use cases. Then, instead of directly maximizing the data utility function, the learner should maximize the expected or risk-averse utility based on the anticipated probabilities. We further propose how to construct these deletion-anticipative data selection (DADS) maximization objectives to preserve properties like monotone submodularity and near-optimality of greedy solutions, optimize the objectives efficiently, and empirically evaluate DADS' performance on real-world datasets.
Helpful or Harmful Data? Fine-tuning-free Shapley Attribution for Explaining Language Model Predictions
Jingtan Wang†, Xiaoqiang Lin†, Rui Qiao†, Chuan Sheng Foo*, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), 2024
27.5% acceptance rate; also appeared in ICML 2024 Workshop on Data-Centric Machine Learning Research
Abstract. The increasing complexity of foundational models underscores the necessity for explainability, particularly for fine-tuning, the most widely used training method for adapting models to downstream tasks. Instance attribution, one type of explanation, attributes the model prediction to each training example by an instance score. However, the robustness of instance scores, specifically towards dataset resampling, has been overlooked. To bridge this gap, we propose a notion of robustness on the sign of the instance score. We theoretically and empirically demonstrate that the popular leave-one-out-based methods lack robustness, while the Shapley value behaves significantly better, but at a higher computational cost. Accordingly, we introduce an efficient fine-tuning-free approximation of the Shapley value (FreeShap) for instance attribution based on the neural tangent kernel. We empirically demonstrate that FreeShap outperforms other methods for instance attribution and other data-centric applications such as data removal, data selection, and wrong label detection, and further generalize our scale to large language models (LLMs).
Towards AutoAI: Optimizing a Machine Learning System with Black-box and Differentiable Components
Zhiliang Chen†, Chuan Sheng Foo*, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), 2024
27.5% acceptance rate
Abstract. Machine learning (ML) models in the real world typically do not exist in isolation. They are usually part of a complex system (e.g., healthcare systems, self-driving cars) containing multiple ML and black-box components. The problem of optimizing such systems, which we refer to as automated AI (AutoAI), requires us to jointly train all ML components together and presents a significant challenge because the number of system parameters is extremely high and the system has no analytical form. To circumvent this, we introduce a novel algorithm called A-BAD-BO which uses each ML component's local loss as an auxiliary indicator for system performance. A-BAD-BO uses Bayesian optimization (BO) to optimize the local loss configuration of a system in a smaller dimensional space and exploits the differentiable structure of ML components to recover optimal system parameters from the optimized configuration. We show A-BAD-BO converges to optimal system parameters by showing that it is asymptotically no regret. We use A-BAD-BO to optimize several synthetic and real-world complex systems, including a prompt engineering pipeline for large language models containing millions of system parameters. Our results demonstrate that A-BAD-BO yields better system optimality than gradient-driven baselines and is more sample-efficient than pure BO algorithms.
Zeroth-Order Methods for Constrained Nonconvex Nonsmooth Stochastic Optimization
Zhuanghua Liu†, Cheng Chen, Luo Luo, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), 2024
1.52% acceptance rate (oral presentation)
Abstract. This paper studies the problem of solving nonconvex nonsmooth optimization over a closed convex set. Most previous works tackle such problems by transforming the constrained problem into an unconstrained problem that can be solved by the techniques developed in the unconstrained setting. However, they only provide asymptotic convergence analysis for their methods. In this work, we provide the non-asymptotic analysis for solving constrained nonconvex nonsmooth optimization. We first generalize classical gradient mapping and the Frank-Wolfe gap in the nonsmooth setting. Then we introduce novel notions of approximate stationarity concerning such generalized quantities. We also propose several stochastic zeroth-order algorithms for the problem, along with their non-asymptotic convergence guarantees of obtaining the proposed approximate stationarity. Finally, we conduct numerical experiments that demonstrate the effectiveness of our algorithms.
Distributionally Robust Data Valuation
Xiaoqiang Lin†, Xinyi Xu†, Zhaoxuan Wu†, See-Kiong Ng*, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), 2024
27.5% acceptance rate
Abstract. Data valuation quantifies the contribution of each data point to the performance of a machine learning model. Existing works typically define the value of data by its improvement of the validation performance of the trained model. However, this approach can be impractical to apply in collaborative machine learning and data marketplace since it is difficult for the parties/buyers to agree on a common validation dataset or determine the exact validation distribution a priori. To address this, we propose a distributionally robust data valuation approach to perform data valuation without known/fixed validation distributions. Our approach defines the value of data by its improvement of the distributionally robust generalization error (DRGE), thus providing a worst-case performance guarantee without a known/fixed validation distribution. However, since computing DRGE directly is infeasible, we propose using model deviation as a proxy for the marginal improvement of DRGE (for kernel regression and neural networks) to compute data values. Furthermore, we identify a notion of uniqueness where low uniqueness characterizes low-value data. We empirically demonstrate that our approach outperforms existing data valuation approaches in data selection and data removal tasks on real-world datasets (e.g., housing price prediction, diabetes hospitalization prediction).
Use Your INSTINCT: INSTruction optimization for LLMs usIng Neural bandits Coupled with Transformers
Xiaoqiang Lin†, Zhaoxuan Wu†, Zhongxiang Dai†, Wenyang Hu†, Yao Shu†, See-Kiong Ng*, Patrick Jaillet*, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), 2024
27.5% acceptance rate; also appeared in NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following
Abstract. Large language models (LLMs) have shown remarkable instruction-following capabilities and achieved impressive performances in various applications. However, the performances of LLMs depend heavily on the instructions given to them, which are typically manually tuned with substantial human efforts. Recent work has used the query-efficient Bayesian optimization (BO) algorithm to automatically optimize the instructions given to black-box LLMs. However, BO usually falls short when optimizing highly sophisticated (e.g., high-dimensional) objective functions, such as the functions mapping an instruction to the performance of an LLM. This is mainly due to the limited expressive power of the Gaussian process (GP) which is used by BO as a surrogate to model the objective function. Meanwhile, it has been repeatedly shown that neural networks (NNs), especially pre-trained transformers, possess strong expressive power and can model highly complex functions. So, we adopt a neural bandit algorithm which replaces the GP in BO by an NN surrogate to optimize instructions for black-box LLMs. More importantly, the neural bandit algorithm allows us to naturally couple the NN surrogate with the hidden representation learned by a pre-trained transformer (i.e., an open-source LLM), which significantly boosts its performance. These motivate us to propose our INSTruction optimization usIng Neural bandits Coupled with Transformers (INSTINCT) algorithm. We perform instruction optimization for ChatGPT and use extensive experiments to show that INSTINCT consistently outperforms baselines in different tasks, e.g., various instruction induction tasks and the task of improving zero-shot chain-of-thought instructions.
On Newton's Method to Unlearn Neural Networks
Nhung Bui, Xinyang Lu†, See-Kiong Ng*, and Bryan Kian Hsiang Low†
arXiv:2406.14507, 2024
Abstract. Machine unlearning facilitates personal data ownership, including the "right to be forgotten". The proliferation of applications of neural networks (NNs) trained on users' personal data calls for the need to develop algorithms to unlearn an NN. Since retraining is costly, efficiency is often achieved through approximate unlearning which aims to unlearn a trained NN to be close to the retrained one (in distribution). Though the Newton's method has been used by previous works to approximately unlearn linear models, adapting it for unlearning an NN often encounters degenerate Hessians that make computing the Newton's update impossible. In this paper, we will first show that when coupled with naive yet often effective solutions to mitigate the degeneracy issue for unlearning, the Newton's method surprisingly suffers from catastrophic forgetting. To overcome this difficulty, we revise the Newton's method to include a theoretically justified regularizer and propose a cubic-regularized Newton's method for unlearning an NN. The cubic regularizer comes with the benefits of not requiring manual finetuning and affording a natural interpretation. Empirical evaluation on several models and real-world datasets shows that our method is more resilient to catastrophic forgetting and performs better than the baselines, especially in sequential unlearning.
PINNACLE: PINN Adaptive ColLocation and Experimental points selection
Gregory Kang Ruey Lau†, Apivich Hemachandra†, See-Kiong Ng*, and Bryan Kian Hsiang Low†
International Conference on Learning Representations (ICLR), 2024
5% acceptance rate (spotlight presentation); also appeared in ICML 2024 Workshop on AI for Science: Scaling in AI for Scientific Discovery (best paper award, oral presentation); also appeared in NeurIPS 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World
Abstract. Physics-Informed Neural Networks (PINNs), which incorporate PDEs as soft constraints, train with a composite loss function that contains multiple training point types: different types of collocation points chosen during training to enforce each PDE and initial/boundary conditions, and experimental points which are usually costly to obtain via experiments or simulations. Training PINNs using this loss function is challenging as it typically requires selecting large numbers of points of different types, each with different training dynamics. Unlike past works that focused on the selection of either collocation or experimental points, this work introduces PINN Adaptive ColLocation and Experimental points selection (PINNACLE), the first algorithm that jointly optimizes the selection of all training point types, while automatically adjusting the proportion of collocation point types as training progresses. PINNACLE uses information on the interactions among training point types, which had not been considered before, based on an analysis of PINN training dynamics via the Neural Tangent Kernel (NTK). We theoretically show that the criterion used by PINNACLE is related to the PINN generalization error, and empirically demonstrate that PINNACLE is able to outperform existing point selection methods for forward, inverse, and transfer learning problems.
Robustifying and Boosting Training-Free Neural Architecture Search
Zhenfeng He†, Yao Shu†, Zhongxiang Dai†, and Bryan Kian Hsiang Low†
International Conference on Learning Representations (ICLR), 2024
31% acceptance rate
Abstract. Neural architecture search (NAS) has become a key component of AutoML and a standard tool to automate the design of deep neural networks. Recently, training-free NAS as an emerging paradigm has successfully reduced the search costs of standard training-based NAS by estimating the true architecture performance with only training-free metrics. Nevertheless, the estimation ability of these metrics typically varies across different tasks, making it challenging to achieve robust and consistently good search performance on diverse tasks with only a single training-free metric. Meanwhile, the estimation gap between training-free metrics and the true architecture performances limits training-free NAS to achieve superior performance. To address these challenges, we propose the robustifying and boosting training-free NAS (RoBoT) algorithm which (a) employs the optimized combination of existing training-free metrics explored from Bayesian optimization to develop a robust and consistently better-performing metric on diverse tasks, and (b) applies greedy search, i.e., the exploitation, on the newly developed metric to bridge the aforementioned gap and consequently to boost the search performance of standard training-free NAS further. Remarkably, the expected performance of our RoBoT can be theoretically guaranteed, which improves over the existing training-free NAS under mild conditions with additional interesting insights. Our extensive experiments on various NAS benchmark tasks yield substantial empirical evidence to support our theoretical results.
Incentive-Aware Federated Learning with Training-Time Model Rewards
Zhaoxuan Wu†, Mohammad Mohammadi Amiri, Ramesh Raskar, and Bryan Kian Hsiang Low†
International Conference on Learning Representations (ICLR), 2024
31% acceptance rate
Abstract. In federated learning (FL), incentivizing contributions of training resources (e.g., data, compute) from potentially competitive clients is crucial. Existing incentive mechanisms often distribute post-training monetary rewards, which suffer from practical challenges of timeliness and feasibility of the rewards. Rewarding the clients after the completion of training may incentivize them to abort the collaboration, and monetizing the contribution is challenging in practice. To address these problems, we propose an incentive-aware algorithm that offers differentiated training-time model rewards for each client at each FL iteration. We theoretically prove that such a local design ensures the global objective of client incentivization. Through theoretical analyses, we further identify the issue of error propagation in model rewards and thus propose a stochastic reference-model recovery strategy to ensure theoretically that all the clients eventually obtain the optimal model in the limit. We perform extensive experiments to demonstrate the superior incentivizing performance of our method compared to existing baselines.
Understanding Domain Generalization: A Noise Robustness Perspective
Rui Qiao†, and Bryan Kian Hsiang Low†
International Conference on Learning Representations (ICLR), 2024
31% acceptance rate
Abstract. Despite the rapid development of machine learning algorithms for domain generalization (DG), there is no clear empirical evidence that the existing DG algorithms outperform the classic empirical risk minimization (ERM) across standard benchmarks. To better understand this phenomenon, we investigate whether there are benefits of DG algorithms over ERM through the lens of label noise. Specifically, our finite-sample analysis reveals that label noise exacerbates the effect of spurious correlations for ERM, undermining generalization. Conversely, we illustrate that DG algorithms exhibit implicit label-noise robustness during finite-sample training even when spurious correlation is present. Such desirable property helps mitigate spurious correlations and improve generalization in synthetic experiments. However, additional comprehensive experiments on real-world benchmark datasets indicate that label-noise robustness does not necessarily translate to better performance compared to ERM. We conjecture that the failure mode of ERM arising from spurious correlations may be less prevalent in practice.
A Unified Framework for Bayesian Optimization under Contextual Uncertainty
Sebastian Shenghong Tay†, Chuan Sheng Foo*, Daisuke Urano*, Richalynn Leong*, and Bryan Kian Hsiang Low†
International Conference on Learning Representations (ICLR), 2024
31% acceptance rate
Abstract. Bayesian optimization under contextual uncertainty (BOCU) is a family of BO problems in which the learner makes a decision prior to observing the context and must manage the risks involved. Distributionally robust BO (DRBO) is a subset of BOCU that affords robustness against context distribution shift, and includes the optimization of expected values and worst-case values as special cases. By considering the first derivatives of the DRBO objective, we generalize DRBO to one that includes several other uncertainty objectives studied in the BOCU literature such as worst-case sensitivity (and thus notions of risk such as variance, range, and conditional value-at-risk) and mean-risk tradeoffs. We develop a general Thompson sampling algorithm that is able to optimize any objective within the BOCU framework, analyze its theoretical properties, and compare it to suitable baselines across different experimental settings and uncertainty objectives.
Optimistic Bayesian Optimization with Unknown Constraints
Quoc Phong Nguyen†, Wan Theng Ruth Chew, Le Song, Bryan Kian Hsiang Low†, and Patrick Jaillet*
International Conference on Learning Representations (ICLR), 2024
31% acceptance rate
Abstract. Though some research efforts have been dedicated to constrained Bayesian optimization (BO), there remains a notable absence of a principled approach with a theoretical performance guarantee in the decoupled setting. Such a setting involves independent evaluations of the objective function and constraints at different inputs, and is hence a relaxation of the commonly-studied coupled setting where functions must be evaluated together. As a result, the decoupled setting requires an adaptive selection between evaluating either the objective function or a constraint, in addition to selecting an input (in the coupled setting). This paper presents a novel constrained BO algorithm with a provable performance guarantee that can address the above relaxed setting. Specifically, it considers the fundamental trade-off between exploration and exploitation in constrained BO, and, interestingly, affords a noteworthy connection to active learning. The performance of our proposed algorithms is also empirically evaluated using several synthetic and real-world optimization problems.
Meta-VBO: Utilizing Prior Tasks in Optimizing Risk Measures with Gaussian Processes
Quoc Phong Nguyen†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
International Conference on Learning Representations (ICLR), 2024
31% acceptance rate
Abstract. Research on optimizing the risk measure of a blackbox function using Gaussian processes, especially Bayesian optimization (BO) of risk measures, has become increasingly important due to the inevitable presence of uncontrollable variables in real-world applications. Nevertheless, existing works on BO of risk measures start the optimization from scratch for every new task without considering the results of previous tasks. In contrast, its vanilla BO counterpart has received a thorough investigation on utilizing previous tasks to speed up the current task through the body of works on meta-BO which, however, have not considered risk measures. To bridge this gap, this paper presents the first algorithm for meta-BO of risk measures (i.e., value-at-risk (VaR) and the conditional VaR) by introducing a novel adjustment to the upper confidence bound acquisition function. Our proposed algorithm exhibits two desirable properties: (i) invariance to scaling and vertical shifting of the blackbox function and (ii) robustness to previous harmful tasks. We provide a theoretical performance guarantee for our algorithm and empirically demonstrate its performance using several synthetic function benchmarks and real-world objective functions.
Fairness in Federated Learning
Xiaoqiang Lin†, Xinyi Xu†, Zhaoxuan Wu†, Rachael Hwee Ling Sim†, See-Kiong Ng*, Chuan Sheng Foo*, Patrick Jaillet*, Trong Nghia Hoang†, and Bryan Kian Hsiang Low†
L. M. Nguyen, T. N. Hoang, P.-Y. Chen, editors, Federated Learning: Theory and Practice, chapter 8, Academic Press, pages 143-160, 2024
Abstract. Federated learning (FL) enables a form of collaboration among multiple clients in jointly learning a machine learning (ML) model without centralizing their local datasets. Like in any collaboration, it is imperative to guarantee fairness so that the clients are willing to participate. For instance, it is unfair if one client benefits significantly more than others, or if some client benefits disproportionately to its contribution in the collaboration. Additionally, it is also unfair if the ML model makes biased predictions against certain groups of clients. This chapter discusses three specific notions of fairness by highlighting their motivations from real-world use-cases, examining several specific definitions for each notion and lastly describing the corresponding algorithms proposed to achieve each notion of fairness. At the end, this chapter will also summarize the identified gaps in current research efforts into open problems.
Federated Sequential Decision Making: Bayesian Optimization, Reinforcement Learning, and Beyond
Zhongxiang Dai†, Flint Xiaofeng Fan, Cheston Tan*, Trong Nghia Hoang†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
L. M. Nguyen, T. N. Hoang, P.-Y. Chen, editors, Federated Learning: Theory and Practice, chapter 14, Academic Press, pages 257-279, 2024
Abstract. Federated learning (FL) in its classic form involves the collaborative training of supervised learning models (e.g., neural networks) among multiple agents/clients. However, in addition to supervised learning, many other machine learning tasks which are inherently sequential decision-making problems, such as Bayesian optimization (BO) and reinforcement learning (RL), also find important applications in the federated setting. For example, the crucial problem of hyperparameter tuning of neural networks in the federated setting calls for algorithms for federated BO; collaborative clinical treatment recommendation among multiple hospitals is a natural application for federated RL. However, the extension of these classic sequential decision-making algorithms into the federated setting is faced with immense challenges. Firstly, these algorithms (e.g., BO and RL) have to be adapted to satisfy the core principles of FL. For example, consistent with the requirement of FL, the raw data (e.g., the history of observations in BO and the trajectories in RL) of every agent can never be shared with the other agents. Next, it is challenging to preserve the rigorous theoretical guarantees of these classic sequential decision-making algorithms (e.g., the sub-linear regret upper bound of classic BO algorithms and the sample complexity of classic policy gradient algorithms for RL) and at the same time consistently improve their empirical performances by leveraging the federation of multiple agents. In this regard, a number of recent works have tackled these challenges and hence introduced federated versions of classic sequential decision-making algorithms (e.g., federated BO and federated RL algorithms) which satisfy the core principles of FL and are both theoretically grounded and practically effective. In light of these recent advances, this chapter discusses federated sequential decision-making problems with a focus on recent representative works on federated BO and federated RL, and describes open problems and potential future directions in these areas.
Data Valuation in Federated Learning
Zhaoxuan Wu†, Xinyi Xu†, Rachael Hwee Ling Sim†, Yao Shu†, Xiaoqiang Lin†, Lucas Agussurja†, Zhongxiang Dai†, See-Kiong Ng*, Chuan Sheng Foo*, Patrick Jaillet*, Trong Nghia Hoang†, and Bryan Kian Hsiang Low†
L. M. Nguyen, T. N. Hoang, P.-Y. Chen, editors, Federated Learning: Theory and Practice, chapter 15, Academic Press, pages 281-296, 2024
Abstract. Federated learning (FL) has become an increasingly popular solution paradigm for enabling collaborative machine learning (CML) in which multiple clients can collaboratively train a common model without sharing their private training data with others. However, broad adoption of FL in practice is still limited, as clients can often be reluctant to participate in such federated effort unless their contributions are accurately recognized and fairly compensated. Data valuation is thus extensively required to measure the relative contributions among clients. In this chapter, we review data valuation methods in the conventional supervised CML setting, followed by extensions to the FL paradigm. To better address the challenge that the private data from local clients cannot be made available to the server, we further discuss many specialized data valuation methods developed for both horizontal and vertical FL in detail. Overall, this chapter aims to provide a comprehensive suite of data valuation tools to empower FL practitioners in various practical scenarios.
Incentives in Federated Learning
Rachael Hwee Ling Sim†, Sebastian Shenghong Tay†, Xinyi Xu†, Yehong Zhang†, Zhaoxuan Wu†, Xiaoqiang Lin†, See-Kiong Ng*, Chuan Sheng Foo*, Patrick Jaillet*, Trong Nghia Hoang†, and Bryan Kian Hsiang Low†
L. M. Nguyen, T. N. Hoang, P.-Y. Chen, editors, Federated Learning: Theory and Practice, chapter 16, Academic Press, pages 299-309, 2024
Abstract. This chapter explores incentive schemes that encourage clients to participate in federated learning (FL) and contribute more valuable data. Such schemes are important to enable collaboration in competitive situations where clients need justifiable incentives to participate and benefit others with information acquired at significant costs and resources, such as collecting and processing data, computing and communicating model updates, risking the privacy of data via shared model updates. Incentivization addresses these concerns through three key components: (1) fair contribution evaluation of each client's data, (2) client selection to maximize the utility of the global model, and (3) reward allocation to clients. Intuitively, clients desire higher valued rewards which should at least outweigh their costs. These and other requirements will be formally described as incentives. The chapter will also discuss some recent solutions and open problems to achieve these incentives in various settings, which includes settings where the contribution evaluation is declared or measured while the rewards can be monetary- or model-based.
DeRDaVa: Deletion-Robust Data Valuation for Machine Learning
Xiao Tian, Rachael Hwee Ling Sim†, Jue Fan, and Bryan Kian Hsiang Low†
AAAI Conference on Artificial Intelligence, pages 15373-15381, 2024
23.75% acceptance rate
Abstract. Data valuation is concerned with determining a fair valuation of data from data sources to compensate them or to identify training examples that are the most or least useful for predictions. With the rising interest in personal data ownership and data protection regulations, model owners will likely have to fulfil more data deletion requests. This raises issues that have not been addressed by existing works: Are the data valuation scores still fair with deletions? Must the scores be expensively recomputed? The answer is no. To avoid recomputations, we propose using our data valuation framework DeRDaVa upfront for valuing each data source's contribution to preserving robust model performance after anticipated data deletions. DeRDaVa can be efficiently approximated and will assign higher value to data that are more useful or less likely to be deleted. We further generalize DeRDaVa to Risk-DeRDaVa to cater to risk-averse/seeking model owners who are concerned with the worst/best-cases model utility. We also empirically demonstrate the practicality of our solutions.
Incremental Quasi-Newton Methods with Faster Superlinear Convergence Rates
Zhuanghua Liu†, Luo Luo, and Bryan Kian Hsiang Low†
AAAI Conference on Artificial Intelligence, pages 14097-14105, 2024
23.75% acceptance rate
Abstract. We consider the finite-sum optimization problem of the form minx ∈ Rd f(x) = (1/n) (f1(x)+ ... + fi(x) + ... + fn(x)) where each fi(.) is strongly convex and has Lipschitz continuous gradient and Hessian. The recently proposed incremental quasi-Newton method is based on BFGS update and achieves a local superlinear convergence rate of O((1 − (dκ)-1)⌈t/n⌉2) where κ is the condition number and t is the number of iterations. This paper proposes a more efficient quasi-Newton method by incorporating the Symmetric Rank-1 update into the incremental framework, which results in the better local convergence rate of O((1 − d-1)⌈t/n⌉2). Furthermore, we can boost our method by applying the block update on the Hessian approximation, which leads to the even faster convergence rate of O((1 − (kd)-1)⌈t/n⌉2) where k < d is the rank of update matrix. The numerical experiments show the proposed methods significantly outperform the baseline methods.
Decentralized Sum-of-Nonconvex Optimization
Zhuanghua Liu†, and Bryan Kian Hsiang Low†
AAAI Conference on Artificial Intelligence, pages 14088-14096, 2024
23.75% acceptance rate
Abstract. We consider the optimization problem of minimizing the sum-of-nonconvex function, i.e., a convex function that is the average of nonconvex components. The existing stochastic algorithms for such a problem only focus on a single machine and the centralized scenario. In this paper, we study the sum-of-nonconvex optimization in the decentralized setting. We present a new theoretical analysis of the PMGT-SVRG algorithm to this problem and prove the linear convergence of their approach. However, the convergence rate of the PMGT-SVRG algorithm has a linear dependency on the condition number, which is undesirable for the ill-conditioned problem. To remedy this issue, we propose an accelerated stochastic decentralized first-order algorithm by incorporating the techniques of acceleration, gradient tracking, and multi-consensus mixing into the SVRG algorithm. The convergence rate of the proposed method has a square-root dependency on the condition number. The numerical experiments validate the theoretical guarantee of our proposed algorithms on both synthetic and real-world datasets.
Quantum Bayesian Optimization
Zhongxiang Dai†, Gregory Kang Ruey Lau†, Arun Verma†, Yao Shu†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 20179-20207, 2023
26.1% acceptance rate
Abstract. Kernelized bandits, also known as Bayesian optimization (BO), has been a prevalent method for optimizing complicated black-box reward functions. Various BO algorithms have been theoretically shown to enjoy upper bounds on their cumulative regret which are sub-linear in the number T of iterations, and a regret lower bound of Omega(sqrt(T)) has been derived which represents the unavoidable regrets for any classical BO algorithm. Recent works on quantum bandits have shown that with the aid of quantum computing, it is possible to achieve tighter regret upper bounds better than their corresponding classical lower bounds. However, these works are restricted to either multi-armed or linear bandits, and are hence not able to solve sophisticated real-world problems with non-linear reward functions. To this end, we introduce the quantum-Gaussian process-upper confidence bound (Q-GP-UCB) algorithm. To the best of our knowledge, our Q-GP-UCB is the first BO algorithm able to achieve a regret upper bound of O(poly log T), which is significantly smaller than its regret lower bound of Omega(sqrt(T)) in the classical setting. Moreover, thanks to our novel analysis of the confidence ellipsoid, our Q-GP-UCB with the linear kernel achieves a smaller regret than the quantum linear UCB algorithm from the previous work. We use simulations to verify that the theoretical quantum speedup achieved by our Q-GP-UCB is also potentially relevant in practice.
Batch Bayesian Optimization for Replicable Experimental Design
Zhongxiang Dai†, Quoc Phong Nguyen†, Sebastian Shenghong Tay†, Daisuke Urano*, Richalynn Leong*, Bryan Kian Hsiang Low†, and Patrick Jaillet*
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 36476-36506, 2023
26.1% acceptance rate
Abstract. Many real-world experimental design problems (a) evaluate multiple experimental conditions in parallel and (b) replicate each condition multiple times due to large and heteroscedastic observation noise. Given a fixed total budget, this naturally induces a trade-off between evaluating more unique conditions while replicating each of them fewer times vs. evaluating fewer unique conditions and replicating each more times. Moreover, in these problems, practitioners may be risk-averse and hence prefer an input with both good average performance and small variability. To tackle both challenges, we propose the Batch Thompson Sampling for Replicable Experimental Design (BTS-RED) framework, which encompasses three algorithms. Our BTS-RED-Known and BTS-RED-Unknown algorithms, for, respectively, known and unknown noise variance, choose the number of replications adaptively rather than deterministically such that an input with a larger noise variance is replicated more times. As a result, despite the noise heteroscedasticity, both algorithms enjoy a theoretical guarantee and are asymptotically no-regret. Our Mean-Var-BTS-RED algorithm aims at risk-averse optimization and is also asymptotically no-regret. We also show the effectiveness of our algorithms in two practical real-world applications: precision agriculture and AutoML.
Exploiting Correlated Auxiliary Feedback in Parameterized Bandits
Arun Verma†, Zhongxiang Dai†, Yao Shu†, and Bryan Kian Hsiang Low†
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 4430-4451, 2023
26.1% acceptance rate
Abstract. We study a novel variant of the parameterized bandits problem in which the learner can observe auxiliary feedback that is correlated with the observed reward. The auxiliary feedback is readily available in many real-life applications, e.g., an online platform that wants to recommend the best-rated services to its users can observe the user's rating of service (rewards) and collect additional information like service delivery time (auxiliary feedback). We first develop a method that exploits auxiliary feedback to build a reward estimator with tight confidence bounds, leading to a smaller regret. We then characterize the regret reduction in terms of the correlation coefficient between reward and auxiliary feedback. Experimental results in different settings also verify the performance gain achieved by our proposed method.
Incentives in Private Collaborative Machine Learning
Rachael Hwee Ling Sim†, Yehong Zhang†, Trong Nghia Hoang†, Xinyi Xu†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 7555-7593, 2023
26.1% acceptance rate
Abstract. Collaborative machine learning involves training models on data from multiple parties but must incentivize their participation. Existing data valuation methods fairly value and reward each party based on shared data or model parameters but neglect the privacy risks involved. To address this, we introduce differential privacy (DP) as an incentive. Each party can select its required DP guarantee and perturb its sufficient statistic (SS) accordingly. The mediator values the perturbed SS by the Bayesian surprise it elicits about the model parameters. As our valuation function enforces a privacy-valuation trade-off, parties are deterred from selecting excessive DP guarantees that reduce the utility of the grand coalition's model. Finally, the mediator rewards each party with different posterior samples of the model parameters. Such rewards still satisfy existing incentives like fairness but additionally preserve DP and a high similarity to the grand coalition's posterior. We empirically demonstrate the effectiveness and practicality of our approach on synthetic and real-world datasets.
Model Shapley: Equitable Model Valuation with Black-box Access
Xinyi Xu†, Chi Thanh Lam†, Chuan Sheng Foo*, and Bryan Kian Hsiang Low†
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 43254-43283, 2023
26.1% acceptance rate
Abstract. Valuation methods of data and machine learning (ML) models are essential to the establishment of AI marketplaces. Importantly, certain practical considerations (e.g., operational constraints, legal restrictions) favor the use of model valuation over data valuation. Also, existing marketplaces that involve trading of pre-trained ML models call for an equitable model valuation method to price them. In particular, we investigate the black-box access setting which allows querying a model (to observe predictions) without disclosing model-specific information (e.g., architecture and parameters). By exploiting a Dirichlet abstraction of a model's predictions, we propose a novel and equitable model valuation method called model Shapley. We also leverage a formal connection between the similarity in models and that in model Shapley values (MSVs) to devise a learning approach for predicting MSVs of many vendors' models (e.g., 150) in a large-scale marketplace. We perform extensive empirical validation on the effectiveness of model Shapley using various real-world datasets and heterogeneous model types.
Bayesian Optimization with Cost-varying Variable Subsets
Sebastian Shenghong Tay†, Chuan Sheng Foo*, Daisuke Urano*, Richalynn Leong*, and Bryan Kian Hsiang Low†
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 3008-3031, 2023
26.1% acceptance rate
Abstract. We introduce the problem of Bayesian optimization with cost-varying variable subsets (BOCVS) where in each iteration, the learner chooses a subset of query variables and specifies their values while the rest are randomly sampled. Each chosen subset has an associated cost. This presents the learner with the novel challenge of balancing between choosing more informative subsets for more directed learning versus leaving some variables to be randomly sampled to reduce incurred costs. This paper presents a novel Gaussian process upper confidence bound-based algorithm for solving the BOCVS problem that is provably no-regret. We analyze how the availability of cheaper control sets helps in exploration and reduces overall regret. We empirically show that our proposed algorithm can find significantly better solutions than comparable baselines with the same budget.
WASA: WAtermark-based Source Attribution for Large Language Model-Generated Data
Jingtan Wang†, Xinyang Lu†, Zitong Zhao†, Zhongxiang Dai†, Chuan Sheng Foo*, See-Kiong Ng*, and Bryan Kian Hsiang Low†
arXiv:2310.00646, 2023
Abstract. The impressive performances of large language models (LLMs) and their immense potential for commercialization have given rise to serious concerns over the intellectual property (IP) of their training data. In particular, the synthetic texts generated by LLMs may infringe the IP of the data being used to train the LLMs. To this end, it is imperative to be able to (a) identify the data provider who contributed to the generation of a synthetic text by an LLM (source attribution) and (b) verify whether the text data from a data provider has been used to train an LLM (data provenance). In this paper, we show that both problems can be solved by watermarking, i.e., by enabling an LLM to generate synthetic texts with embedded watermarks that contain information about their source(s). We identify the key properties of such watermarking frameworks (e.g., source attribution accuracy, robustness against adversaries), and propose a WAtermarking for Source Attribution (WASA) framework that satisfies these key properties due to our algorithmic designs. Our WASA framework enables an LLM to learn an accurate mapping from the texts of different data providers to their corresponding unique watermarks, which sets the foundation for effective source attribution (and hence data provenance). Extensive empirical evaluations show that our WASA framework achieves effective source attribution and data provenance.
Training-Free Neural Active Learning with Initialization-Robustness Guarantees
Apivich Hemachandra†, Zhongxiang Dai†, Jasraj Singh†, See-Kiong Ng*, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), pages 12931-12971, 2023
27.9% acceptance rate
Abstract. Existing neural active learning algorithms have aimed to optimize the predictive performance of neural networks (NNs) by selecting data for labeling. However, other than a good predictive performance, being robust against random parameter initializations is also a crucial requirement in safety-critical applications. To this end, we introduce our expected variance with Gaussian processes (EV-GP) criterion for neural active learning, which is theoretically guaranteed to select data points that lead to trained NNs with both good predictive performances and initialization robustness. Importantly, our EV-GP criterion is training-free, i.e., it does not require any training of the NN during data selection, which makes it computationally efficient. We empirically demonstrate that our EV-GP criterion is highly correlated with both initialization robustness and generalization performance, and show that it consistently outperforms baseline methods in terms of both desiderata, especially in situations with limited initial data or large batch sizes.
Fair yet Asymptotically Equal Collaborative Learning
Xiaoqiang Lin†, Xinyi Xu†, See-Kiong Ng*, Chuan Sheng Foo*, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), pages 21223-21259, 2023
27.9% acceptance rate
Abstract. In collaborative learning with streaming data, nodes (e.g., organizations) jointly and continuously learn a machine learning model by sharing the latest model updates computed from their latest streaming data. For the more resourceful nodes to be willing to share their model updates, they need to be fairly incentivized. This paper explores an incentive design that guarantees fairness so that nodes receive rewards commensurate to their contributions. Our approach leverages an explore-then-exploit formulation to estimate the nodes' contributions (i.e., exploration) for realizing our theoretically guaranteed fair incentives (i.e., exploitation). However, we observe a "rich get richer" phenomenon arising from the existing approaches to guarantee fairness and it discourages the participation of the less resourceful nodes. To remedy this, we additionally preserve asymptotic equality, i.e., less resourceful nodes achieve equal performance eventually to the more resourceful/"rich" nodes. We empirically demonstrate in two settings with real-world streaming data: federated online incremental learning and federated reinforcement learning, that our proposed approach outperforms existing baselines in fairness and learning performance while remaining competitive in preserving equality.
Collaborative Causal Inference with Fair Incentives
Rui Qiao†, Xinyi Xu†, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), pages 28300-28320, 2023
27.9% acceptance rate
Abstract. Collaborative causal inference (CCI) aims to improve the estimation of the causal effect of treatment variables by utilizing data aggregated from multiple self-interested parties. However, since their source data are valuable proprietary assets that can be costly or tedious to obtain, every party has to be incentivized to be willing to contribute to the collaboration, for example, with a guaranteed fair and sufficiently valuable reward (than performing causal inference on its own). This paper presents a reward scheme designed using the unique statistical properties required by causal inference to guarantee certain desirable incentive criteria (such as fairness and benefit) for the parties based on their contributions. To achieve this, we first propose a data valuation method to value parties' data for CCI based on the distributional closeness of its resulting treatment effect estimate to that utilizing the aggregated data from all parties. Then, we show how to value the parties' rewards fairly based on a modified variant of the Shapley value arising from our proposed data valuation for CCI. Finally, the Shapley fair rewards are realized in the form of improved and stochastically perturbed treatment effect estimates to be returned to the parties. We empirically demonstrate the effectiveness of our reward scheme using simulated and real-world datasets.
FedHQL: Federated Heterogeneous Q-Learning
Flint Xiaofeng Fan, Yining Ma, Zhongxiang Dai†, Cheston Tan*, and Bryan Kian Hsiang Low†
International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 2810-2812, 2023
Abstract. Federated Reinforcement Learning (FedRL) encourages distributed agents to learn collectively from each other's experience to improve their performance without exchanging their raw trajectories. The existing work on FedRL assumes that all participating agents are homogeneous, which requires all agents to share the same policy parameterization (e.g., network architectures and training configurations). However, in real-world applications, agents are often in disagreement about the architecture and the parameters, possibly also because of disparate computational budgets. Because homogeneity is not given in practice, we introduce the problem setting of Federated Reinforcement Learning with Heterogeneous And bLack-box agEnts (FedRL-HALE). We present the unique challenges this new setting poses and propose the Federated Heterogeneous Q-Learning (FedHQL) algorithm that principally addresses these challenges. We empirically demonstrate the efficacy of FedHQL in boosting the sample efficiency of heterogeneous agents with distinct policy parameterization using standard RL tasks.
Federated Neural Bandits
Zhongxiang Dai†, Yao Shu†, Arun Verma†, Flint Xiaofeng Fan, Bryan Kian Hsiang Low†, and Patrick Jaillet*
International Conference on Learning Representations (ICLR), 2023
31.8% acceptance rate
Abstract. Recent works on neural contextual bandits have achieved compelling performances due to their ability to leverage the strong representation power of neural networks (NNs) for reward prediction. Many applications of contextual bandits involve multiple agents who collaborate without sharing raw observations, thus giving rise to the setting of federated contextual bandits}. Existing works on federated contextual bandits rely on linear or kernelized bandits, which may fall short when modeling complex real-world reward functions. So, this paper introduces the federated neural-upper confidence bound (FN-UCB) algorithm. To better exploit the federated setting, FN-UCB adopts a weighted combination of two UCBs: UCBa allows every agent to additionally use the observations from the other agents to accelerate exploration (without sharing raw observations), while UCBb uses an NN with aggregated parameters for reward prediction in a similar way to federated averaging for supervised learning. Notably, the weight between the two UCBs required by our theoretical analysis is amenable to an interesting interpretation, which emphasizes UCBa initially for accelerated exploration and relies more on UCBb later after enough observations have been collected to train the NNs for accurate reward prediction (i.e., reliable exploitation). We prove sub-linear upper bounds on both the cumulative regret and the number of communication rounds of FN-UCB, and empirically demonstrate its competitive performance.
Risk-Aware Reinforcement Learning with Coherent Risk Measures and Non-linear Function Approximation
Thanh Lam, Arun Verma†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
International Conference on Learning Representations (ICLR), 2023
31.8% acceptance rate
Abstract. We study the risk-aware reinforcement learning (RL) problem in the episodic finite-horizon Markov decision process with unknown transition and reward functions. In contrast to the risk-neutral RL problem, we consider minimizing the risk of having low rewards, which arise due to the intrinsic randomness of the MDPs and imperfect knowledge of the model. Our work provides a unified framework to analyze the regret of risk-aware RL policy with coherent risk measures in conjunction with non-linear function approximation, which gives the first sub-linear regret bounds in the setting. Finally, we validate our theoretical results via empirical experiments on synthetic and real-world data.
Zeroth-Order Optimization with Trajectory-Informed Derivative Estimation
Yao Shu†, Zhongxiang Dai†, Weicong Sng†, Arun Verma†, Patrick Jaillet*, and Bryan Kian Hsiang Low†
International Conference on Learning Representations (ICLR), 2023
31.8% acceptance rate
Abstract. Zeroth-order (ZO) optimization, in which the derivative is unavailable, has recently succeeded in many important machine learning applications. Existing algorithms rely on finite difference (FD) methods for derivative estimation and gradient descent (GD)-based approaches for optimization. However, these algorithms suffer from query inefficiency because additional function queries are required for derivative estimation in their every GD update, which typically hinders their deployment in applications where every function query is expensive. To this end, we propose a trajectory-informed derivative estimation method which only uses the optimization trajectory (i.e., the history of function queries during optimization) and hence eliminates the need for additional function queries to estimate a derivative. Moreover, based on our derivative estimation, we propose the technique of dynamic virtual updates, which allows us to reliably perform multiple steps of GD updates without reapplying derivative estimation. Based on these two contributions, we introduce the zeroth-order optimization with trajectory-informed derivative estimation (ZoRD) algorithm for query-efficient ZO optimization. We theoretically demonstrate that our trajectory-informed derivative estimation and our ZoRD algorithm improve over existing approaches, which is then supported by our real-world experiments such as black-box adversarial attack, non-differentiable metric optimization and derivative-free reinforcement learning.
Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks
Tiedong Liu†, and Bryan Kian Hsiang Low†
arXiv:2305.14201, 2023
Abstract. We introduce Goat, a fine-tuned LLaMA model that significantly outperforms GPT-4 on a range of arithmetic tasks. Fine-tuned on a synthetically generated dataset, Goat achieves state-of-the-art performance on BIG-bench arithmetic sub-task. In particular, the zero-shot Goat-7B matches or even surpasses the accuracy achieved by the few-shot PaLM-540B. Surprisingly, Goat can achieve near-perfect accuracy on large-number addition and subtraction through supervised fine-tuning only, which is almost impossible with previous pretrained language models, such as Bloom, OPT, GPT-NeoX, etc. We attribute Goat's exceptional performance to LLaMA's consistent tokenization of numbers. To tackle more challenging tasks like large-number multiplication and division, we propose an approach that classifies tasks based on their learnability, and subsequently decomposes unlearnable tasks, such as multi-digit multiplication and division, into a series of learnable tasks by leveraging basic arithmetic principles. We thoroughly examine the performance of our model, offering a comprehensive evaluation of the effectiveness of our proposed decomposition steps. Additionally, Goat-7B can be easily trained using LoRA on a 24GB VRAM GPU, facilitating reproducibility for other researchers. We release our model, dataset, and the Python script for dataset generation.
No-regret Sample-efficient Bayesian Optimization for Finding Nash Equilibria with Unknown Utilities
Sebastian Shenghong Tay†, Quoc Phong Nguyen†, Chuan Sheng Foo*, and Bryan Kian Hsiang Low†
International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3591-3619, 2023
29% acceptance rate
Abstract. The Nash equilibrium (NE) is a classic solution concept for normal-form games that is stable under potential unilateral deviations by self-interested agents. Bayesian optimization (BO) has been used to find NE in continuous general-sum games with unknown costly-to-sample utility functions in a sample-efficient manner. This paper presents the first no-regret BO algorithm that is sample-efficient in finding pure NE by leveraging theory on high probability confidence bounds with Gaussian processes and the maximum information gain of kernel functions. Unlike previous works, our algorithm is theoretically guaranteed to converge to the optimal solution (i.e., NE). We also introduce the novel setting of applying BO to finding mixed NE in unknown discrete general-sum games and show that our theoretical framework is general enough to be extended naturally to this setting by developing a no-regret BO algorithm that is sample-efficient in finding mixed NE. We empirically show that our algorithms are competitive w.r.t. suitable baselines in finding NE.
FAIR: Fair Collaborative Active Learning with Individual Rationality for Scientific Discovery
Xinyi Xu†, Zhaoxuan Wu†, Arun Verma†, Chuan Sheng Foo*, and Bryan Kian Hsiang Low†
International Conference on Artificial Intelligence and Statistics (AISTATS), pages 4033-4057, 2023
29% acceptance rate
Abstract. Scientific discovery aims to find new patterns and test specific hypotheses by analysing large-scale experimental data. However, various practical limitations, like high experimental costs or the inability to do some experiments, make it challenging for researchers to collect sufficient experimental data for successful scientific discovery. To this end, we propose a framework named collaborative active learning (CAL) that enables researchers to share their experimental data for mutual benefit. Specifically, our proposed coordinated acquisition function sets out to achieve individual rationality and fairness so that everyone can equitably benefit from collaboration. Finally, we empirically demonstrate that our method outperforms existing batch active learning methods adapted to the CAL setting in terms of both learning performance and fairness on various real-world scientific discovery datasets (biochemistry, material science, and physics).
Pruning during Training by Network Efficacy Modeling
Mohit Rajpal†, Yehong Zhang†, and Bryan Kian Hsiang Low†
Machine Learning (Special Issue on ECML-PKDD 2022 Journal Track), volume 112, issue 7, pages 2653-2684, 2023
Abstract. Deep neural networks (DNNs) are costly to train. Pruning, an approach to alleviate model complexity by zeroing out or pruning DNN elements, has shown promise in reducing training costs for DNNs with little to no efficacy at a given task. This paper presents a novel method to perform early pruning of DNN elements (e.g., neurons or convolutional filters) during the training process while minimizing losses to model performance. To achieve this, we model the efficacy of DNN elements in a Bayesian manner conditioned upon efficacy data collected during the training and prune DNN elements with low predictive efficacy after training completion. Empirical evaluations show that the proposed Bayesian early pruning improves the computational efficiency of DNN training while better preserving model performance compared to other tested pruning approaches.
Recursive Reasoning-based Training-time Adversarial Machine Learning
Yizhou Chen†, Zhongxiang Dai†, Haibin Yu†, Bryan Kian Hsiang Low†, and Teck-Hua Ho
Artificial Intelligence (Special Issue on Risk-Aware Autonomous Systems: Theory and Practice), volume 315, pages 103837, 2023
Abstract. The training process of a machine learning (ML) model may be subject to adversarial attacks from an attacker who attempts to undermine the test performance of the ML model by perturbing the training minibatches, and thus needs to be protected by a defender. Such a problem setting is referred to as training-time adversarial ML. We formulate it as a two-player game and propose a principled Recursive Reasoning-based Training-Time adversarial ML (R2T2) framework to model this game. R2T2 models the reasoning process between the attacker and the defender and captures their bounded reasoning capabilities (due to bounded computational resources) through the recursive reasoning formalism. In particular, we associate a deeper level of recursive reasoning with the use of a higher-order gradient to derive the attack (defense) strategy, which naturally improves its performance while requiring greater computational resources. Interestingly, our R2T2 framework encompasses a variety of existing adversarial ML methods which correspond to attackers (defenders) with different recursive reasoning capabilities. We show how an R2T2 attacker (defender) can utilize our proposed nested projected gradient descent-based method to approximate the optimal attack (defense) strategy at an arbitrary level of reasoning. R2T2 can empirically achieve state-of-the-art attack and defense performances on benchmark image datasets.
Probably Approximate Shapley Fairness with Applications in Machine Learning
Zijian Zhou†, Xinyi Xu†, Rachael Hwee Ling Sim†, Chuan Sheng Foo*, and Bryan Kian Hsiang Low†
AAAI Conference on Artificial Intelligence, pages 5910-5918, 2023
19.6% acceptance rate (oral presentation)
Abstract. The Shapley value (SV) is adopted in various scenarios in machine learning including data valuation, agent valuation, feature attribution etc. because it guarantees fairness (i.e., satisfying several fairness axioms) and that fairness is important in such valuations (e.g., for pricing in a marketplace). However, the exact calculation of SVs (with exponential time complexity) is infeasible in practice, so estimations are necessary. Consequently, it raises the following important question: The exact SVs guarantee fairness, but do the SV estimates also guarantee the same fairness? We show the fairness guaranteed by exact SVs is too restrictive (for SV estimates) and generalise it to a probably approximate fairness. We propose fidelity score, a metric to measure the variation of SV estimates and establish theoretically the relationship between fidelity score and fairness guarantee. We exploit the relationship between fidelity score and fairness guarantee to propose a novel greedy active estimation (GAE) with theoretical guarantees. We theoretically show GAE achieves better fairness guarantee than the de facto Monte-Carlo estimation and empirically verify GAE outperforms several existing methods in guaranteeing fairness while remaining competitive in estimation accuracy in various specific ML scenarios using real-world datasets.
Sample-Then-Optimize Batch Neural Thompson Sampling
Zhongxiang Dai†, Yao Shu†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 23331-23344, 2022
25.6% acceptance rate
Abstract. Bayesian optimization (BO), which uses a Gaussian process (GP) as a surrogate to model its objective function, is popular for black-box optimization. However, due to the limitations of GPs, BO underperforms in some problems such as those with categorical, high-dimensional or image inputs. To this end, recent works have used the highly expressive neural networks (NNs) as the surrogate model and derived theoretical guarantees using the theory of neural tangent kernel (NTK). However, these works suffer from the limitations of the requirement to invert an extremely large parameter matrix and the restriction to the sequential (rather than batch) setting. To overcome these limitations, we introduce two algorithms based on the Thompson sampling (TS) policy named Sample-Then-Optimize Batch Neural TS (STO-BNTS) and STO-BNTS-Linear. To choose an input query, we only need to train an NN (resp. a linear model) and then choose the query by maximizing the trained NN (resp. linear model), which is equivalently sampled from the GP posterior with the NTK as the kernel function. As a result, our algorithms sidestep the need to invert the large parameter matrix yet still preserve the validity of the TS policy. Next, we derive regret upper bounds for our algorithms with batch evaluations, and use insights from batch BO and NTK to show that they are asymptotically no-regret under certain conditions. Finally, we verify their empirical effectiveness using practical AutoML and reinforcement learning experiments.
Trade-off between Payoff and Model Rewards in Shapley-Fair Collaborative Machine Learning
Quoc Phong Nguyen†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 30542-30553, 2022
25.6% acceptance rate
Abstract. This paper investigates the problem of fairly trading off between payoff and model rewards in collaborative machine learning (ML) where parties aggregate their datasets together to obtain improved ML models over that of each party. Supposing parties can afford the optimal model trained on the aggregated dataset, we propose an allocation scheme that distributes the payoff fairly. Notably, the same scheme can be derived from two different approaches based on (1) desirable properties of the parties' payoffs or (2) that of the underlying payoff flows from one party to another. While the former is conceptually simpler, the latter can be used to handle the practical constraint on the budgets of parties. In particular, we propose desirable properties for achieving a fair adjustment of the payoff flows that can trade off between the model reward's performance and the payoff reward. We empirically demonstrate that our proposed scheme is a sensible solution in several scenarios of collaborative ML with different budget constraints.
Unifying and Boosting Gradient-Based Training-Free Neural Architecture Search
Yao Shu†, Zhongxiang Dai†, Zhaoxuan Wu†, and Bryan Kian Hsiang Low†
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 33001-33015, 2022
25.6% acceptance rate
Abstract. Neural architecture search (NAS) has gained immense popularity owing to its ability to automate neural architecture design. A number of training-free metrics are recently proposed to realize NAS without training, hence making NAS more scalable. Despite their competitive empirical performances, a unified theoretical understanding of these training-free metrics is lacking. As a consequence, (a) the relationships among these metrics are unclear, (b) there is no theoretical interpretation for their empirical performances, and (c) there may exist untapped potential in existing training-free NAS, which probably can be unveiled through a unified theoretical understanding. To this end, this paper presents a unified theoretical analysis of gradient-based training-free NAS, which allows us to (a) theoretically study their relationships, (b) theoretically guarantee their generalization performances, and (c) exploit our unified theoretical understanding to develop a novel framework named hybrid NAS (HNAS) which consistently boosts training-free NAS in a principled way. Remarkably, HNAS can enjoy the advantages of both training-free (i.e., superior search efficiency) and training-based (i.e., remarkable search effectiveness) NAS, which we have demonstrated through extensive experiments.
On Provably Robust Meta-Bayesian Optimization
Zhongxiang Dai†, Yizhou Chen†, Haibin Yu†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
Conference on Uncertainty in Artificial Intelligence (UAI), pages 475-485, 2022
32.3% acceptance rate
Abstract. Bayesian optimization (BO) has become popular for sequential optimization of black-box functions. When BO is used to optimize a target function, we often have access to previous evaluations of potentially related functions. This begs the question as to whether we can leverage these previous experiences to accelerate the current BO task through meta-learning (meta-BO), while ensuring robustness against potentially harmful dissimilar tasks that could sabotage the convergence of BO. This paper introduces two scalable and provably robust meta-BO algorithms: robust meta-Gaussian process-upper confidence bound (RM-GP-UCB) and RM-GP-Thompson sampling (RM-GP-TS). We prove that both algorithms are asymptotically no-regret even when some or all previous tasks are dissimilar to the current task, and show that RM-GP-UCB enjoys a better theoretical robustness than RM-GP-TS. We also exploit the theoretical guarantees to optimize the weights assigned to individual previous tasks through regret minimization via online learning, which diminishes the impact of dissimilar tasks and hence further enhances the robustness. Empirical evaluations show that (a) RM-GP-UCB performs effectively and consistently across various applications, and (b) RM-GP-TS, despite being less robust than RM-GP-UCB both in theory and in practice, performs competitively in some scenarios with less dissimilar tasks and is more computationally efficient.
Neural Ensemble Search via Bayesian Sampling
Yao Shu†, Yizhou Chen†, Zhongxiang Dai†, and Bryan Kian Hsiang Low†
Conference on Uncertainty in Artificial Intelligence (UAI), pages 1803-1812, 2022
32.3% acceptance rate
Abstract. Recently, neural architecture search (NAS) has been applied to automate the design of neural networks in real-world applications. A large number of algorithms have been developed to improve the search cost or the performance of the final selected architectures in NAS. Unfortunately, these NAS algorithms aim to select only one single well-performing architecture from their search spaces and thus have overlooked the capability of neural network ensemble (i.e., an ensemble of neural networks with diverse architectures) in achieving improved performance over a single final selected architecture. To this end, we introduce a novel neural ensemble search algorithm, called neural ensemble search via Bayesian sampling (NESBS), to effectively and efficiently select well-performing neural network ensembles from a NAS search space. In our extensive experiments, NESBS algorithm is shown to be able to achieve improved performance over state-of-the-art NAS algorithms while incurring a comparable search cost, indicating the superior performance of our NESBS algorithm over these conventional NAS algorithms in practice.
On the Convergence of the Shapley Value in Parametric Bayesian Learning Games
Lucas Agussurja†, Xinyi Xu†, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), pages 180-196, 2022
21.9% acceptance rate
Abstract. Measuring contributions is a classical problem in cooperative game theory where the Shapley value is the most well-known solution concept. In this paper, we establish the convergence property of the Shapley value in parametric Bayesian learning games where players perform a Bayesian inference using their combined data, and the posterior-prior KL divergence is used as the characteristic function. We show that for any two players, under some regularity conditions, their difference in Shapley value converges in probability to the difference in Shapley value of a limiting game whose characteristic function is proportional to the log-determinant of the joint Fisher information. As an application, we present an online collaborative learning framework that is asymptotically Shapley-fair. Our result enables this to be achieved without any costly computations of posterior-prior KL divergences. Only a consistent estimator of the Fisher information is needed. The framework's effectiveness is demonstrated with experiments using real-world data.
Efficient Distributionally Robust Bayesian Optimization with Worst-case Sensitivity
Sebastian Shenghong Tay†, Chuan Sheng Foo*, Daisuke Urano*, Richalynn Leong*, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), pages 21180-21204, 2022
21.9% acceptance rate
Abstract. In distributionally robust Bayesian optimization (DRBO), an exact computation of the worst-case expected value requires solving an expensive convex optimization problem. We develop a fast approximation of the worst-case expected value based on the notion of worst-case sensitivity that caters to arbitrary convex distribution distances. We provide a regret bound for our novel DRBO algorithm with the fast approximation, and empirically show it is competitive with that using the exact worst-case expected value while incurring significantly less computation time. In order to guide the choice of distribution distance to be used with DRBO, we show that our approximation implicitly optimizes an objective close to an interpretable risk-sensitive value.
Bayesian Optimization under Stochastic Delayed Feedback
Arun Verma†, Zhongxiang Dai†, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), pages 22145-22167, 2022
21.9% acceptance rate
Abstract. Bayesian optimization (BO) is a widely-used sequential method for zeroth-order optimization of complex and expensive-to-compute black-box functions. The existing BO methods assume that the function evaluation (feedback) is available to the learner immediately or after a fixed delay. Such assumptions may not be practical in many real-life problems like clinical trials, online recommendations, and hyper-parameters tuning, where feedback is available after a random delay. To benefit from the experimental parallelization in these problems, the learner needs to start new function evaluations without waiting for delayed feedback. In this paper, we consider the BO under stochastic delayed feedback problem. We propose algorithms with sub-linear regret guarantees that efficiently address the dilemma of selecting new function queries while waiting for randomly delayed feedback. Building on our results, we also make novel contributions to batch BO and contextual Gaussian process bandits. Our experiments on synthetic and real-life datasets verify the performance of proposed algorithms.
DAVINZ: Data Valuation using Deep Neural Networks at Initialization
Zhaoxuan Wu†, Yao Shu†, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), pages 24150-24176, 2022
21.9% acceptance rate
Abstract. Recent years have witnessed a surge of interest in developing trustworthy methods to evaluate the value of data in many real-world applications, e.g., collaborative machine learning, data marketplaces, etc. Existing data valuation methods typically valuate data using the generalization performance of converged machine learning models after their long-term model training, making data valuation on large complex deep neural networks (DNNs) unaffordable. To this end, we theoretically derive a domain-aware generalization bound to estimate the generalization performance of DNNs without model training. We then exploit this theoretically derived generalization bound to develop a novel training-free data valuation method named data valuation at initialization (DAVINZ) on DNNs, which consistently achieves remarkable effectiveness and efficiency in practice. Moreover, our training-free DAVINZ, surprisingly, can even theoretically and empirically enjoy the desirable properties that training-based data valuation methods usually attain, making it more trustworthy in practice.
Data Valuation in Machine Learning: "Ingredients", Strategies, and Open Challenges
Rachael Hwee Ling Sim†, Xinyi Xu†, and Bryan Kian Hsiang Low†
International Joint Conference on Artificial Intelligence (IJCAI), pages 5607-5614, 2022
18.2% acceptance rate
Abstract. Data valuation in machine learning (ML) is an emerging research area that studies the worth of data in ML. Data valuation is used in collaborative ML to determine a fair compensation for every data owner and in interpretable ML to identify the most responsible, noisy, or misleading training examples. This paper presents a comprehensive technical survey that provides a new formal study of data valuation in ML through its "ingredients" and the corresponding properties, grounds the discussion of common desiderata satisfied by existing data valuation strategies on our proposed ingredients, and identifies open research challenges for designing new ingredients, data valuation strategies, and cost reduction techniques.
Markov Chain Monte Carlo-Based Machine Unlearning: Unlearning What Needs to be Forgotten
Quoc Phong Nguyen†, Ryutaro Oikawa, Dinil Mon Divakaran*, Mun Choon Chan*, and Bryan Kian Hsiang Low†
ACM ASIA Conference on Computer and Communications Security (ACM ASIACCS), pages 351-363, 2022
18.4% acceptance rate
Abstract. As the use of machine learning (ML) models is becoming increasingly popular in many real-world applications, there are practical challenges that need to be addressed for model maintenance. One such challenge is to 'undo' the effect of a specific subset of dataset used for training a model. This specific subset may contain malicious or adversarial data injected by an attacker, which affects the model performance. Another reason may be the need for a service provider to remove data pertaining to a specific user to respect the user's privacy. In both cases, the problem is to 'unlearn' a specific subset of the training data from a trained model without incurring the costly procedure of retraining the whole model from scratch. Towards this goal, this paper presents a Markov chain Monte Carlo-based machine unlearning (MCU) algorithm. MCU helps to effectively and efficiently unlearn a trained model from subsets of training dataset. Furthermore, we show that with MCU, we are able to explain the effect of a subset of a training dataset on the model prediction. Thus, MCU is useful for examining subsets of data to identify the adversarial data to be removed. Similarly, MCU can be used to erase the lineage of a user's personal data from trained ML models, thus upholding a user's "right to be forgotten". We empirically evaluate the performance of our proposed MCU algorithm on real-world phishing and diabetes datasets. Results show that MCU can achieve a desirable performance by efficiently removing the effect of a subset of training dataset and outperform an existing algorithm that utilizes the remaining dataset.
Adjusted Expected Improvement for Cumulative Regret Minimization in Noisy Bayesian Optimization
Shouri Hu, Haowei Wang, Zhongxiang Dai†, Bryan Kian Hsiang Low†, and Szu Hui Ng
arXiv:2205.04901, 2022
Abstract. The expected improvement (EI) is one of the most popular acquisition functions for Bayesian optimization (BO) and has demonstrated good empirical performances in many applications for the minimization of simple regret. However, under the evaluation metric of cumulative regret, the performance of EI may not be competitive, and its existing theoretical regret upper bound still has room for improvement. To adapt the EI for better performance under cumulative regret, we introduce a novel quantity called the evaluation cost which is compared against the acquisition function, and with this, develop the expected improvement-cost (EIC) algorithm. In each iteration of EIC, a new point with the largest acquisition function value is sampled, only if that value exceeds its evaluation cost. If none meets this criteria, the current best point is resampled. This evaluation cost quantifies the potential downside of sampling a point, which is important under the cumulative regret metric as the objective function value in every iteration affects the performance measure. We further establish in theory a tight regret upper bound of EIC for the squared-exponential covariance kernel under mild regularity conditions, and perform experiments to illustrate the improvement of EIC over several popular BO algorithms.
NASI: Label- and Data-agnostic Neural Architecture Search at Initialization
Yao Shu†, Shaofeng Cai, Zhongxiang Dai†, Beng Chin Ooi*, and Bryan Kian Hsiang Low†
International Conference on Learning Representations (ICLR), 2022
32.29% acceptance rate
Abstract. Recent years have witnessed a surging interest in Neural Architecture Search (NAS). Various algorithms have been proposed to improve the search efficiency and effectiveness of NAS, i.e., to reduce the search cost and improve the generalization performance of the selected architectures, respectively. However, the search efficiency of these algorithms is severely limited by the need for model training during the search process. To overcome this limitation, we propose a novel NAS algorithm called NAS at Initialization (NASI) that exploits the capability of a Neural Tangent Kernel in being able to characterize the performance of candidate architectures at initialization, hence allowing model training to be completely avoided to boost the search efficiency. Besides the improved search efficiency, NASI also achieves competitive search effectiveness on various datasets like CIFAR-10/100 and ImageNet. Further, NASI is shown to be label- and data-agnostic under mild conditions, which guarantees the transferability of architectures selected by our NASI over different datasets.
Near-Optimal Task Selection for Meta-Learning with Mutual Information and Online Variational Bayesian Unlearning
Yizhou Chen†, Shizhuo Zhang, and Bryan Kian Hsiang Low†
International Conference on Artificial Intelligence and Statistics (AISTATS), pages 9091-9113, 2022
29.2% acceptance rate
Abstract. This paper addresses the problem of active task selection which involves selecting the most informative tasks for meta-learning. We propose a novel active task selection criterion based on the mutual information between latent task vectors. Unfortunately, such a criterion scales poorly in the number of candidate tasks when optimized. To resolve this issue, we exploit the submodularity property of our new criterion for devising the first active task selection algorithm for meta-learning with a near-optimal performance guarantee. To further improve the efficiency of our algorithm, we propose an online variant of the Stein variational gradient descent to perform fast belief updates of the meta-parameters via maintaining a set of forward (and backward) particles when learning (or unlearning) from each selected task. We empirically demonstrate the performance of our proposed algorithm on real-world datasets.
Incentivizing Collaboration in Machine Learning via Synthetic Data Rewards
Sebastian Shenghong Tay†, Xinyi Xu†, Chuan Sheng Foo*, and Bryan Kian Hsiang Low†
AAAI Conference on Artificial Intelligence, pages 9448-9456, 2022
4.26% acceptance rate (oral presentation)
Abstract. This paper presents a novel collaborative generative modeling (CGM) framework that incentivizes collaboration among self-interested parties to contribute data to a pool for training a generative model (e.g., GAN), from which synthetic data are drawn and distributed to the parties as rewards commensurate to their contributions. Distributing synthetic data as rewards (instead of trained models or money) offers task- and model-agnostic benefits for downstream learning tasks and is less likely to violate data privacy regulation. To realize the framework, we firstly propose a data valuation function using maximum mean discrepancy (MMD) that values data based on its quantity and quality in terms of its closeness to the true data distribution and provide theoretical results guiding the kernel choice in our MMD-based data valuation function. Then, we formulate the reward scheme as a linear optimization problem that when solved, guarantees certain incentives such as fairness in the CGM framework. We devise a weighted sampling algorithm for generating synthetic data to be distributed to each party as reward such that the value of its data and the synthetic data combined matches its assigned reward value by the reward scheme. We empirically show using simulated and real-world datasets that the parties' synthetic data rewards are commensurate to their contributions.
Rectified Max-Value Entropy Search for Bayesian Optimization
Quoc Phong Nguyen†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
arXiv:2202.13597, 2022
Abstract. Although the existing max-value entropy search (MES) is based on the widely celebrated notion of mutual information, its empirical performance can suffer due to two misconceptions whose implications on the exploration-exploitation trade-off are investigated in this paper. These issues are essential in the development of future acquisition functions and the improvement of the existing ones as they encourage an accurate measure of the mutual information such as the rectified MES (RMES) acquisition function we develop in this work. Unlike the evaluation of MES, we derive a closed-form probability density for the observation conditioned on the max-value and employ stochastic gradient ascent with reparameterization to efficiently optimize RMES. As a result of a more principled acquisition function, RMES shows a consistent improvement over MES in several synthetic function benchmarks and real-world optimization problems.
Fault-Tolerant Federated Reinforcement Learning with Theoretical Guarantee
Flint Xiaofeng Fan, Yining Ma, Zhongxiang Dai†, Wei Jing, Cheston Tan*, and Bryan Kian Hsiang Low†
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 1007-1021, 2021
25.6% acceptance rate
Abstract. The growing literature of Federated Learning (FL) has recently inspired Federated Reinforcement Learning (FRL) to encourage multiple agents to federatively build a better decision-making policy without sharing raw trajectories. Despite its promising applications, existing FRL work fails to I) provide theoretical analysis on its convergence; II) account for random system failures and adversarial attacks. Towards this end, we propose the first FRL framework, the convergence of which is tolerant to less than half of participating agents being random system failures or adversarial attackers. We prove that the sample efficiency of the proposed framework is guaranteed to scale with the number of agents, accounting for such potential failures or attacks. We empirically verify all theoretical results on various RL benchmarking tasks.
Optimizing Conditional Value-At-Risk of Black-Box Functions
Quoc Phong Nguyen†, Zhongxiang Dai†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 4170-4180, 2021
25.6% acceptance rate
Abstract. This paper presents two Bayesian optimization (BO) algorithms with theoretical performance guarantee to maximize the conditional value-at-risk (CVaR) of a black-box function: CV-UCB and CV-TS which are based on the well-established principle of optimism in the face of uncertainty and Thompson sampling, respectively. To achieve this, we develop an upper confidence bound of CVaR and prove the no-regret guarantee of CV-UCB by utilizing an interesting connection between CVaR and value-at-risk (VaR). For CV-TS, though it is straightforwardly performed with Thompson sampling, bounding its Bayesian regret is non-trivial because it requires a tail expectation bound for the distribution of CVaR of a black-box function, which has not been shown in the literature. The performances of both CV-UCB and CV-TS are empirically evaluated in optimizing CVaR of synthetic benchmark functions and simulated real-world optimization problems.
Differentially Private Federated Bayesian Optimization with Distributed Exploration
Zhongxiang Dai†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 9125-9139, 2021
25.6% acceptance rate
Abstract. Bayesian optimization (BO) has recently been extended to the federated learning (FL) setting by the federated Thompson sampling (FTS) algorithm. However, FTS is not equipped with a rigorous privacy guarantee which is an important consideration in FL. Recent works have incorporated differential privacy (DP) into the training of deep neural networks through a general framework for adding DP to iterative algorithms. Following this general DP framework, our work here integrates DP into FTS to preserve user-level privacy. We also leverage the ability of this general DP framework to handle different parameter vectors, as well as the technique of local modeling for BO, to further improve the utility of our algorithm through distributed exploration (DE). The resulting differentially private FTS with DE (DP-FTS-DE) algorithm is endowed with theoretical guarantees for both the privacy and utility and is amenable to interesting theoretical insights about the privacy-utility trade-off. We also use real-world experiments to show that DP-FTS-DE achieves high utility (competitive performance) with a strong privacy guarantee (small privacy loss) and induces a practical trade-off between privacy and utility.
Validation Free and Replication Robust Volume-based Data Valuation
Xinyi Xu†, Zhaoxuan Wu†, Chuan Sheng Foo*, and Bryan Kian Hsiang Low†
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 10837-10848, 2021
25.6% acceptance rate
Abstract. Data valuation arises as a non-trivial challenge in use cases such as collaborative data sharing, data markets and etc. The value of data is often related to the learning performance, e.g. validation accuracy, of the model trained on the data. While intuitive, this methodology introduces a high coupling between data valuation and validation, which may not be desirable in practice. For instance, data providers may disagree on the choice of the validation set, or the validation set may be (statistically) different from the actual application. A separate but practical issue is data replication. If some data points are valuable, a dishonest data provider may offer a dataset containing replications of these data points, trying to exploit the valuation to get a higher reward/payment. Based on the ordinary least squares framework, our data valuation method does not require validation, and still provides a useful connection between the value of data and learning performance. In particular, we utilize the volume of the data matrix (determinant of its left Gram), thus able to provide an intuitive interpretation of the value of data via the diversity in the data. Furthermore, we formalize the robustness to data replication, and propose a robust volume valuation with robustness guarantees. We conduct extensive experiments to demonstrate its consistency and practical advantages over existing baselines.
Gradient Driven Rewards to Guarantee Fairness in Collaborative Machine Learning
Xinyi Xu†, Lingjuan Lyu, Xingjun Ma, Chenglin Miao, Chuan Sheng Foo*, and Bryan Kian Hsiang Low†
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 16104-16117, 2021
25.6% acceptance rate
Abstract. In collaborative machine learning (CML), multiple agents pool their resources (e.g., data) together for a common learning task. In realistic CML settings where the agents are self-interested and not altruistic, they may be unwilling to share data or model information without adequate rewards. Furthermore, as the data/model information shared by the agents may differ in quality, designing rewards which are fair to them is important so that they would not feel exploited nor discouraged from sharing. In this paper, we adopt federated learning as the CML paradigm, propose a novel cosine gradient Shapley value (CGSV) to fairly evaluate the expected marginal contribution of each agent's uploaded model parameter update/gradient without needing an auxiliary validation dataset, and based on the CGSV, design a novel training-time gradient reward mechanism with a fairness guarantee by sparsifying the aggregated parameter update/gradient downloaded from the server as reward to each agent such that its resulting quality is commensurate to that of the agent’s uploaded parameter update/gradient. We empirically demonstrate the effectiveness of our fair gradient reward mechanism on multiple benchmark datasets in terms of fairness, predictive performance, and time overhead.
Model Fusion for Personalized Learning
Thanh Chi Lam, Trong Nghia Hoang†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
International Conference on Machine Learning (ICML), pages 5948-5958, 2021
21.5% acceptance rate
Abstract. Production systems operating on a growing domain of analytic services often require generating warm-start solution models for emerging tasks with limited data. One potential approach to address this challenge is to adopt meta learning to generate a base model that can be adapted to solve unseen tasks with minimal fine-tuning. This however requires the training processes of previous solution models of existing tasks to be synchronized. This is not possible if these models were pre-trained separately on private data owned by different entities and cannot be synchronously re-trained. To accommodate for such scenarios, we develop a new personalized learning framework that synthesizes customized models for unseen tasks via fusion of independently pre-trained models of related tasks. We establish performance guarantee for the proposed framework and demonstrate its effectiveness on both synthetic and real datasets.
Value-at-Risk Optimization with Gaussian Processes
Quoc Phong Nguyen†, Zhongxiang Dai†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
International Conference on Machine Learning (ICML), pages 8063-8072, 2021
21.5% acceptance rate
Abstract. Value-at-risk (VaR) is an established measure to assess risks in critical real-world applications with random environmental factors. This paper presents a novel VaR upper confidence bound (V-UCB) algorithm for maximizing the VaR of a black-box objective function with the first no-regret guarantee. To realize this, we first derive a confidence bound of VaR and then prove the existence of values of the environmental random variable (to be selected to achieve no regret) such that the confidence bound of VaR lies within that of the objective function evaluated at such values. Our V-UCB algorithm empirically demonstrates state-of-the-art performance in optimizing synthetic benchmark functions, a portfolio optimization problem, and a simulated robot task.
Collaborative Bayesian Optimization with Fair Regret
Rachael Hwee Ling Sim†, Yehong Zhang†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
International Conference on Machine Learning (ICML), pages 9691-9701, 2021
21.5% acceptance rate
Abstract. Bayesian optimization (BO) is a popular tool for optimizing complex and costly-to-evaluate black-box objective functions. To further reduce the number of function evaluations, any party performing BO may be interested to collaborate with others to optimize the same objective function concurrently. To do this, existing BO algorithms have considered optimizing a batch of input queries in parallel and provided theoretical bounds on their cumulative regret reflecting inefficiency. However, when the objective function values are correlated with real-world rewards (e.g., money), parties may be hesitant to collaborate if they risk incurring larger cumulative regret (i.e., smaller real-world reward) than others. This paper shows that fairness and efficiency are both necessary for the collaborative BO setting. Inspired by social welfare concepts from economics, we propose a new notion of regret capturing these properties and a collaborative BO algorithm whose convergence rate can be theoretically guaranteed by bounding the new regret, both of which share an adjustable parameter for trading off between fairness vs. efficiency. We empirically demonstrate the benefits (e.g., increased fairness) of our algorithm using synthetic and real-world datasets.
Convolutional Normalizing Flows for Deep Gaussian Processes
Haibin Yu†, Dapeng Liu, Bryan Kian Hsiang Low†, and Patrick Jaillet*
International Joint Conference on Neural Networks (IJCNN), 2021
Abstract. Deep Gaussian processes (DGPs), a hierarchical composition of GP models, have successfully boosted the expressive power of their single-layer counterpart. However, it is impossible to perform exact inference in DGPs, which has motivated the recent development of variational inference-based methods. Unfortunately, either these methods yield a biased posterior belief or it is difficult to evaluate their convergence. This paper introduces a new approach for specifying flexible, arbitrarily complex, and scalable approximate posterior distributions. The posterior distribution is constructed through a normalizing flow (NF) which transforms a simple initial probability into a more complex one through a sequence of invertible transformations. Moreover, a novel convolutional normalizing flow (CNF) is developed to improve the time efficiency and capture dependency between layers. Empirical evaluation shows that CNF DGP outperforms the state-of-the-art approximation methods for DGPs.
Learning to Learn with Gaussian Processes
Quoc Phong Nguyen†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
Conference on Uncertainty in Artificial Intelligence (UAI), pages 1466-1475, 2021
26.5% acceptance rate
Abstract. This paper presents Gaussian process meta-learning (GPML) for few-shot regression, which explicitly exploits the distance between regression problems/tasks using a novel task kernel. It contrasts sharply with the popular metric-based meta-learning approach which is based on the distance between data inputs or their embeddings in the few-shot learning literature. Apart from the superior predictive performance by capturing the diversity of different tasks, GPML offers a set of representative tasks that are useful for understanding the task distribution. We empirically demonstrate the performance and interpretability of GPML in several few-shot regression problems involving a multimodal task distribution and real-world datasets.
Trusted-Maximizers Entropy Search for Efficient Bayesian Optimization
Quoc Phong Nguyen†, Zhaoxuan Wu†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
Conference on Uncertainty in Artificial Intelligence (UAI), pages 1486-1495, 2021
26.5% acceptance rate
Abstract. Information-based Bayesian optimization (BO) algorithms have achieved state-of-the-art performance in optimizing a black-box objective function. However, they usually require several approximations or simplifying assumptions (without clearly understanding their effects on the BO performance) and/or their generalization to batch BO is computationally unwieldy, especially with an increasing batch size. To alleviate these issues, this paper presents a novel trusted-maximizers entropy search (TES) acquisition function: It measures how much an input query contributes to the information gain on the maximizer over a finite set of trusted maximizers, i.e., inputs optimizing functions that are sampled from the Gaussian process posterior belief of the objective function. Evaluating TES requires either only a stochastic approximation with sampling or a deterministic approximation with expectation propagation, both of which are investigated and empirically evaluated using synthetic benchmark objective functions and real-world optimization problems, e.g., hyperparameter tuning of a convolutional neural network and synthesizing ‘physically realizable’ faces to fool a black-box face recognition system. Though TES can naturally be generalized to a batch variant with either approximation, the latter is amenable to be scaled to a much larger batch size in our experiments.
AID: Active Distillation Machine to Leverage Pre-Trained Black-Box Models in Private Data Settings
Trong Nghia Hoang†, Shenda Hong, Cao Xiao, Bryan Kian Hsiang Low†, and Jimeng Sun
The Web Conference (WWW), pages 3569-3581, 2021
20.6% acceptance rate
Abstract. This paper presents an active distillation method for a local institution (e.g., hospital) to find the best queries within its given budget to distill an on-server black-box model's predictive knowledge into a local surrogate with transparent parameterization. This allows local institutions to understand better the predictive reasoning of the black-box model in its own local context or to further customize the distilled knowledge with its private dataset that cannot be centralized and fed into the server model. The proposed method thus addresses several challenges of deploying machine learning in many industrial settings (e.g., healthcare analytics) with strong proprietary constraints. These include: (1) the opaqueness of the server model’s architecture which prevents local users from understanding its predictive reasoning in their local data contexts; (2) the increasing cost and risk of uploading local data on the cloud for analysis; and (3) the need to customize the server model with private onsite data. We evaluated the proposed method on both benchmark and real-world healthcare data where significant improvements over existing local distillation methods were observed. A theoretical analysis of the proposed method is also presented.
An Information-Theoretic Framework for Unifying Active Learning Problems
Quoc Phong Nguyen†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
AAAI Conference on Artificial Intelligence, pages 9126-9134, 2021
21.4% acceptance rate
Abstract. This paper presents an information-theoretic framework for unifying active learning problems: level set estimation (LSE), Bayesian optimization (BO), and their generalized variant. We first introduce a novel active learning criterion that subsumes an existing LSE algorithm and achieves state-of-the-art performance in LSE problems with a continuous input domain. Then, by exploiting the relationship between LSE and BO, we design a competitive information-theoretic acquisition function for BO that has interesting connections to upper confidence bound and max-value entropy search (MES). The latter connection reveals a drawback of MES which has important implications on not only MES but also on other MES-based acquisition functions. Finally, our unifying information-theoretic framework can be applied to solve a generalized problem of LSE and BO involving multiple level sets in a data-efficient manner. We empirically evaluate the performance of our proposed algorithms using synthetic benchmark functions, a real-world dataset, and in hyperparameter tuning of machine learning models.
Top-k Ranking Bayesian Optimization
Quoc Phong Nguyen†, Sebastian Tay, Bryan Kian Hsiang Low†, and Patrick Jaillet*
AAAI Conference on Artificial Intelligence, pages 9135-9143, 2021
21.4% acceptance rate
Abstract. This paper presents a novel approach to top-k ranking Bayesian optimization (top-k ranking BO) which is a practical and significant generalization of preferential BO to handle top-k ranking and tie/indifference observations. We first design a surrogate model that is not only capable of catering to the above observations, but is also supported by a classic random utility model. Another equally important contribution is the introduction of the first information-theoretic acquisition function in BO with preferential observation called multinomial predictive entropy search (MPES) which is flexible in handling these observations and optimized for all inputs of a query jointly. MPES possesses superior performance compared with existing acquisition functions that select the inputs of a query one at a time greedily. We empirically evaluate the performance of MPES using several synthetic benchmark functions, CIFAR-10 dataset, and SUSHI preference dataset.
FCM-Sketch: Generic Network Measurements with Data Plane Support
Cha Hwan Song†, Pravein Govindan Kannan, Bryan Kian Hsiang Low†, and Mun Choon Chan*
International Conference on emerging Networking EXperiments and Technologies (CoNEXT), pages 78-92, 2020
24% acceptance rate
Abstract. Sketches have successfully provided accurate and fine-grained measurements (e.g., flow size and heavy hitters) which are imperative for network management. In particular, Count-Min (CM) sketch is widely utilized in many applications due to its simple design and ease of implementation. There have been many efforts to build monitoring frameworks based on Count-Min sketch. However, these frameworks either support very specific measurement tasks or they cannot be implemented on high-speed programmable hardware (PISA). In this work, we propose FCM, a framework that is designed to support generic network measurement with high accuracy. Our key contribution is FCM-Sketch, a data structure that has a lightweight implementation on the emerging PISA programmable switches. FCM-Sketch can also be used as a substitute for CM-Sketch in applications that use CM-Sketch. We have implemented FCM-Sketch on a commodity programmable switch (Barefoot Tofino) using the P4 language. Our evaluation shows that FCM-Sketch can reduce the errors in many measurement tasks by 50% to 80% compared to CM-Sketch and other state-of-the-art approaches.
Efficient Exploration of Reward Functions in Inverse Reinforcement Learning via Bayesian Optimization
Sreejith Balakrishnan†, Quoc Phong Nguyen†, Bryan Kian Hsiang Low†, and Harold Soh*
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 4187-4198, 2020
20.1% acceptance rate
Abstract. In this paper, we focus on the problem of Inverse Reinforcement Learning (IRL), which is relevant for a variety of tasks including value alignment and robot learning from demonstration. Despite significant algorithmic contributions in recent years, IRL remains an ill-posed problem at its core; multiple reward functions coincide with the observed behavior, and the actual reward function is not identifiable without prior knowledge or supplementary information. Here, we propose Bayesian Optimization-IRL (BO-IRL), an IRL framework that identifies multiple solutions that are consistent with the expert demonstrations by efficiently exploring the reward function space. BO-IRL achieves this by utilizing Bayesian Optimization along with our newly proposed kernel that (a) projects the parameters of policy invariant reward functions to a single point in a latent space, and (b) ensures that nearby points in the latent space correspond to reward functions that yield similar likelihoods. This projection allows for the use of standard stationary kernels in the latent space to capture the correlations present across the reward function space. Empirical results on synthetic and real-world environments (model-free and model-based) show that BO-IRL discovers multiple reward functions while minimizing the number of expensive exact policy optimizations.
Federated Bayesian Optimization via Thompson Sampling
Zhongxiang Dai†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 9687-9699, 2020
20.1% acceptance rate
Abstract. Bayesian optimization (BO) is a prominent method for optimizing expensive-to-compute black-box functions. The massive computational capability of edge devices such as mobile phones, coupled with privacy concerns, has led to immense recent interest in federated learning (FL), which focuses on collaborative training of deep neural networks (DNN) via first-order optimization techniques. However, some common machine learning tasks such as hyperparameter tuning of DNN lack access to gradients and thus require zeroth-order optimization (black-box optimization). This hints at the considerable potential of extending BO to the FL setting (FBO), to allow agents to collaborate in these black-box optimization tasks. Here, we introduce federated Thompson sampling (FTS), which overcomes a number of key challenges of FBO and FL in a principled way: We (a) use random Fourier features to approximate the Gaussian process surrogate model used in BO which naturally produces the parameters to be exchanged between agents, (b) design FTS based on Thompson sampling which significantly reduces the number of parameters to be exchanged, and (c) provide a theoretical convergence guarantee that is robust against heterogeneous agents which is a major challenge in FL and FBO. We empirically demonstrate the effectiveness of FTS in terms of communication efficiency, computational efficiency and practical performance.
Variational Bayesian Unlearning
Quoc Phong Nguyen†, Bryan Kian Hsiang Low†, and Patrick Jaillet*
Annual Conference on Neural Information Processing Systems (NeurIPS), pages 16025-16036, 2020
20.1% acceptance rate
Abstract. This paper studies the problem of approximately unlearning a Bayesian model from a small subset of the training data to be erased. We frame this problem as one of minimizing the Kullback-Leibler distance between the approximate posterior belief of model parameters after directly unlearning from the erased data vs. the exact posterior belief from retraining with remaining data. Using the variational inference (VI) framework, we show that it is equivalent to minimizing an evidence upper bound which trades off between fully unlearning from erased data vs. not entirely forgetting the posterior belief given the full data (i.e., including the remaining data); the latter prevents catastrophic unlearning that can render the model useless. In model training with VI, only an approximate (instead of exact) posterior belief given the full data can be obtained, which makes unlearning even more challenging. We propose two novel tricks to tackle this challenge. We empirically demonstrate our unlearning methods on Bayesian models such as sparse Gaussian process and logistic regression using synthetic and real-world datasets.
Nonmyopic Gaussian Process Optimization with Macro-Actions
Dmitrii Kharkovskii†, Chun Kai Ling†, and Bryan Kian Hsiang Low†
International Conference on Artificial Intelligence and Statistics (AISTATS), pages 4593-4604, 2020
28.7% acceptance rate
Abstract. This paper presents a multi-staged approach to nonmyopic adaptive Gaussian process optimization (GPO) for Bayesian optimization (BO) of unknown, highly complex objective functions that, in contrast to existing nonmyopic adaptive BO algorithms, exploits the notion of macro-actions for scaling up to a further lookahead to match up to a larger available budget. To achieve this, we generalize GP upper confidence bound to a new acquisition function defined w.r.t. a nonmyopic adaptive macro-action policy, which is intractable to be optimized exactly due to an uncountable set of candidate outputs. The contribution of our work here is thus to derive a nonmyopic adaptive ϵ-Bayes-optimal macro-action GPO (ϵ-Macro-GPO) policy. To perform nonmyopic adaptive BO in real time, we then propose an asymptotically optimal anytime variant of our ϵ-Macro-GPO policy with a performance guarantee. We empirically evaluate the performance of our ϵ-Macro-GPO policy and its anytime variant in BO with synthetic and real-world datasets.
R2-B2: Recursive Reasoning-Based Bayesian Optimization for No-Regret Learning in Games
Zhongxiang Dai†, Yizhou Chen†, Bryan Kian Hsiang Low†, Patrick Jaillet*, and Teck-Hua Ho
International Conference on Machine Learning (ICML), pages 2291-2301, 2020
21.8% acceptance rate
Abstract. This paper presents a recursive reasoning formalism of Bayesian optimization (BO) to model the reasoning process in the interactions between boundedly rational, self-interested agents with unknown, complex, and costly-to-evaluate payoff functions in repeated games, which we call Recursive Reasoning-Based BO (R2-B2). Our R2-B2 algorithm is general in that it does not constrain the relationship among the payoff functions of different agents and can thus be applied to various types of games such as constant-sum, general-sum, and common-payoff games. We prove that by reasoning at level 2 or more and at one level higher than the other agents, our R2-B2 agent can achieve faster asymptotic convergence to no regret than that without utilizing recursive reasoning. We also propose a computationally cheaper variant of R2-B2 called R2-B2-Lite at the expense of a weaker convergence guarantee. The performance and generality of our R2-B2 algorithm are empirically demonstrated using synthetic games, adversarial machine learning, and multi-agent reinforcement learning.
Learning Task-Agnostic Embedding of Multiple Black-Box Experts for Multi-Task Model Fusion
Trong Nghia Hoang†, Thanh Lam, Bryan Kian Hsiang Low†, and Patrick Jaillet*
International Conference on Machine Learning (ICML), pages 4282-4292, 2020
21.8% acceptance rate
Abstract. Model fusion is an emerging study in collective learning where heterogeneous experts with private data and learning architectures need to combine their black-box knowledge for better performance. Existing literature achieves this via a local knowledge distillation scheme that transfuses the predictive patterns of each pre-trained expert onto a white-box imitator model, which can be incorporated efficiently into a global model. This scheme however does not extend to multi-task scenarios where different experts were trained to solve different tasks and only part of their distilled knowledge is relevant to a new task. To address this multi-task challenge, we develop a new fusion paradigm that represents each expert as a distribution over a spectrum of predictive prototypes, which are isolated from task-specific information encoded within the prototype distribution. The task-agnostic prototypes can then be reintegrated to generate a new model that solves a new task encoded with a different prototype distribution. The fusion and adaptation performance of the proposed framework is demonstrated empirically on several real-world benchmark datasets.
Private Outsourced Bayesian Optimization
Dmitrii Kharkovskii†, Zhongxiang Dai†, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), pages 5231-5242, 2020
21.8% acceptance rate
Abstract. This paper presents the private-outsourced-Gaussian process-upper confidence bound (PO-GP-UCB) algorithm, which is the first algorithm for privacy-preserving Bayesian optimization (BO) in the outsourced setting with a provable performance guarantee. We consider the outsourced setting where the entity holding the dataset and the entity performing BO are represented by different parties, and the dataset cannot be released non-privately. For example, a hospital holds a dataset of sensitive medical records and outsources the BO task on this dataset to an industrial AI company. The key idea of our approach is to make the BO performance of our algorithm similar to that of non-private GP-UCB run using the original dataset, which is achieved by using a random projection-based transformation that preserves both privacy and the pairwise distances between inputs. Our main theoretical contribution is to show that a regret bound similar to that of the standard GP-UCB algorithm can be established for our PO-GP-UCB algorithm. We empirically evaluate the performance of our PO-GP-UCB algorithm with synthetic and real-world datasets.
Collaborative Machine Learning with Incentive-Aware Model Rewards
Rachael Hwee Ling Sim†, Yehong Zhang†, Mun Choon Chan*, and Bryan Kian Hsiang Low†
International Conference on Machine Learning (ICML), pages 8927-8936, 2020
21.8% acceptance rate
Abstract. Collaborative machine learning (ML) is an appealing paradigm to build high-quality ML models by training on the aggregated data from many parties. However, these parties are only willing to share their data when given enough incentives, such as a guaranteed fair reward based on their contributions. This motivates the need for measuring a party's contribution and designing an incentive-aware reward scheme accordingly. This paper proposes to value a party's reward based on Shapley value and information gain on model parameters given its data. Subsequently, we give each party a model as a reward. To formally incentivize the collaboration, we define some desirable properties (e.g., fairness and stability) which are inspired by cooperative game theory but adapted for our model reward that is uniquely freely replicable. Then, we propose a novel model reward scheme to satisfy fairness and trade off between the desirable properties via an adjustable parameter. The value of each party's model reward determined by our scheme is attained by injecting Gaussian noise to the aggregated training data with an optimized noise variance. We empirically demonstrate interesting properties of our scheme and evaluate its performance using synthetic and real-world datasets.