Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
35
result(s) for
"contextual bandits"
Sort by:
A Systematic Study on Reinforcement Learning Based Applications
by
Aljafari, Belqasem
,
Rajasekar, Elakkiya
,
Nikolovski, Srete
in
Algorithms
,
Analysis
,
Clustering
2023
We have analyzed 127 publications for this review paper, which discuss applications of Reinforcement Learning (RL) in marketing, robotics, gaming, automated cars, natural language processing (NLP), internet of things security, recommendation systems, finance, and energy management. The optimization of energy use is critical in today’s environment. We mainly focus on the RL application for energy management. Traditional rule-based systems have a set of predefined rules. As a result, they may become rigid and unable to adjust to changing situations or unforeseen events. RL can overcome these drawbacks. RL learns by exploring the environment randomly and based on experience, it continues to expand its knowledge. Many researchers are working on RL-based energy management systems (EMS). RL is utilized in energy applications such as optimizing energy use in smart buildings, hybrid automobiles, smart grids, and managing renewable energy resources. RL-based energy management in renewable energy contributes to achieving net zero carbon emissions and a sustainable environment. In the context of energy management technology, RL can be utilized to optimize the regulation of energy systems, such as building heating, ventilation, and air conditioning (HVAC) systems, to reduce energy consumption while maintaining a comfortable atmosphere. EMS can be accomplished by teaching an RL agent to make judgments based on sensor data, such as temperature and occupancy, to modify the HVAC system settings. RL has proven beneficial in lowering energy usage in buildings and is an active research area in smart buildings. RL can be used to optimize energy management in hybrid electric vehicles (HEVs) by learning an optimal control policy to maximize battery life and fuel efficiency. RL has acquired a remarkable position in robotics, automated cars, and gaming applications. The majority of security-related applications operate in a simulated environment. The RL-based recommender systems provide good suggestions accuracy and diversity. This article assists the novice in comprehending the foundations of reinforcement learning and its applications.
Journal Article
Doubly Robust Policy Evaluation and Optimization
2014
We study sequential decision making in environments where rewards are only partially observed, but can be modeled as a function of observed contexts and the chosen action by the decision maker. This setting, known as contextual bandits, encompasses a wide variety of applications such as health care, content recommendation and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strengths and overcome the weaknesses of the two approaches by applying the doubly robust estimation technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have either a good (but not necessarily consistent) model of rewards or a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust estimation uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice in policy evaluation and optimization.
Journal Article
Feature-Based Dynamic Pricing
by
Lobel, Ilan
,
Paes Leme, Renato
,
Cohen, Maxime C.
in
Algorithms
,
Clothing industry
,
Consumer goods
2020
We consider the problem faced by a firm that receives highly differentiated products in an online fashion. The firm needs to price these products to sell them to its customer base. Products are described by vectors of features and the market value of each product is linear in the values of the features. The firm does not initially know the values of the different features, but can learn the values of the features based on whether products were sold at the posted prices in the past. This model is motivated by applications such as online marketplaces, online flash sales, and loan pricing. We first consider a multidimensional version of binary search over polyhedral sets and show that it has a worst-case regret which is exponential in the dimension of the feature space. We then propose a modification of the prior algorithm where uncertainty sets are replaced by their Löwner-John ellipsoids. We show that this algorithm has a worst-case regret which is quadratic in the dimension of the feature space and logarithmic in the time horizon. We also show how to adapt our algorithm to the case where valuations are noisy. Finally, we present computational experiments to illustrate the performance of our algorithm.
This paper was accepted by Yinyu Ye, optimization.
Journal Article
Multi-armed bandits with episode context
2011
A multi-armed bandit episode consists of
n
trials, each allowing selection of one of
K
arms, resulting in payoff from a distribution over [0,1] associated with that arm. We assume contextual side information is available at the start of the episode. This context enables an arm predictor to identify possible favorable arms, but predictions may be imperfect so that they need to be combined with further exploration during the episode. Our setting is an alternative to classical multi-armed bandits which provide no contextual side information, and is also an alternative to contextual bandits which provide new context each individual trial. Multi-armed bandits with episode context can arise naturally, for example in computer Go where context is used to bias move decisions made by a multi-armed bandit algorithm. The UCB1 algorithm for multi-armed bandits achieves worst-case regret bounded by
. We seek to improve this using episode context, particularly in the case where
K
is large. Using a predictor that places weight
M
i
> 0 on arm
i
with weights summing to 1, we present the PUCB algorithm which achieves regret
where
M
∗
is the weight on the optimal arm. We illustrate the behavior of PUCB with small simulation experiments, present extensions that provide additional capabilities for PUCB, and describe methods for obtaining suitable predictors for use with PUCB.
Journal Article
THE MULTI-ARMED BANDIT PROBLEM WITH COVARIATES
2013
We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit problem, this setting allows for dynamically changing rewards that better describe applications where side information is available. We adopt a nonparametric model where the expected rewards are smooth functions of the covariate and where the hardness of the problem is captured by a margin parameter. To maximize the expected cumulative reward, we introduce a policy called Adaptively Binned Successive Elimination (ABSE) that adaptively decomposes the global problem into suitably \"localized\" static bandit problems. This policy constructs an adaptive partition using a variant of the Successive Elimination (SE) policy. Our results include sharper regret bounds for the SE policy in a static bandit problem and minimax optimal regret bounds for the ABSE policy in the dynamic problem.
Journal Article
Multi-objective contextual bandits in recommendation systems for smart tourism
2025
In the context of smart tourism, recommender systems play a pivotal role in enhancing the personalization and quality of travel experiences. Tourists often face challenges in decision-making due to information overload. While context-aware recommender systems provide promising solutions by utilizing dynamic contextual data such as time, weather, and location, they struggle to adapt to real-time changes and to balance multiple objectives effectively. To address these challenges, this paper introduces a novel multi-objective contextual multi-armed bandit (MOC-MAB)-based recommender system. This approach integrates the strengths of contextual bandit algorithms with multi-objective optimization to provide personalized recommendations while simultaneously considering relevance and fairness. The proposed system dynamically learns from user feedback to optimize multi-objective recommendations. Extensive experiments conducted on a designed dataset simulating real-world scenarios and the TripAdvisor dataset demonstrate the approach’s superior performance in terms of cumulative reward, click-through rate, and regret minimization when compared to baseline methods. This study also illustrates its practical application in the smart tourism context of Marrakesh, showcasing its potential to enhance tourism experiences in smart cities.
Journal Article
Bandit algorithms to personalize educational chatbots
by
Sheng Hao
,
Tian-Zheng, Wei Johnny
,
Williams, Joseph Jay
in
Algorithms
,
Chatbots
,
Customization
2021
To emulate the interactivity of in-person math instruction, we developed MathBot, a rule-based chatbot that explains math concepts, provides practice questions, and offers tailored feedback. We evaluated MathBot through three Amazon Mechanical Turk studies in which participants learned about arithmetic sequences. In the first study, we found that more than 40% of our participants indicated a preference for learning with MathBot over videos and written tutorials from Khan Academy. The second study measured learning gains, and found that MathBot produced comparable gains to Khan Academy videos and tutorials. We solicited feedback from users in those two studies to emulate a real-world development cycle, with some users finding the lesson too slow and others finding it too fast. We addressed these concerns in the third and main study by integrating a contextual bandit algorithm into MathBot to personalize the pace of the conversation, allowing the bandit to either insert extra practice problems or skip explanations. We randomized participants between two conditions in which actions were chosen uniformly at random (i.e., a randomized A/B experiment) or by the contextual bandit. We found that the bandit learned a similarly effective pedagogical policy to that learned by the randomized A/B experiment while incurring a lower cost of experimentation. Our findings suggest that personalized conversational agents are promising tools to complement existing online resources for math education, and that data-driven approaches such as contextual bandits are valuable tools for learning effective personalization.
Journal Article
Learning to bid and rank together in recommendation systems
by
Fahid, Fahmid Morshed
,
Xiao, Jun
,
Ji, Geng
in
Artificial Intelligence
,
Artificial neural networks
,
Bids
2024
Many Internet applications adopt real-time bidding mechanisms to ensure different services (types of content) are shown to the users through fair competitions. The service offering the highest bid price gets the content slot to present a list of items in its candidate pool. Through user interactions with the recommended items, the service obtains the desired engagement activities. We propose a contextual-bandit framework to jointly optimize the price to bid for the slot and the order to rank its candidates for a given service in this type of recommendation systems. Our method can take as input any feature that describes the user and the candidates, including the outputs of other machine learning models. We train reinforcement learning policies using deep neural networks, and compute top-
K
Gaussian propensity scores to exclude the variance in the gradients caused by randomness unrelated to the reward. This setup further facilitates us to automatically find accurate reward functions that trade off between budget spending and user engagements. In online A/B experiments on two major services of Facebook Home Feed, Groups You Should Join and Friend Requests, our method statistically significantly boosted the number of groups joined by 14.7%, the number of friend requests accepted by 7.0%, and the number of daily active Facebook users by about 1 million, against strong hand-tuned baselines that have been iterated in production over years.
Journal Article
Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis
2024
We study stochastic linear contextual bandits (CB) where the agent observes a noisy version of the true context through a noise channel with unknown channel parameters. Our objective is to design an action policy that can “approximate” that of a Bayesian oracle that has access to the reward model and the noise channel parameter. We introduce a modified Thompson sampling algorithm and analyze its Bayesian cumulative regret with respect to the oracle action policy via information-theoretic tools. For Gaussian bandits with Gaussian context noise, our information-theoretic analysis shows that under certain conditions on the prior variance, the Bayesian cumulative regret scales as O˜(mT), where m is the dimension of the feature vector and T is the time horizon. We also consider the problem setting where the agent observes the true context with some delay after receiving the reward, and show that delayed true contexts lead to lower regret. Finally, we empirically demonstrate the performance of the proposed algorithms against baselines.
Journal Article
Lessons on off-policy methods from a notification component of a chatbot
2021
This work serves as a review of our experience applying off-policy techniques to train and evaluate a contextual bandit model powering a troubleshooting notification in a chatbot. First, we demonstrate the effectiveness of off-policy evaluation when data volume is orders of magnitude less than typically found in the literature. We present our reward function and choices behind its design, as well as how we construct our logging policy to balance exploration and performance on key metrics. Next, we present a guided framework to update a model post-training called Post-Hoc Reward Distribution Hacking, which we employed to improve model performance and correct deficiencies in trained models stemming from the existence of a null action and a noisy reward signal. Throughout the work, we include discussions of various practical pitfalls encountered while using off-policy methods in hopes to expedite other applications of these techniques.
Journal Article