Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Source
    • Language
1,356 result(s) for "Multi-armed bandit problems"
Sort by:
KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION
We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins [J. R. Stat. Soc. Ser. B Stat. Methodol. 41 (1979) 148—177], based on upper confidence bounds of the arm payoffs computed using the Kullback—Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: the kl-UCB algorithm is designed for one-parameter exponential families and the empirical KL-UCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins [Adv. in Appl. Math. 6 (1985) 4—22] and Burnetas and Katehakis [Adv. in Appl. Math. 17 (1996) 122—142], respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the state-of-the-art.
Combinatorial MAB-Based Joint Channel and Spreading Factor Selection for LoRa Devices
Long-Range (LoRa) devices have been deployed in many Internet of Things (IoT) applications due to their ability to communicate over long distances with low power consumption. The scalability and communication performance of the LoRa systems are highly dependent on the spreading factor (SF) and channel allocations. In particular, it is important to set the SF appropriately according to the distance between the LoRa device and the gateway since the signal reception sensitivity and bit rate depend on the used SF, which are in a trade-off relationship. In addition, considering the surge in the number of LoRa devices recently, the scalability of LoRa systems is also greatly affected by the channels that the LoRa devices use for communications. It was demonstrated that the lightweight decentralized learning-based joint channel and SF-selection methods can make appropriate decisions with low computational complexity and power consumption in our previous study. However, the effect of the location situation of the LoRa devices on the communication performance in a practical larger-scale LoRa system has not been studied. Hence, to clarify the effect of the location situation of the LoRa devices on the communication performance in LoRa systems, in this paper, we implemented and evaluated the learning-based joint channel and SF-selection methods in a practical LoRa system. In the learning-based methods, the channel and SF are decided only based on the ACKnowledge information. The learning methods evaluated in this paper were the Tug of War dynamics, Upper Confidence Bound 1, and ϵ-greedy algorithms. Moreover, to consider the relevance of the channel and SF, we propose a combinational multi-armed bandit-based joint channel and SF-selection method. Compared with the independent methods, the combinations of the channel and SF are set as arms. Conversely, the SF and channel are set as independent arms in the independent methods that are evaluated in our previous work. From the experimental results, we can see the following points. First, the combinatorial methods can achieve a higher frame success rate and fairness than the independent methods. In addition, the FSR can be improved by joint channel and SF selection compared to SF selection only. Moreover, the channel and SF selection dependents on the location situation to a great extent.
Reducing Computational Time in Pixel-Based Path Planning for GMA-DED by Using Multi-Armed Bandit Reinforcement Learning Algorithm
This work presents an artificial intelligence technique to minimise path planning computer processing time for successful GMA-DED 3D printings. An advanced version of the Pixel space-filling-based strategy family is proposed and developed, using, originally for GMA-DED, an artificially intelligent Reinforcement Learning technique to optimise its heuristics. The initial concept was to boost the preceding Enhanced-Pixel version of the Pixel planning strategy by applying the solution of the Multi-Armed Bandit problem in the algorithms. Computational validation was initially performed to evaluate Advanced-Pixel improvements systematically and comparatively with the Enhanced-Pixel strategy. A testbed was set up to compare experimentally the performance of both algorithm versions. The results showed that the reduced processing time reached with the Advanced-Pixel strategy did not affect the performance gains of the Pixel strategy. A larger build was printed as a case study to conclude the study. The results outstand the artificially intelligent role of the Reinforcement Learning technique in printing more efficiently functional structures.
Exploring Multi-Armed Bandit (MAB) as an AI Tool for Optimising GMA-WAAM Path Planning
Conventional path-planning strategies for GMA-WAAM may encounter challenges related to geometrical features when printing complex-shaped builds. One alternative to mitigate geometry-related flaws is to use algorithms that optimise trajectory choices—for instance, using heuristics to find the most efficient trajectory. The algorithm can assess several trajectory strategies, such as contour, zigzag, raster, and even space-filling, to search for the best strategy according to the case. However, handling complex geometries by this means poses computational efficiency concerns. This research aimed to explore the potential of machine learning techniques as a solution to increase the computational efficiency of such algorithms. First, reinforcement learning (RL) concepts are introduced and compared with supervised machining learning concepts. The Multi-Armed Bandit (MAB) problem is explained and justified as a choice within the RL techniques. As a case study, a space-filling strategy was chosen to have this machining learning optimisation artifice in its algorithm for GMA-AM printing. Computational and experimental validations were conducted, demonstrating that adding MAB in the algorithm helped to achieve shorter trajectories, using fewer iterations than the original algorithm, potentially reducing printing time. These findings position the RL techniques, particularly MAB, as a promising machining learning solution to address setbacks in the space-filling strategy applied.
Reinforcement Learning in Economics and Finance
Reinforcement learning algorithms describe how an agent can learn an optimal action policy in a sequential decision process, through repeated experience. In a given environment, the agent policy provides him some running and terminal rewards. As in online learning, the agent learns sequentially. As in multi-armed bandit problems, when an agent picks an action, he can not infer ex-post the rewards induced by other action choices. In reinforcement learning, his actions have consequences: they influence not only rewards, but also future states of the world. The goal of reinforcement learning is to find an optimal policy – a mapping from the states of the world to the set of actions, in order to maximize cumulative reward, which is a long term strategy. Exploring might be sub-optimal on a short-term horizon but could lead to optimal long-term ones. Many problems of optimal control, popular in economics for more than forty years, can be expressed in the reinforcement learning framework, and recent advances in computational science, provided in particular by deep learning algorithms, can be used by economists in order to solve complex behavioral problems. In this article, we propose a state-of-the-art of reinforcement learning techniques, and present applications in economics, game theory, operation research and finance.
Bandit Learning with Concurrent Transmissions for Energy-Efficient Flooding in Sensor Networks
Concurrent transmissions, a novel communication paradigm, has been shown to e?ectively accomplish a reliable and energy-eÿcient flooding in low-power wireless networks. With multiple nodes exploiting a receive-and-forward scheme in the network, this technique inevitably introduces communication redundancy and consequently raises the energy consumption of the nodes. In this article, we propose Less is More (LiM), an energy-eÿcient flooding protocol for wireless sensor networks. LiM builds on concurrent transmissions, exploiting constructive interference and the capture e?ect to achieve high reliability and low latency. Moreover, LiM is equipped with a machine learning capability to progressively reduce redundancy while maintaining high reliability. As a result, LiM is able to significantly reduce the radio-on time and therefore the energy consumption. We compare LiM with our baseline protocol Glossy by extensive experiments in the 30-node testbed FlockLab. Experimental results show that LiM highly reduces the broadcast redundancy in flooding. It outperforms the baseline protocol in terms of radio-on time, while attaining a high reliability of over 99.50% and an average end-to-end latency around 2 milliseconds in all experimental scenarios.
BATCHED BANDIT PROBLEMS
Motivated by practical applications, chiefly clinical trials, we study the regret achievable for stochastic bandits under the constraint that the employed policy must split trials into a small number of batches. We propose a simple policy, and show that a very small number of batches gives close to minimax optimal regret bounds. As a byproduct, we derive optimal policies with low switching cost for stochastic bandits.
Good arm identification via bandit feedback
We consider a novel stochastic multi-armed bandit problem called good arm identification (GAI), where a good arm is defined as an arm with expected reward greater than or equal to a given threshold. GAI is a pure-exploration problem in which a single agent repeats a process of outputting an arm as soon as it is identified as a good one before confirming the other arms are actually not good. The objective of GAI is to minimize the number of samples for each process. We find that GAI faces a new kind of dilemma, the exploration-exploitation dilemma of confidence, which is different from the best arm identification. As a result, an efficient design of algorithms for GAI is quite different from that for the best arm identification. We derive a lower bound on the sample complexity of GAI that is tight up to the logarithmic factor \\[\\mathrm {O}(\\log \\frac{1}{\\delta })\\] for acceptance error rate \\[\\delta \\]. We also develop an algorithm whose sample complexity almost matches the lower bound. We also confirm experimentally that our proposed algorithm outperforms naive algorithms in synthetic settings based on a conventional bandit problem and clinical trial researches for rheumatoid arthritis.
Covariate-Adjusted Response-Adaptive Randomization for Multi-Arm Clinical Trials Using a Modified Forward Looking Gittins Index Rule
We introduce a non-myopic, covariate-adjusted response adaptive (CARA) allocation design for multi-armed clinical trials. The allocation scheme is a computationally tractable procedure based on the Gittins index solution to the classic multi-armed bandit problem and extends the procedure recently proposed in Villar et al. (2015). Our proposed CARA randomization procedure is defined by reformulating the bandit problem with covariates into a classic bandit problem in which there are multiple combination arms, considering every arm per each covariate category as a distinct treatment arm. We then apply a heuristically modified Gittins index rule to solve the problem and define allocation probabilities from the resulting solution. We report the efficiency, balance, and ethical performance of our approach compared to existing CARA methods using a recently published clinical trial as motivation. The net savings in terms of expected number of treatment failures is considerably larger and probably enough to make this design attractive for certain studies where known covariates are expected to be important, stratification is not desired, treatment failures have a high ethical cost, and the disease under study is rare. In a two-armed context, this patient benefit advantage comes at the expense of increased variability in the allocation proportions and a reduction in statistical power. However, in a multi-armed context, simple modifications of the proposed CARA rule can be incorporated so that an ethical advantage can be offered without sacrificing power in comparison with balanced designs.
Multi-armed bandits with dependent arms
We study a variant of the multi-armed bandit problem (MABP) which we call as MABs with dependent arms. Multiple arms are grouped together to form a cluster, and the reward distributions of arms in the same cluster are known functions of an unknown parameter that is a characteristic of the cluster. Thus, pulling an arm i not only reveals information about its own reward distribution, but also about all arms belonging to the same cluster. This “correlation” among the arms complicates the exploration–exploitation trade-off that is encountered in the MABP because the observation dependencies allow us to test simultaneously multiple hypotheses regarding the optimality of an arm. We develop learning algorithms based on the principle of optimism in the face of uncertainty (Lattimore and Szepesvári in Bandit algorithms, Cambridge University Press, 2020), which know the clusters, and hence utilize these additional side observations appropriately while performing exploration–exploitation trade-off. We show that the regret of our algorithms grows as O ( K log T ) , where K is the number of clusters. In contrast, for an algorithm such as the vanilla UCB that does not utilize these dependencies, the regret scales as O ( M log T ) , where M is the number of arms. When K ≪ M , i.e. there is a lot of dependencies among arms, our proposed algorithm drastically reduces the dependence of regret on the number of arms.