Knowledge

Multi-armed bandit

Source 📝

1499:: An adaptive epsilon adaptation strategy for reinforcement learning similar to VBDE, with monotone convergence guarantees. In this framework, the epsilon parameter is viewed as the expectation of a posterior distribution weighting a greedy agent (that fully trusts the learned reward) and uniform learning agent (that distrusts the learned reward). This posterior is approximated using a suitable Beta distribution under the assumption of normality of observed rewards. In order to address the possible risk of decreasing epsilon too quickly, uncertainty in the variance of the learned reward is also modeled and updated using a normal-gamma model. (Gimelfarb et al., 2019). 1675: 1266:
techniques, this work provided practically applicable optimal solutions for Bernoulli bandits provided that time horizons and numbers of arms did not become excessively large. Pilarski et al. later extended this work in "Delayed Reward Bernoulli Bandits: Optimal Policy and Predictive Meta-Algorithm PARDI" to create a method of determining the optimal policy for Bernoulli bandits when rewards may not be immediately revealed following a decision and may be delayed. This method relies upon calculating expected values of reward outcomes which have not yet been revealed and updating posterior probabilities when rewards are revealed.
1262:
constructed an explicit form for a class of adaptive policies with uniformly maximum convergence rate properties for the total expected finite horizon reward under sufficient assumptions of finite state-action spaces and irreducibility of the transition law. A main feature of these policies is that the choice of actions, at each state and time period, is based on indices that are inflations of the right-hand side of the estimated average reward optimality equations. These inflations have recently been called the optimistic approach in the work of Tewari and Bartlett, Ortner Filippi, Cappé, and Garivier, and Honda and Takemura.
130: 1711:. In this example, each adversary has two arms to pull. They can either Deny or Confess. Standard stochastic bandit algorithms don't work very well with these iterations. For example, if the opponent cooperates in the first 100 rounds, defects for the next 200, then cooperate in the following 300, etc. then algorithms such as UCB won't be able to react very quickly to these changes. This is because after a certain point sub-optimal arms are rarely pulled to limit exploration and focus on exploitation. When the environment changes the algorithm is unable to adapt or may not even detect the change. 1689:: The framework of UCB-ALP is shown in the right figure. UCB-ALP is a simple algorithm that combines the UCB method with an Adaptive Linear Programming (ALP) algorithm, and can be easily deployed in practical systems. It is the first work that show how to achieve logarithmic regret in constrained contextual bandits. Although is devoted to a special case with single budget constraint and fixed cost, the results shed light on the design and analysis of algorithms for more general CCB problems. 20: 7893: 7873: 3981:
algorithm is the f-Discounted-Sliding-Window Thompson Sampling (f-dsw TS) proposed by Cavenaghi et al. The f-dsw TS algorithm exploits a discount factor on the reward history and an arm-related sliding window to contrast concept drift in non-stationary environments. Another work by Burtini et al. introduces a weighted least squares Thompson sampling approach (WLS-TS), which proves beneficial in both the known and unknown non-stationary cases.
720: 6564: 4019:
2014 by the work on the CLUB algorithm. Following this work, several other researchers created algorithms to learn multiple models at the same time under bandit feedback. For example, COFIBA was introduced by Li and Karatzoglou and Gentile (SIGIR 2016), where the classical collaborative filtering, and content-based filtering methods try to learn a static recommendation model given training data.
50:) is a problem in which a decision maker iteratively selects one of multiple fixed choices (i.e., arms or actions) when the properties of each choice are only partially known at the time of allocation, and may become better understood as time passes. A fundamental aspect of bandit problems is that choosing an arm does not affect the properties of the arm or other arms. 1489:: Similar to the epsilon-decreasing strategy, except that epsilon is reduced on basis of the learning progress instead of manual tuning (Tokic, 2010). High fluctuations in the value estimates lead to a high epsilon (high exploration, low exploitation); low fluctuations to a low epsilon (low exploration, high exploitation). Further improvements can be achieved by a 1736:
information and to multi-armed bandits in the mixed stochastic-adversarial setting . The paper presented an empirical evaluation and improved analysis of the performance of the EXP3 algorithm in the stochastic setting, as well as a modification of the EXP3 algorithm capable of achieving "logarithmic" regret in stochastic environment.
4018:
Approaches using multiple bandits that cooperate sharing knowledge in order to better optimize their performance started in 2013 with "A Gang of Bandits", an algorithm relying on a similarity graph between the different bandit problems to share knowledge. The need of a similarity graph was removed in
3997:
The dueling bandit variant was introduced by Yue et al. (2012) to model the exploration-versus-exploitation tradeoff for relative feedback. In this variant the gambler is allowed to pull two levers at the same time, but they only get a binary feedback telling which lever provided the best reward. The
137:
The multi-armed bandit problem models an agent that simultaneously attempts to acquire new knowledge (called "exploration") and optimize their decisions based on existing knowledge (called "exploitation"). The agent attempts to balance these competing tasks in order to maximize their total value over
1244:
In the paper "Asymptotically efficient adaptive allocation rules", Lai and Robbins (following papers of Robbins and his co-workers going back to Robbins in the year 1952) constructed convergent population selection policies that possess the fastest rate of convergence (to the population with highest
698:
Another formulation of the multi-armed bandit has each arm representing an independent Markov machine. Each time a particular arm is played, the state of that machine advances to a new one, chosen according to the Markov state evolution probabilities. There is a reward depending on the current state
1637:
In practice, there is usually a cost associated with the resource consumed by each action and the total cost is limited by a budget in many applications such as crowdsourcing and clinical trials. Constrained contextual bandit (CCB) is such a model that considers both the time and budget constraints
1540:
A useful generalization of the multi-armed bandit is the contextual multi-armed bandit. At each iteration an agent still has to choose between arms, but they also see a d-dimensional feature vector, the context vector they can use together with the rewards of the arms played in the past to make the
1265:
For Bernoulli multi-armed bandits, Pilarski et al. studied computation methods of deriving fully optimal solutions (not just asymptotically) using dynamic programming in the paper "Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge." Via indexing schemes, lookup tables, and other
1261:
Later in "Optimal adaptive policies for Markov decision processes" Burnetas and Katehakis studied the much larger model of Markov Decision Processes under partial information, where the transition law and/or the expected one period rewards may depend on unknown parameters. In this work, the authors
1735:
EXP3 is a popular algorithm for adversarial multiarmed bandits, suggested and analyzed in this setting by Auer et al. . Recently there was an increased interest in the performance of this algorithm in the stochastic setting, due to its new applications to stochastic multi-armed bandits with side
1698:
Another variant of the multi-armed bandit problem is called the adversarial bandit, first introduced by Auer and Cesa-Bianchi (1998). In this variant, at each iteration, an agent chooses an arm and an adversary simultaneously chooses the payoff structure for each arm. This is one of the strongest
4027:
The Combinatorial Multiarmed Bandit (CMAB) problem arises when instead of a single discrete variable to choose from, an agent needs to choose values for a set of variables. Assuming each variable is discrete, the number of possible choices per iteration is exponential in the number of variables.
1269:
When optimal solutions to multi-arm bandit tasks are used to derive the value of animals' choices, the activity of neurons in the amygdala and ventral striatum encodes the values derived from these policies, and can be used to decode when the animals make exploratory versus exploitative choices.
1257:
in the paper "Optimal adaptive policies for sequential allocation problems", where index based policies with uniformly maximum convergence rate were constructed, under more general conditions that include the case in which the distributions of outcomes from each population depend on a vector of
3980:
Garivier and Moulines derive some of the first results with respect to bandit problems where the underlying model can change during play. A number of algorithms were presented to deal with this case, including Discounted UCB and Sliding-Window UCB. A similar approach based on Thompson Sampling
4009:
More recently, researchers have generalized algorithms from traditional MAB to dueling bandits: Relative Upper Confidence Bounds (RUCB), Relative EXponential weighing (REX3), Copeland Confidence Bounds (CCB), Relative Minimum Empirical Divergence (RMED), and Double Thompson Sampling (DTS).
3976: 2400: 95:. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. The crucial tradeoff the gambler faces at each trial is between "exploitation" of the machine that has the highest expected payoff and "exploration" to get more 3998:
difficulty of this problem stems from the fact that the gambler has no way of directly observing the reward of their actions. The earliest algorithms for this problem are InterleaveFiltering, Beat-The-Mean. The relative feedback of dueling bandits can also lead to
80:"), who has to decide which machines to play, how many times to play each machine and in which order to play them, and whether to continue with the current machine or try a different machine. The multi-armed bandit problem also falls into the broad category of 2008: 1531:
for each lever. For example, as illustrated with the POKER algorithm, the price can be the sum of the expected reward plus an estimation of extra future rewards that will gain through the additional knowledge. The lever of highest price is always pulled.
1270:
Moreover, optimal policies better predict animals' choice behavior than alternative strategies (described below). This suggests that the optimal solutions to multi-arm bandit problems are biologically plausible, despite being computationally demanding.
699:
of the machine. In a generalization called the "restless bandit problem", the states of non-played arms can also evolve over time. There has also been discussion of systems where the number of choices (about which arm to play) increases over time.
1541:
choice of the arm to play. Over time, the learner's aim is to collect enough information about how the context vectors and rewards relate to each other, so that it can predict the next best arm to play by looking at the feature vectors.
658:
tends to zero with probability 1 when the number of played rounds tends to infinity. Intuitively, zero-regret strategies are guaranteed to converge to a (not necessarily unique) optimal strategy if enough rounds are played.
348:
be the mean values associated with these reward distributions. The gambler iteratively plays one lever per round and observes the associated reward. The objective is to maximize the sum of the collected rewards. The horizon
1258:
unknown parameters. Burnetas and Katehakis (1996) also provided an explicit solution for the important case in which the distributions of outcomes follow arbitrary (i.e., non-parametric) discrete, univariate distributions.
2515: 1111: 171:
The model has also been used to control dynamic allocation of resources to different projects, answering the question of which project to work on, given uncertainty about the difficulty and payoff of each possibility.
6036:
Besbes, O.; Gur, Y.; Zeevi, A. Stochastic multi-armed-bandit problem with non-stationary rewards. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp.
3759: 1229:
A major breakthrough was the construction of optimal population selection strategies, or policies (that possess uniformly maximum convergence rate to the population with highest mean) in the work described below.
495: 57:. A notable alternative setup for the multi-armed bandit problem include the "best arm identification" problem where the goal is instead to identify the best choice by the end of a finite number of rounds. 3618: 1699:
generalizations of the bandit problem as it removes all assumptions of the distribution and a solution to the adversarial bandit problem is a generalized solution to the more specific bandit problems.
160:
In these practical examples, the problem requires balancing reward maximization based on the knowledge already acquired with attempting new actions to further increase knowledge. This is known as the
2259: 2920: 3033: 1199: 1045: 3420: 6005:
Seldin, Y., Szepesvári, C., Auer, P. and Abbasi-Yadkori, Y., 2012, December. Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments. In EWRL (pp. 103–116).
2793: 3252: 581: 53:
Instances of the multi-armed bandit problem include the task of iteratively allocating a fixed, limited set of resources between competing (alternative) choices in a way that minimizes the
4028:
Several CMAB settings have been studied in the literature, from settings where the variables are binary to more general setting where each variable can take an arbitrary set of values.
346: 265: 1879: 1638:
in a multi-armed bandit setting. A. Badanidiyuru et al. first studied contextual bandits with budget constraints, also referred to as Resourceful Contextual Bandits, and show that a
2636: 2149: 300: 6072:
Cavenaghi, E.; Sottocornola, G.; Stella, F.; Zanker, M. Non Stationary Multi-Armed Bandit: Empirical Evaluation of a New Concept Drift-Aware Algorithm. Entropy 2021, 23, 380. <
2724: 113:
in 1952, realizing the importance of the problem, constructed convergent population selection strategies in "some aspects of the sequential design of experiments". A theorem, the
3457: 3196: 6901:
Python implementation of bandit strategies that supports context-free, parametric and non-parametric contextual policies with built-in parallelization and simulation capability.
1628:: The algorithm reduces the contextual bandit problem into a series of supervised learning problem, and does not rely on typical realizability assumption on the reward function. 922: 617: 3712: 3522: 854: 2210: 1827: 1783: 989: 3659: 1669: 1455: 1371: 1253:
simplifications of the policy and the main proof were given for the case of normal populations with known variances. The next notable progress was obtained by Burnetas and
3489: 3284: 2554: 1325: 7767: 1420: 2254: 2052: 1871: 1481: 1345: 3331: 525: 2574: 1565:: the authors assume a linear dependency between the expected reward of an action and its context and model the representation space using a set of linear predictors. 1219: 1143: 656: 6444:
Gai, Y.; Krishnamachari, B.; Jain, R. (2010), "Learning multiuser channel allocations in cognitive radio networks: A combinatorial multi-armed bandit formulation",
5795:
Branislav Kveton; Manzil Zaheer; Csaba Szepesvári; Lihong Li; Mohammad Ghavamzadeh; Craig Boutilier (2020), "Randomized exploration in generalized linear bandits",
4464: 3092:
In the original specification and in the above variants, the bandit problem is specified with a discrete and finite number of arms, often indicated by the variable
2671: 2079: 942: 880: 394: 3752: 3679: 2822: 702:
Computer science researchers have studied multi-armed bandits under worst-case assumptions, obtaining algorithms to minimize regret in both finite and infinite (
3732: 3542: 3351: 3304: 3166: 3130: 3110: 1457:
trials. During the exploration phase, a lever is randomly selected (with uniform probability); during the exploitation phase, the best lever is always selected.
1397: 1286:
Semi-uniform strategies were the earliest (and simplest) strategies discovered to approximately solve the bandit problem. All those strategies have in common a
693: 414: 367: 103:. In practice, multi-armed bandits have been used to model problems such as managing research projects in a large organization, like a science foundation or a 3429:
represents the optimal policy to be compared with other policies in the non-stationary setting. The dynamic oracle optimises the expected reward at each step
4156:(2012), "Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems", Foundations and Trends® in Machine Learning: Vol. 5: No. 1, pp 1–122. 2576:
to uniformly randomly explore. After receiving the rewards the weights are updated. The exponential growth significantly increases the weight of good arms.
7952: 68:. In contrast to general RL, the selected actions in bandit problems do not affect the reward distribution of the arms. The name comes from imagining a 7015: 6936: 5950: 1549:
Many strategies exist that provide an approximate solution to the contextual bandit problem, and can be put into two broad categories detailed below.
2405: 5988:
Burtini, Giuseppe, Jason Loeppky, and Ramon Lawrence. "A survey of online experiment design with the stochastic multi-armed bandit." arXiv preprint
7609: 4367:
Press, William H. (2009), "Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research",
1278:
Many strategies exist which provide an approximate solution to the bandit problem, and can be put into the four broad categories detailed below.
1050: 1600:. Then, UCB is employed on each constant piece. Successive refinements of the partition of the context space are scheduled or chosen adaptively. 3971:{\displaystyle \rho ^{\pi }(T)=\sum _{t=1}^{T}{\mu _{t}^{*}}-\mathbb {E} _{\pi }^{\mu }\left={\mathcal {D}}(T)-\mathbb {E} _{\pi }^{\mu }\left} 1483:
decreases as the experiment progresses, resulting in highly explorative behaviour at the start and highly exploitative behaviour at the finish.
416:
rounds is defined as the expected difference between the reward sum associated with an optimal strategy and the sum of the collected rewards:
6676: 5418: 5436:
Lihong Li; Wei Chu; John Langford; Robert E. Schapire (2010), "A contextual-bandit approach to personalized news article recommendation",
788:. This problem is crucial in various applications, including clinical trials, adaptive routing, recommendation systems, and A/B testing. 1516:
or Bayesian Bandits, and are surprisingly easy to implement if you can sample from the posterior for the mean value of each alternative.
4972:
Filippi, S. and Cappé, O. and Garivier, A. (2010). "Online regret bounds for Markov decision processes with deterministic transitions",
4891: 7947: 421: 6700: 5756: 3549: 6468: 6085:
Improving Online Marketing Experiments with Drifting Multi-armed Bandits, Giuseppe Burtini, Jason Loeppky, Ramon Lawrence, 2015 <
791:
In BAI, the objective is to identify the arm having the highest expected reward. An algorithm in this setting is characterized by a
6290: 4535: 5860: 7125: 6223: 1671:
regret is achievable. However, their work focuses on a finite set of policies, and the algorithm is computationally inefficient.
6188: 6153: 4986:
Honda, J.; Takemura, A. (2011). "An asymptotically optimal policy for finite support models in the multi-armed bandit problem".
2395:{\displaystyle {\hat {x}}_{j}(t)={\begin{cases}x_{j}(t)/p_{j}(t)&{\text{if }}j=i_{t}\\0,&{\text{otherwise}}\end{cases}}} 7008: 6829: 6754: 6710: 6661: 6629:
Dayanik, S.; Powell, W.; Yamazaki, K. (2008), "Index policies for discounted bandit problems with availability constraints",
5318: 5286: 4226: 4192: 6038: 5253: 2827: 7798: 6086: 2925: 6977: 1148: 994: 7899: 7450: 7187: 3356: 741: 4472: 4113:
Katehakis, Michael N.; Veinott, Jr., Arthur F. (1987). "The Multi-Armed Bandit Problem: Decomposition and Computation".
6852: 5533: 4863: 4115: 3043:
We follow the arm that we think has the best performance so far adding exponential noise to it to provide exploration.
2732: 6953: 6729:
Neural Information Processing – 21st International Conference, ICONIP 2014, Malaisia, November 03-06,2014, Proceedings
99:
about the expected payoffs of the other machines. The trade-off between exploration and exploitation is also faced in
7711: 7338: 7145: 7001: 5574: 5471: 3201: 767: 749: 530: 7937: 7666: 6917:
package facilitating the simulation and evaluation of both context-free and contextual Multi-Armed Bandit policies.
6467:
Chen, Wei; Wang, Yajun; Yuan, Yang (2013), "Combinatorial multi-armed bandit: General framework and applications",
65: 5697: 7932: 6102:
Yue, Yisong; Broder, Josef; Kleinberg, Robert; Joachims, Thorsten (2012), "The K-armed dueling bandits problem",
5557:
Hong, Tzung-Pei; Song, Wei-Ping; Chiu, Chu-Tien (November 2011). "Evolutionary Composite Attribute Clustering".
5395: 4240:
Soare, M., Lazaric, A., & Munos, R. (2014). Best-Arm Identification in Linear Bandits. ArXiv, abs/1409.6110.
2003:{\displaystyle p_{i}(t)=(1-\gamma ){\frac {\omega _{i}(t)}{\sum _{j=1}^{K}\omega _{j}(t)}}+{\frac {\gamma }{K}}} 7853: 7793: 7391: 4668:
Auer, P.; Cesa-Bianchi, N.; Freund, Y.; Schapire, R. E. (2002). "The Nonstochastic Multiarmed Bandit Problem".
1708: 1245:
mean) for the case that the population reward distributions are the one-parameter exponential family. Then, in
745: 305: 212: 7386: 7075: 6776: 4637: 4584: 4530: 4494: 4252: 180: 6579: 7828: 7225: 7182: 7135: 7130: 6937:
Leslie Pack Kaelbling and Michael L. Littman (1996). Exploitation versus Exploration: The Single-State Case
2587: 2084: 1622:
is built and analyzed w.r.t the random forest built knowing the joint distribution of contexts and rewards.
1572: 369:
is the number of rounds that remain to be played. The bandit problem is formally equivalent to a one-state
270: 2678: 1674: 1294:
lever (based on previous observations) is always pulled except when a (uniformly) random action is taken.
7927: 7879: 7175: 7101: 3432: 3171: 6257:
Zoghi, Masrour; Karnin, Zohar S; Whiteson, Shimon; Rijke, Maarten D (2015), "Copeland Dueling Bandits",
5336:"ε-BMC: A Bayesian Ensemble Approach to Epsilon-Greedy Exploration in Model-Free Reinforcement Learning" 5242:
Sutton, R. S. & Barto, A. G. 1998 Reinforcement learning: an introduction. Cambridge, MA: MIT Press.
891: 586: 7503: 7438: 7039: 6983: 6382:
The 31st International Conference on Machine Learning, Journal of Machine Learning Research (ICML 2014)
6059:
On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems, Garivier and Moulines, 2008 <
3684: 3494: 1612:: a kernelized non-linear version of linearUCB, with efficient implementation and finite-time analysis. 813: 138:
the period of time considered. There are many practical applications of the bandit model, for example:
6959: 6814:, Institute of Mathematical Statistics Lecture Notes - Monograph Series, vol. 8, pp. 29–39, 2154: 1790: 1750: 956: 7942: 7904: 7762: 7401: 7232: 7055: 3628: 1512:
its actual probability of being the optimal lever. Probability matching strategies are also known as
107:. In early versions of the problem, the gambler begins with no initial knowledge about the machines. 6965: 6116: 5797:
Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS)
5503:
Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS)
4508: 3112:. In the infinite armed case, introduced by Agrawal (1995), the "arms" are a continuous variable in 2299: 1641: 1425: 7803: 7060: 6914: 6850:; Veinott, Jr., Arthur F. (1987), "The multi-armed bandit problem: decomposition and computation", 6504: 5269: 4911: 4683: 4214: 4187:, Wiley-Interscience Series in Systems and Optimization., Chichester: John Wiley & Sons, Ltd., 1597: 1508:
Probability matching strategies reflect the idea that the number of pulls for a given lever should
1350: 730: 374: 207: 88: 54: 5631: 5303: 3462: 3257: 7848: 7833: 7486: 7481: 7381: 7249: 7030: 6591:
Guha, S.; Munagala, K.; Shi, P. (2010), "Approximation algorithms for restless bandit problems",
2527: 1304: 734: 370: 6412:
Li, Shuai; Alexandros, Karatzoglou; Gentile, Claudio (2016), "Collaborative Filtering Bandits",
5926: 5889:
Alekh Agarwal; Daniel J. Hsu; Satyen Kale; John Langford; Lihong Li; Robert E. Schapire (2014),
1402: 7808: 7568: 7287: 7282: 6111: 6027:
Agrawal, Rajeev. The Continuum-Armed Bandit Problem. SIAM J. of Control and Optimization. 1995.
5650:
Perchet, Vianney; Rigollet, Philippe (2013), "The multi-armed bandit problem with covariates",
5264: 4906: 4678: 4503: 2215: 2013: 1832: 1466: 1330: 145:
investigating the effects of different experimental treatments while minimizing patient losses,
104: 61: 5754: 4153: 3309: 1519:
Probability matching strategies also admit solutions to so-called contextual bandit problems.
503: 7838: 7823: 7788: 7476: 7376: 7244: 6015: 4916: 4058: 2559: 1606:: The reward distribution follows a generalized linear model, an extension to linear bandits. 1592:: The nonlinear reward functions are estimated using a piecewise constant estimator called a 1327:
of the trials, and a lever is selected at random (with uniform probability) for a proportion
1204: 1122: 633: 129: 81: 7706: 133:
How must a given budget be distributed among these research departments to maximize results?
77: 7858: 7813: 7259: 7204: 7050: 7045: 6847: 6807: 6727:
Allesiardo, Robin (2014), "A Neural Networks Committee for the Contextual Bandit Problem",
6716: 6526: 6445: 6427: 6395: 6339: 6272: 5972: 5908: 5842: 5810: 5778: 5738: 5695: 5652: 5613: 5451: 5148: 4778: 4616: 4446: 4376: 2656: 2057: 927: 865: 379: 5831:, 29th Conference on Uncertainty in Artificial Intelligence (UAI 2013) and (JFPDA 2013)., 3737: 3664: 2798: 784:
problem in multi-armed bandits is the one of Best Arm Identification (BAI), also known as
8: 7433: 7411: 7160: 7155: 7113: 7065: 6477: 6224:"A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits" 6199: 6164: 6039:
https://proceedings.neurips.cc/paper/2014/file/903ce9225fca3e988c2af215d4e544d3-Paper.pdf
5559:
2011 International Conference on Technologies and Applications of Artificial Intelligence
5304:"Value-Difference Based Exploration: Adaptive Control Between Epsilon-Greedy and Softmax" 4974:
Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on
2729:
t = 1,2,...,T 1. For each arm generate a random noise from an exponential distribution
154: 6810:; C. Derman (1986), "Computing optimal sequential allocation rules in clinical trials", 6569: 6530: 6431: 6399: 6380:
Gentile, Claudio; Li, Shuai; Zappella, Giovanni (2014), "Online Clustering of Bandits",
6343: 6301: 6276: 6087:
http://www.scitepress.org/DigitalLibrary/PublicationsDetail.aspx?ID=Dx2xXEB0PJE=&t=1
5976: 5912: 5846: 5814: 5782: 5742: 5617: 5455: 5313:, Lecture Notes in Computer Science, vol. 7006, Springer-Verlag, pp. 335–346, 5263:, Lecture Notes in Computer Science, vol. 6359, Springer-Verlag, pp. 203–210, 5152: 4782: 4450: 4380: 4149: 7818: 7396: 6877: 6869: 6835: 6795: 6760: 6732: 6618: 6600: 6542: 6516: 6417: 6385: 6361: 6329: 6262: 5989: 5962: 5898: 5868: 5832: 5800: 5768: 5728: 5679: 5661: 5603: 5580: 5539: 5477: 5441: 5377: 5220: 5195: 5171: 5136: 5117: 5065: 5013: 4995: 4696: 4620: 4604: 4548: 4436: 4399: 4349: 4341: 4337: 4271: 4132: 3717: 3527: 3336: 3289: 3151: 3115: 3095: 1382: 678: 399: 352: 28: 6234: 5794: 5088:"Delayed Reward Bernoulli Bandits: Optimal Policy and Predictive Meta-Algorithm PARDI" 4565: 7884: 7872: 7676: 7328: 7199: 7192: 6987: 6956:, a classic example (with known answer) of the exploitation vs. exploration tradeoff. 6825: 6750: 6706: 6657: 5570: 5529: 5495: 5481: 5467: 5335: 5314: 5282: 5225: 5176: 5121: 5109: 5069: 5057: 4858: 4806: 4801: 4766: 4750: 4733: 4624: 4469:
Proceedings of International Joint Conferences on Artificial Intelligence (IJCAI2015)
4404: 4222: 4188: 1513: 1254: 1246: 96: 6764: 6690: 5951:"Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits" 5683: 5584: 5543: 5343:
Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence
5254:"Adaptive ε-greedy exploration in reinforcement learning based on value differences" 4700: 4492:
Farias, Vivek F; Ritesh, Madan (2011), "The irrevocable multiarmed bandit problem",
4353: 4306: 4289: 3148:). In the non-stationary setting, it is assumed that the expected reward for an arm 7629: 7619: 7426: 7220: 7170: 7165: 7108: 7096: 6910: 6861: 6815: 6785: 6742: 6685: 6638: 6622: 6610: 6546: 6534: 6121: 5890: 5671: 5562: 5521: 5459: 5369: 5274: 5215: 5207: 5166: 5156: 5099: 5047: 5005: 4953: 4942:"Online regret bounds for Markov decision processes with deterministic transitions" 4872: 4837: 4796: 4786: 4745: 4688: 4649: 4596: 4544: 4513: 4394: 4384: 4333: 4301: 4261: 4157: 4124: 4093: 4048: 4043: 4003: 3999: 1576: 1493:-weighted action selection in case of exploratory actions (Tokic & Palm, 2011). 1287: 165: 148: 100: 32: 6881: 5888: 5827:
Michal Valko; Nathan Korda; Rémi Munos; Ilias Flaounas; Nello Cristianini (2013),
5381: 5017: 4136: 7742: 7686: 7508: 7150: 7070: 6746: 6731:, Lecture Notes in Computer Science, vol. 8834, Springer, pp. 374–381, 6671: 6414:
The 39th International ACM SIGIR Conference on Information Retrieval (SIGIR 2016)
5278: 5211: 5161: 4670: 4612: 4321: 4180: 3286:
no longer represents the whole sequence of expected (stationary) rewards for arm
2922:
Add noise to each arm and pull the one with the highest value 3. Update value:
1250: 191: 118: 110: 6470:
Proceedings of the 30th International Conference on Machine Learning (ICML 2013)
5826: 4221:, Monographs on Statistics and Applied Probability, London: Chapman & Hall, 7716: 7681: 7671: 7496: 7254: 7080: 6125: 5435: 4771:
Proceedings of the National Academy of Sciences of the United States of America
2510:{\displaystyle \omega _{j}(t+1)=\omega _{j}(t)\exp(\gamma {\hat {x}}_{j}(t)/K)} 142: 6820: 6790: 6231:
Proceedings of the 32nd International Conference on Machine Learning (ICML-15)
6196:
Proceedings of the 31st International Conference on Machine Learning (ICML-14)
6161:
Proceedings of the 30th International Conference on Machine Learning (ICML-13)
5720: 5009: 4958: 4941: 4692: 4266: 4098: 4081: 267:, each distribution being associated with the rewards delivered by one of the 7921: 7661: 7641: 7558: 7237: 6643: 6324:
Wu, Huasen; Liu, Xin (2016), "Double Thompson Sampling for Dueling Bandits",
5755:
Kwang-Sung Jun; Aniruddha Bhargava; Robert D. Nowak; Rebecca Willett (2017),
5525: 5113: 5104: 5087: 5061: 5052: 5035: 4892:"Optimistic linear programming gives logarithmic regret for irreducible MDPs" 4654: 4053: 4037: 3145: 1619: 1239: 114: 73: 6614: 6289:
Komiyama, Junpei; Honda, Junya; Kashima, Hisashi; Nakagawa, Hiroshi (2015),
6189:"Relative Upper Confidence Bound for the $ K$ -Armed Dueling Bandit Problem" 5493: 5463: 4791: 4389: 7747: 7578: 6993: 6990:, and Upper Confidence Bound exploration/exploitation balancing strategies. 5895:
Proceedings of the 31st International Conference on Machine Learning (ICML)
5859:
Féraud, Raphaël; Allesiardo, Robin; Urvoy, Tanguy; Clérot, Fabrice (2016).
5725:
Proceedings of the 34th International Conference on Machine Learning (ICML)
5229: 5180: 4842: 4825: 4810: 4517: 4408: 2556:
it prefers arms with higher weights (exploit), it chooses with probability
1106:{\displaystyle \mathbb {P} ({\hat {a}}_{\tau }\neq a^{\star })\leq \delta } 176: 6865: 6326:
The 30th Annual Conference on Neural Information Processing Systems (NIPS)
5955:
The 29th Annual Conference on Neural Information Processing Systems (NIPS)
5696:
Sarah Filippi; Olivier Cappé; Aurélien Garivier; Csaba Szepesvári (2010),
4876: 4128: 1497:
Adaptive epsilon-greedy strategy based on Bayesian ensembles (Epsilon-BMC)
882:
is a (random) stopping time which suggests when to stop collecting samples
19: 7843: 7614: 7523: 7518: 7140: 7118: 6930: 6926: 6898: 5566: 1379:: A pure exploration phase is followed by a pure exploitation phase. For 121:, gives an optimal policy for maximizing the expected discounted reward. 6187:
Zoghi, Masrour; Whiteson, Shimon; Munos, Remi; Rijke, Maarten D (2014),
5891:"Taming the monster: A fast and simple algorithm for contextual bandits" 5360:
Scott, S.L. (2010), "A modern Bayesian look at the multi-armed bandit",
5034:
Pilarski, Sebastian; Pilarski, Slawomir; Varró, Dániel (February 2021).
4573:, In European Conference on Machine Learning, Springer, pp. 437–448 1373:, but this can vary widely depending on circumstances and predilections. 1145:, the objective is to identify the arm with the highest expected reward 991:, the objective is to identify the arm with the highest expected reward 7737: 7696: 7691: 7604: 7513: 7421: 7333: 7313: 6907:, open-source implementation of bandit strategies in Python and Matlab. 6904: 6873: 6839: 6799: 6152:
Urvoy, Tanguy; Clérot, Fabrice; Féraud, Raphaël; Naamane, Sami (2013),
5721:"Provably optimal algorithms for generalized linear contextual bandits" 5675: 5036:"Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge" 4608: 4431:
Brochu, Eric; Hoffman, Matthew W.; de Freitas, Nando (September 2010),
4345: 4275: 4161: 1047:
with the least possible amount of trials and with probability of error
703: 6894: 6774:
Weber, Richard (1992), "On the Gittins index for multiarmed bandits",
6538: 5516:
Auer, P. (2000). "Using upper confidence bounds for online learning".
5137:"Theory of choice in bandit, information sampling, and foraging tasks" 4250:
Weber, Richard (1992), "On the Gittins index for multiarmed bandits",
3078:
Might be computationally expensive (calculating the exponential terms)
7732: 7701: 7599: 7443: 7406: 7343: 7297: 7292: 7277: 6654:
Approximate Dynamic Programming: Solving the Curses of Dimensionality
5757:"Scalable generalized linear bandits: Online computation and hashing" 5373: 4587:(1988), "Restless bandits: Activity allocation in a changing world", 6291:"Regret Lower Bound and Optimal Algorithm in Dueling Bandit Problem" 6073: 5518:
Proceedings 41st Annual Symposium on Foundations of Computer Science
4600: 719: 706:) time horizons for both stochastic and non-stochastic arm payoffs. 7634: 7466: 6971: 6947: 6943: 6521: 6422: 6360:, Advances in Neural Information Processing Systems 26, NIPS 2013, 6356:
Cesa-Bianchi, Nicolo; Gentile, Claudio; Zappella, Giovanni (2013),
6334: 6267: 5993: 5967: 5805: 5773: 5733: 4861:(1997). "Optimal adaptive policies for Markov decision processes". 4715: 4326:
Journal of the Royal Statistical Society. Series B (Methodological)
4040: – a powerful, general strategy for analyzing bandit problems. 1463:: Similar to the epsilon-greedy strategy, except that the value of 190:
The version of the problem now commonly analyzed was formulated by
6737: 6605: 6390: 6366: 6060: 5903: 5837: 5666: 5608: 5446: 5438:
Proceedings of the 19th international conference on World wide web
5000: 4441: 1487:
Adaptive epsilon-greedy strategy based on value differences (VDBE)
924:
is a guess on the best arm based on the data collected up to time
7757: 7594: 7548: 7471: 7371: 7366: 7318: 6443: 5196:"Subcortical Substrates of Explore-Exploit Decisions in Primates" 3989:
Many variants of the problem have been proposed in recent years.
1490: 490:{\displaystyle \rho =T\mu ^{*}-\sum _{t=1}^{T}{\widehat {r}}_{t}} 184: 69: 6920: 6674:(1952), "Some aspects of the sequential design of experiments", 6505:"Combinatorial Multi-armed Bandits for Real-Time Strategy Games" 6138:
Yue, Yisong; Joachims, Thorsten (2011), "Beat the Mean Bandit",
7772: 7752: 7624: 7416: 5419:"The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits" 5086:
Pilarski, Sebastian; Pilarski, Slawomir; Varro, Daniel (2021).
4463:
Shen, Weiwei; Wang, Jun; Jiang, Yu-Gang; Zha, Hongyuan (2015),
3613:{\displaystyle {\mathcal {D}}(T)=\sum _{t=1}^{T}{\mu _{t}^{*}}} 3062:
Maintains weights for each arm to calculate pulling probability
6466: 4826:"Optimal adaptive policies for sequential allocation problems" 4667: 2640: 1544: 7573: 7553: 7543: 7538: 7533: 7528: 7491: 7323: 3140:
This framework refers to the multi-armed bandit problem in a
187:
so that German scientists could also waste their time on it.
87:
In the problem, each machine provides a random reward from a
6978:
Blog post on multi-armed bandit strategies, with Python code
6288: 6016:
Adaptive online prediction by following the perturbed leader
5924: 4713: 1569:
LinRel (Linear Associative Reinforcement Learning) algorithm
7563: 6498: 6496: 6355: 6018:. Journal of Machine Learning Research, 6(Apr), pp.639–660. 5858: 5718: 5494:
Wei Chu; Lihong Li; Lev Reyzin; Robert E. Schapire (2011),
4324:(1979). "Bandit Processes and Dynamic Allocation Indices". 3073:
The standard FPL does not have good theoretical guarantees
2388: 1707:
An example often considered for adversarial bandits is the
1682:
A simple algorithm with logarithmic regret is proposed in:
6942:
Tutorial: Introduction to Bandits: Algorithms and Theory.
6259:
Advances in Neural Information Processing Systems, NIPS'15
6101: 3459:
by always selecting the best arm, with expected reward of
5425:, vol. 20, Curran Associates, Inc., pp. 817–824 5393: 6493: 6462: 6460: 6447:
2010 IEEE Symposium on New Frontiers in Dynamic Spectrum
6256: 6151: 5949:
Wu, Huasen; Srikant, R.; Liu, Xin; Jiang, Chong (2015),
5334:
Gimelfarb, Michel; Sanner, Scott; Lee, Chi-Guhn (2019),
6222:
Gajane, Pratik; Urvoy, Tanguy; Clérot, Fabrice (2015),
6186: 4716:"Optimal Best Arm Identification with Fixed Confidence" 4430: 4082:"Finite-time Analysis of the Multiarmed Bandit Problem" 4079: 2584:
The (external) regret of the Exp3 algorithm is at most
1702: 1678:
Framework of UCB-ALP for constrained contextual bandits
6806: 6050:
Discounted UCB, Levente Kocsis, Csaba Szepesvári, 2006
4567:
Multi-armed bandit algorithms and empirical evaluation
4424: 4290:"Some aspects of the sequential design of experiments" 2915:{\displaystyle I(t)=arg\max _{i}\{R_{i}(t)+Z_{i}(t)\}} 6846: 6628: 6457: 6298:
Proceedings of the 28th Conference on Learning Theory
5925:
Badanidiyuru, A.; Langford, J.; Slivkins, A. (2014),
5829:
Finite-Time Analysis of Kernelised Contextual Bandits
5085: 5033: 4485: 4219:
Bandit problems: Sequential allocation of experiments
4112: 3762: 3740: 3720: 3687: 3667: 3631: 3552: 3530: 3497: 3465: 3435: 3359: 3339: 3312: 3292: 3260: 3204: 3174: 3154: 3118: 3098: 3065:
Doesn't need to know the pulling probability per arm
3028:{\displaystyle R_{I(t)}(t+1)=R_{I(t)}(t)+x_{I(t)}(t)} 2928: 2830: 2801: 2735: 2681: 2659: 2590: 2562: 2530: 2408: 2262: 2218: 2157: 2087: 2060: 2016: 1882: 1835: 1793: 1753: 1644: 1469: 1428: 1405: 1385: 1353: 1333: 1307: 1207: 1151: 1125: 1053: 997: 959: 930: 894: 868: 816: 681: 636: 589: 533: 506: 424: 402: 382: 355: 308: 273: 215: 6145: 5918: 4734:"Asymptotically efficient adaptive allocation rules" 1194:{\displaystyle a^{\star }\in \arg \max _{k}\mu _{k}} 1040:{\displaystyle a^{\star }\in \arg \max _{k}\mu _{k}} 6411: 6282: 6215: 4465:"Portfolio Choices with Orthogonal Bandit Learning" 3415:{\displaystyle \mu ^{k}=\{\mu _{t}^{k}\}_{t=1}^{T}} 6923:, open-source implementation of bandit strategies. 6812:Adaptive statistical procedures and related topics 6180: 5934:Proceeding of Conference on Learning Theory (COLT) 5362:Applied Stochastic Models in Business and Industry 5333: 3970: 3746: 3726: 3706: 3673: 3653: 3612: 3536: 3516: 3483: 3451: 3414: 3345: 3325: 3298: 3278: 3246: 3190: 3160: 3124: 3104: 3027: 2914: 2816: 2787: 2718: 2665: 2630: 2568: 2548: 2509: 2394: 2248: 2204: 2143: 2073: 2046: 2002: 1865: 1821: 1777: 1663: 1503: 1475: 1449: 1414: 1391: 1365: 1339: 1319: 1213: 1193: 1137: 1105: 1039: 983: 936: 916: 874: 848: 687: 650: 611: 575: 519: 489: 408: 388: 361: 340: 294: 259: 6966:S. Bubeck and N. Cesa-Bianchi A Survey on Bandits 6379: 6250: 6221: 5948: 5861:"Random Forest for the Contextual Bandit Problem" 5761:Advances in Neural Information Processing Systems 5702:Advances in Neural Information Processing Systems 5698:"Parametric Bandits: The Generalized Linear Case" 5496:"Contextual bandits with linear payoff functions" 5423:Advances in Neural Information Processing Systems 5400:Advances in Neural Information Processing Systems 4899:Advances in Neural Information Processing Systems 4856: 4823: 3333:denotes the sequence of expected rewards for arm 2788:{\displaystyle \forall i:Z_{i}(t)\sim Exp(\eta )} 7919: 6502: 6154:"Generic Exploration and K-armed Voting Bandits" 4764: 4462: 4456: 4080:Auer, P.; Cesa-Bianchi, N.; Fischer, P. (2002). 2856: 1632: 1399:trials in total, the exploration phase occupies 1172: 1018: 548: 6590: 6437: 6131: 6097: 6095: 5944: 5942: 5649: 5633:Contextual bandits with similarity information. 4559: 4557: 4369:Proceedings of the National Academy of Sciences 3247:{\displaystyle \mu _{t-1}^{k}\neq \mu _{t}^{k}} 2524:Exp3 chooses an arm at random with probability 197: 66:exploration–exploitation tradeoff dilemma 5396:"An empirical evaluation of Thompson sampling" 4889: 4433:Portfolio Allocation for Bayesian Optimization 1301:: The best lever is selected for a proportion 675:which issues a reward of one with probability 576:{\displaystyle \mu ^{*}=\max _{k}\{\mu _{k}\}} 183:, the problem was proposed to be dropped over 179:, it proved so intractable that, according to 175:Originally considered by Allied scientists in 7009: 6677:Bulletin of the American Mathematical Society 6405: 5597: 5416: 5327: 5295: 5193: 4985: 4563: 4320: 4294:Bulletin of the American Mathematical Society 4213: 630:is a strategy whose average regret per round 91:specific to that machine, that is not known 7023: 6092: 5939: 5602:, Conference on Learning Theory, COLT 2010, 5556: 5487: 5429: 5410: 5387: 5355: 5353: 5311:KI 2011: Advances in Artificial Intelligence 5261:KI 2010: Advances in Artificial Intelligence 5092:IEEE Transactions on Artificial Intelligence 5081: 5079: 5040:IEEE Transactions on Artificial Intelligence 5029: 5027: 4767:"Sequential choice from several populations" 4554: 4491: 3392: 3373: 2909: 2865: 1583: 570: 557: 254: 222: 60:The multi-armed bandit problem is a classic 6974:, a survey/tutorial for Contextual Bandits. 6933:implementation of bandit strategies in C++. 6698: 6509:Journal of Artificial Intelligence Research 6373: 6137: 5882: 5788: 4720:Journal of Machine Learning Research (JMLR) 4533:(1979), "Discussion of Dr Gittins' paper", 4175: 4173: 4171: 4169: 3714:and the cumulative expected reward at step 2641:Follow the perturbed leader (FPL) algorithm 1545:Approximate solutions for contextual bandit 947:There are two predominant settings in BAI: 748:. Unsourced material may be challenged and 151:efforts for minimizing delays in a network, 7953:Science and technology during World War II 7016: 7002: 6972:A Survey on Contextual Multi-armed Bandits 6726: 5639:, Conference on Learning Theory, COLT 2011 5623: 4731: 4707: 4564:Vermorel, Joannes; Mohri, Mehryar (2005), 4314: 4075: 4073: 3524:for the dynamic oracle at final time step 1714: 1281: 856:is a sequence of actions at each time step 709: 6819: 6789: 6736: 6689: 6642: 6604: 6520: 6421: 6389: 6365: 6333: 6317: 6266: 6115: 5966: 5902: 5836: 5820: 5804: 5772: 5748: 5732: 5689: 5665: 5607: 5598:Rigollet, Philippe; Zeevi, Assaf (2010), 5445: 5350: 5301: 5268: 5245: 5219: 5170: 5160: 5103: 5076: 5051: 5024: 4999: 4957: 4910: 4841: 4800: 4790: 4749: 4682: 4653: 4630: 4577: 4523: 4507: 4440: 4398: 4388: 4305: 4265: 4097: 3910: 3828: 1055: 768:Learn how and when to remove this message 282: 6652:Powell, Warren B. (2007), "Chapter 10", 5719:Lihong Li; Yu Lu; Dengyong Zhou (2017), 5712: 5643: 5629: 5591: 5134: 4824:Burnetas, A.N.; Katehakis, M.N. (1996). 4536:Journal of the Royal Statistical Society 4166: 3135: 3087: 2081:randomly according to the probabilities 1673: 1552: 1273: 780:An important variation of the classical 341:{\displaystyle \mu _{1},\dots ,\mu _{K}} 260:{\displaystyle B=\{R_{1},\dots ,R_{K}\}} 128: 18: 6699:Sutton, Richard; Barto, Andrew (1998), 6670: 6104:Journal of Computer and System Sciences 4636: 4583: 4529: 4287: 4179: 4070: 4022: 4013: 3491:. Thus, the cumulative expected reward 124: 7920: 6651: 5520:. IEEE Comput. Soc. pp. 270–279. 4939: 3681:is computed as the difference between 6997: 6921:bandit.sourceforge.net Bandit project 6773: 6323: 5600:Nonparametric Bandits with Covariates 5359: 5302:Tokic, Michel; Palm, Günther (2011), 5251: 4765:Katehakis, M.N.; Robbins, H. (1995). 4366: 4249: 4185:Multi-armed bandit allocation indices 2631:{\displaystyle O({\sqrt {KTlog(K)}})} 2144:{\displaystyle p_{1}(t),...,p_{K}(t)} 2010:       1693: 1522: 1347:. A typical parameter value might be 295:{\displaystyle K\in \mathbb {N} ^{+}} 206:or MAB) can be seen as a set of real 162:exploitation vs. exploration tradeoff 7854:Generative adversarial network (GAN) 5515: 5417:Langford, John; Zhang, Tong (2008), 5394:Olivier Chapelle; Lihong Li (2011), 5194:Costa, V.D.; Averbeck, B.B. (2019). 4714:Aurelien Garivier; Emilie Kaufmann, 4243: 4209: 4207: 4205: 4203: 3070:Has efficient theoretical guarantees 2719:{\displaystyle \forall i:R_{i}(1)=0} 1703:Example: Iterated prisoner's dilemma 1579:to obtain an estimate of confidence. 1535: 1233: 1224: 746:adding citations to reliable sources 713: 16:Resource problem in machine learning 6349: 4890:Tewari, A.; Bartlett, P.L. (2008). 4360: 3452:{\displaystyle t\in {\mathcal {T}}} 3191:{\displaystyle t\in {\mathcal {T}}} 23:A row of slot machines in Las Vegas 13: 6853:Mathematics of Operations Research 6555: 6476:, pp. 151–159, archived from 4864:Mathematics of Operations Research 4549:10.1111/j.2517-6161.1979.tb01069.x 4415: 4338:10.1111/j.2517-6161.1979.tb01068.x 4116:Mathematics of Operations Research 3891: 3690: 3555: 3500: 3444: 3183: 2736: 2682: 2579: 1571:: Similar to LinUCB, but utilizes 1422:trials and the exploitation phase 917:{\displaystyle {\hat {a}}_{\tau }} 695:, and otherwise a reward of zero. 612:{\displaystyle {\widehat {r}}_{t}} 14: 7964: 7948:Metaphors referring to body parts 6960:Bandit algorithms vs. A-B testing 6888: 6656:, New York: John Wiley and Sons, 6074:https://doi.org/10.3390/e23030380 6014:Hutter, M. and Poland, J., 2005. 4640:(1981), "Arm-acquiring bandits", 4200: 3992: 3984: 3707:{\displaystyle {\mathcal {D}}(T)} 3517:{\displaystyle {\mathcal {D}}(T)} 849:{\displaystyle (a_{t})_{t\geq 1}} 7892: 7891: 7871: 6562: 5927:"Resourceful contextual bandits" 3081:Computationally quite efficient 3038: 2205:{\displaystyle x_{i_{t}}(t)\in } 1822:{\displaystyle \omega _{i}(1)=1} 1778:{\displaystyle \gamma \in (0,1]} 1201:minimizing probability of error 984:{\displaystyle \delta \in (0,1)} 718: 6691:10.1090/S0002-9904-1952-09620-8 6631:Advances in Applied Probability 6079: 6066: 6061:https://arxiv.org/abs/0805.3415 6053: 6044: 6030: 6021: 6008: 5999: 5982: 5852: 5550: 5509: 5236: 5187: 5128: 4979: 4966: 4933: 4883: 4850: 4830:Advances in Applied Mathematics 4817: 4758: 4738:Advances in Applied Mathematics 4732:Lai, T.L.; Robbins, H. (1985). 4725: 4661: 4307:10.1090/S0002-9904-1952-09620-8 3654:{\displaystyle \rho ^{\pi }(T)} 2256:set:      1527:Pricing strategies establish a 1504:Probability matching strategies 202:The multi-armed bandit (short: 7804:Recurrent neural network (RNN) 7794:Differentiable neural computer 5961:, Curran Associates: 433–441, 5406:, Curran Associates: 2249–2257 4589:Journal of Applied Probability 4281: 4234: 4143: 4106: 3902: 3896: 3779: 3773: 3701: 3695: 3648: 3642: 3566: 3560: 3511: 3505: 3168:can change at every time step 3144:setting (i.e., in presence of 3046: 3022: 3016: 3011: 3005: 2991: 2985: 2980: 2974: 2960: 2948: 2943: 2937: 2906: 2900: 2884: 2878: 2840: 2834: 2811: 2805: 2782: 2776: 2761: 2755: 2707: 2701: 2645: 2625: 2620: 2614: 2594: 2543: 2531: 2519: 2504: 2493: 2487: 2475: 2462: 2453: 2447: 2431: 2419: 2342: 2336: 2318: 2312: 2288: 2282: 2270: 2199: 2187: 2181: 2175: 2138: 2132: 2104: 2098: 1981: 1975: 1939: 1933: 1917: 1905: 1899: 1893: 1810: 1804: 1772: 1760: 1664:{\displaystyle O({\sqrt {T}})} 1658: 1648: 1450:{\displaystyle (1-\epsilon )N} 1441: 1429: 1094: 1069: 1059: 978: 966: 902: 831: 817: 1: 7849:Variational autoencoder (VAE) 7809:Long short-term memory (LSTM) 7076:Computational learning theory 6986:illustrating Epsilon-greedy, 6777:Annals of Applied Probability 5767:, Curran Associates: 99–109, 5630:Slivkins, Aleksandrs (2011), 4253:Annals of Applied Probability 4064: 1633:Constrained contextual bandit 1604:Generalized linear algorithms 1366:{\displaystyle \epsilon =0.1} 673:Bernoulli multi-armed bandit, 662: 64:problem that exemplifies the 7829:Convolutional neural network 6954:Feynman's restaurant problem 6747:10.1007/978-3-319-12637-1_47 5708:, Curran Associates: 586–594 5279:10.1007/978-3-642-16111-7_23 5212:10.1016/j.neuron.2019.05.017 5162:10.1371/journal.pcbi.1004164 4946:Theoretical Computer Science 4751:10.1016/0196-8858(85)90002-8 4002:. A solution is to take the 3484:{\displaystyle \mu _{t}^{*}} 3279:{\displaystyle \mu _{t}^{k}} 1739: 1573:Singular-value decomposition 667:A common formulation is the 527:is the maximal reward mean, 198:The multi-armed bandit model 7: 7824:Multilayer perceptron (MLP) 6984:Animated, interactive plots 4031: 2549:{\displaystyle (1-\gamma )} 1709:iterated prisoner's dilemma 1461:Epsilon-decreasing strategy 1320:{\displaystyle 1-\epsilon } 10: 7969: 7900:Artificial neural networks 7814:Gated recurrent unit (GRU) 7040:Differentiable programming 6126:10.1016/j.jcss.2011.12.028 5561:. IEEE. pp. 305–308. 5141:PLOS Computational Biology 3035:The rest remains the same 1876:t = 1, 2, ..., T 1. Set 1415:{\displaystyle \epsilon N} 1237: 155:financial portfolio design 37:multi-armed bandit problem 7867: 7781: 7725: 7654: 7587: 7459: 7359: 7352: 7306: 7270: 7233:Artificial neural network 7213: 7089: 7056:Automatic differentiation 7029: 6503:Santiago Ontañón (2017), 5345:, AUAI Press, p. 162 5010:10.1007/s10994-011-5257-4 4959:10.1016/j.tcs.2010.04.005 4693:10.1137/S0097539701398375 4217:; Fristedt, Bert (1985), 2249:{\displaystyle j=1,...,K} 2047:{\displaystyle i=1,...,K} 1866:{\displaystyle i=1,...,K} 1584:Online non-linear bandits 1476:{\displaystyle \epsilon } 1340:{\displaystyle \epsilon } 953:Given a confidence level 951:Fixed confidence setting: 803:, described as follows: 669:Binary multi-armed bandit 7061:Neuromorphic engineering 7024:Differentiable computing 5867:: 93–101. Archived from 5526:10.1109/sfcs.2000.892116 5105:10.1109/TAI.2021.3117743 5053:10.1109/TAI.2021.3074122 4857:Burnetas, Apostolos N.; 3326:{\displaystyle \mu ^{k}} 2402:     1598:nonparametric regression 1561:(Upper Confidence Bound) 520:{\displaystyle \mu ^{*}} 89:probability distribution 7938:Stochastic optimization 7834:Residual neural network 7250:Artificial Intelligence 6821:10.1214/lnms/1215540286 6791:10.1214/aoap/1177005588 6615:10.1145/1870103.1870106 5464:10.1145/1772690.1772758 5135:Averbeck, B.B. (2015). 4792:10.1073/pnas.92.19.8584 4390:10.1073/pnas.0912378106 4267:10.1214/aoap/1177005588 4099:10.1023/A:1013689704352 2569:{\displaystyle \gamma } 1719: 1616:Bandit Forest algorithm 1299:Epsilon-greedy strategy 1282:Semi-uniform strategies 1214:{\displaystyle \delta } 1138:{\displaystyle T\geq 1} 710:Best Arm Identification 651:{\displaystyle \rho /T} 619:is the reward in round 371:Markov decision process 7933:Sequential experiments 6702:Reinforcement Learning 6644:10.1239/aap/1214950209 6140:Proceedings of ICML'11 5252:Tokic, Michel (2010), 4843:10.1006/aama.1996.0007 4655:10.1214/aop/1176994469 4518:10.1287/opre.1100.0891 3972: 3950: 3868: 3805: 3748: 3728: 3708: 3675: 3655: 3614: 3592: 3538: 3518: 3485: 3453: 3416: 3347: 3327: 3300: 3280: 3248: 3192: 3162: 3126: 3106: 3029: 2916: 2818: 2789: 2720: 2667: 2632: 2570: 2550: 2511: 2396: 2250: 2206: 2145: 2075: 2048: 2004: 1964: 1867: 1823: 1779: 1679: 1665: 1626:Oracle-based algorithm 1477: 1451: 1416: 1393: 1377:Epsilon-first strategy 1367: 1341: 1321: 1215: 1195: 1139: 1107: 1041: 985: 938: 918: 876: 850: 689: 652: 613: 577: 521: 491: 467: 410: 390: 363: 342: 296: 261: 134: 105:pharmaceutical company 62:reinforcement learning 39:(sometimes called the 24: 7789:Neural Turing machine 7377:Human image synthesis 6866:10.1287/moor.12.2.262 6848:Katehakis, Michael N. 4877:10.1287/moor.22.1.222 4859:Katehakis, Michael N. 4642:Annals of Probability 4129:10.1287/moor.12.2.262 4059:Stochastic scheduling 3973: 3930: 3848: 3785: 3749: 3729: 3709: 3676: 3656: 3615: 3572: 3539: 3519: 3486: 3454: 3417: 3348: 3328: 3301: 3281: 3249: 3193: 3163: 3136:Non-stationary bandit 3127: 3107: 3088:Infinite-armed bandit 3030: 2917: 2819: 2790: 2721: 2668: 2666:{\displaystyle \eta } 2633: 2571: 2551: 2512: 2397: 2251: 2207: 2146: 2076: 2074:{\displaystyle i_{t}} 2049: 2005: 1944: 1868: 1824: 1780: 1715:Approximate solutions 1677: 1666: 1553:Online linear bandits 1478: 1452: 1417: 1394: 1368: 1342: 1322: 1274:Approximate solutions 1238:Further information: 1216: 1196: 1140: 1119:Given a time horizon 1117:Fixed budget setting: 1108: 1042: 986: 939: 937:{\displaystyle \tau } 919: 877: 875:{\displaystyle \tau } 851: 690: 653: 614: 578: 522: 492: 447: 411: 391: 389:{\displaystyle \rho } 364: 343: 297: 262: 132: 117:, first published by 82:stochastic scheduling 76:(sometimes known as " 48:-armed bandit problem 22: 7880:Computer programming 7859:Graph neural network 7434:Text-to-video models 7412:Text-to-image models 7260:Large language model 7245:Scientific computing 7051:Statistical manifold 7046:Information geometry 5653:Annals of Statistics 5567:10.1109/taai.2011.59 5440:, pp. 661–670, 4288:Robbins, H. (1952). 4023:Combinatorial bandit 4014:Collaborative bandit 3760: 3747:{\displaystyle \pi } 3738: 3718: 3685: 3674:{\displaystyle \pi } 3665: 3629: 3550: 3528: 3495: 3463: 3433: 3357: 3337: 3310: 3290: 3258: 3202: 3172: 3152: 3116: 3096: 2926: 2828: 2817:{\displaystyle I(t)} 2799: 2733: 2679: 2657: 2588: 2560: 2528: 2406: 2260: 2216: 2155: 2085: 2058: 2014: 1880: 1833: 1791: 1751: 1642: 1467: 1426: 1403: 1383: 1351: 1331: 1305: 1205: 1149: 1123: 1051: 995: 957: 928: 892: 866: 814: 742:improve this section 679: 634: 628:zero-regret strategy 587: 531: 504: 422: 400: 380: 353: 306: 271: 213: 125:Empirical motivation 7226:In-context learning 7066:Pattern recognition 6531:2017arXiv171004805O 6432:2015arXiv150203473L 6400:2014arXiv1401.8257G 6344:2016arXiv160407101W 6277:2015arXiv150600312Z 5977:2015arXiv150406937W 5913:2014arXiv1402.0555A 5847:2013arXiv1309.6869V 5815:2019arXiv190608947K 5783:2017arXiv170600136J 5743:2017arXiv170300048L 5618:2010arXiv1003.1630R 5456:2010arXiv1003.0146L 5153:2015PLSCB..11E4164A 4940:Ortner, R. (2010). 4783:1995PNAS...92.8584K 4495:Operations Research 4451:2010arXiv1009.5419B 4381:2009PNAS..10622387P 4375:(52): 22387–22392, 4154:Nicolò Cesa-Bianchi 3924: 3842: 3821: 3608: 3480: 3411: 3390: 3275: 3243: 3225: 1610:KernelUCB algorithm 1290:behavior where the 782:regret minimization 7928:Sequential methods 7819:Echo state network 7707:Jürgen Schmidhuber 7402:Facial recognition 7397:Speech recognition 7307:Software libraries 6593:Journal of the ACM 6580:Multi-armed bandit 5676:10.1214/13-aos1101 4162:10.1561/2200000024 3968: 3908: 3826: 3807: 3744: 3724: 3704: 3671: 3651: 3610: 3594: 3534: 3514: 3481: 3466: 3449: 3412: 3391: 3376: 3343: 3323: 3296: 3276: 3261: 3244: 3229: 3205: 3188: 3158: 3122: 3102: 3025: 2912: 2864: 2814: 2785: 2716: 2663: 2628: 2566: 2546: 2507: 2392: 2387: 2246: 2202: 2151:3. Receive reward 2141: 2071: 2044: 2000: 1863: 1819: 1775: 1694:Adversarial bandit 1680: 1661: 1590:UCBogram algorithm 1523:Pricing strategies 1473: 1447: 1412: 1389: 1363: 1337: 1317: 1211: 1191: 1180: 1135: 1103: 1037: 1026: 981: 934: 914: 872: 846: 685: 648: 609: 573: 556: 517: 487: 406: 386: 359: 338: 292: 257: 135: 29:probability theory 25: 7915: 7914: 7677:Stephen Grossberg 7650: 7649: 6988:Thompson sampling 6831:978-0-940600-09-6 6756:978-3-319-12636-4 6712:978-0-262-19398-6 6663:978-0-470-17155-4 6539:10.1613/jair.5398 6358:A Gang of Bandits 5320:978-3-642-24455-1 5288:978-3-642-16110-0 4952:(29): 2684–2695. 4228:978-0-412-24810-8 4194:978-0-471-92059-5 3727:{\displaystyle T} 3537:{\displaystyle T} 3346:{\displaystyle k} 3299:{\displaystyle k} 3161:{\displaystyle k} 3125:{\displaystyle K} 3105:{\displaystyle K} 3085: 3084: 2855: 2623: 2478: 2383: 2350: 2273: 1998: 1985: 1687:UCB-ALP algorithm 1656: 1536:Contextual bandit 1514:Thompson sampling 1392:{\displaystyle N} 1234:Optimal solutions 1225:Bandit strategies 1171: 1072: 1017: 905: 778: 777: 770: 688:{\displaystyle p} 600: 547: 478: 409:{\displaystyle T} 362:{\displaystyle H} 78:one-armed bandits 7960: 7943:Machine learning 7905:Machine learning 7895: 7894: 7875: 7630:Action selection 7620:Self-driving car 7427:Stable Diffusion 7392:Speech synthesis 7357: 7356: 7221:Machine learning 7097:Gradient descent 7018: 7011: 7004: 6995: 6994: 6884: 6843: 6823: 6802: 6793: 6784:(4): 1024–1033, 6767: 6740: 6720: 6715:, archived from 6694: 6693: 6666: 6647: 6646: 6625: 6608: 6566: 6565: 6550: 6549: 6524: 6500: 6491: 6490: 6489: 6488: 6482: 6475: 6464: 6455: 6454: 6452: 6441: 6435: 6434: 6425: 6409: 6403: 6402: 6393: 6377: 6371: 6370: 6369: 6353: 6347: 6346: 6337: 6321: 6315: 6314: 6313: 6312: 6306: 6300:, archived from 6295: 6286: 6280: 6279: 6270: 6254: 6248: 6247: 6246: 6245: 6239: 6233:, archived from 6228: 6219: 6213: 6212: 6211: 6210: 6204: 6198:, archived from 6193: 6184: 6178: 6177: 6176: 6175: 6169: 6163:, archived from 6158: 6149: 6143: 6142: 6135: 6129: 6128: 6119: 6110:(5): 1538–1556, 6099: 6090: 6083: 6077: 6070: 6064: 6057: 6051: 6048: 6042: 6034: 6028: 6025: 6019: 6012: 6006: 6003: 5997: 5986: 5980: 5979: 5970: 5946: 5937: 5936: 5931: 5922: 5916: 5915: 5906: 5886: 5880: 5879: 5877: 5876: 5856: 5850: 5849: 5840: 5824: 5818: 5817: 5808: 5792: 5786: 5785: 5776: 5752: 5746: 5745: 5736: 5716: 5710: 5709: 5693: 5687: 5686: 5669: 5647: 5641: 5640: 5638: 5627: 5621: 5620: 5611: 5595: 5589: 5588: 5554: 5548: 5547: 5513: 5507: 5506: 5500: 5491: 5485: 5484: 5449: 5433: 5427: 5426: 5414: 5408: 5407: 5391: 5385: 5384: 5374:10.1002/asmb.874 5357: 5348: 5346: 5340: 5331: 5325: 5323: 5308: 5299: 5293: 5291: 5272: 5258: 5249: 5243: 5240: 5234: 5233: 5223: 5191: 5185: 5184: 5174: 5164: 5132: 5126: 5125: 5107: 5083: 5074: 5073: 5055: 5031: 5022: 5021: 5003: 4988:Machine Learning 4983: 4977: 4970: 4964: 4963: 4961: 4937: 4931: 4930: 4928: 4927: 4921: 4915:. Archived from 4914: 4896: 4887: 4881: 4880: 4854: 4848: 4847: 4845: 4821: 4815: 4814: 4804: 4794: 4762: 4756: 4755: 4753: 4729: 4723: 4722: 4711: 4705: 4704: 4686: 4665: 4659: 4658: 4657: 4634: 4628: 4627: 4581: 4575: 4574: 4572: 4561: 4552: 4551: 4527: 4521: 4520: 4511: 4489: 4483: 4482: 4481: 4480: 4471:, archived from 4460: 4454: 4453: 4444: 4428: 4422: 4419: 4413: 4412: 4402: 4392: 4364: 4358: 4357: 4318: 4312: 4311: 4309: 4285: 4279: 4278: 4269: 4260:(4): 1024–1033, 4247: 4241: 4238: 4232: 4231: 4215:Berry, Donald A. 4211: 4198: 4197: 4177: 4164: 4150:Sébastien Bubeck 4147: 4141: 4140: 4110: 4104: 4103: 4101: 4092:(2/3): 235–256. 4086:Machine Learning 4077: 4049:Optimal stopping 4044:Greedy algorithm 4006:as a reference. 4004:Condorcet winner 4000:voting paradoxes 3977: 3975: 3974: 3969: 3967: 3963: 3962: 3961: 3960: 3949: 3944: 3923: 3918: 3913: 3895: 3894: 3885: 3881: 3880: 3879: 3878: 3867: 3862: 3841: 3836: 3831: 3822: 3820: 3815: 3804: 3799: 3772: 3771: 3753: 3751: 3750: 3745: 3733: 3731: 3730: 3725: 3713: 3711: 3710: 3705: 3694: 3693: 3680: 3678: 3677: 3672: 3660: 3658: 3657: 3652: 3641: 3640: 3619: 3617: 3616: 3611: 3609: 3607: 3602: 3591: 3586: 3559: 3558: 3543: 3541: 3540: 3535: 3523: 3521: 3520: 3515: 3504: 3503: 3490: 3488: 3487: 3482: 3479: 3474: 3458: 3456: 3455: 3450: 3448: 3447: 3421: 3419: 3418: 3413: 3410: 3405: 3389: 3384: 3369: 3368: 3352: 3350: 3349: 3344: 3332: 3330: 3329: 3324: 3322: 3321: 3305: 3303: 3302: 3297: 3285: 3283: 3282: 3277: 3274: 3269: 3253: 3251: 3250: 3245: 3242: 3237: 3224: 3219: 3197: 3195: 3194: 3189: 3187: 3186: 3167: 3165: 3164: 3159: 3131: 3129: 3128: 3123: 3111: 3109: 3108: 3103: 3051: 3050: 3034: 3032: 3031: 3026: 3015: 3014: 2984: 2983: 2947: 2946: 2921: 2919: 2918: 2913: 2899: 2898: 2877: 2876: 2863: 2823: 2821: 2820: 2815: 2794: 2792: 2791: 2786: 2754: 2753: 2725: 2723: 2722: 2717: 2700: 2699: 2672: 2670: 2669: 2664: 2637: 2635: 2634: 2629: 2624: 2598: 2575: 2573: 2572: 2567: 2555: 2553: 2552: 2547: 2516: 2514: 2513: 2508: 2500: 2486: 2485: 2480: 2479: 2471: 2446: 2445: 2418: 2417: 2401: 2399: 2398: 2393: 2391: 2390: 2384: 2381: 2367: 2366: 2351: 2348: 2335: 2334: 2325: 2311: 2310: 2281: 2280: 2275: 2274: 2266: 2255: 2253: 2252: 2247: 2211: 2209: 2208: 2203: 2174: 2173: 2172: 2171: 2150: 2148: 2147: 2142: 2131: 2130: 2097: 2096: 2080: 2078: 2077: 2072: 2070: 2069: 2053: 2051: 2050: 2045: 2009: 2007: 2006: 2001: 1999: 1991: 1986: 1984: 1974: 1973: 1963: 1958: 1942: 1932: 1931: 1921: 1892: 1891: 1872: 1870: 1869: 1864: 1828: 1826: 1825: 1820: 1803: 1802: 1784: 1782: 1781: 1776: 1732: 1731: 1727: 1670: 1668: 1667: 1662: 1657: 1652: 1577:Ridge regression 1482: 1480: 1479: 1474: 1456: 1454: 1453: 1448: 1421: 1419: 1418: 1413: 1398: 1396: 1395: 1390: 1372: 1370: 1369: 1364: 1346: 1344: 1343: 1338: 1326: 1324: 1323: 1318: 1220: 1218: 1217: 1212: 1200: 1198: 1197: 1192: 1190: 1189: 1179: 1161: 1160: 1144: 1142: 1141: 1136: 1112: 1110: 1109: 1104: 1093: 1092: 1080: 1079: 1074: 1073: 1065: 1058: 1046: 1044: 1043: 1038: 1036: 1035: 1025: 1007: 1006: 990: 988: 987: 982: 943: 941: 940: 935: 923: 921: 920: 915: 913: 912: 907: 906: 898: 881: 879: 878: 873: 855: 853: 852: 847: 845: 844: 829: 828: 786:pure exploration 773: 766: 762: 759: 753: 722: 714: 694: 692: 691: 686: 657: 655: 654: 649: 644: 618: 616: 615: 610: 608: 607: 602: 601: 593: 582: 580: 579: 574: 569: 568: 555: 543: 542: 526: 524: 523: 518: 516: 515: 496: 494: 493: 488: 486: 485: 480: 479: 471: 466: 461: 443: 442: 415: 413: 412: 407: 395: 393: 392: 387: 368: 366: 365: 360: 347: 345: 344: 339: 337: 336: 318: 317: 301: 299: 298: 293: 291: 290: 285: 266: 264: 263: 258: 253: 252: 234: 233: 166:machine learning 149:adaptive routing 101:machine learning 33:machine learning 7968: 7967: 7963: 7962: 7961: 7959: 7958: 7957: 7918: 7917: 7916: 7911: 7863: 7777: 7743:Google DeepMind 7721: 7687:Geoffrey Hinton 7646: 7583: 7509:Project Debater 7455: 7353:Implementations 7348: 7302: 7266: 7209: 7151:Backpropagation 7085: 7071:Tensor calculus 7025: 7022: 6891: 6832: 6757: 6713: 6664: 6587: 6586: 6585: 6567: 6563: 6558: 6556:Further reading 6553: 6501: 6494: 6486: 6484: 6480: 6473: 6465: 6458: 6450: 6442: 6438: 6410: 6406: 6378: 6374: 6354: 6350: 6322: 6318: 6310: 6308: 6304: 6293: 6287: 6283: 6255: 6251: 6243: 6241: 6237: 6226: 6220: 6216: 6208: 6206: 6202: 6191: 6185: 6181: 6173: 6171: 6167: 6156: 6150: 6146: 6136: 6132: 6117:10.1.1.162.2764 6100: 6093: 6084: 6080: 6071: 6067: 6058: 6054: 6049: 6045: 6035: 6031: 6026: 6022: 6013: 6009: 6004: 6000: 5987: 5983: 5947: 5940: 5929: 5923: 5919: 5887: 5883: 5874: 5872: 5857: 5853: 5825: 5821: 5793: 5789: 5753: 5749: 5717: 5713: 5694: 5690: 5648: 5644: 5636: 5628: 5624: 5596: 5592: 5577: 5555: 5551: 5536: 5514: 5510: 5498: 5492: 5488: 5474: 5434: 5430: 5415: 5411: 5392: 5388: 5358: 5351: 5338: 5332: 5328: 5321: 5306: 5300: 5296: 5289: 5256: 5250: 5246: 5241: 5237: 5192: 5188: 5147:(3): e1004164. 5133: 5129: 5084: 5077: 5032: 5025: 4984: 4980: 4971: 4967: 4938: 4934: 4925: 4923: 4919: 4894: 4888: 4884: 4855: 4851: 4822: 4818: 4763: 4759: 4730: 4726: 4712: 4708: 4671:SIAM J. Comput. 4666: 4662: 4635: 4631: 4601:10.2307/3214163 4582: 4578: 4570: 4562: 4555: 4528: 4524: 4509:10.1.1.380.6983 4490: 4486: 4478: 4476: 4461: 4457: 4429: 4425: 4420: 4416: 4365: 4361: 4319: 4315: 4286: 4282: 4248: 4244: 4239: 4235: 4229: 4212: 4201: 4195: 4178: 4167: 4148: 4144: 4111: 4107: 4078: 4071: 4067: 4034: 4025: 4016: 3995: 3987: 3956: 3952: 3951: 3945: 3934: 3929: 3925: 3919: 3914: 3909: 3890: 3889: 3874: 3870: 3869: 3863: 3852: 3847: 3843: 3837: 3832: 3827: 3816: 3811: 3806: 3800: 3789: 3767: 3763: 3761: 3758: 3757: 3739: 3736: 3735: 3719: 3716: 3715: 3689: 3688: 3686: 3683: 3682: 3666: 3663: 3662: 3636: 3632: 3630: 3627: 3626: 3603: 3598: 3593: 3587: 3576: 3554: 3553: 3551: 3548: 3547: 3544:is defined as: 3529: 3526: 3525: 3499: 3498: 3496: 3493: 3492: 3475: 3470: 3464: 3461: 3460: 3443: 3442: 3434: 3431: 3430: 3406: 3395: 3385: 3380: 3364: 3360: 3358: 3355: 3354: 3338: 3335: 3334: 3317: 3313: 3311: 3308: 3307: 3291: 3288: 3287: 3270: 3265: 3259: 3256: 3255: 3238: 3233: 3220: 3209: 3203: 3200: 3199: 3182: 3181: 3173: 3170: 3169: 3153: 3150: 3149: 3138: 3117: 3114: 3113: 3097: 3094: 3093: 3090: 3049: 3041: 3036: 3001: 2997: 2970: 2966: 2933: 2929: 2927: 2924: 2923: 2894: 2890: 2872: 2868: 2859: 2829: 2826: 2825: 2800: 2797: 2796: 2749: 2745: 2734: 2731: 2730: 2695: 2691: 2680: 2677: 2676: 2674:Initialisation: 2658: 2655: 2654: 2648: 2643: 2597: 2589: 2586: 2585: 2582: 2580:Regret analysis 2561: 2558: 2557: 2529: 2526: 2525: 2522: 2517: 2496: 2481: 2470: 2469: 2468: 2441: 2437: 2413: 2409: 2407: 2404: 2403: 2386: 2385: 2380: 2378: 2369: 2368: 2362: 2358: 2347: 2345: 2330: 2326: 2321: 2306: 2302: 2295: 2294: 2276: 2265: 2264: 2263: 2261: 2258: 2257: 2217: 2214: 2213: 2167: 2163: 2162: 2158: 2156: 2153: 2152: 2126: 2122: 2092: 2088: 2086: 2083: 2082: 2065: 2061: 2059: 2056: 2055: 2015: 2012: 2011: 1990: 1969: 1965: 1959: 1948: 1943: 1927: 1923: 1922: 1920: 1887: 1883: 1881: 1878: 1877: 1834: 1831: 1830: 1798: 1794: 1792: 1789: 1788: 1786:Initialisation: 1752: 1749: 1748: 1742: 1733: 1729: 1725: 1723: 1722: 1717: 1705: 1696: 1651: 1643: 1640: 1639: 1635: 1586: 1555: 1547: 1538: 1525: 1506: 1468: 1465: 1464: 1427: 1424: 1423: 1404: 1401: 1400: 1384: 1381: 1380: 1352: 1349: 1348: 1332: 1329: 1328: 1306: 1303: 1302: 1284: 1276: 1242: 1236: 1227: 1206: 1203: 1202: 1185: 1181: 1175: 1156: 1152: 1150: 1147: 1146: 1124: 1121: 1120: 1088: 1084: 1075: 1064: 1063: 1062: 1054: 1052: 1049: 1048: 1031: 1027: 1021: 1002: 998: 996: 993: 992: 958: 955: 954: 929: 926: 925: 908: 897: 896: 895: 893: 890: 889: 867: 864: 863: 834: 830: 824: 820: 815: 812: 811: 774: 763: 757: 754: 739: 723: 712: 680: 677: 676: 665: 640: 635: 632: 631: 603: 592: 591: 590: 588: 585: 584: 564: 560: 551: 538: 534: 532: 529: 528: 511: 507: 505: 502: 501: 481: 470: 469: 468: 462: 451: 438: 434: 423: 420: 419: 401: 398: 397: 381: 378: 377: 354: 351: 350: 332: 328: 313: 309: 307: 304: 303: 286: 281: 280: 272: 269: 268: 248: 244: 229: 225: 214: 211: 210: 200: 192:Herbert Robbins 143:clinical trials 127: 119:John C. Gittins 111:Herbert Robbins 17: 12: 11: 5: 7966: 7956: 7955: 7950: 7945: 7940: 7935: 7930: 7913: 7912: 7910: 7909: 7908: 7907: 7902: 7889: 7888: 7887: 7882: 7868: 7865: 7864: 7862: 7861: 7856: 7851: 7846: 7841: 7836: 7831: 7826: 7821: 7816: 7811: 7806: 7801: 7796: 7791: 7785: 7783: 7779: 7778: 7776: 7775: 7770: 7765: 7760: 7755: 7750: 7745: 7740: 7735: 7729: 7727: 7723: 7722: 7720: 7719: 7717:Ilya Sutskever 7714: 7709: 7704: 7699: 7694: 7689: 7684: 7682:Demis Hassabis 7679: 7674: 7672:Ian Goodfellow 7669: 7664: 7658: 7656: 7652: 7651: 7648: 7647: 7645: 7644: 7639: 7638: 7637: 7627: 7622: 7617: 7612: 7607: 7602: 7597: 7591: 7589: 7585: 7584: 7582: 7581: 7576: 7571: 7566: 7561: 7556: 7551: 7546: 7541: 7536: 7531: 7526: 7521: 7516: 7511: 7506: 7501: 7500: 7499: 7489: 7484: 7479: 7474: 7469: 7463: 7461: 7457: 7456: 7454: 7453: 7448: 7447: 7446: 7441: 7431: 7430: 7429: 7424: 7419: 7409: 7404: 7399: 7394: 7389: 7384: 7379: 7374: 7369: 7363: 7361: 7354: 7350: 7349: 7347: 7346: 7341: 7336: 7331: 7326: 7321: 7316: 7310: 7308: 7304: 7303: 7301: 7300: 7295: 7290: 7285: 7280: 7274: 7272: 7268: 7267: 7265: 7264: 7263: 7262: 7255:Language model 7252: 7247: 7242: 7241: 7240: 7230: 7229: 7228: 7217: 7215: 7211: 7210: 7208: 7207: 7205:Autoregression 7202: 7197: 7196: 7195: 7185: 7183:Regularization 7180: 7179: 7178: 7173: 7168: 7158: 7153: 7148: 7146:Loss functions 7143: 7138: 7133: 7128: 7123: 7122: 7121: 7111: 7106: 7105: 7104: 7093: 7091: 7087: 7086: 7084: 7083: 7081:Inductive bias 7078: 7073: 7068: 7063: 7058: 7053: 7048: 7043: 7035: 7033: 7027: 7026: 7021: 7020: 7013: 7006: 6998: 6992: 6991: 6981: 6975: 6969: 6963: 6957: 6951: 6940: 6934: 6924: 6918: 6913:, open-source 6908: 6902: 6890: 6889:External links 6887: 6886: 6885: 6860:(2): 262–268, 6844: 6830: 6804: 6770: 6769: 6755: 6723: 6722: 6711: 6696: 6684:(5): 527–535, 6668: 6662: 6649: 6637:(2): 377–400, 6626: 6568: 6561: 6560: 6559: 6557: 6554: 6552: 6551: 6492: 6456: 6453:, pp. 1–9 6436: 6404: 6372: 6348: 6316: 6281: 6249: 6214: 6179: 6144: 6130: 6091: 6078: 6065: 6052: 6043: 6029: 6020: 6007: 5998: 5981: 5938: 5917: 5881: 5851: 5819: 5787: 5747: 5711: 5688: 5660:(2): 693–721, 5642: 5622: 5590: 5575: 5549: 5535:978-0769508504 5534: 5508: 5486: 5472: 5428: 5409: 5386: 5368:(2): 639–658, 5349: 5326: 5319: 5294: 5287: 5270:10.1.1.458.464 5244: 5235: 5206:(3): 533–535. 5186: 5127: 5098:(2): 152–163. 5075: 5023: 4994:(3): 361–391. 4978: 4965: 4932: 4912:10.1.1.69.5482 4882: 4871:(1): 222–255. 4849: 4836:(2): 122–142. 4816: 4777:(19): 8584–5. 4757: 4724: 4706: 4684:10.1.1.130.158 4660: 4648:(2): 284–292, 4638:Whittle, Peter 4629: 4585:Whittle, Peter 4576: 4553: 4543:(2): 148–177, 4531:Whittle, Peter 4522: 4502:(2): 383–399, 4484: 4455: 4423: 4414: 4359: 4332:(2): 148–177. 4313: 4300:(5): 527–535. 4280: 4242: 4233: 4227: 4199: 4193: 4181:Gittins, J. C. 4165: 4142: 4123:(2): 262–268. 4105: 4068: 4066: 4063: 4062: 4061: 4056: 4051: 4046: 4041: 4033: 4030: 4024: 4021: 4015: 4012: 3994: 3993:Dueling bandit 3991: 3986: 3985:Other variants 3983: 3966: 3959: 3955: 3948: 3943: 3940: 3937: 3933: 3928: 3922: 3917: 3912: 3907: 3904: 3901: 3898: 3893: 3888: 3884: 3877: 3873: 3866: 3861: 3858: 3855: 3851: 3846: 3840: 3835: 3830: 3825: 3819: 3814: 3810: 3803: 3798: 3795: 3792: 3788: 3784: 3781: 3778: 3775: 3770: 3766: 3743: 3723: 3703: 3700: 3697: 3692: 3670: 3650: 3647: 3644: 3639: 3635: 3606: 3601: 3597: 3590: 3585: 3582: 3579: 3575: 3571: 3568: 3565: 3562: 3557: 3533: 3513: 3510: 3507: 3502: 3478: 3473: 3469: 3446: 3441: 3438: 3427:dynamic oracle 3409: 3404: 3401: 3398: 3394: 3388: 3383: 3379: 3375: 3372: 3367: 3363: 3342: 3320: 3316: 3295: 3273: 3268: 3264: 3241: 3236: 3232: 3228: 3223: 3218: 3215: 3212: 3208: 3185: 3180: 3177: 3157: 3142:non-stationary 3137: 3134: 3121: 3101: 3089: 3086: 3083: 3082: 3079: 3075: 3074: 3071: 3067: 3066: 3063: 3059: 3058: 3055: 3048: 3045: 3040: 3037: 3024: 3021: 3018: 3013: 3010: 3007: 3004: 3000: 2996: 2993: 2990: 2987: 2982: 2979: 2976: 2973: 2969: 2965: 2962: 2959: 2956: 2953: 2950: 2945: 2942: 2939: 2936: 2932: 2911: 2908: 2905: 2902: 2897: 2893: 2889: 2886: 2883: 2880: 2875: 2871: 2867: 2862: 2858: 2854: 2851: 2848: 2845: 2842: 2839: 2836: 2833: 2813: 2810: 2807: 2804: 2784: 2781: 2778: 2775: 2772: 2769: 2766: 2763: 2760: 2757: 2752: 2748: 2744: 2741: 2738: 2715: 2712: 2709: 2706: 2703: 2698: 2694: 2690: 2687: 2684: 2662: 2649: 2647: 2644: 2642: 2639: 2627: 2622: 2619: 2616: 2613: 2610: 2607: 2604: 2601: 2596: 2593: 2581: 2578: 2565: 2545: 2542: 2539: 2536: 2533: 2521: 2518: 2506: 2503: 2499: 2495: 2492: 2489: 2484: 2477: 2474: 2467: 2464: 2461: 2458: 2455: 2452: 2449: 2444: 2440: 2436: 2433: 2430: 2427: 2424: 2421: 2416: 2412: 2389: 2379: 2377: 2374: 2371: 2370: 2365: 2361: 2357: 2354: 2346: 2344: 2341: 2338: 2333: 2329: 2324: 2320: 2317: 2314: 2309: 2305: 2301: 2300: 2298: 2293: 2290: 2287: 2284: 2279: 2272: 2269: 2245: 2242: 2239: 2236: 2233: 2230: 2227: 2224: 2221: 2201: 2198: 2195: 2192: 2189: 2186: 2183: 2180: 2177: 2170: 2166: 2161: 2140: 2137: 2134: 2129: 2125: 2121: 2118: 2115: 2112: 2109: 2106: 2103: 2100: 2095: 2091: 2068: 2064: 2043: 2040: 2037: 2034: 2031: 2028: 2025: 2022: 2019: 1997: 1994: 1989: 1983: 1980: 1977: 1972: 1968: 1962: 1957: 1954: 1951: 1947: 1941: 1938: 1935: 1930: 1926: 1919: 1916: 1913: 1910: 1907: 1904: 1901: 1898: 1895: 1890: 1886: 1862: 1859: 1856: 1853: 1850: 1847: 1844: 1841: 1838: 1818: 1815: 1812: 1809: 1806: 1801: 1797: 1774: 1771: 1768: 1765: 1762: 1759: 1756: 1743: 1741: 1738: 1721: 1718: 1716: 1713: 1704: 1701: 1695: 1692: 1691: 1690: 1660: 1655: 1650: 1647: 1634: 1631: 1630: 1629: 1623: 1613: 1607: 1601: 1585: 1582: 1581: 1580: 1566: 1554: 1551: 1546: 1543: 1537: 1534: 1524: 1521: 1505: 1502: 1501: 1500: 1494: 1484: 1472: 1458: 1446: 1443: 1440: 1437: 1434: 1431: 1411: 1408: 1388: 1374: 1362: 1359: 1356: 1336: 1316: 1313: 1310: 1283: 1280: 1275: 1272: 1235: 1232: 1226: 1223: 1210: 1188: 1184: 1178: 1174: 1170: 1167: 1164: 1159: 1155: 1134: 1131: 1128: 1102: 1099: 1096: 1091: 1087: 1083: 1078: 1071: 1068: 1061: 1057: 1034: 1030: 1024: 1020: 1016: 1013: 1010: 1005: 1001: 980: 977: 974: 971: 968: 965: 962: 945: 944: 933: 911: 904: 901: 883: 871: 857: 843: 840: 837: 833: 827: 823: 819: 797:decision rule, 776: 775: 726: 724: 717: 711: 708: 684: 664: 661: 647: 643: 639: 606: 599: 596: 572: 567: 563: 559: 554: 550: 546: 541: 537: 514: 510: 484: 477: 474: 465: 460: 457: 454: 450: 446: 441: 437: 433: 430: 427: 405: 385: 358: 335: 331: 327: 324: 321: 316: 312: 289: 284: 279: 276: 256: 251: 247: 243: 240: 237: 232: 228: 224: 221: 218: 199: 196: 158: 157: 152: 146: 126: 123: 15: 9: 6: 4: 3: 2: 7965: 7954: 7951: 7949: 7946: 7944: 7941: 7939: 7936: 7934: 7931: 7929: 7926: 7925: 7923: 7906: 7903: 7901: 7898: 7897: 7890: 7886: 7883: 7881: 7878: 7877: 7874: 7870: 7869: 7866: 7860: 7857: 7855: 7852: 7850: 7847: 7845: 7842: 7840: 7837: 7835: 7832: 7830: 7827: 7825: 7822: 7820: 7817: 7815: 7812: 7810: 7807: 7805: 7802: 7800: 7797: 7795: 7792: 7790: 7787: 7786: 7784: 7782:Architectures 7780: 7774: 7771: 7769: 7766: 7764: 7761: 7759: 7756: 7754: 7751: 7749: 7746: 7744: 7741: 7739: 7736: 7734: 7731: 7730: 7728: 7726:Organizations 7724: 7718: 7715: 7713: 7710: 7708: 7705: 7703: 7700: 7698: 7695: 7693: 7690: 7688: 7685: 7683: 7680: 7678: 7675: 7673: 7670: 7668: 7665: 7663: 7662:Yoshua Bengio 7660: 7659: 7657: 7653: 7643: 7642:Robot control 7640: 7636: 7633: 7632: 7631: 7628: 7626: 7623: 7621: 7618: 7616: 7613: 7611: 7608: 7606: 7603: 7601: 7598: 7596: 7593: 7592: 7590: 7586: 7580: 7577: 7575: 7572: 7570: 7567: 7565: 7562: 7560: 7559:Chinchilla AI 7557: 7555: 7552: 7550: 7547: 7545: 7542: 7540: 7537: 7535: 7532: 7530: 7527: 7525: 7522: 7520: 7517: 7515: 7512: 7510: 7507: 7505: 7502: 7498: 7495: 7494: 7493: 7490: 7488: 7485: 7483: 7480: 7478: 7475: 7473: 7470: 7468: 7465: 7464: 7462: 7458: 7452: 7449: 7445: 7442: 7440: 7437: 7436: 7435: 7432: 7428: 7425: 7423: 7420: 7418: 7415: 7414: 7413: 7410: 7408: 7405: 7403: 7400: 7398: 7395: 7393: 7390: 7388: 7385: 7383: 7380: 7378: 7375: 7373: 7370: 7368: 7365: 7364: 7362: 7358: 7355: 7351: 7345: 7342: 7340: 7337: 7335: 7332: 7330: 7327: 7325: 7322: 7320: 7317: 7315: 7312: 7311: 7309: 7305: 7299: 7296: 7294: 7291: 7289: 7286: 7284: 7281: 7279: 7276: 7275: 7273: 7269: 7261: 7258: 7257: 7256: 7253: 7251: 7248: 7246: 7243: 7239: 7238:Deep learning 7236: 7235: 7234: 7231: 7227: 7224: 7223: 7222: 7219: 7218: 7216: 7212: 7206: 7203: 7201: 7198: 7194: 7191: 7190: 7189: 7186: 7184: 7181: 7177: 7174: 7172: 7169: 7167: 7164: 7163: 7162: 7159: 7157: 7154: 7152: 7149: 7147: 7144: 7142: 7139: 7137: 7134: 7132: 7129: 7127: 7126:Hallucination 7124: 7120: 7117: 7116: 7115: 7112: 7110: 7107: 7103: 7100: 7099: 7098: 7095: 7094: 7092: 7088: 7082: 7079: 7077: 7074: 7072: 7069: 7067: 7064: 7062: 7059: 7057: 7054: 7052: 7049: 7047: 7044: 7042: 7041: 7037: 7036: 7034: 7032: 7028: 7019: 7014: 7012: 7007: 7005: 7000: 6999: 6996: 6989: 6985: 6982: 6979: 6976: 6973: 6970: 6967: 6964: 6961: 6958: 6955: 6952: 6949: 6945: 6941: 6938: 6935: 6932: 6928: 6925: 6922: 6919: 6916: 6912: 6909: 6906: 6903: 6900: 6896: 6893: 6892: 6883: 6879: 6875: 6871: 6867: 6863: 6859: 6855: 6854: 6849: 6845: 6841: 6837: 6833: 6827: 6822: 6817: 6813: 6809: 6808:Katehakis, M. 6805: 6801: 6797: 6792: 6787: 6783: 6779: 6778: 6772: 6771: 6766: 6762: 6758: 6752: 6748: 6744: 6739: 6734: 6730: 6725: 6724: 6719:on 2013-12-11 6718: 6714: 6708: 6705:, MIT Press, 6704: 6703: 6697: 6692: 6687: 6683: 6679: 6678: 6673: 6669: 6665: 6659: 6655: 6650: 6645: 6640: 6636: 6632: 6627: 6624: 6620: 6616: 6612: 6607: 6602: 6598: 6594: 6589: 6588: 6583: 6582: 6581: 6575: 6571: 6548: 6544: 6540: 6536: 6532: 6528: 6523: 6518: 6514: 6510: 6506: 6499: 6497: 6483:on 2016-11-19 6479: 6472: 6471: 6463: 6461: 6449: 6448: 6440: 6433: 6429: 6424: 6419: 6415: 6408: 6401: 6397: 6392: 6387: 6383: 6376: 6368: 6363: 6359: 6352: 6345: 6341: 6336: 6331: 6327: 6320: 6307:on 2016-06-17 6303: 6299: 6292: 6285: 6278: 6274: 6269: 6264: 6260: 6253: 6240:on 2015-09-08 6236: 6232: 6225: 6218: 6205:on 2016-03-26 6201: 6197: 6190: 6183: 6170:on 2016-10-02 6166: 6162: 6155: 6148: 6141: 6134: 6127: 6123: 6118: 6113: 6109: 6105: 6098: 6096: 6088: 6082: 6075: 6069: 6062: 6056: 6047: 6040: 6033: 6024: 6017: 6011: 6002: 5995: 5991: 5985: 5978: 5974: 5969: 5964: 5960: 5956: 5952: 5945: 5943: 5935: 5928: 5921: 5914: 5910: 5905: 5900: 5897:: 1638–1646, 5896: 5892: 5885: 5871:on 2016-08-10 5870: 5866: 5862: 5855: 5848: 5844: 5839: 5834: 5830: 5823: 5816: 5812: 5807: 5802: 5798: 5791: 5784: 5780: 5775: 5770: 5766: 5762: 5758: 5751: 5744: 5740: 5735: 5730: 5727:: 2071–2080, 5726: 5722: 5715: 5707: 5703: 5699: 5692: 5685: 5681: 5677: 5673: 5668: 5663: 5659: 5655: 5654: 5646: 5635: 5634: 5626: 5619: 5615: 5610: 5605: 5601: 5594: 5586: 5582: 5578: 5576:9781457721748 5572: 5568: 5564: 5560: 5553: 5545: 5541: 5537: 5531: 5527: 5523: 5519: 5512: 5504: 5497: 5490: 5483: 5479: 5475: 5473:9781605587998 5469: 5465: 5461: 5457: 5453: 5448: 5443: 5439: 5432: 5424: 5420: 5413: 5405: 5401: 5397: 5390: 5383: 5379: 5375: 5371: 5367: 5363: 5356: 5354: 5344: 5337: 5330: 5322: 5316: 5312: 5305: 5298: 5290: 5284: 5280: 5276: 5271: 5266: 5262: 5255: 5248: 5239: 5231: 5227: 5222: 5217: 5213: 5209: 5205: 5201: 5197: 5190: 5182: 5178: 5173: 5168: 5163: 5158: 5154: 5150: 5146: 5142: 5138: 5131: 5123: 5119: 5115: 5111: 5106: 5101: 5097: 5093: 5089: 5082: 5080: 5071: 5067: 5063: 5059: 5054: 5049: 5045: 5041: 5037: 5030: 5028: 5019: 5015: 5011: 5007: 5002: 4997: 4993: 4989: 4982: 4976:, pp. 115–122 4975: 4969: 4960: 4955: 4951: 4947: 4943: 4936: 4922:on 2012-05-25 4918: 4913: 4908: 4904: 4900: 4893: 4886: 4878: 4874: 4870: 4866: 4865: 4860: 4853: 4844: 4839: 4835: 4831: 4827: 4820: 4812: 4808: 4803: 4798: 4793: 4788: 4784: 4780: 4776: 4772: 4768: 4761: 4752: 4747: 4743: 4739: 4735: 4728: 4721: 4717: 4710: 4702: 4698: 4694: 4690: 4685: 4680: 4676: 4673: 4672: 4664: 4656: 4651: 4647: 4643: 4639: 4633: 4626: 4622: 4618: 4614: 4610: 4606: 4602: 4598: 4594: 4590: 4586: 4580: 4569: 4568: 4560: 4558: 4550: 4546: 4542: 4538: 4537: 4532: 4526: 4519: 4515: 4510: 4505: 4501: 4497: 4496: 4488: 4475:on 2021-12-04 4474: 4470: 4466: 4459: 4452: 4448: 4443: 4438: 4434: 4427: 4418: 4410: 4406: 4401: 4396: 4391: 4386: 4382: 4378: 4374: 4370: 4363: 4355: 4351: 4347: 4343: 4339: 4335: 4331: 4327: 4323: 4322:J. C. Gittins 4317: 4308: 4303: 4299: 4295: 4291: 4284: 4277: 4273: 4268: 4263: 4259: 4255: 4254: 4246: 4237: 4230: 4224: 4220: 4216: 4210: 4208: 4206: 4204: 4196: 4190: 4186: 4182: 4176: 4174: 4172: 4170: 4163: 4159: 4155: 4151: 4146: 4138: 4134: 4130: 4126: 4122: 4118: 4117: 4109: 4100: 4095: 4091: 4087: 4083: 4076: 4074: 4069: 4060: 4057: 4055: 4054:Search theory 4052: 4050: 4047: 4045: 4042: 4039: 4038:Gittins index 4036: 4035: 4029: 4020: 4011: 4007: 4005: 4001: 3990: 3982: 3978: 3964: 3957: 3953: 3946: 3941: 3938: 3935: 3931: 3926: 3920: 3915: 3905: 3899: 3886: 3882: 3875: 3871: 3864: 3859: 3856: 3853: 3849: 3844: 3838: 3833: 3823: 3817: 3812: 3808: 3801: 3796: 3793: 3790: 3786: 3782: 3776: 3768: 3764: 3755: 3741: 3721: 3698: 3668: 3645: 3637: 3633: 3625: 3620: 3604: 3599: 3595: 3588: 3583: 3580: 3577: 3573: 3569: 3563: 3545: 3531: 3508: 3476: 3471: 3467: 3439: 3436: 3428: 3423: 3407: 3402: 3399: 3396: 3386: 3381: 3377: 3370: 3365: 3361: 3353:, defined as 3340: 3318: 3314: 3293: 3271: 3266: 3262: 3239: 3234: 3230: 3226: 3221: 3216: 3213: 3210: 3206: 3178: 3175: 3155: 3147: 3146:concept drift 3143: 3133: 3119: 3099: 3080: 3077: 3076: 3072: 3069: 3068: 3064: 3061: 3060: 3056: 3053: 3052: 3044: 3019: 3008: 3002: 2998: 2994: 2988: 2977: 2971: 2967: 2963: 2957: 2954: 2951: 2940: 2934: 2930: 2903: 2895: 2891: 2887: 2881: 2873: 2869: 2860: 2852: 2849: 2846: 2843: 2837: 2831: 2808: 2802: 2779: 2773: 2770: 2767: 2764: 2758: 2750: 2746: 2742: 2739: 2728: 2713: 2710: 2704: 2696: 2692: 2688: 2685: 2675: 2660: 2652: 2638: 2617: 2611: 2608: 2605: 2602: 2599: 2591: 2577: 2563: 2540: 2537: 2534: 2501: 2497: 2490: 2482: 2472: 2465: 2459: 2456: 2450: 2442: 2438: 2434: 2428: 2425: 2422: 2414: 2410: 2375: 2372: 2363: 2359: 2355: 2352: 2339: 2331: 2327: 2322: 2315: 2307: 2303: 2296: 2291: 2285: 2277: 2267: 2243: 2240: 2237: 2234: 2231: 2228: 2225: 2222: 2219: 2196: 2193: 2190: 2184: 2178: 2168: 2164: 2159: 2135: 2127: 2123: 2119: 2116: 2113: 2110: 2107: 2101: 2093: 2089: 2066: 2062: 2041: 2038: 2035: 2032: 2029: 2026: 2023: 2020: 2017: 1995: 1992: 1987: 1978: 1970: 1966: 1960: 1955: 1952: 1949: 1945: 1936: 1928: 1924: 1914: 1911: 1908: 1902: 1896: 1888: 1884: 1875: 1860: 1857: 1854: 1851: 1848: 1845: 1842: 1839: 1836: 1816: 1813: 1807: 1799: 1795: 1787: 1769: 1766: 1763: 1757: 1754: 1746: 1737: 1728: 1712: 1710: 1700: 1688: 1685: 1684: 1683: 1676: 1672: 1653: 1645: 1627: 1624: 1621: 1620:random forest 1617: 1614: 1611: 1608: 1605: 1602: 1599: 1595: 1591: 1588: 1587: 1578: 1574: 1570: 1567: 1564: 1562: 1557: 1556: 1550: 1542: 1533: 1530: 1520: 1517: 1515: 1511: 1498: 1495: 1492: 1488: 1485: 1470: 1462: 1459: 1444: 1438: 1435: 1432: 1409: 1406: 1386: 1378: 1375: 1360: 1357: 1354: 1334: 1314: 1311: 1308: 1300: 1297: 1296: 1295: 1293: 1289: 1279: 1271: 1267: 1263: 1259: 1256: 1252: 1248: 1241: 1240:Gittins index 1231: 1222: 1208: 1186: 1182: 1176: 1168: 1165: 1162: 1157: 1153: 1132: 1129: 1126: 1118: 1114: 1100: 1097: 1089: 1085: 1081: 1076: 1066: 1032: 1028: 1022: 1014: 1011: 1008: 1003: 999: 975: 972: 969: 963: 960: 952: 948: 931: 909: 899: 887: 886:Decision rule 884: 869: 861: 860:Stopping rule 858: 841: 838: 835: 825: 821: 809: 808:Sampling rule 806: 805: 804: 802: 801:stopping rule 798: 794: 793:sampling rule 789: 787: 783: 772: 769: 761: 751: 747: 743: 737: 736: 732: 727:This section 725: 721: 716: 715: 707: 705: 700: 696: 682: 674: 670: 660: 645: 641: 637: 629: 624: 622: 604: 597: 594: 565: 561: 552: 544: 539: 535: 512: 508: 498: 482: 475: 472: 463: 458: 455: 452: 448: 444: 439: 435: 431: 428: 425: 417: 403: 383: 376: 372: 356: 333: 329: 325: 322: 319: 314: 310: 287: 277: 274: 249: 245: 241: 238: 235: 230: 226: 219: 216: 209: 208:distributions 205: 195: 193: 188: 186: 182: 181:Peter Whittle 178: 173: 169: 167: 163: 156: 153: 150: 147: 144: 141: 140: 139: 131: 122: 120: 116: 115:Gittins index 112: 108: 106: 102: 98: 94: 90: 85: 83: 79: 75: 74:slot machines 71: 67: 63: 58: 56: 51: 49: 47: 43: 38: 34: 30: 21: 7748:Hugging Face 7712:David Silver 7360:Audio–visual 7214:Applications 7193:Augmentation 7038: 6857: 6851: 6811: 6781: 6775: 6728: 6717:the original 6701: 6681: 6675: 6653: 6634: 6630: 6596: 6592: 6578: 6577: 6576:profile for 6573: 6512: 6508: 6485:, retrieved 6478:the original 6469: 6446: 6439: 6413: 6407: 6381: 6375: 6357: 6351: 6325: 6319: 6309:, retrieved 6302:the original 6297: 6284: 6258: 6252: 6242:, retrieved 6235:the original 6230: 6217: 6207:, retrieved 6200:the original 6195: 6182: 6172:, retrieved 6165:the original 6160: 6147: 6139: 6133: 6107: 6103: 6081: 6068: 6055: 6046: 6032: 6023: 6010: 6001: 5984: 5958: 5954: 5933: 5920: 5894: 5884: 5873:. Retrieved 5869:the original 5864: 5854: 5828: 5822: 5796: 5790: 5764: 5760: 5750: 5724: 5714: 5705: 5701: 5691: 5657: 5651: 5645: 5632: 5625: 5599: 5593: 5558: 5552: 5517: 5511: 5502: 5489: 5437: 5431: 5422: 5412: 5403: 5399: 5389: 5365: 5361: 5342: 5329: 5310: 5297: 5260: 5247: 5238: 5203: 5199: 5189: 5144: 5140: 5130: 5095: 5091: 5043: 5039: 4991: 4987: 4981: 4973: 4968: 4949: 4945: 4935: 4924:. Retrieved 4917:the original 4902: 4898: 4885: 4868: 4862: 4852: 4833: 4829: 4819: 4774: 4770: 4760: 4741: 4737: 4727: 4719: 4709: 4677:(1): 48–77. 4674: 4669: 4663: 4645: 4641: 4632: 4592: 4588: 4579: 4566: 4540: 4539:, Series B, 4534: 4525: 4499: 4493: 4487: 4477:, retrieved 4473:the original 4468: 4458: 4432: 4426: 4421:Press (1986) 4417: 4372: 4368: 4362: 4329: 4325: 4316: 4297: 4293: 4283: 4257: 4251: 4245: 4236: 4218: 4184: 4145: 4120: 4114: 4108: 4089: 4085: 4026: 4017: 4008: 3996: 3988: 3979: 3756: 3623: 3621: 3546: 3426: 3424: 3141: 3139: 3132:dimensions. 3091: 3042: 2795:2. Pull arm 2726: 2673: 2650: 2583: 2523: 1873: 1785: 1744: 1734: 1706: 1697: 1686: 1681: 1636: 1625: 1615: 1609: 1603: 1594:regressogram 1593: 1589: 1575:rather than 1568: 1560: 1558: 1548: 1539: 1528: 1526: 1518: 1509: 1507: 1496: 1486: 1460: 1376: 1298: 1291: 1285: 1277: 1268: 1264: 1260: 1243: 1228: 1116: 1115: 950: 949: 946: 885: 859: 807: 800: 796: 792: 790: 785: 781: 779: 764: 755: 740:Please help 728: 701: 697: 672: 668: 666: 627: 625: 620: 499: 418: 302:levers. Let 203: 201: 189: 177:World War II 174: 170: 161: 159: 136: 109: 92: 86: 72:at a row of 59: 52: 45: 41: 40: 36: 26: 7896:Categories 7844:Autoencoder 7799:Transformer 7667:Alex Graves 7615:OpenAI Five 7519:IBM Watsonx 7141:Convolution 7119:Overfitting 6931:open-source 6905:PyMaBandits 6899:open-source 6672:Robbins, H. 6515:: 665–702, 6037:199–207< 5046:(1): 2–17. 4744:(1): 4–22. 4595:: 287–298, 3734:for policy 3661:for policy 3622:Hence, the 3306:. Instead, 3047:Exp3 vs FPL 3039:Explanation 2651:Parameters: 2520:Explanation 1745:Parameters: 758:August 2024 97:information 7922:Categories 7885:Technology 7738:EleutherAI 7697:Fei-Fei Li 7692:Yann LeCun 7605:Q-learning 7588:Decisional 7514:IBM Watson 7422:Midjourney 7314:TensorFlow 7161:Activation 7114:Regression 7109:Clustering 6911:Contextual 6522:1710.04805 6487:2019-06-14 6423:1502.03473 6335:1604.07101 6311:2016-04-27 6268:1506.00312 6244:2016-04-29 6209:2016-04-27 6174:2016-04-29 5994:1510.00757 5968:1504.06937 5875:2016-06-10 5806:1906.08947 5774:1706.00136 5734:1703.00048 4926:2012-10-12 4479:2016-03-20 4065:References 704:asymptotic 663:Variations 7768:MIT CSAIL 7733:Anthropic 7702:Andrew Ng 7600:AlphaZero 7444:VideoPoet 7407:AlphaFold 7344:MindSpore 7298:SpiNNaker 7293:Memristor 7200:Diffusion 7176:Rectifier 7156:Batchnorm 7136:Attention 7131:Adversary 6927:Banditlib 6738:1409.8191 6606:0711.3861 6391:1401.8257 6367:1306.0811 6112:CiteSeerX 5904:1402.0555 5838:1309.6869 5667:1110.6084 5609:1003.1630 5505:: 208–214 5482:207178795 5447:1003.0146 5265:CiteSeerX 5122:247682940 5114:2691-4581 5070:235475602 5062:2691-4581 5001:0905.2776 4907:CiteSeerX 4679:CiteSeerX 4625:202109695 4504:CiteSeerX 4442:1009.5419 3932:∑ 3921:μ 3916:π 3906:− 3850:∑ 3839:μ 3834:π 3824:− 3818:∗ 3809:μ 3787:∑ 3769:π 3765:ρ 3742:π 3669:π 3638:π 3634:ρ 3605:∗ 3596:μ 3574:∑ 3477:∗ 3468:μ 3440:∈ 3378:μ 3362:μ 3315:μ 3263:μ 3231:μ 3227:≠ 3214:− 3207:μ 3179:∈ 2780:η 2765:∼ 2737:∀ 2683:∀ 2661:η 2646:Algorithm 2564:γ 2541:γ 2538:− 2476:^ 2466:γ 2460:⁡ 2439:ω 2411:ω 2382:otherwise 2271:^ 2185:∈ 1993:γ 1967:ω 1946:∑ 1925:ω 1915:γ 1912:− 1796:ω 1758:∈ 1755:γ 1740:Algorithm 1563:algorithm 1471:ϵ 1439:ϵ 1436:− 1407:ϵ 1355:ϵ 1335:ϵ 1315:ϵ 1312:− 1255:Katehakis 1247:Katehakis 1209:δ 1183:μ 1169:⁡ 1163:∈ 1158:⋆ 1130:≥ 1101:δ 1098:≤ 1090:⋆ 1082:≠ 1077:τ 1070:^ 1029:μ 1015:⁡ 1009:∈ 1004:⋆ 964:∈ 961:δ 932:τ 910:τ 903:^ 870:τ 839:≥ 729:does not 638:ρ 598:^ 562:μ 540:∗ 536:μ 513:∗ 509:μ 476:^ 449:∑ 445:− 440:∗ 436:μ 426:ρ 384:ρ 330:μ 323:… 311:μ 278:∈ 239:… 194:in 1952. 7876:Portals 7635:Auto-GPT 7467:Word2vec 7271:Hardware 7188:Datasets 7090:Concepts 6895:MABWiser 6765:14155718 6599:: 1–50, 5684:14258665 5585:14125100 5544:28713091 5230:31196672 5181:25815510 4811:11607577 4701:13209702 4409:20018711 4354:17724147 4183:(1989), 4032:See also 3254:. Thus, 2727:For each 2349:if  2054:2. Draw 1874:For each 93:a priori 7758:Meta AI 7595:AlphaGo 7579:PanGu-Σ 7549:ChatGPT 7524:Granite 7472:Seq2seq 7451:Whisper 7372:WaveNet 7367:AlexNet 7339:Flux.jl 7319:PyTorch 7171:Sigmoid 7166:Softmax 7031:General 6874:3689689 6840:4355518 6800:2959678 6623:1654066 6570:Scholia 6547:8517525 6527:Bibcode 6428:Bibcode 6396:Bibcode 6340:Bibcode 6273:Bibcode 5996:(2015). 5973:Bibcode 5909:Bibcode 5865:Aistats 5843:Bibcode 5811:Bibcode 5779:Bibcode 5739:Bibcode 5614:Bibcode 5452:Bibcode 5221:6687547 5172:4376795 5149:Bibcode 4779:Bibcode 4617:0974588 4609:3214163 4447:Bibcode 4400:2793317 4377:Bibcode 4346:2985029 4276:2959678 2212:4. For 1559:LinUCB 1491:softmax 1251:Robbins 750:removed 735:sources 185:Germany 70:gambler 7773:Huawei 7753:OpenAI 7655:People 7625:MuZero 7487:Gemini 7482:Claude 7417:DALL-E 7329:Theano 6882:656323 6880:  6872:  6838:  6828:  6798:  6763:  6753:  6709:  6660:  6621:  6572:has a 6545:  6114:  5682:  5583:  5573:  5542:  5532:  5480:  5470:  5382:573750 5380:  5317:  5285:  5267:  5228:  5218:  5200:Neuron 5179:  5169:  5120:  5112:  5068:  5060:  5018:821462 5016:  4909:  4809:  4799:  4699:  4681:  4623:  4615:  4607:  4506:  4407:  4397:  4352:  4344:  4274:  4225:  4191:  4137:656323 4135:  3624:regret 1724:": --> 1288:greedy 799:and a 583:, and 500:where 396:after 375:regret 373:. The 204:bandit 55:regret 35:, the 7839:Mamba 7610:SARSA 7574:LLaMA 7569:BLOOM 7554:GPT-J 7544:GPT-4 7539:GPT-3 7534:GPT-2 7529:GPT-1 7492:LaMDA 7324:Keras 6948:Part2 6944:Part1 6878:S2CID 6870:JSTOR 6836:JSTOR 6796:JSTOR 6761:S2CID 6733:arXiv 6619:S2CID 6601:arXiv 6574:topic 6543:S2CID 6517:arXiv 6481:(PDF) 6474:(PDF) 6451:(PDF) 6418:arXiv 6386:arXiv 6362:arXiv 6330:arXiv 6305:(PDF) 6294:(PDF) 6263:arXiv 6238:(PDF) 6227:(PDF) 6203:(PDF) 6192:(PDF) 6168:(PDF) 6157:(PDF) 5990:arXiv 5963:arXiv 5930:(PDF) 5899:arXiv 5833:arXiv 5801:arXiv 5769:arXiv 5729:arXiv 5680:S2CID 5662:arXiv 5637:(PDF) 5604:arXiv 5581:S2CID 5540:S2CID 5499:(PDF) 5478:S2CID 5442:arXiv 5378:S2CID 5339:(PDF) 5307:(PDF) 5257:(PDF) 5118:S2CID 5066:S2CID 5014:S2CID 4996:arXiv 4920:(PDF) 4895:(PDF) 4802:41010 4697:S2CID 4621:S2CID 4605:JSTOR 4571:(PDF) 4437:arXiv 4350:S2CID 4342:JSTOR 4272:JSTOR 4133:S2CID 2653:Real 1747:Real 1529:price 1510:match 44:- or 7763:Mila 7564:PaLM 7497:Bard 7477:BERT 7460:Text 7439:Sora 6826:ISBN 6751:ISBN 6707:ISBN 6658:ISBN 6089:> 6076:> 6063:> 6041:> 5571:ISBN 5530:ISBN 5468:ISBN 5315:ISBN 5283:ISBN 5226:PMID 5177:PMID 5110:ISSN 5058:ISSN 4807:PMID 4405:PMID 4223:ISBN 4189:ISBN 4152:and 3057:FPL 3054:Exp3 1829:for 1726:edit 1720:Exp3 1618:: a 1292:best 1249:and 795:, a 733:any 731:cite 31:and 7504:NMT 7387:OCR 7382:HWR 7334:JAX 7288:VPU 7283:TPU 7278:IPU 7102:SGD 6862:doi 6816:doi 6786:doi 6743:doi 6686:doi 6639:doi 6611:doi 6535:doi 6122:doi 5672:doi 5563:doi 5522:doi 5460:doi 5370:doi 5275:doi 5216:PMC 5208:doi 5204:103 5167:PMC 5157:doi 5100:doi 5048:doi 5006:doi 4954:doi 4950:411 4873:doi 4838:doi 4797:PMC 4787:doi 4746:doi 4689:doi 4650:doi 4597:doi 4593:25A 4545:doi 4514:doi 4395:PMC 4385:doi 4373:106 4334:doi 4302:doi 4262:doi 4158:doi 4125:doi 4094:doi 2857:max 2457:exp 1596:in 1361:0.1 1173:max 1166:arg 1113:. 1019:max 1012:arg 744:by 671:or 549:max 164:in 27:In 7924:: 6946:. 6929:, 6897:, 6876:, 6868:, 6858:12 6856:, 6834:, 6824:, 6794:, 6780:, 6759:, 6749:, 6741:, 6682:58 6680:, 6635:40 6633:, 6617:, 6609:, 6597:58 6595:, 6541:, 6533:, 6525:, 6513:58 6511:, 6507:, 6495:^ 6459:^ 6426:, 6416:, 6394:, 6384:, 6338:, 6328:, 6296:, 6271:, 6261:, 6229:, 6194:, 6159:, 6120:, 6108:78 6106:, 6094:^ 5971:, 5959:28 5957:, 5953:, 5941:^ 5932:, 5907:, 5893:, 5863:. 5841:, 5809:, 5799:, 5777:, 5765:30 5763:, 5759:, 5737:, 5723:, 5706:23 5704:, 5700:, 5678:, 5670:, 5658:41 5656:, 5612:, 5579:. 5569:. 5538:. 5528:. 5501:, 5476:, 5466:, 5458:, 5450:, 5421:, 5404:24 5402:, 5398:, 5376:, 5366:26 5364:, 5352:^ 5341:, 5309:, 5281:, 5273:, 5259:, 5224:. 5214:. 5202:. 5198:. 5175:. 5165:. 5155:. 5145:11 5143:. 5139:. 5116:. 5108:. 5094:. 5090:. 5078:^ 5064:. 5056:. 5042:. 5038:. 5026:^ 5012:. 5004:. 4992:85 4990:. 4948:. 4944:. 4905:. 4903:20 4901:. 4897:. 4869:22 4867:. 4834:17 4832:. 4828:. 4805:. 4795:. 4785:. 4775:92 4773:. 4769:. 4740:. 4736:. 4718:, 4695:. 4687:. 4675:32 4644:, 4619:, 4613:MR 4611:, 4603:, 4591:, 4556:^ 4541:41 4512:, 4500:59 4498:, 4467:, 4445:, 4435:, 4403:, 4393:, 4383:, 4371:, 4348:. 4340:. 4330:41 4328:. 4298:58 4296:. 4292:. 4270:, 4256:, 4202:^ 4168:^ 4131:. 4121:12 4119:. 4090:47 4088:. 4084:. 4072:^ 3754:: 3425:A 3422:. 3198:: 2824:: 1221:. 888:: 862:: 810:: 626:A 623:. 497:, 168:. 84:. 7017:e 7010:t 7003:v 6980:. 6968:. 6962:. 6950:. 6939:. 6915:R 6864:: 6842:. 6818:: 6803:. 6788:: 6782:2 6768:. 6745:: 6735:: 6721:. 6695:. 6688:: 6667:. 6648:. 6641:: 6613:: 6603:: 6584:. 6537:: 6529:: 6519:: 6430:: 6420:: 6398:: 6388:: 6364:: 6342:: 6332:: 6275:: 6265:: 6124:: 5992:: 5975:: 5965:: 5911:: 5901:: 5878:. 5845:: 5835:: 5813:: 5803:: 5781:: 5771:: 5741:: 5731:: 5674:: 5664:: 5616:: 5606:: 5587:. 5565:: 5546:. 5524:: 5462:: 5454:: 5444:: 5372:: 5347:. 5324:. 5292:. 5277:: 5232:. 5210:: 5183:. 5159:: 5151:: 5124:. 5102:: 5096:3 5072:. 5050:: 5044:2 5020:. 5008:: 4998:: 4962:. 4956:: 4929:. 4879:. 4875:: 4846:. 4840:: 4813:. 4789:: 4781:: 4754:. 4748:: 4742:6 4703:. 4691:: 4652:: 4646:9 4599:: 4547:: 4516:: 4449:: 4439:: 4411:. 4387:: 4379:: 4356:. 4336:: 4310:. 4304:: 4264:: 4258:2 4160:: 4139:. 4127:: 4102:. 4096:: 3965:] 3958:t 3954:r 3947:T 3942:1 3939:= 3936:t 3927:[ 3911:E 3903:) 3900:T 3897:( 3892:D 3887:= 3883:] 3876:t 3872:r 3865:T 3860:1 3857:= 3854:t 3845:[ 3829:E 3813:t 3802:T 3797:1 3794:= 3791:t 3783:= 3780:) 3777:T 3774:( 3722:T 3702:) 3699:T 3696:( 3691:D 3649:) 3646:T 3643:( 3600:t 3589:T 3584:1 3581:= 3578:t 3570:= 3567:) 3564:T 3561:( 3556:D 3532:T 3512:) 3509:T 3506:( 3501:D 3472:t 3445:T 3437:t 3408:T 3403:1 3400:= 3397:t 3393:} 3387:k 3382:t 3374:{ 3371:= 3366:k 3341:k 3319:k 3294:k 3272:k 3267:t 3240:k 3235:t 3222:k 3217:1 3211:t 3184:T 3176:t 3156:k 3120:K 3100:K 3023:) 3020:t 3017:( 3012:) 3009:t 3006:( 3003:I 2999:x 2995:+ 2992:) 2989:t 2986:( 2981:) 2978:t 2975:( 2972:I 2968:R 2964:= 2961:) 2958:1 2955:+ 2952:t 2949:( 2944:) 2941:t 2938:( 2935:I 2931:R 2910:} 2907:) 2904:t 2901:( 2896:i 2892:Z 2888:+ 2885:) 2882:t 2879:( 2874:i 2870:R 2866:{ 2861:i 2853:g 2850:r 2847:a 2844:= 2841:) 2838:t 2835:( 2832:I 2812:) 2809:t 2806:( 2803:I 2783:) 2777:( 2774:p 2771:x 2768:E 2762:) 2759:t 2756:( 2751:i 2747:Z 2743:: 2740:i 2714:0 2711:= 2708:) 2705:1 2702:( 2697:i 2693:R 2689:: 2686:i 2626:) 2621:) 2618:K 2615:( 2612:g 2609:o 2606:l 2603:T 2600:K 2595:( 2592:O 2544:) 2535:1 2532:( 2505:) 2502:K 2498:/ 2494:) 2491:t 2488:( 2483:j 2473:x 2463:( 2454:) 2451:t 2448:( 2443:j 2435:= 2432:) 2429:1 2426:+ 2423:t 2420:( 2415:j 2376:, 2373:0 2364:t 2360:i 2356:= 2353:j 2343:) 2340:t 2337:( 2332:j 2328:p 2323:/ 2319:) 2316:t 2313:( 2308:j 2304:x 2297:{ 2292:= 2289:) 2286:t 2283:( 2278:j 2268:x 2244:K 2241:, 2238:. 2235:. 2232:. 2229:, 2226:1 2223:= 2220:j 2200:] 2197:1 2194:, 2191:0 2188:[ 2182:) 2179:t 2176:( 2169:t 2165:i 2160:x 2139:) 2136:t 2133:( 2128:K 2124:p 2120:, 2117:. 2114:. 2111:. 2108:, 2105:) 2102:t 2099:( 2094:1 2090:p 2067:t 2063:i 2042:K 2039:, 2036:. 2033:. 2030:. 2027:, 2024:1 2021:= 2018:i 1996:K 1988:+ 1982:) 1979:t 1976:( 1971:j 1961:K 1956:1 1953:= 1950:j 1940:) 1937:t 1934:( 1929:i 1918:) 1909:1 1906:( 1903:= 1900:) 1897:t 1894:( 1889:i 1885:p 1861:K 1858:, 1855:. 1852:. 1849:. 1846:, 1843:1 1840:= 1837:i 1817:1 1814:= 1811:) 1808:1 1805:( 1800:i 1773:] 1770:1 1767:, 1764:0 1761:( 1730:] 1659:) 1654:T 1649:( 1646:O 1445:N 1442:) 1433:1 1430:( 1410:N 1387:N 1358:= 1309:1 1187:k 1177:k 1154:a 1133:1 1127:T 1095:) 1086:a 1067:a 1060:( 1056:P 1033:k 1023:k 1000:a 979:) 976:1 973:, 970:0 967:( 900:a 842:1 836:t 832:) 826:t 822:a 818:( 771:) 765:( 760:) 756:( 752:. 738:. 683:p 646:T 642:/ 621:t 605:t 595:r 571:} 566:k 558:{ 553:k 545:= 483:t 473:r 464:T 459:1 456:= 453:t 432:T 429:= 404:T 357:H 334:K 326:, 320:, 315:1 288:+ 283:N 275:K 255:} 250:K 246:R 242:, 236:, 231:1 227:R 223:{ 220:= 217:B 46:N 42:K

Index


probability theory
machine learning
regret
reinforcement learning
exploration–exploitation tradeoff dilemma
gambler
slot machines
one-armed bandits
stochastic scheduling
probability distribution
information
machine learning
pharmaceutical company
Herbert Robbins
Gittins index
John C. Gittins

clinical trials
adaptive routing
financial portfolio design
machine learning
World War II
Peter Whittle
Germany
Herbert Robbins
distributions
Markov decision process
regret
asymptotic

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.