多腕バンディット問題

多腕バンディット問題（たわんばんでぃっともんだい、Multi-armed bandit problem）は、確率論と機械学習において、一定の限られた資源のセットを競合する選択肢間で、期待利得を最大化するように配分しなければならない問題。それぞれの選択肢の特性が、配分時には一部しか分かっておらず、時間が経過したり選択肢に資源が配分されることで理解できる可能性がある^[1]^[2]。これは、探索(exploration)と活用(exploitation)のトレードオフのジレンマを例証する古典的な強化学習の問題である。この名前は、スロットマシン（単腕バンディットとも呼ばれる）の列で、どのマシンをプレイするか、各マシンを何回プレイするか、どの順番でプレイするか、現在のマシンを続けるか別のマシンを試すかを決めなければならないギャンブラーを想像することに由来している^[3]。多腕バンディット問題も、広義の確率的スケジューリングに分類される。

経験的動機

多腕バンディット問題は、新しい知識の取得（探索 exploration）と既存の知識に基づいた意思決定の最適化（活用 exploitation）を同時に試みるエージェントをモデル化したものである。エージェントは、これらの競合するタスクのバランスをとりながら、考慮される期間中の総価値を最大化しようとする。以下のような例がある。

患者の損失を最小限に抑えながら、さまざまな実験的治療の効果を調査する臨床試験^[1] ^[4]
ネットワークの遅延を最小化するための適応的なルーティングの取り組み
金融ポートフォリオの設計^[5]^[6]

このような実用例では、すでに獲得した知識に基づく報酬の最大化と、さらに知識を増やすための新しい行動の思考とのバランスが問題となる。これは、機械学習における探索 exploration と活用 exploitation のトレードオフとして知られる。

このモデルは、さまざまなプロジェクトへのリソースの動的な配分を制御するために使用されており、それぞれの可能性の難易度と報酬に関する不確実性がある場合、どのプロジェクトに取り組むかという問題に答えている^[7]。

第二次世界大戦で連合国の科学者によって検討されたが、それはあまりに難解なため、ピーター・ホイットルによれば、ドイツの科学者も時間を浪費できるようにと、この問題をドイツに投下することが提案されたのだという^[8]。

現在一般的に分析されているのは、1952年にハーバート・ロビンスによって定式されたバージョンである。

多腕バンディットモデル

多腕バンディット（略称：バンディットまたは MAB）は、確率分布 $B=\{R_{1},\dots ,R_{K}\}$ の集合と見做すことができる。各確率分布は、 $K\in \mathbb {N} ^{+}$ 個のレバーのそれぞれによって配分される報酬に関連する。 $\mu _{1},\dots ,\mu _{K}$ を報酬分布の平均値とする。ギャンブラーは各ラウンドに1つのレバーを操作し、報酬を観察する。収集された報酬の合計を最大化することが目的である。地平線 $H$ は残りのラウンド数である。バンディット問題は、形式的には1状態のマルコフ決定過程と同等である。 $T$ ラウンド後の後悔 $\rho$ は、最適な戦略による報酬の合計と収集された報酬の合計との間の差の期待値として定義される。

\rho =T\mu ^{*}-\sum _{t=1}^{T}{\widehat {r}}_{t}

ここで、最大報酬平均 $\mu ^{*}$ は $\mu ^{*}=\max _{k}\{\mu _{k}\}$ を満たす。 ${\widehat {r}}_{t}$ はラウンド t の報酬である。

ゼロ後悔戦略とは、ラウンドごとの平均後悔が $\rho /T$ が確率 1 でゼロになる戦略である^[9]。直感的には、十分なラウンドがプレイされれば、後悔ゼロの戦略は最適な戦略に収束することが保証される。

脚注

^ ^a ^b John C. Gittins (1989), Multi-armed bandit allocation indices, Wiley-Interscience Series in Systems and Optimization., Chichester: John Wiley & Sons, Ltd., ISBN 978-0-471-92059-5
^ Don Berry; Fristedt, Bert (1985), Bandit problems: Sequential allocation of experiments, Monographs on Statistics and Applied Probability, London: Chapman & Hall, ISBN 978-0-412-24810-8
^ Weber, Richard (1992), “On the Gittins index for multiarmed bandits”, Annals of Applied Probability 2 (4): 1024-1033, doi:10.1214/aoap/1177005588, JSTOR 2959678, https://jstor.org/stable/2959678
^ Press, William H. (2009), “Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research”, Proceedings of the National Academy of Sciences 106 (52): 22387-22392, Bibcode: 2009PNAS..10622387P, doi:10.1073/pnas.0912378106, PMC 2793317, PMID 20018711, http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2793317.
^ Brochu, Eric; Hoffman, Matthew W.; de Freitas, Nando (2010-09), Portfolio Allocation for Bayesian Optimization, arXiv:1009.5419, Bibcode: 2010arXiv1009.5419B
^ Shen, Weiwei; Wang, Jun; Jiang, Yu-Gang; Zha, Hongyuan (2015), “Portfolio Choices with Orthogonal Bandit Learning”, Proceedings of International Joint Conferences on Artificial Intelligence (IJCAI2015), http://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/viewPDFInterstitial/10972/10798
^ Farias, Vivek F; Ritesh, Madan (2011), “The irrevocable multiarmed bandit problem”, Operations Research 59 (2): 383-399, doi:10.1287/opre.1100.0891
^ Peter Whittle (1979), “Discussion of Dr Gittins' paper”, Journal of the Royal Statistical Society, Series B 41 (2): 148-177, doi:10.1111/j.2517-6161.1979.tb01069.x
^ Vermorel, Joannes; Mohri, Mehryar (2005), Multi-armed bandit algorithms and empirical evaluation, In European Conference on Machine Learning, Springer, pp. 437-448, http://bandit.sourceforge.net/Vermorel2005poker.pdf

参考文献

Guha, S.; Munagala, K.; Shi, P. (2010), “Approximation algorithms for restless bandit problems”, Journal of the ACM 58: 1-50, arXiv:0711.3861, doi:10.1145/1870103.1870106
Dayanik, S.; Powell, W.; Yamazaki, K. (2008), “Index policies for discounted bandit problems with availability constraints”, Advances in Applied Probability 40 (2): 377-400, doi:10.1239/aap/1214950209 .
Powell, Warren B. (2007), “Chapter 10”, Approximate Dynamic Programming: Solving the Curses of Dimensionality, New York: John Wiley and Sons, ISBN 978-0-470-17155-4 .
Herbert Robbins (1952), “Some aspects of the sequential design of experiments”, Bulletin of the American Mathematical Society 58 (5): 527-535, doi:10.1090/S0002-9904-1952-09620-8 .
Sutton, Richard; Barto, Andrew (1998), Reinforcement Learning, MIT Press, ISBN 978-0-262-19398-6, オリジナルの2013-12-11時点におけるアーカイブ。, https://web.archive.org/web/20131211192714/http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html

外部リンク

MABWiser, open source Python implementation of bandit strategies that supports context-free, parametric and non-parametric contextual policies with built-in parallelization and simulation capability.
PyMaBandits, open source implementation of bandit strategies in Python and Matlab.
Contextual, open source R package facilitating the simulation and evaluation of both context-free and contextual Multi-Armed Bandit policies.
bandit.sourceforge.net Bandit project, open source implementation of bandit strategies.
Banditlib, Open-Source implementation of bandit strategies in C++.
Leslie Pack Kaelbling and Michael L. Littman (1996). Exploitation versus Exploration: The Single-State Case.
Tutorial: Introduction to Bandits: Algorithms and Theory. Part1. Part2.
Feynman's restaurant problem, a classic example (with known answer) of the exploitation vs. exploration tradeoff.
Bandit algorithms vs. A-B testing.
S. Bubeck and N. Cesa-Bianchi A Survey on Bandits.
A Survey on Contextual Multi-armed Bandits, a survey/tutorial for Contextual Bandits.
Blog post on multi-armed bandit strategies, with Python code.
Animated, interactive plots illustrating Epsilon-greedy, Thompson sampling, and Upper Confidence Bound exploration/exploitation balancing strategies.