Winning in the Casino by solving the Multi-Armed Bandit Problem

Hey upcoming data science expert,

this weeks topic caught my interest as I overheard some colleagues from the machine learning department. Multi-armed bandit theory is a quite old math problem from the 80s with still having relevance in the data science area. It is about choosing a optimal value from several options. As I am not a data science person I had to first learn about the basics in this Youtube video. The paper presented build upon this problem and gives a general framework for solving multi-armed bandit scenarios.To be 100% honest, I did not completely follow all the math in the paper. If someone of you does, it would be awesome if you could send a small “Combinatorial Multi-Armed Bandit”-For Dummies to the Telegram group. Thanks for your help!

Abstract:

We define a general framework for a large class of combinatorial multi-armed bandit(CMAB) problems, where simple arms with unknown distributions form super arms. In each round, a super arm is played and the outcomes of its related simple arms are ob-served, which helps the selection of super arms in future rounds. The reward of the super arm depends on the outcomes of played arms, and it only needs to satisfy two mild assumptions, which allow a large class of nonlinear reward instances. We assume the availability of an (α,β)-approximation oracle that takes the means of the distributions of arms and outputs a super arm that with probability β generates anαfraction of the optimal expected reward. The objective o fa CMAB algorithm is to minimize (α,β)-approximation regret, which is the difference in total expected reward between theαβfrac-tion of expected reward when always playing the optimal super arm, and the expected re-ward of playing super arms according to the algorithm. We provide CUCB algorithm that achieves O(logn) regret, where n is the number of rounds played, and we further provide distribution-independent bounds for a large class of reward functions. Our regret analysis is tight in that it matches the bound for classical MAB problem up to a constant factor,and it significantly improves the regret bound in a recent paper on combinatorial bandits with linear rewards. We apply our CMAB framework to two new applications, probabilistic maximum coverage (PMC) for online advertising and social influence maximization for viral marketing, both having nonlinear re-ward structures.

Download Link:

http://proceedings.mlr.press/v28/chen13a.pdf