Bandit Algorithms

Lecture notes that I scribbled for Bandit Algorithm (Online Machine Learning).

The textbook is Bandit Algorithms by Tor Lattimore and Csaba Szepesvari.

\( \renewcommand{\vec}[1]{\boldsymbol{#1}} \DeclareMathOperator*{\E}{\mathbb{E}} \DeclareMathOperator*{\Var}{\mathrm{Var}} \DeclareMathOperator*{\Cov}{\mathrm{Cov}} \DeclareMathOperator*{\argmin}{\mathrm{arg\,min\;}} \DeclareMathOperator*{\argmax}{\mathrm{arg\,max\;}} \def\ZZ{{\mathbb Z}} \def\NN{{\mathbb N}} \def\RR{{\mathbb R}} \def\CC{{\mathbb C}} \def\QQ{{\mathbb Q}} \def\FF{{\mathbb FF}} \def\EE{{\mathbb E}} \newcommand{\tr}{{\rm tr}} \newcommand{\sign}{{\rm sign}} \newcommand{\1}{đťź™} \newcommand{\inprod}[2]{\left\langle #1, #2 \right\rangle} \newcommand{\set}[1]{\left\{#1\right\}} \require{physics} \)

Table of Contents

1 Supervised Learning #

1.1 A statistical learning framework #

  1. Domain (feature) set \(\mathcal{X}\): set of objects that we wish to label.

  2. Label set \(\mathcal{Y}=\{0,1\}\): set of outcomes.

  3. Hypothesis class \(\mathcal{H}=\{h: \mathcal{X} \to \mathcal{Y}\}\).

  4. Training data \(S=\{(x_1,y_1), (x_2,y_2), \dots, (x_n,y_n)\}, x_i \in \mathcal{X}, y_i \in \mathcal{Y}\).

  5. A simple data-generation model

    Assuming data is sampled according to an unknown distribution \(D\): \(X \sim D\). And there is a fixed but unknown function \(f: \mathcal{X}\to \mathcal{Y}\), such that \(y_i = f(x_i)\). \((D,f)\) is called “environment”.

  6. Measure of success

    Prediction error/test error:

    \[L_{(D,f)}(h) = \Pr_{X \sim D} [h(X)\neq f(X)]\]

    We want to minimize the error:

    \[\min_{h \in \mathcal{H}} L_{(D,f)} (h)\]

1.2 Empirical Risk Minimization (ERM) #

\[L_S(h)=\frac{1}{m} | \{i \in [m]: h(x_i)\neq y_i\} |\]

\[ERM_{\mathcal{H}}(S) \triangleq \argmin_{h \in \mathcal{H}} L_S(h)\]

2 Online Learning Algorithms #

Basic online classification setting:

The total loss/mistakes have an obvious bound:

\[\sum_{t=1}^{T} \1_{\{\hat{y}_t \neq y_t\}}\leq T\]

The bound \(T\) is useless, we want a better bound. To do that, we must restrict the ability of the environment. Because if the environment is an adversary, which will always reveal the true label exactly opposite to the learner’s prediction, then there is no way to do a good job. So we must assume that the true label is not arbitrary, but follows a fixed rule.

We therefore assume the “realizability condition” holds,

Realizability Condition \(\exists h^{*} \in \mathcal{H}\) such that \(h^{*}(x_t)=y_t\), for all \(t\). I.e., the “true” hypothesis is in the hypothesis class \(\mathcal{H}\). And \(\mathcal{H}\) is known to the learner.

How to quickly find the right hypothesis? There are some algorithms.

2.1 Consistent Algorithm #

The only purpose of the prediction \(p_t\) is to indicate whether the algorithm makes a mistake. The update will remove all the “wrong” hypotheses. For each iteration \(t\), if there is a mistake, at least one hypothesis is removed from \(V_t\), therefore \(|V_{t+1}| \leq |V_t| -1\). Assume after the algorithm terminates, \(M\) mistakes are made, we have \(1\leq |V_T| \leq |\mathcal{H}|-M\). So the number of mistakes has bound:

\[M \leq |\mathcal{H}|-1\]

2.2 Halving Algorithm #

If there is a mistake, at least half of the hypotheses will be discarded, so \(|V_{t+1}| \leq |V_t|/2\). After making \(M\) mistakes, we have

\[1\leq |V_{T}| \leq \frac{|\mathcal{H}|}{2^M}\]

So \(M\) has bound: \(M \leq \log_2(|\mathcal{H}|)\).

Obviously, the halving algorithm is exponentially better than the consistent algorithm.

3 Online Learnability #

Is the halving algorithm the best we can do? It depends on the complexity of \(\mathcal{H}\).

Every time we run an algorithm \(A\), the environment selects a different \(S=\{(x_1, h^{*}(x_1)), (x_2, h^{*}(x_2)), \dots, (x_n, h^{*}(x_n))\}\). Let \(M_A(S)\) denotes the number of mistakes on \(S\). Consider the worst case scenario, define

\[M_A(\mathcal{H}) = \sup_S M_A(S)\]

(The length of \(S\) is arbitrary.)

Definition We say that class \(\mathcal{H}\) is online learnable if there exists an algorithm \(A\) such that \(M_A(\mathcal{H})\leq B < \infty\), where \(B\) is a constant independent of the length of the sequence.

Obviously, under the realizability assumption, any finite \(\mathcal{H}\) is online learnable, e.g., by using the halving algorithm.

We want a lower bound of mistakes, so consider the environment as an adversary, how can it force more mistakes to the learner? This can be modeled as a full binary tree.

The environment can build a full binary tree of depth \(n\) (\(2^n-1\) nodes \(v_1, \dots, v_{2^n-1}\)) (note: this depth differs from the standard definition). Then

  1. Let \(x_1=v_1\)
  2. At round \(t\), denote \(i_t\) as the index of the current node, i.e., \(x_t = v_{i_t}\), \(i_1 = 1\). The environment chooses the left child if \(y_t=0\), and right child if \(y_t=1\).

So \(i_{t+1}=2 i_t + y_t\), we can get \(i_{t+1}=2^t+\sum_{j=1}^{t} y_j 2^{t-j}\).

For realizability, we must assume the tree is shattered.

Definition (Shattered Tree) Consider a full binary tree of depth \(d\) such that the nodes are \(v_1, \dots, v_{2^d-1} \in \mathcal{X}\). The tree is said to be shattered by \(\mathcal{H}\) if for every labeling \(\{y_1, \dots, y_d\} \in \{0,1\}^d\), there exists some \(h \in \mathcal{H}\) such that for all \(t \in [d]\) we have \(h(v_{i_t})=y_t\).

Definition (Littlestone Dimension) \(\mathrm{LDim}(\mathcal{H})\) is the maximal depth of a tree that is shattered by \(\mathcal{H}\).

Theorem \(\mathrm{LDim}(\mathcal{H}) \leq \log_2(|\mathcal{H}|)\).


Because each labelling constructs a different path, each path requires a different hypothesis \(h \in \mathcal{H}\). Therefore, the number of paths cannot exceed \(|\mathcal{H}|\), and \(\log_2(\text{number of paths}) = \mathrm{LDim}(\mathcal{H})\leq \log_2(|\mathcal{H}|)\).

Theorem Any algorithm \(A\) for learning \(\mathcal{H}\) has mistake bound \(M_A(\mathcal{H}) \geq \mathrm{LDim}(\mathcal{H})\).


The sketch of the proof is as follows: take a shattered tree and start at \(v_1\). Let \(y_t = 1- \hat{y}_{t}\) and take the path defined by \(y_1, \dots, y_d\). Along this path, there are \(d\) mistakes that are made. Because by definition, each path is consistent with an \(h \in \mathcal{H}\), there exists an \(h\) that achieves \(d\) mistakes.

3.1 Standard Optimal Algorithm #

Lemma Standard Optimal Algorithm is optimal: \(M_{SOA}(\mathcal{H})\leq \mathrm{LDim}(\mathcal{H})\).


If a mistake is made in round \(t\), then \(\mathrm{LDim}(V_{t+1})\leq \mathrm{LDim}(V_t)-1\).

3.2 Unrealizable Case #

If the true hypothesis \(h^{*} \not\in \mathcal{H}\), then we want to make as few mistakes as possible. To quantify our performance, we introduce “regret”.

For a sequence of data \((x_1,y_1), (x_2,y_2), \dots, (x_n,y_n)\), and a fixed hypothesis \(h \in \mathcal{H}\), the following formula quantifies how much is our algorithm worse than \(h\).

\[\sum_{t=1}^{n} \left| \hat{y}_t -y_t \right| - \sum_{t=1}^{n} \left| h(t)-y_t \right|\]

We consider the worst case, get the supremum so the quantity does not depend on the sequence.

\[R_A(h,n) = \sup_{(x_1,y_1)\dots(x_n,y_n)} \left[\sum_{t=1}^{n} \left| \hat{y}_t -y_t \right| - \sum_{t=1}^{n} \left| h(x_t)-y_t \right|\right]\]

We want to compare our algorithm with the best \(h\), which minimizes \(\sum_{t=1}^{n} \left| h(x_t)-y_t \right|\) (Hence maximize the whole formula). So we get the “regret”:

\[\begin{split} R_A(\mathcal{H},n) &= \sup_{(x_1,y_1)\dots(x_n,y_n)} \left[\sum_{t=1}^{n} \left| \hat{y}_t -y_t \right| - \inf_{h \in \mathcal{H}}\sum_{t=1}^{n} \left| h(x_t)-y_t \right|\right]\\ &= \sup_{h \in \mathcal{H}}\sup_{(x_1,y_1)\dots(x_n,y_n)} \left[\sum_{t=1}^{n} \left| \hat{y}_t -y_t \right| - \sum_{t=1}^{n} \left| h(x_t)-y_t \right|\right] \end{split} \]

The regret is comparing our algorithm to an algorithm that knows all the information (i.e., the best hypothesis in \(\mathcal{H}\)).

Definition We say \(\mathcal{H}\) is learnable if \(R_A(\mathcal{H}, n)\) is sublinear, i.e.,

\[\frac{R_A(\mathcal{H},n)}{n}\to 0 \text{ as } n\to \infty\]

Intuitively, this means as we learn more and more data, we will make fewer and fewer mistakes.

3.3 Covers Impossibility Result #

Is \(\mathcal{H}\) always learnable? Not always if the environment is adversarial, e.g., \(\mathcal{H}=\{h_0,h_1\}\), where \(h_0 =0, h_1=1\). If the environment can always reveal \(y_t\) exactly opposite to \(\hat{y}_t\), then \(\sum_{t=1}^{n} |\hat{y}_t-y_t| = n\). And \(\inf_{(x_1,y_1) \dots (x_n,y_n)} \sum_{t=1}^{n} |h(x_t)-y_t| \leq \frac{n}{2}\). So \(R_A(\mathcal{H},n) \geq n-\frac{n}{2} = \frac{n}{2}\), which is not sublinear.

Therefore we have to restrict the ability of the environment again. At the same time, we want to give the learner more flexibility: allowing the learner to choose probabilistic strategies. And the environment generates the true label \(y_t\) before the learner makes predictions (so the environment cannot always reveal the opposite labels).

In summary:

  1. Learner’s predictions are probabilistic.
  2. Adversary does not know the actual value of predicted labels.

Since the predictions \(\hat{y}_t\) is randomized, we need to redefine the regret, i.e., the expected regret.

Let \(p_t\) be the probability that the learner chooses label 1: \(p_t = \Pr[\hat{y}_t=1]\).

\[\begin{split} R_A(\mathcal{H},n) &= \sup_{h \in \mathcal{H}}\sup_{(x_1,y_1)\dots(x_n,y_n)} \E \left[\sum_{t=1}^{n} \left| \hat{y}_t -y_t \right| - \sum_{t=1}^{n} \left| h(x_t)-y_t \right|\right]\\ &=\sup_{h \in \mathcal{H}}\sup_{(x_1,y_1)\dots(x_n,y_n)} \E \left[\sum_{t=1}^{n} \left| \hat{y}_t -y_t \right| \right] - \sum_{t=1}^{n} \left| h(x_t)-y_t \right|\\ &=\sup_{h \in \mathcal{H}}\sup_{(x_1,y_1)\dots(x_n,y_n)} \E \left[\sum_{t=1}^{n} \1_{\{\hat{y}_t \neq y_t\}} \right] - \sum_{t=1}^{n} \left| h(x_t)-y_t \right|\\ &=\sup_{h \in \mathcal{H}}\sup_{(x_1,y_1)\dots(x_n,y_n)} \sum_{t=1}^{n} \Pr[\hat{y}_t \neq y_t] - \sum_{t=1}^{n} \left| h(x_t)-y_t \right|\\ &=\sup_{h \in \mathcal{H}}\sup_{(x_1,y_1)\dots(x_n,y_n)} \sum_{t=1}^{n} |p_t-y_t| - \sum_{t=1}^{n} \left| h(x_t)-y_t \right|\\ \end{split} \]

Comparing to the original regret definition, we only change \(\hat{y}_t \in \{0,1\}\) to \(p_t \in [0,1]\).

Now we can say under these assumptions, \(\mathcal{H}\) is always learnable, i.e., there exists an algorithm that can learn \(\mathcal{H}\).

Theorem For every hypothesis class \(\mathcal{H}\), there exists an algorithm for online classification, whose predictions come from \([0,1]\), such that \(\forall h \in \mathcal{H}\),

\[\sum_{t=1}^{n} |p_t-y_t| - \sum_{t=1}^{n} \left| h(x_t)-y_t \right| \leq \sqrt{2\cdot \min \{\log(|\mathcal{H}|), \mathrm{LDim}(\mathcal{H}) \log (e n)\} n}\]

Note that this bound is about \(O(\sqrt{n})\), which is sublinear.

4 Weighted Majority #

Predictions with expert advice.

Assume we have \(d\) experts. At round \(t\), the environment assigns losses to experts \(\vec{v}_t \in [0,1]^d\), where \(v_{ti}\) is the loss assigned to export \(i\). The learner assigns weights (probabilities) to experts \(\vec{w}_t \in [0,1]^d\), where \(w_{ti}\) is the weight assigned to expert \(i\), such that \(\sum_{i=1}^{d} w_{ti} = 1\).

Note that the losses of all experts are revealed (known by the learner). This is called the “full information” case, which is different from the bandit setting, where the learner only learns the outcome of the expert he follows.

The expected loss for round \(t\) is \(\sum_{i=1}^{d} w_{ti} v_{ti} = \langle \vec{w}_t,\vec{v}_t\rangle\). The expected loss after \(n\) rounds is \(\sum_{t=1}^{n} \langle \vec{w}_t,\vec{v}_t\rangle\). If the leaner always choose expert \(i\), the loss after \(n\) rounds is \(\sum_{t=1}^{n} v_{ti}\).

We define our regret as:

\[\sum_{t=1}^{n} \langle \vec{w}_t,\vec{v}_t\rangle - \min_{i \in [N]} \sum_{t=1}^{n} v_{ti}\]

We want to minimize the regret. One classical algorithm for this is called “weighted majority”:

If \(\vec{w}_t\) is chosen from this algorithm, then the regret has bound:

\[\sum_{t=1}^{n} \langle \vec{w}^{(t)},\vec{v}_t\rangle - \min_{i \in [d]} \sum_{t=1}^{n} v_{ti} \leq \sqrt{2 \log(d)n}\]

Using this algorithm on the online classification problem, we can prove that the regret is sublinear.


We only consider the case when \(\mathcal{H}\) is finite (infinite case is a bit involved), \(|\mathcal{H}|=d\). \(\mathcal{H} = \{h_1, \dots, h_d\}\). And define

\[v_{ti} = |h_i(x_t)-y_t|\]

\[p_t = \sum_{i=1}^{d} w_i^{(t)} h_i(x_t)\]


\[\begin{split} \left| p_t-y_t \right| &= \left| \sum_{i=1}^{d} w_i^{(t)}h_i(x_t)-y_t \right|\\ &= \left| \sum_{i=1}^{d} w_i^{(t)} \left(h_i(x_t)-y_t\right) \right|\\ &= \sum_{i=1}^{d} w_i^{(t)} \left|h_i(x_t)-y_t\right|\\ &= \sum_{i=1}^{d} w_i^{(t)} v_{ti}\\ &= \langle \vec{w}^{(t)}, \vec{v}_t \rangle \end{split} \]

The third equality is because all the terms in \(| \cdot |\) are either all positive or all negative, so the absolute value and summation can be exchanged.

Next we proof the regret bound of weighted majority algorithm.

Theorem If \(n\geq 2\log(d)\) then the weighted majority algorithm has regret bounds:

\[\sum_{t=1}^{n} \langle \vec{w}^{(t)},\vec{v}_t\rangle - \min_{i \in [d]} \sum_{t=1}^{n} v_{ti} \leq \sqrt{2 \log(d)n}\]


\[\begin{split} \log \frac{z_{t+1}}{z_t} &= \log \frac{\sum_{i=1}^{d} \tilde{w}_i^{(t+1)}}{z_t}\\ &=\log \frac{\sum_{i=1}^{d} \tilde{w}_i^{(t)} e^{-\eta v_{ti}}}{z_t}\\ &=\log \sum_{i=1}^{d} w_i^{(t)} e^{-\eta v_{ti}}\\ &\leq \log \sum_{i=1}^{d} w_i^{(t)} \left( 1-\eta v_{ti} + \frac{\eta^2 v_{ti}^2}{2} \right)\\ &= \log \left( 1- \sum_{i=1}^{d} w_i^{(t)} \left( \eta v_{ti} - \frac{\eta^2 v_{ti}^2}{2} \right) \right)\\ &\leq \log e^{- \sum_{i=1}^{d} w_i^{(t)} \left( \eta v_{ti} - \frac{\eta^2 v_{ti}^2}{2} \right)}\\ &= - \sum_{i=1}^{d} w_i^{(t)} \left( \eta v_{ti} - \frac{\eta^2 v_{ti}^2}{2} \right) \end{split} \]

The two inequalities are the right and left side of the following fact:

\[1-x \leq e^{-x} \leq 1-x+\frac{x^2}{2}, \quad \forall x \in [0,1]\]

\(0\leq\eta v_{ti} \leq 1\) because \(\eta = \sqrt{\frac{2\log(d)}{n}} \leq 1\) and \(0\leq v_{ti}\leq 1\).

Also \(\sum_{i=1}^{d} w_i^{(t)} \left( \eta v_{ti} - \frac{\eta^2 v_{ti}^2}{2} \right) \in [0,1]\).


\[\begin{split} &\log \frac{z_{t+1}}{z_t} \\ \leq& - \sum_{i=1}^{d} w_i^{(t)} \left( \eta v_{ti} - \frac{\eta^2 v_{ti}^2}{2} \right)\\ =& -\eta \langle \vec{w}^{(t)}, \vec{v}_t \rangle + \frac{\eta^2}{2} \underbrace{\sum_{i=1}^{d} w_i^{(t)} v_{ti}^2}_{\leq 1}\\ \leq& -\eta \langle \vec{w}^{(t)}, \vec{v}_t \rangle + \frac{\eta^2}{2} \end{split} \]


\[\begin{split} \sum_{t=1}^{n} \log \frac{z_{t+1}}{z_t} \leq -\eta \sum_{t=1}^{n} \langle \vec{w}^{(t)}, \vec{v}_t \rangle + n\frac{\eta^2}{2} \end{split} \]


\[\sum_{t=1}^{n} \log \frac{z_{t+1}}{z_t} = \log \frac{z_{n+1}}{z_1} = \log z_{n+1} - \log z_1 = \log z_{n+1} - \log d\]


\[\log z_{n+1} \leq -\eta \sum_{t=1}^{n} \langle \vec{w}^{(t)}, \vec{v}_t \rangle + n\frac{\eta^2}{2} + \log d\]

Now compute a lower bound for \(\log z_{n+1}\).

\[\begin{split} \log z_{n+1} &= \log \left( \sum_{i=1}^{d} \tilde{w}_i^{(n+1)} \right)\\ &= \log \left( \sum_{i=1}^{d} \tilde{w}_i^{(n)} e^{-\eta v_{ni}} \right)\\ &= \log \left( \sum_{i=1}^{d} e^{-\eta \sum_{t=1}^{n} v_{ti}} \right)\\ &\geq \log \left( \max_i e^{-\eta \sum_{t=1}^{n} v_{ti}} \right)\\ &= \log \left( e^{- \min_i \eta \sum_{t=1}^{n} v_{ti}} \right)\\ &= - \eta\min_i \sum_{t=1}^{n} v_{ti} \end{split} \]

Therefore we get

\[- \eta\min_i \sum_{t=1}^{n} v_{ti} \leq -\eta \sum_{t=1}^{n} \langle w^{(t)}, v_t \rangle + n\frac{\eta^2}{2} + \log d\]

Rearrange and get

\[\sum_{t=1}^{n} \langle \vec{w}^{(t)}, \vec{v}_t \rangle - \min_i \sum_{t=1}^{n} v_{ti} \leq n\frac{\eta}{2} + \frac{\log d}{\eta}\]

The LHS is exactly what we want. By minimizing the RHS with respect to \(\eta\), we get the RHS.

5 Bandit Setting #

Different from the above full information case, in bandit setting the learner can only observe the loss incurred from the action he played.


Example of bandit setting: playing a bandit machine in a casino, you can only see whether you win or lose on the machine you play, but not other machines.

Example of full information setting: buying stocks in a stock market, you can see the ups and downs of every stock.

  1. \(K\) actions, denoted as \([K]=\{1,2, \dots, K\}\).
  2. The environment assigns loss to each action \(\vec{x}_t \in [0,1]^K\).
  3. In total \(n\) rounds, the player selects action \(I_t \in [K]\) at round \(t\).
  4. The player observe the loss incurred by playing action \(I_t\) at round \(t\), denoted by \(x_{t, I_t}\).

The player selects actions based on his past observations. Denote \(H_t = \{I_1, x_{1,I_1}, I_2, x_{2,I_2}, \dots, I_{t-1}, x_{t-1, I_{t-1}}\}\) as the history till round \(t\). The player’s policy is to choose an action in round \(t\) given \(H_t\). So we define a policy as a probability distribution that the player comes up with in each round:

\[\pi_t(i) = \Pr[I_t=i \mid H_t]\]

We define the regret of a policy \(\pi\) given the loss vector sequence \(\{\vec{x}_t\}_{t=1}^n\) for each round:

\[R(n, \pi, \{\vec{x}_t\}_{t=1}^n) = \sum_{t=1}^{n} x_{t,I_t} - \min_{i \in [K]} \sum_{t=1}^{n} x_{t,i}\]

Since \(I_t\) is random (governed by the policy \(\pi\)), so the regret \(R(n,\pi)\) is a random quantity. When considering its bound, we may be interested in its expectation, called expected regret.

\[\begin{split} \E[R(n, \pi, \{\vec{x}_t\}_{t=1}^n)] &= \E \left[ \sum_{t=1}^{n} x_{t,I_t} \right] - \E \left[ \min_{i \in [K]} \sum_{t=1}^{n} x_{t,i} \right]\\ &= \E \left[ \sum_{t=1}^{n} x_{t,I_t} \right] - \min_{i \in [K]} \sum_{t=1}^{n} x_{t,i} \end{split} \]

The expectation is with respect to the randomness of the arm selection \(I_t\). The second expectation can be removed because here we assume the sequence \(\{\vec{x}_t\}_{t=1}^n\) is given, so there is no randomness in the second term.

To bound the expected regret, we need to consider the supremum with respect to all possible loss sequences:

\[\sup_{(\vec{x}_1, \dots, \vec{x}_n)} \E[R(n, \pi, \{\vec{x}_t\}_{t=1}^n)]\]

Now assume the loss vector sequence is not given but is stochastic, i.e., it follows some probability distribution. Now it becomes a sequence of random vectors \(\{\vec{X}_t\}_{t=1}^n\). The expected regret is defined as:

\[\E[R(n, \pi)] = \E \left[ \sum_{t=1}^{n} X_{t,I_t} \right] - \E \left[ \min_{i \in [K]} \sum_{t=1}^{n} X_{t,i} \right]\]

The expectation is with respect to the randomness of both the policy of the learner (arm selection) and the adversary (loss sequence generation).

The regret definition is too stringent, so people also define pseudo-regret:

\[\E[\overline{R}(n, \pi)] = \E \left[ \sum_{t=1}^{n} X_{t,I_t} \right] - \min_{i \in [K]}\E \left[ \sum_{t=1}^{n} X_{t,i} \right]\]

It is easier to bound the expected pseudo-regret because it is smaller than the actual expected regret. It is easy to see by using Jensen’s Inequality:

\[\E \left[ \min_{i \in [K]} \sum_{t=1}^{n} X_{t,i} \right] \leq \min_{i \in [K]} \E \left[ \sum_{t=1}^{n} X_{t,i} \right]\]

6 Adversarial Bandit Setting #

6.1 Importance sampling #

Select actions according to the distribution \(\vec{P}_t = (P_{t,1}, \dots, P_{t, K})\). And assume again the loss vector sequence is given.

For all \(j \in [K]\), define an estimator for loss values when playing arm \(j\) in round \(t\):

\[\hat{X}_{t,j} = \frac{x_{t,j}}{P_{t,j}} \1_{\{I_t=j\}}\]

The estimator is unbiased because \(\E[\hat{X}_{t,j}] = \frac{x_{t,j}}{P_{t,j}} P_{t,j} = x_{t,j}\).

We can use these estimates as the true loss so to update the weights.

6.2 Exp3 (Exponential Weights for Exploration and Exploitation) Algorithm #


  1. If Exp3 is run with \(\eta = \eta_t = \sqrt{\frac{2\log K}{nK}}\) (a constant), then

    \[\overline{R}(n, \text{Exp3}) \leq \sqrt{2nK \log K}\]

  2. If \(\eta_t = \sqrt{\frac{\log K}{tK}}\), then

    \[\overline{R}(n,\text{Exp3}) \leq 2 \sqrt{nK \log K}\]

In the first case, you need to know the number of rounds \(n\) in prior. In the second case, you do not, but you can stop at any round and give the bound. The second case is called “anytime algorithm”.


There are usually two ways to transform an algorithm into anytime algorithm:

  1. Doubling Trick: This is almost a standard technique. We can use it whenever we have an algorithm with a regret of order \(O(n^{\alpha})\) for some \(\alpha>0\) with a known horizon \(n\) to turn it into an algorithm with a regret \(O(n^{\alpha})\) for all \(n\geq1\).
  2. Using time-varying parameters \(\eta_t\), just like the above. The analysis will be less straightforward.


Assume the number of rounds \(n\) is known. Recall the weighted majority algorithm, the regret bound is \(\sqrt{2n \log K}\). The Exp3 algorithm’s regret bound is larger than this by a factor of \(\sqrt{K}\).

Because in weighted majority, in each round we know the loss for all the \(K\) arms. But in Exp3, we only know the loss of one arm. Intuitively, by playing \(Kn\) rounds in Exp3, we may get the same amount of information as playing \(n\) rounds in weighted majority. So by replacing \(n\) by \(nK\), we get the bound for Exp3.


Claim: we will show that

\[\overline{R}(n, Exp3) \leq \frac{K}{2} \sum_{t=1}^{n} \eta_t + \frac{\ln K}{\eta_{n}}\]

(1) By plugging in \(\eta_t= \sqrt{\frac{2\log K}{nK}}\), we can get the corresponding bound (easy to verify).

(2) By plugging in \(\eta_t = \sqrt{\frac{\log K}{tK}}\), and using the following inequality, we can get the corresponding bound.

\[\sum_{t=1}^n \frac{1}{\sqrt{t}} \leq \int_{0}^{n} \frac{1}{\sqrt{t}} \dd{t} = 2 \sqrt{t} \Bigr\rvert^n_0 = 2 \sqrt{n}\]

Now we proof the claim.

First proof some useful equalities:

\[\E_{I_t \sim \vec{P}_t} \left[ \hat{l}_{t,i} \right] = \frac{l_{t,i}}{P_{t,i}} P_{t,i} = l_{t,i}\]

\[\E_{i \sim \vec{P}_t} \left[ \hat{l}_{t,i} \right] = \sum_{i=1}^{K} \frac{l_{t,i}}{P_{t,i}} \1_{\{I_t=i\}} P_{t,i} = l_{t,I_t}\]

Step 1: Consider the benchmark that always plays arm \(k\).

\[\sum_{t=1}^n l_{t, I_t} - \sum_{t=1}^{n} l_{t,k} = \sum_{t=1}^{n} \E_{i \sim \vec{P}_t} \left[ \hat{l}_{t,i} \right] - \sum_{t=1}^{n} \E_{I_t \sim \vec{P}_t} \left[ \hat{l}_{t,k} \right]\]

One can verify that

\[\begin{split} \E_{i \sim \vec{P}_t} \left[ \hat{l}_{t,i} \right] &= \frac{1}{\eta_{t}} \ln \left( \E_{i \sim \vec{P}_t} \exp \left( -\eta_t \left( \hat{l}_{t,i} - \E_{k \sim \vec{P}_t} \hat{l}_{t,k} \right) \right) \right) \quad \text{(i)}\\ &- \frac{1}{\eta_{t}} \ln \left( \E_{i \sim \vec{P}_t} \exp \left( -\eta_t \hat{l}_{t,i} \right) \right) \quad \text{(ii)} \end{split} \]

Step 2: Bound (i)

\[\begin{split} &\ln \left( \E_{i \sim \vec{P}_t} \exp \left( -\eta_t \left( \hat{l}_{t,i} - \E_{k \sim \vec{P}_t} \hat{l}_{t,k} \right) \right) \right)\\ =& \ln \left( \E_{i \sim \vec{P}_t} \left[\exp \left( -\eta_t \hat{l}_{t,i} \right)\right] \right) + \eta_t \E_{k \sim \vec{P}_t} \left[ \hat{l}_{t,k} \right]\\ \leq& \E_{i \sim \vec{P}_t} \left[\exp \left( -\eta_t \hat{l}_{t,i} \right)\right] - 1 + \eta_t \E_{k \sim \vec{P}_t} \left[ \hat{l}_{t,k} \right]\\ =& \E_{i \sim \vec{P}_t} \left[\exp \left( -\eta_t \hat{l}_{t,i} \right) - 1 + \eta_t \hat{l}_{t,i}\right]\\ \leq& \E_{i \sim \vec{P}_t} \left[ \frac{\eta_t^2 \hat{l}_{t,i}^2}{2} \right]\\ =& \frac{\eta_t^2}{2} \sum_{i=1}^{K} \hat{l}_{t,i}^2 P_{t,i}\\ =& \frac{\eta_t^2}{2} \sum_{i=1}^{K} \left( \frac{l_{t,i} \1_{\{I_t=i\}}}{P_{t,i}} \right)^2 P_{t,i}\\ =& \frac{\eta_t^2}{2} \frac{l_{t,I_t}^2}{P_{t,I_t}}\\ \leq& \frac{\eta_t^2}{2} \frac{1}{P_{t,I_t}} \end{split} \]

The first two inequalities are \(\ln(x) \leq x-1\) and \(e^{-x} \leq 1-x + \frac{x^2}{2}\) respectively. The third inequality is because the loss \(l_{t,I_t} \in [0,1]\).

Step 3: Bound (ii)

\[\begin{split} &-\frac{1}{\eta_{t}} \ln \left( \E_{i \sim \vec{P}_t} \left[ \exp \left( -\eta_t \hat{l}_{t,i} \right) \right] \right)\\ =& -\frac{1}{\eta_{t}} \ln \left( \sum_{i=1}^{K} \exp \left( -\eta_t \hat{l}_{t,i} \right) P_{t,i} \right)\\ =& -\frac{1}{\eta_{t}} \ln \left( \sum_{i=1}^{K} \exp \left( -\eta_t \hat{l}_{t,i} \right) \frac{\exp \left( -\eta_{t} \hat{L}_{t-1,i} \right)}{\sum_{k=1}^{K} \exp \left( -\eta_{t} \hat{L}_{t-1,k} \right)} \right)\\ =& -\frac{1}{\eta_{t}} \ln \left(\frac{\sum_{i=1}^{K} \exp \left( -\eta_t \hat{L}_{t,i} \right)}{\sum_{k=1}^{K} \exp \left( -\eta_t \hat{L}_{t-1,k} \right)} \right)\\ =& \Phi_{t-1}(\eta_t)-\Phi_t(\eta_t) \end{split} \]

where \(\Phi_t(\eta) = \frac{1}{\eta} \log \left( \frac{1}{K} \sum_{i=1}^{K} \exp \left( -\eta \hat{L}_{t,i} \right) \right)\).

Now we get

\[\sum_{t=1}^n l_{t, I_t} - \sum_{t=1}^{n} l_{t,k} \leq \sum_{t=1}^{n} \frac{\eta_t}{2} \frac{1}{P_{t,I_t}} + \sum_{t=1}^{n} \Phi_{t-1}(\eta_t)-\sum_{t=1}^{n} \Phi_t(\eta_t) - \sum_{t=1}^{n} \E_{I_t \sim \vec{P}_t} \left[ \hat{l}_{t,k} \right]\]

We are interested in the expected regret, so we get

\[\begin{split} &\E \left[\sum_{t=1}^n l_{t, I_t} - \sum_{t=1}^{n} l_{t,k} \right]\\ \leq& \E \left[\sum_{t=1}^{n}\frac{\eta_t}{2P_{t,I_t}} \right] + \E \left[\sum_{t=1}^{n} \Phi_{t-1}(\eta_t)- \Phi_t(\eta_t)\right] - \E \left[\sum_{t=1}^{n} \E_{I_t \sim \vec{P}_t} \left[ \hat{l}_{t,k} \right]\right] \end{split} \]

Do the computation for each term:

In the first term, the expectation is respect to both the adversary and the learner, so we can write it as a conditional expectation. The inner expectation is conditioned on the adversary, so the learner can compute \(\vec{P}_t\) and do the arm selection.

\[\begin{split} &\E \left[\sum_{t=1}^{n}\frac{\eta_t}{2P_{t,I_t}} \right]\\ =& \E \left[\sum_{t=1}^{n} \E_{I_t \sim \vec{P}_t} \left[\frac{\eta_t}{2P_{t,I_t}}\right] \right]\\ =& \E \left[\sum_{t=1}^{n} \E_{I_t \sim \vec{P}_t} \left[\frac{\eta_t}{2P_{t,I_t}}\right] \right]\\ =& \E \left[\sum_{t=1}^{n} \sum_{j=1}^{K} \frac{\eta_t}{2P_{t,j}} P_{t,j}\right]\\ =& \E \left[ \sum_{t=1}^{n} \frac{\eta_t K}{2} \right]\\ =& \frac{K}{2}\sum_{t=1}^{n} \eta_t \end{split} \]

For the second term, first we do not consider the expectation:

\[\begin{split} &\sum_{t=1}^{n} \Phi_{t-1}(\eta_t)- \Phi_t(\eta_t)\\ =& \Phi_0(\eta_1) - \Phi_1(\eta_1) + \Phi_1(\eta_2) - \Phi_2(\eta_2) + \dots + \Phi_{n-1}(\eta_n) - \Phi_n(\eta_n)\\ =& \sum_{t=1}^{n-1} \left[ \Phi_t(\eta_{t+1}) - \Phi_t(\eta_t) \right] + \Phi_0(\eta_1) - \Phi_n(\eta_n)\\ =& \sum_{t=1}^{n-1} \left[ \Phi_t(\eta_{t+1}) - \Phi_t(\eta_t) \right] - \Phi_n(\eta_n) \end{split} \]

where \[\begin{split} -\Phi_n(\eta_n) &= \frac{\log K}{\eta_n} - \frac{1}{\eta_{n}} \log \left( \sum_{i=1}^{K} \exp \left( -\eta_n \hat{L}_{n,i} \right) \right)\\ &\leq \frac{\log K}{\eta_n} - \frac{1}{\eta_{n}} \log \left( \exp \left( -\eta_n \hat{L}_{n,k} \right) \right)\\ &= \frac{\log K}{\eta_n} + \sum_{t=1}^{n} \hat{l}_{t,k} \end{split} \]

So considering the expectation, the same as above, the expectation of \(\hat{l}_{t,k}\) can be divided into two parts:

\[\begin{split} \E \left[\sum_{t=1}^{n} \Phi_{t-1}(\eta_t)- \Phi_t(\eta_t)\right] &\leq \E \left[ \sum_{t=1}^{n-1} \left[ \Phi_t(\eta_{t+1}) - \Phi_t(\eta_t) \right] \right] + \frac{\log K}{\eta_n} + \sum_{t=1}^{n} \E \left[ \hat{l}_{t,k} \right]\\ &= \E \left[ \sum_{t=1}^{n-1} \left[ \Phi_t(\eta_{t+1}) - \Phi_t(\eta_t) \right] \right] + \frac{\log K}{\eta_n} + \E \left[\sum_{t=1}^{n} \E_{I_t \sim \vec{P}_t} \left[ \hat{l}_{t,k} \right]\right] \end{split} \]

Putting all of these together, we get

\[\begin{split} &\E \left[\sum_{t=1}^n l_{t, I_t} - \sum_{t=1}^{n} l_{t,k} \right]\\ \leq& \frac{K}{2}\sum_{t=1}^{n} \eta_t + \E \left[ \sum_{t=1}^{n-1} \left[ \Phi_t(\eta_{t+1}) - \Phi_t(\eta_t) \right] \right] + \frac{\log K}{\eta_n} + \E \left[\sum_{t=1}^{n} \E_{I_t \sim \vec{P}_t} \left[ \hat{l}_{t,k} \right]\right] - \E \left[\sum_{t=1}^{n} \E_{I_t \sim \vec{P}_t} \left[ \hat{l}_{t,k} \right]\right]\\ =& \frac{K}{2}\sum_{t=1}^{n} \eta_t + \E \left[ \sum_{t=1}^{n-1} \left[ \Phi_t(\eta_{t+1}) - \Phi_t(\eta_t) \right] \right] + \frac{\log K}{\eta_n} \end{split} \]

Now we only need to show

\[\E \left[ \sum_{t=1}^{n-1} \left[ \Phi_t(\eta_{t+1}) - \Phi_t(\eta_t) \right] \right]\leq 0\]

Since \(\eta_{t+1}\leq \eta_t\), we only need to show \(\Phi_t(\eta)\) is increasing by taking the derivative (skipped).

6.3 Exp3.P and Exp3.IX #

7 Online Convex Optimization #

The setup we consider:

  1. Input: a convex set \(S \subset \RR^d\).
  2. For each round \(t=1,2, \dots\):
    1. Predict vector \(\vec{w}_t \in S\).
    2. Receive convex loss function \(f_t: S \to \RR\).
    3. Suffer loss \(f_t(w_t)\).

The regret is defined as

\[R(n,\pi) = \sum_{t=1}^{n} f_t(\vec{w}_t) - \min_{\vec{u} \in U} \sum_{t=1}^{n} f_t(\vec{u})\]

where \(U \subset S\).


For each round, the learner knows \(f_t\), not just the loss.

The full information setting can be fit into this online convex optimization setting.

Let \(S = \{\vec{w} \in \RR^d: w_i \geq0, \sum_{i=1}^{d} w_i=1\}\), where \(d\) is the number of experts. Then define \(f_t\) as the loss:

\[f_t(\vec{w}) = \langle \vec{w}, \vec{v}_t \rangle\]

Note that \(f_t\) is indeed convex (since it is linear).

Let \(U = \{\vec{e}_1,\vec{e}_2, \dots, \vec{e}_d\} \subset S\), so \(f_t(\vec{e}_i) = v_{t,i}\). The regret becomes:

\[R(n,\pi) = \sum_{t=1}^{n} \langle \vec{w}_t, \vec{v}_t \rangle - \min_{i \in [d]} \sum_{t=1}^{n} v_{t,i}\]

This is exactly the regret we defined for the full information setting. So the full information setting is a special case of online convex optimization problems.

Now we focus on the online convex optimization setting and give some algorithms.

7.1 Follow the Leader Algorithm (FTL) #

For round \(t\), since \(f_1, f_2, \dots, f_{t-1}\) are known, the learner simply choose \(\vec{w}_t\) which minimizes the current observations (assuming \(U=S\)):

\[\vec{w}_t = \argmin_{\vec{w} \in S} \sum_{i=1}^{t-1} f_i(\vec{w})\]

Next we analyze the regret of FTL.

Lemma Let \(w_1, w_2, \dots\) be the sequence produced by FTL, then for all \(\vec{u} \in S\), we have

\[R(n, \text{FTL}, \vec{u}) = \sum_{t=1}^{n} f_t(\vec{w}_t) - \sum_{t=1}^{n} f_t(\vec{u}) \leq \sum_{t=1}^{n} f_t(\vec{w}_t) - \sum_{t=1}^{n} f_t(\vec{w}_{t+1})\]


We will show

\[\sum_{t=1}^{n} f_t(\vec{w}_{t+1}) \leq \sum_{t=1}^{n} f_t(\vec{u})\]

For \(n=1\), we need to show \(f_1(\vec{w}_2) \leq f_1(\vec{u})\). By definition of \(\vec{w}_t\), this holds.

Assume the inequality holds for \(n-1\) rounds:

\[\sum_{t=1}^{n-1} f_t(\vec{w}_{t+1}) \leq \sum_{t=1}^{n-1} f_t(\vec{u})\]

Add \(f_n(\vec{w}_{n+1})\) on both sides, we have

\[\sum_{t=1}^{n} f_t(\vec{w}_{t+1}) \leq f_n(\vec{w}_{n+1}) + \sum_{t=1}^{n-1} f_t(\vec{u})\]

Since this holds for all \(\vec{u}\), we choose \(\vec{u}=\vec{w}_{n+1}\), so

\[\sum_{t=1}^{n} f_t(\vec{w}_{t+1}) \leq f_n(\vec{w}_{n+1}) + \sum_{t=1}^{n-1} f_t(\vec{w}_{n+1}) = \sum_{t=1}^{n} f_t(\vec{w}_{n+1})\]

By definition of \(\vec{w}_{n+1}\),

\[\sum_{t=1}^{n} f_t(\vec{w}_{n+1}) = \min_{\vec{u} \in S} \sum_{t=1}^{n} f_t(\vec{u})\]


\[\sum_{t=1}^{n} f_t(\vec{w}_{t+1}) \leq \min_{\vec{u} \in S} \sum_{t=1}^{n} f_t(\vec{u})\]


So if we want to bound the regret of FTL, we can simply bound the RHS of the above lemma.

Next we compute the bound when \(f_t(\vec{w}) = \frac{1}{2} \| \vec{w}-\vec{v}_t\|_2^2\). By definition, we have

\[\vec{w}_t = \argmin_{\vec{w} \in S} \sum_{i=1}^{t-1} \frac{1}{2} \| \vec{w}-\vec{v}_i\|_2^2\]

Taking the derivative, we get \(\vec{w}_t\) is the average of all \(\vec{v}_i\).

\[\vec{w}_t = \frac{1}{t-1} \sum_{i=1}^{t-1} \vec{v}_i\]


\[\vec{w}_{t+1} = \frac{1}{t} \sum_{i=1}^{t} \vec{v}_i = \left( 1-\frac{1}{t} \right) \vec{w}_t + \frac{1}{t} \vec{v}_t\]


\[\vec{w}_{t+1} - \vec{v}_t = \left( 1-\frac{1}{t} \right) (\vec{w}_t-\vec{v}_t)\]


\[\begin{split} f_t(\vec{w}_t)-f_t(\vec{w}_{t+1}) &= \frac{1}{2} \| \vec{w}_t-\vec{v}_t\|_2^2 - \frac{1}{2} \| \vec{w}_{t+1}-\vec{v}_t\|_2^2\\ &= \frac{1}{2} \left( 1- \left( 1- \frac{1}{t}\right)^2 \right) \| \vec{w}_t-\vec{v}_t\|_2^2\\ &= \frac{1}{2} \left( \frac{2}{t} -\frac{1}{t^2} \right) \| \vec{w}_t-\vec{v}_t\|_2^2\\ &\leq \frac{1}{t} \| \vec{w}_t-\vec{v}_t\|_2^2\\ &\leq \frac{1}{t} \left(\| \vec{w}_t\|_2+\|\vec{v}_t\|_2 \right)^2\\ \end{split} \]

We can assume the norm of the loss vector is bounded, i.e., \(\|\vec{v}_t\|_2^2 \leq L\). And because \(\vec{w}_t\) is the average of \(\vec{v}_t\), we also have \(\|\vec{w}_t\|_2^2 \leq L\). So we have

\[\begin{split} &f_t(\vec{w}_t)-f_t(\vec{w}_{t+1})\\ \leq& \frac{1}{t} \left(\| \vec{w}_t\|_2+\|\vec{v}_t\|_2 \right)^2\\ \leq& \frac{4}{t} L \end{split} \]

Therefore we have the bound:

\[\sum_{t=1}^{n} f_t(\vec{w}_t) - \sum_{t=1}^{n} f_t(\vec{w}_{t+1}) \leq \sum_{t=1}^{n} \frac{4}{t} L \leq 4L (\log(n)+1)\]

Note that this is sublinear in \(n\). This is just an example when the loss function is quadratic. In fact, if we just apply FTL algorithm on the loss function \(f_t(\vec{w}) = \langle \vec{w}, \vec{v}_t \rangle\), we will not get a good bound. We will actually get a linear bound.

7.2 Follow the Regularized Leader (FoReL) #

Now consider \(f_t(\vec{w}) = \langle \vec{w}, \vec{v}_t \rangle\), we will show that by blindly applying FTL, we may get bad performance.

For example, let \(S=[-1,1]\) (1 dimension), now \(f_t(w) = w v_t\). And let the loss be

\[v_t = \left\{ \begin{array}{[email protected]{\quad:\quad}l} -0.5 & \text{for } t=1\\ 1 & \text{if } t \text{ is even}\\ -1 & \text{if } t \text{ is odd} \end{array}\right. \]

It is easy to verify that

\[w_2 = \argmin_{w \in [-1,1]} wv_1 = \argmin_{w \in [-1,1]} -0.5w = 1\]

And similarly \(w_3 = -1\), \(w_4=1\), \(w_5=-1\), etc.

The regret is

\[\begin{split} &\sum_{t=1}^{n} f_t(w_t) - \min_{u \in [-1,1]}\sum_{t=1}^{n} f_t(u)\\ =& f_1(w_1) + (n-1) - \min_{u \in [-1,1]} u \sum_{t=1}^{n} v_t\\ =& f_1(w_1) + (n-1) + 0.5 = O(n) \end{split} \]

Why FTL does not work for linear loss function

For quadratic loss function, \(\vec{w}_t\) is the average of \(\vec{v}_i\), so it does not change abruptly. Whereas for the linear loss function, the \(w_t\) can be fluctuated as we showed above.

So the FTL algorithm is not stable for the linear loss function.

Therefore, the weights should not change abruptly. We can do this by adding a regularization term.

Follow the Regularized Leader (FoReL)

\[\vec{w}_t = \argmin_{\vec{w} \in S} \sum_{i=1}^{t-1} f_i(\vec{w}) + R(\vec{w})\]

where \(R(\vec{w})\) is the regularizer.

Next, we consider \(f_t(\vec{w}) = \langle \vec{w}, \vec{v}_t \rangle\), and \(R(\vec{w}) = \frac{1}{2\eta} \|\vec{w}\|_2^2\).

By taking the derivative, we get

\[\vec{w}_t = -\eta \sum_{i=1}^{t-1} \vec{v}_i = \vec{w}_{t-1} - \eta \vec{v}_{t-1}\]

So this time, the weights do not change drastically, but depend on the previous update.

Actually, this is exactly gradient descent:

\[\vec{w}_t = \vec{w}_{t-1} -\eta \gradient f_{t-1}(\vec{w})\]

Next we proof the regret bound of FoReL algorithm. First we need a similar Lemma as above.

Lemma Let \(w_1, w_2, \dots\) be the sequence produced by FoReL, then for all \(\vec{u} \in S\), we have

\[R(n, \text{FoReL}, \vec{u}) = \sum_{t=1}^{n} f_t(\vec{w}_t) - \sum_{t=1}^{n} f_t(\vec{u}) \leq R(\vec{u})-R(\vec{w}_1) + \sum_{t=1}^{n} f_t(\vec{w}_t) - \sum_{t=1}^{n} f_t(\vec{w}_{t+1})\]


If we regard the regularizer as a loss function, i.e., \(R(\vec{w}) = f_0(\vec{w})\), and start the algorithm from time 0. Then FoReL becomes FTL, except that time starts with 0 instead of 1. So we have the lemma:

\[R(n, \text{FoReL}, \vec{u}) = \sum_{t=0}^{n} f_t(\vec{w}_t) - \sum_{t=0}^{n} f_t(\vec{u}) \leq \sum_{t=0}^{n} f_t(\vec{w}_t) - \sum_{t=0}^{n} f_t(\vec{w}_{t+1})\]

The proof follows by separating the terms for \(t=0\).

\[\begin{split} R(n, \text{FoReL}, \vec{u}) &\leq R(\vec{u})-R(\vec{w}_1) + \sum_{t=1}^{n} f_t(\vec{w}_t) - \sum_{t=1}^{n} f_t(\vec{w}_{t+1})\\ &= \frac{1}{2\eta} \|\vec{w}\|_2^2 - \argmin_{\vec{w} \in S} \|w\|_2^2 + \sum_{t=1}^{n} f_t(\vec{w}_t) - \sum_{t=1}^{n} f_t(\vec{w}_{t+1})\\ &= \frac{1}{2\eta} \|\vec{w}\|_2^2 - 0 + \sum_{t=1}^{n} \langle \vec{w}_t, \vec{v}_t \rangle - \sum_{t=1}^{n} \langle \vec{w}_{t+1}, \vec{v}_t \rangle\\ &= \frac{1}{2\eta} \|\vec{w}\|_2^2 + \sum_{t=1}^{n} \langle \vec{w}_t- \vec{w}_{t+1}, \vec{v}_t \rangle \\ &= \frac{1}{2\eta} \|\vec{w}\|_2^2 + \sum_{t=1}^{n} \eta \|\vec{v}_t\|_2^2 \end{split} \]

Assume again \(\|\vec{v}_t\|_2^2\leq L\), and assume \(\vec{u}\) is selected from \(U=\{\vec{u}: \|\vec{u}\|_2^2 \leq B\}\). We get the bound:

\[R(n, \text{FoReL}, \vec{u}) \leq \frac{1}{2\eta} B + \eta \sum_{t=1}^{n} L = \frac{B}{2\eta} + L n \eta\]

Minimizing the RHS with respect to \(\eta\), we get \(\eta = \sqrt{\frac{B}{2Ln}}\), and

\[R(n, \text{FoReL}, \vec{u}) \leq \sqrt{2BLn} = O(\sqrt{n})\]

7.3 Linearization of Convex Functions #

In the above, we only consider some very specific examples (quadratic, linear loss functions). We will show that handling general convex functions is almost as easy as these.

If a function is convex, then for any point on the function, by taking the tangent line, we can get a lower bound of the convex function. As the following lemma formulates.

Lemma Let \(S\) be a convex set. A function \(f: S\to \RR\) is convex if and only if for all \(\vec{w} \in S\), there exists \(\vec{z}\) such that

\[f(\vec{u}) \geq f(\vec{w}) + \langle \vec{u}-\vec{w},\vec{z} \rangle \quad \forall u \in S\]

\(\vec{z}\) is called sub-gradient of \(f\) at \(\vec{w}\). Note that \(\vec{z}\) may not be unique, so we define a set \(\partial{f(\vec{w})} = \{\vec{g}: \vec{g} \text{ is a subgradient of } f \text{ at } \vec{w}\}\). When \(f\) is differentiable, \(z\) is unique, it is the gradient \(\grad f(\vec{w})\).

So we get a upper bound of the regret:

\[R(n,\pi) = \sum_{t=1}^{n} f_t(\vec{w}_t) - \min_{\vec{u} \in U} \sum_{t=1}^{n} f_t(\vec{u}) \leq \sum_{t=1}^{n} \langle \vec{w}_t-\vec{u}^{*},\vec{z}_t \rangle\]

where \(\vec{u}^{*} = \argmin_{\vec{u} \in U} \sum_{t=1}^{n} f_t(\vec{u})\), \(\vec{z}_t \in \partial{f(\vec{w}_t)}\).

The RHS can be regard as the regret for the case where the loss function is linear (i.e., \(f_t(\vec{w}) = \langle \vec{w}, \vec{z}_t \rangle\)).

The Online Gradient Descent Algorithm (OGD):

\[\vec{w}_t = \vec{w}_{t-1} -\eta \vec{z}_{t-1}, \quad \vec{z}_{t-1} \in \partial f_{t-1}(\vec{w}_{t-1})\]

Now we know for a convex function, we can linearize it by getting its sub-gradient.

Note that in the proof of the linear case, we assume the norm of the gradient is bounded: \(\|\vec{v}_t\|_2^2\leq L\), and therefore in the convex case, the sub-gradient should be bounded: \(\|\vec{z}_t\|_2^2\leq L\). We will show that this requirement is actually \(L\)-Lipschitz.

Definition (L-Lipschitz)

A function \(f\) is called \(L\)-Lipschitz over a set \(S\) with respect to a norm \(\| \cdot \|\) if for all \(x,y \in S\) we have:

\[|f(x)-f(y)| \leq L \|x-y\|\]

Definition (Dual Norm)

Let \(\| \cdot \|\) be a norm on \(\RR^n\). The associated dual norm is defined aspell

\[\|\vec{z}\|_{*} = \sup \{ \langle \vec{z}, \vec{x}\rangle \mid \|\vec{x}\| \leq 1\}\]

From the definition we have the inequality

\[\langle \vec{z},\vec{x} \rangle = \|\vec{x}\| \left\langle \vec{z}, \frac{\vec{x}}{\|\vec{x}\|} \right\rangle \leq \|\vec{x}\| \|z\|_{*}\]

Theorem If \(p,q\geq1\) such that \(\frac{1}{p}+\frac{1}{q}=1\), then \(l_p\) norm and \(l_q\) norms are dual.

Lemma Let \(f: S \to \RR\) be convex. Then \(f\) is \(L\)-Lipschitz with respect to a norm \(\| \cdot \|\) if and only if for all \(\vec{w} \in S\) and \(\vec{z} \in \partial f(\vec{w})\), we have

\[\|\vec{z}\|_{*} \leq L\]

7.4 FoReL with Strongly Convex Regularizer #

Definition (Strongly Convex Functions)

A function \(f: S\to \RR\) is \(\sigma\)-strongly convex over \(S\) with respect to norm \(\| \cdot \|\) if \(\forall \vec{u},\vec{w} \in S\), we have

\[f(\vec{u})\geq f(\vec{w}) + \langle \vec{z}, \vec{u}-\vec{w}\rangle + \frac{\sigma}{2} \| \vec{u} -\vec{w}\|^2, \quad \forall \vec{z} \in \partial f(\vec{w})\]

If \(f\) is twice differentiable, then it is \(\sigma\)-strongly convex if and only if

\[\left\langle \laplacian{f(\vec{x})} \vec{x}, \vec{x} \right\rangle \geq \sigma \|x\|^2\]

(Note that the notation \(\laplacian\) means Hessian, not the Laplacian.)

Example (Euclidean Regularizer)

\[R(\vec{w}) = \frac{1}{2} \|\vec{w}\|_2^2\]

We can verify that \(\laplacian{R(\vec{w})} = I\).

So \(R(\vec{w})\) is 1-strongly convex with respect to \(l_2\) norm.

Example (Entropy Regularizer)

\[R(\vec{w}) = \sum_{i=1}^{d} w_i \log w_i\]

where \(\vec{w} \in S = \{\vec{x} \in \RR^d: x_i\geq0, \forall i, \|x\|_1 \leq B\}\).

\(R(\vec{w})\) is \(\frac{1}{B}\)-strongly convex with respect to \(l_1\) norm.

Lemma Let \(\vec{w} = \argmin_{\vec{x} \in S} f(\vec{x})\), then for all \(\vec{u} \in S\),

\[f(\vec{u})-f(\vec{w}) \geq \frac{\sigma}{2} \| \vec{u} -\vec{w}\|^2\]

Lemma Let \(R : S\to \RR\) be a \(\sigma\)-strongly convex function over \(S\) w.r.t. \(\| \cdot \|\). Let \(f_t\) be \(L_t\)-Lipschitz w.r.t. \(\| \cdot \|\) for all \(t\). If \(\vec{w}_1, \vec{w}_2, \dots\) are the predictions of the FoReL, then

\[f_t(\vec{w}_t) - f_t(\vec{w}_{t+1}) \leq L_t \|\vec{w}_t-\vec{w}_{t+1}\| \leq \frac{L_t^2}{\sigma}\]


The first inequality has been proved previously, so we focus on the second one.

By FoReL, at time \(t\), we try to minimize:

\[F_t(\vec{w}) = \sum_{i=1}^{t-1} f_i(\vec{w}) + R(\vec{w})\]

And get

\[\vec{w}_t = \argmin_{\vec{w} \in S} F_t(\vec{w})\]

One can verify that since \(f_i\) are all convex and \(R(\vec{w})\) is \(\sigma\)-strongly convex, so \(F_t(\vec{w})\) is also \(\sigma\)-strongly convex.

According to the above lemma, we have

\[F_t(\vec{w}_{t+1}) \geq F_t(\vec{w}_t) + \frac{\sigma}{2} \|\vec{w}_t-\vec{w}_{t+1}\|^2\]


\[F_{t+1}(\vec{w}_t) \geq F_{t+1}(\vec{w}_{t+1}) + \frac{\sigma}{2} \|\vec{w}_t-\vec{w}_{t+1}\|^2\]

Adding them together,

\[F_t(\vec{w}_{t+1})-F_t(\vec{w}_t) + F_{t+1}(\vec{w}_t) - F_{t+1}(\vec{w}_{t+1})\geq \sigma \|\vec{w}_t-\vec{w}_{t+1}\|^2\]

By the definition of \(F_t\), we have

\[f_t(\vec{w}_t) - f_t(\vec{w}_{t+1})\geq \sigma\|\vec{w}_t-\vec{w}_{t+1}\|^2\]

By the Lipschitz property of \(f_t\), we have

\[f_t(\vec{w}_t) - f_t(\vec{w}_{t+1}) \leq L_t \|\vec{w}_t-\vec{w}_{t+1}\|\]


\[\|\vec{w}_t-\vec{w}_{t+1}\| \leq \frac{L_t}{\sigma}\]

The proof follows by plugging this back to the above formula.

Theorem Let \(f_1,f_2, \dots, f_n\) are convex function s.t. \(f_t\) is \(L\)-Lipschitz w.r.t \(\| \cdot \|\). Then if FoReL is used with a \(\sigma\)-strongly convex regularizer w.r.t. the same norm, then \[\begin{split} R(n, \text{FoReL}, \vec{u}) &\leq R(\vec{u})-R(\vec{w}_1) + \sum_{t=1}^{n} f_t(\vec{w}_t) - \sum_{t=1}^{n} f_t(\vec{w}_{t+1})\\ &\leq R(\vec{u})-\min_{\vec{w} \in S} R(\vec{w}) + \frac{nL^2}{\sigma} \end{split} \]

7.5 Euclidean and Entropy Regularizer (Skipped) #

8 Stochastic Bandits #

8.1 Settings #

  1. Input: Number of arms \(K\), number of rounds \(n\).
  2. In each round \(t=1,2, \dots, n\):
    1. Learner selects an arm \(I_t \in [K]\).
    2. Environment assigns a reward \(X_{t}\) to arms drawn from a distribution \(v_{I_t}\) (associated with arm), i.e., \(X_t \sim v_{I_t}\).

The policy of a learner: given the history \(\mathcal{H}_t = (I_1, X_1, \dots, I_{t-1}, X_{t-1})\), decide which arm to play: \(\pi_t(\cdot \mid \mathcal{H}_t)\).

The total reward is \(S_n = \sum_{t=1}^{n} X_t\), and the goal is to maximize \(\E[S_n]\).

Denote \(\mu_a\) as the mean of distribution \(v_a\) for each arm \(a \in [K]\). And \(\mu^{*} = \max_{a \in [K]} \mu_a\). Then if all the distributions are known, \(\max \E[S_n] = n \mu^{*}\). In reality, these distributions are unknown, so we just use this as our benchmark in the regret definition.

\[Regret(\pi, n) = n\mu^{*} - \E[S_n] = n \max_{i \in [K]} \mu_i - \E \left[ \sum_{t=1}^{n} X_t \right]\]

A bandit instance is a list of \(K\) distributions: \(v = \{v_a\}_{a \in [K]}\).

The environment class is the collection of all possible bandit instances:

\[\mathcal{E} = \{ v = \{v_a\}_{a \in [K]}: v_a \in M_a, \forall a \in [K]\}\]

where \(M_a\) is the set of distributions.

Typical environment classes:

\[\mathcal{E} = \left\{ \{\mathrm{Bernoulli}(\nu_i)\}_{i \in [K]}: \mu_i \in [0,1], \forall i \in [K]\right\}\]

We assume that the arm sample is independent of the past pulls from that arm, and also pulls of the other arms.

8.2 Subgaussian Random Variable #

A random variable is \(\sigma\)-sub-Gaussian if for all \(\lambda \in \RR\), it holds that

\[\E \left[ e^{\lambda X} \right] \leq \exp \left( \frac{\lambda^2\sigma^2}{2} \right) \]


\[\log \left( \E \left[ e^{\lambda X} \right] \right) \leq \frac{\lambda^2\sigma^2}{2}\]


The Gaussian random variable with 0 mean: \(X \sim N(0,\sigma^2)\), compute its moment generation variable, we get

\[\E \left[ e^{\lambda X} \right] = \exp \left( \frac{\lambda^2\sigma^2}{2} \right)\]

So it is sub-Gaussian.

Definition (Heavy Tailed) A R.V. \(X\) is heavy tailed if for all \(\lambda >0\),

\[\log \left( \E \left[ e^{\lambda X} \right] \right) = \infty\]

Otherwise it is called light tailed.


Subgaussian random variables are obviously light tailed.

If R.V. is zero-mean and bounded, i.e., \(X \in [a,b]\), then \(X\) is \(\frac{b-a}{2}\)-subgaussian. (proof skipped)

\(\sigma\)-Subgaussian implies zero mean and bounded variance, i.e., \(\Var[X]\leq \sigma^2\).

It is common to assume the reward distributions to be “subgaussian”, but obviously with non-zero means. This is because bandit literature abuses notation by considering a random variable \(X\) to be \(\sigma\)-subgaussian if the noise \(X-\E[X]\) is \(\sigma\)-subgaussian. See page 78 of the textbook.

Theorem If \(X\) is \(\sigma\)-sub-Gaussian, then for all \(\epsilon\geq0\),

\[\begin{split} \Pr[X\geq \epsilon] &\leq \exp \left( -\frac{\epsilon^2}{2\sigma^2} \right)\\ \Pr[X \leq -\epsilon] &\leq \exp \left( -\frac{\epsilon^2}{2\sigma^2} \right) \end{split} \]


\[\begin{split} \Pr\left[X\geq\epsilon\right] &= \Pr\left[ e^{\lambda X} \geq e^{\lambda \epsilon}\right]\\ &\leq \frac{\E \left[ e^{\lambda X} \right]}{e^{\lambda \epsilon}}\\ &\leq \exp \left( \frac{\lambda^2\sigma^2}{2} - \lambda \epsilon \right) \end{split} \]

The RHS is minimized when \(\lambda = \frac{\epsilon}{\sigma^2}\). The proof follows by minimizing the RHS. The other inequality is similar.

Let \(\epsilon = \sqrt{2\sigma^2 \log(\frac{1}{\delta})}\), we have

\[\Pr[X \geq \epsilon] \leq \delta\]

Therefore, \(X \in \left(-\sqrt{2\sigma^2 \log(\frac{1}{\delta})}, \sqrt{2\sigma^2 \log(\frac{1}{\delta})}\right)\) with probability at least \(1-2\delta\).

Lemma Suppose \(X\) is \(\sigma\)-sub-Gaussian, \(X_1,X_2\) are \(\sigma_1\)-sub-Gaussian and \(\sigma_2\)-sub-Gaussian respectively, then

  1. \(\E[X]=0\), \(\Var[X]\leq\sigma^2\).
  2. \(cX\) is \(|c|\sigma\)-sub-Gaussian for all \(c \in \RR\).
  3. \(X_1+X_2\) is \(\sqrt{\sigma_1^2+\sigma_2^2}\)-sub-Gaussian.

Theorem Assume \(X_i-\mu\) are independent and \(\sigma\)-sub-Gaussian random variables. Then for all \(\epsilon \geq0\),

\[\begin{split} \Pr\left[\hat{\mu}\geq \mu+\epsilon\right] &\leq \exp \left( - \frac{n\epsilon^2}{2\sigma^2} \right)\\ \Pr\left[\hat{\mu}\leq \mu-\epsilon\right] &\leq \exp \left( - \frac{n\epsilon^2}{2\sigma^2} \right) \end{split} \]

where \(\hat{\mu} = \frac{1}{n}\sum_{i=1}^{n} X_i\).


\[\Pr\left[\hat{\mu}\geq \mu+\epsilon\right] = \Pr\left[\frac{1}{n}\sum_{i=1}^{n} (X_i-\mu) \geq \epsilon\right]\]

Since \(X_i-\mu\) is \(\sigma\)-sub-Gaussian, by the above lemma, \(\frac{1}{n}\sum_{i=1}^{n} (X_i-\mu)\) is \(\frac{\sigma}{\sqrt{n}}\)-sub-Gaussian. The proof follows by applying the theorem above.

Let \(\epsilon = \sqrt{\frac{2\sigma^2 \log(\frac{1}{\delta})}{n}}\), we have

\[\begin{split} \Pr\left[\hat{\mu}\geq \mu+\epsilon\right] &\leq \delta\\ \Pr\left[\hat{\mu}\leq \mu-\epsilon\right] &\leq \delta \end{split} \]

Therefore, \(\hat{\mu} \in [\mu-\epsilon, \mu+\epsilon]\) with probability at least \(1-2\delta\).


This called Hoeffding Bound, it is a tighter bound than by using Chebyshev Inequality.

By Chebyshev, we can get

\[\Pr\left[\hat{\mu}\geq \mu+\epsilon\right] \leq \frac{\sigma^2}{n\epsilon^2}\]

By using the inequality: \(e^{-x} \leq \frac{1}{ex}, \forall x>0\), we have

\[\exp \left( - \frac{n\epsilon^2}{2\sigma^2} \right) \leq \frac{2\sigma^2}{en\epsilon^2} < \frac{\sigma^2}{n \epsilon^2}\]

Example of sub-Gaussian R.V.s

  1. If \(X\) has 0 mean and \(|X|\leq B\) almost surely (a.s.) for some \(B\), then \(X\) is \(B\)-subgaussian.
  2. If \(X\) has 0 mean and \(X \in [a,b]\) a.s., then \(X\) is \(\frac{b-a}{2}\)-subgaussian. E.g., it’s common to assume rewards are from \([0,1]\), so they are \(\frac{1}{2}\)-subgaussian.

We will assume our environment class is

\[\mathcal{E} = \left\{ v=\{v_a\}_{a \in [K]}: v_a-\E_{X \sim v_a}[X] \text{ is } \sigma-\text{sub-Gaussian}\right\}\]

8.3 Regret Definition and Regret Decomposition #

The pseudo-regret is defined as

\[\begin{split} \overline{R}(\pi,n) &= n\mu^{*} - \E \left[ \sum_{t=1}^{n} X_t \right]\\ &= n\mu^{*} - \E \left[ \sum_{t=1}^{n} \mu_{I_t} \right] \end{split} \]

The first expectation is w.r.t. the randomness of both the distributions of the arms (environment) and the randomness of arm selection (learner). The second expectation is w.r.t. the randomness only of the learner.

Define \(T_i(n)=\sum_{t=1}^{n} \1_{\{I_t=i\}}\), the number of times that arm \(i\) is pulled over \(n\) rounds. Then obviously \(\sum_{i=1}^{K} \E \left[ T_i(n) \right] = n\). So we can decompose the pseudo-regret as follows.

\[\begin{split} \overline{R}(\pi,n) &= n\mu^{*} - \E \left[ \sum_{t=1}^{n} \mu_{I_t} \right]\\ &= \mu^{*} \sum_{i=1}^{K} \E \left[ T_i(n) \right] - \E \left[ \sum_{t=1}^{n} \sum_{i=1}^{K} \mu_i \cdot \1_{\{I_t=i\}} \right]\\ &= \mu^{*} \sum_{i=1}^{K} \E \left[ T_i(n) \right] - \E \left[ \sum_{i=1}^{K} \sum_{t=1}^{n} \mu_i \cdot \1_{\{I_t=i\}} \right]\\ &= \mu^{*} \sum_{i=1}^{K} \E \left[ T_i(n) \right] - \E \left[ \sum_{i=1}^{K} \mu_i \cdot T_i(n) \right]\\ &= \mu^{*} \sum_{i=1}^{K} \E \left[ T_i(n) \right] - \sum_{i=1}^{K} \mu_i \cdot \E \left[ T_i(n) \right]\\ &= \sum_{i=1}^{K} (\mu^{*}-\mu_i) \E \left[ T_i(n) \right] \\ &= \sum_{i=1}^{K} \Delta_i \E \left[ T_i(n) \right] \quad \text{(Denote } \mu^{*}-\mu_i = \Delta_i \text{)} \end{split} \]

8.4 Explore-Then-Commit (ETC) Algorithm #

Assume the optimal arm is unique and with out loss of generality, the optimal arm is arm 1. I.e., \(\max_{a \in [K]} \mu_a = \mu_1\), and \(\Delta_1=0\), \(\Delta_i>0\) for \(i>1\).

We bound the number of times that arm \(i\) is pulled \(T_i(n)\):

\[\begin{split} T_i(n) &= m + \sum_{s=mK+1}^n \1_{\{I_s=i\}} \\ &= m + (n-mK) \1_{\{\hat{\mu}_i(mK) \geq \max_{j\neq i} \hat{\mu}_j(mK)\}} \end{split} \]


\[\E[T_i(n)] = m + (n-mK) \cdot \Pr \left[ \hat{\mu}_i(mK) \geq \max_{j\neq i} \hat{\mu}_j(mK) \right]\]


\[\begin{split} &\Pr \left[ \hat{\mu}_i(mK) \geq \max_{j\neq i} \hat{\mu}_j(mK) \right]\\ \leq& \Pr \left[ \hat{\mu}_i(mK) \geq \hat{\mu}_1(mK) \right]\\ =& \Pr \left[ \hat{\mu}_i(mK) - \mu_i -(\hat{\mu}_1(mK)- \mu_1) \geq \mu_1-\mu_i \right]\\ \end{split} \]

Note that

\[\hat{\mu}_i(mK) -\mu_i = \frac{\sum_{s=1}^{m} X_{i,s}}{m} - \mu_i = \frac{\sum_{s=1}^{m} (X_{i,s} - \mu_i)}{m}\]

Assume each \(X_{i,s}-\mu_i\) is 1-sub-Gaussian, then the above formula is \(\frac{1}{\sqrt{m}}\)-sub-Gaussian.

So \(\hat{\mu}_i(mK) - \mu_i -(\hat{\mu}_1(mK)- \mu_1)\) is \(\frac{2}{\sqrt{m}}\)-sub-Gaussian. According to the lemma above, we get

\[\Pr \left[ \hat{\mu}_i(mK) - \mu_i -(\hat{\mu}_1(mK)- \mu_1) \geq \Delta_i \right] \leq \exp \left( -\frac{\Delta_i^2 m}{4} \right)\]

Finally we get the result

\[\E[T_i(n)]\leq m + (n-mK)\exp \left( -\frac{\Delta_i^2 m}{4} \right)\]

And the pseudo-regret has bound

\[\begin{split} \overline{R}(\text{ETC},n) &= \sum_{i=1}^K \E[T_i(n)]\Delta_i \\ &= \underbrace{m \sum_{i=1}^{K} \Delta_i}_{\text{Exploration}} + \underbrace{(n-mK)\sum_{i=1}^{K} \Delta_i\exp \left( -\frac{\Delta_i^2 m}{4} \right)}_{\text{Commit}} \end{split} \]

Obviously, if \(m\) is small, the regret in exploration will be small, but the regret in commit will be large. If \(m\) is large, then the regret in commit will be small, but we waste too much time in exploration.

To decide what \(m\) we should choose, we can minimize the above regret w.r.t. \(m\), and the resulting regret will be about \(\log(n)\), which is indeed sublinear. However, this minimization requires that \(\Delta_i\) are known in prior, which is usually not the case.

8.5 Optimism in the Face of Uncertainty #

Greedy algorithm may stuck to the sub-optimal arm. So we consider \(\epsilon\)-greedy, which selects arm as follows:

\[I_t= \left\{ \begin{array}{[email protected]{\quad:\quad}l} \argmax_{i \in [K]} \hat{\mu}_i(t-1) & \text{with probability } 1-\epsilon\\ \text{select an arm uniformly at random} & \text{with probability } \epsilon \end{array}\right. \]

The problem is how to choose \(\epsilon\) (or even choose different \(\epsilon_t\) at time \(t\)).

UCB (Upper Confidence Bound) algorithm (proposed ~2002).

Recall the concentration bounds for sub-Gaussian:

Theorem Assume \(X_i-\mu\) are independent and \(\sigma\)-sub-Gaussian random variables. Define \(\hat{\mu} = \frac{1}{n}\sum_{i=1}^{n} X_i\), then for all \(\epsilon \geq0\),

\(\hat{\mu} \in \left(\mu-\sqrt{\frac{2\sigma^2 \log(\frac{1}{\delta})}{n}}, \mu+ \sqrt{\frac{2\sigma^2 \log(\frac{1}{\delta})}{n}}\right)\) with probability at least \(1-2\delta\).

Also, if the estimator \(\hat{\mu}\) is within \((\mu-\epsilon, \mu+\epsilon)\) with high probability, then we can also say the true value \(\mu\) is within the estimation \((\hat{\mu}-\epsilon, \hat{\mu}+\epsilon)\) with high probability.

The UCB algorithm will choose the upper bound of the estimator, and apply greedy strategy.

In the above algorithm, we assume \(\sigma=1\), and we replace \(\frac{1}{\delta}\) with \(t\).

  1. If \(T_i(t-1)\) is small, i.e., arm \(i\) is not explored too much, then the second term dominates, so the algorithm tends to play it.
  2. When all the arms are fully explored, the first term dominates, the algorithm tends to play the best arm so far.
  3. Since \(t\) is increasing, after some time the second term will again dominate. So unlike ETC algorithm, UCB algorithm will not stick to one arm, it will never stop playing sub-optimal arms. The exploration never stops, but at a very slow rate since \(\log\) increases very slow.

8.6 Upper Confidence Bound Algorithm #

We assume arm 1 is optimal and unique. Define \(\Delta=\min_{i>1} \Delta_i\), the gap between the optimal mean value and the second-optimal mean value.

We define the UCB index as

\[UCB_i(t) = \underbrace{\hat{\mu}_i(t-1)}_{\text{estimate}} + \underbrace{\sqrt{\frac{2\log(t)}{T_i(t-1)}}}_{\text{condifence bound}}\]

where \(T_i(t-1) = \sum_{s=1}^{t-1} \1_{\{I_s=i\}}\) and \(\hat{\mu}_i(t-1) = \frac{\sum_{s=1}^{t-1} \1_{\{I_s=i\}} X_{i,s}}{T_i(t-1)}\).

Theorem If UCB is run on \(K\) arms having arbitrary reward distributions with common support \([0,1]\), then UCB pulls each sub-optimal arm \(k\) in expectation at most

\[\E \left[ T_k(n) \right] \leq \frac{8\ln n}{\Delta_k^2} + \frac{\pi^2}{3} +1\]

By the regret decomposition theorem, the pseudo-regret is bounded as

\[\begin{split} \overline{R}_n &\leq 8 \sum_{i: \Delta_{i}>0} \left( \frac{\ln n}{\Delta_i} \right) + \sum_{i: \Delta_{i}>0} \left( \frac{\pi^2}{3} +1 \right) \Delta_i\\ &\leq 8(K-1) \frac{\ln n}{\Delta} +\sum_{i: \Delta_{i}>0} \left( \frac{\pi^2}{3} +1 \right) \Delta_i \end{split} \]

which is sublinear.


Here since the reward distributions have support \([0,1]\), so each arm is actually \(\frac{1}{2}\)-subgaussian. Don’t be confused with the proof on the textbook, where they assume 1-subgaussian arms. The bounds of these two settings only differ in constant factors though.

Denote the optimal arm as \(i^{*}=\argmax_{i \in [K]} \mu_{i} = 1\), and \(\mu^{*} = \max_i \mu_i\).

Assume that a suboptimal arm \(i\) is played in round \(t\), i.e., \(I_t=i\), \(i \neq i^{*}\). There are 3 reasons that this would happen:

  1. \(\hat{\mu}_i(t-1) - \sqrt{\frac{2\log(t)}{T_i(t-1)}} > \mu_i\) (The suboptimal arm is overestimated)
  2. \(\hat{\mu}_1(t-1) + \sqrt{\frac{2\log(t)}{T_1(t-1)}} \leq \mu_1\) (The optimal arm is underestimated)
  3. \(T_i(t-1)\leq \frac{8 \log n}{\Delta_i^2}\), or equivalently, \(\mu_1-\mu_i=\Delta_i \leq 2 \sqrt{\frac{2\log(t)}{T_i(t-1)}}\)

Claim If a sub-optimal arm \(i\) is played at round \(t\), then at least 1 of these 3 conditions must hold.


Assume none of them holds, then

\[\begin{split} UCB_1(t) &= \hat{\mu}_1(t-1) + \sqrt{\frac{2\log t}{T_1(t-1)}}\\ &> \mu_1\\ &= \mu_i + \Delta_i\\ &> \mu_i + 2 \sqrt{\frac{2 \log n}{T_i(t-1)}}\\ &> \mu_i + 2 \sqrt{\frac{2 \log t}{T_i(t-1)}}\\ &\geq \hat{\mu}_i(t-1) + \sqrt{\frac{2 \log t}{T_i(t-1)}}\\ &= UCB_i(t) \end{split} \]

Therefore arm \(i\) cannot be chosen, contradiction.

\[\begin{split} \E[T_i(n)] &= \E\left[T_i(n) \1_{\{T_i(t-1)\leq \ell\}}\right] + \E\left[T_i(n) \1_{\{T_i(t-1) > \ell\}}\right]\\ &\leq \ell + \E\left[\sum_{t=1}^{n} \1_{\{I_t=i\}} \1_{\{T_i(t-1) > \ell\}}\right]\\ &= \ell + \E\left[\sum_{t=1}^{n} \1_{\{I_t=i, T_i(t-1) > \ell\}} \right]\\ &= \ell + \E\left[\sum_{t=1}^{n} \1_{\{I_t=i, \text{(1) or (2) holds}\}} \right] \quad \text{(By letting } \ell = \left\lceil \frac{8 \log n}{\Delta_i^2} \right\rceil \text{)}\\ &\leq \ell + \E\left[\sum_{t=1}^{n} \1_{\{I_t=i, \text{(1) holds}\}} \right] + \E\left[\sum_{t=1}^{n} \1_{\{I_t=i, \text{(2) holds}\}} \right] \quad \text{(Union Bound)}\\ &= \ell + \sum_{t=1}^{n} \Pr[I_t=i, \text{(1) holds}] + \sum_{t=1}^{n} \Pr[I_t=i, \text{(2) holds}]\\ &\leq \ell + \sum_{t=1}^{n} \Pr[\text{(1) holds}] + \sum_{t=1}^{n} \Pr[\text{(2) holds}]\\ \end{split} \]


\[\begin{split} & \Pr[\text{(1) holds}]\\ =& \Pr\left[\hat{\mu}_i(t-1) - \sqrt{\frac{2\log t}{T_i(t-1)}} > \mu_i\right]\\ =& \Pr\left[\hat{\mu}_i(t-1) - \mu_i > \sqrt{\frac{2\log t}{T_i(t-1)}}\right]\\ \leq& \Pr\left[\exists 1\leq s \leq t-1 , \hat{\mu}_{i,s} - \mu_i > \sqrt{\frac{2\log t}{s}}\right]\\ \leq& \sum_{s=1}^{t-1} \Pr\left[\hat{\mu}_{i,s} - \mu_i > \sqrt{\frac{2\log t}{s}}\right] \quad \text{(Union Bound)}\\ \leq& \sum_{s=1}^{t-1} e^{-4 \log t} \quad \text{(Hoeffding Bound)}\\ =& \sum_{s=1}^{t-1} t^{-4}\\ \leq& t^{-4} \cdot t \leq t^{-2} \end{split} \]

We cannot directly use the concentration bound on the second equality, because \(T_i(t-1)\) is a random quantify. Instead, we assume arm \(i\) is played \(s\) times, and denote \(\hat{\mu}_{i,s}\) as the estimate of arm \(i\) using \(s\) samples.

So \(\sum_{t=1}^{n} \Pr[\text{(1) holds}] \leq \sum_{t=1}^{n} t^{-2} \leq \sum_{t=1}^{\infty} t^{-2} = \frac{\pi^2}{6}\).

By almost the same arguments, \(\sum_{t=1}^{n} \Pr[\text{(2) holds}] \leq \frac{\pi^2}{6}\).

In summary,

\[\E[T_i(n)] \leq \left\lceil \frac{8 \log n}{\Delta_i^2} \right\rceil + \frac{\pi^2}{3} \leq \frac{8 \log n}{\Delta_i^2} + \frac{\pi^2}{3} +1\]

Finally we get the bound of the pseudo-regret:

\[\begin{split} \overline{R}(\text{UCB},n) &= \sum_{i=1}^{K} \E[T_i(n)] \Delta_i\\ &= \sum_{i \neq i^{*}} \frac{8 \log n}{ \Delta_i} + \sum_{i \neq i^{*}} \Delta_i \left( \frac{\pi^2}{3}+1 \right)\\ &= 8 \sum_{i\neq i^{*}} \frac{\log n}{ \Delta_i} + \sum_{i \neq i^{*}} \Delta_i \left( \frac{\pi^2}{3}+1 \right)\\ &\leq 8 (K-1) \frac{\log n}{ \Delta_i} + \sum_{i \neq i^{*}} \Delta_i \left( \frac{\pi^2}{3}+1 \right)\\ &= O \left( \frac{K\log n}{\Delta} \right) \end{split} \]


The constant factor in the UCB confidence radius varies in difference literature. In the above proof, we can actually use \(\sqrt{\frac{3 \log t}{2 T_i(t-1)}}\), and this will improve the constant factor in the bound to 6 instead of 8.

8.7 Problem Independent Bounds of UCB #

The above bound is problem dependent, meaning that all the mean values are fixed (i.e., the bandit instance is fixed). And the regret bound depends on the reciprocal of the gaps (\(\Delta_i^{-1}\)), which may be very large when the gap between the best and the second-best arm is very small. However, as the gap becomes small, we would imagine that playing the second-best arm becomes a decent strategy.

It is possible to derive a problem independent bound, which is still sublinear and holds for all possible bandit instances, i.e., it does not depend on the reciprocal of the gaps. So it is also called a worst-case regret.

\[\begin{split} \overline{R}(\pi,n) &= \sum_{i=1}^{K} \E[T_i(n)] \Delta_i\\ &= \sum_{i=1}^{K} \sqrt{\E[T_i(n)]} \left(\sqrt{\E[T_i(n)]}\Delta_i\right)\\ &\leq \sqrt{\sum_{i=1}^{K} \E[T_i(n)] \sum_{i=1}^{K} \left(\E[T_i(n)] \Delta_i^2\right)} \quad \text{(Cauchy–Schwarz)}\\ &= \sqrt{n \sum_{i=1}^{K} \left(\E[T_i(n)] \Delta_i^2 \right)}\\ &\leq \sqrt{n \sum_{i\neq i^{*}} \left( \frac{4\alpha \log n}{\Delta_i^2} + \frac{\pi^2}{3} +1\right)\Delta_i^2 }\\ &= \sqrt{n (K-1) 4\alpha \log n + n \sum_{i\neq i^{*}} \left(\frac{\pi^2}{3} +1\right)\Delta_i^2 }\\ \end{split} \]

Assume the support of the distributions of arms is \([0,1]\), so \(\mu_i \in [0,1]\) and \(\Delta_i \in [0,1]\). We have

\[\begin{split} \overline{R}(\pi,n) &\leq \sqrt{n (K-1) 4\alpha \log n + n (K-1) \left(\frac{\pi^2}{3} +1\right) }\\ &= O \left( \sqrt{n(K-1) \log n} \right)\\ &= \widetilde{O} \left( \sqrt{n(K-1)} \right) \end{split} \]

This bound does not depend on a particular Bandit instance, so it is a problem independent bound, or worst-case regret.


The above regret bound is \(O(\sqrt{nK \log n})\). We can remove the log factor entirely by using the MOSS (minimax optimal strategy in the stochastic case) algorithm. It is minimax optimal, meaning that except for constant factors, the worst-case bound cannot be improved by any algorithm.

Therefore the worst-case regret for UCB algorithm is \(\Theta(\sqrt{nK})\).

8.8 KL-UCB Algorithm #

The same algorithm but the UCB indices are computed based on different confidence terms.

Definition (KL-divergence)

If \(\vec{p}, \vec{q}\) are two probability distributions (vectors): \[d(\vec{p},\vec{q})=\sum_{i=1}^{N} p_i \log \frac{p_i}{q_i} \geq0\]

If the two distributions are Bernoulli with parameters \(p\) and \(q\), we abuse the notation \[d(p,q) = p \log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}\]

Theorem Consider a \(K\)-armed stochastic bandit with support in \([0,1]\). Let \(\epsilon>0\) and set \(C=3\). For any \(T\), the number of time KL-UCB chooses a sub-optimal arm \(k \neq k^{*}\) is bounded as

\[\E\left[N_k(T)\right] \leq \frac{\log(T) (1+\epsilon)}{d(\mu_k, \mu^{*})} + C_1 \log(\log T) + \frac{c_2(\epsilon)}{T^{\beta(\epsilon)}}\]

where \(C_1\) is constant, \(c_2, \beta\) are positive functions of \(\epsilon\).

Lemma (Pinsker’s Inequality) For all \(p,q \in [0,1]\),

\[d(p,q)\geq 2(p-q)^2\]

By Pinkser’s inequality, \(d(\mu_k, \mu^{*})\geq 2 \Delta_k^2\). Therefore comparing the the bound in UCB, we can expect a better bound by using KL-UCB.

8.9 Thompson Sampling #

Based on Bayesian approach.

9 Lower Bounds #

9.1 Proof Ideas #

We want to answer: What is the best we can get about the regret? Are the above algorithms optimal?

For a policy \(\pi\), define the worst-case regert on bandit environments \(\mathcal{E}\) as

\[R_T(\pi, \mathcal{E}) = \sup_{\nu \in \mathcal{E}} R_T(\pi,\nu)\]

Let \(\Pi\) be the set of all policies, consider the minimax regret:

\[R_T^{*}(\mathcal{E}) = \inf_{\pi \in \Pi}\left[\sup_{\nu \in \mathcal{E}} R_T(\pi,\nu)\right]\]

(Intuitively, this is the regret achieved by the best algorithm in the worst scenario.)

A policy \(\pi\) is called minimax optimal for \(\mathcal{E}\) if \(R_T^{*}(\mathcal{E}) = R_T(\pi,\mathcal{E})\).

Theorem Let \(\mathcal{E}^K\) be the \(K\)-armed stochastic bandits with finite support (\(\epsilon\)-Sub-Gaussian). Then there exists a universal constant \(C>0\) such that for all \(K>1, T>K\), it holds that \(R^{*}_T(\mathcal{E}^K) \geq C \sqrt{KT}\).

Proof idea: Let \(X_1,X_2, \dots, X_n\) be an i.i.d. sequence of R.V.s with mean \(\mu\) and variance 1. Assume that \(\mu=0\) or \(\Delta\). \(\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} X_i\).

Select two bandit instances such that

  1. Optimal action in the two instances are different
  2. The instance are “close” enough that the policy cannot statistically differentiate between them.

So any algorithms will be “fooled”, and at least make some mistakes.

Next we will construct such problem instances.

Consider \(\nu=\{p_i\}_{i \in [K]}\) where \(p_i = N(\mu_i, 1)\), and \(\nu’=\{p’_i\}_{i \in [K]}\) where \(p’_i = N(\mu’_i, 1)\). Then it suffices to show that for every policy \(\pi\), there exists \(\vec{\mu}, \vec{\mu}’ \in [0,1]^{K}\) such that the policy suffers a large regret in at least one instance, i.e.,

\[\max \left\{ R_T(\pi, \nu), R_T(\pi, \nu’) \right\} \geq C \sqrt{KT}\]

Construct \(\vec{\mu}\) as \[\mu_i = \left\{ \begin{array}{[email protected]{\quad:\quad}l} \Delta & i=1\\ 0 & i\neq 1 \end{array}\right. \] Then the regret is \(R_T(\pi, \nu) = (T-\E[N_1(T)])\Delta\).

There exists \(i\neq 1\) such that \(\E[N_i(T)] \leq \frac{T-\E[N_1(T)]}{K-1} \leq \frac{T}{K-1}\). For this fixed \(i\), define \(\vec{\mu}’\) as \[\mu_j’ = \left\{ \begin{array}{[email protected]{\quad:\quad}l} \mu_j & j\neq i\\ 2\Delta & j=i \end{array}\right. \] E.g., if \(i=4\), then \(\vec{\mu}=(\Delta,0,0,0,0,0, \dots)\), and \(\vec{\mu}’=(\Delta, 0,0, 2\Delta, 0, \dots)\).

The regret if \(R_T(\pi, \nu’) = \E[N_1(T)] \Delta + \sum_{j\neq 1,i} \E[N_j(T)]2\Delta \geq \E[N_1(T)] \Delta\).

Now since the two instances only differ in arm \(i\), and the number of times that arm \(i\) is played is bounded. We want to tune \(\Delta\) such that distinguishing them two requires playing arm \(i\) more than \(\frac{T}{K-1}\) times. We will set \(\Delta=\sqrt{\frac{1}{\E[N_i(T)]}} \geq \sqrt{\frac{K-1}{T}}\), we want to show that the expected number of pulls for the optimal arm (arm 1) are approximately the same for these two bandit instances.

9.2 Some Definitions from Information Theory #

Definition (Entropy) For a discrete probability distribution \(P=(p_1,p_2, \dots, p_k)\), define the entropy as \[H(P)=\sum_{i=1}^{k} p_i \log \frac{1}{p_i}\]

Definition (Divergence/Relative entropy) Assume \(P\) and \(Q\) are defined in the same probability space. Divergence is defined as \[D(P \parallel Q) = \sum_{i=1}^{k} p_i \log \frac{p_i}{q_i}= -H(P)+\sum_{i=1}^{k} p_i \log \frac{1}{q_i} \geq 0\]

For continuous case,

\[D(P \parallel Q) = \int_{-\infty}^{\infty} p(x) \log \left( \frac{p(x)}{q(x)} \right) \dd{x}\]

where \(p,q\) are probability densities of \(P,Q\).

More generally, if \(P\) is absolutely continuous with respect to \(Q\), then

\[\begin{split} D(P \parallel Q) &= \int_{\Omega} \log \left( \dv{P}{Q} \right) \dd{P} = \textstyle{\E_P} \left[ \log \dv{P}{Q} \right]\\ &= \int_{\Omega} \log \left( \dv{P}{Q} \right) \dv{P}{Q} \dd{Q} = \textstyle{\E_Q} \left[\dv{P}{Q} \log \dv{P}{Q} \right] \end{split} \]

where \(\dv{P}{Q}\) is the Radon-Nikodym derivative of \(P\) w.r.t. \(Q\) (understand it as a random variable).


The Radon-Nikodym derivative is the ratio of the probability densities:

If \(X_0, X_1\) are two R.V.s with densities \(p(x), q(x)\), then they have probability measures \(\dd{P(x)} = p(x) \dd{x}\), and \(\dd{Q(x)} = q(x) \dd{x}\). Therefore

\[\dv{P}{Q} \left( x \right) =\dv{P(x)}{Q(x)}= \frac{p(x)\dd{x}}{q(x) \dd{x}} = \frac{p(x)}{q(x)}\]

Canceling \(\dd{x}\) from the numerator and denominator is not rigorous, but the formula \(\dv{P}{Q} = \frac{p(x)}{q(x)}\) is easy to check in the integral definition of expectation.

Theorem If \(P \sim N(\mu_1, \sigma^2)\), \(Q \sim N(\mu_2, \sigma^2)\), then \(D(P \parallel Q)=\frac{(\mu_1-\mu_2)^2}{2\sigma^2}\).

Theorem (Bretagnolle–Huber inequality) Let \(P,Q\) be probability measures on the same measurable space \((\Omega, \mathcal{F})\). Let \(A \in \mathcal{F}\) be an arbitrary event. Then \[P(A)+Q(A^c) \geq \frac{1}{2} \exp \left( -D(P \parallel Q) \right)\]

9.3 Proof of Lower Bound #

The sequence of interaction between the learner and the environment is drawn from distribution (See Chapter 4.6 in Bandit Algorithms):

\[p_{\nu,\pi}(a_1,x_1, \dots, a_n, x_n)=\prod_{t=1}^n \pi(a_t\mid a_1,x_1, \dots, a_{t-1},x_{t-1}) p_{a_t}(x_t)\]

Lemma (Divergence decomposition) Let \(\nu=(P_1, \dots, P_k)\) be the reward distributions of one \(k\)-armed bandit, and let \(\nu’=(P’_1, \dots, P’_k)\) be the reward distributions of another bandit. Fix a policy \(\pi\) and let \(\mathbb{P}_\nu=p_{\nu,\pi}\) and \(\mathbb{P}_{\nu’}=p_{\nu’,\pi}\), then

\[D(\mathbb{P}_\nu \parallel \mathbb{P}_{\nu’}) = \sum_{i=1}^{k} \textstyle{\E_\nu}[T_i(n)] D(P_i \parallel P’_i)\]


By definition,

\[D(\mathbb{P}_\nu \parallel \mathbb{P}_{\nu’}) = \textstyle{\E_\nu} \displaystyle\left[ \log \left( \dv{\mathbb{P}_\nu}{\mathbb{P}_{\nu’}} \right) \right]\]


\[\log \frac{\dd{\mathbb{P}_{\nu}}}{\dd{\mathbb{P}_{\nu’}}} (a_1, x_1, \dots, a_n, x_n) = \sum_{t=1}^{n} \log \frac{p_{a_t}(x_t)}{p’_{a_t}(x_t)}\]

Taking expectations on both sides,

\[\textstyle{\E_{\nu}} \displaystyle \left[ \log \frac{\dd{\mathbb{P}_{\nu}}}{\dd{\mathbb{P}_{\nu’}}} (A_1, X_1, \dots, A_n, X_n) \right] = \sum_{t=1}^{n} \textstyle{\E_\nu} \displaystyle\left[ \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \right]\]


\[\textstyle{\E_\nu} \displaystyle\left[ \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \right] = \textstyle{\E_\nu} \displaystyle\left[\textstyle{\E_\nu} \displaystyle\left[ \log \frac{p_{A_t}(X_t)}{p’_{A_t}(X_t)} \biggr\rvert A_t \right]\right] = \textstyle{\E_\nu} [D(P_{A_t} \parallel P’_{A_t})]\]

Plugging back, we get

\[\begin{split} D(\mathbb{P}_\nu \parallel \mathbb{P}_{\nu’}) &= \sum_{t=1}^{n} \textstyle{\E_\nu} [D(P_{A_t} \parallel P’_{A_t})] \\ &= \sum_{i=1}^{k} \textstyle{\E_\nu} \displaystyle \left[ \sum_{t=1}^{n} \1_{A_t=i} D(P_{A_t} \parallel P’_{A_t}) \right] \\ &= \sum_{i=1}^{k} \mathbb{E}_{\nu} [T_i(n)] D(P_i \parallel P’_i) \end{split} \]

What we want to prove is the following theorem.

Theorem Let \(K>1\) and \(n>k-1\). Then for any policy \(\pi\), there exists a mean vector \(\mu \in [0,1]^K\), such that

\[R_n(\pi, \nu_{\mu}) \geq \frac{1}{27} \sqrt{(K-1)n}\]

where \(\nu_{\mu}\) is a Gaussian bandit for which the \(i\)th arm has reward distribution \(N(\mu_i, 1)\).


Construct a Gaussian bandit instance \(\nu_{\mu}\) with variance 1 and mean vector \(\mu=(\Delta, 0,0, \dots, 0)\), \(\Delta \in [0,0.5]\). Let \(i=\argmin_{j>1} \E_{\mu}[N_j(T)]\), then \(\E[N_i(T)]\leq \frac{T}{K-1}\).

Construct the second Gaussian bandit instance with unit variance and mean vector

\[\mu’ = (\Delta, 0, 0, \dots, \underbrace{2\Delta}_{i\text{th}}, \dots, 0)\]

We abbreviate \(\mathbb{P}_{\mu} = p_{\nu_{\mu}, \pi}\), and denote expectations under \(\mathbb{P}_{\mu}\) by \(\E_{\mu}\).


\[R_T(\pi, \nu_{\mu}) = \sum_{i=1}^{K} \Delta_i \E_{\mu}[N_i(T)]\]

Using \(T = \sum_{i=1}^{K} \E_{\mu}[N_i(T)]\) and Markov inequality, we have

\[R_T(\pi, \nu_{\mu}) = \Delta\sum_{i=2}^{K} \E_{\mu}[N_i(T)] = \Delta \left( T-\E_{\mu}[N_1(T)]\right) \geq \Delta \left(\Pr\left[T-N_1(T) \geq \frac{T}{2}\right]\frac{T}{2} \right)\]

For \(\nu_{\mu’}\), we only consider the term where \(i=1\), we have

\[R_T(\pi, \nu_{\mu’}) = \sum_{i=1}^{K} \Delta_i \E_{\mu’}[N_i(T)] > \Delta\E_{\mu’}[N_1(T)] \geq \mathbb{P}_{\mu’}\left[N_1(T) > \frac{T}{2}\right] \frac{T\Delta}{2}\]

In summary, we get

\[R_T(\pi, \nu_{\mu}) \geq \mathbb{P}_{\mu}\left[N_1(T)\leq \frac{T}{2}\right] \frac{T\Delta}{2}, \quad \text{and } R_T(\pi, \nu_{\mu’}) > \mathbb{P}_{\mu’}\left[N_1(T) > \frac{T}{2}\right] \frac{T\Delta}{2}\]

Applying the Bretagnolle–Huber inequality,

\[\begin{split} R_T(\pi, \nu_{\mu}) + R_T(\pi, \nu_{\mu’}) &> \frac{T\Delta}{2} \left(\mathbb{P}_{\mu}\left[N_1(T)\leq \frac{T}{2}\right] + \mathbb{P}_{\mu’}\left[N_1(T) > \frac{T}{2}\right] \right)\\ &\geq \frac{T\Delta}{4} \exp \left( -D(\mathbb{P}_{\mu} \parallel \mathbb{P}_{\mu’}) \right) \end{split} \]

Then we use divergence decomposition,

\[D(\mathbb{P}_{\mu} \parallel \mathbb{P}_{\mu’}) = \E_{\mu}[N_i(T)] D(N(0,1) \parallel N(2\Delta,1)) = \E_{\mu}[N_i(T)] \frac{(2\Delta)^2}{2} \leq \frac{2T\Delta^2}{K-1}\]

Plugging it back, we get

\[R_T(\pi, \nu_{\mu}) + R_T(\pi, \nu_{\mu’}) \geq \frac{T\Delta}{4}\exp \left( -\frac{2T\Delta^2}{K-1} \right)\]

The proof follows by choosing \(\Delta=\sqrt{\frac{K-1}{4T}}\).

10 Stochastic Contextual Bandits #

10.1 Introduction to Stochastic Contextual & Linear Bandits #

We now consider the case where the best arm may be different in different rounds. E.g., for a e-commerce website, it may recommend different ads for different users, according to their “context” (age, gender, etc.). The interaction is like the following.

  1. The environment chooses context \(\{C_t\}_{t=1}^n\) with \(C_t \in \mathcal{C}\)
  2. For each round \(t=1,2, \dots, n\):
    1. Learner observes context \(C_t\) and chooses an arm \(A_t \in [K]\).
    2. Learner observes noisy reward \(X_t = r(C_t,A_t) + \eta_t\)

where \(r: \mathcal{C} \times [K] \to \RR\) is the reward function, unknown for the learner.

We also assume the noise \(\eta_t\) is conditionally \(\sigma\)-sub-Gaussian on \(\mathcal{F}_t\):

\[\E[e^{\lambda \eta_t} \mid \mathcal{F}_t] \leq e^{\frac{\lambda^2\sigma^2}{2}}\]

where \(\mathcal{F}_t = \sigma(C_1, A_1, C_2, A_2, \dots, C_t, A_t)\), the sigma-algebra generated by the observations so far.

Since the mean of the noise is conditionally 0, we have

\[\E[X_t \mid \mathcal{F}_t] = r(C_t, A_t)\]

The learner wants to maximize the reward by learning a policy \(\pi: \mathcal{C} \to [K]\). If he knows the reward function \(r\), he can simply calculate \(A_t^{* } = \argmax_{a \in [K]} r(C_t, a)\) and play arm \(A_t^{*}\) at round \(t\). Note that \(A_t^{ *}\) is a random variable because it depends on the context \(C_t\).

However, this is not possible because the learner doesn’t know \(r\). But we can still use this as our benchmark to define our regret.

\[R_T(\pi) = \E \left[ \sum_{t=1}^{n} \max_{a \in [K]} r(C_t, a) - \sum_{t=1}^{n} X_t \right]\]

One naive solution is to regard a context-action pair \((c, a)\) as an arm, so we will have \(|\mathcal{C}|\cdot K\) arms, then use stochastic bandit algorithms we discussed before. But the problem is that the number of arms may be too large.

The above setting is called stochastic contextual bandits.

Intuitively, if there is some correlations between different \(r(c, \cdot)\) and \(r(c’, \cdot)\), then once we know a reward for an arm, then we can expect to get similar reward for another similar arm. We will assume that:

\[r(c,a) = \langle \psi(c,a), \theta^{*} \rangle\]

where \(\psi: \mathcal{C} \times [K] \to \RR^d\) is called a feature map, which the learner knows. And \(\theta^{* } \in \RR^d\) is an unknown parameter vector, with \(\| \theta^{*} \|_2^2 \leq L\) (\(L\) is known).

This means \(R\) is Lipschitz:

\[|R(x,a)-R(x’,a’)| \leq \| \theta^{*} \|_2 \cdot \| R(x,a)-R(x’,a’) \|_2\]

This setting is called stochastic contextual linear bandits.

People realize that when the reward is linear in the feature vectors \(\set{\psi(c,a)}_{c,a}\), what is really important is not the action \(a\) itself, but the resulting feature vector \(\psi(c,a)\). So we would rather let the algorithms choose the feature vectors directly.

This results in the following setting, called stochastic linear bandit. We assume for each time \(t\), after the learner observes the context \(C_t\), a decision set \(\mathcal{A}_t \subset \RR^d\) is revealed. The “arm” that the learner has to choose is \(A_t \in \mathcal{A}_t\), and the learner receives reward:

\[X_t = \langle \theta^{*}, A_t \rangle + \eta_t\]

The regret is:

\[R_T(\pi) = \E\left[ \sum_{t=1}^{n} \max_{a \in \mathcal{A}_t} \langle a, \theta^{*} \rangle - \sum_{t=1}^{n} \langle A_t,\theta^{*}\rangle\right]\]


Different choices of \(\mathcal{A}_t\) lead to different settings. E.g., if \(\mathcal{A}_t = \{e_1, e_2, \dots, e_d\}\), the problem reduces to \(d\)-arm stochastic bandits, where each component of \(\theta^{*}\) is the mean of each arm. If \(\mathcal{A}_t = \{\psi(C_t, i): i \in [K]\}\). This is exactly stochastic contextual linear bandits. If \(\mathcal{A}_t \subseteq \set{0,1}^d\), this is combinatorial bandits/.

10.2 Regret Analysis of Stochastic Linear Bandits #

In finite-action stochastic bandits, we choose the action with the largest upper confidence bound. In the case of linear bandits, the idea remains the same. We will:

  1. Estimate \(\theta^{*}\)
  2. Find a confidence ball \(\theta^{*}\) in which lies with high probability

To estimate \(\theta^{*}\), we use the regularized least-squares estimator, which is

\[\hat{\theta}_t = \argmin_{\theta \in \RR^d} \left( \sum_{s=1}^{t} (X_s- \langle A_s, \theta\rangle)^2 + \lambda \|\theta\|_2^2\right)\]

where \(X_1, \dots, X_t\) with \(X_s = \langle A_s, \theta^{*}\rangle + \eta_t\) are the received rewards.

Solve this equation and get

\[\hat{\theta}_t = V_t^{-1} \sum_{s=1}^{t} X_s A_s\]

where \(V_0 = \lambda I\) and \(V_t = \lambda I + \sum_{s=1}^{t} A_s A_s^\mathsf{T}\) for \(t\geq1\).


To solve the equation, take the derivative with respect to vector \(\theta\).

Let \(f: \RR^n \to \RR\), the derivative of \(f(\vec{x})\) with respect to \(\vec{x}\) is a row vector:

\[\pdv{f}{\vec{x}} = \left( \pdv{f}{x_1}, \pdv{f}{x_1}, \dots, \pdv{f}{x_n} \right) = (\grad f )^\mathsf{T}\]

which is exactly the transpose of the gradient.

Now we have the estimator, next we need to find a confidence ball \(\mathcal{C}_t\) of it. \(\mathcal{C}_t \subset \RR^d\) should be constructed based on \((A_1, X_1, \dots, A_{t-1}, X_{t-1})\) and contain the unknown \(\theta^{*}\) with high probability. We construct as follows:

\[\mathcal{C}_t \subseteq \mathcal{E}_t = \left\{ \theta \in \RR^d: \left\|\theta-\hat{\theta}_{t-1}\right\|^2_{V_{t-1}} \leq \beta_t \right\}\]

where \(\|v\|_W^2 \triangleq v^\mathsf{T} W v\) is the squared norm of \(v\) with respect to metric \(W\). And \(\{\beta_t\}_t\) is an increasing sequence of constants with \(\beta_1\geq1\). The set \(\mathcal{E}_t\) is an ellipsoid centered at \(\hat{\theta}_{t-1}\) and with principle axis being the eigenvectors of \(V_{t-1}\) with corresponding lengths being the reciprocal of the eigenvalues. Notice that as \(t\) grows, the matrix \(V_t\) has increasing eigenvalues, which means the volume of the ellipse is also shrinking (at least, provided \(\beta_t\) does not grow too fast).

We make the following assumptions:

  1. \(|\langle d, \theta^{*}\rangle|\leq1, \forall d \in \bigcup_{t=1}^T \mathcal{D}_t\)
  2. \(\| a \|_2 < L, \forall a \in \bigcup_{t=1}^T \mathcal{A}_t\)
  3. \(\theta^{*} \in \mathcal{C}_t, \forall t\) with high probability
  4. \(\{\beta_t\}_t\) is increasing with \(\beta_1\geq1\)

The algorithm is similar to UCB: In round \(t\), the decision set \(\mathcal{A}_t\) is revealed. For all \(a \in \mathcal{A}_t\), define the UCB index as

\[\text{UCB}_t(a) = \max_{\theta \in \mathcal{C}_t} \langle a, \theta \rangle\]

Then choose \(A_t\) such that:

\[A_t = \argmax_{a \in A_t} \text{UCB}_t(a)\]

Note that there are two maximums in the RHS. We denote \((d_t, \tilde{\theta}_t)\) as the pair that maximizes it.

This algorithm is called LinUCB, LinREL, OFUL, etc.

Next we analyze the bound. We want to bound the pseudo-regret, which is a random variable.

\[\hat{R}_T(\pi) = \sum_{t=1}^{n} \max_{a \in \mathcal{A}_t} \langle a, \theta^{*} \rangle - \sum_{t=1}^{n} \langle A_t,\theta^{*}\rangle\]

We want to prove the bound on the event that \(\theta^{*} \in \mathcal{C}_t\) for all rounds \(t \in [n]\). This is called “high probability bound”.

We denote the regret incurred at round \(t\) as

\[\begin{split} R_t &= \max_{a \in \mathcal{A}_t} \inprod{a}{\theta^{*}} - \inprod{A_t}{\theta^{*}}\\ &\leq \inprod{A_t}{\tilde{\theta}_t} - \inprod{A_t}{\theta^{*}} \\ &= \inprod{A_t}{\tilde{\theta}_t - \theta^{*}}\\ &= \inprod{A_t}{\tilde{\theta}_t - \theta^{*} + \hat{\theta}_{t-1} - \hat{\theta}_{t-1}}\\ &= \inprod{A_t}{\tilde{\theta}_t - \hat{\theta}_{t-1}} + \inprod{A_t}{\hat{\theta}_{t-1} - \theta^{*}}\\ &= A_t^\mathsf{T} \left(V_{t-1}^{\frac{1}{2}}\right)^{-1} V_{t-1}^{\frac{1}{2}}\left( \tilde{\theta}_t - \hat{\theta}_{t-1} \right) + A_t^\mathsf{T} \left(V_{t-1}^{\frac{1}{2}}\right)^{-1} V_{t-1}^{\frac{1}{2}}\left( \hat{\theta}_{t-1} - \theta^{*} \right)\\ &\leq \left\| A_t^\mathsf{T} V_{t-1}^{-\frac{1}{2}} \right\| \left\| V_{t-1}^{\frac{1}{2}} \left( \tilde{\theta}_t - \hat{\theta}_{t-1} \right) \right\| + \left\| A_t^\mathsf{T} V_{t-1}^{-\frac{1}{2}} \right\| \left\| V_{t-1}^{\frac{1}{2}} \left( \hat{\theta}_{t-1} - \theta^{*} \right) \right\|\\ &= \left\| A_t \right\|_{V_{t-1}^{-1}} \cdot \left\| \tilde{\theta}_t - \hat{\theta}_{t-1} \right\|_{V_{t-1}} + \left\| A_t \right\|_{V_{t-1}^{-1}} \cdot \left\| \hat{\theta}_{t-1} - \theta^{*} \right\|_{V_{t-1}}\\ &\leq 2\left\| A_t \right\|_{V_{t-1}^{-1}} \sqrt{\beta_t} \end{split} \]

Where the first inequality is because according to the algorithm, \(A_t\) maximizes the UCB index, let \(\tilde{\theta}_t\) be the parameter for which \(\inprod{\tilde{\theta}_t}{A_t} = \text{UCB}_t(A_t)\).

On the other hand, since \(|\langle d, \theta^{*}\rangle|\leq1\), so \(R_t \leq 2\). Also because \(\beta_t\geq1\), we get

\[\begin{split} R_t &\leq 2\min \left\{1, \left\| d_t \right\|_{V_{t-1}^{-1}} \sqrt{\beta_t}\right\}\\ &\leq 2\min \left\{\sqrt{\beta_t}, \left\| d_t \right\|_{V_{t-1}^{-1}} \sqrt{\beta_t}\right\}\\ &= 2 \sqrt{\beta_t} \min \left\{1, \left\| d_t \right\|_{V_{t-1}^{-1}} \right\} \end{split} \]

So the pseudo-regret has bound:

\[\begin{split} \hat{R}_T(\pi)&= \sum_{t=1}^T R_t\\ &\leq \sqrt{T \sum_{t=1}^{T} R_t^2} \quad\text{(Cauchy-Schwarz with all-one vector)}\\ &\leq 2 \sqrt{T\beta_T \sum_{t=1}^{T} \min \left\{1, \left\| d_t \right\|^2_{V_{t-1}^{-1}} \right\}} \end{split} \]

Claim \(\sum\limits_{t=1}^{T} \min \left\{1, \left\| d_t \right\|^2_{V_{t-1}^{-1}} \right\} \leq 2\ln \left( \frac{\det(V_T)}{\det(V_0)} \right) \leq 2 d\ln \left( 1 + \frac{TL^2}{\lambda d} \right) \)


Using the fact: for all \(u \geq0\), \(\min\{u, 1\} \leq 2\ln(1+u)\), we get

\[\begin{split} \sum\limits_{t=1}^{T} \min \left\{1, \left\| d_t \right\|^2_{V_{t-1}^{-1}} \right\} &\leq 2\sum_{t=1}^{T} \ln \left( 1+ \left\| d_t \right\|^2_{V_{t-1}^{-1}}\right) \\ &= 2 \ln \prod_{t=1}^{T} \left( 1+ \left\| d_t \right\|^2_{V_{t-1}^{-1}} \right) \end{split} \]


\[V_t = V_{t-1} + d_t d_t^\mathsf{T} = V_{t-1}^{\frac{1}{2}} \left( I+V_{t-1}^{-\frac{1}{2}} d_t d_t^\mathsf{T} V_{t-1}^{-\frac{1}{2}} \right) V_{t-1}^{\frac{1}{2}}\]

We have

\[\det(V_t) = \det(V_{t-1}) \det\left( I+V_{t-1}^{-\frac{1}{2}} d_t d_t^\mathsf{T} V_{t-1}^{-\frac{1}{2}} \right) = \det(V_{t-1}) \left( 1+ \left\| d_t \right\|^2_{V_{t-1}^{-1}} \right)\]


\[\det(V_T) = \det(V_0) \prod_{t=1}^T \left( 1+ \left\| d_t \right\|^2_{V_{t-1}^{-1}} \right)\]

The first inequality follows by plugging this back.

For the second inequality, by the Inequality of arithmetic and geometric means,

\[\det(V_T) = \prod_{i=1}^d \lambda_i \leq \left( \frac{\sum_{i=1}^{d} \lambda_i}{d}\right)^d = \left( \frac{\tr(V_T)}{d}\right)^d \leq \left( \frac{\tr(V_0)+TL^2}{d} \right)^{d}\] where \(\lambda_1, \dots, \lambda_d\) are the eigenvalues of \(V_T\).

The proof follows by plugging in \(\det(V_0) = \lambda^{d}\).

Finally, we get the bound of pseudo-regret:

\[\hat{R}_T(\pi) \leq 2 \sqrt{2T \beta_T d \ln \left( 1+ \frac{TL^2}{\lambda d} \right)}\]

10.3 Construction of Confidence Ellipsoid #

We now construct the confidence ellipsoid \(\mathcal{C}_t\) that satisfies our assumptions.

We first build intuition by making some simplifying assumptions:

  1. No regularization: \(\lambda=0\). \(V_t\) are invertible.
  2. Independent subgaussian noise: \(\{\eta_s\}_{s\geq1}\) are independent and 1-subgaussian.
  3. Fixed design: \(d_1, \dots, d_t\) are deterministically chosen without the knowledge of \(r_1, \dots, r_t\).

Under these assumptions, we know that \[\begin{split} \hat{\theta}_t &= V_t^{-1} \sum_{s=1}^{t} d_s r_s\\ &= \left( \sum_{s=1}^{t} d_s d_s^\mathsf{T} \right)^{-1} \sum_{s=1}^{t} d_s \left( \inprod{d_s}{\theta^{*}}+\eta_s \right) \end{split} \]

And then for some vector \(d \in \RR^d\),

\[\begin{split} \inprod{d}{\hat{\theta}_t-\theta^{*}} &= \inprod{d}{\left( \sum_{s=1}^{t} d_s d_s^\mathsf{T} \right)^{-1} \left(\sum_{s=1}^{t} (d_s d_s^\mathsf{T}\theta^{*}+d_s \eta_s)\right) - \theta^{*}}\\ &= \inprod{d}{\left(\sum_{s=1}^{t} d_s d_s^\mathsf{T}\right)^{-1} \sum_{s=1}^{t} d_s\eta_s}\\ &= \sum_{s=1}^{t} \inprod{d}{V_t^{-1}d_s} \eta_s \end{split} \]

This is just a linear combination of the noise \(\set{\eta_s}_{s=1}^{t}\), which are 1-subgaussian. By the concentration bound of subgaussian R.V.s,

\[\Pr\left[ \inprod{d}{\hat{\theta}_t-\theta^{*}} \geq \sqrt{2\sum_{s=1}^{t}\inprod{d}{V_t^{-1}d_s}^2 \ln \left( \frac{1}{\delta} \right) } \right] \leq \delta\]

Note that \[\sum_{s=1}^{t} \inprod{d}{V_t^{-1}d_s}^2 = \sum_{s=1}^t d^\mathsf{T} V_t^{-1} d_s d_s^\mathsf{T} \left( V_t^{-1} \right)^\mathsf{T} d = d^\mathsf{T} V_t^{-1} V_t \left( V_t^{-1} \right)^\mathsf{T} d = \|d\|_{V_t^{-1}}^2\]


\[\Pr\left[ \inprod{d}{\hat{\theta}_t-\theta^{*}} \geq \sqrt{2 \|d\|_{V_t^{-1}}^2 \ln \left( \frac{1}{\delta} \right) } \right] \leq \delta\]

We want to construct \(\mathcal{C}_t\) such that \(\theta^{*} \in \mathcal{C}_t\) with high probability. By the definition of \(\mathcal{C}_t\), that means we want to bound \(\|\hat{\theta}_t-\theta^{*}\|_{V_t}\). Notice that

\[\|\hat{\theta}_t-\theta^{*}\|_{V_t} = \inprod{\hat{\theta}_t-\theta^{*}}{V_t^{\frac{1}{2}}X}\] where \(X = \frac{V_t^{\frac{1}{2}}(\hat{\theta}_t-\theta^{*})}{\| \hat{\theta}_t-\theta^{*}\|_{V_t}}\).

It seems like we can simply use the above bound by replacing \(d\) with \(V_t^{\frac{1}{2}}X\). However, \(X\) is a random quantity while what we have proven above is for deterministic \(d\). We will use a standard technique called covering argument.

We first identify finite set \(\mathcal{C}_{\epsilon} \subset \RR^{d}\), such that for whatever value \(X\) takes, there exists some \(y \in \mathcal{C}_{\epsilon}\) that is \(\epsilon\)-close to \(X\), i.e., \(\|X-y\|_2 \leq \epsilon\).

Definition \(S_d = \set{x \in \RR^{d}: \|x\|_2^2 = 1}\)

Lemma There exists a set \(\mathcal{C}_{\epsilon} \subset \RR^d\) with \(|\mathcal{C}_{\epsilon}| \leq \left( \frac{3}{\epsilon} \right)^d\) such that \(\forall x \in S_d\), \(\exists y \in \mathcal{C}_\epsilon\), such that \(\|x-y\|_2 \leq \epsilon\). (Proof skipped)

To use this lemma, we need to verify that \(X \in S_d\). This is easy to check that \(\|X\|_2^2=X^\mathsf{T}X=1\).

We define an event:

\[E = \set{\exists x \in \mathcal{C}_{\epsilon}: \inprod{V_t^{\frac{1}{2}}x}{\hat{\theta}_t-\theta^{*}} \geq \sqrt{2 \ln \left( \frac{|\mathcal{C}_{\epsilon}|}{\delta} \right)}}\]

By the union bound, and the fact that \(\|V_t^{\frac{1}{2}}x\|_{V_t^{-1}} = \|x\|_2 = 1\), we show \(E\) happens with low probability.

\[\begin{split} \Pr \left[ E \right] &\leq \sum_{x \in \mathcal{C}_{\epsilon}} \Pr\left[\inprod{V_t^{\frac{1}{2}}x}{\hat{\theta}_t-\theta^{*}} \geq \sqrt{2 \ln \left( \frac{|\mathcal{C}_{\epsilon}|}{\delta} \right)}\right] \\ &\leq \sum_{x \in \mathcal{C}_{\epsilon}} \frac{\delta}{|\mathcal{C}_{\epsilon}|} = \delta \end{split} \]

Cauchy-Schwarz shows that

\[\begin{split} \inprod{V_t^{\frac{1}{2}}x}{\hat{\theta}_t-\theta^{*}} &= \inprod{x}{\left(V_t^{\frac{1}{2}}\right)^\mathsf{T}\left(\hat{\theta}_t-\theta^{*}\right)}\\ &\leq \|x\| \left\|\left(V_t^{\frac{1}{2}}\right)^\mathsf{T}\left(\hat{\theta}_t-\theta^{*}\right)\right\| \end{split} \]

with equality only if \(x = t \cdot \left(V_t^{\frac{1}{2}}\right)^\mathsf{T}\left(\hat{\theta}_t-\theta^{*}\right)\) for some \(t >0\). Since \(\|x\|=1\), we can get \(x = \frac{V_t^{\frac{1}{2}} \left(\hat{\theta}_t-\theta^{*}\right)}{\left\|V_t^{\frac{1}{2}} \left(\hat{\theta}_t-\theta^{*}\right)\right\|}\), which maximizes the above equation.

Therefore when \(E\) doesn’t occur (which is the case with probability at least \(1-\delta\)), we have

\[\begin{split} \left\|\hat{\theta}_t-\theta^{*}\right\|_{V_t} &= \max_{x \in S_d} \inprod{V_t^{\frac{1}{2}}x}{\hat{\theta}_t-\theta^{*}}\\ &= \max_{x \in S_d} \min_{y \in \mathcal{C}_{\epsilon}} \left[\inprod{V_t^{\frac{1}{2}}(x-y)}{\hat{\theta}_t-\theta^{*}} + \inprod{V_t^{\frac{1}{2}}y}{\hat{\theta}_t-\theta^{*}}\right]\\ &< \max_{x \in S_d} \min_{y \in \mathcal{C}_{\epsilon}} \left[ \left\|\hat{\theta}_t-\theta^{*}\right\|_{V_t} \|x-y\|_2 + \sqrt{2 \ln \left( \frac{|\mathcal{C}_{\epsilon}|}{\delta} \right)}\right]\\ &\leq \epsilon\left\|\hat{\theta}_t-\theta^{*}\right\|_{V_t} + \sqrt{2 \ln \left( \frac{|\mathcal{C}_{\epsilon}|}{\delta} \right)} \end{split} \]

Rearranging yields the following holds with high probability (at least \(1-\delta\)).

\[\left\|\hat{\theta}_t-\theta^{*}\right\|_{V_t} \leq \frac{1}{1-\epsilon}\sqrt{2 \ln \left( \frac{|\mathcal{C}_{\epsilon}|}{\delta} \right)}\]

11 Adversarial Contextual Bandits #

11.1 Contextual Bandits: One Bandit per Context #

For \(t=1,2, \dots, T\)

  1. Learner observes context \(c_t \in \mathcal{C}\) (\(\mathcal{C}\) is finite)

  2. Environment assigns rewards \(\set{x_{t,i}}_{i \in [K]}\)

  3. Learner selects distribution \(P_t\) on \([K]\), and sample \(I_t \sim P_t\)

  4. Learner observes reward \(X_t = x_{t,I_t}\)

    Assuming that the set \(\mathcal{C}\) of all possible contexts is finite, one simple approach then is to set up a separate bandit algorithm, such as Exp3, for each context.

The regret is defined as comparing the rewards collected by the learner with the rewards collected by the best context-dependent policy in hindsight:

\[\begin{split} R_T(\pi) &= \E\left[\sum_{c \in \mathcal{C}} \max_{i \in [K]} \sum_{\substack{t=1\\c_t=c}}^{T} x_{t,i} - \sum_{t=1}^{T} X_t \right]\\ &= \E\left[\sum_{c \in \mathcal{C}} \max_{i \in [K]} \sum_{\substack{t=1\\c_t=c}}^{T} \left(x_{t,i} - X_t\right) \right] \end{split} \]

For a particular context \(c\), we define

\[R_{T,c}(\pi) = \E\left[\max_{i \in [K]} \sum_{\substack{t=1\\c_t=c}}^{T} \left(x_{t,i} - X_t\right) \right]\]

Use the results of Exp3 algorithm, we can bound

\[R_{T,c}(\text{Exp3}) \leq 2 \sqrt{K\log(K) \sum_{t=1}^{T} \1_{\set{c_t=c}}}\]

So for the regret

\[\begin{split} R_T(\text{Exp3}) &= \sum_{c \in \mathcal{C}} R_{T,c}(\text{Exp3})\\ &=2 \sum_{c \in \mathcal{C}}\sqrt{K\log(K) \sum_{t=1}^{T} \1_{\set{c_t=c}}}\\ &\leq 2 \sqrt{|\mathcal{C}| \sum_{c \in \mathcal{C}} \left( K\log(K) \sum_{t=1}^{T} \1_{\set{c_t=c}} \right) }\\ &= 2\sqrt{|\mathcal{C}| TK \log(K)} \end{split} \]

The inequality is due to Cauchy–Schwarz Inequality with an all-one vector. Although it is indeed sublinear in \(T\), if \(|\mathcal{C}|\) is large, the regret bound can be very bad. Also, this algorithm needs to run a separate instance of Exp3 for each context, which is expensive.

Fortunately, however, it is seldom the case that the context set is both large and unstructured. Very often the contexts themselves have some kind of internal structure that we may hope to exploit. In the following, we will add some assumptions to the context set.

We first rewrite the regret. Let \(\Phi\) be the set of all functions from \(\mathcal{C}\to[K]\), then

\[R_T(\pi) = \E\left[\max_{\phi \in \Phi} \sum_{t=1}^{T} \left( x_{t, \phi(c_t)}-X_t \right)\right]\]

Method 1 (Partitions) Let \(\mathcal{P} \subset 2^{\mathcal{C}}\) be a partition of \(\mathcal{C}\), then the sets in \(\mathcal{P}\) are disjoint and \(\bigcup_{P \in \mathcal{P}} P = \mathcal{C}\). And we assume that functions in \(\Phi\) are constant on each part in \(\mathcal{P}\).

In this case, we can run a version of Exp3 for each part, which means the regret depends on the number of parts \(|\mathcal{P}|\) rather than on the number of contexts. So the regret becomes:

\[R_T(\text{Exp3}) \leq 2 \sqrt{|\mathcal{P}| TK \log(K)}\]

Method 2 (Similarity Functions) The intuition is that if two contexts are similar, then probably they prefer the same arm.

Define similarity functions: \(s: \mathcal{C} \times \mathcal{C} \to [0,1]\), which measure the similarity between pairs of contexts. Then for a particular \(\phi \in \Phi\), define the average dissimilarity:

\[D_s(\phi) = \frac{1}{|\mathcal{C}|^2} \sum_{c,d \in \mathcal{C}} (1-s(c,d)) \1_{\set{\phi( c) \neq \phi(d)}}\]

And we assume that \(\Phi = \set{\phi: \mathcal{C}\to [K]: D_s(\phi)\leq \theta}\), for some user-tuned threshold \(\theta \in (0,1)\).

Method 3 (Bandits with Expert Advice) Find a collection of predictors \(\Phi=\set{\phi_1, \dots, \phi_M}\). Given a context, each expert reports a probability distribution over the arms. In each round, the predictions of the \(M\) experts can be represented by a \(M \times K\) matrix \(E^{(t)}\), where the \(m\)-th row \(E^{(t)}_m\) is the probability vector suggested by expert \(m\). The settings:

For \(t=1,2, \dots, T\)

  1. Learner observes prediction of all experts \(E^{(t)} \in [0,1]^{M \times K}\)
  2. Learner selects a distribution \(P_t\) on \([K]\)
  3. Action \(I_t\) is sampled from \(P_t\) and the reward is \(X_t=x_{t,I_t}\)

The regret is defined as

\[R_T= \E\left[\max_{m \in [M]} \sum_{t=1}^{T} E_m^{(t)} x_t - \sum_{t=1}^{T} X_t\right]\]

We will focus on this method. The algorithm we will use is Exp4.

11.2 Exp4 (Exponential Weights for Exploration and Exploitation for Experts) Algorithm #

The accumulation of the experts’ rewards is not very obvious. Note that the denominator of \(Q_{t,i}\) only depends on \(t\), so

\[\begin{split} Q_{t+1, i} &= \frac{\exp\left(\eta \tilde{X}_{t,i}\right)Q_{t,i}}{\sum_{j=1}^{M} \exp\left(\eta \tilde{X}_{t,j}\right)Q_{t,j}}\\ &= \frac{\exp\left(\eta \tilde{X}_{t,i}\right)\exp\left(\eta \tilde{X}_{t-1,i}\right)Q_{t-1,i}}{\sum_{j=1}^{M} \exp\left(\eta \tilde{X}_{t,j}\right)\exp\left(\eta \tilde{X}_{t-1,j}\right)Q_{t-1,j}}\\ &= \dots\\ &= \frac{\exp\left(\eta \sum_{s=1}^{t} \tilde{X}_{s,i}\right)}{\sum_{j=1}^{M} \exp\left(\eta \sum_{s=1}^{t} \tilde{X}_{s,j}\right)} \end{split} \]

The regret bound:

Theorem Let \(\gamma=0\) and \(\eta=\sqrt{\frac{2 \log(M)}{TK}}\). Assume that the experts are deterministic and oblivious, then \[R_T(\text{Exp4})\leq \sqrt{2TK \log (M)}\]

Deterministic: Every time an expert is given the same context, it will output the same probability distribution.

Oblivious: Whatever the regard we get, the experts will not change their probability distributions.

Lemma For any \(m^{*} \in [M]\), it holds that \[\sum_{t=1}^{T} \tilde{X}_{t,m^{*}}-\sum_{t=1}^{T} \sum_{m=1}^{M} Q_{t,m}\tilde{X}_{t,m} \leq \frac{\log(M)}{\eta}+\frac{\eta}{2} \sum_{t=1}^{T} \sum_{m=1}^{M} Q_{t,m} \left( 1-\tilde{X}_{t,m} \right)^2\]


Let \(\tilde{Y}_{t,i}=\sum_{s=1}^{t} \tilde{X}_{s,i}\) and \(W_t=\sum_{j=1}^{M} \exp \left( \eta \tilde{Y}_{t,j} \right)\). The remaining proof is same as Theorem 11.2 in the textbook.

12 Adversarial Linear Bandits #

The learner is given an action set \(\mathcal{A} \subset \RR^d\). The environment selects a sequence of loss vectors \(y_1, \dots, y_T \in \RR^d\).

For \(t=1,2, \dots, T\)

  1. Learner selects an action \(A_t \in \mathcal{A}\)
  2. Learner observes loss \(Y_t = \inprod{A_t}{y_t}\)

The learner doesn’t know the loss vector \(y_t\). The regret is

\[R_T = \E \left[ \sum_{t=1}^{T} Y_t \right]-\min_{a \in \mathcal{A}} \sum_{t=1}^{T} \inprod{a}{y_t}\]

If we choose \(\mathcal{A}=\set{e_1,e_2, \dots, e_d}\), this is exactly the finite-armed adversarial bandits.

We make some assumptions:

  1. For any \(y \in \RR^d\), \(\sup_{a \in \mathcal{A}} |\inprod{a}{y}|\leq1\)
  2. The action set \(\mathcal{A}\) spans \(\RR^d\)

12.1 Exponential Weights for Linear Bandits #

Similar to Exp3, we need to estimate the loss for each action. Assume the loss estimate for action \(a \in \mathcal{A}\) in round \(s \in [T]\) is \(\hat{Y}_s(a)\), then the probability distribution proposed by exponential weights \(\tilde{P}_t:\mathcal{A}\to[0,1]\), is given by \[\tilde{P}_t(a) \propto \exp \left( -\eta \sum_{s=1}^{t-1} \hat{Y}_s(a) \right)\] where \(\eta>0\) is the learning rate. To control the variance of the loss estimates, it will be useful to mix this distribution with an exploration distribution \(\pi\). The mixture distribution is

\[P_t(a)=(1-\gamma)\tilde{P}_t(a)+\gamma \pi(a)\]

Our \(A_t\) will be drawn from \(P_t\).

We estimate the loss vector \(y_t\) using least squares: \(\hat{Y}_t=R_tA_tY_t=R_tA_t A_t^\mathsf{T}y_t\), where \(R_t \in \RR^{d \times d}\) is selected deterministically so that this is an unbiased estimator given the history. Denote \(\E_t[\cdot] = \E[\cdot \mid P_t]\).

\[\begin{split} \E_t \left[ \hat{Y}_t \right] &= R_t \E_t \left[ A_tA_t^\mathsf{T} \right] y_t\\ &=R_t \underbrace{\left( \sum_{a \in \mathcal{A}} P_t(a) a a^\mathsf{T} \right)}_{Q_t} y_t \end{split} \]

Letting \(R_t = Q_t^{-1}\) makes it a unbiased estimator.

Theorem There exists an exploration distribution \(\pi\) and parameter \(\gamma,\eta\), such that for any \(\set{y_t}_{t \in [T]}\):

\[R_T(\text{Exp3-Lin})\leq 2 \sqrt{3dT \log(K)}\]

Theorem (Kiefer–Wolfowitz) Assume \(\rm span(\mathcal{A}) = \RR^d\), the followings are equivalent.

  1. \(\pi^{*}\) is a minimizer of \(g\)
  2. \(\pi^{*}\) is a maximizer of \(f(\pi)=\log (\det(V(\pi)))\)
  3. \(g(\pi^{*})=d\)

Furthermore, there exists a minimizer \(\pi^{*}\) of \(g\) such that \(|\rm Supp (\pi^{*})|\leq \frac{d(d+1)}{2}\).

13 Pure Exploration #

E.g., Researchers want to find the most promising drug. They do experiments on mice, and they don’t care how many mice they cost. In this case, we no longer need to balance exploration against exploitation, since exploring is free.

13.1 Simple Regret #

Since the exploration in round 1 to round \(T\) is free, we only care about round \(T+1\). We define simple regret:

\[R_T^{S}(\pi, \nu) = \E \left[ \mu^{*}-\mu_{I_{T+1}} \right]\]

We start by analysing the Uniform Exploration (UE) policy, where for \(t=1, \dots, T\) it explores deterministically by choosing \(I_t = 1+(t \mod K)\) and recommends the empirically best arm in round \(T+1\), i.e., \(I_{T+1} = \argmax_{i \in [K]} \hat{\mu}_i(T)\).

We again assume arm 1 is the optimal arm and \(X_{i,s}-\mu_i\) is 1-subgaussian. If arm \(i\) is chosen in round \(T+1\), that means \(\hat{\mu}_i(T) \geq \hat{\mu}_1(T)\). The probability is

\[\begin{split} &\Pr\left[\hat{\mu}_i(T) - \hat{\mu}_1(T) \geq 0\right]\\ =& \Pr\left[\frac{\sum_{s=1}^{T} \1_{\set{I_s=i}}X_{I_s,s}}{N_i(T)} - \frac{\sum_{s=1}^{T} \1_{\set{I_s=1}}X_{I_s,s}}{N_1(T)} + \Delta_i\geq \Delta_i\right]\\ =& \Pr\left[\frac{\sum_{s=1}^{T} \1_{\set{I_s=i}}X_{I_s,s} - N_i(T)\mu_i}{N_i(T)} - \frac{\sum_{s=1}^{T} \1_{\set{I_s=1}}X_{I_s,s}-N_1(T)\mu_1}{N_1(T)} \geq \Delta_i\right]\\ =& \Pr\left[\underbrace{\frac{\sum_{s=1}^{T} \1_{\set{I_s=i}}\left(X_{I_s,s} - \mu_i \right)}{N_i(T)} - \frac{\sum_{s=1}^{T} \1_{\set{I_s=1}}\left(X_{I_s,s}-\mu_1\right)}{N_1(T)}}_{\sqrt{\frac{1}{N_i(T)}+\frac{1}{N_1(T)}}-\text{subgaussian}} \geq \Delta_i\right]\\ \end{split} \]

Using the property of subgaussian and \(N_i(T) \geq \lfloor \frac{T}{K} \rfloor\), we have

\[\Pr\left[\hat{\mu}_i(T) - \hat{\mu}_1(T) \geq 0\right] \leq \exp \left( -\frac{\lfloor \frac{T}{K} \rfloor \Delta_i^2}{4} \right)\]

For some \(\Delta\geq0\), the definition of the simple regret yields

\[\begin{split} R_T^S(\pi,\nu)&=\sum_{i=1}^{K} \Delta_i \Pr\left[I_{T+1}=i\right]\\ &= \sum_{i: \Delta_i\leq \Delta} \Delta_i \Pr\left[I_{T+1}=i\right] + \sum_{i: \Delta_i > \Delta} \Delta_i \Pr\left[I_{T+1}=i\right]\\ &\leq \Delta + \sum_{i: \Delta_i > \Delta} \Delta_i \Pr\left[I_{T+1}=i\right]\\ &\leq \Delta + \sum_{i: \Delta_i > \Delta} \exp \left( -\frac{\lfloor \frac{T}{K} \rfloor \Delta_i^2}{4} \right)\Delta_i\\ &\leq \Delta + \sum_{i: \Delta_i > \Delta} \exp \left( -\frac{\lfloor \frac{T}{K} \rfloor \Delta^2}{4} \right)\Delta_i \end{split} \]

Taking the minimum over all \(\Delta\geq0\), we get

\[R_T^S(\text{UE},\nu) \leq \min_{\Delta\geq0}\left(\Delta + \sum_{i: \Delta_i > \Delta} \exp \left( -\frac{\lfloor \frac{T}{K} \rfloor \Delta^2}{4} \right)\Delta_i \right)\]

So as \(T \to \infty\), the simple regret converges to zero exponentially fast.

On the other hand, if \(T\) is fixed and \(\nu\) can vary, then we can derive a problem-independent bound by choosing \(\Delta=2 \sqrt{\log(K)/\lfloor T/K \rfloor}\). There exists a universal constant \(C>0\) such that for all \(\nu\)

\[R_T^S(\text{UE}, \nu) \leq C \sqrt{\frac{K \log(K)}{T}}\]

There exists another policy that can remove the \(\log(K)\) factor in the simple regret bound.

Proposition Let \(\pi=\set{\pi_t}_{t=1}^T\) be a policy. Define

\[\pi_{T+1}(i \mid a_1, x_1, \dots, a_T, x_T)=\frac{1}{T}\sum_{t=1}^{T} \1_{\set{I_t=i}}\] Then the simple regret of \(\set{\pi_t}_{t=1}^{T+1}\) satisfies

\[R_T^S(\set{\pi_t}_{t=1}^{T+1}, \nu) = \frac{R_T(\pi, \nu)}{T}\]

where \(R_T(\pi,\nu)\) is the cumulative regret.


\[\begin{split} R_T(\pi, \nu) &= \E\left[\sum_{i=1}^{T} \Delta_i N_i(T)\right]\\ &=\E\left[\sum_{i=1}^{T} \Delta_i \frac{N_i(T)}{T}\right]T\\ &=\E\left[\sum_{i=1}^{T} \Delta_i \pi_{T+1}(i \mid a_1, x_1, \dots, a_T, x_T)\right]T\\ &= \E\left[\E\left[\Delta_{I_{T+1}}\right]\right]T\\ &= n R_T^S(\set{\pi_t}_{t=1}^{T+1}, \nu) \end{split} \]

Corollary For all \(T\) there exists a policy \(\pi\) such that for all \(\nu\),

\[R_T^S(\pi,\nu)\leq C \sqrt{\frac{K}{T}}\]

13.2 Best-Arm Identification with a Fixed Confidence #

Best-arm identification is a variant of pure exploration where the learner is rewarded only for identifying an exactly optimal arm. In the fixed confidence setting, the learner is given a confidence level \(\delta \in (0,1)\) and should use as few samples as possible to output an arm that is optimal with probability at least \(1-\delta\).

In the fixed confidence setting, the learner chooses a policy \(\pi=\set{\pi_t}_{t=1}^{\infty}\) as normal. The number of rounds is not fixed in advance, however, the learner chooses a stopping time \(\tau\) adapted to filtration \(\mathbb{F} = \set{\mathcal{F}_t}_{t=0}^{\infty}\) with \(\mathcal{F}_t=\sigma(I_1, X_1, \dots, I_t, X_t)\). The learner also chooses a \(\mathcal{F}_{\tau}\)-measurable random variable \(\psi \in [K]\). The stopping time represents the time when the learner halts and \(\psi\) is the recommended action, which by the measurability assumption only depends on \(I_1, X_1, \dots, I_{\tau}, X_{\tau}\).

Definition A triple \((\pi, \tau, \psi)\) is sound at confidence level \(\delta \in (0,1)\) for environment class \(\mathcal{E}\) if for all \(\nu \in \mathcal{E}\),

\[\Pr_{\nu,\pi}[\tau<\infty \text{ and } \Delta_{\psi}(\nu)>0]\leq \delta\]

The objective in fixed confidence best-arm identification is to find a sound learner for which \(\E_{\nu,\pi}[\tau]\) is minimized over environments \(\nu \in \mathcal{E}\).

Two algorithms: KL-LUCB and lil-UCB (lil stands for “Law of Iterated Logarithm). Details are skipped.

13.3 Lower Bounds #

Let \(\mathcal{E}\) be an arbitrary set of \(K\)-armed stochastic bandit environments, and for \(\nu \in \mathcal{E}\) define the optimal arm:

\[i^{*}(\nu) = \argmax_{i \in [K]} \mu_i(v)\] and the set of bandits in \(\mathcal{E}\) with different optimal arms than \(\nu\):

\[\mathcal{E}_{\text{alt}}(\nu) = \set{\nu’ \in \mathcal{E}: i^{*}(\nu’) \cap i^{*}(\nu) = \emptyset}\]

Theorem Assume that \((\pi, \tau, \psi)\) is sound for \(\mathcal{E}\) at confidence level \(\delta \in (0,1)\), and let \(\nu \in \mathcal{E}\). Then \(\E_{\nu,\pi}[\tau]\geq c^{*}(\nu) \log \left( \frac{1}{4\delta} \right)\), where

\[\frac{1}{c^{*}(\nu)} = \sup_{\alpha \in \mathcal{P}_{K-1}} \left( \inf_{\nu’ \in \mathcal{E}_{\text{alt}}(\nu)} \left( \sum_{i=1}^{K} \alpha_i D(\nu_i, \nu_i’) \right) \right)\]

with \(c^{*}(\nu)=\infty\) when \(c^{*}(\nu)^{-1}=0\).


If \(\E_{\nu,\pi}[\tau]=\infty\), the result is trivial. So next we assume \(\E_{\nu,\pi}[\tau] < \infty\).

Let \(\nu’ \in \mathcal{E}_{\text{alt}}(\nu)\), and define event \(E=\set{\tau <\infty \text{ and } \psi \not\in i^{*}(\nu’)}\). Then by the definition of soundness,

\[\begin{split} 2\delta &\geq \Pr_{\nu\pi}\left[\tau<\infty\text{ and } \psi \not\in i^{*}(\nu)\right] + \Pr_{\nu’\pi}\left[\tau<\infty\text{ and } \psi \not\in i^{*}(\nu’)\right]\\ &\geq \Pr_{\nu\pi}\left[ E^c \right] + \Pr_{\nu’\pi} \left[ E \right]\\ &\geq \frac{1}{2} \exp \left( -\sum_{i=1}^{K} \E_{\nu\pi}[N_i(\tau)] D(\nu_i, \nu_i’) \right) \end{split} \] The second inequality holds because

\[\begin{split} E^c &= \set{\tau=\infty} \cup \set{\psi \in i^{*}(\nu’)}\\ &\subseteq \set{\tau=\infty} \cup \set{\psi \not\in i^{*}(\nu)}\\ &= \set{\tau<\infty \text{ and }\psi \not\in i^{*}(\nu)} \end{split} \]

where the last equality holds because \(\E_{\nu,\pi}[\tau] < \infty \Rightarrow \Pr_{\nu\pi}\left[\tau<\infty\right]=1\).

Therefore we have

\[\begin{split} \frac{\E_{\nu\pi}[\tau]}{c^{*}(\nu)} &= \E_{\nu\pi}[\tau]\sup_{\alpha \in \mathcal{P}_{K-1}} \left( \inf_{\nu’ \in \mathcal{E}_{\text{alt}}(\nu)} \left( \sum_{i=1}^{K} \alpha_i D(\nu_i, \nu_i’) \right) \right)\\ &\geq \E_{\nu\pi}[\tau] \inf_{\nu’ \in \mathcal{E}_{\text{alt}}(\nu)} \left( \sum_{i=1}^{K} \frac{\E_{\nu\pi}[N_i(\tau)]}{\E_{\nu\pi}[\tau]} D(\nu_i, \nu_i’) \right)\\ &= \inf_{\nu’ \in \mathcal{E}_{\text{alt}}(\nu)} \left( \sum_{i=1}^{K} \E_{\nu\pi}[N_i(\tau)] D(\nu_i, \nu_i’) \right)\\ &\geq \log(\frac{1}{4\delta}) \end{split} \]

14 Some Questions #

Some of my own questions that I came up with.

14.1 Relationship between \(\sigma\)-fields and “knowledge” #

There is a relationship between \(\sigma\)-algebras and the intuitive idea of “knowledge”. We think of \(\mathcal{F}\) as describing the “information” we have at our disposal for each \(A \in \mathcal{F}\). After the observation, we know whether or not each \(A\) has occurred, \(\E\left[X \mid \mathcal{F}\right]\) is then our “best guess” of the value of \(X\) given the information we have.

Note that the knowledge of the elements of a \(\sigma\)-field will not help you in knowing what has happened, it only provides all possible events for which you will be able to decide whether they happened or not. E.g., if a \(\sigma\)-field contains an event “Monday will be warm”, then it must also contains the complementary event “Monday will not be warm”. But there is no contradiction, only after the observation on Monday will you decide whether it was warm or not.

14.2 What is \(\sigma(I_1, X_1, \dots, I_t, X_t)\)? #

A sigma field generated by a R.V. \(\sigma(X)\) is the smallest \(\sigma\)-field on which \(X\) is measurable:

\[\sigma(X) = \{\{\omega \in \Omega, X(\omega) \in B\}: B \in \mathcal{B}\}\]

Intuitively, \(\sigma(I_1, X_1, \dots, I_t, X_t)\) is just a sigma field that contains all possible events that can happen from time 1 to \(t\). A R.V. is measurable in this sigma-field means that the R.V. is well-defined after time \(t\).

14.3 What is the expectation w.r.t. the learner/environment #

In Chapter 4.4 The Regret, the regret is defined as \[R_n(\pi, \nu) = n\mu^{*}(\nu) - \E\left[\sum_{t=1}^{n} X_t\right]\] where the expectation is taken with respect to the probability measure on outcomes induced by the interaction of the learner’s policy \(\pi\) and the bandit randomness \(\nu\). So this is a real number.

We can change the expectation to be taken w.r.t. only \(\pi\), i.e., the learner:

\[R_n(\pi, \nu) = n\mu^{*}(\nu) - \E\left[\sum_{t=1}^{n} \mu_{A_t}\right]\]


\[\begin{split} \E\left[\sum_{t=1}^{n} X_t\right] &= \E\left[\sum_{t=1}^{n} \sum_{a \in \mathcal{A}} X_t \1_{\set{A_t=a}} \right]\\ &= \sum_{t=1}^{n} \sum_{a \in \mathcal{A}} \E\left[X_t \1_{\set{A_t=a}} \right]\\ &= \sum_{t=1}^{n} \sum_{a \in \mathcal{A}} \E\left[X_t \mid A_t=a\right] \Pr\left[A_t=a\right]\\ &= \sum_{t=1}^{n} \sum_{a \in \mathcal{A}} \mu_a \Pr\left[A_t=a\right]\\ &= \sum_{t=1}^{n} \E\left[\mu_{A_t}\right] = \E\left[\sum_{t=1}^{n} \mu_{A_t}\right] \end{split} \]

The pseudo-regret is defined without the expectation, so it is a random variable.

\[\bar{R}_n(\pi, \nu) = n\mu^{*}(\nu) - \sum_{t=1}^{n} \mu_{A_t}\]

14.4 What is conditional probability conditioning on a discrete R.V.? #

In Chapter 11.2 Importance-Weighted Estimators, the probability of playing arm \(i\) in round \(t\) is defined as a conditional probability:

\[P_{ti} = \Pr\left[ I_t=i \mid I_1,X_1, \dots, I_{t-1}, X_{t-1}\right]\]

This is in fact a random variable, see wikipedia.

Basically, \(\Pr\left[A \mid X\right]\) is a R.V. that represents an outcome of \(\Pr[A\mid X=x]\) whenever a value \(x\) of \(X\) is observed.

15 Regret Bounds Summary #

MAB: Stochastic Bandits

SLB: Stochastic Linear Bandits

\(n\): time horizon

\(K\): number of arms

\(d\): dimension

Model Assumption Regret
MAB subgaussian \(\Theta(\sqrt{nK})\)
SLB subgaussian \(\widetilde{\Theta}(d\sqrt{n})\)

Links to this note