Tuesday, January 8, 2013

Unified Algorithms for Stochastic and Adversarial Online Learning

Algorithms for online decision making independently have been developed in three different settings as:
  • Stochastic Rewards
  • Adversarial Rewards
  • Variation based Regret Bound
Usually, algorithms such as UCB and Hedge attain $\log T$ and $O(\sqrt{T})$ regret bound in stochastic and adversarial settings, respectively, which are known to be optimal and tight. But, unfortunately all these algorithms are developed for that specific setting and can not be applied to other setting. For example, when UCB is provided with adversarial rewards or Hedge with stochastic rewards, they can not obtain the optimal bound known for that setting. 

Algorithms with variation based bound tries to bridge the gap from adversarial setting to stochastic setting by developing algorithms with regret bound bounded by the variation of cost/reward vectors (check Hazan and Kale's work at COLT 2008 and our work at COLT 2012). So, it has been an unresolved question that is there any algorithm that works well for both settings. Such an algorithm, needs to be adaptive to the uncertainty in the sequence of rewards and is expected to attain $\log T$ regret bound when encountered with stochastic rewards and $O(\sqrt{T})$ regret bound for adversarial rewards. 

Recently, the FlipFlop algorithm which is a combination of Follow-the-Leader (FTL) and AdaptiveHedge algorithms made a progress towards solving this issue but still there is a room to develop better algorithm. The FlipFlop algorithm retains $O(\sqrt{T})$ regret bound in worst case, but has better guarantees for stochastic setting and it has been shown that its performance should never be substantially worse than that of FTL algorithm.

Challenging Problem: developing efficient unified algorithm that attains $O(\log T)$ and $O(\sqrt{T})$ regret bounds in stochastic and adversarial settings, respectively. The proposed algorithm is not necessarily a combination of algorithms from both settings and would be a novel algorithm with guaranteed bounds as mentioned.