Tuesday, October 23, 2012

Stochastic Gradient Descent with only ONE Projection

With the increasing amount of data that is available for training, it becomes an urgent task to devise efficient algorithms for optimization/learning problems with unprecedented sizes. Online learning algorithms, such as celebrated Stochastic Gradient Descent (SGD) and its counterpart Online Gradient Descent (OGD), despite of their slow rate of convergence compared with the batch methods, have shown to be very effective for large scale and online learning problems, both theoretically and empirically. Although a large number of iterations is usually needed to obtain a solution of desirable accuracy, the lightweight computation per iteration makes SGD attractive for many large-scale learning problems. To find a solution within the domain $\K$ that optimizes the given objective function $f(\x)$, SGD computes an unbiased estimate of the gradient of $f(\x)$, and updates the solution %for complex domains, such as a positive semidefinite (PSD) cone in metric learning and bounded trace norm matrices in matrix completion by moving it in the opposite direction of the estimated gradient. To ensure that the solution stays within the domain $\mathcal{K}$, SGD has to project the updated solution into the $\K$ at {\it every iteration}. For complex domains, such as a positive semidefinite (PSD) cone in metric learning and bounded trace norm matrices in matrix completion, the projection step requires solving an expensive convex optimization, leading to a high computational cost per iteration and consequently making SGD unappealing for large-scale optimization problems over such domains. For instance, projecting a matrix into a PSD cone requires computing the full eigen-decomposition of the matrix, whose complexity is cubic in the size of the matrix. The question is can we somehow reduce the number of projection steps? Our recent work solves this problem and will appear in NIPS 2012.

No comments:

Post a Comment