Valentin Reis

Online Convex Optimization with Haskell

This page illustrates the basics of Online Convex Optimization(OCO) with a Haskell implementation. More precisely, we will use the Online Gradient Descent (OGD) algorithm [5] to build a L2-constrained online regressor. In turn, we will apply this regressor to the classic iris [2] dataset that ships with the standard R distribution. If you wish to read the full source code, you may open OCO.hs for the learner implementation part, and r.hs for the experiments. The code snippets below are taken from these sources . The important code snippets (OCO.hs) are written in Haskell 2010 with no extra extensions and few imports. As such, there are no prerequisites beyond basic familiarity with Haskell. File r.hs is modern haskell, but its content has less value.

This page goes from the generic to the specific. We’ll start with a generic, abstract learning algorithm encoded with a polymorphic implementation, and finish with experiments. Let’s dive in.

The Online Convex Programming The terms “Optimization” and “Programming” are interchangeable in this context, owing to the military history of Linear Programming in the western bloc. [5] framework is the following:

Given a known convex set F \subset \mathbb{R}^n , at every round t=0,\ldots a learner:
\quad Selects an action w^t \in F
\quad Observes a convex loss function c^t : F \rightarrow \mathbb{R}^n
\quad Incurs a loss c^t(w^t)

This broad setting may be specialized to many tasks, such as regression, classification, clustering, ranking, collaborative filtering, structured prediction, principal component analysis, and others. The action w can sometimes be thought of as the parameter of the model - although there will sometimes be a distinction. Let’s go ahead and take a look at the standard algorithm for OCO in the abstract; we’ll choose a modeling strategy for F and c^t later.

Online Gradient Descent

This algorithm is a non-adaptive constrained online gradient descent that uses a projection step. More formally, the original OGD algorithm works by picking w^0 arbitrarily in F and choosing:

w^{t+1} = P(w^t - \eta_t \nabla c^t (w^t)) \qquad(1)

Where P(w) = \text{argmin}_{y \in F} \left\| x - y \right\| is the projection operator on F .

As long as we can encode a problem by specifying a set F such that we can implement a tractable projection operator P and specify a convex function c^t such that we can implement its gradient \nabla c^t at as a function of w^t , we can use the OGD algorithm. Then, it remains to choose a sequence \eta_t of learning rates . The rate \eta_t is essentially the sole hyperparameter of the algorithm. Often, some variables involved in the problem construction (more precisely, in the choice of F or c^t ) may be reffered to as hyperparameters as well.

Let’s start encoding this gradient step in Haskell right away before discussing its behavior. We’ll make the non-experimental code as generic as reasonably feasible, and the typeclasses from the linear Haskell package will help. These typeclasses allow us to use the code from this post seamlessly on vectors, association lists or any custom data structure that represent vector spaces .

Linear.Vector, exports Additive, a Typeclass exposing vector space operation for various vector-like memory representations. Linear.Metric exports Metric, a typeclass exposing dot products, norms and the like.

import Linear.Vector (Additive, (^-^), (^/), (*^), (^*))
import Linear.Metric (Metric, norm, dot)

Let’s begin by writing out useful type synonyms for the types of \nabla c^t , P and \eta_t , respectively naming them LossDerivative, Projection and LearningRate. We’ll use those throughout the rest of the code.

-- | Derivative of a loss function w.r.t an action.
type LossDerivative v a =
 v a -> -- ^ action/parameter 
 v a    -- ^ descent direction and magnitude

-- | Projection operator.
type Projection v a = 
  v a ->  -- ^ vector to project
  v a     -- ^ projection

-- | Learning rate schedule.
type LearningRate a = 
  Int ->  -- ^ Round counter
  a       -- ^ Rate value

We are now ready to write the actual OGD gradient step. We need a datatype to maintain the state of the algorithm: we’ll call it Ogdstate.

-- | Algorithm state for OGD. Encodes both the round count and the last action.
data Ogdstate v a = Ogdstate  
  { t :: Int         -- ^ round counter
  , weights :: (v a) -- ^ parameter/action weights
  }

-- | Smart constructor
initialOGDState :: v a -> Ogdstate v a
initialOGDState w = Ogdstate 1 w

The code for the OGD update step in eq. 1 is rather simple:

-- | The OGD algorithm.
ogd :: (Additive v, Floating a) =>
  (LearningRate a)       -> -- ^ Learning rate schedule
  (Projection v a)       -> -- ^ Projection operator
  (Ogdstate v a)         -> -- ^ Last state/action 
  (LossDerivative v a)   -> -- ^ Derivative of the loss at this round
  (Ogdstate v a)            -- ^ New state/action
ogd rate projection (Ogdstate t w) dc = Ogdstate {
    t = t + 1,
    weights = projection (w ^-^ ( rate t *^ dc w)) 
  }

We now have the ogd gradient descent step nailed down in its most general form. Before moving on with an application to a concrete problem, we still have to finish covering the essentials of the original OGD paper by discussing the choice of \eta_t .

OGD-based learners work by minimizing their regret at round T against all adversaries in a comparison class . There are various ways to refer to the “comparator”. Some texts use the game-theory influenced term of adversary, while some others refer to a comparison class. This page uses the term of best action in hindsight or best parameter in hindsight. In other words, OGD-based learners achieve good performance with respect to the single best action in hindsight (meaning: observed a-posteriori). The original OGD paper [5] defines the regret . This is what is reffered to as the external regret. as:

R_F(T) = C_T - C_{\text{best}} \qquad(2)

Where:

C_T = \sum_{t=1}^{t=T} c^t (x^t) \\ C_{\text{best}} = \min_{w \in F} \sum_{t=1}^{t=T} c^t (w)

If F is bounded, closed, nonempty, and the norm of the gradient of c is bounded on F , the main theorem from the paper gives the following regret bound for a static adversary : Using the fixed learning rate \eta_t = \eta , we can minimize a notion of dynamic regret. For this specific bound, the comparison class is the set A(T,L) of slowly moving actions. This is defined in terms of the path length of a sequence, which just sum of the distance of displacement between subsequent rounds. The comparison class is the set of sequences of actions whose path length is smaller than L : R_A(T,L) = \sum_{t=1}^{t=N} c^t (x^t) - \min_{(w_a)^{t=1 \ldots T} \in A(T,L)} \sum_{t=1}^{t=N} c^t (w_a^t) \qquad(3) The fixed learning rate gives us the following bound: R_A(T,L) \leq \frac{7 \left\|F\right\|^2 }{4 \eta} + \frac{L \left\|F\right\| }{\eta} + \frac{ T \eta \left\|\nabla c \right\|^2}{2} \qquad(4) The implementation we’d need for this rate: const.

R_F(T) \leq \frac{\left\|F\right\|^2 }{2 \eta_T} + \frac{\left\|\nabla c \right\|^2}{2} \sum_{t=1}^{T} \eta_t \qquad(5)

Where \left\|F\right\| is the diameter of F and \left\|\nabla c\right\| is the norm of the maximum value of the gradient of c in F . Notice the summation: we have to use a rate with a convergent series. A simple choice is \eta_t= \frac{1}{\sqrt{t}} , whose series is easily upper bounded by 2\sqrt(T)-1 . Replacing the summation illustrates the sublinearity of the bound:

R_F(T) \leq \frac{\left\|F\right\|^2 \sqrt{T}}{2} + \left( \sqrt{T} - \frac{1}{2} \right) \left\|\nabla c \right\|^2 \qquad(6)

Assuming \left\|F\right\| and \left\|\nabla c\right\| to be independent of T , this bound is essentially the reason why OGD is a useful algorithm. Indeed, notice that the average regret R_F(T)/T vanishes with T . In other words, the performance with respect to the best action in hindsight is upper-bounded by a quantity that diminishes with with every round.

We’ll use the scaled rate \eta_t = \frac{\eta}{\sqrt{t}} , which implies a more tunable bound. Its implementation is simple:

-- | Inverse squared learning rate.
rootRate :: (Floating a) => a -> LearningRate a
rootRate eta t = eta / sqrt (fromIntegral (t))

This rate gives rise to a slight modification of eq. 6, showcasing the \eta hyperparameter:

R_F(T) \leq \frac{\left\|F\right\|^2 \sqrt{T}}{2\eta} + \frac{\eta \left\|\nabla c \right\|^2 \left( 2\sqrt{T} - 1 \right) }{2} \qquad(7)

We now have enough OGD theory, let’s carry on with some modeling work.

Linear Regression

Let’s keep the model simple - we use a classical task, namely linear regression. The online regression set-up is the following:

\forall t=0,\ldots the learner:
\quad Observes a feature vector x^t \in \mathbb{R}^n
\quad Selects a prediction \widehat{y^t} \in \mathbb{R}
\quad Observes the regression target y^t \in \mathbb{R}
\quad Incurs a loss l(\widehat{y^t},y^t) convex in both its arguments.

We can already write a type for the derivative of that loss function l :

-- | Differential of a regression loss
type RegressionLossDerivative v a =
  v a ->               -- ^ x^t
  a   ->               -- ^ y^t
  (LossDerivative v a) -- ^ c^t

We can also immediately provide a more regression-friendly wrapper for the previously implemented ogd interface:

-- | Regression fit call
type FitReg v a =
  v a          ->  -- ^ Feature vector x^t
  a            ->  -- ^ Regression target y^t
  Ogdstate v a ->  -- ^ Old learner state
  Ogdstate v a 

-- | Generates a fitter for one regression data point using OGD.
fitOGDReg :: (Additive v, Floating a) =>
  (LearningRate a) ->               -- ^ Learning rate
  (Projection v a)  ->              -- ^ Projection operator
  (RegressionLossDerivative v a) -> -- ^ Regression loss derivative
  FitReg v a                        -- ^ Fitting function
fitOGDReg rate projection differential x y state =
  ogd rate projection state $ differential x y

Still rather polymorphic, but this code clearly shows that this online regression set-up is just disguised OCO.

Linear predictor and Gaussian losses, a.k.a picking c^t

We make the classical restrictive design choices on the learner. First, we choose a constrained linear model, with a parameter vector in a set F \subset \mathbb{R}^n . Second, we commit to choose w^t before observing x^t . Under those restictions, our online regression framework is:

\forall t=0,\ldots the learner
\quad Selects an action w^t \in F
\quad Observes a feature vector x^t \in \mathbb{R}^n
\quad Predicts \widehat{y^t} = {w^{t}}^T x
\quad Observes regression target y^t \in \mathbb{R}
\quad Incurs loss c^t(w^t) = l( {w^{t}}^T x,y^t)

First, we implement our linear predictor \widehat{y^t} = {w^{t}}^T x , which gives rise to the predictReg prediction function:

-- | Prediction function
predictReg :: (Metric v, Floating a) =>
  v a ->            -- ^ Feature vector
  Ogdstate v a ->   -- ^ Learner
  a                 -- ^ Prediction
predictReg x (Ogdstate _ w) = w `dot` x

Second, we pick the convex loss we’re interested in to be the “Gaussian” square loss function.

c^t(w) = \left\| y^t - w^{T} x^t \right\|^2 \quad y \in \mathbb{R}, x \in \mathbb{R}^n \\

In practice, OGD only needs the gradient of that function. Because we’re using a linear predictor, that is simply:

\nabla c^t (w) = x^t \left( w^T x^t - y^t \right) \\

-- | Derivative of the square loss function.
mseDiff :: (Metric v, Floating a) => RegressionLossDerivative v a
mseDiff x y w = x ^* ( w `dot` x - y)

That’s it, we now have an interface for for feeding an instance to a regression model (FitReg v x) and for predicting a value (predictReg).

The ridge constraint, a.k.a picking F

We’ll consider the \ell_2 ball of radius \lambda as a constraint set .

We’re not really going to reap the systematic benefits of this \ell_2 ball (which mostly come about with a large number of features) in our experiments. An appropriately chosen L2 norm will be helpful to control initial gradient jumps, however.

F = \left\{ x \in \mathbb{R} \; \middle| \; \left\|x\right\| \leq \lambda \right\}

This choice of F is a breeze to implement. The \ell_2 projection is straightforward:

P(w) = \begin{cases} w & \text{if}\ \left\|w\right\| \leq \lambda \\ \lambda \frac{w}{\left\|w\right\|} & \text{otherwise} \end{cases}

-- | Argument name for Lambda.
newtype Lambda a = Lambda a

-- | Projection on a L2 Ball.
l2Projection :: (Metric v, Ord a, Floating a) =>
  Lambda a ->    -- ^ radius
  Projection v a -- ^ projection operator
l2Projection (Lambda lam) w 
   | q <= lam = w
   | otherwise = lam *^ w ^/ q
 where q = norm w

Now that we have all the necessary ingredients, we can write fitOGDReg, an interface wrapper that helps build an online ridge regressor.

-- | Online ridge regressor via OGD.
fitRidge :: (Metric v, Ord a, Floating a) =>
  LearningRate a -> -- ^ Rate schedule
  Lambda a ->       -- ^ Constraint set radius
  FitReg v a
fitRidge rate lambda = fitOGDReg rate (l2Projection lambda) mseDiff

Let’s conclude on the modeling part by using everything we know to make eq. 7 more concrete. The diameter of F depends on our choice of \lambda (namely \left\|F\right\|=2\lambda ) and the magnitude of \left\|\nabla c\right\| depends on the regression data:

\left\|\nabla c\right\| = \max_{w \in F,t} \left\|x^t \left( w^T x^t - y^t \right)\right\|\\ \quad \leq \left\|x\right\| \left(\lambda \left\|x\right\| + \left\|y\right\| \right)

where \left\|x\right\| = \max_t \left\|x^t\right\| and \left\|y\right\| = \max_t \left\|y^t\right\| are data-dependent quantities. Substituting, we have the following data-dependent regret bound for this regression task:

R_F(T) \leq \frac{2 \lambda^2 \sqrt{T}}{\eta} + \frac{ \eta \left( 2\sqrt{T} - 1 \right) (\lambda \left\|x\right\|^2 + \left\|x\right\|\left\|y\right\| )^2 }{2} \qquad(8)

The learner is now fully implemented; let’s play with a toy dataset.

Experiments with iris

Let’s play with this learner using a subset of the iris dataset, which is available with the standard R distribution. Some of the code below involves the use of the inline-r package, so the snippets are slightly verbose, but the point should get across fine. We’ll try to learn petal width in an online fashion, based on the other numerical attributes from the dataset. Let’s peek.

  iris <- select(iris,-c(Species))
  nrow(iris)
  summary(iris)
[1] 150
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  

Of course, we’re not really supposed to look at data advance after all, as we are considering an online learning process. In practice, we have to pick values for \eta and \lambda however. We’ll proceed in the following way:

  • \lambda : That one is tricky. If we expect that the model shouldn’t be too crazy, the norm of the unit vector in the dimension at hand should suffice. \lambda = 1 .

  • \eta : Even more tricky. You surely noticed that eq. 8 does involve some parameters that depend on the dataset. We might be tempted to minimize the value of the bound at the end of the learning process by picking the right \eta . We’d need values for T , \left\|x\right\| , \left\|y\right\| , and \lambda . We can pick T = 150, if we take the length of our online learning process to be the entire dataset. We can also pick \left\|x\right\| = \left\|y\right\| = 1 by assuming the dataset to be normalized (which we will enforce below). If the data source wasn’t normalized a domain expert could help determine those. If we minimize that bound in \eta by finding the roots of the derivative of the expression in eq. 8, we arrive at the following expression:

\eta = \frac{2 \lambda \sqrt{\frac{\sqrt{T}}{2 \sqrt{T} - 1}}}{\left\|{x}\right\| \left(\lambda \left\|{x}\right\| + \left\|{y}\right\|\right)} = 0.721998072401013

Good to know we can select \eta based on a “worst case” bound in a more or less principled maner - but as we’ll see below, this value will not be so helpful in practice.

Here’s the value of C_{\text{best}} we can acheive : Note the -1 at the end of the symbolic form inside the lm call. This is used to remove the intercept from the model: For simplicity’s sake, all models in this page are truely linear (not affine). Also note that the snippet scales the dataset to the \left[0,1\right] range.

  iris = as.data.frame(reshape::rescaler(iris,type="range"))
  sm = summary(lm(
    formula = Petal.Width ~ Sepal.Width + Sepal.Length + Petal.Length - 1, 
    data = iris
    ))
  mean((sm$residuals)^2)
[1] 0.006941823

We’ll use this average squared error when showing the behavior of our algorithm (the plots below use it as a red dashed line). But first, let’s write a state monad for a one-pass learning process that saves all predictions into a sequence:

-- | The "game state" that we will use to run our one-pass procedure
data GameState v a = GameState 
  { predicted:: Seq a
  , ogdState :: Ogdstate v a
  } deriving (Generic)

-- | a one pass procedure: given a fitting function and a dataset,
-- this function returns a state monad that performs one pass of 
-- online learning on the dataset. 
onePass :: 
  ( Metric v      -- ^ The container for the vector space
  , Traversable t -- ^ The container for the dataset
  , Floating a )  -- ^ The floating point type
  => (FitReg v a)              -- ^ the fitting function
  -> t (v a, a)                -- ^ the dataset
  -> State (GameState v a) ()  -- ^ the output state monad
onePass fit dataset = for_ dataset $ \(x,y) -> do
  #ogdState %= fit x y                  -- descent
  yhat <- uses #ogdState (predictReg x) -- prediction
  #predicted <>= pure yhat              -- history update

Second, we write the plot1pass function that feeds one passes of some dataset to OGD and generates a plot of the normalized cumulative average absolute regression error:

plot1pass :: (MonadR m) => 
  (Double,Double,Double) -- ^ a-posteriori best loss, eta, lambda (for plotting)
  -> (FitReg [] Double)  -- ^ fit function
  -> [([Double],Double)] -- ^ dataset
  ->  m (SomeSEXP (PrimState m))
plot1pass (best, eta, lam) fit dataset = [r| 
    d <- data.frame(seq(1, length(ys_hs), 1), ys_hs, predicted_hs)
    names(d) = c("t", "label", "predicted")
    ggplot(d, aes(x = t, y = cummean((predicted-label)^2))) +
      geom_point(size=0.4) + ylim(0,0.05)+ xlab("$t$") + ylab("$C_t/t$") +
      geom_hline(yintercept = best_hs, linetype="dashed", color = "red")
  |]
 where
   ys = snd <$> dataset
   (GameState (toList -> predicted) _) =
       execState (onePass fit dataset) $ GameState 
         { predicted = Data.Sequence.Empty
         , ogdState = initialOGDState [0,0,0]
         }

Third, we write an experiment function that acquires data and feeds it to plot1Pass. We’ll use Sepal.Length, Sepal.Width, and Petal.Length to buidld the x^t vectors (xs in the snippet below), and building y^t using Petal.Width (ys in the snippet below). Note that this code shufflesn the dataset and normalizes all quantities to \left[0,1\right] .

experiment :: (MonadR m) => 
  (Double,Double,Double) -- ^ a-posteriori best loss, eta, lambda (for plotting)
  -> (FitReg [] Double)  -- ^ fit function
  ->  m (SomeSEXP (PrimState m))
experiment meta fit = do
  sepalLength <- [r|iris$Sepal.Length|]
  sepalWidth  <- [r|iris$Sepal.Width |] 
  petalLength <- [r|iris$Petal.Length|]
  petalWidth  <- [r|iris$Petal.Width |]
  let xs = (\a b c -> [a,b,c] :: [Double])
        <$> p sepalLength
        <*> p sepalWidth 
        <*> p petalLength
  let (ZipList dataset) = (,) <$> coerce xs <*> ZipList (shoveSEXP petalWidth)
  plot1pass meta fit (shuffle' dataset (length dataset) (mkStdGen 1))
 where p = ZipList . shoveSEXP
       shoveSEXP (fromSomeSEXP -> l) = 
         l <&> \e -> (e - minimum l) / (maximum l - minimum l)

We run the one-pass learning experiment using the value for \eta we selected above and plot the resulting evolution of the normalized cumulative absolute error. The red dashed line represents the best attainable value C_{\text{best}}/T , as given by the squared residuals of the batch linear regression performed above.

plotRegret :: (MonadR m) => Double -> m (SomeSEXP (PrimState m))
plotRegret bestLoss = 
  experiment meta $ fitRidge (rootRate eta) (Lambda lam)
    where meta@(_,eta,lam) = (bestLoss, 0.721, 1)

As hinted at previously, it turns out that the value for \eta coming from the bound above is not optimal for this dataset in terms of the acheivable performance a-posteriori. Indeed, the bound we’ve been manipulating is not tight, and it is in fact so not tight that we couldn’t really plot it alongside the regret here without squeezing the axis too hard for legibility. Indeed, there exist values of \lambda that perform better; see for instance the behavior of \lambda=2.5 :

plotRegret2 :: (MonadR m) => Double -> m (SomeSEXP (PrimState m))
plotRegret2 bestLoss = 
  experiment meta $ fitRidge (rootRate eta) (Lambda lam)
    where meta@(_,eta,lam) = (bestLoss, 2.5, 1)

Why is this better? With no further assumptions, regret bounds describe the worst-case behavior. This dataset however is very well behaved - adding stochastic assumptions would lead to a tighter analysis in the form of convergence rates. Principled learning rate selection is a difficult problem, for which there are many methods, ofter reffered to using the umbrella terms of adaptive or parameter-free methods. As a conclusion, here is the list of operations that were performed to make the algorithm work.

  • \lambda was hand-picked to control the contribution of initial jumps to the average objective . Note the role the \ell_2 ball plays here: make it too large, and the initial iterations become jumpy enough to degrade the overall performance. See for instance \lambda=100 ; fitRidge (rootRate 2.5) (Lambda 100):
  • \eta was hand-picked to make the algorithm converge fast. Choosing the rate is one of the painful points when running gradient descent algorithms. See [1] and Francesco Orabona’s webpage.
  • Scaling: All features have been scaled to [0,1] . Whithout this operation, the algorithm performance would be degraded. That wasn’t really crucial here, since the features from iris more or less all have the same range. It is worth noting however that the convergence rate of OGD is greatly reduced when features have different magnitudes. See [4], [3] and [1] for a discussion of this aspect.

Code

Here are the source files used to generate this page:

  • OCO.hs for the learner implementation part.
  • r.hs for the experiments and graphs.

References

[1] Duchi, J. et al. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research. 12, Jul (2011), 2121–2159.

[2] Henderson, H.V. and Velleman, P.F. 1981. Building multiple regression models interactively. Biometrics. (1981), 391–411.

[3] Luo, H. et al. 2016. Efficient second order online learning via sketching. CoRR. abs/1602.02202, (2016).

[4] Ross, S. et al. 2013. Normalized online learning. CoRR. abs/1305.6646, (2013).

[5] Zinkevich, M. 2003. Online convex programming and generalized infinitesimal gradient ascent. Proceedings of the 20th international conference on machine learning. (2003), 928–936.