This post progresses from the basic Online Convex Optimization(OCO) ideas to an online learner implementation. More precisely, we’ll use the Online Gradient Descent (OGD) algorithm [5] to obtain an online ridge regression technique, which we will apply to the classic `iris`

[2] dataset. Code snippets are tangled and evaluated to produce this page, see the “Code” section at the end of this post.

The Online Convex Programming (OCP) [5] framework is the following:

\(\forall t=0,\cdots\) the learner

\(\quad\) Selects an action \(w^t \in F \subset \mathbb{R}\), a convex set.

\(\quad\) Observes a convex loss function \(c^t : F \rightarrow \mathbb{R}\)

\(\quad\) Incurs a loss \(c^t(w^t)\)

This setting may be specialized to many tasks, such as regression, classification, clustering, ranking, collaborative filtering, structured prediction, principal component analysis, etc. The *action* \(w\) can sometimes be thought of as the *parameter* of the model - although there will sometimes be a distinction. Let’s go ahead and take a look at our first algorithm for OCP - we’ll specialize the framework to something useful afterwards.

There are many useful variants of M.Zinkevich’s original Online Gradient Descent (OGD) algorithm. This post keeps it simple and implements the “canonical” OGD method itself. This algorithm is a non-adaptive constrained online gradient descent that uses a projection step. More formally, the original OGD algorithm works by picking \(w^0\) arbitrarily in \(F\) and choosing:

\[ w^{t+1} = P(w^t - \eta_t \nabla c^t (w^t)) \qquad(1)\]

Where \(P(w) = \text{argmin}_{y \in F} \| x,y \|\) is the projection operator on \(F\).

As long as we can encode a problem by specifying (and computing) a projection operator \(P\) and a gradient of \(c^t\) at point \(w^t\) for all \(t=1,...\), we can use the OGD algorithm. Then, it remains to choose a sequence \(\eta_t\) of learning rates.

I’m trying to make the non-experimental code for these posts as generic as reasonably feasible, and the typeclasses from the linear Haskell package will help. These typeclasses allow us to use the code from this post seamlessly on vectors, association lists or any custom data structure that represent vector spaces.

```
{-# LANGUAGE RankNTypes #-}
module OCO where
```

```
-- From Linear.Vector, we get the Additive typeclass.
-- Additive exposes vector substraction and scalar multiplication/division
-- for arbitrary memory representations.
import Linear.Vector (Additive, (^-^), (^/), (*^), (^*))
-- We'll need Linear.Metric later on for dot products and squared norms.
import Linear.Metric (Metric, norm, dot)
```

We’ll call the type of a loss function differential `LossDerivative`

and the type for projections `Projection`

. We’ll essentially need these three types in order to define a learner:

```
-- | Derivative of a loss function w.r.t the action w.
type LossDerivative v a =
v a -> -- ^ w
v a -- ^ dc/dw
-- | Projection operator.
type Projection v a =
v a -> -- ^ input vector
v a -- ^ output vector
-- | Learning rate schedule.
type LearningRate a =
Int -> -- ^ Round counter
a -- ^ Rate value
```

The goal of the learner is to minimize its regret at round \(T\) against all adversaries in a comparison class. This means that we’ll be benchmarking the algorithm with respect to a single best adversary in *hindsight*. The original OGD paper [5] introduces the static and the dynamic cases. Let’s go over those.

```
-- | Inverse squared learning rate.
rootRate :: (Floating a) => a -> LearningRate a
rootRate eta t = eta / sqrt (fromIntegral (t))
```

Using the learning rate \(\eta_t = \frac{1}{t^2}\), we can minimize the average regret to a static action:

\[ R_F(T) = \sum_{t=1}^{t=N} c^t (x^t) - \min_{w \in F} \sum_{t=1}^{t=N} c^t (w) \qquad(2)\]

If \(F\) is bounded, closed, nonempty, and the norm of the gradient of \(c\) is bounded on \(F\), we get the following regret bound.

\[ R_F(T) \leq \frac{\|F\|^2 \sqrt{T}}{2} + \left( \sqrt{T} - \frac{1}{2} \right) \|\nabla c \|^2 \qquad(3)\]

Where \(\|F\|\) is the diameter of \(F\) and \(\|\nabla c\|\) is the norm of the maximum value of the gradient of \(c\) in \(F\). That’s sub-linear and hence will actually learn the best static target.

Using the fixed learning rate \(\eta_t = \eta\), we can minimize a notion of dynamic regret. For this specific bound, the comparison class is the set \(A(T,L)\) of slowly moving actions. This is defined in terms of the path length of a sequence, which just sum of the distance of displacement between subsequent rounds. The comparison class is the set of sequences of actions whose path length is smaller than \(L\).

More precisely, this dynamic regret is:

\[ R_A(T,L) = \sum_{t=1}^{t=N} c^t (x^t) - \min_{(w_a)^{t=1 \cdots T} \in A(T,L)} \sum_{t=1}^{t=N} c^t (w_a^t) \qquad(4)\]

And the fixed learning rate gives us the following bound:

\[ R_A(T,L) \leq \frac{7 \|F\|^2 }{4 \eta} + \frac{L \|F\| }{\eta} + \frac{ T \eta \|\nabla c \|^2}{2} \qquad(5)\]

We now are ready to write the actual OGD gradient step. We need a datatype to maintain the state of the algorithm, which we’ll call `Ogdstate`

:

```
-- | Algorithm state for OGD. Encodes both the round count and the last
-- action.
data Ogdstate v a = Ogdstate
Int -- ^ round counter
(v a) -- ^ weights
-- | The OGD algorithm.
ogd :: (Additive v, Floating a) =>
(LearningRate a) -> -- ^ Learning rate schedule
(Projection v a) -> -- ^ Projection operator
(Ogdstate v a) -> -- ^ Last state/action
(LossDerivative v a) -> -- ^ Derivative of the loss at this round
(Ogdstate v a) -- ^ New state/action
ogd rate projection (Ogdstate t w) dc = Ogdstate (t+1) $
projection $ w ^-^ ( rate t *^ dc w)
-- | Smart constuctor for @Ogdstate@.
initialOGDState :: v a -> Ogdstate v a
initialOGDState w = Ogdstate 1 w
```

We now have the `ogd`

gradient descent step nailed down. I’d like to think of a regression problem next, so let’s introduce some more common vocabulary in our code first. We’ll abstract over this gradient step and expose an interface for online linear regression.

For this first post, I want to use something quite straightforward. Accordingly, we’ll avoid all modeling work and use a basic setup. I’m choosing one of the most classical task: linear regression.

The online regression framework is the following.

\(\forall t=0,\cdots\) the learner

\(\quad\) Observes a feature vector \(x^t \in \mathbb{R}^n\)

\(\quad\) Selects a prediction \(\widehat{y^t} \in \mathbb{R}\)

\(\quad\) Observes the regression target \(y^t \in \mathbb{R}\)

\(\quad\) Incurs a loss \(l(\widehat{y^t},y^t)\)

We can already write a type for that loss function \(l\):

```
-- | Differential of a regression loss
type RegressionLossDerivative v a =
v a -> -- ^ x^t
a -> -- ^ y^t
(LossDerivative v a) -- ^ c^t
```

We make the classical restrictive design choices on the learner. First, we use a constrained linear model, with a parameter vector in a set \(F \subset \mathbb{R}^n\). Second, we commit to choose \(w^t\) before observing \(x^t\). Under this design, the regression framework “reduces” to OCP:

\(\forall t=0,...\) the learner

\(\quad\) Selects an action \(w^t \in F\)

\(\quad\) Observes a feature vector \(x^t \in \mathbb{R}^n\)

\(\quad\) Predicts \(\widehat{y^t} = {w^{t}}^T x\)

\(\quad\) Observes regression target \(y^t \in \mathbb{R}\)

\(\quad\) Incurs loss \(c^t(w^t) = l( {w^{t}}^T x,y^t)\)

This means that we need to use the following prediction function:

```
-- | Prediction function
predictReg :: (Metric v, Floating a) =>
v a -> -- ^ Feature vector
Ogdstate v a -> -- ^ Learner
a -- ^ Prediction
predictReg x (Ogdstate _ w) = w `dot` x
```

Let’s wrap the previously implemented `ogd`

in order to expose a regression interface. We’ll do this without making the our choice for \(l\) nor \(F\) explicit:

```
-- | Regression fit call
type FitReg v a =
v a -> -- ^ Feature vector x^t
a -> -- ^ Regression target y^t
Ogdstate v a -> -- ^ Old learner state
Ogdstate v a
-- | Generates a fitter for one regression data point using OGD.
fitOGDReg :: (Additive v, Floating a) =>
(LearningRate a) -> -- ^ Learning rate
(Projection v a) -> -- ^ Projection operator
(RegressionLossDerivative v a) -> -- ^ Regression loss derivative
FitReg v a -- ^ Fitting function
fitOGDReg rate projection differential x y state =
ogd rate projection state $ differential x y
```

That’s it, we now have an interface for predicting a regression value (`predictReg`

) and for feeding an instance to a regression model (`FitReg v x`

). Let’s choose and implement a linear regression model.

We’ll consider an \(\ell_2\) ball of radius \(\lambda\) where \(c^t\) is the square loss. Here, we’re not really going to reap the systematic benefits of this \(\ell_2\) ball (which mostly come about with a large number of features: see this wikipedia page). But it will be useful to keep large gradient jumps in check.

\[ c^t(w) = \| y^t - w^{T} x^t \|^2 \quad y \in \mathbb{R}, x \in \mathbb{R}^n \\ F = \left\{ x \in \mathbb{R} \; \middle| \; \|x\| \leq \lambda \right\} \]

This leads leads to the following OGD parametrization:

\[ \nabla c^t (w) = x^t \left( w^T x^t - y^t \right) \\ P(w) = \begin{cases} w & \text{if}\ \|w\| \leq \lambda \\ \lambda \frac{w}{\|w\|} & \text{otherwise} \end{cases} \]

The diameter of \(F\) depends on our choice of \(\lambda\) (\(\|F\|=2\lambda\)) and the magnitude of $| c | $ depends on the regression data. In the case of \(\eta_t=\frac{1}{t^2}\), we get a regret bound that scales both with \(\lambda^2\) and this data-dependent term:

\[ R_F(T) \leq \frac{\|F\|^2 \sqrt{T}}{2} + \left( \sqrt{T} \frac{1}{2} \right) \|\nabla c \|^2 \\ \quad \leq 2 \lambda^2 \sqrt{T} + \left( \sqrt{T} \frac{1}{2} \right) \|\nabla c \|^2 \qquad(6)\]

where \(\|\nabla c \| = \max_{w,x,y} x^t \left( w^T x^t - y^t \right)\).

Let’s implement the projection operator and the differential of the Gaussian loss. We’ll then use `fitOGDReg`

and get our learner.

```
-- | Argument name for Lambda.
newtype Lambda a = Lambda a
-- | Projection on a L2 Ball.
l2Projection :: (Metric v, Ord a, Floating a) =>
Lambda a -> -- ^ radius
Projection v a -- ^ projection operator
l2Projection (Lambda lam) w
| q <= lam = w
| otherwise = lam *^ w ^/ q
where q = norm w
-- | Derivative of the square loss function.
mseDiff :: (Metric v, Floating a) => RegressionLossDerivative v a
mseDiff x y w = x ^* ( w `dot` x - y)
-- | Online ridge regressor via OGD.
fitRidge :: (Metric v, Ord a, Floating a) =>
LearningRate a -> -- ^ Rate schedule
Lambda a -> -- ^ Constraint set radius
FitReg v a
fitRidge rate lambda = fitOGDReg rate (l2Projection lambda) mseDiff
```

The learner is now fully implemented, let’s play with a toy dataset.

```
{-# LANGUAGE QuasiQuotes #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TypeApplications #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE ViewPatterns #-}
import Protolude
import OCO
import Linear.Metric
import H.Prelude as R
import Data.Sequence.Lens
import Control.Monad.Primitive
import Control.Lens hiding ((:>))
import Data.Sequence hiding (replicate,length)
import System.Random
import Data.Coerce
import Data.Generics.Product
import System.Random.Shuffle
rrequire :: (MonadR m, Literal lib a) => lib -> m ()
rrequire p = void [r| suppressMessages(require(p_hs,character.only=TRUE)) |]
rpackages :: [Text]
rpackages = [ "ggplot2" ,"plotly", "latex2exp" ]
data OutputType = Plotly | Pdf | Png | Xfig deriving (Read, Show, Enum, Bounded)
saveTsPlot
:: (MonadR m, Literal ggplot a)
=> ggplot
-> Text
-> OutputType
-> m (SomeSEXP (PrimState m))
saveTsPlot ggplot path = \case
Plotly -> rggplotly ggplot >>= ras_widget >>= rsaveWidget (path <> ".html")
Pdf -> ggsave (path <> ".pdf") ggplot
Png -> ggsave (path <> ".png") ggplot
Xfig -> xfig (path <> ".fig") ggplot
where
ggsave pa pl = [r| ggsave(pa_hs,pl_hs, width = 6, height = 3) |]
xfig pa pl = [r|
xfig(pa_hs ,width = 6, height = 3, onefile=FALSE
, textspecial=TRUE, pointsize=1,paper="A4")
print(pl_hs)
dev.off()
|]
rggplotly x = [r| ggplotly(x_hs) |]
ras_widget x = [r| as_widget(x_hs) |]
rsaveWidget p x = [r| htmlwidgets::saveWidget(x_hs,p_hs) |]
```

Let’s play with this learner using a subset of the “iris” dataset, which is available with the standard R distribution. This post uses the `inline-r`

haskell package to interface with R, so the snippets below are a little verbose, but the point should get across fine. We’ll try to learn car petal width in an online fashion, based on the other numerical attributes from the dataset. For reference, here’s the corresponding R linear model fit. Note the `-1`

at the end of the model specified in symbolic form in the `lm`

call. This is used to remove the intercept from the model. For simplicity’s sake, all models in this post are truely linear instead of affine. Also note that the following code scales all numerical columns of the dataset.

```
-- | This R section throws away the "species" categorical variable from the
-- `iris` dataset, normalizes it and returns the best a-posteriori average
-- squared error on petal width obtained with a linear model fitted on the
-- three remaining attributes.
preamble :: (MonadR m) => m (SomeSEXP (PrimState m))
preamble = [r|
iris <- select(iris,-c(Species))
sm <- summary(lm(
formula = Petal.Width ~ Sepal.Width + Sepal.Length + Petal.Length - 1,
data = iris
))
mean((sm$residuals)^2)
|]
```

We’ll use this average squared error value later when evaluating OGD. First, we write a state monad for a one-pass learning game that saves all predictions into a sequence:

```
-- | The "game state" that we will use to run our one-pass procedure
data GameState v a = GameState
{ predicted:: Seq a
, ogdState :: Ogdstate v a
} deriving (Generic)
-- | a one pass procedure: given a fitting function and a dataset,
-- this function returns a state monad that performs one pass of
-- online learning on the dataset.
onePass ::
( Metric v -- ^ The container for the vector space
, Traversable t -- ^ The container for the dataset
, Floating a ) -- ^ The floating point type
=>
(FitReg v a) -- ^ the fitting function
-> t (v a, a) -- ^ the dataset
-> State (GameState v a) () -- ^ the output state monad
onePass fit dataset = for_ dataset $ \(x,y) -> do
-- Fitting the new data point
field @"ogdState" %= fit x y
-- Calling the prediction function on the learner state
yhat <- uses (field @"ogdState") (predictReg x)
-- Updating the history
field @"predicted" <>= pure yhat
```

Second, we write the `plotNpasses`

function that feeds `n`

passes of some dataset data to OGD and generates a plot of the normalized cumulative average absolute regression error:

```
-- | plot1pass executes a one-pass procedure on a dataset and
-- produces a R section that returns a plot of the evolution of the
-- average squared error, with a red dashed line as a comparator.
plot1pass :: (MonadR m) =>
Double -- ^ a-posteriori best average standard error
-> (FitReg [] Double) -- ^ fit function
-> [([Double],Double)] -- ^ dataset
-> m (SomeSEXP (PrimState m))
plot1pass comparator fit dataset = [r|
d <- data.frame(seq(1, length(ys_hs), 1), ys_hs, predicted_hs)
names(d)=c("t","label","predicted")
d$c = cumsum((d$predicted - d$label)^2)
print(head(d))
ggplot(d,aes(x = t, y = (cumsum((predicted-label)^2))/t)) +
geom_point(size=0.4) +
ylim(c(0,max(d$c/d$t))) +
geom_hline(yintercept = comparator_hs, linetype="dashed",
color = "red") +
xlab("$t$") +
ylab("$\\frac{1}{t} \\sum_{i=1}^{i=t} c^i (x^i)$")
|]
where
ys = snd <$> dataset
(GameState (toList -> predicted) _) =
execState (onePass fit dataset) $ GameState
{ predicted = Data.Sequence.Empty
, ogdState = initialOGDState [0,0,0]
}
```

Third, we write an `experiment`

function that acquires data and feeds it to `plot1Pass`

. We’ll use `Sepal.Length`

, `Sepal.Width`

, and `Petal.Length`

to buidld the \(x^t\) vectors (`xs`

in the snippet below), and building \(y^t\) using `Petal.Width`

(`ys`

in the snippet below). Note how this code sneakily normalizes and shuffles the dataset.

```
-- | `experiment` acquires `iris` data from R, and passes it to `plot1pass`.
experiment :: (MonadR m) =>
Double -- ^ a-posteriori best average standard error
-> (FitReg [] Double) -- ^ fit function
-> m (SomeSEXP (PrimState m))
experiment comparator fit = do
sepalLength <- [r| iris$Sepal.Length|]
sepalWidth <- [r| iris$Sepal.Width |]
petalLength <- [r| iris$Petal.Length|]
petalWidth <- [r| iris$Petal.Width |]
let xs = coerce $ (\a b c -> [a,b,c] :: [Double])
<$> p sepalLength
<*> p sepalWidth
<*> p petalLength
let ys = fromSomeSEXP petalWidth :: [Double]
let (ZipList dataset) = (,) <$> ZipList xs <*> ZipList ys
plot1pass comparator fit (shuffle' dataset (length dataset) (mkStdGen 1))
where p = ZipList . fromSomeSEXP
```

The main source of difficulty with the practical application of OGD for this problem is the selection of its two hyperparameters:

- The learning rate \(\eta_t\) (schedule and magnitude).
- The size of the regularizer \(\lambda\).

Let’s try to use some arbitrary numbers. We’ll assume the adversary is fixed (`rootRate`

from earlier with \(\eta = 1\)) and use \(\lambda = 1\).

```
plotRidgeArbitrary :: (MonadR m) => Double -> m (SomeSEXP (PrimState m))
plotRidgeArbitrary comparator =
experiment comparator $ fitRidge (rootRate 1) (Lambda 1)
```

We run the one-pass learning experiment and plot the resulting evolution of the normalized cumulative absolute error. The red dashed line represents the best attainable value, as obtained by the “a-posteriori” linear regression performed above.

Not exactly the performance we are looking for. This blog post is not going to cover the selection of those hyperparameters - they will be discussed in another post in this series. Indeed, it is difficult to estimate parameters in eq. 6. In particular, quantities such as \(\|\nabla c \|^2\) that could be used to inform the choice of \(\lambda\) and \(\eta\) are usually not accessible a-priori. I’ll just show that the algorithm works when we magically select correct values.

```
plotRidgeMagic :: (MonadR m) => Double -> m (SomeSEXP (PrimState m))
plotRidgeMagic comparator =
experiment comparator $ fitRidge (rootRate 0.045) (Lambda 0.3)
```

Much better. It seems that we are able to reduce the average error to the a-posteriori best value quite fast into our one pass procedure. Note the role the \(\ell_2\) ball plays here: make it too large, and the initial iterations become jumpy enough to degrade the overall performance:

```
plotRidgeMagicNoBall :: (MonadR m) => Double -> m (SomeSEXP (PrimState m))
plotRidgeMagicNoBall comparator =
experiment comparator $ fitRidge (rootRate 0.045) (Lambda 20)
```

Note the difference in the y axis. Before leaving, let’s review the various operations that were done to make this algorithm work, without elaborating.

- Scaling:
`iris=as.data.frame(scale(iris))`

was used to scale all features to the same range. Whithout this operation, the algorithm performance would be degraded. See [4], [3] and [1] for a discussion of this aspect. - \(\eta\) was hand-picked to make the algorithm converge fast. See [4].
- \(\lambda\) was hand-picked to control the contribution of initial jumps to the average objective.
- Shuffling:
`shuffle'`

was used to shuffle the dataset.

I hope to cover those aspects in follow-up posts.

Here are the tangled source files for this post:

```
main :: IO ()
main =
R.withEmbeddedR defaultConfig
$ R.runRegion $ do
for_ rpackages rrequire
[r| theme_set(
theme_bw() +
theme(
panel.background = element_rect(fill = "transparent"),
plot.background = element_rect(fill = "transparent", color = NA),
legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent")
)) |]
bestSquaredError <- preamble
let bse :: Double
bse = fromSomeSEXP bestSquaredError
liftIO $ writeFile "out_preamble.txt" (show bestSquaredError)
plotRidgeMagic bse >>= \x -> void $ saveTsPlot x "ridgeMagic" Xfig
plotRidgeMagicNoBall bse >>= \x -> void $ saveTsPlot x "ridgeMagicNoBall" Xfig
plotRidgeArbitrary bse >>= \x -> void $ saveTsPlot x "ridgeArbitrary" Xfig
```

```
t label predicted c
1 1 0.2 0.330120 0.01693121
2 2 2.0 2.655195 0.44621179
3 3 1.9 1.204205 0.93034306
4 4 1.0 0.992275 0.93040273
5 5 1.5 1.575888 0.93616169
6 6 1.5 1.509437 0.93625076
t label predicted c
1 1 0.2 0.3301200 0.01693121
2 2 2.0 4.3370428 5.47870018
3 3 1.9 -0.4200059 10.86112760
4 4 1.0 0.9071687 10.86974525
5 5 1.5 1.6354775 10.88809939
6 6 1.5 1.5064163 10.88814056
t label predicted c
1 1 0.2 6.056402 34.29744
2 2 2.0 -8.859473 152.22559
3 3 1.9 8.737272 198.97388
4 4 1.0 -6.422545 254.06805
5 5 1.5 8.751565 306.65325
6 6 1.5 -7.642556 390.23957
OCO.hs
out_preamble.txt
r.hs
ridgeArbitrary.fig
ridgeMagic.fig
ridgeMagicNoBall.fig
root
fig2dev -L tikz -P ridgeMagic.fig > ridgeMagic.tex
rubber -d ridgeMagic.tex
OCO.hs
out_preamble.txt
r.hs
ridgeArbitrary.aux
ridgeArbitrary.fig
ridgeArbitrary.log
ridgeArbitrary.pdf
ridgeArbitrary.svg
ridgeArbitrary.tex
ridgeMagic.aux
ridgeMagic.fig
ridgeMagic.log
ridgeMagicNoBall.aux
ridgeMagicNoBall.fig
ridgeMagicNoBall.log
ridgeMagicNoBall.pdf
ridgeMagicNoBall.svg
ridgeMagicNoBall.tex
ridgeMagic.pdf
ridgeMagic.svg
ridgeMagic.tex
root
```

[1] Duchi, J. et al. 2011. Adaptive subgradient methods for online learning and stochastic optimization. *Journal of machine learning research*. 12, Jul (2011), 2121–2159.

[2] Henderson, H.V. and Velleman, P.F. 1981. Building multiple regression models interactively. *Biometrics*. (1981), 391–411.

[3] Luo, H. et al. 2016. Efficient second order online learning via sketching. *CoRR*. abs/1602.02202, (2016).

[4] Ross, S. et al. 2013. Normalized online learning. *CoRR*. abs/1305.6646, (2013).

[5] Zinkevich, M. 2003. Online convex programming and generalized infinitesimal gradient ascent. *Proceedings of the 20th international conference on machine learning (icml-03)* (2003), 928–936.