Valentin Reis

online (machine) learning with haskell and friends

This post progresses from the basic Online Convex Optimization(OCO) ideas to an online learner implementation. More precisely, we’ll use the Online Gradient Descent (OGD) algorithm [5] to obtain an online ridge regression technique, which we will apply to the classic iris [2] dataset. Code snippets are tangled and evaluated to produce this page, see the “Code” section at the end of this post.

The Online Convex Programming (OCP) [5] framework is the following:

\(\forall t=0,\cdots\) the learner
\(\quad\) Selects an action \(w^t \in F \subset \mathbb{R}\), a convex set.
\(\quad\) Observes a convex loss function \(c^t : F \rightarrow \mathbb{R}\)
\(\quad\) Incurs a loss \(c^t(w^t)\)

This setting may be specialized to many tasks, such as regression, classification, clustering, ranking, collaborative filtering, structured prediction, principal component analysis, etc. The action \(w\) can sometimes be thought of as the parameter of the model - although there will sometimes be a distinction. Let’s go ahead and take a look at our first algorithm for OCP - we’ll specialize the framework to something useful afterwards.

Online Gradient Descent

There are many useful variants of M.Zinkevich’s original Online Gradient Descent (OGD) algorithm. This post keeps it simple and implements the “canonical” OGD method itself. This algorithm is a non-adaptive constrained online gradient descent that uses a projection step. More formally, the original OGD algorithm works by picking \(w^0\) arbitrarily in \(F\) and choosing:

\[ w^{t+1} = P(w^t - \eta_t \nabla c^t (w^t)) \qquad(1)\]

Where \(P(w) = \text{argmin}_{y \in F} \| x,y \|\) is the projection operator on \(F\).

As long as we can encode a problem by specifying (and computing) a projection operator \(P\) and a gradient of \(c^t\) at point \(w^t\) for all \(t=1,...\), we can use the OGD algorithm. Then, it remains to choose a sequence \(\eta_t\) of learning rates.

I’m trying to make the non-experimental code for these posts as generic as reasonably feasible, and the typeclasses from the linear Haskell package will help. These typeclasses allow us to use the code from this post seamlessly on vectors, association lists or any custom data structure that represent vector spaces.

-- From Linear.Vector, we get the Additive typeclass.
-- Additive exposes vector substraction and scalar multiplication/division
-- for arbitrary memory representations.
import Linear.Vector  (Additive, (^-^), (^/), (*^), (^*))
-- We'll need Linear.Metric later on for dot products and squared norms.
import Linear.Metric (Metric, norm, dot)

We’ll call the type of a loss function differential LossDerivative and the type for projections Projection. We’ll essentially need these three types in order to define a learner:

-- | Derivative of a loss function w.r.t the action w.
type LossDerivative v a =
  v a ->  -- ^ w
  v a     -- ^ dc/dw

-- | Projection operator.
type Projection v a = 
  v a ->  -- ^ input vector
  v a     -- ^ output vector

-- | Learning rate schedule.
type LearningRate a = 
  Int ->  -- ^ Round counter
  a       -- ^ Rate value

The goal of the learner is to minimize its regret at round \(T\) against all adversaries in a comparison class. This means that we’ll be benchmarking the algorithm with respect to a single best adversary in hindsight. The original OGD paper [5] introduces the static and the dynamic cases. Let’s go over those.

Static adversary

-- | Inverse squared learning rate.
rootRate :: (Floating a) => a -> LearningRate a
rootRate eta t = eta / sqrt (fromIntegral (t))

Using the learning rate \(\eta_t = \frac{1}{t^2}\), we can minimize the average regret to a static action:

\[ R_F(T) = \sum_{t=1}^{t=N} c^t (x^t) - \min_{w \in F} \sum_{t=1}^{t=N} c^t (w) \qquad(2)\]

If \(F\) is bounded, closed, nonempty, and the norm of the gradient of \(c\) is bounded on \(F\), we get the following regret bound.

\[ R_F(T) \leq \frac{\|F\|^2 \sqrt{T}}{2} + \left( \sqrt{T} - \frac{1}{2} \right) \|\nabla c \|^2 \qquad(3)\]

Where \(\|F\|\) is the diameter of \(F\) and \(\|\nabla c\|\) is the norm of the maximum value of the gradient of \(c\) in \(F\). That’s sub-linear and hence will actually learn the best static target.

Dynamic adversary

-- | Fixed learning rate.
fixedRate :: a -> LearningRate a
fixedRate = const

Using the fixed learning rate \(\eta_t = \eta\), we can minimize a notion of dynamic regret. For this specific bound, the comparison class is the set \(A(T,L)\) of slowly moving actions. This is defined in terms of the path length of a sequence, which just sum of the distance of displacement between subsequent rounds. The comparison class is the set of sequences of actions whose path length is smaller than \(L\).

More precisely, this dynamic regret is:

\[ R_A(T,L) = \sum_{t=1}^{t=N} c^t (x^t) - \min_{(w_a)^{t=1 \cdots T} \in A(T,L)} \sum_{t=1}^{t=N} c^t (w_a^t) \qquad(4)\]

And the fixed learning rate gives us the following bound:

\[ R_A(T,L) \leq \frac{7 \|F\|^2 }{4 \eta} + \frac{L \|F\| }{\eta} + \frac{ T \eta \|\nabla c \|^2}{2} \qquad(5)\]

Gradient Step

We now are ready to write the actual OGD gradient step. We need a datatype to maintain the state of the algorithm, which we’ll call Ogdstate:

-- | Algorithm state for OGD. Encodes both the round count and the last
-- action.
data Ogdstate v a = Ogdstate 
  Int   -- ^ round counter
  (v a) -- ^ weights

-- | The OGD algorithm.
ogd :: (Additive v, Floating a) =>
  (LearningRate a)       -> -- ^ Learning rate schedule
  (Projection v a)       -> -- ^ Projection operator
  (Ogdstate v a)         -> -- ^ Last state/action 
  (LossDerivative v a)   -> -- ^ Derivative of the loss at this round
  (Ogdstate v a)            -- ^ New state/action
ogd rate projection (Ogdstate t w) dc = Ogdstate (t+1) $
  projection $ w ^-^ ( rate t *^ dc w)

-- | Smart constuctor for @Ogdstate@.
initialOGDState :: v a -> Ogdstate v a
initialOGDState w = Ogdstate 1 w

We now have the ogd gradient descent step nailed down. I’d like to think of a regression problem next, so let’s introduce some more common vocabulary in our code first. We’ll abstract over this gradient step and expose an interface for online linear regression.

Linear Regression with OGD

For this first post, I want to use something quite straightforward. Accordingly, we’ll avoid all modeling work and use a basic setup. I’m choosing one of the most classical task: linear regression.

Online Regression

The online regression framework is the following.

\(\forall t=0,\cdots\) the learner
\(\quad\) Observes a feature vector \(x^t \in \mathbb{R}^n\)
\(\quad\) Selects a prediction \(\widehat{y^t} \in \mathbb{R}\)
\(\quad\) Observes the regression target \(y^t \in \mathbb{R}\)
\(\quad\) Incurs a loss \(l(\widehat{y^t},y^t)\)

We can already write a type for that loss function \(l\):

-- | Differential of a regression loss
type RegressionLossDerivative v a =
  v a ->               -- ^ x^t
  a   ->               -- ^ y^t
  (LossDerivative v a) -- ^ c^t

The Linear Predictor

We make the classical restrictive design choices on the learner. First, we use a constrained linear model, with a parameter vector in a set \(F \subset \mathbb{R}^n\). Second, we commit to choose \(w^t\) before observing \(x^t\). Under this design, the regression framework “reduces” to OCP:

\(\forall t=0,...\) the learner
\(\quad\) Selects an action \(w^t \in F\)
\(\quad\) Observes a feature vector \(x^t \in \mathbb{R}^n\)
\(\quad\) Predicts \(\widehat{y^t} = {w^{t}}^T x\)
\(\quad\) Observes regression target \(y^t \in \mathbb{R}\)
\(\quad\) Incurs loss \(c^t(w^t) = l( {w^{t}}^T x,y^t)\)

This means that we need to use the following prediction function:

-- | Prediction function
predictReg :: (Metric v, Floating a) =>
  v a ->            -- ^ Feature vector
  Ogdstate v a ->   -- ^ Learner
  a                 -- ^ Prediction
predictReg x (Ogdstate _ w) = w `dot` x

Let’s wrap the previously implemented ogd in order to expose a regression interface. We’ll do this without making the our choice for \(l\) nor \(F\) explicit:

-- | Regression fit call
type FitReg v a =
  v a          ->  -- ^ Feature vector x^t
  a            ->  -- ^ Regression target y^t
  Ogdstate v a ->  -- ^ Old learner state
  Ogdstate v a 

-- | Generates a fitter for one regression data point using OGD.
fitOGDReg :: (Additive v, Floating a) =>
  (LearningRate a) ->              -- ^ Learning rate
  (Projection v a)  ->              -- ^ Projection operator
  (RegressionLossDerivative v a) -> -- ^ Regression loss derivative
  FitReg v a                        -- ^ Fitting function
fitOGDReg rate projection differential x y state =
  ogd rate projection state $ differential x y

That’s it, we now have an interface for predicting a regression value (predictReg) and for feeding an instance to a regression model (FitReg v x). Let’s choose and implement a linear regression model.

Ridge Regression

We’ll consider an \(\ell_2\) ball of radius \(\lambda\) where \(c^t\) is the square loss. Here, we’re not really going to reap the systematic benefits of this \(\ell_2\) ball (which mostly come about with a large number of features: see this wikipedia page). But it will be useful to keep large gradient jumps in check.

\[ c^t(w) = \| y^t - w^{T} x^t \|^2 \quad y \in \mathbb{R}, x \in \mathbb{R}^n \\ F = \left\{ x \in \mathbb{R} \; \middle| \; \|x\| \leq \lambda \right\} \]

This leads leads to the following OGD parametrization:

\[ \nabla c^t (w) = x^t \left( w^T x^t - y^t \right) \\ P(w) = \begin{cases} w & \text{if}\ \|w\| \leq \lambda \\ \lambda \frac{w}{\|w\|} & \text{otherwise} \end{cases} \]

The diameter of \(F\) depends on our choice of \(\lambda\) (\(\|F\|=2\lambda\)) and the magnitude of $| c | $ depends on the regression data. In the case of \(\eta_t=\frac{1}{t^2}\), we get a regret bound that scales both with \(\lambda^2\) and this data-dependent term:

\[ R_F(T) \leq \frac{\|F\|^2 \sqrt{T}}{2} + \left( \sqrt{T} \frac{1}{2} \right) \|\nabla c \|^2 \\ \quad \leq 2 \lambda^2 \sqrt{T} + \left( \sqrt{T} \frac{1}{2} \right) \|\nabla c \|^2 \qquad(6)\]

where \(\|\nabla c \| = \max_{w,x,y} x^t \left( w^T x^t - y^t \right)\).

Let’s implement the projection operator and the differential of the Gaussian loss. We’ll then use fitOGDReg and get our learner.

-- | Argument name for Lambda.
newtype Lambda a = Lambda a

-- | Projection on a L2 Ball.
l2Projection :: (Metric v, Ord a, Floating a) =>
  Lambda a ->    -- ^ radius
  Projection v a -- ^ projection operator
l2Projection (Lambda lam) w 
   | q <= lam = w
   | otherwise = lam *^ w ^/ q
 where q = norm w

-- | Derivative of the square loss function.
mseDiff :: (Metric v, Floating a) => RegressionLossDerivative v a
mseDiff x y w = x ^* ( w `dot` x - y)

-- | Online ridge regressor via OGD.
fitRidge :: (Metric v, Ord a, Floating a) =>
  LearningRate a -> -- ^ Rate schedule
  Lambda a ->       -- ^ Constraint set radius
  FitReg v a
fitRidge rate lambda = fitOGDReg rate (l2Projection lambda) mseDiff

The learner is now fully implemented, let’s play with a toy dataset.

iris

Let’s play with this learner using a subset of the “iris” dataset, which is available with the standard R distribution. This post uses the inline-r haskell package to interface with R, so the snippets below are a little verbose, but the point should get across fine. We’ll try to learn car petal width in an online fashion, based on the other numerical attributes from the dataset. For reference, here’s the corresponding R linear model fit. Note the -1 at the end of the model specified in symbolic form in the lm call. This is used to remove the intercept from the model. For simplicity’s sake, all models in this post are truely linear instead of affine. Also note that the following code scales all numerical columns of the dataset.

-- | This R section throws away the "species" categorical variable from the
-- `iris` dataset, normalizes it and returns the best a-posteriori average
-- squared error on petal width obtained with a linear model fitted on the
-- three remaining attributes.
preamble :: (MonadR m) => m (SomeSEXP (PrimState m))
preamble = [r|
  iris <- select(iris,-c(Species))
  sm <- summary(lm(
    formula = Petal.Width ~ Sepal.Width + Sepal.Length + Petal.Length - 1, 
    data = iris
    ))
  mean((sm$residuals)^2)
|]

We’ll use this average squared error value later when evaluating OGD. First, we write a state monad for a one-pass learning game that saves all predictions into a sequence:

-- | The "game state" that we will use to run our one-pass procedure
data GameState v a = GameState 
  { predicted:: Seq a
  , ogdState :: Ogdstate v a
  } deriving (Generic)

-- | a one pass procedure: given a fitting function and a dataset,
-- this function returns a state monad that performs one pass of 
-- online learning on the dataset. 
onePass :: 
  ( Metric v      -- ^ The container for the vector space
  , Traversable t -- ^ The container for the dataset
  , Floating a )  -- ^ The floating point type
  =>
  (FitReg v a)                 -- ^ the fitting function
  -> t (v a, a)                -- ^ the dataset
  -> State (GameState v a) ()  -- ^ the output state monad
onePass fit dataset = for_ dataset $ \(x,y) -> do
  -- Fitting the new data point
  field @"ogdState" %= fit x y
  -- Calling the prediction function on the learner state
  yhat <- uses (field @"ogdState") (predictReg x)
  -- Updating the history 
  field @"predicted" <>= pure yhat

Second, we write the plotNpasses function that feeds n passes of some dataset data to OGD and generates a plot of the normalized cumulative average absolute regression error:

-- | plot1pass executes a one-pass procedure on a dataset and 
-- produces a R section that returns a plot of the evolution of the
-- average squared error, with a red dashed line as a comparator.
plot1pass :: (MonadR m) => 
  Double                 -- ^ a-posteriori best average standard error
  -> (FitReg [] Double)  -- ^ fit function
  -> [([Double],Double)] -- ^ dataset
  ->  m (SomeSEXP (PrimState m))
plot1pass comparator fit dataset = [r| 
    d <- data.frame(seq(1, length(ys_hs), 1), ys_hs, predicted_hs)
    names(d)=c("t","label","predicted")
    d$c = cumsum((d$predicted - d$label)^2)
    print(head(d))
    ggplot(d,aes(x = t, y = (cumsum((predicted-label)^2))/t)) +
      geom_point(size=0.4) +
      ylim(c(0,max(d$c/d$t))) +
      geom_hline(yintercept = comparator_hs, linetype="dashed", 
                color = "red") +
      xlab("$t$") +
      ylab("$\\frac{1}{t} \\sum_{i=1}^{i=t} c^i (x^i)$") 
  |]
 where
   ys = snd <$> dataset
   (GameState (toList -> predicted) _) =
       execState (onePass fit dataset) $ GameState 
         { predicted = Data.Sequence.Empty
         , ogdState = initialOGDState [0,0,0]
         }

Third, we write an experiment function that acquires data and feeds it to plot1Pass. We’ll use Sepal.Length, Sepal.Width, and Petal.Length to buidld the \(x^t\) vectors (xs in the snippet below), and building \(y^t\) using Petal.Width (ys in the snippet below). Note how this code sneakily normalizes and shuffles the dataset.

-- | `experiment` acquires `iris` data from R, and passes it to `plot1pass`.
experiment :: (MonadR m) => 
  Double                -- ^ a-posteriori best average standard error
  -> (FitReg [] Double) -- ^ fit function
  ->  m (SomeSEXP (PrimState m))
experiment comparator fit = do
  sepalLength <- [r| iris$Sepal.Length|]
  sepalWidth  <- [r| iris$Sepal.Width |] 
  petalLength <- [r| iris$Petal.Length|]
  petalWidth  <- [r| iris$Petal.Width |]
  let xs = coerce $ (\a b c -> [a,b,c] :: [Double])
        <$> p sepalLength
        <*> p sepalWidth 
        <*> p petalLength
  let ys = fromSomeSEXP petalWidth :: [Double]
  let (ZipList dataset) = (,) <$> ZipList xs <*> ZipList ys
  plot1pass comparator fit  (shuffle' dataset (length dataset) (mkStdGen 1))
 where p = ZipList . fromSomeSEXP

The main source of difficulty with the practical application of OGD for this problem is the selection of its two hyperparameters:

Let’s try to use some arbitrary numbers. We’ll assume the adversary is fixed (rootRate from earlier with \(\eta = 1\)) and use \(\lambda = 1\).

plotRidgeArbitrary :: (MonadR m) => Double -> m (SomeSEXP (PrimState m))
plotRidgeArbitrary comparator = 
  experiment comparator $ fitRidge (rootRate 1) (Lambda 1)

We run the one-pass learning experiment and plot the resulting evolution of the normalized cumulative absolute error. The red dashed line represents the best attainable value, as obtained by the “a-posteriori” linear regression performed above.

Not exactly the performance we are looking for. This blog post is not going to cover the selection of those hyperparameters - they will be discussed in another post in this series. Indeed, it is difficult to estimate parameters in eq. 6. In particular, quantities such as \(\|\nabla c \|^2\) that could be used to inform the choice of \(\lambda\) and \(\eta\) are usually not accessible a-priori. I’ll just show that the algorithm works when we magically select correct values.

plotRidgeMagic :: (MonadR m) => Double -> m (SomeSEXP (PrimState m))
plotRidgeMagic comparator =
  experiment comparator $ fitRidge (rootRate 0.045) (Lambda 0.3)

Much better. It seems that we are able to reduce the average error to the a-posteriori best value quite fast into our one pass procedure. Note the role the \(\ell_2\) ball plays here: make it too large, and the initial iterations become jumpy enough to degrade the overall performance:

plotRidgeMagicNoBall :: (MonadR m) => Double -> m (SomeSEXP (PrimState m))
plotRidgeMagicNoBall comparator = 
  experiment comparator $ fitRidge (rootRate 0.045) (Lambda 20)

Note the difference in the y axis. Before leaving, let’s review the various operations that were done to make this algorithm work, without elaborating.

I hope to cover those aspects in follow-up posts.

Code

Here are the tangled source files for this post:

References

[1] Duchi, J. et al. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research. 12, Jul (2011), 2121–2159.

[2] Henderson, H.V. and Velleman, P.F. 1981. Building multiple regression models interactively. Biometrics. (1981), 391–411.

[3] Luo, H. et al. 2016. Efficient second order online learning via sketching. CoRR. abs/1602.02202, (2016).

[4] Ross, S. et al. 2013. Normalized online learning. CoRR. abs/1305.6646, (2013).

[5] Zinkevich, M. 2003. Online convex programming and generalized infinitesimal gradient ascent. Proceedings of the 20th international conference on machine learning (icml-03) (2003), 928–936.