Applied Online Learning with Haskell and friends, Part 1

Posted on June 17, 2018 under tag(s) machine learning, functional programming, haskell, online learning, regression, theory

Online machine learning techniques can be used to solve many problems. In this series of blog posts, I’ll take a look at the different settings, algorithms, and models, occasionally going all the way to deploying toy services. Expect some theory talk and Haskell implementations, but also some Python/R code when it makes sense. I’ll talk about the infrastructure aspects, and that will probably mean writing some Nix too. The accompanying code will not only illustrate the dialogue. It will also be coherent and used to build something functional.

The three first posts in the series are a journey through the anatomy of an exchange-rate prediction service based on Online Convex Optimization (OCO). Each post deals with a different aspect of this system. There is Haskell, R and Python code associated. Aside from code snippets, the full supporting Haskell implementation for these three posts is documented via Haddock here with hyperlinked sources. The first three posts are packaged via Nix here so you can play with the source as you like on any Linux or MacOs box. Check out in the archive file if you want to play around.

This is the first post of the three, so let’s start simple. We will discuss and implement a basic OCO algorithm in Haskell and show how to use it to build a learner. More precisely, we’ll use the Online Gradient Descent (OGD) algorithm to solve an online ridge regression problem.

The online convex programming (OCP) framework is the following:

\(\forall t=0,\cdots\) the learner
\(\quad\) Selects an action \(w^t \in F \subset \mathbb{R}\), a convex set.
\(\quad\) Observes a convex loss function \(c^t : F \rightarrow \mathbb{R}\)
\(\quad\) Incurs a loss \(c^t(w^t)\)

This setting can be adapted for many tasks: regression, classification, clustering, ranking, collaborative filtering, structured prediction, principal component analysis, and more. The action can sometimes be thought of as the parameter of the model - although there will sometimes be a distinction. Let’s go ahead and take a look at our first algorithm for OCP.

Online Gradient Descent (OGD )

There are many useful variants of M.Zinkevich’s original OGD. For this first post, we’ll keep it simple and implement the OGD method itself, since it is the “canonical” algorithm for OCO. This algorithm is a non-adaptive constrained online gradient descent achieved through a projection step. More formally, the original OGD algorithm works by picking \(w^0\) arbitrarily in \(F\) and choosing:

\[\begin{equation} w^{t+1} = P(w^t - \eta_t \nabla c^t (w^t)) \end{equation} \]

Where \(P(w) = \text{argmin}_{y \in F} \| x,y \|\) is the projection operator on \(F\).

As long as we can encode a problem by specifying (and computing) a projection operator \(P\) and a gradient of \(c^t\) at point \(w^t\) for all $t=1,…$, we can use the OGD algorithm. Then, it remains to choose a sequence \(\eta_t\) of learning rates.

In the first few posts, we’ll be using only dense vectors. That being said, I will later show some applications using sparse data. Therefore, we’ll write as much code as possible in a generic way in this regard. The Typeclasses from the Linear Haskell package will help us achieve that. These Typeclasses allow us to use the code from this post seamlessly on vectors, association lists or any custom data structure that represent vector spaces. We’ll call the type of a loss function differential LossDerivative and the type for projections Projection. Should you want to look at the details, the source file for this post is OCO.hs (source, haddock, raw). We’ll essentially need these three types in order to define a learner:

-- | Derivative of a loss function w.r.t the action w.
type LossDerivative v x = (Additive v, Floating x) =>
  v x ->  -- w
  v x     -- dc/dw

-- | Projection operator.
type Projection v x = (Additive v, Floating x) => v x -> v x

-- | Learning Rate.
type LearningRate x = (Floating x) => Int -> x

The goal of the learner is to minimize its regret at round \(T\) against all adversaries in a comparison class. This means that we’ll be benchmarking the algorithm with respect to a single best adversary in hindsight. The original OGD publication introduces two cases. Let’s go over them.

Static adversary

-- | Inverse squared learning rate.
squaredRate :: (Floating x) => x -> LearningRate x
squaredRate eta t = eta / fromIntegral (t * t)

Using the learning rate \(\eta_t = \frac{1}{t^2}\), we can minimize the average regret to a static action:

\[\begin{equation} R_F(T) = \sum_{t=1}^{t=N} c^t (x^t) - \min_{w \in F} \sum_{t=1}^{t=N} c^t (w) \end{equation} \]

If \(F\) is bounded, closed, nonempty, and the norm of the gradient of \(c\) is bounded on \(F\), we get the following regret bound.

\[\begin{equation} R_F(T) \leq \frac{\|F\|^2 \sqrt{T}}{2} + \left( \sqrt{T} - \frac{1}{2} \right) \|\nabla c \|^2 \end{equation} \]

Where \(\|F\|\) is the diameter of \(F\) and \(\|\nabla c\|\) is the norm of the maximum value of the gradient of \(c\) in \(F\). That’s sub-linear and will learn static targets quite fast.

Dynamic adversary

-- | Fixed learning rate.
fixedRate :: (Floating x) => x -> LearningRate x
fixedRate eta  _ = eta

Using the fixed learning rate \(\eta_t = \eta\), we can minimize a notion of dynamic regret. For this specific bound, the comparison class is the set \(A(T,L)\) of slowly moving actions. This is defined in terms of the path length of a sequence, which just sum of the distance of displacement between subsequent rounds. The comparison class is the set of sequences of actions whose path length is smaller than \(L\).

More precisely, this dynamic regret regret is:

\[\begin{equation} R_A(T,L) = \sum_{t=1}^{t=N} c^t (x^t) - \min_{(w_a)^{t=1 \cdots T} \in A(T,L)} \sum_{t=1}^{t=N} c^t (w_a^t) \end{equation} \]

And the fixed learning rate gives us the following bound:

\[\begin{equation} R_A(T,L) \leq \frac{7 \|F\|^2 }{4 \eta} + \frac{L \|F\| }{\eta} + \frac{ T \eta \|\nabla c \|^2}{2} \end{equation} \]

Gradient Step

We now are ready to write the actual OGD gradient step. We need a type to maintain the state of the algorithm, which we’ll call Ogdstate:

-- | Algorithm state for OGD. Encodes both the round
-- count and the last action w.
data Ogdstate v x = Ogdstate Int (v x)

-- | The OGD algorithm.
ogd :: (Additive v, Floating x) =>
  (LearningRate x)       ->
  (Projection v x)       ->
  (Ogdstate v x)         -> -- Last state/action w^t
  (LossDerivative v x)   -> -- Derivative of c^t
  (Ogdstate v x)            -- New state/action w^(t+1)
ogd rate projection (Ogdstate t w) dc = Ogdstate (t+1) $
  projection $ w ^-^ ( rate t *^ dc w)

-- | Helper. Builds an initial state for 'ogd'.
initialOGDState :: (Additive v, Floating x) => v x -> Ogdstate v x
initialOGDState w = Ogdstate 1 w

We now have the ogd gradient descent step nailed down. I’d like to think of a regression problem next, so let’s introduce some more common vocabulary in our code first. We’ll abstract over this gradient step and expose an interface for online linear regression.

Linear Regression with OGD

For this first post, I want to use something quite straightforward. Accordingly, we’ll avoid all modeling work and use a basic setup. I’m choosing one of the most classical task: linear regression.

Online Regression

The online regression framework is the following.

\(\forall t=0,\cdots\) the learner
\(\quad\) Observes a feature vector \(x^t \in \mathbb{R}^n\)
\(\quad\) Selects a prediction \(\widehat{y^t} \in \mathbb{R}\)
\(\quad\) Observes the regression target \(y^t \in \mathbb{R}\)
\(\quad\) Incurs a loss \(l(\widehat{y^t},y^t)\)

Well, we can already write a type for that loss function \(l\):

-- | Differential of a regression loss
type RegressionLossDerivative v x =
  v x ->                  -- x^t
  x   ->                  -- y^t
  (LossDerivative v x)  -- c^t

The Linear Predictor

We make the classical restrictive design choices on the learner. First, we use a constrained linear model, with a parameter vector in a set \(F \subset \mathbb{R}^n\). Second, we commit to choose \(w^t\) before observing \(x^t\). Under this design, the regression framework “reduces” to OCP:

$∀ t=0,…$ the learner
\(\quad\) Selects an action $wt\(F\)
\(\quad\) Observes a feature vector \(x^t \in \mathbb{R}^n\) and predicts \(\widehat{y^t} = {w^{t}}^T x\)
\(\quad\) Observes the regression target \(y^t \in \mathbb{R}\) and incurs a loss \(c^t(w^t) = l( {w^{t}}^T x,y^t)\)

This means that we need to use the following prediction function:

-- | Prediction function
predictReg :: (Metric v, Floating x) => Ogdstate v x -> v x -> x
predictReg (Ogdstate _ w) x = w `dot` x

Let’s wrap the previously implemented ogd in order to expose a regression interface. We’ll do this without making the our choice for \(l\) nor \(F\) explicit:

-- | Regression fit call
type FitReg v x =
  Ogdstate v x                     ->  --the old learner state
  v x                              ->  --the x^t feature vector
  x                                ->  --the y^t regression target
  Ogdstate v x                         --the new learner state

-- | Fitting one regression data point using OGD.
-- | The first three arguments of this function parametrize the problem.
fitOGDReg :: (Additive v, Floating x) =>
  (LearningRate x ) ->
  (Projection v x) ->
  (RegressionLossDerivative v x) ->
  FitReg v x
fitOGDReg rate projection differential state x y =
  ogd rate projection state $ differential x y

That’s it, we now have an interface for predicting a regression value (predictReg) and for feeding an instance to a regression model (FitReg v x). Let’s choose and implement a regression model.

Ridge Regression

We’ll consider the Ridge Regression approach, also called Thikonov regularization. In that setup, the set \(F\) is an \(\ell_2\) ball of radius \(\lambda\) and \(c^t\) is the square loss:

\[\begin{equation} c^t(w) = \| y^t - w^{T} x^t \|^2 \quad y \in \mathbb{R}, x \in \mathbb{R}^n \\ F = \left\{ x \in \mathbb{R} \; \middle| \; \|x\| \leq \lambda \right\} \end{equation} \]

This leads leads to the following OGD parametrization:

\[\begin{equation} \nabla c^t (w) = x^t \left( w^T x^t - y^t \right) \\ P(w) = \begin{cases} w & \text{if}\ \|w\| \leq \lambda \\ \lambda \frac{w}{\|w\|} & \text{otherwise} \end{cases} \end{equation} \]

The diameter of \(F\) depends on our choice of \(\lambda\) (\(\|F\|=2\lambda\)) and the magnitude of $\| ∇ c \| $ depends on the regression data. In the case of \(\eta_t=\frac{1}{t^2}\), we get a regret bound that scales both with \(\lambda^2\) and this data-dependent term:

\[\begin{equation} R_F(T) \leq \frac{\|F\|^2 \sqrt{T}}{2} + \left( \sqrt{T} - \frac{1}{2} \right) \|\nabla c \|^2 = 2 \lambda^2 \sqrt{T} + \left( \sqrt{T} - \frac{1}{2} \right) \|\nabla c \|^2 \end{equation} \]

where \(\|\nabla c \| = \max_{w,x,y} x^t \left( w^T x^t - y^t \right)\).

Let’s implement the projection operator and the differential of the Gaussian loss. We’ll then use fitOGDReg and get our learner.

-- | Argument name for Lambda.
newtype Lambda x = Lambda x
-- | Projection on a L2 Ball of radius lambda.
l2Projection :: (Metric v, Ord x, Floating x) =>
  Lambda x ->
  Projection v x
l2Projection (Lambda lambda) w = if q <= lambda
                                 then w
                                 else lambda *^ w ^/ q
                                 where q = norm w

-- | Derivative of the square loss function.
mseDiff :: (Metric v) => RegressionLossDerivative v x
mseDiff x y w = x ^* ( w `dot` x - y)

-- | Online ridge regressor via OGD.
fitRidge :: (Metric v, Ord x, Floating x) =>
  LearningRate x ->
  Lambda x ->
  FitReg v x
fitRidge rate lambda = fitOGDReg rate (l2Projection lambda) mseDiff

The learner is now fully implemented. In the next post in this series, we’ll take a look at how we can use this implementation to predict the EUR/USD exchange rate.