Integrating R&D tooling in the Obandit machine learning package

Posted on May 19, 2018 under tag(s) machine learning, functional programming, opam, ocaml, nix

This first post is long due. Here, I discuss software choices for writing machine learning code. The content will perhaps strike as odd if you don’t know about the Nix project, but no other prerequisite is needed.

The post details how I experimented with the ecosystem of the Ocaml language by writing and publishing the Obandit Multi-Armed Bandit library. The goals were specific, since machine learning code has to undergo some benchmarking before being released, and is routinely used in interactive analytics. More precisely, I wanted to achieve the following:

Opam itself is quite straightforward. A single opam file for the project describes the packaging:

opam-version: "1.2"
maintainer: "Valentin Reis <fre@freux.fr>"
authors: ["Valentin Reis <fre@freux.fr>"]
homepage: "http://freux.fr/oss/obandit.html"
doc: "http://freux.fr/oss/obandit/documentation/index.html"
license: "ISC"
dev-repo: "https://github.com/freuk/obandit.git"
bug-reports: "https://github.com/freuk/obandit/issues"
tags: []
available: [ ocaml-version >= "4.01.0"]
depends: [
  "ocamlfind" {build}
  "ocamlbuild" {build}
  "topkg" {build}
  "cmdliner"
  "batteries" ]
depopts: []
build: [ "ocaml" "pkg/pkg.ml" "build" "--pinned" "%{pinned}%" ]
build-test: [
 [ "ocaml" "pkg/pkg.ml" "build" "--pinned" "%{pinned}%" "--tests" "true" ]
 [ "ocaml" "pkg/pkg.ml" "test" ]]

The management of Opam projects is pleasant, as the company managing opam and the opam package repository does a stellar job. There are modern tools such as the topkg ‘transitory package manager’, which allows to perform most Opam publication routine tasks such as building the documentation, tagging the git repository, producing the release tarballs, pushing these to github, and submitting the corresponding opam-repository pull-request. The author gives an interesting presentation of this tool here.

The Ocaml build toolchain is flexible. The Ocamlbuild build tool permits the use of compilation plugins. This allows to build the documentation in a highly customizable way, and here I was able to use a custom MathJax documentation generator: Any opam user can call this custom documentation builder using the odig tool, from the same author as topkg. There are some available tools for this like ltxhtml, but here the plugin is simply some html headers for mathjax functionality. Ocamlbuild is configured in the myocamlbuild.ml file, where "doc/plugin.cmxs" is the compiled plugin file:

open Ocamlbuild_plugin

let () =
  dispatch (function
    | After_options ->
       let ocamldoc tags deps docout docdir =
         let tags = tags -- "extension:html" in
         Ocamlbuild_pack.Ocaml_tools.ocamldoc_l_dir tags deps docout docdir in

       rule "ocamldoc with plugin"
         ~prod:"%.docdir/index.html"
         ~stamp:"%.docdir/html.stamp"
         ~deps:["%.odocl"; "doc/plugin.cmxs"]
         (Ocamlbuild_pack.Ocaml_tools.document_ocaml_project
            ~ocamldoc "%.odocl" "%.docdir/index.html" "%.docdir");

      pflag ["ocaml"; "doc"; "docdir"] "plugin"
        (fun plugin -> S [A "-g"; A plugin])
    | _ -> ())

The resulting documentation can be found here.

Fortunately, the library has few ocaml dependencies. This made the Nix packaging a breeze. Indeed, the nixpkgs ocaml package collections is not extensive. There are some experimental tools for bridging Nix and Opam such as opam2nix but these are too unstable at the time of writing this code for my taste. The library is packaged in my personal package repository at github. The nix package itself is described as follows in the obandit.nix file:

{ stdenv, fetchzip, ocaml, findlib, ocamlbuild, opam, ocamlPackages, topkg }:

stdenv.mkDerivation rec {
	name = "obandit-${version}";
	version = "0.3.4";

  src = fetchzip {
    url = "https://github.com/freuk/obandit/archive/v${version}.tar.gz";
    sha256 = "0jzfn6jgmflcfinw4izlwjmahm3g49an7yn0675f7f2zvgchl0nr";
  };

  buildInputs = [ ocaml findlib ocamlbuild topkg opam
  ocamlPackages.ocaml_batteries
  ocamlPackages.cmdliner
  ];

  inherit (topkg) buildPhase;
  patchPhase=''
    substituteInPlace src/obandit.mli --replace %%VERSION%% ${version}
  '';
  installPhase = ''
    opam-installer -i --prefix=$out --libdir=$OCAMLFIND_DESTDIR
    mkdir -p $out/doc
    ${ocaml}/bin/ocamlc -I +ocamldoc -c doc/plugin.ml
    cp doc/style.css $out/doc
    ${ocaml}/bin/ocamldoc -g doc/plugin.cmo -d $out/doc src/obandit.mli
  '';
	meta = {
		license = stdenv.lib.licenses.isc;
		homepage = https://github.com/freuk/obandit;
		description = "OCaml module for multi-armed bandits";
		inherit (ocaml.meta) platforms;
	};
}

In order to integrate the web commandline description and the documentation, we tie three packages (obandit.nix, validation.nix, web.nix) together in a default.nix file:

{ pkgs, zymake }:
let
  latest = pkgs.ocamlPackages.callPackage ./obandit.nix {};
  makeSet = obandit: {
    inherit obandit;
    validation = pkgs.callPackage ./validation.nix {
      inherit obandit zymake;
      rstudioWrapper=pkgs.rstudioWrapper.overrideAttrs (oldAttrs : {
        propagatedBuildInputs = with pkgs;[ qt5.full qt5.qtwebkit qt5.qtwebchannel ];
      });
    };
    web = pkgs.callPackage ./web.nix { inherit obandit; };
  };
in makeSet latest

The web.nix file builds the description of the obandit commandline tool. This uses the package for obandit and simply runs a minimal knitr file. A makefile calls the appropriate command on a knitr report, which showcases the relevant obandit --help sections by calling the binary directly.

{ stdenv, R, rPackages, obandit, ...}:
let
  rPackList= with rPackages; [
    knitr
  ];
in stdenv.mkDerivation {
  name="cli-webpage";
  src = builtins.toPath "${obandit.src}/web";
  buildInputs = [ R obandit ]++rPackList;

  installPhase =''
    mkdir -p $out
    cp web.md $out
  '';

  obanditversion=obandit.version;
}

The workflow presented in this post is just a proof of concept and doesn’t perform much more than plotting the regret of a few algorithms from the library. This workflow is however is evolutive with new versions and can become complex, so we want to use classical data science tools to design it. Moreover, it provides a point of entry for when one wants to performs some data science experiments using the library. I chose a minimalistic solution which uses zymake, a workflow system for parallel experiments along with some interactive R scripts.

The workflow uses a Rscript file to generate CSV data, runs the obandit command-line wrapper, massages data and executes a rudimentary knitr report script. This zymakefile defines the chain of dependencies. :

#arms
k = 8

#number of pulls in each experiment
n = 3000

#number of experiments
isamples = $(range 1 30)

policies = exp3 ucb1 expgreedy

rate = 0.1

rstudo 0 ./generate.R $(i) $(k) $(n) $(> data="bernouilli").csv

obandit csv $(<).csv $(>).actions $(k) $(rate) $(policy)

printf "$(data)-$(policy) \n$(<i=*is).csv " > $(>).list_csv

printf "$(data)-$(policy) \n$(<i=*is).actions " > $(>).list_actions

printf "$(policy)\n  $(<).list_csv \n $(<).list_actions " > $(>).item_policy

printf " \n$(< policy=*policies).item_policy " > $(>).list_policies

rstudo 1 ./analyze.R $(<).list_policies $(> ).md

pandoc -o $(>).html -s $(< ).md

echo "
$( data="bernouilli" is=isamples ).html
"

The validation.nix file provides a replicable way to setup an experimental environment, run a one-shot computational workflow and run/edit a notebook-like analysis workflow in rstudio. One can simply build the resulting page (in markdown format) using nix-build -A validation, or use a nix-shell in order to interactively edit the analytics.

Here’s an interactive look at this process:

We build the package using nix-build then run the build chain again manually. The build tool then asks whether we want to start the analysis in Rstudio. The analyze.R script manages the argument passing for this setup.

The resulting markdown page is then ready for processing by the build chain of this website. The result can be seen here. The package dependency logic is defined as:

#this imports the default.nix file from above.
obanditPkgs = import ./obandit {
  pkgs = pkgs-stable;
  inherit zymake;
};
obandit = obanditPkgs.obandit;
frehk-generator = pkgs-stable.callPackage ./frehk-generator { };
frehk-site = pkgs-stable.callPackage ./frehk-site
  { inherit publis obanditPkgs cv frehk-generator; };

Here, frehk-generator and frehk-site contain the Nix derivations which build the generator and the website itself. This is available at github. The frehk-site package retrieves the arguments from obanditPkgs and extracts the documentation, the command-line interface description, and the experimental validation for web processing.

This small example shows how a highly programmable package manager can be an asset when writing production ML software. The notebook-like R&D aspect can be programatically tied to the codebase. Here, The separation between the R&D activity and the production deployment is still quite large because we use a command-line wrapper and change language. In the future, I want to explore how language bindings and the IOcaml and IHaskell jupyter kernels can help shorten the distance between data science tasks and library development.


Back