Blog Listing

R and Shark Machine Learning

In this post we will give a brief introduction to Shark ML and how to interface it through R. Shark is a very powerfull and feature rich Machine Learning library written in C++. Shark is not only fast (check the benchmarks here) but also offers a great balance between speed and ease of use. Interfacing Shark through R using Rcpp makes the training and prototyping of models a breeze.

Requirements:

  • R installed + familiarity with the concept of writing R extensions (windows users will need to install R build tools).
  • You have a working Rcpp installation.

Rstudio is also highly recommended. It really shines whenever you are creating R-extensions containing small C++ snippets.

Getting started:

The RcppShark package offers a way to quickly launch and run our first Shark model from R. First we use it to create a RcppShark skeleton which we later extend:

install.packages("RcppShark")
library(RcppShark)
RcppShark.package.skeleton("mySharkPackage")

We then edit the DESCRIPTION file in our newly created “mySharkPackage”-package in order to allow for C++11. We add

SystemRequirements: C++11

to the file.

Training a model:

In the following, we will train a simple classification model from the Shark model inventory. This example uses Sharks implementation of the Random Forest by Breiman [L. Breiman. Random Forests. Machine Learning, vol. 45, issue 1, p. 5-32, 2001]. We create the file “RandomForest.cpp” in the package “/src”-folder containing the following piece of code:

#include <shark/Algorithms/Trainers/RFTrainer.h> //the random forest trainer
#include <shark/ObjectiveFunctions/Loss/ZeroOneLoss.h> //zero one loss for evaluation
#include <shark/Data/Dataset.h>
#include <Rcpp.h>
#include "utils.h"

using namespace shark;
using namespace std;
using namespace Rcpp;

//' @export
//'
// [[Rcpp::depends(BH)]]
// [[Rcpp::export]]
List SharkRFTrain (NumericMatrix X, NumericVector Y, int nTrees = -1) {

 // Convert data finto a ClassificationDataset
 UnlabeledData<RealVector> inputs = NumericMatrixToUnlabeledData(X);
 Data<unsigned int> labels = NumericVectorToLabels (Y);
 ClassificationDataset trainData, testData;
 trainData.inputs() = inputs;
 trainData.labels() = labels;

 // Split data:
 testData = splitAtElement(trainData, 80 * trainData.numberOfElements() / 100);

 // Train random-forest:
 RFTrainer trainer;
 if(nTrees > 0) trainer.setNTrees(nTrees);
 RFClassifier model;
 trainer.train(model, trainData);

 // Make predictions
 Data<RealVector> predsTrain = model(trainData.inputs()); // evaluate on training set
 Data<RealVector> predsTest = model(testData.inputs()); // evaluate on test set

 //compute error on test data:
 ZeroOneLoss<unsigned int, RealVector> loss; // 0-1 loss
 double trainError = loss.eval(testData.labels(), predsTest);

 // return components:
 Rcpp::List rl = R_NilValue;
 rl = Rcpp::List::create(
 Rcpp::Named("error") = trainError,
 Rcpp::Named("predsTrain") = DataRealVectorToNumericMatrix(predsTrain),
 Rcpp::Named("predsTest") = DataRealVectorToNumericMatrix(predsTest)
 );
 return (rl);
}

where we

  1. Convert the Rcpp Matrices into a Classification Dataset
  2. We use Shark to split the data into training and test data.
  3. After training the model on the training data, we use Shark to evaluate the model on the test data.
  4. At last we do some type conversion before returning the predictions and performance.

Test:

After building the package. We use the spam-data from “kernlab” to test our code:

data(spam, package = "kernlab")
spam <- spam[sample.int(nrow(spam)), ]
X <- as.matrix(spam[, names(spam) != "type"])
Y <- spam$type == "spam"

library(mySharkPackage)
sharkFit <- SharkRFTrain(X, Y, nTrees = 100)

In a later post we will look into the process of tuning hyper-parameters using Shark. Typically one would focus most of the attention on tuning the feature bagging parameter or “mtry” when tuning a RandomForest. For now we will simply leverage on R’s powerfull visualization tools in order to investigate the relationship between the error-rate and the number of trees:

treeCount <- seq(1, 100, 3)
errorRate <- sapply(treeCount, function(x) SharkRFTrain(X, Y, x)$error)
library(ggplot2)
gg1 <- ggplot(data.frame(treeCount, errorRate), aes(treeCount, errorRate))
gg1 + geom_point() + geom_smooth()

 

 

The error-rate is decreasing in the number of trees since it reduces variance.  The error-rate is around 5 pct. after 75 trees.