Benchmark experiments

In order to get an unbiased estimate of the performance on new data, it is generally not enough to simply use repeated crossvalidations for a given set of hyperparamters and methods (see tuning), as this might produce an overly optimistic result.

A better (although more time-consuming) approach is nesting two resampling methods. To make the explanation easier, let's take cross-validations, in this case also called "double cross-validation". In the so called "outer" cross-validation the data is split repeatedly into a (larger) training set and a (smaller) test set in the usual way. Now, in every outer iteration the learner is tuned on the training set by performing an "inner" cross-validation. The best found hyperparameters are selected , with these the learner is fitted to the complete "outer" training set and the resulting model is used to access the (outer) test set. This results in much more reliable estimates of true performance distribution of the leraner for unseen data. These can now be used to estimate locations (e.g. of the mean or median performance value) and to compare learning methods in a fair way.

Using mlr, setting up such an experiment becomes very easy:

Example 1

	# Classification task with iris data set 
	ct <- make.classif.task(data = iris, target = "Species")

	# Very small grid for svm hyperparameters  
	r <- list(C = 2^seq(-1,1), sigma = 2^seq(-1,1))

	# Define "inner" cross-validation indices
	inner.res <- make.res.desc("cv", iters = 3)   

	# Tune a SVM
	svm.tuner <- make.tune.wrapper("kernlab.svm.classif", method = "grid", 
							      resampling = inner.res, 
							      control = grid.control(ranges=r))

	# Three learner to be compared 
	learners <- c("lda", "qda", svm.tuner)

	# Define "outer" cross-validation indices 
	res <- make.res.desc("cv", iters = 5)

	# Merge it to a benchmark experiment 
	result <- bench.exp(learners, ct, res)
	
	Benchmark result
	                mean         sd
	LDA       0.02000000 0.02981424
	qda       0.02000000 0.02981424
	tuned-svm 0.05333333 0.03800585
	

The above code should be mainly self-explanatory. In the result every row corresponds to one learner. The entries show the mean test error and its standard deviation for the final fitted model.

But the Benchmark result contains much more information, which you can access if you want to see details. Let's have a look to the benchmark result from the example above:

Example 1 (Fortsetzung)

	# Access further information 
	# The single performances of the outer crossvalidation 
	result["perf"]
	
	         LDA        qda  tuned-svm
	1 0.03333333 0.10000000 0.06666667
	2 0.00000000 0.00000000 0.00000000
	3 0.00000000 0.00000000 0.06666667
	4 0.06666667 0.06666667 0.10000000
	5 0.00000000 0.00000000 0.03333333
	
	# A list of the tuned parameters with tune- and test-performance 
	result["tuned.pars"]
	
	[[1]]$LDA
	[1] NA

	[[1]]$qda
	[1] NA

	[[1]]$`tuned-svm`
	    C sigma  tune.perf  test.perf
	1 1.0   0.5 0.02500000 0.06666667
	2 2.0   0.5 0.03333333 0.00000000
	3 1.0   0.5 0.04166667 0.06666667
	4 0.5   0.5 0.01666667 0.10000000
	5 0.5   0.5 0.04166667 0.03333333
	
	# Confusion matrices - one for each learner
	result["conf.mats"]
	
	[[1]]$LDA
	            predicted
	true         setosa versicolor virginica -SUM-
	  setosa         50          0         0     0
	  versicolor      0         48         2     2
	  virginica       0          1        49     1
	  -SUM-           0          1         2     3

	[[1]]$qda
	            predicted
	true         setosa versicolor virginica -SUM-
	  setosa         50          0         0     0
	  versicolor      0         46         4     4
	  virginica       0          1        49     1
	  -SUM-           0          1         4     5

	[[1]]$`tuned-svm`
	            predicted
	true         setosa versicolor virginica -SUM-
	  setosa         50          0         0     0
	  versicolor      0         46         4     4
	  virginica       0          4        46     4
	  -SUM-           0          4         4     8
	

Of course everything works the same way if you exchange the resampling strategy either in the outer or inner run. They can be freely mixed.
We show an example with outer bootstrap and inner cross-validation, our learner will be k-nearest-neighbor.

Example 2

	# Classification task with iris data set 
	ct <- make.classif.task(data = iris, target = "Species")

	# Range of hyperparameter k  
	r <- list(k = 1:5)

	# Define "inner" cross-validation indices
	inner.res <- make.res.desc("cv", iters = 3)   

	# Tune a SVM
	knn.tuner <- make.tune.wrapper("kknn.classif", method = "grid", 
						       resampling = inner.res, 
    						       control = grid.control(ranges=r))

	# Define "outer" bootstrap indices 
	res <- make.res.desc("bs", iters = 5)

	# Merge it to a benchmark experiment 
	result <- bench.exp(knn.tuner, ct, res)
	
	Benchmark result
	           mean         sd
	[1,] 0.05747409 0.02422707
	
	
	# Which performances did we get in the single runs? 
	result["perf"]
	
	   tuned-knn
	1 0.07272727
	2 0.08000000
	3 0.05263158
	4 0.06382979
	5 0.01818182
	

	# Which parameter belong to the perfomances? 
	result["tuned.pars"]
	
	$`tuned-knn`
	  k   tune.perf  test.perf
	1 2 0.013333333 0.07272727
	2 4 0.006666667 0.08000000
	3 1 0.026666667 0.05263158
	4 1 0.013333333 0.06382979
	5 5 0.026666667 0.01818182
		

	# What does the confusion matrix look like? 
	result["conf.mats"]
	
	$`tuned-knn`
	            predicted
	true         setosa versicolor virginica -SUM-
	  setosa         87          1         0     1
	  versicolor      0         89         5     5
	  virginica       0          9        73     9
	  -SUM-           0         10         5    15
	

When you want to add another learner to your existing benchmark experiment, this works easily in mlr. The big advantage is, that the same resample pairing is used as for the other learners.
Let's take Example 1 and add another learner - Naive Bayes.

Example 1 (Fortsetzung)

	new.result <- bench.add(learner = "naiveBayes", task = ct, result = result)
	
	---EINFÜGEN----