Resampling strategies concern the process of generating new data from your data set D under examination
in order to generate various training and test sets, which the learning method can be fitted and validated on.
Here it is assumed that every resampling strategy consists of a couple of iterations, where for each one
there are indices into D, defining the respective training and test sets. These iterations are implemented by
storing the index set in a so called
The packages come with a couple of predefined strategies
Subsampling
In each iteration i the data set D is randomly partitioned into a training and a test set according to a
given percentage (maybe 2/3 training, 1/3 test set). If there is just one iteration, the strategy is commonly
called
# split is the training set percentage rin <- make.subsample.instance(iters=10, size=nrow(iris), split=2/3)# holdout rin <- make.subsample.instance(iters=1, size=nrow(iris), split=2/3)
k-fold Crossvalidation
The data set is partitioned in k subparts of (nearly) equal size. In the i.th step of the k iterations, the i.th subpart is used as a test set, while the remainig parts form the training set.
rin <- make.cv.instance(iters=10, size=nrow(iris))
Bootstrapping
B new data sets D1,..,DB are drawn from D with replacement, each of the same size as D. In the i.th iteration Di forms the training set, while the remaining element from D forms the test set.
rin <- make.bs.instance(iters=10, size=nrow(iris))
For every resampling strategy there is a description class inheriting from resample.desc (which completely characterizes the necessary parameters) and a class inheriting from resample.instance. This latter class takes the description object and takes care of the random drawing of indices. While this seems overly complicated, it is necessary as sometimes one only wants to describe the drawing process, while in other instances one wants to create the concrete index sets. Also, there are convenience methods, to make the construction process as easy as possible. Here's an example for crossvalidation:
# create a description for 10-fold CV desc <- new("cv.desc", iters=10)# create the resample.instance, which defines the train/test indices rin <- new("cv.instance", desc=desc, size=nrow(iris))# get the cv.instance directly rin <- make.cv.instance(iters=10, size=nrow(iris))
Asking the desc or resample.instance object for further information is easy, just use [ ] as the generic getter operator:
# decription object # number of iters desc["iters"] desc["iters"]# resample.instance object # number of iters rin["iters"] rin["iters"]# train/test indices for 3rd iteration rin["train.inds", 3] rin["test.inds", 3]# train/test indices for 1st and 3rd iteration rin["train.inds", c(1,3)] rin["test.inds", c(1,3)]
Please refer to the help pages of the specific classes for a complete list of getters.
If you want to validate your classification method, using a certain resampling strategy, simply call resample.fit.
For the example code, we use the standard iris data set and compare with cross-validation a
Decision Tree and the Linear Discriminant Analysis:
# Classification task ct <- make.classif.task(data=iris, formula=Species~.)# Resample instance for Cross-validation rin <- make.res.instance("cv", ct, iters=3)# Merge learner, i.e. Decision Tree, classification task ct and resample instance rin f1 <- resample.fit("rpart.classif", ct, rin)# Let's set a couple of hyperparamters for rpart f1 <- resample.fit("rpart.classif", ct, rin, parset=list(minsplit=10, cp=0.03))# Second resample.fit for LDA as learner f2 <- resample.fit("lda", ct, rin)# Let's see how the well both classifiers did resample.performance(ct, f1)$aggr1 [1] 0.07333333 $spread [1] 0.01154701 $aggr2 [1] 0.08 0.06 0.08 $vals $vals[[1]] [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 [39] 0 0 0 0 0 0 0 0 0 0 0 0 $vals[[2]] [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [39] 0 1 0 0 0 1 0 0 0 0 0 0 $vals[[3]] [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 [39] 0 0 0 1 0 0 0 1 1 0 0 0 resample.performance(ct, f2)$aggr1 [1] 0.02 $spread [1] 0 $aggr2 [1] 0.02 0.02 0.02 $vals $vals[[1]] [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 [39] 0 0 0 0 0 0 0 0 0 0 0 0 $vals[[2]] [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [39] 0 0 0 0 0 0 0 0 0 0 0 0 $vals[[3]] [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [39] 0 0 0 0 0 0 0 1 0 0 0 0