A short while ago I had a chance to perform analysis using the **caret** package. One of the requirements is to run it parallelly and to work in both Windows and Linux. The requirement can be met by using the **parallel** and **doParallel** packages as the **caret** package trains a model using the **foreach** package if clusters are registered by the **doParallel** package - further details about how to implement parallel processing on a single machine can be found in earlier posts (Link 1, Link 2 and Link 3). While it is relatively straightforward to train a model across multiple clusters using the **caret** package, setting up random seeds may be a bit tricky. As analysis can be more reproducible by random seeds, a way of setting them up is illustrated using a simple function in this post.

The following packages are used.

```
1library(parallel)
2library(doParallel)
3library(randomForest)
4library(caret)
```

In the **caret** package, random seeds are set up by adjusting the argument of *seeds* in `trainControl()`

and the object document illustrates it as following.

**seeds**- an optional set of integers that will be used to set the seed at each resampling iteration. This is useful when the models are run in parallel. A value of**NA**will stop the seed from being set within the worker processes while a value of**NULL**will set the seeds using a random set of integers. Alternatively,**a list can be used**. The list should have*B+1*elements where*B is the number of resamples*. The first B elements of the list should be vectors of integers of length M where*M is the number of models being evaluated*. The last element of the list only needs to be a single integer (for the final model). See the Examples section below and the Details section.

Setting *seeds* to either *NA* or *NULL* wouldn’t guarantee a full control of resampling so that a *custom list* would be necessary. Here `setSeeds()`

creats the custom list and it handles only (repeated) cross validation and it returns *NA* if a different resampling method is specified - this function is based on the source code. Specifically *B* is determined by the number of folds (*numbers*) or the number of repeats of it (*numbers x repeats*). Then a list of *B* elements are generated where each element is an integer vector of length *M*. *M* is the sum of the number of folds (*numbers*) and the length of the tune grid (*tunes*). Finally an integer vector of length 1 is added to the list.

```
1# function to set up random seeds
2setSeeds <- function(method = "cv", numbers = 1, repeats = 1, tunes = NULL, seed = 1237) {
3 #B is the number of resamples and integer vector of M (numbers + tune length if any)
4 B <- if (method == "cv") numbers
5 else if(method == "repeatedcv") numbers * repeats
6 else NULL
7
8 if(is.null(length)) {
9 seeds <- NULL
10 } else {
11 set.seed(seed = seed)
12 seeds <- vector(mode = "list", length = B)
13 seeds <- lapply(seeds, function(x) sample.int(n = 1000000, size = numbers + ifelse(is.null(tunes), 0, tunes)))
14 seeds[[length(seeds) + 1]] <- sample.int(n = 1000000, size = 1)
15 }
16 # return seeds
17 seeds
18}
```

Below shows the control variables of the resampling methods used in this post: k-fold cross validation and repeated k-fold cross validation. Here (5 repeats of) 3-fold cross validation is chosen. Also a grid is set up to tune *mtry* of `randomForest()`

(*cvTunes*) and *rcvTunes* is for tuning the number of nearest neighbours of `knn()`

.

```
1# control variables
2numbers <- 3
3repeats <- 5
4cvTunes <- 4 # tune mtry for random forest
5rcvTunes <- 12 # tune nearest neighbor for KNN
6seed <- 1237
```

Random seeds for 3-fold cross validation are shown below - *B + 1* and *M* equal to 4 (3 + 1) and 7 (3 + 4) respectively.

```
1# cross validation
2cvSeeds <- setSeeds(method = "cv", numbers = numbers, tunes = cvTunes, seed = seed)
3c('B + 1' = length(cvSeeds), M = length(cvSeeds[[1]]))
```

```
## B + 1 M
## 4 7
```

```
1cvSeeds[c(1, length(cvSeeds))]
```

```
## [[1]]
## [1] 328213 983891 24236 118267 196003 799071 956708
##
## [[2]]
## [1] 710446
```

Random seeds for 5 repeats of 3-fold cross validation are shown below - *B + 1* and *M* equal to 16 (3 x 5 + 1) and 15 (3 + 12) respectively.

```
1# repeated cross validation
2rcvSeeds <- setSeeds(method = "repeatedcv", numbers = numbers, repeats = repeats,
3 tunes = rcvTunes, seed = seed)
4c('B + 1' = length(rcvSeeds), M = length(rcvSeeds[[1]]))
```

```
## B + 1 M
## 16 15
```

```
1rcvSeeds[c(1, length(rcvSeeds))]
```

```
## [[1]]
## [1] 328213 983891 24236 118267 196003 799071 956708 395670 826198 863837
## [11] 119602 864078 177322 382935 513432
##
## [[2]]
## [1] 751240
```

Given the random seeds, train controls are set up as shown below.

```
1# cross validation
2cvCtrl <- trainControl(method = "cv", number = numbers, classProbs = TRUE,
3 savePredictions = TRUE, seeds = cvSeeds)
4# repeated cross validation
5rcvCtrl <- trainControl(method = "repeatedcv", number = numbers, repeats = repeats,
6 classProbs = TRUE, savePredictions = TRUE, seeds = rcvSeeds)
```

They are tested by comparing the two learners: *knn* and *randomForest*. Each of the two sets of objects are the same except for *times* elements, which are not related to model reproduciblility.

```
1# repeated cross validation
2cl <- makeCluster(detectCores())
3registerDoParallel(cl)
4set.seed(1)
5KNN1 <- train(Species ~ ., data = iris, method = "knn", tuneLength = rcvTunes, trControl = rcvCtrl)
6stopCluster(cl)
7
8cl <- makeCluster(detectCores())
9registerDoParallel(cl)
10set.seed(1)
11KNN2 <- train(Species ~ ., data = iris, method = "knn", tuneLength = rcvTunes, trControl = rcvCtrl)
12stopCluster(cl)
13
14# same outcome, difference only in times element
15all.equal(KNN1$results[KNN1$results$k == KNN1$bestTune[[1]],],
16 KNN2$results[KNN2$results$k == KNN2$bestTune[[1]],])
```

```
## [1] TRUE
```

```
1all.equal(KNN1, KNN2)
```

```
## [1] "Component \"times\": Component \"everything\": Mean relative difference: 0.02158273"
```

```
1# cross validation
2cl <- makeCluster(detectCores())
3registerDoParallel(cl)
4set.seed(1)
5rf1 <- train(Species ~ ., data = iris, method = "rf", ntree = 100,
6 tuneGrid = expand.grid(mtry = seq(1, 2 * as.integer(sqrt(ncol(iris) - 1)), by = 1)),
7 importance = TRUE, trControl = cvCtrl)
8stopCluster(cl)
9
10cl <- makeCluster(detectCores())
11registerDoParallel(cl)
12set.seed(1)
13rf2 <- train(Species ~ ., data = iris, method = "rf", ntree = 100,
14 tuneGrid = expand.grid(mtry = seq(1, 2 * as.integer(sqrt(ncol(iris) - 1)), by = 1)),
15 importance = TRUE, trControl = cvCtrl)
16stopCluster(cl)
17
18# same outcome, difference only in times element
19all.equal(rf1$results[rf1$results$mtry == rf1$bestTune[[1]],],
20 rf2$results[rf2$results$mtry == rf2$bestTune[[1]],])
```

```
## [1] TRUE
```

```
1all.equal(rf1, rf2)
```

```
## [1] "Component \"times\": Component \"everything\": Mean relative difference: 0.01735016"
## [2] "Component \"times\": Component \"final\": Mean relative difference: 0.6666667"
```

I hope this post may be helpful to improve reproducibility of analysis using the **caret** package.

## Comments