This is the first article about tree based methods using R. Carseats data in the chapter 8 lab of ISLR is used to perform classification analysis. Unlike the lab example, the rpart package is used to fit the CART model on the data and the caret package is used for tuning the pruning parameter (cp).

The bold-cased sections of the tutorial are covered in this article.

  • Visualizations
  • Pre-Processing
  • Data Splitting
  • Miscellaneous Model Functions
  • Model Training and Tuning
  • Using Custom Models
  • Variable Importance
  • Feature Selection: RFE, Filters, GA, SA
  • Other Functions
  • Parallel Processing
  • Adaptive Resampling

The pruning parameter in the rpart package is scaled so that its values are from 0 to 1. Specifically the formula is

$$ R_{cp}\left(T\right)\equiv R\left(T\right) + cp*|T|*R\left(T_{1}\right) $$

where $$T_{1}$$ is the tree with no splits, $$\mid T\mid$$ is the number of splits for a tree and R is the risk.

Due to the inclusion of $$R\left(T_{1}\right)$$, when cp=1, the tree will result in no splits while it is not pruned when cp=0. On the other hand, in the original setup without the term, the pruning parameter ($$\alpha$$) can range from 0 to infinity.

Let’s get started.

The following packages are used.

1library(dplyr) # data minipulation
2library(rpart) # fit tree
3library(rpart.plot) # plot tree
4library(caret) # tune model

Carseats data is created as following while the response (Sales) is converted into a binary variable.

1require(ISLR)
2data(Carseats)
3Carseats = Carseats %>% 
4  mutate(High=factor(ifelse(Sales<=8,"No","High"),labels=c("High","No")))
5# structure of predictors
6str(subset(Carseats,select=c(-High,-Sales)))
## 'data.frame':	400 obs. of  10 variables:
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
1# classification response summary
2with(Carseats,table(High))
## High
## High   No 
##  164  236
1with(Carseats,table(High) / length(High))
## High
## High   No 
## 0.41 0.59

The train and test data sets are split using createDataPartition().

1## Data Splitting
2set.seed(1237)
3trainIndex.cl = createDataPartition(Carseats$High, p=.8, list=FALSE, times=1)
4trainData.cl = subset(Carseats, select=c(-Sales))[trainIndex.cl,]
5testData.cl = subset(Carseats, select=c(-Sales))[-trainIndex.cl,]

Stratify sampling is performed by default.

1# training response summary
2with(trainData.cl,table(High))
## High
## High   No 
##  132  189
1with(trainData.cl,table(High) / length(High))
## High
##     High       No 
## 0.411215 0.588785
1# test response summary
2with(testData.cl,table(High))
## High
## High   No 
##   32   47
1with(testData.cl,table(High) / length(High))
## High
##      High        No 
## 0.4050633 0.5949367

The following resampling strategies are considered: cross-validation, repeated cross-validation and bootstrap.

1## train control
2trControl.cv = trainControl(method="cv",number=10)
3trControl.recv = trainControl(method="repeatedcv",number=10,repeats=5)
4trControl.boot = trainControl(method="boot",number=50)

There are two methods in the caret package: rpart and repart2. The first method allows the pruning parameter to be tuned. The tune grid is not set up explicitly and it is adjusted by tuneLength - equally spaced cp values are created from 0 to 0.3 in the package.

 1set.seed(12357)
 2fit.cl.cv = train(High ~ .
 3                  ,data=trainData.cl
 4                  ,method="rpart"
 5                  ,tuneLength=20
 6                  ,trControl=trControl.cv)
 7
 8fit.cl.recv = train(High ~ .
 9                    ,data=trainData.cl
10                    ,method="rpart"
11                    ,tuneLength=20
12                    ,trControl=trControl.recv)
13
14fit.cl.boot = train(High ~ .
15                    ,data=trainData.cl
16                    ,method="rpart"
17                    ,tuneLength=20
18                    ,trControl=trControl.boot)

Repeated cross-validation and bootstrap produce the same best tuned cp value while cross-validation returns a higher value.

1# results at best tuned cp
2subset(fit.cl.cv$results,subset=cp==fit.cl.cv$bestTune$cp)
##           cp  Accuracy     Kappa AccuracySD   KappaSD
## 3 0.03110048 0.7382148 0.4395888 0.06115425 0.1355713
1subset(fit.cl.recv$results,subset=cp==fit.cl.recv$bestTune$cp)
##           cp  Accuracy    Kappa AccuracySD   KappaSD
## 2 0.01555024 0.7384537 0.456367 0.08602102 0.1821254
1subset(fit.cl.boot$results,subset=cp==fit.cl.boot$bestTune$cp)
##           cp  Accuracy     Kappa AccuracySD    KappaSD
## 2 0.01555024 0.7178692 0.4163337 0.04207806 0.08445416

The one from repeated cross-validation is taken to fit to the entire training data.

Updated on Feb 10, 2015

  • As a value of cp is entered in rpart(), the function fits the model up to the value and takes the result. Therefore it produces a pruned tree.
  • If it is not set or set to be a low value (eg, 0), pruning can be done using the prune() function.
1## refit the model to the entire training data
2cp.cl = fit.cl.recv$bestTune$cp
3fit.cl = rpart(High ~ ., data=trainData.cl, control=rpart.control(cp=cp.cl))
4
5# Updated on Feb 10, 2015
6# prune if cp not set or too low
7# fit.cl = prune(tree=fit.cl, cp=cp.cl)

The resulting tree is shown as following. The plot shows expected losses and node probabilities in the final nodes. For example, the leftmost node has

  • expected loss of 0.13 (= 8/60 (0.13333))
  • node probability of 19% (= 60/321)
 1# plot tree
 2cols <- ifelse(fit.cl$frame$yval == 1,"green4","darkred") # green if high
 3prp(fit.cl
 4    ,main="CART model tree"
 5    ,extra=106           # display prob of survival and percent of obs
 6    ,nn=TRUE             # display the node numbers
 7    ,fallen.leaves=TRUE  # put the leaves on the bottom of the page
 8    ,branch=.5           # change angle of branch lines
 9    ,faclen=0            # do not abbreviate factor levels
10    ,trace=1             # print the automatically calculated cex
11    ,shadow.col="gray"   # shadows under the leaves
12    ,branch.lty=3        # draw branches using dotted lines
13    ,split.cex=1.2       # make the split text larger than the node text
14    ,split.prefix="is "  # put "is " before split text
15    ,split.suffix="?"    # put "?" after split text
16    ,col=cols, border.col=cols   # green if survived
17    ,split.box.col="lightgray"   # lightgray split boxes (default is white)
18    ,split.border.col="darkgray" # darkgray border on split boxes
19    ,split.round=.5) 
## cex 0.7   xlim c(0, 1)   ylim c(-0.1, 1.1)

The fitted model is applied to the test data.

1## apply to the test data
2pre.cl = predict(fit.cl, newdata=testData.cl, type="class")
3
4# confusion matrix
5cMat.cl = table(data.frame(response=pre.cl,truth=testData.cl$High))
6cMat.cl
##         truth
## response High No
##     High   21  4
##     No     11 43
1# mean misclassification error
2mmse.cl = 1 - sum(diag(cMat.cl))/sum(cMat.cl)
3mmse.cl
## [1] 0.1898734

More models are going to be implemented/compared in the subsequent articles.