machine learning exercises The caret Package

createDataPartition

split the data into two groups: a training set and a test set.

trainControl

To modify the resampling method, a trainControl function is used.
The option method controls the type of resampling and defaults to "boot".
Another method, "repeatedcv", is used to specify repeated K–fold cross–validation (and the argument repeats controls the number of repetitions). K is controlled by the number argument and defaults to 10.

machine learning exercises
machine learning solutions

Sample

install.packages("caret")
library(caret)
data(iris)
validation <- createDataPartition(iris$Species, p=0.80, list=FALSE)
validation20 <- iris[-validation,]
iris <- iris[validation,]

library(caret)
control <- trainControl(method="cv", number=10)

Keywords

: createDataPartition, trainControl, 10-fold cross-validation

cross-validation

交叉驗證，有時亦稱循環估計，是一種統計學上將数据樣本切割成較小子集的實用方法。
於是可以先在一個子集上做分析，而其它子集則用來做後續對此分析的確認及驗證。
一開始的子集被稱為訓練集。
而其它的子集則被稱為驗證集或測試集。
交叉驗證的目標是定義一個數據集到“測試”的模型在訓練階段，以便減少像過擬合的問題，得到該模型將如何衍生到一個獨立的數據集的提示。

K-fold cross-validation

K次交叉验证
初始采样分割成K个子样本，一个单独的子样本被保留作为验证模型的数据，其他K-1个样本用来训练。
交叉验证重复K次，每个子样本验证一次，平均K次的结果或者使用其它结合方式，最终得到一个单一估测。
这个方法的优势在于，同时重复运用随机产生的子样本进行训练和验证，每次的结果验证一次，10次交叉验证是最常用的。

留一驗證

留一驗證（LOOCV）意指只使用原本樣本中的一項來當做驗證資料，而剩餘的則留下來當做訓練資料。
這個步驟一直持續到每個樣本都被當做一次驗證資料。
事實上，這等同於 K-fold 交叉驗證是一樣的，其中K為原本樣本個數。
在某些情況下是存在有效率的演算法，如使用kernel regression 和Tikhonov regularization。

誤差估計

可以計算估計誤差。
常見的誤差衡量標準是均方差和方根均方差，分別為交叉驗證的方差和標準差。

createDataPartition

First, we split the data into two groups: a training set and a test set. To do this, the createDataPartition function is used:

> library(caret)
> library(mlbench)
> data(Sonar)
> set.seed(107)
> inTrain <- createDataPartition(y = Sonar$Class,
+ ## the outcome data are needed
+ p = .75,
+ ## The percentage of data in the
+ ## training set
+ list = FALSE)

> ## The format of the results
>
> ## The output is a set of integers for the rows of Sonar
> ## that belong in the training set.
> str(inTrain)
int [1:157, 1] 1 2 3 6 7 9 10 11 12 13 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr "Resample1"
By default, createDataPartition does a stratified random split of the data. To partition the data:
> training <- Sonar[ inTrain,]
> testing <- Sonar[-inTrain,]
> nrow(training)
> nrow(testing)