Jackknife variance estimates for random forest

In statistics, jackknife variance estimates for random forest are a way to estimate the variance in random forest models, in order to eliminate the bootstrap effects.

Jackknife variance estimates

The sampling variance of bagged learners is:

V (x) = V a r [{\hat{θ}}^{\infty} (x)]

Jackknife estimates can be considered to eliminate the bootstrap effects. The jackknife variance estimator is defined as:^[1]

{\hat{V}}_{j} = \frac{n - 1}{n} \sum_{i = 1}^{n} ({\hat{θ}}_{(- i)} - \overline{θ})^{2}

In some classification problems, when random forest is used to fit models, jackknife estimated variance is defined as:

{\hat{V}}_{j} = \frac{n - 1}{n} \sum_{i = 1}^{n} ({\overline{t}}_{(- i)}^{⋆} (x) - \overset{⋆}{\overline{t}} (x))^{2}

Here, $t^{⋆}$ denotes a decision tree after training, $t_{(- i)}^{⋆}$ denotes the result based on samples without $i t h$ observation.

Examples

E-mail spam problem is a common classification problem, in this problem, 57 features are used to classify spam e-mail and non-spam e-mail. Applying IJ-U variance formula to evaluate the accuracy of models with m=15,19 and 57. The results shows in paper( Confidence Intervals for Random Forests: The jackknife and the Infinitesimal Jackknife ) that m = 57 random forest appears to be quite unstable, while predictions made by m=5 random forest appear to be quite stable, this results is corresponding to the evaluation made by error percentage, in which the accuracy of model with m=5 is high and m=57 is low.

Here, accuracy is measured by error rate, which is defined as:

E r r o r R a t e = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{M} y_{i j},

Here N is also the number of samples, M is the number of classes, $y_{i j}$ is the indicator function which equals 1 when $i t h$ observation is in class j, equals 0 when in other classes. No probability is considered here. There is another method which is similar to error rate to measure accuracy:

l o g l o s s = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{M} y_{i j} l o g (p_{i j})

Here N is the number of samples, M is the number of classes, $y_{i j}$ is the indicator function which equals 1 when $i t h$ observation is in class j, equals 0 when in other classes. $p_{i j}$ is the predicted probability of $i t h$ observation in class $j$ .This method is used in Kaggle^[2] These two methods are very similar.

Modification for bias

When using Monte Carlo MSEs for estimating $V_{I J}^{\infty}$ and $V_{J}^{\infty}$ , a problem about the Monte Carlo bias should be considered, especially when n is large, the bias is getting large:

E [{\hat{V}}_{I J}^{B}] - {\hat{V}}_{I J}^{\infty} \approx \frac{n \sum_{b = 1}^{B} (t_{b}^{⋆} - {\bar{t}}^{⋆})^{2}}{B}

To eliminate this influence, bias-corrected modifications are suggested:

{\hat{V}}_{I J - U}^{B} = {\hat{V}}_{I J}^{B} - \frac{n \sum_{b = 1}^{B} (t_{b}^{⋆} - {\bar{t}}^{⋆})^{2}}{B}

{\hat{V}}_{J - U}^{B} = {\hat{V}}_{J}^{B} - (e - 1) \frac{n \sum_{b = 1}^{B} (t_{b}^{⋆} - {\bar{t}}^{⋆})^{2}}{B}

References

↑ Wager, Stefan; Hastie, Trevor; Efron, Bradley (2014-05-14). "Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife". Journal of Machine Learning Research 15 (1): 1625–1651. PMID 25580094. Bibcode: 2013arXiv1311.4555W.
↑ "Otto Group Product Classification Challenge". https://www.kaggle.com/c/otto-group-product-classification-challenge/details/evaluation.

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Jackknife variance estimates for random forest. Read more

[1] Wager, Stefan; Hastie, Trevor; Efron, Bradley (2014-05-14). "Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife". Journal of Machine Learning Research 15 (1): 1625–1651. PMID 25580094. Bibcode: 2013arXiv1311.4555W.

[2] "Otto Group Product Classification Challenge". https://www.kaggle.com/c/otto-group-product-classification-challenge/details/evaluation.

[1]

[2]

Anonymous

Search

Jackknife variance estimates for random forest

Namespaces

More

Page actions

Contents