Wednesday, March 18, 2009

Data Mining

Misclassification Error
Gini Index
Entropy

P(i|t) what fraction of observation that are in class "i"

slide--13
MC_a = 25%
MC_b = 40%
MC_c = either way we get misclassify 50%, *note take 25 from both branch*

First split, we split on A cuz it has lowest MisClassification error.

Gini Index

Gini_a1 = 0.3444
Gini_a2 = 0.489

T = 2+, 3- 1- 2/5^2 - 3/5^2 = .48 5/9*0.48
F = 2+, 2- 1- 2/2^2 - 2/2^2 = .5 4/9*0.5

Gini Index is better on first split. while Misclassification show same result on 1st split.

Entropy
logP(j|t) / log 2

+:4 +:0
-:3 -:3

7/10(-4/7*log_2*4/7 - 3/7*log_2*3/7)
+
3/10(-0*log_2*0 - 1*log_2*1)
---------------------------------------
0.69

Split on A: Entropy = 0.69, InfoGain: 0.28
Split on B: Entropy = 0.71, InfoGain: 0.26

No comments: