Monday, May 04, 2009

Data mining

Detecting outliers:
Method 1
|z| >= 3

ex. 60
use Pivot table in excel
count IP address
how many time the address show up, see the standard dev, calculate the z score

z= x - mu / sigma
if Z is > than 3 its an outlier

mu = mean or average = 11.6
Standard deviation = 141.1525

2nd Popular method.

IQR 3rd Quater - 1st Quarter
Median 2nd quarter
1.5 x IQR = outlier

outliers 1st 2nd 3rd outliers
|---------|====|====|-----------|
1.5xIQR 3rd-1st
25% 50% 25%


q1<-quantile(data[,3],.25,na.rm=true)
q3<-quantile(data[,3],.75,na.rm=true)

Q1= 150
Q3=175
interquartile range = (175-150)/2 = 26
iqr<-q3-q1
iqr = 26

26*1.5 = 39

q1 = 150 - 39 = 111 (any score below 111 is an outlier)
q3 = 176 +39 = 215 (any score above 215 is an outlier)


Proximity-Based Outlier Detection.
K-nearest neighbor

ex.62
Lm - regression
a "linear model"
x_2 = Beta_o + B_1, X_1 + Ephsilon

model<-lm(data[,3]~data[,2]
predict data 3 using data 2

x_2 = -36+1.1x_1

No comments: