Preview

data mining hw 3

Satisfactory Essays
Open Document
Open Document
505 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
data mining hw 3
Introduction to Data Mining
Summer, 2012
Homework 3
Due Monday June.11, 11:59pm
May 22, 2012

In homework 3, you are asked to compare four methods on three different data sets. The four methods are:

• Indicator Response Matrix
Linear Regression to the Indicator Response Matrix. You need to implement the ridge regression and tune the regularization parameter.
The material of this algorithm can be found in Page 103 to Page 106 in the book ”The Elements of Statistical Learning”
(http://www-stat.stanford.edu/~tibs/ElemStatLearn/).
• Na¨ Bayes ive You need to try Naive Bayes without smoothing and use smoothing.
• k -Nearest Neighbor for kNN, k is a parameter. You need to report two result, k =1 and k =p. you can choose an appropriate p for different datasets.
• Support Vector Machine
Use both LibSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and LibLinear (http://www.csie.ntu.edu.tw/~cjlin/liblinear/)
Use LibSVM with linear kernel and Gaussian Kernel (tune the parameters)
LibLinear is always linear, you need to compare the different speed of
LibSVM and LibLinear.

The test datasets are as follow:
1

• ORL database
Ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement).
A random subset with 7 images per individual was taken with labels to form the training set, and the rest of the database was considered to be the test set.
You will be given ORL train.mat and ORL test.mat.
• USPS database
The USPS handwritten digit database. We provide here a popular subset contains 9298 16x16 handwritten digit images in total, which is then split into 7291 training images and 2007 test images.
You will be given

You May Also Find These Documents Helpful