Introduction to Data Mining
Due Monday June.11, 11:59pm
May 22, 2012
In homework 3, you are asked to compare four methods on three diﬀerent data sets. The four methods are:
• Indicator Response Matrix
Linear Regression to the Indicator Response Matrix. You need to implement the ridge regression and tune the regularization parameter. The material of this algorithm can be found in Page 103 to Page 106 in the book ”The Elements of Statistical Learning”
• Na¨ Bayes
You need to try Naive Bayes without smoothing and use smoothing. • k -Nearest Neighbor
for kNN, k is a parameter. You need to report two result, k =1 and k =p. you can choose an appropriate p for diﬀerent datasets. • Support Vector Machine
Use both LibSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
and LibLinear (http://www.csie.ntu.edu.tw/~cjlin/liblinear/) Use LibSVM with linear kernel and Gaussian Kernel (tune the parameters) LibLinear is always linear, you need to compare the diﬀerent speed of LibSVM and LibLinear.
The test datasets are as follow:
• ORL database
Ten diﬀerent images of each of 40 distinct subjects. For some subjects, the images were taken at diﬀerent times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement). A random subset with 7 images per individual was taken with labels to form the training set, and the rest of the database was considered to be the test set.
You will be given ORL train.mat and ORL test.mat.
• USPS database
The USPS handwritten digit database. We provide here a popular subset contains 9298 16x16 handwritten digit images in total, which is then split into 7291 training images and 2007 test images. You will be given USPS train.mat...
Please join StudyMode to read the full document