Preview

"Iris” Data Set Contains 3 Classes, Iris Setosa, Iris Versicolour, and Iris Virginica, Each with 50 Attributes.

Good Essays
Open Document
Open Document
930 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
"Iris” Data Set Contains 3 Classes, Iris Setosa, Iris Versicolour, and Iris Virginica, Each with 50 Attributes.
Paul Perez
ID: 2247878
November 9th, 2010
Project Two

Analysis The “Iris” data set contains 3 classes, Iris Setosa, Iris Versicolour, and Iris Virginica, each with 50 attributes. Each attribute contains the Iris’ sepal and petal length, as well as its sepal and petal width in centimeters for its class. The “Adult” data set contains 48,842 records, each with 15 variables to analyze. Those fields include age, work class, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, final weight, hours-per-week, native-country, and income. Finally, the “Zoo” data set is a trivial one with 7 classes, which are animal groups, with a total of 101 instances. Each animal instance contains 18 attributes, those of which include the animal’s name or race, 2 numerics for its legs and its type, and 15 Boolean-valued attributes; those that involve simple yes or no answers. The following is an analysis of 4 classification algorithms that can be optimally used for these data sets. Naive Bayes The Naive Bayes classification is a good medium to many user modeling situations, as in the “Iris” data set, given its advantages of fast learning or intuition and low structural cost. It would work the following way: Suppose your data consisted of vegetables, described by their color and shape. This would work by saying "If you see a vegetable that is green and spherical, what type of vegetable is it most likely to be, based on the data? In the future, classify green and spherical vegetables as that type of vegetable." The advantages are that it works well on text and numerical data and is easy to implement and compute when comparing to other classification algorithms. The disadvantages are that it does not do well at all when features are highly dependent, and it does not consider multiples or repeats of the same word or data.

You May Also Find These Documents Helpful