data mining titanic dataset
Author: Dated: 29/12/2012
The database corresponds to the sinking of the titanic on April the 15th 1912. It is part of a database containing the passengers and crew who were aboard the ship, and various attributes correlating to them. The purpose of this task is to apply the methodology of CRISP-DM and follow the phases and tasks of this model. Using the classification method in rapid miner and both the decision tree and KNN algorithms, I will create a training model and try apply the class survived or didn’t survive. If I apply a decision tree to the dataset as it is, I get a prediction rate of 78%. I will try various techniques throughout this report to increase the overall prediction rate.
Data mining objectives:
I would like to explore the pre conceived ideas I have about the sinking of the titanic, and prove if they are correct.
Was there a majority of 3rd class passengers who died? What was the ratio of passengers who died, male or female?
Did the location of cabins make a difference as to who survived?
Did chivalry ring through and did ‘women and children first’ actually happen?
Describe the data:
Class label: Survive (1 or 0) 1 = survived, 0 = died. Type = Binomial. Total: 891. Survived: 342, Died: 549
10 attributes 891 rows
The dataset have primarily a categorical type of attribute so there is low information content. This might indicate a decision tree would be an appropriate model to use.
I can see that the number of rows in the dataset is indeed 10 to 20 times the number of columns, so the number of instances is adequate.
There doesn’t seem to be any inconsistency’s in the data.
Pclass: 1st, 2nd, or 3rd class. Type: polynominal. Categorical, 3rd class: 491, 2nd