Wine was once viewed as a luxury good, but now it is increasingly enjoyed by a wider range of consumers. According to the different qualities, the prices of wines are quite different. So when the wine sellers buy wines from wine makers, it’s important for them to understand the wine quality, which is in some degrees affected by some chemical attributes. When wine sellers get the wine samples, it makes difference for them to accurately classify or predict the wine quality and this will differentiate their profits. So our goal is to model the wine quality based on physicochemical tests and give the reference for wine sellers to select high, moderate and low qualities of wines. We download wine quality data set that is the white vinho verde wine samples from the north of Portugalthe from UC Irvine Machine Learning Repository. This white wine data set includes 4898 observations and 12 variables, among which quality is the dependent variable, and other 11 attributes- fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol-are independent variables. Technical summary
1. Data pre-process
The first step to analyze data is to pre-process it. First, observing all the data, we found several outliers, so we eliminate these outliers. Then we found that the dependent variables are numerical, and some values are focused in a narrow range, like variable density, ranging from 0.98 to 1.02 , so in the initial analysis, we decided not to bin them. Also we observed the correlation of each variable; since we mainly want to make prediction, even though some variables are correlated, we didn’t eliminate them.Overall, we just eliminate several outliers of this data set. 2. Preliminary Models
We use many models to make classification and prediction. The three models are multiple linear regression, classification tree and neural network. 2.1 Multiple linear regressions
Based on the pre-processed data set, we applied multiple linear regressions to check whether we can use the factors in dataset to predict the quality of white wine. However, the result is not very useful. According to the regression model attached as Exhibit 1, Multiple R-squared is only about 0.27, which suggested that those independent variables can not well explain the behavior of the quality. Besides, based on the Validation Score Summary attached as Exhibit 2, the predicted quality values are continuous and range only between 3 and 7. Compare with the original dataset, the quality values are discrete and range from 3 to 9. Besides, since the predicted vales are so closed and nearly continuous, it’s hard to make accurate prediction about the class. Thus, multiple linear regressions do not offer reasonable result and interpretation about white wine quality. 2.2 Neural Network
The result in neural network is not good. In the error report attached as Exhibit 3, the overall error rate is 96.05%. Only class 8 is classified correctly, while other classes are classified wrongly. The neural network (with default setting 4 tiers, 25 nodes) won’t help the wine sellers at all. 2.3 Classification tree
In the classification tree, we first partitioned original data to 70% training data and 30% validation data. The result shows some meanings; it makes some predictions about some classes even though in the validation data scoring, the overall error rate is up to 42.89% which is not ideal in our views. Besides, from the Exhibit 4, the error rates of quality 3, 4 and 7, 8 are too bad to make difference. As the wine sellers, classifying the high quality and low quality will make big difference in profits. Therefore, classification tree cannot be used to predict which variables influence the quality. 3. Data Exploration and Re-modeling
Since these results above cannot give reasonable reference for classifications, we then explored our data and results further. We observed the...