This was my final project for PSTAT 131, Data Mining, Spring 2017. A version of this project written in Python can be found on my GitHub here.


Introduction

We use this white wine quality dataset, and all of its attributes (e.g. sulfur dioxide content, pH) to determine what constitutes a “good” (or above average) quality wine. We used the R statistical computing language to conduct the analyses in this report. The data were found on the UC Irvine Machine Learning Repository.

Knowing what makes a good wine was an interesting question because it allows us to look at “taste” in a different way–without formal training in wine tasting, we can determine algorithmically how and why a wine is good using machine learning.

We utilized randomForest, decision trees, and k-nearest neighbors to classify each observation as either good or bad based off of these attributes, while varying the number of variables, nodes, and number of neighbors and comparing within and between these methods.

In this report, we find the accuracy, error rates, and area under the ROC curve (AUC) of each of the three methods and ultimately came to determine that randomForest was the most effective in terms of accuracy and AUC. It works well as a generalization of the decision tree method, and as an algorithm it is robust, but it falls short in terms of interpretability. kNN on the other hand is slow to compute with a dataset of this dimensionality, as well as weaker in its accuracy both in absolute terms and relative to the other methods we have tested here.

Main Body

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.0             0.27        0.36           20.7     0.045
## 2           6.3             0.30        0.34            1.6     0.049
## 3           8.1             0.28        0.40            6.9     0.050
## 4           7.2             0.23        0.32            8.5     0.058
## 5           7.2             0.23        0.32            8.5     0.058
## 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

A preliminary look at the dataset reveals that there are no missing values.

First we can look at the correlation matrix of the dataset, to see if any predictors are highly correlated with one another. We may have to take out these predictors in order to avoid multicollinearity, which can invalidate results. Having said this, multicollinearity is less of an issue with decision trees, and even less so with randomForest, both of which are going to be used in this analysis.

Taking a look at the correlation coefficients \(r\) for the predictor variables, we see that density is strongly correlated with residual.sugar (\(r = 0.84\)) and alcohol (\(r = -0.78\)), and moderately correlated with total.sulfur.dioxide (\(r = 0.53\)). free.sulfur.dioxide and total.sulfur.dioxide are also moderately correlated with each other (\(r = 0.62\)) although this is trivially known because of course, free sulfur dioxide is incorporated into the total sulfur dioxide.

Aside from that correlations are all very low, including (and especially) quality, the response variable, with the predictors.

So, we should actually remove the variables residual.sugar and density, as well as total.sulfur.dioxide because if its direct relationship with free.sulfur.dioxide, in order to address problems with multicollinearity. We’re going to withhold removing alcohol, to see the if the initial effect of removing just these three correlated variables is enough to address the issue.

From the new correlation matrix it appears that none of the predictors now have too high or a correlation with each other, and we can decide that multicollinearity is no longer an issue.

From here on out, we’re also going to want to convert the quality response variable into a binary factor so that we can use the predictors to classify the observations. We’re going to do this by labeling all of the observations that have received an above average (5 out of 10) as “good”, and the rest as “bad”, “bad” really meaning “not good”. This factor of good and bad goes under a new column titled label.

We’ll remove the quality variable afterwards, since if we use it as an attribute in the predictor, it will skew the results because it is directly correlated to the label.

It’s important to note that all of these numeric predictor variables (fixed.acidity, volatile.acidity, citric.acid, chlorides, free.sulfur.dioxide, pH, sulphates, alcohol) are not all scaled the same. As such, it’s appropriate to scale them before running any analyses.

Now we have 8 numeric predictor variables, and one two-level categorical variable (label). We’re going to apply a few different classification methods in order to firstly determine which the best model for predicting is in terms of the relevant variables, and secondly to find the best classification algorithm for this data.

We’re going to initialize a matrix to easily compare the quality of the different classification methods we’re going to utilize going forward, namely decision trees (with k-fold cross validation to prune the tree), k-nearest neighbor, and randomForest. The ‘full randomForest’ refers to the model using all 8 predictors whereas the ‘small randomForest’ refers to a subset of these predictors, the use of which will become clear when discussing decision trees.

##                    Accuracy Rate Error Rate AUC
## tree                          NA         NA  NA
## pruned.tree                   NA         NA  NA
## k=10 kNN                      NA         NA  NA
## k=35 kNN                      NA         NA  NA
## full.randomForest             NA         NA  NA
## small.randomForest            NA         NA  NA

In order to apply machine learning algorithms to this dataset, we need to stratify the dataset into a training set and a test set. The first set will be used to teach the classification model how to predict, depending on the algorithm chosen. We then apply the algorithm to the test set, and see how accurate the classification was.

Decision Tree

The first method we are going to perform on this dataset, is Decision Trees. Decision tree is a non-parametric classification method, which uses a set of rules to predict that each observation belongs to the most commonly occurring class label of training data.

Of course, we’re going to use label as a response variable, and each of the now 8 remaining numeric attributes as predictors.

## 
## Classification tree:
## tree(formula = label ~ ., data = train, control = tree.control(nobs = nrow(train), 
##     mincut = 5, minsize = 10, mindev = 0.003), method = "class")
## Variables actually used in tree construction:
## [1] "alcohol"             "volatile.acidity"    "free.sulfur.dioxide"
## [4] "chlorides"           "sulphates"           "fixed.acidity"      
## [7] "citric.acid"        
## Number of terminal nodes:  16 
## Residual mean deviance:  0.9588 = 3722 / 3882 
## Misclassification error rate: 0.2324 = 906 / 3898

So we can see from this summary, that in fact 6 out of the 8 predictors were used in constructing this tree: alcohol, volatile.acidity, free.sulfur.dioxide, sulphates, citric.acid, and fixed.acidity. Now we are actually going to plot the tree to visualize this.

We can see while looking at the tree how often alcohol appears and intuit from that that the amount of alcohol, whether high or low, plays at least some part in the model’s classification of a good wine.

We can build a confusion matrix after using the data to predict on the test set, and then find the accuracy rate and the error rate.

##       true
## pred   bad good
##   bad  225  118
##   good 133  524
## [1] 0.749
## [1] 0.251

With an accuracy rate of 0.748, this decision tree model is not superb, but will still classify correctly about 3 out of 4 times.

As an alternative metric to quantify the robustness of this method, we can use the Receiver Operating Characteristic (ROC) curve and the area underneath it (AUC). The ROC curve plots the false positive rate against the true positive rate, and the area underneath it falls between either 0.5 or 1, 0.5 being the worst (random classification), and 1 being the best (perfect classification).

## [1] 0.7885253
##                    Accuracy Rate Error Rate       AUC
## tree                       0.749      0.251 0.7885253
## pruned.tree                   NA         NA        NA
## k=10 kNN                      NA         NA        NA
## k=35 kNN                      NA         NA        NA
## full.randomForest             NA         NA        NA
## small.randomForest            NA         NA        NA

We see thusly that the area under the curve is \(0.788\) which is slightly closer to 1 than 0.5. That is to say that it is more good than bad, but hardly so.

k-fold Cross Validation

We can use k-fold cross-validation, which randomly partitions the dataset into folds of similar size, to see if the tree requires any pruning which can improve the model’s accuracy as well as make it more interpretable for us.

In k-fold cross validation, we divide the sample into k sub samples, then train the model on k -1 samples, leaving one as a holdout sample. We compute validation error on each of these samples, then average the validation error of all of them.

The idea of cross-validation is that it will sample multiple times from the training set, with different separations. Ultimately, this creates a more robust model i.e. the tree will not be overfit.

Cross validation will help us find the optimal size for the tree (in terms of number of nodes). We can plot the size against misclassification error to visualize this as well.

## $size
## [1] 16 11  8  7  6  5  4  3  1
## 
## $dev
## [1] 1004 1004 1004 1013 1013 1007  997 1057 1205
## 
## $k
## [1]       -Inf   0.000000   1.333333   6.000000   7.000000  10.000000  40.000000
## [8]  77.000000 116.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

## [1] 4

So we see, after running cross-validation, we see that we should prune the tree so that it has only 7 nodes. With this knowledge we can prune the tree and run the same diagnostics on it that we did on the unpruned model to see if any improvements are apparent.

## 
## Classification tree:
## snip.tree(tree = tree, nodes = c(10L, 3L, 4L, 11L))
## Variables actually used in tree construction:
## [1] "alcohol"          "volatile.acidity"
## Number of terminal nodes:  4 
## Residual mean deviance:  1.054 = 4103 / 3894 
## Misclassification error rate: 0.2496 = 973 / 3898

Note that after pruning the tree, the only relevant variables used in tree construction are: alcohol, volatile.acidity, and free.sulfur.dioxide. And of course, this tree only has 7 nodes, the best tree size we determined from using cross-validation.

Now we can apply the same diagnostic methods as before: looking at confusion matrix, accuracy/error rate, the ROC curve and the area underneath it, for the sake of comparison.

##       true
## pred   bad good
##   bad  179   88
##   good 179  554
## [1] 0.749
## [1] 0.251

We see that pruning the tree didn’t actually really improve the accuracy rate of the model at all, although it did condense the number of relevant variables. Initially, seeing that accuracy did not improve might give the impression that pruning was not meaningful, but to the contrary, the fact that we were able to prune the tree without losing any accuracy shows that the sole 3 variables we have remaining (alcohol, volatile.acidity, and free.sulfur.dioxide) are just as good as classifying when using a decision tree as when using all 8 predictors.

The original model being rather complex with as many as 6 predictors runs the risk of over-fitting, which is to say that the data follows the training data too closely and cannot be well generalized to new data. This is why we are inclined to favor a simpler model such as that we found after pruning with cross-validation.

## [1] 0.7604509
##                    Accuracy Rate Error Rate       AUC
## tree                       0.749      0.251 0.7885253
## pruned.tree                0.749      0.251 0.7604509
## k=10 kNN                      NA         NA        NA
## k=35 kNN                      NA         NA        NA
## full.randomForest             NA         NA        NA
## small.randomForest            NA         NA        NA

So while the accuracy and error rates are virtually unchanged, the area under the curve (AUC) has slightly decreased. It’s not a substantial decrease, but one could argue that it has overall made the model worse. Conversely it could be argued that the strength of the model is relatively preserved while reducing the number of variables included. This is good because it gives us a better idea of what the important variables are when it comes to classifying the wines.

Now we have added both the original and the pruned tree’s respective error rates and AUC’s to the records matrix, and we can proceed to the next method of classification.

k-Nearest Neighbors (kNN)

We’re now going to apply the k-nearest neighbors method of classification, which is a non-parametric method. k-Nearest neighbors (or kNN) is called a “lazy learning” technique because it goes through the training set every time it predicts a test sample’s label. It finds this label by plotting the test sample in the same dimensional space as the training data, then classifies it based on the “k nearest neighbor(s)”, i.e. if k = 10, then the label of the 10 nearest neighbors in the training data to the test data observation will be applied to that observation.

Distance is measured in different ways, but by default the knn() function utilized Euclidean distance.

This is rather problematic because when calculating distance it’s assumed that attributes have the same effect, while this is not generally true. So the distance metric (Euclidean distance in this case) does not take into account the attributes’ relationships with each other, which can result in misclassification. So already we have determined a shortcoming in the kNN method before we have even applied it. Although of course, we already dropped the predictors that were highly correlated with each other, and what’s more we scaled the remaining numeric predictors, which goes in a small way to addressing this.

##       true
## pred   bad good
##   bad  205   91
##   good 153  551
## [1] 0.756
## [1] 0.244

So, using 10 nearest neighbors was just a random estimate, and it ended up with another mediocre accuracy rate (\(0.765\)) but we can look at the area under the ROC curve (AUC) and look at the strength of the test relative to the methods we have tried so far.

## [1] 0.6893241
##                    Accuracy Rate Error Rate       AUC
## tree                       0.749      0.251 0.7885253
## pruned.tree                0.749      0.251 0.7604509
## k=10 kNN                   0.756      0.244 0.6893241
## k=35 kNN                      NA         NA        NA
## full.randomForest             NA         NA        NA
## small.randomForest            NA         NA        NA

So with an AUC of \(0.679\), this test is not very good. We can look at different values for \(k\) and try to find the best one to use and then compare the results from that with these.

This is interesting because accuracy seems to follow a slight negative trend but overall there are huge jumps in accuracy when incrementing only by 1. We know well that using \(k=1\) will result in a very low bias and high variance, and this also means that we are fitting too closely to the training dataset and therefore, overfitting. This makes for a bad model that cannot be well generalized to new data.

Here is the ROC curve demonstrating this.

So we think better not to opt for \(k=1\) and rather choose some \(k\) like 35, which is still decently accurate, and probably less biased.

##       pred
## true   bad good
##   bad  197  161
##   good  84  558
## [1] 0.755
## [1] 0.245

Using \(k=35\) gives a slight increase in test accuracy relative to \(k=10\), although it is not very significant. Now let’s look at the ROC curve and the AUC to make our final comparison, both with the \(k=10\) model, and the decision trees.

## [1] 0.7414004
##                    Accuracy Rate Error Rate       AUC
## tree                       0.749      0.251 0.7885253
## pruned.tree                0.749      0.251 0.7604509
## k=10 kNN                   0.756      0.244 0.6893241
## k=35 kNN                   0.755      0.245 0.7414004
## full.randomForest             NA         NA        NA
## small.randomForest            NA         NA        NA

So we see that although we have sacrified some accuracy, the area under the curve has increased somewhat, so it could be argued that the test has improved. What’s more, with a dataset of this dimensionality, it is most likely better to use more neighbors if one can, because otherwise you run the risk of overfitting to the training data (which is why we did not opt for \(k=1\).)

So we see actually that while kNN is slightly more accurate than decision trees, the area under the ROC curve is worsened which makes it a worse test. Also, kNN is computationally rather expensive and it gets to be very complex when dealing with datasets with high dimensions (this dataset has nearly 5000 rows), so we think to rule out k-nearest neighbors when deciding what the best method of classification is.

Finally we can move on to the final method of classification, randomForest.

randomForest

randomForest is similar to the decision tree method in that it builds trees, hence the name ‘random Forest’. This is an ensemble learning method which creates a multitude of decision trees, and outputting the class that occurs most frequently among them. The advantage that randomForest has over decision trees is the element of randomness which guards against the pitfall of overfitting that decision trees run into on their own.

## 
## Call:
##  randomForest(formula = label ~ ., data = train, mtry = 8) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 16.96%
## Confusion matrix:
##      bad good class.error
## bad  896  386   0.3010920
## good 275 2341   0.1051223

meanDecreaseGini refers to the “mean decrease in node impurity”. Impurity is a way that the optimal condition of a tree is determined, and this plot shows how each variable individually affects the weighted impurity of the tree itself.

randomForest used all 8 of the predictor variables. This variable importance plot shows how ‘important’ each variable was in determining the classification. We can see that, consistent with the pruned decision tree, that alcohol, volatile.acidity, and free.sulfur.dioxide are the three most important predictors.

##       pred
## true   bad good
##   bad  251  107
##   good  67  575
## [1] 0.826
## [1] 0.174

With an accuracy rate of 0.823, this randomForest model is looking pretty good so far, and it already is more accurate than any method we’ve tried thus far.

Let’s take a look at the ROC curve and the area underneath it.

## [1] 0.8891906
##                    Accuracy Rate Error Rate       AUC
## tree                       0.749      0.251 0.7885253
## pruned.tree                0.749      0.251 0.7604509
## k=10 kNN                   0.756      0.244 0.6893241
## k=35 kNN                   0.755      0.245 0.7414004
## full.randomForest          0.826      0.174 0.8891906
## small.randomForest            NA         NA        NA

The area under the ROC curve for randomForest is 0.887, which is also a strong AUC for a classification model.

So we see actually that randomForest stands head and shoulders above the other two methods, decision tree and k-nearest neighbors. This is seen in the fact that the accuracy rate, as well as the AUC, are the highest. Judging from this, we can assume that randomForest would be the most likely to correctly classify a wine based off of the attributes and data given.

Recall that earlier we determined in the decision tree that the relevant variables were: alcohol, volatile.acidity, and free.sulfur.dioxide. While this randomForest model was pretty effective in utilizing all of the 8 predictors, we can take a look at a model using only these 3 as well for the sake of comparison.

We have established by now that simpler models have a reduced bias and complexity, but higher variance and a higher chance of underfitting, whereas complex models (such as the full model) have the opposite issue. The good thing about randomForest is that it inherently accounts for this “Bias-Variance” tradeoff by introducing randomness with bagging (bootstrap aggregating).

The question here is whether or not making the model simpler is worthwhile, but we can build the simple model and compare their metrics to find out.

##       rf_pred2
##        bad good
##   bad  232  126
##   good  71  571
## [1] 0.803
## [1] 0.197

## [1] 0.8515354
##                    Accuracy Rate Error Rate       AUC
## tree                       0.749      0.251 0.7885253
## pruned.tree                0.749      0.251 0.7604509
## k=10 kNN                   0.756      0.244 0.6893241
## k=35 kNN                   0.755      0.245 0.7414004
## full.randomForest          0.826      0.174 0.8891906
## small.randomForest         0.803      0.197 0.8515354

The accuracy rate has actually decreased, as well as the area under the curve, but not significantly. We’re managed to actually preserve the strength of the model, both in relation to the tree and knn methods, but also relative to the original application of randomForest with all of the predictors.

As such, we can opt to utilize this much smaller model for classification instead if we are concerned about complexity and bias. Having said that, because of the randomization introduced in the randomForest, it is inherently more robust so subsetting in this manner may even be unnecessary.

Conclusion

So judging from all of our findings, we have seen that in this case, randomForest is the best algorithm (out of the three we’ve compared) for classifying this wine dataset. So we have answered the question of what among these three classification algorithms is truly the best.

The decision tree algorithm is useful but ultimately, randomForest is superior version of it since it aggregates many decision trees to create an optimized model that is not susceptible to overfitting. When it comes to interpretability however, a decision tree is preferred. When using a decision tree however it is important to use cross-validation to prune the tree in order to narrow it down to the most important variables.

Compared to decision trees, the k-nearest neighbor algorithm has a slightly greater accuracy rate but a worse AUC. The decision tree method did however help to narrow down the three most relevant attributes: alcohol, volatile.acidity, and free.sulfur.dioxide. This finding was consistent with when we took a look at the most important variables in the randomForest model.

We were able to apply this subset of attributes to the randomForest algorithm and come out with a strong model that only utilizes a few independent variables in order to classify at a high success rate. This lends strength to the argument that these three variables are the most relevant when it comes to determining the content of a good wine.

As far as what these variables’ importance is in reality, is that sulfur dioxide is crucial for killing bacteria in wine when creating it. On the other hand, volatile acidity is an undesired trait in wine that affects flavor, that can be caused by such bacteria. So it makes sense that wine that is high in sulfur dioxide, and low in volatile acidity, is considered good.

The pending questions that remain are, did we overfit or underfit to the training data when testing these different classification methods? It is also worth determining exactly the threshold for the amounts of these variables such as alcohol, for example finding the optimal amount of alcohol content to create a good wine.

We would also like to delve more into how best to select some \(k\) for kNN that maintains a high level of accuracy while also having a balance between bias and variance without either over or underfitting. We would also posit a similar question for the number of nodes in a decision tree. Finally, is dropping variables in randomForest really necessary, if the randomization inherent in it already accounts for overfitting?

If we can only compare models that utilize the same set of predictors, then we should look at the pruned classification tree against the randomForest model utilizing the same attributes. We see even there that the randomForest model is superior.

In conclusion we have found that randomForest is best for binary classification and that alcohol, volatile acidity, and free sulfur dioxide are the most important predictors when attempting to classify a good wine.

References

Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. Introduction To Data Mining. 1st ed. Addison Wesley: Pearson, 2005. Print.

R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

RStudio Team (2016). RStudio: Integrated Development for R. RStudio, Inc., Boston, MA URL http://www.rstudio.com/.

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.

Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

Sing T, Sander O, Beerenwinkel N and Lengauer T (2005). “ROCR: visualizing classifier performance in R.” Bioinformatics, 21(20), pp. 7881. <URL: http://rocr.bioinf.mpi-sb.mpg.de>.

A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18–22.

Brian Ripley (2016). tree: Classification and Regression Trees. R package version 1.0-37. https://CRAN.R-project.org/package=tree

Hadley Wickham and Romain Francois (2016). dplyr: A Grammar of Data Manipulation. R package version 0.5.0. https://CRAN.R-project.org/package=dplyr

Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. URL http://www.jstatsoft.org/v40/i01/.