While prepping for my YOW! talk, I was looking for a clean and over-seeable dataset to demo predictive analytics using MAML. It was then when I came across the P. Cortez wine quality dataset: It contains data samples of 4898 white and 1599 red wines, each consisting of 11 physiochemical properties (such as alcohol, ph, acidity) and its quality rated by wine connoisseurs (0 – very bad to 10 – very excellent).
My goal was to create a model that could predict the quality of a wine based on its physiochemical properties.
The most straightforward approach was to just use a multiclass classification algorithm to predict the rating. Here the experiment that I used to train and evaluate the model:
I used a random 50% split between the training and the validation data and trained the model using the quality label. Looking at the performance of the model, it becomes clear that we do an ok job in predicting good wines (5,6,7) but do a poor job at predicting bad and great wines:
A cause for the inaccurate prediction of bad and great wines most likely resides in the fact that we don’t have enough bad and great wines to train the model – which becomes obvious if we look at the histogram for wine quality:
As we can see, there are only a few wines with a rating <=4 and >=8, which makes it very hard to build a model that is well trained on bad and great wines, especially considering that the training algorithm only sees 50% of the data as we use the other 50% to test the model.
One possible approach to train models with only limited data available, is to use cross validation. This will divide the data into n-folds and the model will be trained on each fold while using the other folds for validation:
It looks like the performance of the model had been increased without really over fitting it:
While we increased the accuracy of predictions for wines in the 4-8 range, we still have a hard time to predict the 3 and 9 (there are no 1, 2 and 10) because we just don’t have enough data points to learn them.
This is why I moved away from the 1-10 rating but started to quantize the wines into three buckets
- great (8,9,10)
- good (6,7)
- bad (1,2,3,4,5)
To do so, I use the quantize module and define the bin edges as 5 and 7:
This will give as the following distribution across the three quality buckets:
The performance of the above model looked like the following:
While we still struggle to accurately predict good and great wines, we’re doing a good job in predicting the bad ones. So we can fairly accurately make the following two predictions:
- Predicted as 3: these wines have a very high probability to be great
- Predicated as 1: these wines have a high probability to be bad
I’m fairly sure that we could dramatically improve the performance of the model if we would have additional features such as grape types, wine brand or vineyard orientation. Unfortunately these are unavailable due to privacy and logistic issues, but obviously would be available to wineries that leverage such an approach to fine-tune their winemaking process.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf