Machine learning: predicting wine quality ratings based on physicochemical properties

While prepping for my YOW! talk, I was looking for a clean and over-seeable dataset to demo predictive analytics using MAML. It was then when I came across the P. Cortez wine quality dataset: It contains data samples of 4898 white and 1599 red wines, each consisting of 11 physiochemical properties (such as alcohol, ph, acidity) and its quality rated by wine connoisseurs (0 – very bad to 10 – very excellent).
My goal was to create a model that could predict the quality of a wine based on its physiochemical properties.
The most straightforward approach was to just use a multiclass classification algorithm to predict the rating. Here the experiment that I used to train and evaluate the model:

wine quality prediction using training and validation data

wine quality prediction using training and validation data

I used a random 50% split between the training and the validation data and trained the model using the quality label. Looking at the performance of the model, it becomes clear that we do an ok job in predicting good wines (5,6,7) but do a poor job at predicting bad and great wines:

model performance using training and validation data

model performance using training and validation data

A cause for the inaccurate prediction of bad and great wines most likely resides in the fact that we don’t have enough bad and great wines to train the model – which becomes obvious if we look at the histogram for wine quality:

wine quality histogram

wine quality histogram

As we can see, there are only a few wines with a rating <=4 and >=8, which makes it very hard to build a model that is well trained on bad and great wines, especially considering that the training algorithm only sees 50% of the data as we use the other 50% to test the model.
One possible approach to train models with only limited data available, is to use cross validation. This will divide the data into n-folds and the model will be trained on each fold while using the other folds for validation:

wine quality prediction using cross validation

wine quality prediction using cross validation

It looks like the performance of the model had been increased without really over fitting it:

model performance using cross validation

model performance using cross validation

While we increased the accuracy of predictions for wines in the 4-8 range, we still have a hard time to predict the 3 and 9 (there are no 1, 2 and 10) because we just don’t have enough data points to learn them.
This is why I moved away from the 1-10 rating but started to quantize the wines into three buckets

  • great (8,9,10)
  • good (6,7)
  • bad (1,2,3,4,5)

To do so, I use the quantize module and define the bin edges as 5 and 7:

wine quality prediction using quantized values and cross validation

wine quality prediction using quantized values and cross validation

This will give as the following distribution across the three quality buckets:

quantized wine quality histogram

quantized wine quality histogram

The performance of the above model looked like the following:

model performance using binned quality values

model performance using binned quality values

While we still struggle to accurately predict good and great wines, we’re doing a good job in predicting the bad ones. So we can fairly accurately make the following two predictions:

  • Predicted as 3: these wines have a very high probability to be great
  • Predicated as 1: these wines have a high probability to be bad

I’m fairly sure that we could dramatically improve the performance of the model if we would have additional features such as grape types, wine brand or vineyard orientation. Unfortunately these are unavailable due to privacy and logistic issues, but obviously would be available to wineries that leverage such an approach to fine-tune their winemaking process.

Credits:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

Deployment and release strategies for Microsoft Azure Web Sites

One of the easiest ways to implement continuous deployment with web sites is to use Git. Developers can write Git hooks that push the deployable repository to the web site repository. When we take this approach, it is important to fully script the creation and configuration of the web site. It is not a good practice to “manually” create and configure it. This might not be apparent, but it is crucial for supporting disaster recovery, creating parallel versions of different releases, or deploying releases to additional data centers. Further, the separation of configuration and settings from the deployable artifacts makes it easy to guard certificates and other secrets, such as connection strings.
The proposed approach is to create a web site (including staging slot) for each releasable branch. This allows deployment of new release candidates by simply pushing the Git repository to the staging web site. After testing, this can be switched to the production environment.
As described above, it is recommended that we create two repositories, one for the creation and configuration of the web site and one for the deployable artifacts. This allows us to restrict access to sensitive data stored in the configuration repository. The configuration script must be idempotent, so it produces the same outcome regardless of if it runs the first or the hundredth time. Once the web site has been created and configured, the deployable artifacts can be deployed using Git push to the staging web site’s Git repository. This push should take place with every commit to the release repository.
It is important that all web site dependencies, such as connection strings and URLs, are sourced from the web site’s application and connection string settings. (Do not make them part of the deployable artifacts!) This allows us to deploy the same artifacts across different web sites without interference. For this example, assume we have an application that consists of two sites, one serving as the frontend and the other as the backend. The backend site also uses storage services (Figure 1).

Application consisting of two sites

Figure 1: Application consisting of two sites

The first step is to split the application into independent deployable components. Each component has its own source repository. Because the backend is the only component that accesses the storage service, we can group them together. The configuration script creates the web site for each component as well as the containing resources, such as storage accounts or databases. Further, it configures all dependencies. In the example below, the script for site 1 will configure the site 2 URL as an application setting. Splitting an application into independent deployable components (Figure 2).

Splitting an application into independent deployable components

Figure 2: Splitting an application into independent deployable components

There are different strategies to handle code branches when releasing new functionality. The following two are commonly used:

  • Keep the master always deployable and use short-lived branches for feature work.
  • Create long-lived branches for releases and integrate feature work directly into the master.

In this series of posts I will focus on the second approach—creating long-lived branches for every new release. The benefit of this approach resides in the fact that there is a 1:1 relationship between a specific release and its corresponding web site creation and configuration script. This makes deploying previous versions extremely simple because we just run the respective script and then deploy the component. It also allows us to easily run multiple releases of the same component in parallel, which is great for A/B testing.

The next posts will cover how to manage long-lived branches for releases while working on features on master. So stay tuned…