A while back, I happened to complete Udacity’s Data Analyst Nanodegree. While completing my coursework, I worked on a project on Exploratory Data Analysis (EDA) (numerical and graphical examination of data characteristics and relationships before applying more formal, rigorous statistical analysis). In this project, a dataset on red wine quality was explored (using R & ggplot2) based on its physicochemical properties. The objective was to identify physicochemical properties that distinguish good quality wine from lower quality ones. I had a high sense of satisfaction when I completed my work, and I decided to write about the thought-process of how I went through the whole study, having already uploaded the source code on GitHub. There are chances that you were looking for a qualitative explanation on the subject and accidentally ended up on this post. In that case, I suggest you read this article.
Before beginning my investigation, I spent time learning about the dataset and the terminologies associated with it. The curated dataset contains information about 1,599 red wines with 11 variables on the chemical properties of the wine along with a variable attributing to the quality of the wine; as marked by wine-experts. The preparation of the dataset has been described in this link.
Before beginning my analysis, I needed a starting point. To lead the univariate analysis, I chose to build a grid of histograms that represent the distributions of each variable in the dataset, hoping to distinguish the most interesting attributes.
There were some really interesting variations in the distributions here. Working from top-left to the right, selected plots were analyzed to get more insights as described further.
The first feature that I investigated was acidity. When reading about wines, I learned that fixed acidity is determined by acids that do not evaporate readily – tartaric acid. It contributes to many other attributes, including the taste, pH, color, and stability to oxidation, i.e., prevent the wine from tasting flat . On the other hand, volatile acidity is responsible for the sour taste in wine. A very high value can lead to sour tasting wine, a low amount can make the wine seem dense .
There is a slight negative-skew in the data because a few wines possess a very high fixed acidity. Given the importance of this factor and an indication of a standard-range of acidity for good wines, this attribute was again examined during bivariate analysis.
Next, sulfur-dioxide & sulfates were studied. Free sulfur dioxide is the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. Sulphates are a wine additive which can contribute to sulfur dioxide gas (SO2) levels, which acts as an anti-microbial moreover, antioxidant – overall keeping the wine, fresh . The distributions did not provide any exciting inference, and I’ve left them in this section.
Finally, Alcohol was examined as it is what adds that special something that turns rotten grape juice into a drink many people love. Hence, by intuitive understanding, it should be crucial in determining the wine quality.
print("Summary statistics for alcohol %age.")
##  "Summary statistics for alcohol %age."
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
It is observed, the mean alcohol content for the wines is 10.42%, the median is 10.2%. The distribution also suggests that majority of wines tend to have lower alcohol content. More on this attribute’s impact on quality is discussed later as a part of the bivariate analysis.
What about quality? Quality is a very subjective measure and made me doubtful for a moment before I studied ‘how the attribute was measured.’ For this dataset, each wine was rated by at least three experts on a scale of 0-10 and the median value was chosen.
Overall ‘quality’ has a normal shape and very few exceptionally high or low-quality ratings. It can be seen that the minimum rating is 3 and the maximum is 8 for quality, with a very less number of wines rated to extremes. Hence, a variable called ‘rating’ is created based on variable quality.
# Dividing the quality into 3 rating levels
wine$rating <- ifelse(wine$quality < 5, 'C',
ifelse(wine$quality < 7, 'B', 'A'))
# Changing it into an ordered factor
wine$rating <- ordered(wine$rating,
levels = c('C', 'B', 'A'))
The distribution of ‘rating’ is much higher on the ‘B’ rated wines as seen in quality distribution, which is likely to cause overplotting. Therefore, a comparison of only the ‘C’ and ‘A’ wines was made in a lot of areas to find distinctive properties that separate these two categories of wine. At first, I compared the summary statistics of two classes of wine. The changes seemed suitable only for estimation of important quality impacting variables and setting a way for further analysis. No conclusion could be drawn from it.
The distributions studied in this section were primarily used to identify the trends in variables present in the dataset. This helps in setting up a track for moving towards bivariate and multivariate analysis. An interesting measurement was the wine quality since it is the subjective measurement of how attractive the wine might be to a consumer. The goal here was to then to try and correlate non-subjective wine properties with its quality. At first, the lack of an age metric felt lacking since it is commonly a factor in quick assumptions of wine quality. However, since the actual effect of wine age is on the wine’s measurable chemical properties and its exclusion here might not be necessary.
Before beginning this section of the investigation, a correlation matrix was plotted.
Since no single property appeared to correlate with quality, it simply meant that I had to explore the bivariate relations graphically. I got curious about a few trends in particular viz. ‘Sulphates vs. Quality’ as low sulfate wine has a reputation for not causing hangovers, ‘Acidity vs. Quality’ given that it impacts many factors like pH, taste, color, it was compelling to see if it affects the quality, and ‘Alcohol vs. Quality’ – seemed to be an interesting measurement.
Acidity vs. Rating/Quality
The boxplots depicting quality also depicts the distribution of various wines, and we can again see wines with the quality measure of ‘5’ and ‘6’ have the most share. The red dot is the mean, and the middle line shows the median of the acidity levels. The plots show how the acidity decreases as the quality of wines improve. However, the difference is not very noticeable. This is most likely because most wines tend to maintain a similar acidity level (as volatile acidity is responsible for the sour taste in wine). Hence, a density plot of the said attribute is plotted to investigate the data.
Red wines of quality ‘7’ and ‘8’ have their peaks for ‘Volatile Acidity’ well below the 0.4 mark. Wine with quality ‘3’ has its peak more towards the right-hand-side (towards more volatile acidity levels) which shows that the better quality wines are lesser sour and in general, have lesser acidity.
Alcohol vs. Quality
The plot between residual sugar and alcohol content suggests that there is no erratic relation between sugar and alcohol content, which is surprising as alcohol is a byproduct of the yeast feeding off of sugar during the fermentation process. That inference could not be established here. Alcohol and quality appear to be somewhat correlatable. Lower quality wines tend to have lower alcohol content. Although, this made me curious about finding an upper bound to the alcohol concentration or could it mean that by adding more alcohol, we’d get better wine?
Aha! The above line plot indicates nearly a linear increase till 13% alcohol concentration, followed by a steep downward trend.
Sulphates vs. Quality
There is a slight trend implying a relationship between sulphates and wine quality, mainly if extreme sulfate values are ignored, i.e., because disregarding measurements where sulphates > 1.0 is the same as disregarding the positive tail of the distribution, keeping just the normal-looking portion. Even though it can be said that good wines have higher sulphates values than bad wines, the difference is not that wide.
I was more or less satisfied with the results of the bivariate analysis section, so I further decide to include visualizations that take the bivariate analysis a step further, i.e., understand the earlier patterns better or to strengthen the arguments that were presented earlier.
Nearly every wine had volatile acidity less than 0.8. As discussed earlier, the ‘A’ rated wines all have a volatile acidity of less than 0.6. For wines with rating ‘B’, the volatile acidity is between 0.4 and 0.8. Some ‘C’ rated wines have a volatile acidity value of more than 0.8. Also, most ‘A’ rated wines have a citric acid value of 0.25 to 0.75 while the B rating wines have a citric acid value below 0.50.
It is incredible to see that nearly all wines laid below 1.0 sulphates level. Due to overplotting, wines with rating ‘B’ were removed. It was noted that rating ‘A’ wines mostly had sulphate values between 0.5 and 1 and the best-rated wines had sulphate values between 0.6 and 1.
I realized half-way through the study that because wine rating is a subjective measure, statistical correlation values were not a very suitable metric to find important factors. The graphs aptly depict that there is an adequate range and it is some combination of chemical factors that contribute to the flavor of the wine.
In this project, I was able to examine the relationship between physicochemical properties and identify the key variables that determine red wine quality, which are alcohol content, volatile acidity, and sulphate levels. The dataset was quite interesting, though limited in large-scale implications. I felt if this dataset had ‘price’ supplied, I could target the best wines within price categories, and what aspects correlated to a high performing wine in any price bracket. Overall, I was initially surprised by the seemingly dispersed nature of the wine data. Nothing was immediately correlatable to being an inherent quality of good wines. However, upon reflection, this is a sensible finding. Winemaking is still less of a science and more of an art, and if there were one single property or process that continually yielded high-quality wines, the field wouldn’t be what it is.
Additionally, having the wine type would be helpful for further analysis. Sommeliers might prefer certain types of wines to have different properties and behaviors. For example, a Port (a sweet dessert wine) surely is rated differently from a dark and robust Cabernet Sauvignon, which is rated differently from a bright and fruity Syrah. Without knowing the type of wine, it is entirely possible that we are almost literally comparing apples to oranges and can’t find a correlation.
With my amateurish knowledge of wine-tasting, I tried my best to relate it to how I would rate a bottle of wine at dining. However, in the future, I would like to do some research into the winemaking process (maybe this MOOC). Some winemakers might actively try for some property values or combinations, and be finding those combinations (of 3 or more properties) might be the key to truly predicting wine quality. This investigation was not able to find a robust generalized model that would consistently be able to predict wine quality with any degree of certainty. If I were to continue further into this specific dataset, I would aim to train a classifier to correctly predict the wine category, in order to better grasp the minuteness of what makes a good wine.
According to the study, it can be concluded that the best kind of wines are the ones with an alcohol concentration of about 13%, with low volatile acidity & high sulphates level (with an upper cap of 1.0 g/dm3).