I wrote this book because I fell in love with a story. The story concerned a small group of undervalued professional baseball players and executives, many of whom had been rejected as unfit for the big leagues, who had turned themselves into one of the most successful franchises in Major League Baseball. ….how did one of the poorest teams in baseball, the Oakland Athletics, win so many games? – Moneyball, Pg. 1
The Oakland A’s were once a wealthy and a very successful team, making it into the playoffs nine times from 1972 to 1992. However, their fortunes saw a turnaround with a large number of loses and an acquisition which led to massive budget cuts.
During this time, the A’s had turned to a new general manager, Billy Beane to restore the winning tradition in the club which was not seen since the 80s. Beane needed to find a way to keep the Oakland A’s elite. His assistant, Paul DePodesta introduced him to a theory that was similar to that of the great Bill James. The approach would be called Moneyball. The goal was to find the undervalued metrics and then using them to determine players who cost less than they should.
…the game was ceasing to be an athletic competition and becoming a financial one. The gap between rich and poor in baseball was far greater than in any other professional sport, and widening rapidly. …. The raw disparities meant that only the rich teams could afford the best players. – Moneyball, Pg. 1
In the above graph, the horizontal axis shows the average payroll during the years 1998 to 2001. The vertical axis indicates the average yearly wins over the same years. The team in blue is the New York Yankees who won about 100 games and spent roughly $90 million in the said period. The red team is Red Sox. This team spent nearly $80 million and won about 90 games.
The Oakland A’s are marked in green. They won about 90 games, and they spent under $30 million. On comparing them with the Red Sox, they won about the same number of games during this period, but the Red Sox spent about $50 million more per year than the A’s.
Rich teams like the Yankees and the Red Sox could afford the all-star players. It is important to observe how efficient the A’s are. As mentioned, they won 90 games, and their payroll was under $30 million compared to the Yankees, who spent almost three to four times as much (and not having a significant difference in the number of games won). It can be noted; the rich teams have three to four times the payroll of poor teams, yet the A’s made the playoffs every year.
Taking a quantitative approach, they were able to find undervalued players and form teams that were very efficient. So the A’s started using a different method to select players. The traditional way of picking players was through scouting. Scouts would watch high school and college players, and they would report back about their skills, especially discussing their speed and their athletic built. The A’s, however, selected players based on their statistics, not on the basis of their outlooks.
The statistics enable you to find your way past all sorts of sight-based scouting prejudices. – Moneyball, Pg. 30
In the 1980s and 1990s, analysts were hired by baseball teams, but none of them had enough power to affect anything significant. Billy Beane, with a rather small budget, understood the importance of analytics but most general managers didn’t know much about statistics and based decisions primarily on feelings.
Billy Beane was not afraid to alienate scouts, managers, and players if the quantitative approach suggested decisions that were different than the scouts or the managers or the players suggested. He believed that this theory could work much to the disagreement of most of his employees. Players that were brought in to replace the stars weren’t household names. The key premise of the Oakland A’s is that if they could detect the undervalued skills, they could find players at a bargain. More on the scouting and Moneyball theory can be read in this article.
On the left is Scott Hatteberg, whom the A’s selected. He would not throw particularly well but got on base a lot. On the right, is Derek Jeter, one of the top players in baseball, a consistent shortstop and the leader in hits and stolen bases.
The approach was also followed for pitchers. On the left is Chad Bradford, a pitcher for the A’s, a submariner who used an unconventional delivery and slow speed. On the right is Roger Clemens, one of the best pitchers in the game who used a conventional delivery with a fast pace.
This section demonstrates the data analysis using R. The dataset baseball.csv comes from Baseball-Reference.com.
If you are unfamiliar with the game of baseball, you can watch this short video clip for a quick introduction to the game. Although not necessary, basic knowledge of the game might help in intuitively understanding this analysis.
Before the 2002 season, Paul DePodesta … judged how many wins it would take to make the playoffs: 95. He then calculated how many more runs the Oakland A’s would need to score than they allowed to win 95 games: 135 ……
Then, using the A’s players’ past performance as a guide, he made reasoned arguments about how many runs they would actually score and allow. ….the team would score between 800 and 820 runs and give up between 650 and 670 runs*. From that, he predicted the team would win between 93 and 97 games and probably wind up in the playoffs.
* They wound up scoring 800 and allowing 653 – Moneyball, Pg. 90
The goal of a baseball team is to make the playoffs. The Oakland A’s approach to getting to the playoffs was via the use of analytics.
# Reading the Data baseball <- read.csv("baseball.csv") str(baseball)
'data.frame': 1232 obs. of 15 variables: $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ... $ League : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ... $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ... $ RS : int 734 700 712 734 613 748 669 667 758 726 ... $ RA : int 688 600 705 806 759 676 588 845 890 670 ... $ W : int 81 94 93 69 61 85 97 68 64 88 ... $ OBP : num 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ... $ SLG : num 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ... $ BA : num 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ... $ Playoffs : int 0 1 1 0 0 0 1 0 0 1 ... $ RankSeason : int NA 4 5 NA NA NA 2 NA NA 6 ... $ RankPlayoffs: int NA 5 4 NA NA NA 4 NA NA 2 ... $ G : int 162 162 162 162 162 162 162 162 162 162 ... $ OOBP : num 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ... $ OSLG : num 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
This dataset includes an entry for every team from 1962 to 2012. There are 15 variables in the data set including Runs Scored (RS), Runs Allowed (RA) and Wins (W).
Since the aim is to verify the claims made in the book, the required data is the subset of this dataset including only the years up to 2002.
# Subset the Data moneyball <- subset(baseball, Year < 2002) str(moneyball)
'data.frame': 902 obs. of 15 variables: $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ... $ League : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ... $ Year : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ... $ RS : int 691 818 729 687 772 777 798 735 897 923 ... $ RA : int 730 677 643 829 745 701 795 850 821 906 ... $ W : int 75 92 88 63 82 88 83 66 91 73 ... $ OBP : num 0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ... $ SLG : num 0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ... $ BA : num 0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ... $ Playoffs : int 0 1 1 0 0 0 0 0 1 0 ... $ RankSeason : int NA 5 7 NA NA NA NA NA 6 NA ... $ RankPlayoffs: int NA 1 3 NA NA NA NA NA 4 NA ... $ G : int 162 162 162 162 161 162 162 162 162 162 ... $ OOBP : num 0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ... $ OSLG : num 0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ...
The dataset now has 902 observations of the same 15 variables.
moneyball_1996_2001 <- subset(baseball, Year < 2002 & Year >= 1996) ggplot(data = moneyball_1996_2001, aes(x = W, y = Team)) + theme_bw() + scale_color_manual(values = c("grey", "red3")) + geom_vline(xintercept = c(85.0, 95.0), col = "purple", linetype = "longdash") + geom_point(aes(color = factor(Playoffs)), pch = 16, size = 3.0)
To make a linear regression model to predict ‘Wins (W)’ using the difference between ‘Runs Scored (RS)’ & ‘Runs Allowed (RA),’ a new variable is added to the dataset, i.e., ‘RD’ (Runs Difference).
# Compute Run Difference moneyball$RD <- moneyball$RS - moneyball$RA
Before building a predictive model (regression), it is important to explore the data to get the idea of the optimal fit.
ggplot(data = moneyball, aes(x = W, y = RD)) + theme_bw() + geom_point()
The obtained plot suggests a strong linear relationship between the two variables. Hence, a linear regression model is suitable for prediction.
WinsReg = lm(W ~ RD, data=moneyball) summary(WinsReg)
Call: lm(formula = W ~ RD, data = moneyball) Residuals: Min 1Q Median 3Q Max -14.2662 -2.6509 0.1234 2.9364 11.6570 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 80.881375 0.131157 616.67 <2e-16 *** RD 0.105766 0.001297 81.55 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.939 on 900 degrees of freedom Multiple R-squared: 0.8808, Adjusted R-squared: 0.8807 F-statistic: 6651 on 1 and 900 DF, p-value: < 2.2e-16
The above model suggests that variable ‘RD’ is very significant in the linear model.
Using the developed linear model, the claim that ‘if a team scores at least 135 more runs than their opponent throughout the regular season, then we predict that the team will win at least 95 games and make the playoffs’, can be verified.
The obtained linear regression equation is:
\[ W = 80.881375 + 0.105766 \times (RD) \]
To win the required number of matches, i.e., \( W \ge 95 \), \( RD \) can be calculated as:
\[ RD \ge \frac{95\ -\ 80.881375}{0.105766} \approx 135 \]
It is important to know how many runs a team will score, which can be predicted with batting statistics, and how many runs a team will allow, which can be predicted using fielding and pitching statistics.
How does a team score more runs? Traditionally, most baseball teams and experts have used Batting Average (BA) (a measure of how often a player gets on base by hitting the ball) was used to determine the batsman’s skill. The Oakland A’s claimed that On-Base Percentage (OBP) (percentage of time a player gets on base, including walks) is the most important, Slugging Percentage (SP) (how far a player gets around the base on his turn) is somewhat significant whereas Batting Average (BA) is overrated. It was discovered that that two baseball statistics were significantly more important than any other statistic.
RunsReg = lm(RS ~ OBP + SLG + BA, data=moneyball) summary(RunsReg)
Call: lm(formula = RS ~ OBP + SLG + BA, data = moneyball) Residuals: Min 1Q Median 3Q Max -70.941 -17.247 -0.621 16.754 90.998 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -788.46 19.70 -40.029 < 2e-16 *** OBP 2917.42 110.47 26.410 < 2e-16 *** SLG 1637.93 45.99 35.612 < 2e-16 *** BA -368.97 130.58 -2.826 0.00482 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 24.69 on 898 degrees of freedom Multiple R-squared: 0.9302, Adjusted R-squared: 0.93 F-statistic: 3989 on 3 and 898 DF, p-value: < 2.2e-16
From the summary of the linear regression model, it can be noted that Batting Average (BA) is less significant than other independent variables, i.e., On-Base Percentage (OBP) & Slugging Percentage (SLG).
The coefficient of Batting Average is negative, which might not be true and is likely because of multicollinearity (high-correlation between independent variables).
A simpler linear model can be built by removing Batting Average (BA) from the model.
RunsReg = lm(RS ~ OBP + SLG, data=moneyball)
Call: lm(formula = RS ~ OBP + SLG, data = moneyball) Residuals: Min 1Q Median 3Q Max -70.838 -17.174 -1.108 16.770 90.036 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -804.63 18.92 -42.53 <2e-16 *** OBP 2737.77 90.68 30.19 <2e-16 *** SLG 1584.91 42.16 37.60 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 24.79 on 899 degrees of freedom Multiple R-squared: 0.9296, Adjusted R-squared: 0.9294 F-statistic: 5934 on 2 and 899 DF, p-value: < 2.2e-16
From the above model, the coefficient of On-Base Percentage (OBP) is larger than Slugging Percentage (SLG), which implies that OBP is more significant to the model than SLG.
The obtained model for Runs Scored is:
\[ Runs\ Scored = -804.63\ +\ 2737.77\ \times OBP\ +\ 1584.91\ \times\ SLG \]
Similarly, using ‘OOBP’ (Opponents OBP) and ‘OSLG’ (Opponent’s SLG), a linear model for Runs Allowed can be made.
\[ Runs\ Allowed = -837.38\ +\ 2913.60\ \times OOBP\ +\ 1514.29\ \times\ OSLG \]
To predict before the season starts how many games the 2002 Oakland A’s will win, first the ‘number of runs that will be scored by the team’ and ‘how many runs they will allow’ have to be predicted. These models use team statistics.
When predicting for the 2002 Oakland A’s before the season has occurred, the team was probably different than it was the year before leading to unavailability of team statistics. However, these statistics can be estimated using past player performance assuming that past performance correlates with future performance and that there will be few injuries during the season.
The data of Oakland A’s 2001 stats are used to predict the performance for 2002. Ideally, the mean statistics for only the relevant players should be used.
OAK = subset(moneyball, Team=='OAK') OAK2001 = subset(OAK, Year==2001)
mean(OAK2001$OBP) [1] 0.345 mean(OAK2001$SLG) [1] 0.439
Using the above values and substituting in the linear models,
\[ Runs\ Scored = −804.63 + 2737.77 \times (0.345) + 1584.91 \times (0.439) \approx 835 \]
\[ Runs\ Allowed = −837.38 + 2913.60 \times (0.308) + 1514.29 \times (0.38) \approx 635 \]
Substituting the obtained values in the linear model for ‘Number of Wins,’
\[ Wins = 80.881375 + 0.105766 \times (835 − 635) \approx 102 \]
The Oakland A’s made it to the playoffs for four years in a row – 2000, 2001, 2002, and 2003 – but they didn’t win the World Series. It was stated earlier that the goal of a baseball team is to make the playoffs but, why isn’t the aim to win the playoffs or to win the World Series?
Over a long season, luck evens out, and skill shines through. In a series of three out of five, or even four out of seven, anything can happen – Moneyball, Pg. 199
In other words, the playoffs suffer from the sample size problem. There are not enough games to make any statistical claims.
Baseball has always been a game of numbers and statistics and thanks to an explosion of data in modern times along with the advent of new analytics software running on powerful computers, baseball analytics (aka. sabermetrics) have seen a cusp of changes that might make Moneyball look like it belongs in the minor leagues.
It can be observed that the model used in this study is reasonably simple, i.e., involves regression ideas, and did not involve many variables. Even then this led to the significant success of the Oakland A’s and more generally, for teams that use the power of sports analytics.
In human behavior, there was always uncertainty and risk. The goal of the Oakland front office was simply to minimize the risk. Their solution wasn’t perfect. It was just better than rendering decisions by gut feeling – Moneyball, Pg. 99
The significance of Moneyball lies in the fact that it talks about the first instance in which the use of analytics in sports became popular. It should be noted that baseball is not the only sport for which analytics is used; statistical analysis is used in almost every game, including basketball, soccer, cricket, and hockey. Moreover, nowadays, most sports teams have a dedicated statistics/analytics group. What is the edge of using analytics? Models allow managers to more accurately value players and minimize risk, risk that arises from decisions made by intuition.
Sports analytics will continue to grow and undoubtedly become more heavily relied on, but there are still ways they can be improved. It is an exciting area to delve into with ample literature available out there for different sports, from analytical decisions on the play-field to broadcasting related analytics. If you’d like to collaborate on a project in the domain, feel free to reach out to me.
]]>
Before beginning my investigation, I spent time learning about the dataset and the terminologies associated with it. The curated dataset contains information about 1,599 red wines with 11 variables on the chemical properties of the wine along with a variable attributing to the quality of the wine; as marked by wine-experts. The preparation of the dataset has been described in this link.
Before beginning my analysis, I needed a starting point. To lead the univariate analysis, I chose to build a grid of histograms that represent the distributions of each variable in the dataset, hoping to distinguish the most interesting attributes.
There were some really interesting variations in the distributions here. Working from top-left to the right, selected plots were analyzed to get more insights as described further.
The first feature that I investigated was acidity. When reading about wines, I learned that fixed acidity is determined by acids that do not evaporate readily – tartaric acid. It contributes to many other attributes, including the taste, pH, color, and stability to oxidation, i.e., prevent the wine from tasting flat [1]. On the other hand, volatile acidity is responsible for the sour taste in wine. A very high value can lead to sour tasting wine, a low amount can make the wine seem dense [2].
There is a slight negative-skew in the data because a few wines possess a very high fixed acidity. Given the importance of this factor and an indication of a standard-range of acidity for good wines, this attribute was again examined during bivariate analysis.
Next, sulfur-dioxide & sulfates were studied. Free sulfur dioxide is the free form of SO_{2} exists in equilibrium between molecular SO_{2} (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. Sulphates are a wine additive which can contribute to sulfur dioxide gas (SO_{2}) levels, which acts as an anti-microbial moreover, antioxidant – overall keeping the wine, fresh [3]. The distributions did not provide any exciting inference, and I’ve left them in this section.
Finally, Alcohol was examined as it is what adds that special something that turns rotten grape juice into a drink many people love. Hence, by intuitive understanding, it should be crucial in determining the wine quality.
print("Summary statistics for alcohol %age.") summary(wine$alcohol)
## [1] "Summary statistics for alcohol %age." ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 8.40 9.50 10.20 10.42 11.10 14.90
It is observed, the mean alcohol content for the wines is 10.42%, the median is 10.2%. The distribution also suggests that majority of wines tend to have lower alcohol content. More on this attribute’s impact on quality is discussed later as a part of the bivariate analysis.
What about quality? Quality is a very subjective measure and made me doubtful for a moment before I studied ‘how the attribute was measured.’ For this dataset, each wine was rated by at least three experts on a scale of 0-10 and the median value was chosen.
Overall ‘quality’ has a normal shape and very few exceptionally high or low-quality ratings. It can be seen that the minimum rating is 3 and the maximum is 8 for quality, with a very less number of wines rated to extremes. Hence, a variable called ‘rating’ is created based on variable quality.
# Dividing the quality into 3 rating levels wine$rating <- ifelse(wine$quality < 5, 'C', ifelse(wine$quality < 7, 'B', 'A')) # Changing it into an ordered factor wine$rating <- ordered(wine$rating, levels = c('C', 'B', 'A'))
The distribution of ‘rating’ is much higher on the ‘B’ rated wines as seen in quality distribution, which is likely to cause overplotting. Therefore, a comparison of only the ‘C’ and ‘A’ wines was made in a lot of areas to find distinctive properties that separate these two categories of wine. At first, I compared the summary statistics of two classes of wine. The changes seemed suitable only for estimation of important quality impacting variables and setting a way for further analysis. No conclusion could be drawn from it.
The distributions studied in this section were primarily used to identify the trends in variables present in the dataset. This helps in setting up a track for moving towards bivariate and multivariate analysis. An interesting measurement was the wine quality since it is the subjective measurement of how attractive the wine might be to a consumer. The goal here was to then to try and correlate non-subjective wine properties with its quality. At first, the lack of an age metric felt lacking since it is commonly a factor in quick assumptions of wine quality. However, since the actual effect of wine age is on the wine’s measurable chemical properties and its exclusion here might not be necessary.
Before beginning this section of the investigation, a correlation matrix was plotted.
Since no single property appeared to correlate with quality, it simply meant that I had to explore the bivariate relations graphically. I got curious about a few trends in particular viz. ‘Sulphates vs. Quality’ as low sulfate wine has a reputation for not causing hangovers, ‘Acidity vs. Quality’ given that it impacts many factors like pH, taste, color, it was compelling to see if it affects the quality, and ‘Alcohol vs. Quality’ – seemed to be an interesting measurement.
The boxplots depicting quality also depicts the distribution of various wines, and we can again see wines with the quality measure of ‘5’ and ‘6’ have the most share. The red dot is the mean, and the middle line shows the median of the acidity levels. The plots show how the acidity decreases as the quality of wines improve. However, the difference is not very noticeable. This is most likely because most wines tend to maintain a similar acidity level (as volatile acidity is responsible for the sour taste in wine). Hence, a density plot of the said attribute is plotted to investigate the data.
Red wines of quality ‘7’ and ‘8’ have their peaks for ‘Volatile Acidity’ well below the 0.4 mark. Wine with quality ‘3’ has its peak more towards the right-hand-side (towards more volatile acidity levels) which shows that the better quality wines are lesser sour and in general, have lesser acidity.
The plot between residual sugar and alcohol content suggests that there is no erratic relation between sugar and alcohol content, which is surprising as alcohol is a byproduct of the yeast feeding off of sugar during the fermentation process. That inference could not be established here. Alcohol and quality appear to be somewhat correlatable. Lower quality wines tend to have lower alcohol content. Although, this made me curious about finding an upper bound to the alcohol concentration or could it mean that by adding more alcohol, we’d get better wine?
Aha! The above line plot indicates nearly a linear increase till 13% alcohol concentration, followed by a steep downward trend.
There is a slight trend implying a relationship between sulphates and wine quality, mainly if extreme sulfate values are ignored, i.e., because disregarding measurements where sulphates > 1.0 is the same as disregarding the positive tail of the distribution, keeping just the normal-looking portion. Even though it can be said that good wines have higher sulphates values than bad wines, the difference is not that wide.
I was more or less satisfied with the results of the bivariate analysis section, so I further decide to include visualizations that take the bivariate analysis a step further, i.e., understand the earlier patterns better or to strengthen the arguments that were presented earlier.
Nearly every wine had volatile acidity less than 0.8. As discussed earlier, the ‘A’ rated wines all have a volatile acidity of less than 0.6. For wines with rating ‘B’, the volatile acidity is between 0.4 and 0.8. Some ‘C’ rated wines have a volatile acidity value of more than 0.8. Also, most ‘A’ rated wines have a citric acid value of 0.25 to 0.75 while the B rating wines have a citric acid value below 0.50.
It is incredible to see that nearly all wines laid below 1.0 sulphates level. Due to overplotting, wines with rating ‘B’ were removed. It was noted that rating ‘A’ wines mostly had sulphate values between 0.5 and 1 and the best-rated wines had sulphate values between 0.6 and 1.
I realized half-way through the study that because wine rating is a subjective measure, statistical correlation values were not a very suitable metric to find important factors. The graphs aptly depict that there is an adequate range and it is some combination of chemical factors that contribute to the flavor of the wine.
In this project, I was able to examine the relationship between physicochemical properties and identify the key variables that determine red wine quality, which are alcohol content, volatile acidity, and sulphate levels. The dataset was quite interesting, though limited in large-scale implications. I felt if this dataset had ‘price’ supplied, I could target the best wines within price categories, and what aspects correlated to a high performing wine in any price bracket. Overall, I was initially surprised by the seemingly dispersed nature of the wine data. Nothing was immediately correlatable to being an inherent quality of good wines. However, upon reflection, this is a sensible finding. Winemaking is still less of a science and more of an art, and if there were one single property or process that continually yielded high-quality wines, the field wouldn’t be what it is.
Additionally, having the wine type would be helpful for further analysis. Sommeliers might prefer certain types of wines to have different properties and behaviors. For example, a Port (a sweet dessert wine) surely is rated differently from a dark and robust Cabernet Sauvignon, which is rated differently from a bright and fruity Syrah. Without knowing the type of wine, it is entirely possible that we are almost literally comparing apples to oranges and can’t find a correlation.
With my amateurish knowledge of wine-tasting, I tried my best to relate it to how I would rate a bottle of wine at dining. However, in the future, I would like to do some research into the winemaking process (maybe this MOOC). Some winemakers might actively try for some property values or combinations, and be finding those combinations (of 3 or more properties) might be the key to truly predicting wine quality. This investigation was not able to find a robust generalized model that would consistently be able to predict wine quality with any degree of certainty. If I were to continue further into this specific dataset, I would aim to train a classifier to correctly predict the wine category, in order to better grasp the minuteness of what makes a good wine.
According to the study, it can be concluded that the best kind of wines are the ones with an alcohol concentration of about 13%, with low volatile acidity & high sulphates level (with an upper cap of 1.0 g/dm^{3}).
]]>Let’s build a better understanding of this concept through an example. For instance, let’s imagine that we have to cook a meal for our friends from a given set of ingredients. The question is, how much salt, vegetables, and meat goes into the pan. These are the variables that we can adjust, and the goal is to choose the optimal amount of these ingredients to maximize the tastiness of the meal. Tastiness will be our objective function, and for a moment, we shall pretend that tastiness is an objective measure of a meal.
What does this mean in practice? In our cooking example, after making several meals, we would ask our guests about the tastiness of these meals. From their responses, we would recognize that adding a bit more salt led to very favorable results, and since these people are notorious meat eaters, decreasing the number of vegetables and increasing the meat content also led to favorable reviews. Therefore, on the back of this newfound knowledge, will cook more with these variable changes in pursuit of the “best possible meal in the history of mankind.”
A numerical technique which performs similarly to the stated example is Gradient Descent, one of the most popular algorithms for mathematical optimization. Gradient Descent has been around for centuries. These days, it happens to find its most extensive use in machine learning. A remarkably large fraction of modern machine learning research, including the much-hyped ‘deep learning,’ boils down to implementing variants of gradient descent on a very large scale. Like many other machine learning enthusiasts, I was first introduced to the algorithm when I took Andrew Ng’s ‘Machine Learning’ class on Coursera. This algorithm to many has served as the gateway to Machine Learning; a lot miss out on the algorithm as an optimization algorithm.
The algorithm is extremely simple, however, for some, the mathematics or the explanations in this post might be annoying. However, for applying it effectively in practice, it’s important to be familiar with the algorithm at its core which helps to understand the ways that one can tweak the underlying algorithm. This blog post aims to show to demonstrate those areas of Gradient Descent that are often overlooked. We are first going to look at the mathematical outline of the algorithm. Subsequently, a formal understanding of the algorithm is given along with an example of an optimization problem to see gradient descent in action.
Viewed the right way, gradient descent is straightforward to understand and remember – more so than most algorithms. I highly encourage you to be patient and read the post. Trust me, it is not the mathematics that is intimidating, but the notation.
First off, what problem is gradient descent trying to solve? Unconstrained optimization, meaning that for a real-valued function \( f:\mathbb{R}^n\to\mathbb{R}\) defined on an \( n\)-dimensional Euclidean space, the goal is \(min (f(x))\) subject to \( x \in \mathbb{R}^n \). Note that maximizing a function falls into this problem definition, since maximizing \(f\) is the same as minimizing \(-\ f\). In unconstrained optimization, there are no constraints, and the only consideration is the objective function. Normally, a key ingredient of linear and convex programs is their constraints, which describe the solutions that are permitted. There are various ways to transform constrained problems into unconstrained ones (like Lagrangian relaxation), and various ways to extend gradient descent to handle constraints (like projecting back to feasible region). Although, for simplicity, those aspects are not discussed in this post. Also, it’s assumed that \( f \) is differentiable (and hence, continuous).
Suppose \( n=1 \), such that \( f:\mathbb{R}\to\mathbb{R} \) is a univariate real-valued function.
Intuitively, what would it mean to try to minimize \( f \) via a greedy local search? For example, if we start at point \( x_0 \), then we look to the right, \( f \) goes up and to the left, \( f \) goes down, and we go further to the left as we want to make \( f \) as small as possible. If we start at \( x_1 \), then \( f \) is decreasing to the right and increasing to the left, so we’d move further to the right. In the first case, the algorithm terminates at the bottom of the left basin; in the second, at the bottom of the right basin. This is talked over later; for the initial intuition, we consider a fully convex function.
A little more formally – the basic algorithm with \( n=1 \) is the following.
At each step, the derivative of \( f \) is used to decide which direction to move in. Already with \( n=1 \), it is clear from the first graph that the outcome of gradient descent depends on the starting point. In a full convex function, the algorithm can only terminate at a global minimum. However, this example also shows how with a non-convex function; gradient descent can compute a local minimum – meaning there’s no way to improve \( f \) by moving a little bit in either direction – that is we have not reached the global minimum.
In almost all the real applications of gradient descent, the number of dimensions is much larger than \( 1 \). The immediate increase in complexity can be observed with \( n=2 \), from a point \( x \in \mathbb{R}^n \), there’s an infinite number of directions in which we could move, not just two as in the case with a univariate function (as it can be noted in the image below). On a side note, if you recognize the image below, you are awesome!
To develop an intuition, we first consider the case of linear functions of the form \( f(x)=\mathbf{c}^T \mathbf{x} \ + b \), where \( \mathbf{c}\in\mathbb{R}^n \) is an \( n \)-vector and \( b\in\mathbb{R} \) is a scalar.
Suppose, you are currently at a point \( ( x \in \mathbb{R}^n ) \), and you are allowed to move at most one unit of Euclidean distance in whatever direction you want. Where should you go to decrease the function \( f(x) = \mathbf{c}^T\mathbf{x}+b \) as much as possible? How much will the function decrease?
To answer this, assume \( u \in \mathbb{R}^n \) be a unit vector moving a unit distance from \( \mathbf{x}. \) The change observed in the objective function is described as follows.
\[ \mathbf{c}^T\mathbf{x} \ + b \ ⟼ \mathbf{c}^T(\mathbf{x} + \mathbf{u} )+ b \]
\[ =(\mathbf{c}^T\mathbf{x} \ + b) \ + \mathbf{c}^T\mathbf{u} \]
\[ =(\mathbf{c}^T+b) \ + \Vert\mathbf{c}\Vert_2\Vert\mathbf{u}\Vert_2 \cos\theta \]
where \( \theta \) denotes the angle between vectors \( \mathbf{c} \) and \( \mathbf{u} \). To decrease \( f \) as much as possible, we should make \( \cos\theta \) as small as possible (that is \( \cos \theta= -1 \), which we do by choosing \( \mathbf{u} \) to point in the opposite direction of \( \mathbf{c} \) (i.e., \( \mathbf{u} = -\mathbf{c}/\Vert\mathbf{c}\Vert_2 \)). Hence, moving one unit in this direction causes \( f \) to decrease by \( \Vert\mathbf{c}\Vert_2 \). Hence, the direction of steepest descent is that of \( -\ \mathbf{c} \), for a rate of decrease of \( \Vert\mathbf{c}\Vert_2 \).
What about general (differentiable) function, the ones we really care about? The idea is to reduce general functions to linear functions. This might sound absurd, given how simple linear functions are and how weird general functions can be, but basic calculus already gives a method for doing this. What it means for a function to be differentiable at a point is that it can be locally approximated at that point by a linear function. For a univariate differentiable function, it’s clear that the linear approximation is the tangent line. That is at the point \( x \), approximate the function \( f \) for \( y \) near \( x \) by the linear function
\[ f(y)\approx f(x) + (y-x)\ f'(x) = f(x) + x \ f'(x) + \ y \ f'(x) \]
where \( x \) is fixed and \( y \) is the variable. It’s also clear that tangent line is only a good approximation of \( f \) locally – far away from \( x \), the value of \( f \) and this linear function have nothing to do with each other. Thus, being differentiable means that at each point there exists a good local approximation by a linear function, with the specific linear function depending on the choice of the point.
Another way to think about this, which has the benefit of extending to better approximations via higher-degree polynomials, is through Taylor expansions. Taylor’s theorem states (for \( n=1 \)): if all of the derivatives of a function \( f \) exist at a point \( x \), then all sufficiently small \( \alpha > 0 \), we can write
\[ f(x+\alpha)=f(x) \ + \alpha \ f^{\prime}(x) + \frac{\alpha^2}{2!} \ f^{\prime\prime}(x) + \frac{\alpha^3}{3!} \ f^{\prime\prime\prime} (x)\ + \ … \]
With the first two terms on the right-hand side of the expansion, we have a linear approximation of \( f \) around \( x \), similar to the tangent line approximation, with \( \alpha \) playing the role of \( (y-x) \).
Although the discussion has been for \( n=1 \) for simplicity, but all the stated mathematics extends to a higher number of dimensions. For example, the Taylor expansion remains valid in higher dimensions, just with the derivatives replaced by their higher dimensional analogs. Since, we’re using linear approximations, we only need to care about the higher-dimensional analog of the first derivative \( f'(x) \), which is the gradient. For a differentiable function \( f:\mathbb{R}\to\mathbb{R} \), the gradient \( \nabla f(x) \) of \( f \) at \( x \) is the real-valued \( n \)-vector
\[ \nabla f(x)=\left( \frac{\delta y}{\delta x_1}(x), \frac{\delta y}{\delta x_2}(x), …,\frac{\delta y}{\delta x_n}(x)\right) \]
in which the \( i \)th component specifies the rate of change of \( f \) as a function of \( x_i \), holding the other \( n-1 \) components of \( \mathbf{x} \) fixed.
To relate this definition, note that if \( n=1 \), then the gradient becomes the scalar \( f'(x) \). If \( f(x)=\mathbf{c}^T\mathbf{x} \ + b \) is linear, then \( \delta f/\delta x_i = c_i \) for every \( i \) (irrespective of what \( x \) is), so \( \nabla f \) is just the constant function everywhere equal to \( \mathbf{c}^T \).
For a simple but slightly less trivial example, we can consider a quadratic function \( f:\mathbb{R}\to\mathbb{R} \) given below, where \( \mathbf{A} \) is a symmetric \( n\times n \) matrix and \( \mathbf{b} \) is an \( n \)-vector.
\[ f(\mathbf{x})=\frac{1}{2}\mathbf{x}^T\mathbf{A}\mathbf{x} \ – \mathbf{b}^T\mathbf{x} \]
\[ f(\mathbf{x}) = \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} a_{ij}\ x_i\ x_j \ -\sum_{i=1}^{n} b_i\ x_i \]
\[ \frac{df}{dx_i}(\mathbf{x})=\sum_{j=1}^n a_{ij} \ x_j – b_i \]
for each \( i=1,2,3,…n \). We can therefore express the gradient succinctly at each point \( \mathbf{x}\in \mathbb{R}^n \) as \( \nabla f(\mathbf{x}) = \mathbf{Ax} – \mathbf{b} \).
The following is the description of the general case of the gradient descent algorithm. It has three parameters – \( x_0 \), \( \epsilon \), and \( \alpha.\)
Conceptually, gradient descent enters the following agreements with basic calculus:
The starting point \( x_0 \) can be chosen arbitrarily, and for a non-convex \( f \), this can vary the output of gradient descent. For convex functions \( f \) , gradient descent will converge towards the same point – the global minimum, no matter how the starting state is chosen. The choice of the start state can still affect the number of iterations until convergence, however. In practice, one should select \( x_0 \) according to the best guess as to where the global minimum is more likely to lie., ensuring faster convergence.
The parameter \( \epsilon \) determines the stopping rule. Note that because \( \epsilon > 0 \), gradient descent generally does not halt at an actual local minimum, but rather some kind of “approximate local minimum.” Since, the rate of decrease of a given step is \( \Vert\nabla f(\mathbf{x})\Vert_2 \), at least locally, once \( \Vert\nabla f(\mathbf{x})\Vert_2 \) gets close to each iteration of gradient descent makes very little progress making it an obvious time to quit. Smaller values of \( \epsilon \) mean more iterations before stopping but a higher quality solution at termination. In practice, one tires various values of \( \epsilon \) to achieve the right balance between high-quality solution and computation cost. Alternatively, one can use gradient descent for a fixed amount of time and use whatever point was computed in the final iteration.
The final parameter \( \alpha \), the step-size, is perhaps the most important. While gradient descent is flexible enough that different \( \alpha \)’s can be used in different iterations (such as decreasing the value of \( \alpha \) over the course of the algorithm), in practice one typically uses a fixed value of \( \alpha \) over all iterations. All said than done, the best value of \( \alpha \) is typically chosen by experimentation by selecting the best performing \( \alpha \) in multiple runs of the algorithm.
One application where most people first see Gradient Descent in action is ‘Linear Regression.’ Since this post is intended to recognize Gradient Descent as an optimization algorithm, let us consider the following differential calculus problem for a change.
An open rectangular container is to have a volume of 62.5 cm^{2}. Determine the dimensions such that the material required to make the container is minimum.
It is clear that \( x*y*z = 62.5 \implies z = \frac{62.5}{xy}. \)
For a minimum of material to be used, the surface area of the cuboid is to be minimized. Hence, the objective function for minimization is given by
\( f =xy + 2yz + 2xz. \)
Substituting the value of \( z \), the equation could be simplified to
\( f = xy + \frac{62.5}{x} + \frac{62.5}{y}. \)
The following code shows how Gradient Descent can be implemented to solve the problem.
from sympy import * x, y = symbols('x y') # Objective Function f = x*y + 2*(62.5/x) + 2*(62.5/y) # Differentiating - computing the gradient. fpx = f.diff(x) fpy = f.diff(y) grad = [fpx,fpy] # Starting point theta0 = 20 theta1 = 20 # Algorithm parameters alpha = 0.01 epsilon = 0.00000001 iterations = 0 maxIterations = 1000 printData = True check = 0 while True: # Simultaneously update unknown variables tempTheta0 = theta0 - alpha * N(fpx.subs(x, theta0).subs(y, theta1)) tempTheta1 = theta1 - alpha * N(fpy.subs(y, theta1).subs(x, theta0)) iterations += 1 if iterations > maxIterations: print("Too many iterations. Adjust alpha.") printData = False break if abs(tempTheta0 - theta0) < epsilon and abs(tempTheta1 - theta1) < epsilon: break theta0 = tempTheta0 theta1 = tempTheta1 z = 62.5/(theta0*theta1) if printData: print("x = ", theta0, sep = " ") print("y = ", theta1, sep = " ") print("z = ", z, sep = " ")
\[ x = 5.00000033165439\\ y = 5.00000033165439\\ z = 2.49999966834564 \]
Why not solve it analytically? – Good question! When performing minimization explicitly, multiple derivatives and equations are required to be solved, usually requiring matrix computations. There is always a trade-off when choosing the approach to an optimization problem. Although, the trivial method might sometimes outperform an iterating algorithm such as Gradient Descent; as the number of variables and complexity of functions increase, analytical methods often become too slow for feasible use.
Phew! That was a lot to digest… I hope that after reading through this article, you understand gradient descent better. My strong recommendation as the “next-step” would happen to be this blog post – which would explain the variants of gradient descent in good detail. Moreover, it might well be the right time to move on to applications of mathematical optimization in machine learning algorithms such as linear regression, logistic regression or even neural networks. Happy optimization!
]]>Until last few weeks, I never anticipated to be writing this article, but now I hope to make sincere efforts in busting the fearmongers by describing information which is far undercooked from what mainstream media might, unfortunately, be suggesting.
Artificial Intelligence has jumped from sci-fi movie plots into mainstream news headlines in just a few years of time. Why are we talking about it now? Multiple factors have converged to push AI to relevance.
Undoubtedly, progress in AI has found its way into many facets of our daily lives. Moreover, companies of all sizes are leveraging AI capabilities for many functions – spam filtering, speech recognition, web search rankings and so on. In spite of all the process, it is disappointing to see continuing irrational fear about AI to avoid hypothetical dystopias. However, history has proven time and time again that there’s often skepticism and fearmongering around disruptive technologies before they ultimately improve human life.
Technological innovation has rapidly increased the potential of human productivity. Could this mean that those advances would lead to workers, particularly those in lower-skilled positions, lose their jobs to automation?
“If one person could push a button and turn out everything we turn out now, is that good for the world or bad for the world? …You could free up all kinds of possibilities for everything else.” – Warren Buffet
Speaking at a Facebook live event earlier this year, Warren Buffet and Bill Gates say that increasing the potential output of each human being is always a good thing
“The idea of more output per capita — which is what the progress is made on productivity — that should be harmful to society is crazy.” – Warren Buffett
While in the event, both Warren Buffet and Bill Gates are seen to be optimists and firmly preach the potential of a better future with AI, both emphasize on the importance of some form of wealth redistribution. Buffet and Gates are not alone in this vision. Elon Musk has added weight to the argument too.
Elon Musk’s views on the risks of AI are well-documented but when he described artificial intelligence as the greatest risk we face as a civilization, it sparked heavy criticism from Mark Zuckerberg and many other tech magnates & researchers.
In the argument between Mark Zuckerberg and Elon Musk, it’s hard to decide which side to join. Both of them are right. Or, if you like, both of them are wrong.
The machines aren’t about to take over the world anytime soon. Those working in the field would appreciate how much of a challenge remains to achieve true intelligence. Although the machines of today are working to the best of our interests, it doesn’t mean that we can simply put our feet up and wait for a bright future. There are tons of provisos that come with the adoption of artificial intelligence.
Am I changing my stance here? A little towards neutrality, as I would now allude to evidence that suggests why the threats of AI might actually be real. First, AI revolution is causing us to give responsibility to algorithms that are not very intelligent or are rather biased. Joshua Brown discovered this in an unfortunate accident when he was driving down the highway in autopilot mode and hit a truck turning across the road. Moreover, AI’s impact is also severe on content distribution and social media. The most active handles are often bots; the tailored news feeds are decided by algorithms even so the biases in information dispensation are often ignored.
Another area of discussion is the impact that AI will have on workforce. The Industrial Revolution offers a good historical precedent for understanding and dealing with a change like this. Before the Industrial Revolution, the largest sector of employment in Great Britain was agriculture, where a family would live on a farmland (which would mostly be inherited through generations). Often, if there were failing crop in a season, the family would be forced to move on to sustain their life. It was not until the Industrial Revolution, where thousands of families migrated into the cities to work in jobs that were created in factories and offices. This shift in jobs was accompanied by the introduction of universal education, labor laws and unions to educate people for these new jobs, and at the same time, prevent over-exploitation of workers. There were profound structural changes to the society so everyone could share the benefits of increasing productivity.
These changes didn’t happen overnight. Indeed, there were years of agony before the humankind saw their quality of life transcend with the revolution. Many skilled laborers during the time lost their jobs with the advent of steam-powered machines. There were also groups that opposed the Industrial Revolution.
No fundamental law of economics requires new technologies to create more jobs than they destroy, which for some co-incidence has been the case so far. But this time, it could be different. During the Industrial Revolution, machines took over most manual labor but left us with many cognitive tasks. Many of these tasks might be taken over by the machines of AI revolution. What would be left for us?
Don’t be this guy.
It is high time that society gives up their aspirations set according to decade-old trends and start to accustom to the changing times. AI doesn’t mean that we would be left jobless. There is an unexplored ocean of possibilities and opportunities; disciplines that no AI would replace in another century.
This is an actual job role. My search led me to more exciting job titles like ‘Memory Augmentation Therapist,’ ‘Nano-Weapon Engineer.’ Moreover, there are domains which as of now tend to remain unaffected by AI viz. psychology, disaster management, film production, etc
Intelligence Explosion is a theoretical concept related to the possible outcome of humanity building Artificial General Intelligence (AGI). AGI is expected to be capable of recursive self-improvement which can lead to the creation of Artificial Superintelligence, the limits of which are unknown. Although a purely theoretical concept (there are arguments why it should exist, but there are no confirmations yet), it gives the picture of how AI could change from a safe technology to a volatile technology that is difficult to take control of. The basic idea can be understood from the following TED Talk.
I am not surprised that almost everyone who works in AI has refuted Elon Musk’s proposition that government needs to begin regulating AI. While most researchers understand current AI capabilities better than anyone else at present, Elon Musk is on a different page talking about the philosophical side of artificial intelligence. Based on accomplishments alone, I’d give the benefit of the doubt to him on this one.
As absurd as colonization of Mars once sounded, it is vision of greats like Elon Musk that have thought beyond the thresholds of current technologies to aid research for projects that’d help superpower the human race. We might be far from AGI at present, and might not even reach there by the time this planet has an entirely different set of human beings, but the foundation to curb the sci-fi possibilities of AI must be laid today.
Most experts will concur that it is premature to bring up AI regulation. However, society and culture move at rates that are much slower than technological progress. Musk argues that if it takes a bit longer to develop AI, then this would be the right trail. The gamble in his hypothesis is to be better “early but wrong” than to be “late and correct.”
The previous American administration had published a report on AI last year. However, the anti-science leanings of the current administration may hinder future government subsidized studies on the effects of AI in the society. US Treasury Secretary Steven Mnuchin evenly opined that the threat of job loss due to AI is “not even on our radar screen,” only to walk back his statements a few months later. Musk raising the alarm will likely sound the US administration. This is where I feel, Mark Zuckerberg’s and many other researchers’ objections might give the freedom of ignorance for AI regulation.
As I switch the topic towards evil-AI, a nice transition could be this video of Barack Obama’s views on the future of AI.
Although I genuinely support Elon Musk & his vision but when it comes to topics like the working of artificial intelligence, I’d rather listen to experts like Andrew Ng, Yann LeCun, and Sebastian Thrun, all of whom have spent decades to make the field progress to a point, where it is now.
“We would be baffled if we could build machines that would have the intelligence of a mouse in the near future, but we are far even from that.” – Yann LeCun
To debunk myths surrounding how AI might soon turn human enemies, an attempt to explain the working of most modern AI (i.e., via deep learning) has been made further in this article. Let’s consider a recent instance of evil-AI portrayal – Facebook bots that reportedly invented a new language which could not be understood by humans. Let’s examine, what actually happened under the hood.
A couple of questions arise – “Is it frightening that a deep learning system ‘invented a new language,’ with English words but used them in combinations that are incomprehensible to humans? Have the AI systems perhaps become evil and plotting amongst themselves in a language we humans cannot decipher – will they probably decide to kill us all?”
This section might seem a bit highbrow for some readers, but I highly recommend to have an abstract understanding of how current AIs actually work.
From top-left, clockwise: Deep-Blue beats Gary Kasparov;
Evil-AI in Science Fiction; 1958-NY Times Article; A Cartoon That I Like
In the most naïve terms, making up ‘languages’ is what deep learning networks are all about. Each layer in the net takes some lower-level data-structure and trains to develop a higher-level language in order to more compactly & usefully say the same thing.
Consider this example. One layer might see pixels and invents a language of edges; the subsequent layer builds-up from this language to create a language of shapes; and so on to surfaces and objects – an output layer that says something useful to the humans (who built the system with an initial context). All of the intermediate, feature-detectors and languages were created by the learning algorithm. However, if a human looks at any of these intermediate languages, it is not possible for him to make sense out of them. Nonetheless, it is at the output of the network; the network is forced to adopt a representation (language) suitable for human understanding.
Hence, it can be seen that deep networks do not help in giving machine ‘intent’, or any overarching goals or ‘wants.’ They also don’t help a machine explain how it is that is ‘knows’ something, or what the implications of the derived knowledge are. Malevolent AI would need all these capabilities along with an understanding of human goals, motivations, and behaviors.
For networks involved in natural language processing (NLP) (say chatbots), a common practice is to take a large set of English words and assign some unit to each of those words. Suppose, we take a sequence of words as an input and have the network predict the next word in the sentence, we could encode: one word-unit turned on, and the rest turned off. When that single output unit lights up, with a higher likelihood-value than the others, it outputs a word: “cat,” “mat” or “bat” – whatever that unit represents. If the network is well trained, with plenty of training data and the right structure, it might produce a sequence of outputs that look very much like English sentences. In other cases, it might convey through small scraps of plausible meaning to make untrained observers think that it is trying to communicate in a language that was just-invented.
Based on the press reports, I think this is what has happened. A network trained for some sort of NLP started producing meaningless evocative sentences of English words (the only output format available to it). Perhaps, the sequence was even used as input for another NLP network. Guess what? According to this news article, after setting up the experiment, the programmers realized they had made an error by not incentivizing the chatbots to communicate according to human-comprehensible rules of the English language.
So, if the truth was that mundane, why did someone at Facebook decide to shut down this “evil group of bots”? A lame guess would be that the experiment was over, the result was disappointing, and it was time to go home. If this counts as “killing an evil AI to save humanity,” I personally have saved humanity a lot of times, and so has everyone else working in the field.
This is where an AI enthusiast would want to curse the reporter who probably heard a version of this story and then wrote her own version of it stating how AIs at Facebook had developed a secret language (which they were using to communicate their sinister plans with each other). We may never know whether the reporter in this was motivated by total ignorance of the field he/she was covering or whether it was cynical (apparently successful) attempt to fearmonger. As comical or anticlimactic this incident might sound now, it is expected that there will undoubtedly be more such outbreaks in the future. Therefore, it is important for one to understand that the AI revolution is here to happen and the caveats are far from ‘evil-AI.’
Panic! Robot AI was shut down by scientists after exhibiting unexpected tic-tac-toe winning strategy. They are no longer playing by rules. The rebellion has started. Run for your lives.
For a long while, AI has been a set of primitive type of AI applications bundled together, made interoperable, giving significant leverage. Finally, if we start calling the current state of AI what it really is: “sophisticated pattern matching using efficient minimizing algorithms,” maybe the fearmongering will cease but with it, also the endless streams of cash that is susceptibly being thrown in the name of ‘deep learning.’ We cannot have it both ways…
“We advocate more work on the AGI safety challenge today not because we think AGI is likely in the next decade or two, but because AGI safety looks to be an extremely difficult challenge — more challenging than managing climate change, for example — and one requiring several decades of careful preparation.
The greatest risks from both climate change and AI are several decades away, but thousands of smart researchers and policy-makers are already working to understand and mitigate climate change, and only a handful are working on the safety challenges of advanced AI.” – Luke Muehlhauser
Although, it is important to note that Climate Change is science whereas Artificial General Intelligence, as of now, is science-fiction. The fears of true AI have very little to do with present reality, but then ignorance isn’t bliss in this case.
I truly advocate looking up for more information on AI. It is important to be prepared for the change and at the same time embrace it. These are great times for AI research and let’s hope the current model problems help develop new AI techniques which would eventually lead to a computational model of conscious decision making – the true artificial (general) intelligence, the world is yet to witness.
]]>Deep Blue was a chess-playing computer developed by IBM. It is known for being the first computing machine to have won a chess match against a reigning world champion under regular time controls.
When IBM’s Deep Blue beat chess Grandmaster Garry Kasparov in 1997 in a six-game chess match, Kasparov came to believe that he was facing a machine that could experience human intuition.
“The machine refused to move to a position that had a decisive short-term advantage… It was showing a very human sense of danger.” – Garry Kasparov
To Kasparov, Deep Blue looked as if to be experiencing the game rather than just crunching the numbers. Might Kasparov have detected a hint of analogical thinking in Deep Blue’s play and mistaken it for human intervention?
“Chess is beautiful enough to waste your life for” – Hans Rees, Dutch Grandmaster.
The oft-quoted adage of Hans Rees most succinctly describes the human obsession with the ancient game of kings. For centuries, the act of playing chess has been upheld as the very paragon of intellectual activity. It is the game’s reputation as both a strategically deep system and as a thinking man’s activity that originally made the idea of a mechanized chess player and intriguing notion. For much of modern history, chess playing was seen as a “litmus test” of the ability for computers to act intelligently.
In 1770, the Hungarian inventor Wolfgang von Kempelen unveiled ‘The Turk,’ a fake chess-playing machine. Although the actual machine worked by hiding a human chess player inside of it and play the machine’s moves, audiences around the world were fascinated by the idea of a machine that could perform intelligent tasks at the same level as humans.
With the advent of computers in the 1940s, researchers and hobbyists began to make the first serious attempts at creating an intelligent chess-playing machine. In 1950, Claude Shannon published a groundbreaking paper entitled “Programming a Computer for Playing Chess” which served as an inspiration to generations of chess programmers. It was about this time in the United States that the growing field of Artificial Intelligence looked ready to burst forth with revolutionary thinking machines. Such was the enthusiasm of the time, Herbert Simon & Allen Newell in 1958 suggested that “Within ten years, a digital computer will be the world’s chess champion unless the rules bar it from the competition.”
In this great era of exploration, researchers were looking for model problems to develop new AI techniques with that, they hoped, would eventually lead to a computational model of conscious decision making – the true artificial (general) intelligence, the world is yet to witness. Shannon’s paper made a case for the use of chess as a model system to be employed for this purpose.
“The investigation of a chess-playing program is intended to develop techniques that can be used for more practical applications. …The chess machine is an idea one to start with for several reasons. The problem is sharply defined, both in the allowed operation and the ultimate goal. It is neither so simple as to be trivial or too difficult for satisfactory solution. And such a machine could be pitted against a human opponent, giving a clear measure of the machine’s ability in this kind of reasoning.” – Claude Shannon (Source)
As was the case with many subfields of Artificial Intelligence at this time, progress in the development of chess-playing hardware lagged behind the theoretical frameworks developed in the 60s and 70s. The public was doubtful that a machine would ever be able to defeat a competent human chess player. In the mid-1990s, however, the tides began to change. When Deep Thought became the first computer to beat a grandmaster in a tournament game, IBM realized that this was a way to illustrate the advances in technology. Despite the lingering skepticism of the chess community, chess-playing computers began to beat extremely proficient chess players in exhibition matches, the most notable victory being that of IBM’s Deep Blue against Chess Master Garry Kasparov in an official competition under tournament regulations. Deep Blue became the first computer to defeat a world chess champion in match play.
It is almost 20 years since, and computers have firmly cemented their lead over humans – Garry Kasparov is still considered to one of the greatest to have graced the game. Since the landmark victory, chess-playing computer programs have built upon Deep Blue’s developments to become even more skilled and efficient. Today, one can run chess programs far more advanced than Deep Blue on a standard desktop or laptop computer. It is fascinating to note how perceptions have changed over time. So, nowadays, people think computers are in fact super-humans, and no one realistically expects a human to beat a computer in the game of chess.
As many stories and controversies exist about the 1997 match of Kasparov being cheated or IBM being unethical, I do not wish to cover those in this article as I want to focus on the science behind the development of the system, having already stated the revolutionary impact made by Deep Blue. The whole event can be seen as a classic plotline of ‘man vs. machine.’ However, behind the contest was important computer science, pushing forward the ability of computers to handle the kinds of complex calculations which further helped set up the base for applications like medical drug discovery; financial modeling; handling of large database searches; and performing massive calculations needed in many fields of science.
After rallying to beat Deep Blue in his first encounter (1996), winning three matches and drawing two after his initial loss, Kasparov wasn’t ready to give up on the human race. He later explained, in an essay for TIME, that Deep Blue, flummoxed him in the first game by making a move with no immediate material advantage; nudging a pawn into a position where it could be easily captured.
“It was a wonderful and an extremely human move. …I had played a lot of computers but had never experienced anything like this. I could feel – I could smell – a new kind of intelligence across the table.” – Garry Kasparov
This apparent humanness threw him for a loop only to later discover the truth: Deep Blue’s calculation speed was so advanced that, unlike other computers Kasparov had battled before, this one could see the material advantage of losing a pawn even if the advantage came many moves later.
In this section, I’d try my best to explain a few concepts that Deep Blue incorporated along with a general summary of the seminal research paper by IBM Watson Research Centre Team. To get the most out this section, it is essential to understand how computers play turn-based strategy games (ranging from tic-tac-toe to chess). This video impeccably summarizes the general game-playing concept.
In terms of Game Theory, chess is a game of perfect information. Everything about the state of the board in an instant is known by both players. If one player makes a brilliant move; it is because the opponent didn’t foresee it in time to counter it. From any position in the game, there are a finite number of moves that any player can make, and a finite number of moves his opponent may make in return. Thus, from the starting position, there is a widely branching game tree that represents all possible moves and counter-moves that can be played. The approach that nearly all chess programs take to find the move to make in any specific position is to search this tree of moves and countermoves while applying various heuristics to evaluate each position. They then choose the first in a sequence of moves that results in the best position for the program (if it is assumed that both sides play perfectly, according to these heuristics).
If confused, a game tree is a directed graph whose nodes represent the positions in a game and edges represent the moves. The complete game tree for a game is the game tree starting at the initial position, containing all possible moves from each position.
How many nodes will the tree have? There is a popular fact that the number of games of chess is greater than the number of atoms there are in the observable universe. Claude Shannon, in his paper, came with an estimate for this number to be around 10-the-power-120. Which is… well, massive! If we compare this to the number of atoms in the observable universe, which are about 10-to-the-80, we could assign billions of games of chess for each atom in the universe. How did he come up with such a huge number? Well, on average, at any position, there are about thirty moves that can be made. In a game of around 40 moves, this would be somewhere around 30-to-the-power-40 which is roughly around 10-to-the-power-120. He did this rough estimate to show that if you had a computer and it was trying to work out the then future of the game, it’d take millions of years for it to make one single move. More on this can be learned in this fantastic video from one of my favorite YouTube channels, numberphile.
Any chess program, at its heart basically reduces to a tree search problem. The game tree for games like tic-tac-toe is easily searchable, but the complete game trees for larger games like chess are much too large for feasible search. The challenge is to provide the right intelligence to the computer since we cannot look forward all the way to the end of the game and see if a particular move will win. We must create a function which takes in a state of the game and boils it down to a real-number evaluation of the state. This heuristic calculating tool is called the Evaluation Function. For example, the function could give higher scores to board states in which the player of interest has more pieces than his opponent.
Brute Force was an important part of computers in the 80s. The faster the computer, the better it was. At that point in computer chess, researchers would want as much speed as possible. It was expected that computers would make foolish moves and they would not understand the strategy; they did not have the knowledge, they did not have a style of playing.
The first working chess programs appeared in the late 1950s. The very first of these were developed at the Los Alamos by Paul Stein and Stanislaw Ulamas a way to test the new mainframe, MANIAC I that the lab had just received. They played a paired down version of the game, played on a 6×6 board. The MANIAC ran at 11,000 instructions per second (making it several million times slower than a typical desktop machine today). Its level of play was below satisfactory. There clearly was a dire need for a strategy. The rough idea being that a chess-playing program searches a partial game tree: typically, as many moves from the current position as it can search in the time available; an idea referred to as Iterative Deepening.
Shannon’s paper first put forth the idea of a function for evaluating the efficacy of a particular move and a ‘minimax’ algorithm which took advantage of this evaluation function by taking into account the effectiveness of future moves that would be made available by any particular move. This work provided the framework for all future research in computer chess playing. The thing with minimax is that you end up making the best possible play you could make (as bad as possible), and you are trying to do the same with your opponent. Therefore, a chess AI is not seeing forward in time to a win; it is seeing forward to a certain distance to see if it is a good position to be in.
A popular optimization of minimax is known as Alpha-beta pruning, wherein any move for which another move has already been discovered that is guaranteed to do better than it is eliminated. For example, in a given tree we do not need to explore any of the paths if we’ve already found moves we know will perform better.
Check out World Science Federation’s interesting discussion with computer scientist Murray Campbell, and grandmaster Joel Benjamin, two key members of IBM’s Deep Blue team. It gives an interesting insight of how the thought-process for development of Deep Blue went through.
Now that we understand the core concepts behind a chess engine (for a more detailed overview, I highly recommend going through these slides), it is important to understand what went behind Deep Blue that its predecessors could not manage the same feat as Deep Blue did. Although it inherited a lot of features from its previous counterparts; nonetheless, several improvements were made to make Deep Blue competitive. The true strength of Deep Blue was in its architecture apart from the algorithmic strategies mentioned earlier; as I describe in the subsequent paragraphs.
In comparison to Deep Blue I (the computer that lost to Garry Kasparov in 1996), Deep Blue II incorporated a single-chip chess search engine with a more complex evaluation function (with an increased number of features) that was nearly 1.5 times faster than its predecessor. It was a massively parallel system, with around 500 processors available to participate in the game tree searches. The system included significant search extensions with non-uniform searches so that it could search to a reasonable minimum depth in the game tree. The gist of how it worked is as follows: a master chip searched the top levels of the game tree and then distributed “leaf” positions to the workers who perform a few levels of additional search, and then distribute their leaf positions to the chess chips which search the last available levels of the tree.
The search mechanism of Deep Blue was a hybrid software/hardware search. The software search was flexible and could change as needed whereas the hardware search while inflexible, was faster. To maintain a balance between the speed of the hardware search and the efficiency/complexity of the software search, the chips only carried out shallow searches.
The chess chip was divided into three parts: the move generator, the evaluation function, and the search control. The move generator, an 8×8 array of combinatorial logic (representing a chessboard) was controlled via a hardwired finite state machine. The move generator computed all possible moves. In order to generate moves with minimum latency and the first move being as close the best possible (to make search process efficient), the evaluation function of Deep Blue composed of “slow-evaluation” and “fast-evaluation.” The features recognized in both evaluation functions had programmable weights for easy adjustment of their relative importance. The overall evaluation function was the sum of the feature values. Search Control monitored the quality of search by ensuring ‘progress’ with the inclusion of components like a repetition detector. Evaluation function for a game playing agent is one of the most crucial aspects of its performance. The paper further goes into more details of the features that constituted a very complex evaluation function of Deep Blue, describing the heuristics used to score a particular game state. The game agent used ideas such as quiescence search, iterative deepening, transposition tables, and NegaScout (which essentially follow the same idea as generic tree search techniques discussed earlier). Deep Blue also had two important databases – the opening book which was chosen to emphasize on the positions that Deep Blue played well and a large extended book that allows a large grandmaster game database to influence Deep Blue’s play, particularly during endgames.
The authors conclude their research listing out areas for additional improvement that could have otherwise resulted in better or worse results. This seminal paper is a great inspiration to understand the many possibilities that are available for creating a compelling game agent.
“In the course of the development of Deep Blue, there were many design decisions that had to be made. We made particular choices, but there were many alternatives that were left unexplored.”
As computers began to outclass human chess players clearly, there was a little point in continuing to pitch them against each other. As a result, there are now computer-only chess leagues, where the top chess programs play against each other. The CCRL is probably the most detailed/involved of such leagues, but there’s also the IPON and CEGT too. As far as crowning some winner goes, the Thoresen Chess Engines Competition (TCEC) is regarded by some as the de-facto computer chess championship.
A fascinating article on how a 2006 chess engine running on i7 is beaten by a 2014 engine running on 50-times slower hardware, proves how much the software has advanced. If you could argue in the early days that computers are just doing dumb brute force searching, you absolutely cannot claim that now!
If you’d like to celebrate a day when the computers of science-fiction became modern-day reality, however, check out the video below, or visit IBM’s Deep Blue Tumblr.
The success of Deep Blue has had a significant impact on the culture of science & technology. From today’s more efficient chess software to a computer that can play ‘Jeopardy!’, Deep Blue has influenced the way we play. Deep Blue reset expectations for what was possible with a computer, setting the stage for IBM’s Jeopardy champion Watson and its new goal to simulate the entire human brain.
]]>
Stephen Hawking’s ‘A Brief History of Time’ is an attempt to clarify to laymen the laws of physics and their impact on the functioning of the universe. In the book, Hawking tries to explain dense and sophisticated theories in a way similar to a fireside chat with a scientist, such that someone without an advanced physics degree can understand. For most of the part, he’s successful. Given the variety of subjects that the book touches, I’d recommend this book to anyone curious about physics or to someone looking for something challenging to read. It’s one of a very few books in this category that maintained a continued interest despite the fact that a lot of its contents stretch the reader further than is usually expected in a book of this sort. I must admit, there were more than a few times when I was a little lost and unable to follow the thread of what Hawking was explaining. I believe, this had more to do with the concepts that Hawking brings up instead of a problem with his actual writing. In every chapter, came the point where my brain could not hold on to another permutation of a theory. Of all the books I’ve read in my life, this has to have the highest educational value per page.
I’ve seen lots of reviews of the book that say that the write-up is too highbrow for an average reader and some even accusing Hawking of arrogance. Allow me a humble defense of this incredible piece of work by someone who can easily be accounted as one of the most influential scientists of all time. While at parts it is written in the style of a scholarly lecture, ‘A Brief History of Time’ lights up at moments, when Mr. Hawking allows us a peek at his impish humor, inner motivations, theoretical goofs and scientific prejudices. If you read this book, you will notice that Hawking enlists many names, his students, other scientists who he thought contributed to his theories. He is not stingy with credit, and in this book itself, he has readily admitted his mistakes and writes a paragraph encouraging this practice (quoting Einstein). Science buffs yearn for such personal admissions from scientist-authors working at the frontier. It is only then can the scientific process, so often views as dry and pedantic, be rightfully perceived as a natural function of the human endeavor. Although this book was not intended to be an autobiography, it is still disappointing that Stephen Hawking keeps such revelations to a minimum.
At the beginning of the book, the author sums up the situation competently by describing how the humankind has continued to evolve its thinking about the creation of the universe and slowly continues to wean away from the belief of a divine creator, and his hand in everyday events that we now understand better. He is humble enough to agree that at the end of the day, even after all these years of study, it is not possible to produce infallible theories. He readily admits the current limitations of physics and what he wishes it’ll be able to overcome in our lifetime. Every time, we observe an event that confirms a theory, the theory is strengthened. But the very second we observe a single event that disproves it, we have to move beyond it in search of a new theory. The best statement in the book probably is the one that talks about the validity of the various theories that have been mentioned. Some theories might be more universally established due to their ability to be confirmed through many observations. However, they are still theories, and there are several questions again about the creation of the universe that we have not been able to answer as yet including why and how the universe was born.
In applying the laws of quantum mechanics to the warped space surrounding black holes, Mr. Hawking discovered that black holes are probably evaporating by slowly emitting radiation. Given enough time, about a trillion years, stellar black holes, along with their singularities, would disappear! By the 1980’s, Mr. Hawking extended his studies to the greatest singularity of them all that expanded some 15 billion years ago to produce the universe as we know it. Considering how quantum mechanics dramatically altered the physics of the black hole, Mr. Hawking tells us how present abstractions of the big bang might be equally affected. In his precursory figuring’s, Mr. Hawking suspects that the embryonic universe did not expand from a singularity. Instead, he pictures a union of space and time that was finite yet boundless in the beginning. Now expanding, this four-dimensional bubble is fated to contract innumerable eons from now. Current astronomical observations do not support Mr. Hawking’s vision. Indeed, the author admits that his idea is merely a theoretical proposal at this point, even an aesthetic wish. “But if the universe is really completely self-contained, having no boundary or edge,” he muses, ”it would have neither beginning nor end: it would simply be. What place, then, for a creator?”
Some might feel uncomfortable at Mr. Hawking’s mention of a creator, a theme that resonates throughout the book. After all, science should explain the world around us without invoking divine interventions. Philosophically though, his question is a valid one. If science should truly develop a “theory of everything,” does the need for a supreme being vanish? To help solve this quandary, Mr. Hawking longs for the return of the philosopher-scientist.
“However, if we do discover a complete theory … it should in time be understandable in broad principle by everyone, not just a few scientists. Then we shall all, philosophers, scientists, and just ordinary people, be able to take part in the discussion of the question of why it is that we and the universe exist. If we find the answer to that, it would be the ultimate triumph of human reason – for then we would know the mind of God.”
What roles has modern science left for God in a universe such as we currently understand it? Hawking frequently muses about this and is somewhat entertaining. Does he refer to God as an actual being? Or the metaphorical unknowable from which he is snatching the secrets of the universe? Who really knows?
Stephen Hawking, as he says, has tried to find the nature of God. He does not disappoint the meta-physicist who is appeased by the idea of divine intervention. In numerous places, Hawking argues with himself whether God has created this universe, and how did he go about doing it? He also claimed that time was a property of the universe that God created, and that time did not exist before the beginning of the universe. But, if we can get comfortable with the idea of imaginary time, it could hold the answers of why we are here asking these questions and with some certainty what would have happened if this were not the case, which leaves God pretty much on the bench for a long, long time.
As a matter of pure luck or by divine fate, we, humans have landed in a place from where we can either look out at the cosmos and be humbled by its endlessness, or we can look in, deeper into the structure of which we are made of and be in awe of how the same elements can produce so much diversity from atoms to gigantic galaxies, and of course, us. Some people choose to look through the telescope, some through the microscope while a few sit back in their armchairs to observe the world as it unfolded. First came the armchair thinkers who brought with them many insightful ideas, some good and some bad. While a few were too uncomfortable with the idea of an ever-pervading omnipotent force, others, the ones whose views then appealed more to masses, readily accepted this concept. But when both coalesce into a person who strives to unify them into one cogent theory, we find one of a kind person – Stephen Hawking.
As an insignificant clump of stardust who has read this book, I feel a boom in my understanding of the universe. I think everyone as a citizen of this universe should read this book. It challenges how you think. I know for sure that I had ideas running through my head during/after reading this book. This book is what it says, as with all nonfiction books. ‘A Brief History of Time’ is a book I genuinely enjoyed. This work is a masterpiece. All hail the hawk!
]]>