This post is about a case-study that made me fall in love with Data Analytics. Moneyball, a book by Michael Lewis in 2003, later adopted as a movie in 2011 starring Brad Pitt discusses how sports analytics changed baseball through the story of Oakland A’s, a team near San Francisco, California. In this post, I would try to recreate the numeric figures mentioned in the book by Michael Lewis (as studied from MIT’s MOOC called ‘The Analytics Edge‘ on edX).

I wrote this book because I fell in love with a story. The story concerned a small group of undervalued professional baseball players and executives, many of whom had been rejected as unfit for the big leagues, who had turned themselves into one of the most successful franchises in Major League Baseball. ….how did one of the poorest teams in baseball, the Oakland Athletics, win so many games? – Moneyball, Pg. 1

The Oakland A’s were once a wealthy and a very successful team, making it into the playoffs nine times from 1972 to 1992. However, their fortunes saw a turnaround with a large number of loses and an acquisition which led to massive budget cuts.

During this time, the A’s had turned to a new general manager, Billy Beane to restore the winning tradition in the club which was not seen since the 80s. Beane needed to find a way to keep the Oakland A’s elite. His assistant, Paul DePodesta introduced him to a theory that was similar to that of the great Bill James. The approach would be called Moneyball. The goal was to find the undervalued metrics and then using them to determine players who cost less than they should.

…the game was ceasing to be an athletic competition and becoming a financial one. The gap between rich and poor in baseball was far greater than in any other professional sport, and widening rapidly. ….

The raw disparities meant that only the rich teams could afford the best players. –Moneyball, Pg. 1

In the above graph, the horizontal axis shows the average payroll during the years 1998 to 2001. The vertical axis indicates the average yearly wins over the same years. The team in blue is the New York Yankees who won about 100 games and spent roughly $90 million in the said period. The red team is Red Sox. This team spent nearly $80 million and won about 90 games.

The Oakland A’s are marked in green. They won about 90 games, and they spent under $30 million. On comparing them with the Red Sox, they won about the same number of games during this period, but the Red Sox spent about $50 million more per year than the A’s.

Rich teams like the Yankees and the Red Sox could afford the all-star players. It is important to observe how efficient the A’s are. As mentioned, they won 90 games, and their payroll was under $30 million compared to the Yankees, who spent almost three to four times as much (and not having a significant difference in the number of games won). It can be noted; the rich teams have three to four times the payroll of poor teams, yet the A’s made the playoffs every year.

Taking a quantitative approach, they were able to find undervalued players and form teams that were very efficient. So the A’s started using a different method to select players. The traditional way of picking players was through scouting. Scouts would watch high school and college players, and they would report back about their skills, especially discussing their speed and their athletic built. The A’s, however, selected players based on their statistics, not on the basis of their outlooks.

The statistics enable you to find your way past all sorts of sight-based scouting prejudices. – Moneyball, Pg. 30

In the 1980s and 1990s, analysts were hired by baseball teams, but none of them had enough power to affect anything significant. Billy Beane, with a rather small budget, understood the importance of analytics but most general managers didn’t know much about statistics and based decisions primarily on feelings.

Billy Beane was not afraid to alienate scouts, managers, and players if the quantitative approach suggested decisions that were different than the scouts or the managers or the players suggested. He believed that this theory could work much to the disagreement of most of his employees. Players that were brought in to replace the stars weren’t household names. The key premise of the Oakland A’s is that if they could detect the undervalued skills, they could find players at a bargain. More on the scouting and Moneyball theory can be read in this article.

*On the left is Scott Hatteberg, whom the A’s selected. He would not throw particularly well but got on base a lot. On the right, is Derek Jeter, one of the top players in baseball, a consistent shortstop and the leader in hits and stolen bases.*

*The approach was also followed for pitchers. On the left is Chad Bradford, a pitcher for the A’s, a submariner who used an unconventional delivery and slow speed. On the right is Roger Clemens, one of the best pitchers in the game who used a conventional delivery with a fast pace.*

### Statistical Analysis

This section demonstrates the data analysis using R. The dataset baseball.csv comes from Baseball-Reference.com.

If you are unfamiliar with the game of baseball, you can watch this short video clip for a quick introduction to the game. Although not necessary, basic knowledge of the game might help in intuitively understanding this analysis.

Before the 2002 season, Paul DePodesta …

judged how many wins it would take to make the playoffs: 95.He then calculatedhow many more runs the Oakland A’s would need to score than they allowed to win 95 games: 135……Then, using the A’s players’ past performance as a guide, he made reasoned arguments about how many runs they would actually score and allow. ….

the team would score between 800 and 820 runs and give up between 650 and 670 runs*. From that, he predictedthe team would win between 93 and 97 gamesand probably wind up in the playoffs.* They wound up scoring 800 and allowing 653 – Moneyball, Pg. 90

The goal of a baseball team is to make the playoffs. The Oakland A’s approach to getting to the playoffs was via the use of analytics.

1 2 3 |
# Reading the Data baseball <- read.csv("baseball.csv") str(baseball) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
'data.frame': 1232 obs. of 15 variables: $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ... $ League : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ... $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ... $ RS : int 734 700 712 734 613 748 669 667 758 726 ... $ RA : int 688 600 705 806 759 676 588 845 890 670 ... $ W : int 81 94 93 69 61 85 97 68 64 88 ... $ OBP : num 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ... $ SLG : num 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ... $ BA : num 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ... $ Playoffs : int 0 1 1 0 0 0 1 0 0 1 ... $ RankSeason : int NA 4 5 NA NA NA 2 NA NA 6 ... $ RankPlayoffs: int NA 5 4 NA NA NA 4 NA NA 2 ... $ G : int 162 162 162 162 162 162 162 162 162 162 ... $ OOBP : num 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ... $ OSLG : num 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ... |

This dataset includes an entry for every team from 1962 to 2012. There are 15 variables in the data set including Runs Scored (RS), Runs Allowed (RA) and Wins (W).

Since the aim is to verify the claims made in the book, the required data is the subset of this dataset including only the years up to 2002.

1 2 3 |
# Subset the Data moneyball <- subset(baseball, Year < 2002) str(moneyball) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
'data.frame': 902 obs. of 15 variables: $ Team : Factor w/ 39 levels "ANA","ARI","ATL",..: 1 2 3 4 5 7 8 9 10 11 ... $ League : Factor w/ 2 levels "AL","NL": 1 2 2 1 1 2 1 2 1 2 ... $ Year : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ... $ RS : int 691 818 729 687 772 777 798 735 897 923 ... $ RA : int 730 677 643 829 745 701 795 850 821 906 ... $ W : int 75 92 88 63 82 88 83 66 91 73 ... $ OBP : num 0.327 0.341 0.324 0.319 0.334 0.336 0.334 0.324 0.35 0.354 ... $ SLG : num 0.405 0.442 0.412 0.38 0.439 0.43 0.451 0.419 0.458 0.483 ... $ BA : num 0.261 0.267 0.26 0.248 0.266 0.261 0.268 0.262 0.278 0.292 ... $ Playoffs : int 0 1 1 0 0 0 0 0 1 0 ... $ RankSeason : int NA 5 7 NA NA NA NA NA 6 NA ... $ RankPlayoffs: int NA 1 3 NA NA NA NA NA 4 NA ... $ G : int 162 162 162 162 161 162 162 162 162 162 ... $ OOBP : num 0.331 0.311 0.314 0.337 0.329 0.321 0.334 0.341 0.341 0.35 ... $ OSLG : num 0.412 0.404 0.384 0.439 0.393 0.398 0.427 0.455 0.417 0.48 ... |

The dataset now has 902 observations of the same 15 variables.

1 2 3 4 5 6 |
moneyball_1996_2001 <- subset(baseball, Year < 2002 & Year >= 1996) ggplot(data = moneyball_1996_2001, aes(x = W, y = Team)) + theme_bw() + scale_color_manual(values = c("grey", "red3")) + geom_vline(xintercept = c(85.0, 95.0), col = "purple", linetype = "longdash") + geom_point(aes(color = factor(Playoffs)), pch = 16, size = 3.0) |

To make a linear regression model to predict ‘Wins (W)’ using the difference between ‘Runs Scored (RS)’ & ‘Runs Allowed (RA),’ a new variable is added to the dataset, i.e., ‘RD’ (Runs Difference).

1 2 |
# Compute Run Difference moneyball$RD <- moneyball$RS - moneyball$RA |

Before building a predictive model (regression), it is important to explore the data to get the idea of the optimal fit.

1 |
ggplot(data = moneyball, aes(x = W, y = RD)) + theme_bw() + geom_point() |

The obtained plot suggests a strong linear relationship between the two variables. Hence, a linear regression model is suitable for prediction.

1 2 |
WinsReg = lm(W ~ RD, data=moneyball) summary(WinsReg) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Call: lm(formula = W ~ RD, data = moneyball) Residuals: Min 1Q Median 3Q Max -14.2662 -2.6509 0.1234 2.9364 11.6570 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 80.881375 0.131157 616.67 <2e-16 *** RD 0.105766 0.001297 81.55 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.939 on 900 degrees of freedom Multiple R-squared: 0.8808, Adjusted R-squared: 0.8807 F-statistic: 6651 on 1 and 900 DF, p-value: < 2.2e-16 |

The above model suggests that variable ‘RD’ is very significant in the linear model.

Using the developed linear model, the claim that ‘if a team scores at least 135 more runs than their opponent throughout the regular season, then we predict that the team will win at least 95 games and make the playoffs’, can be verified.

The obtained linear regression equation is:

\[ W = 80.881375 + 0.105766 \times (RD) \]

To win the required number of matches, i.e., \( W \ge 95 \), \( RD \) can be calculated as:

\[ RD \ge \frac{95\ -\ 80.881375}{0.105766} \approx 135 \]

#### Predicting Runs Scored & Runs Allowed

It is important to know how many runs a team will score, which can be predicted with batting statistics, and how many runs a team will allow, which can be predicted using fielding and pitching statistics.

How does a team score more runs? Traditionally, most baseball teams and experts have used Batting Average (BA) (a measure of how often a player gets on base by hitting the ball) was used to determine the batsman’s skill. The Oakland A’s claimed that On-Base Percentage (OBP) (percentage of time a player gets on base, including walks) is the most important, Slugging Percentage (SP) (how far a player gets around the base on his turn) is somewhat significant whereas Batting Average (BA) is overrated. It was discovered that that two baseball statistics were significantly more important than any other statistic.

1 2 |
RunsReg = lm(RS ~ OBP + SLG + BA, data=moneyball) summary(RunsReg) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Call: lm(formula = RS ~ OBP + SLG + BA, data = moneyball) Residuals: Min 1Q Median 3Q Max -70.941 -17.247 -0.621 16.754 90.998 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -788.46 19.70 -40.029 < 2e-16 *** OBP 2917.42 110.47 26.410 < 2e-16 *** SLG 1637.93 45.99 35.612 < 2e-16 *** BA -368.97 130.58 -2.826 0.00482 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 24.69 on 898 degrees of freedom Multiple R-squared: 0.9302, Adjusted R-squared: 0.93 F-statistic: 3989 on 3 and 898 DF, p-value: < 2.2e-16 |

From the summary of the linear regression model, it can be noted that Batting Average (BA) is less significant than other independent variables, i.e., On-Base Percentage (OBP) & Slugging Percentage (SLG).

The coefficient of Batting Average is negative, which might not be true and is likely because of multicollinearity (high-correlation between independent variables).

A simpler linear model can be built by removing Batting Average (BA) from the model.

1 |
RunsReg = lm(RS ~ OBP + SLG, data=moneyball) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Call: lm(formula = RS ~ OBP + SLG, data = moneyball) Residuals: Min 1Q Median 3Q Max -70.838 -17.174 -1.108 16.770 90.036 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -804.63 18.92 -42.53 <2e-16 *** OBP 2737.77 90.68 30.19 <2e-16 *** SLG 1584.91 42.16 37.60 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 24.79 on 899 degrees of freedom Multiple R-squared: 0.9296, Adjusted R-squared: 0.9294 F-statistic: 5934 on 2 and 899 DF, p-value: < 2.2e-16 |

From the above model, the coefficient of On-Base Percentage (OBP) is larger than Slugging Percentage (SLG), which implies that OBP is more significant to the model than SLG.

The obtained model for Runs Scored is:

\[ Runs\ Scored = -804.63\ +\ 2737.77\ \times OBP\ +\ 1584.91\ \times\ SLG \]

Similarly, using ‘OOBP’ (Opponents OBP) and ‘OSLG’ (Opponent’s SLG), a linear model for Runs Allowed can be made.

\[ Runs\ Allowed = -837.38\ +\ 2913.60\ \times OOBP\ +\ 1514.29\ \times\ OSLG \]

To predict before the season starts how many games the 2002 Oakland A’s will win, first the ‘number of runs that will be scored by the team’ and ‘how many runs they will allow’ have to be predicted. These models use team statistics.

When predicting for the 2002 Oakland A’s before the season has occurred, the team was probably different than it was the year before leading to unavailability of team statistics. However, these statistics can be estimated using past player performance assuming that past performance correlates with future performance and that there will be few injuries during the season.

The data of Oakland A’s 2001 stats are used to predict the performance for 2002. Ideally, the mean statistics for only the relevant players should be used.

1 2 |
OAK = subset(moneyball, Team=='OAK') OAK2001 = subset(OAK, Year==2001) |

1 2 3 4 5 |
mean(OAK2001$OBP) [1] 0.345 mean(OAK2001$SLG) [1] 0.439 |

Using the above values and substituting in the linear models,

\[ Runs\ Scored = −804.63 + 2737.77 \times (0.345) + 1584.91 \times (0.439) \approx 835 \]

\[ Runs\ Allowed = −837.38 + 2913.60 \times (0.308) + 1514.29 \times (0.38) \approx 635 \]

Substituting the obtained values in the linear model for ‘Number of Wins,’

\[ Wins = 80.881375 + 0.105766 \times (835 − 635) \approx 102 \]

The Oakland A’s made it to the playoffs for four years in a row – 2000, 2001, 2002, and 2003 – but they didn’t win the World Series. It was stated earlier that the goal of a baseball team is to make the playoffs but, why isn’t the aim to win the playoffs or to win the World Series?

Over a long season, luck evens out, and skill shines through. In a series of three out of five, or even four out of seven, anything can happen – Moneyball, Pg. 199

In other words, the playoffs suffer from the sample size problem. There are not enough games to make any statistical claims.

### Baseball and Sports Analytics

Baseball has always been a game of numbers and statistics and thanks to an explosion of data in modern times along with the advent of new analytics software running on powerful computers, baseball analytics (aka. sabermetrics) have seen a cusp of changes that might make Moneyball look like it belongs in the minor leagues.

It can be observed that the model used in this study is reasonably simple, i.e., involves regression ideas, and did not involve many variables. Even then this led to the significant success of the Oakland A’s and more generally, for teams that use the power of sports analytics.

In human behavior, there was always uncertainty and risk. The goal of the Oakland front office was simply to minimize the risk. Their solution wasn’t perfect. It was just better than rendering decisions by gut feeling – Moneyball, Pg. 99

The significance of Moneyball lies in the fact that it talks about the first instance in which the use of analytics in sports became popular. It should be noted that baseball is not the only sport for which analytics is used; statistical analysis is used in almost every game, including basketball, soccer, cricket, and hockey. Moreover, nowadays, most sports teams have a dedicated statistics/analytics group. What is the edge of using analytics? Models allow managers to more accurately value players and minimize risk, risk that arises from decisions made by intuition.

Sports analytics will continue to grow and undoubtedly become more heavily relied on, but there are still ways they can be improved. It is an exciting area to delve into with ample literature available out there for different sports, from analytical decisions on the play-field to broadcasting related analytics. If you’d like to collaborate on a project in the domain, feel free to reach out to me.