Baseball fan? Excel fan? Of course there’s a way to combine our nation’s two pastimes.
Michael Lewis’s Moneyball popularized Bill James and the “sabermetrics” school of applying statistical methods to baseball analysis.
One of the most popular statistics developed by James is the Pythagorean expectation.
From Wikipedia, the Pythagorean expectation is “a sports analytics formula devised by Bill James to estimate the percentage of games a baseball team ‘should’ have won based on the number of runs they scored and allowed.”
This is an empirically derived formula and there is some controversy on how best to estimate this relations between run differential and winning percentage.
We will explore how to calculate the Pythagorean expectation using Excel’s Power Query and statistical functionality.
Make your own workbook following the steps below or download the exercise file here.
This is a public dataset of baseball statistics (going back to the 1890s!) maintained by sports journalist Sean Lahman. Download the latest dataset here. We want the teams.csv file in the comma-delimited version.
2. Import into Get & Transform
This exercise requires some filtering and measure calculations which we will do in Power Query. This amazing new tool shipped with Excel 2016 and can be downloaded as an add-in for previous versions of Excel.
Want to learn more Power Query? I suggest MyExcelOnline’s Ultimate Power Query course.
a. Import CSV into Power Query
In a blank workbook, select Data – New Query – From CSV. Then browse to the location of your teams.csv file.
This will open the Query Editor from which we will be transforming the dataset.
b. Filter data before 2000
Baseball was a very different game 100 years ago! To provide results more consistent with today’s ball play, we will only use data for the year 2000 and later. Meanwhile, for those who are struggling to find sites to bet on baseball, then standardsuk.com may be the solution.
To do this, click the filter button on the “yearID” column. This will bring down a filter menu. Select Filter Rows.
From here, you can filter rows where the yearID is greater or equal to 2000.
c. Remove unnecessary columns
This dataset contains a lot of columns. We only need the following:
- teamID
- yearID
- lgID
- G = number of games played
- W = number of team wins
- L = number of team losses
- R = total number of runs scored
- RA = total number of runs allowed
To remove other columns, click on teamID. Then, hold down the control key as you select the other columns you want. When you have selected them all, right-click on any column. Select “Remove Other Columns” from the drop-down menu.
3. Get additional columns
We need additional measures to conduct our analysis. On the Query Editor ribbon, select Add Columns. An “Add Custom Column” dialogue box will appear.
a. Calculate run differential, RD
This is the difference between the runs scored and the runs allowed by a team. Naming your column RD, you calculate the run differential as [R] - [RA]
.
b. Calculate winning percentage, Wpct
This is the fraction of games won by a team. Naming the column “Wpct,” calculate the column as [W] / ([W] + [L])
.
4. Basic Regression: Winning Percentage on Run Differential
One simple way to predict a team’s winning percentage using runs scored and allowed is with linear regression. This is a so-called “straight-line” estimate of winning percentage using the run differential.
a. Make a scatter plot
With cells I2:J541
selected, select Insert on your ribbon and then Recommended Charts. Excel will recommend a scatter plot which is what we want. This chart plots the run differential on the X axis and winning percentage on the Y axis.
b. Get the coefficients
This is a helpful visualization. We also want the coefficients, which we can retrieve from the chart by clicking on our chart, then clicking the plus sign toward the right of our chart.
Go ahead and check off the last option to add a trendline, then click the arrow to the right of this option and head to “More Options.”
From this menu, scroll to the bottom and check on “Display Equation on Chart.” You will then see the following equation on your chart:
Wpct = .500+ 0.0006*RD
This formula tells us that a team with a run differential of zero (RD = 0) will win half of its games (estimated intercept ≈ .500) which is reasonable.
This simple equation explains the ten-runs-equal-a-win formula.
You may have heard this rule of thumb before that an extra ten runs in a season means an additional win. Well, this formula bears out that observation. How?
A one-unit increase in run differential corresponds to an increase of 0.0006 in winning percentage. To put this into context of a season, a team scoring 750 and allowing 750 runs is predicted to win half of its games, which is 81 games in a typical MLB season of 162 games.
However, a team scoring 760 runs and allowing 750 has a run differential of +10 and is predicted to have a winning percentage of 0.5 + 10 * 0.0006 = 0.506.
A winning percentage of 0.506 in a 162-game schedule corresponds to 82 wins.
Thus an increase of 10 runs in the run differential of a team corresponds in this model to one additional win.
5. The Pythagorean Formula for Winning Percentage
From this linear relationship between run differential and winning percentage, Bill James empirically derived a non-linear formula to estimate winning percentage, called the Pythagorean expectation. The standard equation is as follows:
Where R
= runs and RA
= runs allowed.
James and other analysts have aimed to find an exponent which would give a better fit relative to the originally proposed exponent value of 2. We will take a shot at doing this in Excel. But first, some algebra! Instead of the exponent 2 as used above, we will solve for the exponent K
.
Keep in mind that unlike the above example, the Pythagorean expectation uses the win ratio rather than the win percentage as its dependent variable. Whereas before we were predicting W / W + L
, we are now predicting the less common W / L
.
The value of k can now be estimated using linear regression. Our dependent variable is now the log of winning percentage, and our independent is the log of the ratio of runs to runs allowed. We will also set the intercept to zero to match the above equation.
Go to the Query tab on the home ribbon and select Edit. This will take us back to the Query Editor so we can add these measures.
So we compute some new variables. Go back to Power Query:
a. Calculate the LogRratio
This is what’s on the right-hand side of the equation above.
Adding a custom column with logarithms is a bit trickier. Our formula for this is Number.Log10[R] / [RA]
.
b. Calculate the logWratio
This is what’s on the left-hand side of the equation above. Our formula is Number.Log10([W] / [L])
.
c. Plot and show equation for our new variables
Follow the steps above to plot this relationship. In addition to showing the equation on our chart, we want to set the intercept to zero. To do this, check on “Set Intercept” above “Display Equation on chart” and choose zero. This way, our regression weight can be interpreted without an intercept, as in the formula above.
The plot suggests a Pythagorean exponent of 1.8508, smaller than the value 2.
This suggests that the run differential is a bit off in predicting how many games a team will win. While teams may have a large run differential, it may not really lead to as many extra game wins.
According to Wikipedia: “1.83 is the most accurate, and the one used by baseball-reference.com.” So way to go! Pretty daggone close for our first try.
This makes our “new and improved” Pythagorean Expectation formula:
Kevin Lehrbass
Hi George,
I enjoyed reading your post! Nice mix of baseball stats and Excel functionality.
Bill James had a humble beginning but believed in his ideas and it’s changed everything! Thanks for the link to Sean Lahman’s site.
Cheers,
Kevin Lehrbass
https://www.myspreadsheetlab.com/blog/
George Mount
Thanks, Kevin!
Paul Dolores
Hi George! This is a great read. I’m just curious how you would treat cases for W/L where L=0, i.e., the team is undefeated. Your response is greatly appreciated!
George Mount
Thanks for reading Paul! Interesting question. In this case a team has a winning percentage of 100%. Using our new equation for the expectation [latex] \frac{W}{W+L} = \frac{ R^{1.85} }{ R^{1.85} + RA^{1.85} }[/latex], the only way a team with a 100% winning percentage (or any team) will have a 100% expectation is if they do not give up any runs.