Bayes-Ball

Suppose you are a longtime Red Sox fan and you just bought a ticket to the fourth of a four-game series against the Yankees. The Red Sox have won two of the three previous games in the series. As you settle into your seat at Fenway, everyone around you is confident the Red Sox will win. At least that’s the best bet. But what are the odds on that bet? Even if you are loathe to bet on the Yankees, what is the chance the Red Sox lose?

The Red Sox have won two out of the three previous games, so it is reasonable to think that the probability that they win the next game is 2/366.62/3 \approx 66.6 percent. But maybe the Red Sox’ win probability against the Yankees is actually higher, say 7575 percent, and they just got unlucky in the first game. Or maybe the Red Sox’ win probability is actually lower—5050 or even 4040 percent—and they just happened to be on a hot streak. Guessing 66.666.6 percent seems reasonable, but how do we know this is even a good estimation scheme?

Suppose we only watched the first game, which the Red Sox lost. If we used the same estimation scheme, then we’d be forced to say that the Red Sox have a 0/1=00/1 = 0 percent win chance in any future game. This is clearly not a reasonable guess since there has to be some chance that the Red Sox win. One game does not provide enough information to make this estimator reasonable. Why? Because we’ve seen the Red Sox win other games and we know even the best MLB teams don’t have a 100100 percent win probability against the worst teams.

This statement—no team has a 100100 percent win probability against any other team—is an example of prior knowledge. We know something about baseball before we see any games. This prior knowledge lets us reject simple estimation schemes when they give answers that are wildly off-base. Can we use prior knowledge to improve our estimator so that it doesn’t go foul?

Questions

The story above raises three main questions:

  1. How should we estimate the win probability of team 1 against team 2 if we see team 1 win ww out of nn games against team 2?
  2. How can we incorporate prior information about baseball to improve this estimate?
  3. What prior information do we actually have about win probabilities in baseball?

The answers to all three of these questions depend on one number, β\beta. The value of β\beta represents our prior knowledge about baseball. It is large when teams are often evenly matched (win probabilities near 5050 percent) and small when teams are often unevenly matched (win probabilities near 00 or 100100 percent). A large β\beta means most games are unpredictable; a small β\beta means most games are predictable.

Given β\beta, there is a surprisingly simple way to estimate win probabilities that accounts for prior knowledge. This method can be motivated rigorously while remaining easy to compute. It is so easy to compute that you could use it while sitting in the bleachers with Cracker Jacks in one hand and a hot dog in the other.

But to use this method you need to know β\beta. In this article, we fit for β\beta based on the history of baseball. By estimating β\beta, we gain insight into the nature of the sport and how that nature has changed over the league’s 148-year history.

How to Incorporate Prior Information

Before we get to β\beta, we need to understand the question we are asking. Specifically, how to balance prior knowledge with observed games and arrive at a probability.

To do this we need some notation. Capital letters are used for quantities that are either random or unknown while lower case letters are used for quantities that are either fixed or known.

Let nn be the number of games observed and ww be the number of games team 1 won. Since there is a chance that either team could win or lose any game, observing ww wins is a specific outcome out of many possible outcomes. Let WW be a random variable representing the number of wins if we could run the nn games again. This is analogous to seeing a coin land heads ww times out of nn flips. WW is the number of times it lands heads if we flip nn more times. Let PP be the probability that team 1 beats team 2. Assume that this probability is independent of the outcome of all previous games and is constant over the course of the nn games.

In order to estimate PP, we want to know the likelihood that P=pP = p given that W=wW = w for any possible outcome ww. We could then find what win probability is most likely given the observed outcome. The likelihood that P=pP = p given W=wW = w is the conditional probability: Pr{P=pW=w,n}\text{Pr}\{P = p|W = w,n \}.


To find this conditional probability, we will use Bayes’ rule.

Bayes’ rule expresses the probability that a random event AA occurs given that BB occurred in terms of the odds that BB occurs given that AA occurred. It reverses the direction of the conditioning. This reversal is useful because conditional probabilities are often easier to work out in one direction than in the other. Bayes’ rule comes from the following pair of equalities:

Pr{AB}=Pr{AB}Pr{B},Pr{AB}=Pr{BA}Pr{A}.\text{Pr}\{A\cap B\} = \text{Pr}\{A|B\} \text{Pr}\{B\} , \quad \text{Pr}\{A\cap B\} = \text{Pr}\{B|A\} \text{Pr}\{A\}.

Setting the two equal to each other and solving for Pr{AB}\text{Pr}\{A|B\} yields:

Pr{AB}=Pr{AB}Pr{B}=Pr{BA}Pr{A}Pr{B}.\text{Pr}\{A|B\} = \frac{\text{Pr}\{A\cap B\}}{\text{Pr}\{B\}} = \frac{\text{Pr}\{B|A\} \text{Pr}\{A\}}{\text{Pr}\{B\}}.

Applying Bayes’ rule to our problem gives:

Pr{P=pW=w,n}=Pr{W=wP=p,n}Pr{P=p}Pr{W=wn}.\text{Pr}\{P = p|W = w,n\} = \frac{\text{Pr}\{W = w|P = p,n\} \text{Pr}\{P = p\}}{\text{Pr}\{W = w|n\}}.

The conditional probability on the left hand side, Pr{P=pW=w,n}\text{Pr}\{P = p|W = w,n\}, is the posterior. This is the probability that the P=pP = p given that team 1 won ww out of nn games. It is called the posterior because it is the distribution of win probabilities after observing data. Our goal is to find the probability pp that maximizes the posterior.

On the right hand side, Pr{W=wP=p,n}\text{Pr}\{W = w|P = p,n\}, is the likelihood, so named because it is the likelihood of observing the data given a win probability pp. The probability Pr{P=p}\text{Pr}\{P = p\} is the prior. This is the probability that team 1 has a win probability pp against team 2 before we observe any games between them. We’ve done it! This is how we incorporate prior knowledge.

The Prior

Now that we know how to incorporate prior knowledge, what do we want it to be? What distribution should we use to model the probability that a baseball team has win probability P=pP = p against another team?

We will use a symmetric beta distribution since the beta distribution is the standard choice for this type of estimation problem (don’t worry, the reasons will become apparent once we start the analysis). This is where β\beta enters; it is the parameter that defines this distribution.

When β=0\beta = 0, the distribution is uniform—all win probabilities are equally likely. This is the same as knowing nothing about the win probabilities. When β=1\beta = 1, the distribution is a downward-facing parabola. When β>1\beta > 1, the distribution is bell-shaped. The larger β\beta, the more the distribution concentrates about p=1/2p = 1/2. In essence, a large β\beta means less predictable games and a more competitive league. You can experiment with different β\betas below.

0.20.40.60.81.01.02.03.04.05.06.07.0
Demo 1: The symmetric beta distribution as a function of its parameter, $\beta$.

The more concentrated the distribution, the less likely it is that team 1 has a large (or small) win probability against team 2. Therefore, by tuning β\beta we can express our expectation about how even baseball teams are. We will use past baseball data to fit for this parameter. Note that this prior does not incorporate prior information about specific teams, only about the nature of baseball as a whole.

The Estimator

Now we can state our original question formally. If team 1 wins ww out of nn games against team 2, and if the win probabilities are sampled from a beta distribution with parameter β\beta, then what is the most likely win probability for team 1?

This may seem like a hard question to answer cleanly—and in general it is. However, by picking the beta distribution as our prior, the answer becomes both elegant and intuitive.

The most likely win probability pp_*, having seen ww wins out of nn games, given prior parameter β\beta, is:

p(w,nβ)=w+βn+2β.p_*(w,n|\beta) = \frac{w + \beta}{n + 2 \beta}.

This win probability maximizes Pr{P=pW=w,n}\text{Pr}\{P = p|W = w,n\}. That’s all there is to it!

What makes this so brilliant is that the parameter β\beta can be treated as a number of fictitious games. Think back to our original method for estimating the win probability: the win frequency (ratio of wins to games played). The most likely win probability is just the win frequency if we pretend that our team wins β\beta more games and loses β\beta more games.

For example, suppose that β=1\beta = 1, w=2w = 2, and n=3n = 3. Instead of estimating that the Red Sox have a 2/32/3 win probability, we estimate that they have win probability (2+1)/(3+2)=3/5(2 + 1)/(3 + 2) = 3/5, just as if they had won and lost one more game. If instead β=4\beta = 4, then our estimated win probability would be the Red Sox’ win frequency had they won an additional 4 games and lost an additional 4 games: (2+4)/(3+8)=6/11(2 + 4)/(3 + 8) = 6/11.

Notice that as β\beta grows, the estimated win probability approaches 1/21/2. As a consequence, if β\beta is large, then our estimator is conservative, and it will take a lot of games to move the estimated win probability far away from 1/21/2.

You can experiment with different ww, nn, and β\betas below. While you experiment, consider how large you think β\beta should be for baseball.

Gives a win probability of 0.656

Demo 2: Experiment with large and small $\beta$. See how many wins you need to get to a large win probability while keeping $\beta$ fixed. Try this for $\beta = 0$, $\beta = 1$, and $\beta = 20$.

The proof that p(w,nβ)p_*(w,n|\beta) maximizes the posterior is included below (in two pieces). The proof is not necessary to understand the role of β\beta in the estimator; however, it is important to understand why the estimator takes this form. In particular, the proof below is needed to understand exactly what is assumed a priori about baseball win probabilities given a particular β\beta. The motivation for the beta prior is explained in the second half of the proof. That said, if you are only interested in how β\beta has changed over time and how β\beta differs between different sports, skip to the end of the proof and continue from there.

Proof Without Prior (β=0\beta = 0)

Why does the most likely win probability take this form?

If we do not incorporate prior information, then the posterior is proportional to the likelihood and maximizing the posterior is equivalent to maximizing the likelihood. To find the likelihood, condition on P=pP = p (assume that the win probability is pp). Then the number of wins WW is binomially distributed, which means that:

Pr{W=wn,p}=bin(wn,p)=(nw)pw(1p)nw.\text{Pr}\{W =w|n,p\} = \text{bin}(w|n,p) = \left(\begin{array}{c} n \\ w \end{array} \right) p^{w} (1-p)^{n-w}.

Binomial

Figure 1: Binomial distribution for 30 events with win probability $p$ ranging from 0.25 (blue) to 0.75 (red).

So in the absence of prior information:

Pr{P=pw,n}Pr{W=wn,p}=(nw)pw(1p)nw.\text{Pr}\{P =p|w,n\} \propto \text{Pr}\{W =w|n,p\} = \left(\begin{array}{c} n \\ w \end{array} \right) p^{w} (1-p)^{n-w}.

Note that PP is not binomially distributed because the posterior is a distribution over pp not ww. When using the expression on the right as a posterior it is important to remember that pp is the variable that can change and ww and nn are fixed by what happened in the observed games.

Beta Distribution

Figure 2: The likelihood of $p$ for a varying number of wins and games. From top left to bottom right the panels correspond to 5, 10, 50, and 100 games played. The distributions correspond to wins ranging from 0 to the number of games played. Notice that these distributions become more tightly concentrated around the ratio of wins to games as the number of games increases. This reflects the expectation that the number of wins divided but the number of games will approach the win probability when many games are played. Also notice that even if a team wins none or all of the games, there is still a chance they have a win probability between 0 and 1.

This distribution is an example of a beta distribution. A beta distribution is a probability distribution for probabilities. It is often used to give the probability P=pP = p. The beta distribution depends on two parameters, α,β\alpha, \beta. It has the form: 1

beta(pα,β)=pα(1p)βB(α+1,β+1)\text{beta}(p|\alpha,\beta) = \frac{p^{\alpha}(1-p)^{\beta}}{B(\alpha+1,\beta+1)}

where:

B(α+1,β+1)=Γ(α+1)Γ(β+1)Γ(α+β+2)B(\alpha+1,\beta+1) = \frac{\Gamma(\alpha+1)\Gamma(\beta+1)}{\Gamma(\alpha+\beta+2)}

and Γ(x)\Gamma(x) is the gamma function. The gamma function is the continuous extension of the factorial. That is: Γ(n+1)=n!\Gamma(n+1) = n! for any integer nn, but, unlike the factorial, Γ(x)\Gamma(x) is defined for all real xx. It follows that if α\alpha and β\beta are integers, then 1/B(α,β)1/B(\alpha,\beta) equals α+β\alpha + \beta choose α\alpha. Therefore, our likelihood is beta-distributed with parameters equal to the number of wins, ww, and losses, wnw-n:

Pr{P=pw,n}=beta(pw,nw)\text{Pr}\{P = p|w,n\} = \text{beta}(p|w,n-w)

The likelihood allows us to calculate the probability that P=pP = p having observed ww wins out of nn games (in the absence of prior information). What win probability, pp_*, maximizes the likelihood?

To maximize the likelihood we maximize the log-likelihood. Since a logarithm is monotonically increasing in its argument the log-likelihood is maximized wherever the likelihood is maximized. The log-likelihood is:

log(Pr{P=pw,n})=log(n choose w)+wlog(p)+(nw)log(1p).\log(\text{Pr}\{P = p|w,n\}) = \log(n \text{ choose } w) + w \log(p) + (n-w) \log(1 - p).

Differentiating with respect to pp:

ddplog(Pr{P=pw,n})=0+wpnw1p.\frac{d}{dp} \log(\text{Pr}\{P = p|w,n\}) = 0 + \frac{w}{p} - \frac{n-w}{1-p}.

Setting the derivative to zero requires wp=nw1p\frac{w}{p_*} = \frac{n-w}{1-p_*} or:

wnw=p1p.\frac{w}{n - w} = \frac{p_*}{1-p_*}.

This is solved when p=w/np_* = w/n. The win frequency is the only solution to this equation, so the log-likelihood has one extremum. The log-likelihood is a convex function, so this extremum is a maximum. Thus the win probability that maximizes the likelihood is the win frequency:

p(w,n)=wn.p_*(w,n) = \frac{w}{n}.

This is a good motivation for our naïve estimation scheme. If the Red Sox win two out of three games then we guess that their win probability is 2/32/3. This, however, is the solution without any prior information. It is equivalent to the solution when we assume that all win probabilities are equally likely. As noted before, this solution breaks down if w=0w = 0 or nn since it returns either a 00 or 100100 percent chance of victory in future games. What we need is a prior distribution that can be fit to baseball data so the estimator does not return extreme win probabilities.

Proof Incorporating the Prior (β0\beta \geq 0)

In order to incorporate a prior we need to pick a form for the prior distribution, Pr{P=p}\text{Pr}\{P = p\}. The classic choice of prior for this problem is our home favorite, the beta distribution. The beta distribution is the standard choice of prior for this problem because it is the conjugate prior to the binomial distribution.

A conjugate prior is a distribution that, when chosen as a prior, ensures that the posterior distribution has the same form as the likelihood, only with different parameters. In our case, if we use a beta distribution as a prior, then the posterior is also beta-distributed, only with different parameter values. These parameters depend on both the data observed and the parameters of the prior.

Recall that the beta distribution depended on two parameters, α\alpha and β\beta. In our case we require that our prior is symmetric about p=1/2p = 1/2, because if team 1 has a win probability of pp against team 2, then team 2 has a win probability of 1p1 - p against team 1. If we did not assume the prior was symmetric in this way, then the probability a team wins against another would depend on the order we listed them in. Requiring that the beta distribution is symmetric requires α=β\alpha = \beta. This means that our prior only depends on one parameter and has the form:

Pr{P=pβ}=beta(pβ,β)=1B(β+1,β+1)(p(1p))β.\text{Pr}\{P = p|\beta\} = \text{beta}(p|\beta,\beta) = \frac{1}{B(\beta + 1,\beta + 1)} (p(1-p))^{\beta}.

The product p(1p)p (1-p) is a downward facing parabola that is maximized at p=1/2p = 1/2 and equals zero at p=0p = 0 and p=1p = 1. Changing β\beta bends this parabola. For β>1\beta > 1 the parabola gets bent down into a bell shape. For 0<β<10 < \beta < 1 the parabola is bent up towards a box. At β=0\beta = 0 the box becomes the uniform distribution. This effect was illustrated in Demo 1.

When symmetric, the beta distribution has mean 1/21/2 and variance 1/(8β+4)1/(8 \beta + 4). Therefore, as β\beta goes to infinity the variance in the beta distribution vanishes. In that case almost all of the win probabilities are expected to be close to 1/21/2. This leads to the simple intuition:

  1. Large β\beta \rightarrow win probabilities near 1/21/2 \rightarrow unpredictable games, even teams, competitive league
  2. Small β\beta \rightarrow win probabilities can be near 00 or 11 \rightarrow more predictable games, uneven teams, less competitive league

How does introducing this prior change our estimator?

The maximum likelihood estimator (MLE) is the probability pp that maximizes the likelihood. The maximum a posteriori estimator (MAP) is the probability pp that maximizes the posterior. Using a symmetric beta distribution with parameter β\beta sets the posterior proportional to:

Pr{P=pw,n,β}beta(pw,nw)beta(pβ)=1B(w,nw)pw(1p)nw1B(β,β)pβ(1p)β=1B(w+β+1,nw+β+1)pw+β(1p)nw+β=beta(pw+β,nw+β).\begin{aligned} \text{Pr}\{P = p|w,n,\beta\} & \propto \text{beta}(p|w,n-w) \text{beta}(p|\beta) \\ & = \frac{1}{B(w,n-w)} p^{w} (1-p)^{n-w} \frac{1}{B(\beta,\beta)} p^{\beta}(1-p)^{\beta} \\ & = \frac{1}{B(w + \beta + 1,n - w + \beta + 1)} p^{w + \beta} (1-p)^{n - w + \beta} \\ & = \text{beta}(p|w + \beta, n - w + \beta). \end{aligned}

As promised, the posterior has the same form as the likelihood, only now the parameters are w+βw + \beta and nw+βn - w + \beta. Therefore, by using a beta prior, the posterior distribution takes the same form as the likelihood had we seen an additional β\beta wins and β\beta losses. The parameter β\beta is, in effect, the number of fictitious games added to the record when accounting for prior information. Symbolically:

posterior(pw,n,β)=likelihood(pw+β,n+2β)=beta(pw+β,nw+β).\text{posterior}(p|w,n,\beta) = \text{likelihood}(p|w+\beta,n + 2 \beta) = \text{beta}(p|w + \beta,n - w + \beta).
0.20.40.60.81.01.02.03.04.05.06.07.0
Demo 3: The beta distribution as a function of its two parameters. These can be thought of as either the parameters of the prior, in which case the two parameters are identical and the distribution is symmetric (try it!), or the parameters of the posterior ($w + \beta,n - w + \beta$) that also depend on the number of observed wins and losses.

Since we already know how to maximize the likelihood, we also already know how to maximize the posterior. The likelihood is maximized at the win frequency so the posterior is maximized at the win frequency—after β\beta wins and losses have been added to the record:

p(w,nβ)=w+βn+2β.p_*(w,n|\beta) = \frac{w + \beta}{n + 2 \beta}. \quad \blacksquare


β\beta and the Nature of Baseball

We now have an intuitive way to incorporate prior information about the win probabilities into our estimate for the win probability given data. What remains is to have an estimate for β\beta (now that we know how to use prior information we need to gather the prior information).

In order to estimate β\beta we need to:

  1. gather historical baseball data to use in the estimate
  2. work out an estimation framework for estimating β\beta given the data

FiveThirtyEight provide the score of all 220,010 MLB games played since 1871 here.3 This data can be used to estimate β\beta for every year of Major League Baseball. In order to smooth the results, we fit for β\beta on sliding ten-year intervals (i.e., we find the best fit for β\beta for 2010 to 2019, 2009 to 2018, and so on back to the decade spanning from 1871 to 1880). These intervals are indexed by the last year in the interval so that our estimate for β\beta in 2019 is based on all the years between 2010 and 2019. This way the estimate for β\beta in a given year does not depend on games that have not yet occurred.

To fit for β\beta we used a Bayesian estimation framework much like the framework derived in this article for estimating the win probability. We did not assume any prior information about β\beta and solved for the value of β\beta that maximized the likelihood of β\beta given the corresponding decade of baseball. As before, this likelihood was solved for by first finding the likelihood of sampling the observed decade of baseball conditioned on a given β\beta. Then the likelihood of β\beta given the decade was computed using Bayes’ rule. Formally, this likelihood is the product of a series of beta-binomial distributions. Details are included in the expandable box below.


We also computed the 95 percent confidence interval on our estimate for β\beta using the same likelihood. If baseball win probabilities are beta-distributed, then there is a 95 percent chance that β\beta is in the interval if it is sampled from its posterior.

The results of this analysis are shown in Figure 3.

Baseball beta

Figure 3: Estimates for $\beta$ for each decade of Major League Baseball. The solid blue curve is the MLE estimate, and the shaded blue region is the 95 percent confidence interval. Note the gradual increase in $\beta$ over time.

The average results for each decade are provided in Table 1.

Decade95% LowerMLE95% Upper
2010–201917.421.930
2000–200914.918.825.4
1990–199920.627.341.2
1980–198915.219.326.2
1970–197913.516.822.2
1960–196910.61316.6
1950–19597.18.610.7
1940–19496.88.210.1
1930–19396.57.89.6
1920–19297.69.311.7
1910–19195.16.17.3
1900–19094.25.26.4
1890–18993.344.9
1880–18891.41.92.5
Table 1: Estimates for $\beta$ averaged over each decade of baseball.

Using this table we can easily compute the MAP estimate for the win probability of a baseball team that won ww out of nn games in any decade. For example, if the Red Sox beat the Yankees 2 out of 3 games in 1885 (ignoring that neither team existed at the time), then the MAP estimator for the win probability would be (2+1.9)/(3+3.8)=3.9/6.8=0.57(2 + 1.9)/(3 + 3.8) = 3.9/6.8 = 0.57. Alternatively, if we saw the same thing today, the MAP estimator would be (2+21.9)/(3+43.8)=0.511(2 + 21.9)/(3 + 43.8) = 0.511. Seeing the Red Sox win 2 out of 3 games only makes our estimate 11 percent larger than 5050 percent!

Tragically, the Yankees won 14 out of 19 games against the Red Sox in 2019, which means our best estimate for the Red Sox win probability against the Yankees in 2019 is (5+21.9)/(14+43.8)=0.46(5 + 21.9)/(14 + 43.8) = 0.46. This is still shockingly close to 5050 percent given that the Red Sox’ win frequency against the Yankees in 2019 was only 0.260.26.

These examples illustrate how astonishingly large β\beta is for modern baseball. Modern baseball teams have proven to be fairly even, so we expect most team against team win probabilities to be near 0.50.5. The beta distribution with β=21.9\beta = 21.9 is shown below in Figure 4. Note that very few teams are expected to have a win probability greater than 6060 percent against any other team. This means that most matchups are relatively fair and fairly unpredictable.

Priors

Figure 4: Prior distributions by decade. The highlighted decades are 1880–1889, 1920–1929, 1960–1969, and 2010–2019. Older decades are colored blue while newer decades are colored red. Notice the gradual tightening of the prior distribution over time. This corresponds to the gradual increase in $\beta$.

The astonishing evenness of modern baseball is reflected in how large β\beta is. With β=21.9\beta = 21.9 it would take winning eleven games straight to push the estimated win probability above 0.60.6 and forty-four games straight to push it past 0.750.75. The median number of games played between pairs of teams who competed in 2019 was seven games, and the max was nineteen, including playoffs. Few teams even play enough games to push the estimate past 6060 percent or beneath 4040 percent. Note that β\beta is so large that in all cases we add more fictitious games to the estimator than we do real games.

Baseball wasn’t always this even. Take a look back at Figure 3 and notice the gradual increase in β\beta over time. At the start of the league, β\beta is less than one. In fact, in 1880 the MLE estimate for β\beta is 0.620.62 (with a confidence interval [0.35,1.02][0.35,1.02]). The gradual increase in β\beta over the history of the league reflects the league becoming more competitive. Over time teams have become more even and the playing field more level. It is now so level that it takes over twenty games—more games than are played between almost any pair of teams—before observed data has a bigger influence on our estimate than our prior expectation does.

To see how truly large β\beta is for baseball it is informative to estimate β\beta for other professional sports. Game by game data has also been made available by FiveThirtyEight for both the NBA and the NFL. Results from applying these same estimation techniques to basketball and football are shown below in Figure 5.

Compare beta

Figure 5: Estimates for $\beta$ per decade for MLB, modern NFL (after merging with the AFL), and NBA. The solid blue curves are the MLE estimates for $\beta$ and the shaded regions are 95 percent confidence intervals. Blue represents the MLB, purple the NBA, and red the NFL. Notice how much larger beta is for the MLB than for either the NBA or the NFL.

Unlike the MLB, both the NBA and NFL have had close to constant β\betas over time. Both also have much smaller β\betas than the MLB. For example, in the 2010s the NBA and NFL have β\betas equal to 2.02.0 and 1.71.7 games respectively. This indicates that both the NFL and NBA are far more predictable than baseball, and football and basketball teams are far less evenly matched than baseball teams.

For example, if we saw the Celtics beat the Knicks 2 out of 3 games, we would estimate the Celtics win probability was (2+2)(3+4)=4/7=0.57(2 + 2)(3 + 4) = 4/7 = 0.57. The same situation in baseball would lead to a predicted win probability of 0.510.51. For a more extreme example, imagine the Celtics and Red Sox each win 9 out of 10 games against the Knicks and Yankees respectively. The best estimate of the Celtics’ win probability is (9+2)/(10+4)=0.79(9 + 2)/(10 + 4) = 0.79 while the best estimate for the Red Sox’ is (9+21.9)/(10+43.8)=0.57(9 + 21.9)/(10 + 43.8) = 0.57. The Celtics’ win probability is only 0.110.11 less than their win frequency, 0.90.9, while the Red Sox’ is 0.330.33 less than 0.90.9.

It is also interesting to note that the likelihood distribution for β\beta skews much more positive for football than for basketball or baseball. This is because football teams play far fewer games per pair and much less data is available for fitting β\beta to any decade of NFL games. The median number of games played between any pair of NFL teams who competed with each other in 2018 was one game and the max was three. Any pair of teams who play only one game do not provide any information for estimating β\beta, so the estimate of β\beta is limited to pairs of teams who play two or more games (teams within a division play twice against each possible opponent, and teams may meet more than once if they play in the regular season and in the playoffs). The uncertainty in football is most dramatic in the very early years of the league, when there were more teams and fewer games per pair, and in the 1960s, when the AFL joined the NFL (see Figure 6).

Compare beta

Figure 6: Estimates for $\beta$ per decade for the NFL. The solid red curve is the MLE estimates for beta and the shaded region is the 95 percent confidence interval. Notice the large uncertainty in $\beta$ as the league formed (1920s–1930s) and during the merger with the AFL.

Full results for basketball and football are provided below in Table 2.

Decade95% Lower NBAMLE NBA95% Upper NBA95% Lower NFLMLE NFL95% Upper NFL
2010–20192.12.53.20.92.18.1
2000–200922.53.10.61.43.9
1990–19992.12.53.10.61.44
1980–19893.64.55.70.71.65.4
1970–19792.73.34.10.30.92.6
1960–19692.93.95.50.61.65.3
1950–19592.43.35-0.10.31.8
1940–19490.92.17.2-0.30.11.1
1930–19390.61.65.4
Table 2: Estimates for $\beta$ averaged over each decade for the NBA and the NFL.

Taken together, these estimates of β\beta tell a story about the nature of these leagues. It is clear that baseball is an outlier when compared to the NFL and the NBA. Not only is the league much older and not only do baseball teams play far more games, but baseball teams are more evenly matched than football or basketball teams, and baseball teams have gotten more evenly matched over time. You would have to go back to 1880 (or at least 1900) for the teams in MLB to be as unevenly matched as the teams in professional football or basketball. Baseball is now so even that it is difficult to estimate teams’ win probabilities against each other without seeing close to twenty games between them. Until then, your estimates are more informed by β\beta than the games you watched. Until then, your estimates are more Bayes'-ball than baseball.


  1. This is not quite the standard parameterization of the beta distribution. The beta distribution is usually defined with α\alpha set to what we call α+1\alpha + 1 and β\beta set to what we call β+1\beta + 1. We chose to modify the convention to ease interpretation of the parameters.

  2. Laplace, Pierre-Simon. "Philosophical essay on probability. 2nd ed." Paris: Mrs. Ve Courcier (1814).

  3. All data was collected from https://data.fivethirtyeight.com/
© 2021, Built by Alex and Jack Strang