Tango on Baseball Archives

© Tangotiger

Archive List

A method for determining the probability that a given team was the true best team in some particular year (January 6, 2004)

Bradley-Terry, Bayes, Markov, probability distributions. It's all here.
--posted by TangoTiger at 10:34 AM EDT


Posted 11:35 a.m., January 6, 2004 (#1) - Guy
  Cool stuff. I wonder: what is the cumulative probability that so many teams with relatively small chances of being the best team actually won the WS (or that the most-likely-best-team won so few)? Can we say, with a high degree of confidence, that the playoff/WS system is not an efficient system for identifying the best regular-season team? (For whatever reason: favoring top 3 starters, something else.)

Posted 11:56 a.m., January 6, 2004 (#2) - Tom T
  Guy,

One way to start to get at your question is to calculate the expected number of WS champs with the best record. The most straightforward way to do that is just to calculate the expected WS champs for the best team year-by-year. For example, the 1990 A's would have been expected to win 0.543 World Series that year.

If you add these up for every year from 1990 - 2002 (except 1994), you would expect the best team to have won 5.245 World Series over this time period. In fact, they won 1 ('98 Yanks). On the other hand, 5 of them did participate in the World Series (although we would have expected more, since the probability of making the World Series is greater than the probability of winning it).

I'm not sure what to do with those numbers except to say that the "best teams" in baseball do certainly look like they win the WS less than you'd expect.

Posted 3:16 p.m., January 6, 2004 (#3) - AED (homepage)
  I've never understood the use of B-T model for team rankings. Random variations in team performance are Gaussian, so the standard error function should be used instead. I've been publishing computer rankings for quite some time that are based on probable superiority between teams, with the added bonus that I calculate it analytically instead of with a Monte Carlo approach. See homepage link for a fairly comprehensive guide to statistics of rankings (in the "rating details" section).

Note that the probability that a team will win the world series is completely different from the probability that it is the best team. If X is the probability that team A is better than team B after a 162-game season, the odds that team A will beat team B in a 7-game series is roughly 0.5*[(X/0.5)^0.30]. The odds that team A will beat team B in a 5-game series is roughly 0.5*[(X/0.5)^0.26]. Skipping through lots of math, the probability of a team winning the world series is roughly proportional to the square root of its probability of being the best team, perhaps an even lower power.

The way to make it optimally efficient would be to replace the three rounds of playoffs with a best-of-19 series (19 to keep a team's maximum number of games unchanged) between the top teams from each league. Better yet, give each team 5-6 games against every other MLB team during the season to give a balanced schedule, and do away with playoffs altogether.

Posted 3:20 p.m., January 6, 2004 (#4) - MGL
  Interesting premise! Once again, no one who doesn't have a degree in statistics is going to be able to folow the methodology, which, as is often the case with an interesting topic, is a shame.

There is actually a much easier to follow method of achieiving the same results using Bayesian probability. I'm not sure why the author went the more complicated route. Speaking of, does the author have a name?

Sample size and the attenuation between regular season record and WS results is not going to allow you to make any reliable inferences about why the best teams did or didn't win the number of WS they were expected to. At the very least you would have to adjust for teams making or not making the playoffs. Even then, I think you can safely assume that the best team in the reg season is not always the favorite in the post-season because of the increased value of a team's top 3 starters, as well as how the home teams are figured in the post-season...

Posted 4:45 p.m., January 6, 2004 (#5) - Jesse Frey
  AED,

I used the Bradley-Terry model because of its simplicity. It also, from everything I've seen, fits baseball game results quite well. I'm not quite sure what you mean when you say that 'random variations in team performance are Gaussian,' but I have no doubt that a model which used the Gaussian CDF as in your homepage link would give results similar to those I obtained. Is there a way, using your methods, to find analytically the probability that a given team is the best team?

MGL,

I'm not aware of an easier method to find the probability that a given team is the best team. Certainly the rankings of the teams could have been obtained with less effort. Is there a reference you could point me to?

Posted 5:51 p.m., January 6, 2004 (#6) - AED
  If team A has a merit value of a, and team B has a merit value of b, the probability that team A will beat team B equals the fraction of the Gaussian error function that is less than a-b (assuming you choose your scale carefully, of course). This is function is similar, but not identical to 1/[1+exp(b-a), which corresponds to a B-T model (taking ln(m) as your merit ranking are measured).

Comparing the two, the B-T model's different shape will affect the determined rankings. This is especially problematic in college sports, where B-T predicts too many major upsets and thus ranks teams too high from beating up on cupcakes, but the difference in shape will produce subtle problems any time it's used. You're right that it's not grossly in error, but since the Gaussian error function is part of the standard C math library, I don't quite see why people opt for a less accurate approach that is also more work to program. Just a pet peeve of mine, I guess.

I'm more interested in team-by-team comparisons, since my goal is to create an optimal ranking order. So I calculate the odds that each team is better than each other team. However, it would be trivial to instead calculate the odds that each team is better than all other teams. One correction -- I meant to say "calculate directly"; the calculation is numerical rather than analytical (although it takes negligible time to compute to arbitrarily high precision). It can also be done completely analytically (and more accurately) if you choose to use game scores rather than wins and losses.

Posted 6:30 p.m., January 6, 2004 (#7) - Ryan
  Just for Devil's advocacy and a non SABR approach, the best team is that which can conquer two separate seasons, regular and playoff. I really do appreciate the statistical approach to baseball, and certainly am glad it has become so prominent, but I want to ask you guys something as a someone not as familiar with statistical method. Is there an equalibreum between statistical and tradition evaluation of baseball?

Posted 9:14 p.m., January 6, 2004 (#8) - MGL
  MGL,

I'm not aware of an easier method to find the probability that a given team is the best team. Certainly the rankings of the teams could have been obtained with less effort. Is there a reference you could point me to?

Honestly, I may have been speaking out of my posterior. I only skimmed your study the first time. If I had the distribution of true talent in the league, of course the rest (calculating the P that any given team is the best team, given their sample w/l record for the year) is trivial.

I don't know how to calculate that (the true talent distribution of baseball teams in an avreage season, e.g., 15% if all teams are .500 teams, 10% are .510 or .490, etc.). I also don't know if this distribution can be defined by a SD (we know that the mean is .500) only (e.g., that it is normal). I also don't know whether you are assuming that this distribution changes from year to year. If so, then you must be using the actual distribution of sample w/l records to determine the "true" talent distribution of teams in that year.

If not (if you are assuming that this distribution is always the same, at least for your test sample of 13 years), then what I meant was why not list the talent distribution or explain the properties of the distribution (again, e.g., it is normal with a SD of .05), and then go through the simple math for one of those years to come up with the P that each team is the best team?

These are semi-rhetorical questions. You don't need to answer them. Theyprobably makes little sense aanyway. As I said, very interesting premise. If you increase your sample size, and control for "making the playoffs," could we not get some idea as to whether there are indeed other significant factors that contribute to winning in the post-season that don't during the regular season, or vice versa?

Posted 9:22 p.m., January 6, 2004 (#9) - Alan Jordan
  AED

When you say Gaussian error function, do you mean the cumulative as in probit (normit)?

Also how do you KNOW that errors are normaly distributed? That strikes me as more of an assumption since they are unobservable.
Have you run some sort of specification tests to assess that?

I'm not sure that the usual t-b or logistic rankings that are done in a non bayesian system are fair comparison to your system or one that allows priors because most of the priors that I've seen push the estimates toward the mean. I think if you ran a probit model they way they usually run the t-b or logistic you would get the same effect.

Posted 9:55 p.m., January 6, 2004 (#10) - Guy
  Back to Tom:
Another way to look at this is to assume the 8 playoff teams always include the likely best team, and the cumulative percentage for the 8 teams is close to 100%. Then if the WS had been perfect at identifying the best team in these 12 seasons, the WS champs would have an average p of .44. If the WS were no better than a coin toss, the WS champs would have an average p of about .12. In these seasons it was actually 15% -- just a little better than the coin.

Posted 12:18 a.m., January 7, 2004 (#11) - AED (homepage)
  Alan, actually I have made extensive tests. You can see one such plot on my homepage, in 'predicting games', which shows actual winning percentages as a function of ranking difference, plotted against the prediction from the Gaussian model. In all sports, the B-T model overpredicts the number of major upsets. This may not matter as much in pro sports, where there is more parity, but it indicates that the Gaussian distribution is more accurate in general.

Whether you use the normal distribution or the (cumulative) error function depends on which problem you're answering. The odds of one team beating another would use the cumulative function; the odds of getting a specific game result would be based on the Gaussian probability. I tend to use the term 'error function' as synonymous with the cumulative because that's how it's named in the C library (erf() and erfc()).

Again, there isn't a huge difference between Gaussian and B-T models in pro sports; it's just that the Gaussian *is* easier to program and it's also more accurate. Like I said, a pet peeve.

Posted 9:31 a.m., January 7, 2004 (#12) - Tom T
  Guy,

One thing to remember is that, prior to 1997, the AL and NL teams never played each other and had no common opponents, so, really, all you can do in that case is identify the probability of being the best team in the AL and the probability of being the best team in the NL, unless you have some basis for making an assumption about the relative strengths of the two leagues.

In fact, from 1990 - 1996, 7 of the 12 World Series participants were identified by this system as having the highest probability of being the best team in their own league. I think that's actually pretty good.

Posted 10:42 a.m., January 7, 2004 (#13) - Alan Jordan
  AED,

Frey's model uses game wins as a dependent variable and the discussion centers on his model, so that's all I'm interested in at the moment.

Given the graph that I saw
http://www.dolphinsim.com/ratings/info/predicting.html

There's nothing wrong with the specification of the cumulative normal distribution, but what would the graph look like if you used the logistic distribution which is what the b-t model uses? The two distributions are so similar, that you might get approximately the same fit.

Yes, I understand that the calculations are simpler, but I don't see that the logistic is wrong.

Also the b-t models that I've seen don't use priors or any sort of bayesian so I would expect them to overfit the data. My question is whether a b-t (logistic) and a cumulative model (probit) that both don't use any priors would overpredict upsets. Also if they both used priors would either overpredict the number of upsets.

It seems to me that Frey's model shouldn't overpredict major upsets as much as the way I usually use(simple logistic using dummy variables for teams and no priors). My estimates would too extreme and therefore predictions would be too extreme ( I don't care because I'm focusing on rankings, not probability estimations per se). I doubt my model would improve by switching to the probit function.

If I wanted to focus on probability estimation, I would have to add some sort of prior distribution or Stein estimator or Bayesian model averaging or something to tone down the extreme predictions especially in the early part of the season.

Posted 11:29 a.m., January 7, 2004 (#14) - AED
  The line in that graph is not a "fit" to the data, it is the model prediction plotted against the data. The B-T model would not fit it. There are several other tests you can do. For example, using sets of three teams that have played each other, the winning percentages are better fit with the Gaussian than with the B-T. I spent quite a bit of time deciding which model to use when setting up my system, and the Gaussian consistently outperformed the others. I've always assumed the B-T model is common solely because it is similar to the Gaussian but conceptually simpler.

Using a B-T model with a prior is better than using a Gaussian model without a prior. But using a Gaussian model with a prior is better still.

I don't care because I'm focusing on rankings, not probability estimations per se

Actually, for accurate rankings you should still use a prior. Otherwise you are prone to overranking teams with easier schedules.

Posted 3:05 p.m., January 7, 2004 (#15) - Steve Rohde
  One problem that I think would make an accurate analysis of the problem more complicated than is presented is that any team that begins the playoffs is generally not really the same team that started the season. There are almost always in-season trades, minor league call-ups, and injuries. For example, with the trading that occurs around the trading deadline, it has become institutionalized for teams in contention to attempt to impove themselves in the short run. So when you ask the question, who was really the best team, what does that mean -- the best team when.

Under the current system, a team needs to be good enough to make the playoffs, and especially must strive to be the best team as of the beginning of the post season. Moreover, as has already been noted, there are differences in what makes the best team in the post season as opposed to the regular saeason - for example having a good 5th starter is irrelevant in the post season.

Posted 3:19 p.m., January 7, 2004 (#16) - Steve Rohde
  Here is another complication. Some years ago, I can't remember when but it may have come from Bill James, I remember seeing a study that suggested that even teams playing identical schedules could have significatly different effective schedule strengths, because of random fluctuations in which opposing starters they faced. There are not necessarily enough games in a season to effectively even out those chance variations.

Posted 12:36 a.m., January 8, 2004 (#17) - Alan Jordan
  "The line in that graph is not a "fit" to the data, it is the model prediction plotted against the data."

When you overlay actual over predicted values, you are visually inspecting goodness of fit. We could divide the data into groups and do a chi-square test. I know you know how to do tests like that, and the chi-square version is called a goodness of fit test. Why do you object to the word fit?

"Actually, for accurate rankings you should still use a prior. Otherwise you are prone to overranking teams with easier schedules."

I don't do any rankings at all until the teams have played at least 30 games minimum. That's more than two years worth of data for a college football team. By that time even the Tigers have won at least one game and even the Yankees have lost 7 so there's no complete or quasi separation of data points. Teams have generally played each other enough (within the league at least) so that the matrix invertable without resorting to a generalized inverse or setting a team coefficient to 0 (o.k. I have to set one National league team and one American league team coefficient to 0 until interleague play). Also the effect of scheduling (i.e. A good team does better playing against mediocre teams than against half really good and half really bad teams) doesn't hold up when winning percentages are between 25% and 75% and you have a larger sample size (I ran monte carlo simulation at 100). College football has more unforgiving conditions than baseball.

Posted 12:47 a.m., January 8, 2004 (#18) - Alan Jordan
  "One problem that I think would make an accurate analysis of the problem more complicated than is presented is that any team that begins the playoffs is generally not really the same team that started the season."

That's a more interesting problem. Assuming we can come up with a reasonable model for calculating team strength at different points in the season, do we grade a team on its overall average throughout the season or do we grade them on their strength at the end of the season. Both have merit.

Playoffs tend to reward the team that's strongest at the end of the season while getting into the playoffs depends on your average strength across the regular season. If teams have stable strengths across the season, the there's no conflict. Of course we don't really believe that teams don't change strength across the season.

Posted 12:58 a.m., January 8, 2004 (#19) - Alan Jordan
  "I remember seeing a study that suggested that even teams playing identical schedules could have significatly different effective schedule strengths, because of random fluctuations in which opposing starters they faced."

If the problem is only the starting pitchers then you add coefficients for starting pitchers to the model. At this point using priors becomes more important. But's it not catastrophic statistically. To rank teams you treat each pitcher and team combination as if it were a separate team and then take a weighted average of these for each team to get a team strength.

On the other hand injuries are harder to deal with. What if 2 or 3 star players are injured? I don't have a problem with docking that team for their poor performance, but a team that beats them shouldn't get many points either. I'm not sure its big enough to worry about in baseball. In ranking the NFL though, I would definitely have Falcons team with Vick and without Vick.

Posted 4:19 a.m., January 8, 2004 (#20) - AED
  You can't treat the pitcher/team combinations exactly as if they are different teams, since the offense is fairly constant regardless of who is pitching. You would also want to ensure that the prior was not applied separately for each pitcher/team combination, lest the team's offense and fielding be regressed multiple times. But I think if you're going to start looking at who was playing, why not just do the full-bore sabermetric approach by measuring a team's player contributions relative to league average and add them up to give a team ranking? This would be way more accurate than anything done using game scores.

Regarding injuries, I have generally found injuries to be less significant than is commonly thought. If a team loses a key player and starts losing, it is attributed to the injury. If they win without him, it was heart, courage, and/or great coaching. (It's kind of like clutch hitting -- you can always attribute the key hit after the fact but can rarely predict it.) For the season, the Falcons showed about as much game-to-game fluctuation as other teams, and if you really want to read into the weekly performances (at the risk of interpreting noise as signal) you would conclude the turnaround happened during the bye, not at Vick's return.

Posted 8:50 p.m., January 8, 2004 (#21) - Alan Jordan
  "You can't treat the pitcher/team combinations exactly as if they are different teams, since the offense is fairly constant regardless of who is pitching."

I have a set of dummy variables for each team and a set of dummy variables for each pitcher/team. Pitcher/teams that have less than 5 starts get grouped together by team. I have to admit this set up needs priors because there are still quasi separations even at the end of the season and the matrix requires a generalized inverse (some parameters get set to 0).

"Regarding injuries, I have generally found injuries to be less significant than is commonly thought."

I agree and I've never factored injuries into a model. I have left them out of the model because I don't see a simple way of factoring it in without throwing subjectivity into it. If a team is playing better or worse at the beginning or end of the season, an opponent who plays them when they are better should get more points than the team that plays them when they are worse.

Whether a team actually changes strength by any appreciable amount is another question.

"(at the risk of interpreting noise as signal) you would conclude the turnaround happened during the bye, not at Vick's return."

Hypothetically yes, but I'm sure you've studied the effect of bye weeks on performance the next and following weeks. I don't know if you found an effect for the week after the bye, but I doubt you found an effect for the second week after the bye. I did a check a couple of years ago where the dependent variable was win/loss and the independent variable was bye/no. I didn't find anything, but it was only one years worth of data so maybe the sample was too small.

Luck may be enough to explain a difference between 2-10 without Vick and 3-1 with him (Fisher's exact gives a p value .063 chi-square is invalid because of sample size), but the bye week isn't.

Posted 1:29 a.m., January 9, 2004 (#22) - AED
  Actually, the Falcons' first really good game was the second week after the bye. This is why I said it was risking interpreting noise as signal, since I know of no reason why any statistically significant change in performance might have been associated with that game.

Posted 9:54 p.m., January 9, 2004 (#23) - Alan Jordan
  Here is an interesting model. It's done for College Football and is primarily intended to handle wins/losses, but the author puts forward a modification for handling margin of victory. It has priors, but still uses iteratively reweighted least squares (in this case a penalized max like).

AED, I'd be interested in your comments.

http://members.accesstoledo.com/measefam/paper.pdf

He uses a probit. The theoretical justification is that if the process is additive then the Central limit theorem kicks in and forces a gaussian distribution (if I recall correctly, there is a version of CL that says that they don't even have to come from the same distribution).

I think also that if Y is the product of a series of variables then Y should be distributed in an expontential? distribution.

Maybe that's a difference in logistic and probit. Probit assumes Y is a sum and logistic assumes that Y is a product. I don't know.

Posted 11:22 p.m., January 9, 2004 (#24) - Steve Rohde (homepage)
  As it turns out, the Bill James study that I referred to in post #16 comes from the 1986 abstract. Rob Neyer refers to it in his column today. (Click on homepage.)

Posted 1:35 a.m., January 10, 2004 (#25) - AED
  Alan, his overall probability equation appears to have been adapted from my own ratings pages, so yes, I think that's a reasonable approach. A few quibbles though.

About the rankings, he fails to account for the distribution of IAA team strengths. Instead, you create a built-in uncertainty for IAA team rankings, a standard deviation of 0.8 or 1.0, and instead of
phi( theta - thetaIAA )
in his 'part 3', it should be
phi[ ( theta - thetaIAA ) / sqrt( 1 + sd^2 ) ]
The effective difference is that, by assuming you know all IAA teams are identically strong, you penalize teams too much for IAA losses. Of course, it would actually be better for him to go ahead and rank IAA, DII, DIII, and NAIA teams so that this isn't necessary. Wolfe's website has full data for all games played.

Also, his chosen prior isn't particularly well-chosen. One should actually use the distribution of team strengths rather than an arbitrary assumption that the distribution equals phi(x)*phi(-x).

About the potential modifications... Giving different weights to games in a win-loss system is a recipe for disaster in college football. You don't have enough games to have much constraint as it is. I don't think his suggested way for treating margin of victory would work. However, one can convert the game scores into a sigma (team A played X sigma better than team B) and use a Gaussian probability instead of the cumulative probability using the prescription in my info pages.

Posted 1:26 p.m., January 10, 2004 (#26) - Alan Jordan
  I personally love the system because it fits my limitations in mathmatics and programing. My knowledge of calculus and matrix algebra is weak. I can program some basic linear algebra in SAS's IML, but I'm lost the minute you begin some nonlinear optimization for maximum likelihood.

I'm particularly good at milking SAS's procs in such a way as to test or estimate stuff.

This system allows me to use proc logistic to estimate team strengths. That I can do.

* I agree about using all the football teams instead of just lumping the IAA together.

* His prior works by popping pseudo games into the data. If you can show me how to do that with your prior, I'll do it. I just don't see how to do it.

* His priors seem to me to function as a shrinkage estimator like the Stein. The strength estimates should be pushed toward 0. I find that attractive.

* I was wondering especially about his margin of victory method.

* I don't see how to do margin of victory your way in SAS. I was thinking you were using the cumulative normal. Could I use a linear model with your margin of victory?

Posted 6:54 p.m., January 10, 2004 (#27) - AED
  As I said, his system isn't all that bad, and appears to be virtually identical to what Wolfe used in 2002 and very similar in design to my own win-loss system. I don't know what SAS is capable of; I do all of my own programming with help from numerical recipes when I need to do minimization.

The prior should be the actual distribution of team ratings from a margin-of-victory rating system. It's close to Gaussian, so in college football you can use a Gaussian with standard deviation 0.9 instead of the single win and single loss against a hypothetical zero team. Better yet, use 1/(1+(x/0.9)^2)^2, which has the same standard deviation but fits the shape of team ranking distributions in all sports pretty well. This requires that you can plug in functions of your own choice within SAS, of course.

A margin of victory ranking can be accurately approximated as a linear problem.

Posted 8:09 p.m., January 11, 2004 (#28) - Alan Jordan
  SAS is strictly a frequentist statistical program when it comes to its proceedures. The proceedures for general linear models allow specification of independent variables, dependent variables, weights (or counts) for observations, interaction terms and stepwise parameters. But not priors. One of the selling points of that article is that his method spells out a way getting frequentist software to do bayesian estimation.

Even if his method were exactly like yours or other peoples, he would still get points for translating such a system into frequentist software. It's really more of a teaching journal than a cutting edge statistical journal. It is consequently more readable to people of lower math background like me.

The whole idea of using penalized maximum likelihood or least squares to do bayesian estimation only works if you know how to add an augmented pseudo data set of a correct or plausible form.

His way (augmented pseudo data set) is pretty simple for wins/losses, but I'm not sure how to do it for margin of victory. I understand that his way for wins/losses translates to a beta distribution with mean a/(a+b) where a & b are the pseudo wins and losses that are added to augmented data matrix.

Posted 6:57 p.m., January 12, 2004 (#29) - AED
  OK. His "prior" equal phi(x)*phi(-x), which by sheer luck matches the spread of IA football teams pretty well. Can this be made any arbitrary power of that, such as [phi(x)^0.8]*[phi(-x)^0.8]? To adapt this system to baseball, you would want to use [phi(x)^31]*[phi(-x)^31]. If powers can be floating point, you should also use the square root of the probabilities for IAA games (or if not, square everything else).

I seriously doubt his technique could be easily adapted for margin of victory, but as noted earlier it can be done easily as a linear system.

Posted 9:21 p.m., January 12, 2004 (#30) - Alan Jordan
  Phi(x) can be raised to any arbitrary power. Alpha-1 represents a win added to the data matrix and beta-1 represents a loss. If t is the number of teams then you need a matrix of 2tXt to add to the bottom of the data matrix.

Mease acts as if alpha-1 and beta-1 must be expressed in whole numbers. This isn't necessary. You can assign weights of any rational value.

Where he suggests that ties be entered once and wins and losses be entered twice, one can represent ties by adding a win and a loss for that combination of teams and giving them a weight of 1/2. I don't see how he missed that. The two biggest stat packages (and countless others) allow for weights.

Phi^31 would be represented by adding 30 wins and 30 losses to the data matrix for each combination of teams. Again, only a 2tXt matrix needs to be entered. These pseudo games need to be given a weight of 30.

Basically alpha-1 and beta-1 are the weights.

I think I may have found someone who has written a program in SAS's matrix language to handle the linear bayesian estimation for the margin of victory.

I wonder what the prior would mean if I just added pseudo games where each team played

2 std below
1 std below
equal
1 std above
2 std above

where each game was weighted 12 a piece (to add up to 60 as in the wins/losses model).

I wonder if this would be a valid penalized least squares solution for baseball.