Tango on Baseball Archives

© Tangotiger

Archive List

Clutch Hits - Tango's 11 points to think about --- to understand why we regress towards the mean (February 12, 2004)

I think all those points are accurate (if not precise). If someone wants to correct, or add, feel free.
--posted by TangoTiger at 10:22 AM EDT


Posted 12:59 p.m., February 12, 2004 (#1) - BirdWatcher
  Tango, I admire your courage but question your judgement in reopening the regression to the mean can-of-worms !! Your 11 Commandments are fine, in so far as we are searching out a best estimate for a group, but as soon as we apply these principles to one specific individual, there are still some fundamental issues which the 2002 thread left unresolved - as summed up in "Late's" closing comment in that thread:

"I'm talking about understanding the true talent of the regulars, the guys who contribute to the game by accumulating 600-700 plate appearances for a number of seasons. I see no reason why their true talent is dependent upon the performance of others."

The bottom line is your methodology will always "drag down" a Barry Bonds and "drag up" a Rey Ordonez when there is no necessary reason why they shouldn't go in the other direction. Your Commandments simply don't address the issue of why, for these specific ball players, it is preferable to regress against the MLB "population" mean instead of their own career "sample" mean.

Even in the case of more limited sample sizes, for example, predicting performance for players entering their second full season who had 400+PAs in year one, why wouldn't you use as a population mean say, the average change in performance from year one to year two for all first year players in the last 10/20/30/40 years ?? Perhaps this is a better estimator then the overall MLB population mean, perhaps not. But, hey, let's talk about it (i.e selection of the population mean) each time we do one of these runs as opposed to treating the MLB population some sort of Aristotlean "truth."

Posted 1:08 p.m., February 12, 2004 (#2) - tangotiger
  why wouldn't you use as a population mean say, the average change in performance from year one to year two for all first year players in the last 10/20/30/40 years

I never said not to. I clearly said that if you wanted to take a subpopulation of players (like Arod, Giambi, Delgado), then you can do so, with my provision.

Choosing other similar-aged and experienced players is also a good criteria, and this particular combination was specifically cited a few weeks ago when I presented the "Forecasting Pitchers" article.

Posted 1:09 p.m., February 12, 2004 (#3) - tangotiger
  of their own career "sample" mean.

Again, if you have a large sample mean, then you will regress barely at all towards the pop mean. A player with 1800 PAs will be regressed 10% towards the pop mean. That means a player who performed at +50 runs per year for 3 years is, at our best guess, a +45 runs per year player (with a margin of error).

Posted 2:02 p.m., February 12, 2004 (#4) - dr feelgood
  What is the equation for figuring out how much to regress?

Posted 2:19 p.m., February 12, 2004 (#5) - MGL
  9. Every time you establish a player's true talent level you are actually establishing his probable true talent level. Implicit with this comes a variance. To properly present a player's true talent level, you should provide the mean and the variance. However, because of #3, no one should be silly enough to think that the variance is zero. And, because of #2, no one should be silly enough to think that this won't change day-to-day.

The bottom line is your methodology will always "drag down" a Barry Bonds and "drag up" a Rey Ordonez when there is no necessary reason why they shouldn't go in the other direction.

Tango, you are a brave soul! The only thing that is going to keep you from being balsted forupurveying such heresy is that this is Primate Studies and not Clutch Hits and there is no link from a popular, mainstream web-site. ;)

Of course, the above quote is completely wrong and virtually everything you said was completely right. The notion that there is a finite chance that Barry Bonds is not really that good and just got lucky for 10 or 20 years (you know what I mean) or that Rey Ordonez' is really a pretty good hitter and just got unlucky for 10 years or so, creates so much cognitive dissonance in most people's minds that no matter what anyone says, some people are just going to think that you are anywhere from somewhat wrong to completely nuts.

Of course, in those 2 cases, the key to the "regression" or the Bayesian analysis is what is the mean and the distribution of true talent in the population of big, strong left-handed hitters, etc., or that of small, wiry, slap hitting shortstops from the Dominican Republic (or whereever Ordonez is from)?

It is a very, very tricky thing trying to establish the "population" fro whence a player is selected or even if we can, to try and establish the mean and distribution of true talent with that population. For example, even though we can do a pretty good job of figuring out the mean and distribution of true talent in the entire population of MLB players, which is all we need to be able to regress a "random" MLB player's sample stats properly, how do we go about figuring out the true mean and distribution of all "big, strong, smart, left-handed hitters," etc.? That ain't so easy!

Fortunately, as Tango says, if we have lots of PA's for a player, the regressions are not very important anyway. As well, for players with few PA's, we usually don't know much about them anyway, which makes the regressions for their sample stats a lot easier also!

What has never been explained adequately, I don't think is the relationship between regressing a player's stats toward the mean (in order to estimate his true stats, or at least come up with a weighted average of all his possible true stats), using one of Tango's standard regression equations, x/(x+PA), and doing the same thing (estimating true talent) through a complete thorough Bayesian analysis. The Bayesian analysis the THE complete correct way to solve the problem. It basicaly goes like this: Given that 1% of all MLB players have a true BA of .200, 2% have a true BA of .220, 10%, .230, 3% .300, 1% .330, etc., if we have a random MLB player who is allowed to bat 100 times no matter what, and hits .230, what are the chances that he is a true .200 hitter, a true .220 hitter, a true .314 hitter, etc.? There is one and only one answer and it is as precise as a human being can get. You can express it as "given that he hit .230 in those 100 AB's, there is a 10% chance that he is a true .230 hitter, 1% chance that he is a .290 hitter, etc.," or "there is a 20% chance he is worse than a .230 hitter, 30% chance he is better than a .270 hitter, etc.," or, you can express it as "his most likely (the weighted average of #1 above) true BA is .244, but there is a 50% chance that it is between .234 and .254, and a 70% chance it is between .220 and .260, etc. BTW, those "intervals" I made up do NOT have to be symmetrical around "the weighted mean" or the "best single estimate of his true BA." The analogy is if we flipped a coin 100 times, we could say that the best estimate of the number of heads that will come up (this the above in reverse - here we know the true value and are estimating the possible sample values, but it is really the same thing as looking at the sample value and estimating the possible true values) is 50, or we can say that there is a 70% chance that between 40 and 60 heads will come up, or we can say that there is a 3% chance that 50 heads will come up, a 2% chance that 49 or 51 heads will come up, etc.

The above Bayesian analysis (or some equivalent version of it) is the ONLY perfect way to solve the problem of what is the true BA of a random player who hits .xxx in y nubmer of PA's. And it is perfect. No one can do better, given the conditions. In this case, the conditions are that the player is a random player from the population - we know nothing else about him, and that we know the exact distribution of true BA's among players in the population (1% of all players are true .330 hitters, 5% are true .290 hitters, etc.). Again, if we know the exact distribution of true BA's in the population and we draw a player from that population and "sample" his BA in x number of AB's, we can comeup with an exact, precise, and perfect (given what we know) model (the weighted mean and all the various possibilities) for his likely true BA. What you usually see in a projection, which is simply the result of the above Bayesian analysis, of course, is the player's "weighted mean." As I said above, you can see it represented in lots of other ways. Which ways are more useful is a personal choice I guess, but one is not more right than the other. I suppose it is better to have all the information (the weighted mean, plus the confidence intervals, or the weighred mean plus the entire distribution of possibilities), but what you usually see is just the weightred mean (a typical projection). To tell you the truth, the weighted mean is really all you usually need as that is the most useful piece of information,. and in anycase, the rest (the variance around that mean, or the confidence intervals, or even the entire distribution of possible true values) can usually be inferred.

So how does this rigorous Bayesian analysis, which I just said is the only perfect way to solve this problem, relate to "regressions" and Tango's handy regression formula? This is important, and if I get anything wrong, perhaps someone like Jordan or AED (or Davis or Hsu) can correct me.

If the distribution of true talent in the population is "normal" (or somewhat normal for all practical purposes), AND the sample distribution (in this cas of BA in x number of AB's) is also normal, then we can forego the "long version" of the above Bayesian analysis and use the shortcut version, which is Tango's regression formula. We know that the second part is true for sample BA's, since a batting average is basically a binomial and each event is independent (somewhat), etc., and a distribution of a binomial approximates a normal distribution.

As far as the first part, that the distribution of true BA's in the population is normal or somewhat normal, that's the tricky part. Tango says that it is (if we incorporate playing time). For purpsoes of doing a Bayesian anlysis on a player who has a certain sample BA in order to estimate his true BA or the distribution of p[ossible true BA's., I honestly don't know if we can or should assume a normal distribution of true BA's in the population. If there isn't (a normal distribution), then using a regression formula like Tango's handy one is NOT equivalent to using the rigorous (prefect) Bayesian approach. I know that it comes close, but I don't know how close. The closer the distribtion of true BA's in the population is to a normal distribution, the closer the "regression" model is to the real Bayesian model.

That's about all I have to say for now. The only other question I have which no one had answered yet, but I know that AED or someone like him has the answer to is this:

If the distribution of true talent IS normal, then we know for a fact that the true correlation, "r", between any 2 set of independent samples drawn from that population tells us EXACTLY how much to regress to the mean. For example, if we draw 2 sets of 500 AB samples and the true "r" when regressing one sample on the other is .500, then we know that the exact regression is 50% (1-r). Therefore, if a player's sample BA in 500 AB's is .300 and the mean (true) BA in the population is .250, our absolute best estimate of this player's true BA is .275 (.300 regressed 50% towards .250). That we know for a fact. We also know for a fact, that we can do the same thing for any number of AB's as long as we know the true "r" for that number of AB's. We also know (I think) that since all distributions are normal, that we can exactly infer (interpolate) the "r", and thus the proper regression (1-r), for any number of AB's, if we know the true "r" for any one numb erof AB's.

The question is is Tango's formula, x/(x+PA), the EXACT way to infer these other "r" or is it just a quick and dirty shortcut? I have a feeling that it is the latter, and if it is, I'd like to know the true equation.

The other question is, is it true that, given the same parameters as above, that we don't have to do a regression and calculate an "r" as descrived above - that we can take the expected variance for any number of AB's, assuming no spread of BA talent, and divide that by the observed variance, and that also gives us an EXACT estimate of the "r" and regression (in this case expected variance divided by observed variance equals regression, and "r" equals one minus regression)? And if that is true, how do we calculate the confidence interval (i.e., the "standard error") around that ratio?

If I use the regression method above, I know I can look at a table on the web that tells me what my confidence intervals are on the "r" that I get. If I use the "expected variacne divided by observed variance method," assuming that they will yield the exact same results (not counting sample error of course), I don't know how to find out the confidence interval aroun that result...

Posted 2:31 p.m., February 12, 2004 (#6) - tangotiger (homepage)
  MGL, your question in your last paragraph was asked/answered at the above homepage link.

Posted 4:02 p.m., February 12, 2004 (#7) - Jesus Christ Himself
  MGL, you do a lot of great work, but come on...

Of course, the above quote is completely wrong and virtually everything you said was completely right. The notion that there is a finite chance that Barry Bonds is not really that good and just got lucky for 10 or 20 years (you know what I mean) or that Rey Ordonez' is really a pretty good hitter and just got unlucky for 10 years or so, creates so much cognitive dissonance in most people's minds that no matter what anyone says, some people are just going to think that you are anywhere from somewhat wrong to completely nuts.

The point is not that cognitive dissonance arises from regressing these players to the mean. The point is that why should other players' performance lead us to conclude that Bonds' current stats are better than his true talent, and that Ordonez' current stats are worse than his true talent? The average of other players' performances should not influence how we value any particular player.
And it's not just a matter of Bonds and Ordonez. This type of regression is basically saying:
Every above-average player is overrated by their stats, and every below average player is underrated by their stats.

I'm sure you wouldn't endorse this statement, yet that is what the regression does!!!

Posted 4:14 p.m., February 12, 2004 (#8) - tangotiger
  No, because you are forgetting about the confidence interval.

If say you have a league average OBA of .340 and Bonds has a .440 OBA. Because that's a sample, maybe Bonds' true OBA is .430. But, that comes with a confidence interval. It would be better to say that Bonds' true OBA is .430 +/- .030 95% of the time. As you can see, there is a chance, in this example, that Bonds' true OBA is actually higher than his sample OBA.

What is being said is that, on average, players above the mean got more good luck than bad luck. Not all players, of course. Again, in my above example, there is a chance that Bonds got more bad luck than good luck.

Perhaps what we should show is a line like:

Bonds: player
.440: sample OBA,
.430: true OBA,
.015: 1 standard deviation
20%: chance that true OBA is greater than sample OBA

(Numbers for illustration only.)

Would this make it easier to swallow?

Posted 4:38 p.m., February 12, 2004 (#9) - AED
  Every above-average player is overrated by their stats, and every below average player is underrated by their stats.

Such a statement isn't entirely inaccurate. It's easy to get confused here. A player whose "true talent" OBA is 0.400 is most likely to have an OBA of 0.400. However, because of sample biases (there are more mediocre players than great ones), a player who had an OBA of 0.400 probably has a "true talent" that is less than 0.400.

MGL, the x/(x+PA) equation is indeed the correct solution in the case that the Gaussian approximations are valid (or are reasonably close to valid).

Posted 6:04 p.m., February 12, 2004 (#10) - MGL
  Every above-average player is overrated by their stats, and every below average player is underrated by their stats..

JC, yes, if I may re-state what AED just said, the above is almost, but not quite, a brilliant statement! I mention it about a dozen times a year.

Let me make it right:

Any player who has stats higher than the mean of whatever population they come from is overrated, on the average, and every player who has stats lower than that mean is underrated, on the average!

That is one of the most important things to remember in analysis of baseball stats (and of course in many other areas). Oh, and that assumes a Gaussian (normal) or at least a symmetrical distribution of talent in the population. I'm not sure if that statement is always correct for "weird" distributions of talent.

As Tango is trying to explain, that doesn't mean that EVERY player is over or underrated. It means that for EVERY given player, our best estimate of his true stats is ALWAYS between his actual stats and the mean stats from his population. It is a certainty, given the parameters I have just described. Of course, it is up to you to try and figure out what his population is, and then what the distribution of true talent in that population is. Sometimes it is easy and sometimes it is not. Most of the time in baseball, we can make a reasonable estimate.

Tango also keeps trying to explain that it might be better (easier to understand?) to stop presenting just one number and one number only to represent the best estimate of a player's true stats, although it IS correct to present that number to answer the question "what is the best estimate..."

So another way to present that above statement in bold - one that might be easier to digest for those not statistically savvy - is:

Any player who has stats higher than the mean of whatever population they come from is more likely to be overrated than underrated and every player who has stats lower than that mean is more likely to be underrated than overrated!

Statisiticians have a tendency to speak about probabilities as if they were certainties, or at least they are often interpreted as such. In sampling statistics, there are NO certainties (so to speak). Every "conclusion" in sampling statistics is a probability or series of probabilities. Statisticians also have a tendency to speak the "mean" when they really mean a set of probabilities. (I am not a statistician, but I tend to do that also.)

When we say that our estimate of a player's true OBP is .330 (given his .340 sample OBA), we really mean that there is a 10% chance it is between .325 and .335, a 5% chance it is between .320 and .325 or .335 and .340, etc. (I'm making those numbers up).

When we "say" .340, we mean the "weighted average of all those probabilities." We also mean, as I stated, that .340 is the "best estimate (if we had to pick one number) of that player's true OBA" (which we don't really know).

I can't explain things any better than that...

Posted 7:03 p.m., February 12, 2004 (#11) - Jesus Christ Himself
  I was also posting as 'Late to the party' in the other thread...

Tango, MGL, and AED: these last three posts are excellent. I completely understand where you are coming from. That doesn't mean that I'm totally convinced, but now I understand your theory and your method (almost) completely. The only problem that remains is with the following statement made by MGL and alluded to by others:

"Any player who has stats higher than the mean of whatever population they come from is overrated, on the average, and every player who has stats lower than that mean is underrated, on the average![italics added for emphasis]"

The "on the average" statement is what I have a problem with. Like I said, it works for the group, but is meaningless at the individual level. It does nothing to improve our knowledge about one particular player. In which case, I would prefer to believe that a player's career stats are a better estimate of true talent than his regressed stats are (with enough PAs of course).

You guys have certainly made an intelligent attempt to defend your point and I am still not 100% convinced. Maybe it's just me.

By the way - tango: I would love to see all data presented as such.

Posted 11:08 p.m., February 12, 2004 (#12) - tangotiger
  I think that from now on, I will do my best to present stats with the variance.

Posted 1:56 p.m., February 21, 2004 (#13) - Scott de B(e-mail)
  Not to get all metaphysical, but the first three points posit the existence of a Platonic concept of 'true talent'. We can't measure it exactly, nor can we ever know what it is, it's always changing, but nevertheless it exists.

As a way of visualising how the world works, it might be fine. But it seems to me to be an unprovable claim. How do we get around this?

Posted 2:42 p.m., February 21, 2004 (#14) - tangotiger
  With regression.

Posted 10:51 p.m., February 22, 2004 (#15) - ColinM
  Not that I disagree with you Tango, but I'm laughing my as off right now!

I'm just picturing one of my long discussions, over a bottle of scotch, with my good buddy who's a PHD in theology. He asks how we can accept the concept of a Platonic universe given all we know about quantum physics, etc...

And I answer, "With regression". The answer to all of cosmology's probelms in two words!

Posted 8:01 p.m., February 23, 2004 (#16) - Mike Emeigh(e-mail)
  If the distribution of true talent IS normal, then we know for a fact that the true correlation, "r", between any 2 set of independent samples drawn from that population tells us EXACTLY how much to regress to the mean.

The distribution of true talent across the entire population of baseball-playing individuals may be normal, but in professional baseball it's some form of a truncated normal (there's a minimum level of sustainable performance below which a player cannot succeed in pro baseball). At the major league level, the distribution of true talent is probably closer to a gamma distribution than anything else; it resembles the tail of a normal distribution (this is supported by AED's observation that there are many more mediocre players than great players, and the general observation that there are no players who are as far below the mean as the best players are above the mean). For that reason, I'm not convinced that the normal distribution is an appropriate model of the prior distribution of true talent at the major league level.

-- MWE

Posted 9:40 a.m., February 24, 2004 (#17) - tangotiger (homepage)
  If you include the playing time component, then you are pretty close to it.

See above homepage link.