Tango on Baseball Archives

© Tangotiger

Archive List

BGIM : Maximum Likelihood Estimation Primer (December 26, 2003)

Worth reading, again and again, for those who are math-inclined.
--posted by TangoTiger at 05:32 PM EDT


Posted 11:47 p.m., January 2, 2004 (#1) - Tangotiger
  I'm hoping some of the usual stat-savvy suspects chime in here:

Let's say that it's a given that I know the true talent distribution of players in a league at .340 OBA with a variance of 1 standard deviation = .040.

Now, you have a player who has a .440 OBA in 600 PA, and I ask the question: what's his most likely true talent OBA?

I can think of 2 ways to find that answer:
1 - Find out the sample observed OBA of all the players, and establish the variance of those samples, compare it to the true known, and regress such that true variance ^ 2 + luck ^ 2 = observed ^ 2

2 - Choose the a priori value throughout the true known distribution, and come up with a best guess true OBA based on the observed of .440 over 600 PA (say with a 95% confidence level). Take the weighted average of the results based on the true known distribution. (Is this known as MLE?)

Is what I am saying making sense? If #2 is correct, how do you set up your equations to find the answer? And are the results of #2 going to be close to #1?

Posted 3:04 a.m., January 3, 2004 (#2) - AED
  If you actually know the true talent distribution, it's pretty simple. The probability of any player having an ability of x is a Gaussian centered on 0.340 with sigma of 0.040. The probability of a player with ability x having 0.440 in 600 PA is proportional to:
x^264 * (1-x)^336
which is roughly a Gaussian centered on 0.440 with sigma of 0.0237. Multiplying the two probabilities together, you get a Gaussian centered on 0.414 with a sigma of 0.020. (Basically you're weighting each factor by 1/sigma^2.)

Your second option will produce a very similar result. In fact, if the Gaussian approximations are correct, the expectation value will exactly equal what you get from #1. If you are dealing with highly non-Gaussian situations, you have to choose whether to take the mean (as you suggest here) of the probability distribution, the median (which I prefer since it's scale-invariant) or the mode (which is MLE).

Posted 5:15 p.m., January 5, 2004 (#3) - tangotiger
  AED, this is great stuff! I'm not sure where you get all your numbers, but walking through it, here's what I've got:

Taking 1 / variance^2, and we end up weighting the population mean by 26% and the observed mean by 74%. That gets us to .414. The variance would be sqrt(.414*.586/600) = .020.

I get where you get this:
x^264 * (1-x)^336
where 264 = 600*.44, and 336 = 600*.56
but I don't get where the variance is .0237.

Thanks again... it's been very enlightening!

Posted 5:55 p.m., January 5, 2004 (#4) - J Cross
  Tango, did you get 1 sd =.040 from somewhere. That's higher than I'd expect.

Posted 8:26 p.m., January 5, 2004 (#5) - Tangotiger
  Numbers for illustration only.

Posted 2:52 a.m., January 6, 2004 (#6) - AED
  You're right, I messed up. The variance from the binomial approximately equals OBA*(1-OBA)/PA, which gives a standard deviation of 0.0203, not 0.0237.

Putting in the correct number, the most likely true talent is the average weighted by 1/sigma^2, or 0.420.

The standard deviation equals 1/sqrt(1/sigma1^2 + 1/sigma2^2) = 0.018.

Keep in mind that you shouldn't do this for every season's worth of stats if you're trying to make projections from 3 years of data.

Posted 9:56 a.m., January 6, 2004 (#7) - tangotiger
  This is some good stuff! Thanks again.

I looked at the OBA, and the stdev is around .030. The regression towards the mean equation, to match the above, would be rr=270/PA, r=rr/(1+rr). I've been using 250 instead of 270. Interesting, when I use .340 as my observed mean, the regression equation would now have 250. In fact, it seems that how much you would regress (to best-fit to the above) is related to how much away from .500 you are (as well as what the population variance is).

In any case, given the league mean of .340, and having an observed OBA of between .250 and .500 (and regardless of PA), the "270" works out to between 210 and 280.

Therefore, if you are looking for a quick regression towards the mean equation, use 250.

A .340 OBA with 250 PAs will give you a standard deviation of .030. This will match the league population standard deviation. Since the two sigmas now match, this means the population mean and the observed mean will be equally weighted. That is, you would regress the observed by 50%.

rr=250/PA, or in this case rr=1
regression rate = rr/(1+rr) = .50

I suppose then it would be rather simple to come up with regression equations for a whole bunch of metrics. Figure out what the population variance is. Figure out how many observed PAs the variance from the binomial would generate. That would be your numerator in the rr equation.

(With all the provisions I've already noted.)

For example, if the league BA is .27 and the sigma is .025, then the numerator would be 316. If the SLG is .4 and the sigma is .04, then the numerator is 153.

AED, thanks for the sabermetric orgasm!