Tango on Baseball Archives

Marcel, The Monkey, Forecasting System (December 1, 2003)

Someone asked, so here it is...

The Marcel the Monkey Forecasting System:

I use 5/4/3/2 for hitters and 3/2/1/2 for pitchers.

(I defy anyone to come up with a forecasting system that can be explained in 1 line or less! And, this system is as accurate as just about anything else, and it still leaves you with enough time to have fun every weekend during the off-season.)

Explanation:
For 2004, this correponds to:
2003 / 2002 / 2001 / LgAvg

(This last component is regression towards the mean.)

So, weight the 2003 performance at "5", the 2002 at "4", the 2001 at "3", and use the league average and set the weight of that at "2". For league average, use 600 PA for hitters and pitchers.

If you want to include age, calculate age as 2004 minus year of birth. If it's 28 or 29, don't adjust. If he's 27 or younger, add 5% to each component (and 10% for ER,R of pitchers). If he's 30 or older, reduce 5% of each component (and 10% for ER,R of pitchers).

Provision:

I have plenty of backing for the hitters portion. You can look at the Banner Years article on Primer.

For pitchers, it's just an educated guess. It seems reasonable, FWIW.

I would of course prefer to regress each component differently. But, that would take an extra paragraph, and I don't really think people would care that much.
--posted by TangoTiger at 02:03 PM EDT

Posted 3:31 p.m., December 1, 2003 (#1) - J Cross
In other words:

hitters: .36, .29, .21, .14
pitchers: .375, .25, .125, .25

Does a pitcher's 2001 really have less relevance (compared to 2003) than a hitter's 2001?

I, for one, would like to see the components regressed separately.

Posted 3:34 p.m., December 1, 2003 (#2) - tangotiger (homepage)
Jay, click the above link, and that should get you started.

(Note: I didn't know about regression towards the mean at the time I did that, and therefore, the numbers are a little unreliable. When I did the same process for pitchers, the pitchers peaked at age 24, and that's ridiculous. Again, all explained by regression towards the mean.)

Posted 4:01 p.m., December 1, 2003 (#3) - J Cross
Oh, I was just thining about regressing each component to the mean individually but I guess one should age adjust each component individually as well. I've been to that link before and even considered using it to construct a system that guessed a hitter's "Real Age" based on their hitting components. Just for fun... if that sounds like fun to anyway who isn't deranged.

Posted 4:05 p.m., December 1, 2003 (#4) - Michael
Tango, do you also weight each season by sq_root(PATBF)? Seems like you should.

I also have to say Marcel is one damn smart monkey.

Posted 4:11 p.m., December 1, 2003 (#5) - Michael
That should be PA or TBF. I tried to use the verticle bar character for an or, but if seems to have been filtered from the message.

Posted 4:18 p.m., December 1, 2003 (#6) - J Cross
Michael, I think the season's should be weighed by PA, not sqrt(PA). Otherwise, if a hitter had 400 PA in 2003 and 100 PA in 2002 each PA in 2002 would be weighed more heavily than each PA in 2003 which doesn't make sense. The weight of each PA in year 1 should not change relative to PA's in year 2 depending on the number of PA's in each year. The weight of all PA's compared to the weight of league average, however, should probably increase with the sqrt(PA).

Posted 4:18 p.m., December 1, 2003 (#7) - tangotiger
I'm not sure I follow. Given the following as an example:
YR,PA,timesOnBase
2003,600,250
2002,400,100
2001,500,150
lg,1,0.333

So, first thing we convert that last line into 600 PA, so now we have
lg,600,200

Then, we multiply by the weight:
2003,600*5,250*5
2002,400*4,100*4
2001,500*3,150*3
lg,600*2,200*2

And we add to get:
total,7100,2500 (with 14 weights)
average,507,179

which gives you an OBA of .353

Now, are you saying instead to use SQRT(600*5) for the first record, and so on? And then, after I get a total, take the square of that? And same for the "timesOnBase"?

If you are, then I wouldn't use 5/4/3/2, but maybe 25/16/9/4, or something else. That is, on my scale, my best-fit was with the 5/4/3/2 system. I'm not sure what my best-fit would be if I used SQRT.

Again, seeing that you have 2 great systems with DMB and PECOTA, and they are very intricate and pay alot of attention to detail... and since Marcel does just as well, I don't see much benefit to adding all that complexity.... unless of course I would charge you for it, and therefore, precision would be a nice thing.

(I also didn't mention anything about putting stuff to league average, but again, no big deal if you do it or not.)

Posted 4:29 p.m., December 1, 2003 (#8) - J Cross
I'm thinking the weights for hitters should be:

5*2003PA, 4*2002PA, 3*2001PA, 2*600*sqrt[1800/(2003PA+2002PA+2001PA)]

(I also didn't mention anything about putting stuff to league average, but again, no big deal if you do it or not.)

You said that the final weight is for the league average and that the point is to regress to the mean, right?

Posted 4:36 p.m., December 1, 2003 (#9) - tangotiger
I'm thinking the weights for hitters should be:

5*2003PA, 4*2002PA, 3*2001PA, 2*600*sqrt[1800/(2003PA+2002PA+2001PA)]

So, for 1800 PAs, the regression towards the mean remains at 2/14, or 14%. But, for 900 PAs, the regression towards the mean would be 40% under your system, and 25% for Marcel. If I remember right, I would regress about 30-35% with 500 PAs. So, I don't think we need any further adjustment beyond what Marcel tells us.

(I also didn't mention anything about putting stuff to league average, but again, no big deal if you do it or not.)

You said that the final weight is for the league average and that the point is to regress to the mean, right?

I meant to say if you have something like 1987, where the overall league average was much different from 1986 and 1988, that maybe you'd want to adjust against that first. But, again, no big deal for the most part, since most players will be affected the same way, more or less.

Posted 4:38 p.m., December 1, 2003 (#10) - tangotiger
I must have calculated wrong... I think your system says 32% regression. Much better, but I'm not sure that that's correct.

Posted 4:49 p.m., December 1, 2003 (#11) - J Cross
okay, I goofed on the formula. I meant to set it up so that for 1800 PA you'd regress 2/14 (14%) but for 900 PA you'd regress 2*sqrt(1800/900)/14 or about 20%. This would result in regressing 27% for 500 AB's.

Posted 4:58 p.m., December 1, 2003 (#12) - tangotiger
Ahh, but don't forget the number of PAs in the other 3 years are also different.

So, you'd have:
5*300 + 4*300 + 3*300 + 2*1.4*600

So, the regression towards the mean is effectively 32%.

For 500 PAs, that's now:
5*167 + 4*167 + 3*166 + 2*1.9*600, with an effective regression towards the mean of 53%.

Posted 5:03 p.m., December 1, 2003 (#13) - tangotiger
That's the really cool part about the "2" part of the equation. It's really "2*600". So, if the player's total PA over the 3 years is around 300 (5*100 + 4*100 + 3*100), the regression towards the mean is 50%. Put the PAs at 600, and regression towards the mean is now 33%. Put the PAs at 1200, and regression is 20%.

I.e, put things in terms of the ratio of the player to the regression total:

PA*4 : 1200
300*4 : 1200 = 1:1 ratio = 50% regression towards mean
600*4 : 1200 = 2:1 ratio = 33% regression
1200*4: 1200 = 4:1 ratio = 20% regression
1800*4: 1200 = 6:1 ratio = 14% regression

(the "4" is the average of 5/4/3)

Posted 5:04 p.m., December 1, 2003 (#14) - Michael
Well the square root comes about if you make the simplifying assumption that a player has some true likely hood of reaching base in each plate appearance (.350 say). Then each plate appearance is a single observation of an identicle independent random variable. This simplifying assumption, while obviously false, is useful to model the data. And the observed on base percentage should approach the actual on base percentage proportionally to the square root of the number of observations. I.e., to get a twice as accurate measure you need four times as many observations.

To use Tango's numbers. First of all there is a calculation error. you should get 7300 not 7100 and as a result an OBP of .342 not .353.(Maybe the formula is already too complicated? :)

A different way to use the weights to estimate obp would be to just use the obp each year. That is do:

2003, .417
2002, .25
2001, .3
2000, .333

and use the 5,4,3,2 on that so (5*2003obp+4*2002obp+3*2001obp+2*lgobp)/(5+4+3+2).
This gives you an estimate of .332 OBP for the numbers above.
At first that is what I thought Tango meant by the weights above.

What Tango actually means (based on message 7) is:

(5*2003obp*2003PA+4*2002obp*2002PA+3*2001obp*2001PA+2*lgobp*600)/(5*2003PA+4*2002PA+3*2001PA+2*600).
This gives you an estimate of .342 OBP for the numbers above.

What I am suggesting is you instead use sq_root(PA) for each of those numbers above:
(5*2003obp*sr(2003PA)+4*2002obp*sq(2002PA)+3*2001obp*sq(2001PA)+2*lgobp*sr(600))/(5*sr(2003PA)+4*sr(2002PA)+3*sr(2001PA)+2*sr(600)).
This gives you an estimate of .337 OBP and is more consistant with the math behind an observed variable. Using the linear PA in this case overweights the 600 observed PA seasons as it assumes they are 50% more accurate than the season with 400 observed PA when in reality they are only 22.5% more accurate.

Even though Tango did the regression without the square root normalization it wouldn't surprise me if the square root normalization was as least as good as the linear PA weighting (unless the data set was pruned in a systematic way for short seasons) since for many players they will be balanced where all the years will have about the same number of PA. And where they are unbalanced they should be unbalanced in an unsystematic way so the weights shouldn't suffer. The only thing that might be incorrectly set for the square root calculation is the 600 PA for league average, but I doubt that will be off by much.

Posted 5:08 p.m., December 1, 2003 (#15) - J Cross
Yep, that's where I goofed with the equation.

I should figure the three year prediction:

the weights should be:

2003:

[5*2003PA/(5*2003PA+4*2002PA+3*2001PA)]*[1-.14*sqrt(1800/totalPA)]

2002:

[4*2002PA/(5*2003PA+4*2002PA+3*2001PA)]*[1-.14*sqrt(1800/totalPA)]

2001:
[3*2001PA/(5*2003PA+4*2002PA+3*2001PA)]*[1-.14*sqrt(1800/totalPA)]

lgAvg:

.14*sqrt(1800/total PA)

I hope that works out.

Posted 6:02 p.m., December 1, 2003 (#16) - studes (homepage)
That is one helluva monkey.

Posted 6:50 p.m., December 1, 2003 (#17) - David Smyth
So Marcus performs about as well as Mickey? (Yes, I am aware MGL didn't participate; it simply had a nice ring.) Not surprising, because the data necessary to separate *real* changes in ability ( apparently Bonds in 2001, perhaps Loaiza in 2003) simply is not available, before the fact, and perhaps will never be.. So all of the "small" adjustments made by the "systematic forecasters" (or whatever Tango called them) are drowned out in an ocean of noise. I do believe, however, that the systematic forecasts can be improved from what they are now, using the available data, by more analysis on exactly what "goes into" real unexpected changes in ability, and how to "detect" that from the available data. Or it may turn out that, even with such refinements, the noise is still overwhelming. But I do know, at least, that more work can be done in this area...

Posted 7:09 p.m., December 1, 2003 (#18) - Ryan
Should certain components (BABIP) be regressed more than others (K, HR)?

Posted 9:26 a.m., December 2, 2003 (#19) - Tom N
David Smth's post is a nice insight that articulates some of my own incomplete thinking on forecasting and sabremetrics. Marcel the monkey works well not because forecasting is inherently simple, but because a simple approach addresses the manageable cases without really worrying about the difficult cases.

Pecota seems designed specifically to spot breakout and collapse candidates. One of the important assumptions that underlies PECOTA is that adding in non-performance factors like height and weight is going to suddenly reveal something not evident in the performance record so as to be able to spot a breakout by Milton Bradley. Given the inputs though it seems silly to put much faith in Pecota telling us anything other than that pitchers get injured and that good hitters tend to be good and bad hitters bad.

Imagining that there will some great statistical means by which one can identify out of the blue breakout candidates is fun, but it's highly improbable. Especially since the breakout themselves are highly improbable given the previous performance. Meanwhile, mild breakouts pegged by Pecota, like Kerry Wood in 2003, actually were fairly likely anyhow, since a pitcher that strikes out as many people as Wood and is young is a good possibility of breaking out.

Posted 9:38 a.m., December 2, 2003 (#20) - tangotiger
By the way, in the Forecast Experiment, with the 28 players that were the hardest to forecast based on their up-and-down last 3 years, Marcel beat each of the other 6 systematic forecasters. Of course, since I never put out the equation until after the season was over, it's kind of "back testing", which isn't really fair. We'll see how it does in 2004.

My personal opinion is that the amount of precision that Diamond-Mind, PECOTA, ZiPS et al add is so slight that we don't learn too much. Those systems do have the advantage in using minor league data for the youngsters, and park factors for Coors hitters. You can also argue, slightly, that speedsters and power hitters might follow a different career path. But, all in all, for purposes of say fantasy drafts, Marcel should fit the bill.

Posted 11:16 a.m., December 2, 2003 (#21) - Alan Jordan
Another way of doing a weighted average is to weight them by the inverse of the squarred error of their mean (inverse of their UNreliabilities). This gives the best (in terms of smallest Mean squared error)estimator of the mean.

The standard error of the mean for a batting average is

sqrt(BA*(1-BA)/AB)

Where BA is batting average and AB is # of at bats

and the square of that is simply

BA*(1-BA)/AB

let's call that variance of the error or VE

If you wanted to weight two seasons then you could weight them by the inverse of the variance of their errors. For example if season 1 had BA of 350 and AB of 20, while season 2 had BA of 270 and AB of 300 then the variances would be

VE1=.350*(1-.350)/20=0.011375
W1= 1/VE1 = 1/0.011375 =87.91

VE2=.270*(1-.270)/300=0.000657
W2= 1/VE2 = 1/0.011375 =1522.07

The weighted Mean is
WM1=(1/VE1*BA1 + 1/VE2*BA2)/(1/VE1+1/VE2)=

(1/0.011375*.350 + 1/0.000657*.270)/
(1/0.011375+1/0.000657)=

.274

Now what about weighting season 2 more than season 1?

Assume that you want to weight season 1 by 3 and season 2 by 5 (pick any weights you think appropriate). Then just modify the weights from 1/VE1 and 1/VE to 3/VE1 and 5/VE2. The resulting weighted mean is

WM2=(3/VE1*BA1 + 5/VE2*BA2)/(3/VE1+5/VE2)=.273

These two weighted means both factor in the number of at bats and the amount error proportional to their batting average (proportion). You could simplify these by dropping the BA*(1-BA) term. This would leave you with

WM3=(1/AB1*BA1 + 1/AB2*BA2)/(1/AB1 + 1/AB2) or

WM4=(3/AB1*BA1 + 5/AB2*BA2)/(3/AB1 + 5/AB2)

This gives a weight based on the inverse of AB not the inverse of the square root of AB.

Posted 9:14 a.m., December 3, 2003 (#22) - PhillyBooster
Question:

How would you regress a player with only 1 year of data available?

If Player X emerged fully formed out of Zeus's forehead and hits .360 in 2004, without knowing anything else about him, how would you predict he would hit in 2005?

Posted 9:25 a.m., December 3, 2003 (#23) - tangotiger
The same you would with anyone else, even if you knew his name was Barry Bonds. Because you've decided to only use 1 year of data, even if you have more, you always regress based on the sample size.

In your case, with say 600 PAs, you'd regress about 30-35% (a little less for HR,BB,K, and alot more for nonHR hits).

Posted 9:40 a.m., December 3, 2003 (#24) - PhillyBooster
Thanks!

Posted 11:38 a.m., December 3, 2003 (#25) - Rally Monkey
Would you give the same age adjustment to a player going from age 25 to 26 as you would for a player going 20 to 21? Is there any reason to think the younger players are making bigger gains from year to year?

Posted 12:04 p.m., December 3, 2003 (#26) - tangotiger (homepage)
If you go to my post#2, you will see that the younger player should get a bigger adjustment.

You may also want to read the article at the homepage link. At the end of the article you will get a great chart to use.

Posted 2:02 p.m., December 4, 2003 (#27) - mathteamcoach
Who is Marcel named after: Marcel Lachemann, Marcelino Lopez or
Marcelino Solis?

Posted 2:17 p.m., December 4, 2003 (#28) - tangotiger
Marcel, the Monkey... the monkey from Friends.

Marcel could also be Marcel Dionne.

Posted 3:26 p.m., December 4, 2003 (#29) - mathteamcoach
Koko, I know. Marcel? I would never have known that. I am an element of the set of people who have never sat through an entire episode of Friends. I wonder how many of us there are?

Anyway, now when I play the 20th anniv. edition of Trivial Pursuit, I'll know the answer to the corresponding question.

Posted 5:11 p.m., December 4, 2003 (#30) - Rally Monkey
Marcel from friends is also the monkey actor who plays me on the Angel's scoreboard.

Posted 2:03 a.m., December 5, 2003 (#31) - MGL
The same you would with anyone else, even if you knew his name was Barry Bonds. Because you've decided to only use 1 year of data, even if you have more, you always regress based on the sample size.

Tango, whoa Nellie on that statement! You may be misleading people. It was significant that Phillie booster said "out of Zeus' head" - IOW, that we know nothing about the player. These formulas (e.g. Marcel the Monkey) are predicated on not knowing anything else about the player other than his 3-year (or whatever time period) stats. If we know that it is Bonds, then we have more info, so the formula will not necessarily work well as it stands in terms of predicting future perforemance, if that is the intention. Anytime we know something else about our player we would like to tweak the formual if we can.

People should not forget that these formulas are based on regression analyses or at least something loosely analagous to a regression anlysis. The precise way to come upo with a "best estimate" of a player's true performance level is to know or estimate the true distributions of talent in the populaton from which the player comes, and then do a Bayesian analysis based on that and a binomial distribution which models sample error in the sample stats you are working with. A regression equation and these simple equations (like MArcel) work great, but they are still VERY shorthand versions of the more precise Bayesian analysis.

BTW, when you regress to the league average in the Marcel or any other projection formula, if you know some characteristics about the player, such as ht and wt, or even defensive position alone, you can regress to those subsets of players and not the league average as a whole. You can also do something similar for your observation of a player's athleticism or the "sweetness" of his swing. For example, if a player with a lousy swing hits .340 in 300 AB's, it is more likely to be fluke than for a player with a nice swing. Again, you can handle this by regressing the first player's sample BA to the league average for players with lousy swings, etc.

And yes, each component very definitely has different regression coefficients and different age adustments, just as they have different park factors...

Posted 8:46 a.m., December 5, 2003 (#32) - Tangotiger
MGL, I don't think you read me right. Suppose you are a long-time baseball fan, but you can't remember a number to save your life. And all you have are the 2003 batting stats.

It doesn't matter what his name is... you would still regress the 600 PA hitters around the same way.

Posted 6:14 p.m., December 5, 2003 (#33) - AED
I think the intention here is to make the system as simple as possible. As such, Tango has done a great job.

The variance measurement for batting averages should use the average batting average, not the per-season batting averages. The reason being that random variations significantly outweigh actual changes in ability (aside from the basic age adjustment). The beauty of this is that the same BA*(1-BA) factor appears in all variances, and thus divides out to give exactly
average = sum(AB*BA) / sum(AB).

Strictly speaking, the weighting factors for various years should themselves be functions of the number of at bats. The reason being that this is just an additional source of variance: the year-to-year fluctuation in batting average is a combination of random noise, changes due to average aging, and unmodeled (effectively random) changes. If the random noise is small (many at bats), the amount of variance due to the unmodeled changes is proportionally quite large. If it is large (few at bats), the variance due to unmodeled changes is negligible. In general, the weight for one year's stats would equal:
AB/(1+x*AB*dy)
where AB is the number of at bats, x is related to the year-to-year variance of the skill in question, and dy is the number of years between the year being projected and the year whose stats are being looked at. This is something to worry about in a more complex system, of course; Marcel works fine because the situation most problematic (player doesn't have many at bats in any season) is mitigated by the use of the prior.

The regression of players in the way used here can be very accurate. If player abilities and random errors are distributed normally, in fact, weighting in this way is exactly the same as making a probability analysis. Since the distributions are moderately close to Gaussian, you won't improve the projcections noticeably using a more formal analysis.

The one catch is that, by regressing to the league average, Marcel only works correctly for starters. Overall, MLB players regress to the average batting average of MLB players, not to the league average. This is significant, because there more players with 100-200 at bats (averaging 0.240) than there are with over 500 at bats (averaging 0.285). Because the regression effect is most significant for players with few at bats, I would strongly suggest changing this factor. At minimum, you should determine the relationship between batting average and number of at-bats per season, and use the appropriate value to regress. That makes things a little more complicated, but will greatly increase the accuracy of projections of part-time players.

Posted 7:59 p.m., December 5, 2003 (#34) - Tangotiger
Marcel only works correctly for starters

... but, since we usually only validate for players in year x+1 who have at least 300 PA.... well, you see where I'm going right? This is just like those MLE programs that try to forecast MLB perform using minor league data. The better the rookie happens to perform, the more PAs he gets... the worse he performs, he gets sent down. And so, the guys who play the most do better and get the higher weight.

No question, if you look at my "Talent Distribution" article, you can create a rather simple function between PAs and true talent level, and you can regress based on that.

I was thinking of changing Marcel so that you would regress to 90% of league average. While this would be a little less accurate for the regulars (they should regress to 105%), it would be more accurate for rookies and sophs.

Posted 8:13 p.m., December 5, 2003 (#35) - Tangotiger
To continue...

If I have a regular with 1800 PAs over the last 3 years, and I have his OBA at .400, and the league is .333 (at 600 PA), this would work out to a forecasted OBA of .390.

For a rookie, he'd by default get .333.

However, if I use say .300 as my "league" or "regression towards the mean", my regular comes in at .386, and my rookie comes in, again by default, at .300.

Against this though is the few number of rookies. So, the regulars should regress to .340, the rookies to .300, but my guess is that if I try to minimize the differences, I'd probably end up using .333 for all players. I think.

Posted 11:43 p.m., December 5, 2003 (#36) - MGL
MGL, I don't think you read me right. Suppose you are a long-time baseball fan, but you can't remember a number to save your life. And all you have are the 2003 batting stats.

It doesn't matter what his name is... you would still regress the 600 PA hitters around the same way.

I know that's what you meant, but I didn't want anyone to be mislead. I'm not sure you aren't misleading (some) people again.

Let's take your first sentence above. You say you "can't remember a number," but presumably you have a name (Bonds). Do you know anyhting about that name, even though you don't know any numbers? Do you know that he is a good player? A great player? His father was a great player? All of these things change the number towards which you regress.

If you know literally nothing about this player names Bonds (he could be named Savkjgbjd for all you know), then isn't it so obvious that it won't affect the regression that it's not worth even mentioning?

There is a tenet in statute interpretation (in law) that says something like "If there are two ways to interpret something and one way (way "A") means that there was absolutely no reason to mention "X" and "X" was in fact mentioned, then interpretation "B" must be assumed.

I assume that you mentioned Bonds with the assumption that our "person" must know something about him otherwise there was no reason to even bring up the fact that he knew the name of the player...

Posted 12:29 a.m., December 6, 2003 (#37) - Alan Jordan
AED, Where do you get this:

AB/(1+x*AB*dy)?

The efficient (minimum variance) weight for an observation when there is heteroskedasticity (systematicaly unequal variances for observations) is 1/VE. Where VE is the variance for that observation.

See Econometric Models & Economic forecasting 3rd ed. by Pindyck & Rubinfeld on page 149-153.

Assuming that we can use the average BA to estimate variance due to the binomial distribution, you get a weight of 1/AB.

Adding in year to year variance, x and a term for lags, dy should get you 1/(AB + x + dy). Where do you get the AB in the numerator and the 1 in the denominator?

Can you expand on this too?

"If player abilities and random errors are distributed normally, in fact, weighting in this way is exactly the same as making a probability analysis."

Posted 10:12 a.m., December 6, 2003 (#38) - Tangotiger
If you know literally nothing about this player names Bonds (he could be named Savkjgbjd for all you know), then isn't it so obvious that it won't affect the regression that it's not worth even mentioning?

No, it is not obvious, considering that someone asked the question. The original question was:

How would you regress a player with only 1 year of data available?

If Player X emerged fully formed out of Zeus's forehead and hits .360 in 2004, without knowing anything else about him, how would you predict he would hit in 2005?

Posted 4:18 a.m., December 7, 2003 (#39) - AED
Where do you get this:

AB/(1+x*AB*dy)?

Alan, I assume that year-to-year variations of a player's true ability, after making average age corrections, are a random walk. So the variance between the 2003 and 2004 abilities will equal x, that between 2002 and 2004 abilities equals 2x, and that between 2001 and 2004 abilities equals 3x. I also multiply the variance by r(1-r), where r is the rate in question, which makes things easier and intuitively makes smaller year-to-year variations when r is close to zero or one.

Putting this together with the variance from the binomial distribution, I get a total random variance of
r*(1-r)/AB + r*(1-r)*x*dy = r*(1-r)/AB * ( 1 + x*AB*dy) ,
where the first term is from the binomial distribution and the second is the random walk in unmodeled year-to-year variations. If there is no prior, the r*(1-r) factor cancels out since it is present in all weights, and the individual terms are weighted by
AB/(1+x*AB*dy).

If there is a prior and you want to keep Marcel simple, you need to fudge by assuming some value of r*(1-r) in the weighting. I'm also fudging the corrections for average aging, of course, which technically should be done on the model value rather than an adjustment to the stats. (Doing the latter screws up the variances ever so slightly.) If you're trying to project stats as accurately as possible to run a major league team, you'd worry about this. If you want a reasonably accurate forecasting system that can be explained in a few sentences, you don't.

Can you expand on this too?

"If player abilities and random errors are distributed normally, in fact, weighting in this way is exactly the same as making a probability analysis."

This is pretty straightforward. Paraphrasing Bayes' theorem, the probability of a player's true (inherent) batting average skill being x equals the probability of his having had his particular batting stats over the past N seasons times the probability of any player like him having a skill of x. If both of these are approximated using Gassian probability distributions, -2*ln(P) equals:
(x-m)/s^2 + sum(i=years) (x-xi)^2/Vi
where m is the mean of "players like him", s is the standard deviation of that group, xi is the player's rate stat (adjusting for average aging) in year i, and Vi is the random variance (Binomial plus random walk aging) in year i. This, of course, simplifies to
x^2*(1/s^2+sum_i(1/Vi)) - 2x*(m/s^2 + sum_i(xi/Vi)) + constants
Solving for the value of x that maximizes the probability (or minimizes this function of -2*ln(P)), you get:
x = (m/s^2 + sum_i(xi/Vi)) / (1/s^2+sum_i(1/Vi))
which is *exactly* a weighted average of the player's past stats and a regression factor. The weights equal 1/Vi for the past stats and 1/s^2 for the regression to the mean.

Posted 11:15 p.m., December 7, 2003 (#40) - Alan Jordan
AED, I screwed up in not one but two places. I got my weights and errors mixed up and I put x+dy instead of x*dy.

if we ignore r*(1-r) then my VE would 1/AB + x*dy. Just taking the reciprical gets:

1/(1/AB + x*dy).

multiplying each term by AB gives you:

AB/(1+AB*x*dy)

Which is exactly what you got, so you were right.

This is the part that I *really*, *really* want to talk about.

"This is pretty straightforward. Paraphrasing Bayes' theorem,...
you get:
x = (m/s^2 + sum_i(xi/Vi)) / (1/s^2+sum_i(1/Vi))"

I have had my doubts about how valid the standard way of regressing a rate like batting average to the mean is. The only way that I've ever seen done is take a bating average and subtract the mean for that year then multiply it by the year to year correlation. This has obvious problems if everyone has different numbers of ABs or PAs, but there is something more insidious that people don't notice. The validity of the approach is based on the idea that the correlation equals the variance of true abilities/ divided by the total variance. The proof goes something like this:

Assume
P1=mu+e1,
P2=mu+e2,
cov(mu,e1)=cov(mu,e2)=cov(e1,e2)=0,
var(e1)=var(e2)=var(e)

where P1 and P2 are performance for year 1 and year 2, and mu represents the true rate or average for the person, and e1 and e2 represent the error for year 1 and 2.

1. r=cov(P1,P2)/std(p1)*std(P2)

2. cov(P1,P2)=cov(mu+e1,mu+e2)=
cov(mu,mu)+ cov(mu,e1) + cov(mu,e2) + cov(e1,e2)=
cov(mu,mu)=
var(mu)

3. std(P1)*std(P2)=
sqrt(var(mu) + var(e1))*sqrt(var(mu) + var(e2))=
sqrt(var(mu) + var(e))*sqrt(var(mu) + var(e))=
var(mu) + var(e)

plugging the end results of 2. and 3. back into 1. you get:

4. var(mu)/(var(mu) + var(e))

which is by definition the ratio of the variance true abilities over total variance. The problem comes when you assume that the process has autoregressive elements to it. If you assume

P1=mu+e1,
P2=mu+u1+e2,
u1 is uncorrelated with mu, e1, and e2

where u1 represents the autoregressive component of the error, then the whole thing falls appart. Cov(P1,P2) still equals var(mu), but the bottom part is:

sqrt(var(mu) + var(e)) * sqrt(var(mu) + var(e) + var(u1))

The bottom part no longer equals total variance. As you can see we have a problem. Using the correlation to forecast isn't a problem, but estimating true ability is.

Your system seems to be valid replacement. Even though you take a few shortcuts such as assuming a uniform rate for all players and assuming that errors are normally distributed, this seems to do the job. I suggest you publish it here or at BTN.

Posted 9:39 a.m., December 8, 2003 (#41) - tangotiger (homepage)
There was a good discussion on regression towards the mean at the above link.

I concluded that the regression towards the mean should not be based on rates, but on RATIOS (odds).

If you look at Rob Wood's post near the end of that, I looked for a best-fit to match it, and I end up with the following process:

======================
1 - Take 318 / PA. Call this the "RegressionRatio",
or rr.
2 - RegressionRate = rr / (1+rr)

The 318 would change based on whatever it is you are looking at. I'm not sure of the significance of 318, other than it's close to SQRT(100,000).

I'm sure I stumbled upon something really cool, but as an amateur, I really don't know why it works so well.

Posted 10:55 a.m., December 8, 2003 (#42) - Alan Jordan
Tango

Rates and odds ratios (along with logits which are ln(odds ratios) are usually a different way of saying the same thing. They each have their advantages and disadvantages in this case. AED's method could be modified to use odds ratios, but I suspect it would be more complicated.

AED's method only requires a program that can run an ARIMA (p=1). This is also called autoregression where residuals are allowed to correlate with themselves as part of the model. Its one of the simplest forms of ARIMA. Actually we need it to include other independent variables which is called a transfer function.

The basic idea of a transfer function goes like this.

1. Perform a regression using a set of independent variables such age and possibly injuries or whatever you think appropriate. Save the residuals.

2. Perform a second regression where the residuals are predicted by the residuals of the year before.

I believe the square of the regression coefficient will give you x for the weight AB/(1+AB*x*dy). The way I described is unbiased in large samples, but inefficient (not the most precise). Maximum likelihood is used to solve regressions 1 & 2 at the same time to get estimates that are efficient, but only unbiased in large samples.

I have a strong hunch that it will be a lot simler to work with means than odds ratios. I'm not even sure what the variance of r/(1-r) is.

Also note that you and Rob Wood were assuming that the only error involved was from the binomial process. AED doesn't make that assumption. He effectively allows a modified version of true ability to move up and down each year. That's probably more realistic. Injuries, one time learning/adjustments to swing, and other temporary changes can't be represented by the binomial part of the error. They are probably swept under that rug, but they don't really belong there.

AED's method is probably a lot more realistic than using the common correlation coefficient.

Posted 11:20 a.m., December 8, 2003 (#43) - tangotiger
To just clarify, while Rob's process did only involve the error from the binomial process, my contention is that if I modify that "318" constant to tailor what I want to regress (HR/PA, nonHRhits / BIP, etc), that I get a function that would work for any type of process. (I'd do that by looking at empirical year-to-year data.)

The complaint by the readers is that you should not regress a guy who went say 90 for 100 to the same "rate" degree as the one who went 40 for 100. The ratio process takes care of this in a more realistic fashion.

That said, I look forward to you and AED discussing this regression topic, and be happy to watch.

Posted 9:40 p.m., December 8, 2003 (#44) - Alan Jordan
Tango,

Several months ago you forwarded an email to me from someone who wanted one equation for regression to the mean that would handle various lags and various ABs/PAs. This may well be it.

If this works, then it can be applied to Pinto's model so the park estimates can be regressed (assuming two or more years of data).

Posted 10:27 p.m., December 8, 2003 (#45) - AED
Tango, the rr/(1+rr) formula is the same as the one I posted on the DRA part 3 thread. rr equals the variance of the player ability distribution divided by the variance from random noise, the latter being proportional to the number of chances.

I tend to think in terms of probabilities, so it doesn't really matter if you use rates, ratios, ln(ratios), or any arbitrary function of the rate. The only difference is that calculations are easier in some scales than others. In this case, I think that it's easiest to stick with the rates, since the binomial statistics dominate the variance, and variance is trivial to calculate if your scale is a rate.

Posted 10:52 p.m., December 8, 2003 (#46) - AED
Alan, about your earlier post... I think the difference is that I'm modeling the autoregression component as a combination of average aging trends and an unmodeled random walk. I don't attempt to model the random walk element, instead treating it as an additional source of variance equal to x*dy in the performance for the year in question (as viewed from the year being forecast). I don't think you would measurably improve the forecast accuracy by using a more sophisticated approach, since it is a small factor compared with the random noise. Also, by modeling in this way, I can treat the yearly stats as being statistically independent.

Posted 12:22 a.m., December 10, 2003 (#47) - Alan Jordan
AED,

What I posted above the correlation being the ratio of true variance to total variance is based on the true score theory. That's different from what you're doing. One of the problems that I didn't mention in that post is that the true score model predicts that the correlation between 2003 ba and 2002 ba is the same as the correlation between the 2003 and the 2001 ba or for that matter 1990. I can't imagine how that could possibly be the case.

You say that you don't model the random walk part. That's kind of puzzling. Just allowing errors to correlate is a form of modeling them.

BTW are you estimating the autoregressive coefficient or are you setting it equal to 1. Strictly speaking, if the ar coefficient is anything other than 1, it's not a random walk. I've been assuming that you're estimating the ar coefficient rather than setting it to 1.

I think that this is a superior model for doing regression to the mean than the method currently in circulation. I'm also interested in it for uses other than simple forecasting. For example you can do a logistic regression where the dependent variable is whethet the batter gets on base. One set of independent variables is who is batting. The other set is what park is it in. The coefficient for batters give you OBA corrected for park effects in logit form. With a little manipulation they can be transformed from logit to a rate conditional on a certain park, or "average park". I didn't have a way of regressing those coefficients to the mean other than the traditional correlation coefficient. With your method, I can plug in the square of the standard errors where you have the binomial portion. I could also factor out age first. I could also do the same for park factors and pitchers. You get the idea. The point isn't forecasting, but estimating talent/ability.

Again, I recomend that you publish this is as a more robust way of doing regression to the mean.

Posted 12:10 p.m., December 10, 2003 (#48) - AED
Alan, this is what the x*dy term accomplishes. Recall that my random variance is set equal to r*(1-r)*(1/AB+x*dy), where again r is the "true score" and x is an arbitrary constant. (Actually I use something a little more complex than x*dy to make that term a little more accurate, but that's not the point here and x*dy is a pretty good approximation.) The other element of the correlation coefficient is the variance in inherent abilities, is a constant. The correlation coefficient, of course, is related to the ratio of the two variances, so it is indeed a function of dy.

There are several ways you can handle the element I model as x*dy. One, you can ignore it altogether, which as you note is quite unrealstic. Two, you can actually model it as some sort of random walk whose year-to-year variance equals x. Three, you can model the dy-year variance term as I do (x*dy), which correctly models the variance but doesn't attempt to model the walk itself. In other words, I just care that the variance between 2000 and 2004 is 4x and that between 2001 and 2004 is 3x; I don't take advantage of the additional knowledge that the variance between 2000 and 2001 is only x. Because the variance from random noise is much larger than x, I don't mind the tradeoff. I could be wrong though; I'll have to take a look at it.

Posted 12:25 p.m., December 10, 2003 (#49) - tangotiger
AED: I'd be interested in seeing the results of your methodology against Marcel. How much extra accuracy does your process add?

Posted 3:24 p.m., December 10, 2003 (#50) - AED
I didn't enter the 2003 forecasting contest, but running the numbers my average "accuracy" for the 28 players (12*relative OPS error or relative ERA error) was 0.623 using a weighted average with a career baseline and 0.633 using a PECOTA-like comparable player test with a 5-year baseline. Part of the increased accuracy is probably because I include park and league adjustment factors, although I didn't pay attention to see how many players had changed teams. Using a longer baseline also helps the accuracy. On the flip side, I didn't regress to the mean in those projections, which probably adds enough inaccuracy to offset the accuracy gained from the park/league adjustments and extended baseline.

Regardless, my projection isn't significantly more accurate than Marcel, since the dominant source of projection inaccuracy is random noise in the 2003 stats (about +/-0.054 in OPS for 600 PA; about +/-0.63 in ERA for 200 IP). I think a perfect projection, with all abilities known exactly, would have had an average accuracy of around 0.55.

Posted 10:37 p.m., December 10, 2003 (#51) - Blixa Bargeld
is the league average component the league average in 2003, or the average of the 3 seasons?

Posted 10:08 a.m., December 11, 2003 (#52) - tangotiger
Technically, it should be weighted by 5*PA(yrX) + 4*PA(yrX-1) + 3*PA(yrX-2).