Tango on Baseball Archives

Accuracy of Run Estimators (September 12, 2003)

Patriot takes a look at a large group of run estimators to establish what their accuracy is on the team-seasonal level, USING the same inputs. This is great and tremendous work. As expected, virtually all the estimators have the same accuracy at this level.

Now, what's wrong with this picture? Easy. A large set of these estimators were established using the same sample data that Patriot is using. This is most easily shown with the most accurate equation, the regression equation:

.509S+.674D+1.167T+1.487HR+.335W+.211SB-.262CS-.0993(AB-H)

This is so wrong for a couple of these events. Using the 1974-1990 data that I published in the "Runs Created" series, or using the 1999-2002 data that I published on my site, or using a sim, or using a math model, we know that the difference between a single and a double is about .30 runs.

To say otherwise is to:
1 - Force the numbers to get a better fit
2 - Try to capture other things going on that is not represented by the missing variables

What Patriot has demonstrated very well is that all run estimators have a similar best-fit. But, he has not demonstrated whether the equations themselves make any sense (nor was it his intent to do so).

To do a proper creation of run estimators, and a proper testing of run estimators you have to:
1 - tell the readers what years of data you used to establish your equations
2 - allow your readers to use the data that is NOT part of your sample in #1 to test against

This applies to PECOTA and sim scores and all that as well. Without a sample to test against, all you are showing is that you've gotten a best-fit.

So, what can we do here? I would suggest that you would construct equations using say every odd-numbered years from 1947 to 2002, and test those equations against the even-numbered years from 1947 to 2002. You can also try the reverse.

In the end, I would stick with what has a best chance to work in the future, and not what was best-fitted to work in the past. And to do that, you just have to construct an RE matrix, and derive your LWTS figures from there.

--posted by TangoTiger at 10:13 AM EDT

Posted 11:51 a.m., September 12, 2003 (#1) - tangotiger
Patriot tried more best-fit equations for BsR and RC and he is reporting the following

For RC:
1st--24.89
2nd--23.02
3rd--22.60
BsR:
1st--22.66
2nd--22.48
3rd--22.44

So, RC vaults from worst to 3rd best, and BsR jumps to best.

Like I said, all we are doing is best-fitting. It doesn't prove anything.

Posted 11:56 a.m., September 12, 2003 (#2) - Patriot
I agree with all of Tango's comments about not wanting to test on the data used to develop the formulas, etc. The main purpose, though, was to test BsR against XR, ERP, RC, and other formulas that aren't based on that sample. BsR aquits itself quite nicely.

The last tests were requested by David to see if the BsR structure could produce a more accurate estimator than the linear best-fit, within the data that it was developed for, and the answer, at least there, is yes. I agree, though, it doesn't prove anything.

Posted 12:22 p.m., September 12, 2003 (#3) - Patriot
I did a quick test on Tango's even/odd idea. I used the odd years to get the linear best fit and applied it to the evens--RMSE was 23.76. For comparison, ERP in the same sample, not customized for that specific period(although I'm not sure where I got the ".322" mutliplier in the first place, it probably overlapped with that somehow) came in at 23.98. BsR(again, not a fitted version) came in at 23.64.

Posted 1:39 p.m., September 12, 2003 (#4) - tangotiger
Cool, good stuff!

What I wouldn't mind seeing (if not from Patriot, from some aspiring sabermetrician) is using the 1974-1990 data as the "sample" data to fix all your equations. What's good about this is that I already give you what the "plus 1 method" true value to fix against (at the bottom of the article in article 1, or at the bottom of the page of article 3, which links to the BaseRuns addendum). You can limit it to the fields you've been using (ab,h,2b,3b,hr,bb,sb).

Once you've fitted all the equations against this data, you then apply it to the 1961-1973 and 1991-2002 data.

As Patriot is starting to show here, I would guess that BsR would come up with better estimate than the best-fit linear equation, and probably anything else.

What's really cool about the time periods I am showing is that they each have their own pecularities, and so, should be a good test against extreme-type team-seasons.

Posted 2:43 p.m., September 12, 2003 (#5) - Patriot
I sort of did what Tango suggested. I didn't fit the formulas to match the actual emprical weights he published, I just fit them to equal overall runs for 1974-1990. Then I found the RMSE of them against the 61-72 and 91-02 data for ERP, BsR, Tango's empirical weights(the out value was the absolute needed for 74-90; the only problem here is that Tango's CS coefficient is for RAA I think), and the linear best fit for 74-90. Anyway:
61-72:
Tango--22.87
BsR----23.16
ERP----23.60
Reg----23.82
91-02:
BsR----23.17
Reg----23.25
ERP----23.38
Emp----24.17

Posted 2:49 p.m., September 12, 2003 (#6) - Patriot
BTW, the "Emp" in the 91-02 listing is Tango's empirical weights, identified as "Tango" in the 61-72 list.

Probably doesn't mean anything, but I find it interesting that BsR's accuracy is virtually identical for both periods.

Posted 3:24 p.m., September 12, 2003 (#7) - tangotiger (homepage)
Great stuff again Patriot!!

You will find the "absolute" (along with the "marginal") event values of empirical data from 1974-1990 at the above link. The CS value is something like -.28 runs.

Posted 3:51 p.m., September 12, 2003 (#8) - Michael Humphreys
Tango,

Thanks for posting Patriot's article. Patriot mentions John Jarvis' excellent survey of approximately 30 or so run-estimators.

I agree that simulators, when the data are available, are generally better than simple regression analyses.

The book "Curve Ball", by two former Chairs of the Sports Section of the American Statistical Association, has an excellent explanation of the weaknesses, and strengths, of regression analysis as applied to baseball offensive statistics. Curve Ball points out that the regression weights for doubles tend to be too low, and the weights for home runs slightly too high, probably because of the cross-correlation between both variables. That being said, I don't think any player would be significantly mis-rated if the regression weights were used instead of the simulator weights.

For example, let's say you've got a high doubles/moderate home run guy like Kirby Puckett, say 40 doubles and 20 homers, and a lower doubles, higher homers guy like Mickey Mantle, say 20 doubles and 40 homers. If you use the Curve Ball weight for doubles (.67) and homers (1.5), the doubles/homers runs for Kirby are 57. Using the Pete Palmer linear weights (.80/1.40), the Kirby doubles/homers runs are 60. For "Mickey", the regression runs are 73; the Palmer runs are 72.

The SB/CS weights have more significant errors, for reasons explained in Curve Ball, but even using Mitchell's UZR weights, a Rickey Henderson wasn't creating more than a dozen runs or so in a typical season with his basestealing.

Curve Ball directly addresses the important issue you've identified of *validating* regression weights derived from Sample A seasons by applying them to Sample B seasons. Curve Ball found the regression weights from a 1954-1999 sample worked well "out of sample". See pages 181-82. In my pitching and fielding rating system, which uses regression analysis, the weights derived from various sub-samples of the data were virtually indistinguishable, except the weight for outs, which moved in sync with Pete Palmer's and Mitchell Lichtman's results. Major league baseball is a remarkably stable "system".

I think the larger point Patriot makes is worth making: there are many offensive formulas based on counting stats that are well-developed and reliable. I think the next major advances for offensive evaluation will build upon your work on Win Expectancy; i.e., the actual change in expected runs and wins created by a player, based upon his actual plate-appearance-by-plate-appearance data. One of the co-authors of Curve Ball, Jim Albert, has run a similar model for run-creation using 1987 Retrosheet data. The play-by-play system, customized per player, did reveal some differences not captured by aggregate data, but even the highly simplistic and flawed OPS statistic had a very strong straight-line relationship to the customized runs created data. The other issue in using PA-by-PA data is whether it captures real player skills in maximizing their positive impact, given their base-out scenario opportunities, or just random *measured* contextual impact. It's a subtle and more complete analysis of the old "clutch hitting" question.

Posted 4:00 p.m., September 12, 2003 (#9) - tangotiger (homepage)
You may be interested by the great work by Tom Ruane, who uses the "runs value-added" approach, on a PA-by-PA basis for all players from 1980 to 1999.

The next step is of course the Mills' brothers approach on WPA. That will come from me next year, unless someone else beats me to it.

Posted 4:19 p.m., September 12, 2003 (#10) - Michael Humphreys
Tango,

Thanks for the link to Tom's article. It doesn't provide an overall evaluation of the difference between "counting stat" run estimates and PA-by-PA run estimates. I'll write to Tom, but do you know offhand the improvement Tom's system makes in terms of long-term career assessments?

Posted 9:02 p.m., September 12, 2003 (#11) - Michael Humphreys
Tango,

I had an idea that might be interesting to evaluate while you are developing your Win Expectancy model. Your article on your website persuaded me that "Runs Produced" (R + RBI *minus* HR) is a better "quick and dirty" evaluation statistic for batters than R+RBI. Runs Produced obviously reflects the contextual opportunities of the batter, his own "run-creating" production, and his clutch "performance". But it's a great stat for treating different kinds of batters fairly: on-base guys and power guys.

Do you think that some sort of weighted average of Runs Produced (perhaps scaled to league-average based on PA or outs) and BaseRuns (all of which use counting stats) might have a good match with a PA-by-PA "runs value added" (the last step before Win Expectancy)?

Also, not to beat a dead horse, but I did some quick calculations regarding the potential scale of error if one were to use regression-based SB/CS weights instead of the (correct) BaseRuns or UZR weights. I looked up Rickey Henderson's career SB/CS data. If you apply the regression weights from Patriot (+.211 per SB; -.262 per CS), Rickey's career stolen base runs is +208. I believe the BaseRuns weights, derived from 1974-1990 data, are +.193 per SB and -.437 per CS. If you apply those weights, that makes +124 runs for Rickey. Now 84 runs a large number, although the difference occurs over the course of the equivalent of 19 162-game seasons, so that's 4.4 runs/162 game season. And Rickey is the *most* extreme case.

Another way to see it is that the SB runs are roughly equivalent, but the CS run weights differ by just slightly less than .2 runs. The most CS Rickey ever had in one season was about 40 (when he set the record for SB). So that's an 8-run error; not nothing, but that is the most extreme single season outcome I can think of.

UZR weights differ slightly more, but I think that's because they're derived from late 1990s/early 2000s seasons, when outs were more costly than in 1974-1990.

Posted 9:24 p.m., September 12, 2003 (#12) - Robert Dudek
Michael,

It's important to get to develop the most accurate and logically sound method one can using a given dataset. Even if it only amounts to a couple of runs here or there, it's a matter of principle.

Posted 11:07 p.m., September 12, 2003 (#13) - Tangotiger
I agree with Robert's general sentiment. Let's get it right first, and let others worry about how accurate they need something.

As for the CS, please note the difference between an "absolute" method and a "marginal" method. When the out value is set to -.10 runs or thereabouts, you are employing an absolute method. When the out value is set to about -.27 runs, you are using a marginal method.

The same applies to CS. -.28 runs? Absolute. -.45 runs? Marginal. Check out article 2 of "How runs are created" for more on this.

Posted 11:24 p.m., September 12, 2003 (#14) - Michael Humphreys
Robert,

Yes, we should use the best methods available--when they are available. I just wanted to clarify that regression analysis, used carefully, is a good tool that provides good estimates. I believe that we have the necessary data to derive "change-in-state" results or to run simulations for offense throughout baseball history, or that at least Pete Palmer has it. But when we don't have it, regression analysis is a good back-up; indeed, sometimes it can reveal new relationships.

Sometimes these relationships are spurious and misleading. As explained in "Curve Ball", the fact that regression analysis only provides a measure of a statistical "association" between discrete offensive events and runs scored means that if you include Sac Flies in the regression, they get *way* overweighted, because Sac Flies "carry" information about the run-scoring context in which they occur: you generally won't have a lot of Sac Flies unless you have a lot of runners on third, and having a lot of runners on third is statistically associated with scoring more runs, whether or not they come home via a Sac Fly.

Sometimes, however, you can *apply* regression analysis to *limit* regression analysis errors of the Sac Fly kind, and reveal new information.

One thing I've been meaning to do is a regression analysis on the factors that are associated with stolen bases; i.e., how do team strikeouts, walks, homeruns, etc., impact, on average, the number of bases a team steals and the number of runners caught stealing.

It would make sense, given what we know about the game, that more strikeouts would be associated (through regression analysis) with more stolen bases (due to longer pitch counts), more walks would be associated with more stolen bases (due to more runners on base and longer pitch counts), more homeruns would be associated with *fewer* stolen bases (since teams with power would not risk the out, and homeruns clear the bases).

After determining the number of stolen bases per team *not* explained by these factors, you would have a better idea of the *context-adjusted* stolen bases by a team. That might form the basis for more refined estimates of the "real" stolen base contribution of a player, given the stolen base context of his team.

Furthermore, if such "context-adjusted" stolen base numbers were used instead of plain stolen base numbers in a regression analysis of team runs scored "onto" walks, singles, doubles, etc., you would get an estimate of the number of runs associated with SB/CS *after* taking into account the impact of factors that supercede or "control" stolen bases. In other words, you could find out the number of runs associated with stolen bases not accounted for by the "context" in which stolen base attempts tend to occur.

We might find that that run value has a surprising weight. We might not. Even if we didn't, we might learn something that would help our simulator models.

We need to remember that change-in-state data and simulators *also* work off of averages. They tell us the *average* weight of a stolen base, given an offense that is average in all respects. If we discover, through regression analysis, that a lot of stolen base numbers are explained by the contexts I've described, perhaps it would support refining the *simulator* to answer the question: "What is the change in run expectation when there is a stolen base given a high strikeout/walk/homerun context?" We might not find anything. But regression analysis is an easy-to-use method for discovering new relationships, which can then be tested using simulation models or more detailed play-by-play data.

This is precisely the tack taken in Curve Ball, which first introduces the idea of a batting Linear Weights equation through regression analysis, identifies the Sac Fly "carrier" problem, and then uses Lindsay-Palmer change-in-state models to get at the best answer.

I suppose the other point I was trying to make is that all of the "counting stat" models for offense are so similar in their degree of accuracy (including, yes, simple regression analysis) that it is time to take the next step, as Tango is doing, to try to get "runs value added" or Win Expectancy systems developed. The Jim Albert system I mentioned has something like *half* the standard error as the counting stat models. *That* is the kind of big improvement that is best repays the extra effort of using much more complicated data sets and analytical techniques.

Posted 12:44 a.m., September 13, 2003 (#15) - Michael Humphreys
Tango,

I'm heading out for the rest of the weekend, but I just saw your post. You're right. The BaseRuns "absolute" weight for CS is -.28, not -.44. Patriot's regression result is also on an absolute scale. His CS weight is -.26. (The SB weights are .19 and .21, respectively.)

Patriot's post #3 was indeed good stuff. The regression model standard error using the whole sample was 22.6. When he re-ran the regression on half of the seasons and applied the weights to the other half of the sample, the standard error went up by only 1.1 runs, to 23.7 runs per team per season.

Posted 7:44 p.m., September 13, 2003 (#16) - David Smyth
If you want a run estimator which tries to model the process, them BsR seems to be the best one available now.

If you don't care about models, and are more into powerful statistical techniques, then use the regression formula.

All of the other variations, including the ones tested by Patriot, and certain other formulas, are pretty much superfluous, IMO.

To say that all run estimators are equal, simply because the seasonal RMSEs are similar, is to try to be politically correct. Logic dictates otherwise. Since that is the prevailing sentiment in our society on a variety of important subjects, it is not surprising that it would tend to filter into the area of baseball stats. Nobody wants to hurt anyone else's feelings.

I am not completely immune to these 'pressures'. As the person who came up with one of these formulas, and as a person of moderate technical sabermetric ability, I tend to defer to the James's and the Davenports', etc., in my written statements.

But no more. Give me BsR, give me the regression formula, and give me the PBP coefficients. They all have their place. But let the RCs, the EQAs, the ERPs, etc., be consigned to the historical stat museum which contains the Lindsay formula, the Gimbel formula, the TA formula, and others.

OK, I have my helmet on. Let the bashing begin...

Posted 8:54 p.m., September 13, 2003 (#17) - Patriot
No bashing, but healthy disagreement:)...LW are LW. I prefer the empirical ones--there's no reason why not to. But I still see the room for an ERP or an XR or an EQR(since it's BASICALLY LW) when you don't have the PBP data for the sample or you want one formula that will work with a large set of conditions(not that the 1980 empircal LW will do that bad in 1990 or anything though). ERP is nice and easy and you don't need to remeber whether it's .52S or .48S.

And now this is flat out nitpicking, but Lindsey is actually the one who invented LW. His formula was empirical LW, IIRC(minus walks and outs).

Posted 10:15 p.m., September 13, 2003 (#18) - Tangotiger
All of Patriot testing shows that there is an enormous number of teams where there is not a large distinction between them. Essentially, most of the teams are .320 to .340 OBA and .390 to .420 SLG (or whatever).

So, what his testing shows is that all these run estimators "work", not for any logical reason, but simply because every team in the sample group also has a matching team in the testing group (more or less).

However, to extend these things beyond your sample group, to pitchers like Gibson for example, you need to be grounded in logic. And for that, you need a non-linear interdependent model. And that can be generated using a math model or sim or custom-RE matrix (and looking at change-in-states). Or, you can use the thing that most closely matches what you really want: BaseRuns, or the custom-LWTS generated from BaseRuns.

Can you use ERP or XR or LWTS or RC or EqR or ....? Sure thing. As it turns out, while run creation is non-linear interdependent, you can assume a linear independent process and you will be pretty close (say within 3 runs / 600 PA for a hitter).

Much ado about nothing for most people. But, if someone says that EqR or RC basic LWTS is fatally flawed, you can't argue with that either.

The proper thing would be for say Bill James or Clay Davenport to say "hey, this is how accurate it is... it won't work for these kinds of teams or players.... so, it's up to you to decide if this is good enough".

Posted 10:17 p.m., September 13, 2003 (#19) - David Smyth
Your disagreement is noted and respected, Patriot. But nothing you wrote comes anywhere near to refuting my post...

Posted 3:32 p.m., September 15, 2003 (#20) - tangotiger (homepage)
I made a comment at battersbox.ca at the above link, and again 2 posts later.

Posted 11:55 p.m., September 16, 2003 (#21) - Michael Humphreys
Tango,

I followed you link above and thought the following comment of yours was worth quoting here:

"When looking at exteme cases, that is those cases where EqR and xR and RC are limited to and fail at, you might unearth say an extra 10 runs in there. 10 runs = 1 win. And a team will pay 2 million$ / win.

"So, in the case of say Delgado, you really need the absolute best most precise measure you can get. Even BsR by itself is not enough. You'd need to create a sim model that will show exactly how Delgado will interact, or expect to interact, with his teammates, and the impact of Delgado specifically. You might also find that putting him in the #2 spot will further add more runs and wins to the team.

"Sign him for 5 years, and you now need an "aging" model, similar to the (unverified but seemingly sharp) PECOTA. Add all that up, and you can find say 10 million$ of value in a player. You can make a case say that while the market may value Delgado at 68 million over the next 4 years, you might figure out that he's really worth 52 million$. (Say like what the A's did with Giambi.)

"Use the right tool for the right job. And, for the casual fan, OPS is fine. For the more devoted fans, EqR, xR, RC, and Win Shares are fine too. But, don't try selling these things as more than they are. Limits. That's what it's about.

I thought this was spot on. Using more sophisticated techniques has potentially greater value when you're dealing with outliers--both because the better models get a better answer and because the financial relevance is much greater.

I also re-read your "How Runs Are Really Created" article and agree that BsR is truly a *model* of run scoring--not just a statistical "match" with typical run-scoring environments.

I'm very much sold on the relevance of BsR for extreme pitchers, who create their own run-context. Have you done any analyses that show how BsR improves the valuations for the Gagnes and Pedros? Also, have you found some hitters that have been significantly and persistently misvalued under "linear" formulas? Say, Bonds, or, going further back, Williams v. Ruth?

Posted 7:19 a.m., September 17, 2003 (#22) - Tangotiger
I have not done any of that work, but the pitcher thing is I think the most worthwhile to pursue.

The results of that will make it clear as to the relevane of BsR compared to the others. And, since we've got pbp for the last 30 years, we'd be in great shape to get all the data we want at the pitcher level.