Tango on Baseball Archives

MGL - Component Regression Values (PDF) (January 8, 2004)

MGL presents component regression values.
--posted by TangoTiger at 09:50 AM EDT

Posted 9:52 a.m., January 8, 2004 (#1) - tangotiger
One of the problems with what you are doing with component regression is that you treat each of the components as independent of each other.

Say that each component has the same flat regression of 40% for 600 PA.

However, if you were to take LWTS of a player, I'll bet you the regression should be 30% for 600 PA.

By keeping the components separate, you overstate the OVERALL regression, while correctly stating the component regression.

A good way to test this is to do your component regression prediction (and AFTER convert that into LWTS), and also come up with a LWTS regression prediction. Do the two match? Yes? Then, you can ignore what I said, and all the components are independent of one another. No match? Hmmmm....

Posted 5:16 p.m., January 8, 2004 (#2) - tangotiger
Here are the regression towards the mean figures to use for 600 PAs, as well as the "x" value to use in
regression = x / (x+PA)

obaLWTS 1B 2B 3B HR NIBB HBP RBOE SO
209 298 1,101 571 131 96 255 1,627 62
26% 33% 65% 49% 18% 14% 30% 73% 9%

obaLWTS is that "effective OBA" that I mentioned a few weeks ago, one that weights the single as 0.9 and the HR as 2.0, etc, etc.

All figures are per PA, though that's not necessarily the way I'd do it.

Posted 6:12 p.m., January 8, 2004 (#3) - MGL
You mean you reduced my article to 3 lines? Where did you get these. Are you figuring all the rates as per PA, or are you assuming the "traditional" denominators (SO=SO/(PA-BB), etc.)? If you are using per PA, do you think it makes a big difference that these are not the best deno,inators to use? Do these inlcude the possible "inter-dependencies" you mention in your first reponse?

I assume these are for batters. Do you have similar numbers for pitchers?

Your numbers are very close to mine, except for triples. His lower values may reflect the "inter-dependency." Tango, why do you think our triples are so different. I was surprised how high mine was, as triples seems to be a good reflection of speed. Also remember I use park adjusted stats. If you did not, the persisitency of triples rate may reflect the park more than the batter. Also do you have values for SB/CS (or attempts per 1B+BB and success rate, which is what I would use)?

Here is the comparison:

Tango: 26% 33% 65% 49% 18% 14% 30% 73% 9%
MGL: N/A N/A 70% 60% 60% 15% N/A N/A 15%

Also, you might want to explain how to use the x/(x+PA). It may be a little confusing for those who are not familiar with regression and the "role" the # of PA's play. Also I have a little problem (or at least a caveat) with the quick and dirty formula: x/(x+PA). I assume that techincally that is not the correct true "relationship" (curve) between the regression coefficient and PA.

I did a littel preliminary work on the "inter-dependency" thing. Indeed it appears that some of the components inform the regression of other components (such as $SO and $HR for batters), or at least change the regression constant. However, it also appears as if there is little or no dependency among some of the components, such that you can safely regress each one separately, without worryong about the other. Tango suggests that the best "Q&D" way to handle this potential problem is to just reduce the regressions for all of the components by some amount. That may be OK, but I would like to come up with a better solution. It may require a regression equation for each component, which inlcudes all the other components (as independent variables)...

Posted 11:34 p.m., January 8, 2004 (#4) - Tangotiger
You mean you reduced my article to 3 lines? Where did you get these.

Just continuing work that I brought up a few weeks ago, which is being explained with great thoroughness by the stat-savvy here, like AED, Alan, Arvin, et al.

It's all based on comparing the observed variance to the expected variance based on luck, and attributing the difference to the true variance.

Are you figuring all the rates as per PA

Read the last line of my last post.

PA, do you think it makes a big difference that these are not the best deno,inators to use?

Yes, I believe that it is wrong.

Do these inlcude the possible "inter-dependencies" you mention in your first reponse?

Yes, though I don't know the degree to which this is wrong.

I assume these are for batters. Do you have similar numbers for pitchers?

I did those quick on my way out. I'll do the pitchers tomorrow morning.

Your numbers are very close to mine, except for triples. His lower values may reflect the "inter-dependency." Tango, why do you think our triples are so different.

I believe it's because of the sample size. Your sample is 2B+3B, and mine is PA. However, your spread of 3b/2b+3b is much larger than my 3b/pa. I'm not sure which of the two variables has more of an effect. (They might cancel out.)

I use park adjusted stats. If you did not, the persisitency of triples rate may reflect the park more than the batter.

That's possible, but I don't measure persistency per se. I just measure the spread of triples. However, I should have done (all numbers are variances):
observed = true + park + pitchers + fielders + luck
I think pitchers and fielders would have a variance of 0 from the perspective of the batter. I should have included park, and maybe that has more of an effect on triples. But, I don't really want to get into nuances.

Also do you have values for SB/CS (or attempts per 1B+BB and success rate, which is what I would use)?

No, though I could have done SB,CS,WP,PB,BK,PO as well.

Also, you might want to explain how to use the x/(x+PA)...with the quick and dirty formula: x/(x+PA). I assume that techincally that is not the correct true "relationship" (curve) between the regression coefficient and PA.

Technically, I believe it IS correct, but the stat-savvy can chime in with their expertise.

That formula is probably the most important one to remember out of any formula you will read. In the "obaLWTS" I put out, "x" = 209. Think of obaLWTS as OBA, but "weighted". Not important. Anyway, if you have an OBA of .400 with 1000 PAs, and the league is .300, how much do you regress?

Regression = 209 / (209 + 1000) = .16

So, you regress the .400 16% towards .300, or .384. That's your best guess as to his true obaLWTS. There's also a simple equation that AED put out to figure out the confidence interval, but that escapes me right now.

Tango suggests that the best "Q&D" way to handle this potential problem is to just reduce the regressions for all of the components by some amount.

I'll run the above regressions against my players, by component and then converted to obaLWTS as well as directly on obaLWTS, and see if they match. If they do, then we can assume that interdependency does not play a role. Otherwise, I'll just factor in a blanket interdependency factor across the board to get a fit.

Posted 10:04 a.m., January 9, 2004 (#5) - studes (homepage)
THANK YOU! This is one of the areas I've wanted to better understand. MGL's article, and Tango's comments, are filling in one of the (many) holes in my understanding of some key concepts.

BTW, this has been a great offseason for Primate Studies. Nice job, Tango.

Posted 10:28 a.m., January 9, 2004 (#6) - tangotiger
Here are the regression values for hitters and for pitchers, on a per 600 PA basis:

Bat Pit Event
26% 39% All
33% 44% 1B
65% 64% 2B
49% 83% 3B
18% 56% HR
14% 24% NIBB
30% 57% HBP
73% 75% RBOE
9% 11% SO

"All" refers to the Linear Weights-based OBA.

Like I said, I WOULDN'T do it this way, per PA, because of the interdependency, but this is good enough for now.

Check out the RBOE. There's a similar impact based on the hitter and pitcher. Now, I *know* that the distribution of batters faced from the pitcher's perspective does NOT have a variance of zero, especially as it relates to handedness. The LH/RH split for a LP and RP are far different.

That the 3B rates regress much more for pitchers than batters is probably due to the batter's speed. The park effect, if the variance is not zero from each of the hitter's and pitcher's perspective, is probably the same for both. So, the regression differentials are probably the same, but the amount of regression might be different.

Check out how much a pitcher's HR has to regress... right in line with his doubles, and MORE than his singles. This does NOT mean that a pitcher has less control on HR than singles, or whatever. It just means that our ability to figure out how much HR skill the pitcher has is limited by the sample available.

In virtually all cases, the hitter's performance is more indicative of his skill level than a pitcher's performance. Again, this is not to say, necessarily, that a hitter has more influence on a PA (they probably do), but rather that the individual performance lines AND the distribution of these performance lines are such that we can tell more about a player if he's a hitter than if he's a pitcher.

***
Incidentally, these numbers kind of support my off-the-cuff MArcel for pitchers to be 3/2/1/2, where the last value is for regression towards the mean, compared to the hitter's 5/4/3/2.

Posted 11:18 a.m., January 9, 2004 (#7) - FJM
TT: Can you explain why the 2B regression for the batters is so high, just as high as it is for the pitchers? If you combine 2B and 3B, does it change much?

Posted 11:36 a.m., January 9, 2004 (#8) - tangotiger
Good question: 67% for hitters and 63% for pitchers.

If you count XBH as 2b+3b+hr, 30% for hitters (similar as for singles) and 45% for pitchers (similar as for singles).

Posted 12:26 p.m., January 9, 2004 (#9) - tangotiger
I just want to reiterate, and it's important, that from a pitcher's perspective, this is the equation for variance:

Obs = True + Batters + Fielders + Park + Luck

From a pitcher's perspective, batters (in terms of talent) will have a variance of 0 but not in terms of handedness. Fielders will definitely not have a variance of 0 for BIP, but would for BB,K,HR. Park would not have a variance of 0.

So, those regression equations I have listed assumes that all these variances ARE zero.

To bring back the famous equation from the Solving DIPS document:

.012 ^ 2 = pitch ^ 2 + field ^ 2 + park ^ 2 + luck ^ 2

After you figure out field, park, luck, you are left with a pitch variance of .009 or .010, depending on the other values.

So, just be careful in trying to conclude anything with the numbers I published.

To continue the work that I did, you want to:
1 - figure out the best denominator for each component
2 - figure out the variance of field,park,bat on each of these components, for the pitcher (and similarly for the batter)
3 - figure out the interdependent relationship between these components

Posted 2:29 p.m., January 9, 2004 (#10) - MGL
The denominator thing is VERY important. If you use the "right" denominators (the right ones are the denominators which tend to do two things: one, reduce the interdependence of the components, i.e., make sure that if one goes up, another one doesn't automatically go down (or up also), or something like that, and two, reflect the greatest proprtion of "skill"), you will see some of the regressions change quite a bit. For example, for a pitcher, triples per PA are very dependent on the other components, but triples per extra base hit are basically the same for all pitchers of the same handedness. For batters, a triple is bascially a "trouble" double hit to RF by a speedy batter. If you use triples per PA for a batter, that will automatically go up as doubles go up, but if you use triples per doubles and triples, this should reflect the true triples speed of the batter.

I agree that Primate Studies this off-season has been fantastic! Props to Tango for all the work he puts in!

Posted 4:45 p.m., January 9, 2004 (#11) - FJM
Let's set aside Fielding, Park and Luck for now. If all you wanted to know is the relative importance of the batter vs. the pitcher in determining the expected frequency of each outcome, you would calculate it as (1-B)/(1-P), correct? For example, for singles it would be .67 / .55 = 1.22; i.e., the batter's influence is 22% greater than the pitcher's. Alternatively, you could express the batter's influence as a percentage by (1-B)/(1-B+1-P). Then you have .67 /1.22 = 55%.

Here then are the 2 stats, sorted low to high:

2B+3B: 0.89 and 47%

SO: 1.02 and 51%
RBOE: 1.08 and 52%
NIBB: 1.13 and 53%
1B: 1.22 and 55%

HBP: 1.63 and 62%
HR: 1.86 and 65%

Now you can see the reason for my earlier question. The batter has a little more influence than the pitcher in every category except HBP and HR, where he has a LOT more influence. But the pitcher apparently has more influence than the batter over doubles and triples. Not only does this run contrary to DIPS, but it seems to go against common sense. Shouldn't the batter's influence over doubles and triples fall somewhere in between his effect on singles and his impact on HR's?

Posted 5:07 p.m., January 9, 2004 (#12) - MGL
FJM, again, the big problem is with the denominators of the rates. You cannot use per PA for both batters and pitchers and expect to get reasonable "responsibility" percentages, especially with the singles. The reason you get such a discrepany with DIPS is that DIPS specifically refers to hits/BIP and NOT hits/PA. In fact, that's the whole point of DIPS. If you look at hits/PA for pitchers, it will appear as if pitchers have huge control over that, but that is simply becuase of their BB and K totals, which changes the hits/PA...

Posted 6:21 p.m., January 9, 2004 (#13) - FJM
True. But if you divide all 3 hit categories by (PA-NIBB-HBP-SO) instead of by PA, won't 2B+3B still indicate less batter influence than 1B?

Posted 10:24 p.m., January 9, 2004 (#14) - Mike Emeigh(e-mail)
But the pitcher apparently has more influence than the batter over doubles and triples.

Most doubles and triples are hit in the air. From the PBP evidence, the pitcher has more impact as to whether a particular BIP is a ground ball or a fly ball than does the batter. Thus the pitcher should influence 2B+3B more, and singles (which are much closer to having the same distribution as the overall ratio of fly balls to ground balls) less.

-- MWE

Posted 11:00 p.m., January 9, 2004 (#15) - Tangotiger
Great point, Mike!

FJM: you might also be mixing up 2 separate things.

1 - A .300 true OBA pitcher facing a .400 true OBA hitter will have virtually the same result as a .400 pitcher against a .300 hitter. (I've yet to publish the study, but that's pretty much it.) So, the hitter doesn't have greater influence.

2 - The spread of talent is greater with hitters than pitchers. So, the likelihood is that the result is more based on the hitter than the pitcher. If all pitchers were like Pedro, RJ, Maddux and Clemens (variance close to zero), then the result of the matchup would depend almost entirely on the hitter.

In your case, you are seeing #2.