Tango on Baseball Archives

Clutch Hitting: Logisitic Regression (PDF) (February 10, 2004)

Alan Jordan checks in with his take on clutch hitting, with conclusions that are in contrast with those presented by me and Andy.
--posted by TangoTiger at 11:05 AM EDT

Posted 1:31 p.m., February 10, 2004 (#1) - Very small nit
Lig represents the test for the five groups of leverage index. The last column is of
particular interest. It represents the probability of getting a result this extreme due to
chance. Usually if this value is below .05 then we say that it is statistically significant
meaning that it is probably a real effect.

Actually, it represents the probability - if the null is true - that this analysis would yield the current results.

This indicates that on average, players' performances aren't affected by the leverage of the situation. However, this does not refute the possibility that there are a smaller subset of players who do perform better in clutch situations. Some players might do better in the clutch, some might do worse, and some might perform equally to other situations. Regarding those who do better - is it better than would be expected by chance?

Also, I admit I didn't read through the past threads on clutch ability, so I may have missed this, but wouldn't clutch hitting (hitting in high leverage situations) be affected by clutch pitching (pitching in high leverage situations)? I would imagine that the two would cancel each other out, so that even if clutch does exist, it would be masked by this.

Posted 2:08 p.m., February 10, 2004 (#2) - Alan Jordan
"Actually, it represents the probability - if the null is true - that this analysis would yield the current results."

That's a much better way of stating it. I wrote it up pretty quickly last night, and I wouldn't be surprised if there are whole words missing in places.

"However, this does not refute the possibility that there are a smaller subset of players who do perform better in clutch situations."

No it doesn't. There may be a small subset that are affected or there may just be just be a really small effect. Testing for a small subset is problamatic because statistically its cheating to look at the results to identify who have the biggest differenes in the clutch and then select those batters out for analysis. You could use one years worth of data to select and another years worth of data to test. Actually you could divide your data in two groups any number of ways such odd days versus even days. The trick is to select on one group and test on the other. My gut feeling is that if there were something there, half a million cases would have found it already.

"Also, I admit I didn't read through the past threads on clutch ability, so I may have missed this, but wouldn't clutch hitting (hitting in high leverage situations) be affected by clutch pitching (pitching in high leverage situations)?"

Perhaps in this data set because we are only looking at from the batter's perspective. If you had data at the PA level then you could control for that by factoring in who was pitching and adding terms for clutch pitching. I wasn't able to control for that in this data set.

Posted 2:18 p.m., February 10, 2004 (#3) - AED
...the results of this analysis stand in strong contradiction to the results of Andrew Dolphin...

Well, hold on a minute... As I noted in post #104 of the clutch hitting thread, I found a chi^2 of 1.04 for Tango's data, with random S.D. of 0.077 in chi^2. This gives a 30% likelihood that these data (or less consistent data) could have been produced without a clutch factor. You find a 31% chance. Therefore you are confirming my analysis of Tango's data, not contradicting it.

In regards to my study, your statements are misleading. You did not determine that players do not perform differently in the clutch; rather you determined that any clutch factor was sufficiently small that it could not be definitively detected in four years' of data. This says absolutely nothing about whether or not it can be detected at the 2-sigma level using 24 years' of data. Given that our techniques give the same results on Tango's data, if anything your calculations show that mine are right and thus that the results from my larger data sample (analyzed similarly) are probably also right.

Posted 2:28 p.m., February 10, 2004 (#4) - Alan Jordan
Tango, you asked why LI groups have to be factored into the logistic regression.

Excerptfrom email

"What I mean to suggest is that you don't have the
extra term, because you can simply normalize each LI
against the league.

Let's say we have:
Giambi: .400,.410,.420,.430,.520
league: .340,.350,.360,.370,.380

why not have this as:
Giambi,LI0,+.06
Giambi,LI1,+.06
Giambi,LI2,+.06
Giambi,LI3,+.06
Giambi,LI4,+.14

(and do this for all players), and run the regression
based on playerid and LI only?"

The main reason is that we already have a way of factoring in LI through our function prob(Y given X)=exp(X)/(1+exp(x)). By trying adjust the data in the way you are talking about, you are treating the probability of Y as if it were linear with respect to LI, but logistic with respect to batters and clutch hitting. I wouldn't treat probabilities as linear unless I absolutely had to because its a mispecification.

The other reason is that I need to work with events (got on base)and chances (PA). These have to be positive because you can't get on base -5 times out of 20. You posted OBAs and PAs. I calculated # of times on base as round(OBA*PA). I wouldn't have even been able to run a linear regression using the method that you describe. In order to do a linear regression (actually an ANOVA), I would need to have the variance for each group of PA's and the variance of all the data combined in one group (I think that's sufficient). I might actuall need to have all the data PA by PA. That's why I can't analyze you lwts data in the same file.

Posted 3:17 p.m., February 10, 2004 (#5) - tangotiger
you are treating the probability of Y as if it were linear with respect to LI,

Agreed. I really meant:

x= .400/(1-.400) all divided by .340/(1-.340)
newOBA = x/(x+1)

I see the point about having whole numbers for PA*OBA.

Posted 4:41 p.m., February 10, 2004 (#6) - Alan Jordan
"Well, hold on a minute... As I noted in post #104 of the clutch hitting thread, I found a chi^2 of 1.04 for Tango's data, with random S.D. of 0.077 in chi^2. This gives a 30% likelihood that these data (or less consistent data) could have been produced without a clutch factor. You find a 31% chance. Therefore you are confirming my analysis of Tango's data, not contradicting it."

I have to admit that I missed that I missed that post. My understanding from post #43 was that Tango, at least, was convinced that an effect for clutch hitting was detectable in the OBA (not OBSlwts) data from 1999-2002.

"In regards to my study, your statements are misleading. You did not determine that players do not perform differently in the clutch; rather you determined that any clutch factor was sufficiently small that it could not be definitively detected in four years' of data."

I made pains to say that hadn't proven the nonexistence of clutch hitting.

"This says absolutely nothing about whether or not it can be detected at the 2-sigma level using 24 years' of data."

This is where I think you have a valid beef.

"Given that our techniques give the same results on Tango's data, if anything your calculations show that mine are right and thus that the results from my larger data sample (analyzed similarly) are probably also right."

We both found no effect on Tango's OBA data. That's about all we can say.

Had I caught post #104, I would have written it up differently. I agree with you now that it doesn't contradict your findings, it only fails to validate them on a smaller sample.

Posted 4:48 p.m., February 10, 2004 (#7) - Alan Jordan
"Agreed. I really meant:

x= .400/(1-.400) all divided by .340/(1-.340)
newOBA = x/(x+1)"

That's a lot better and is equivilent to ln(.4/(1-.4)-ln(.34/(1-.34)) which fits directly into logistic function. However, when you are doing hypothesis testing, it's best to avoid doing those adjustments beforehand. By putting them in as properly specified independent variables, you avoid adding bias and imprecision (inefficiency) into the estimates and their variance-covariance matrix. You also get correct degrees of freedom for the hypothesis tests.

Posted 5:08 p.m., February 10, 2004 (#8) - AED
Alan, it is customary to provide upper limits for non-detections. In other words, how large would the 'clutch effect' have to be for you to detect it? I'd guess that you're only sensitive to clutch if the standard deviation of the clutch talent distribution is 0.015 or higher. Can you quantify this more precisely?

Actually I noted my disagreement several times (#65, #69, #104), but that thread seems to have gotten hijacked by win advancement minutae so I fully understand how things get missed...

Posted 10:44 p.m., February 10, 2004 (#9) - Alan Jordan
"Alan, it is customary to provide upper limits for non-detections. In other words, how large would the 'clutch effect' have to be for you to detect it? I'd guess that you're only sensitive to clutch if the standard deviation of the clutch talent distribution is 0.015 or higher. Can you quantify this more precisely?"

It's not as customary as it should be. I know how to estimate power and necessary sample size for a single coefficient, but we are testing a group of coefficients and I can't find a formula for that in Hosmer & Lemeshow. I took a look at using the chi-square to estimate sample size necessary but I'm pulling theory out my ass to get an answer. It goes like this, Chi-square is proportional to effect*sample size. Estimate effect from chi-square/sample size and then estimate sample size for the critical alpha for a chi-square with 1,699 DF. The results suggest that with about 2 more years worth of data, I'll have a chi-square significant at the p<=.0001 level. I find that hard to believe.

"Actually I noted my disagreement several times (#65, #69, #104), but that thread seems to have gotten hijacked by win advancement minutae so I fully understand how things get missed..."

There seemed to be a mosquito in that thread that no one could swat.
It was tedious to wade through because of all that net effect of a pa on def and off.

It was unnerving to think that your methodology wasn't working. It's somewhat of a relief to see it didn't find anything on Tango's data.

I still have doubts about clutch hitting because even with 24 years worth of data, you still couldn't find an effect significant at the p<=.0001 level. 24 years of data is the statistical equivelent of an electron microscope. You can see effects that are too small to be of any practical value to anyone. Conceivably we could collect data for another 24 years and not replicate your findings.

That said, I just finished a test run of a logistic regression of 1999-2002 data at the PA level. I used your criteria of clutch 6th inn or later... Anyway it doesn't find clutch hitting or clutch pitching and the average drop in OBA in clutch situations is almost but not entirely explained by pitching. It's a test run and I'll probably post some of the code on fanhome this weekend to verify that I'm doing it right before I do the final run.

Posted 2:42 a.m., February 11, 2004 (#10) - AED
Well, I didn't claim a p<=0.0001 level; I claimed a p<=0.009 level. Requiring 0.0001 is equivalent to a 3.7 sigma detection...

Granted that you won't measure a single player's clutch skill to arbitrarily high precision, but you can regress it to estimate the true talent level. Since we're talking a few tenths of a win, it's not going to be a huge difference but isn't negligible either, thanks to the fact that those situations are high leverage.

Posted 12:44 p.m., February 11, 2004 (#11) - Alan Jordan
Nobody accused you of making that claim. I just think that with the massive amounts of data involved, we should hold it to a higher standard than .009. I routinely ignore effects of p< .009 at my job and I only have at most 30,000 cases to work with. I sometimes ignore effects of p<.0001 if the increase in the area under the receiver operator curve is less .005. When you have small sample sizes you have to be more liberal. You are probably working with 3 million cases and that should be enough to get us p<.0001.

I have to admit that because it's a high leverage effect, it doesn't have to be as big to affect the outcome of a game. That makes it different from other effects. Still I question whether we can estimate the effect for a player with enough precision or accuracy to justify putting it into a model. Given that we bastardize park factors (express them as linear multiplicative factors instead as odds ratios among other short cuts) I just can't see that this is big enough to warrant inclusion into a model.

The last problem I have is that OBA is a really incomplete measure of clutch hitting. To address it fully, we need to look at all the outcomes of a PA. OBA treats a walk the same as a HR. I think the linear weights is a better way of looking at it. Even better would be a multinomial model. I don't have the computer power at home to estimate that for 24 years worth of data. I would need access to a university's Unix system. It would literally take days, assuming the job didn't explode.

Enough of that. I'll make some changes to the write up so that Tango can replace the one here. It should be fixed.

Posted 1:28 p.m., February 11, 2004 (#12) - tangotiger
Can you give us a small primer as to how a multinomial model would work?

And, do you think that my lwtsOBA, coupled with the fudge factor that Andy suggested, would be an excellent approximation?

Posted 2:17 p.m., February 11, 2004 (#13) - Alan Jordan
By multinomial, I mean multinomial logistic regression. In regular binomial logistic regression you have two outcomes, yes or no, on base or not on base, etc... In multinomial, you can have more than two. The basic difference is that if you have k outcomes, you need k-1 equations and the odds ratios are difined differently. Instead of ratios being defined as p/(1-p), you have p(i=j)/p(i=k). That is one outcome is always defined as the refrence and it's the denominator in the odds ratios. For example if you have 3 outcomes and the probabilities are .3, .5, and .2, then one outcome gets picked to be the references category, let's say the last outcome. So the odds ratios are:

outcome1 = p(i=1)/p(i=k) = .3/.2
outcome2 = p(i=2)/p(i=k) = .5/.2

Models are estimated using the natural log of these odds ratios as dependent variables.

Tango, this should look familiar from the match up method I posted a couple of months ago. This is what I based it on.

I don't know how well the lwts would work as an approximation. By the time you combine those outcomes together in linear combinations, you have to treat them as a continous variable. That's a standard least squares problem which would run (relatively) quickly. Any bias introduced by treating a nonlinear relation as a linear one contaminates the results. The key question is how much, and I don't know the answer to that.

Posted 2:33 p.m., February 11, 2004 (#14) - tangotiger
Yes, it does look familiar.

In this case then, you would have say :
ln(1b/out),
ln(2b/out),
ln(3b/out),
ln(hr/out),
ln(bb/out),
ln(hbp/out)

For each player/LI category, right?

In terms of least-squares, since the BB is worth less than the HR, would you weight the ln(bb/out) less than ln(hr/out)?

Posted 2:45 p.m., February 11, 2004 (#15) - Alan Jordan
"For each player/LI category, right?"

Yes

"In terms of least-squares, since the BB is worth less than the HR, would you weight the ln(bb/out) less than ln(hr/out)?"

For the linear regression, you take the lwts weight for a BB and that's literally the value of the dependent variable for that PA. The same goes with HR or any other outcome. The idea is that weight represents the average runs produced by that PA.

Posted 2:56 p.m., February 11, 2004 (#16) - Aland Jordan
For the least squares model you have to have a row for each PA. You can't group them the way you did on your posted data.

Posted 5:06 p.m., February 11, 2004 (#17) - AED
If I'm reading this correctly, breaking it into a multinomial model and testing the various factors independently would require you to search for "clutch" factors independently in all factors, right? If sample size is killing you now, won't it be worse trying to measure clutch changes in triples rates?

As the data are posted, I think the best you can do is estimate the standard deviation of lwtsOBA as:
0.883*sqrt(lwtsOBA/PA).
Although this approximation is imperfect on a player-by-player basis (a high-HR player will have a larger random variance than a low-HR player with the same lwtsOBA), it is correct for the overall league and thus the overall chi^2 model should work.

Posted 9:06 p.m., February 11, 2004 (#18) - Alan Jordan
"If sample size is killing you now, won't it be worse trying to measure clutch changes in triples rates?"

I prefer the term f*cking computationally prohibitive.

I have a proposal. Your method is binomial. However, I believe your method can be made multinomial in the following way. Divide outcomes into:

Singles
xtra base (doubles & triples)
HR
Strike out
Walk
Out from BIP (ground outs, fly outs, double plays, triple plays &
fielder's choice)

Use either strikeouts or out from BIP as the reference category.

Calculate a singles rate as singles/(singles + outs from BIP), then an xtrabase rate as xtrabase/(xtrabase+outs from BIP), etc...

Divide PAs into clutch and nonclutch, and do the same analysis you did with OBA. You will then end up with 5 separate variances estimated with 5 separate chi-squares (of 1 DF).

As long as these chi-squares are independent, you can add them up into one chi-square with 5 degrees of freedom. This should give you a more powerful test. It will also allow you to isolate where any effects are.

If we can form a correlation or covariance matrix of these five effects, we could estimate a model that we could plug into lwts or BSR or something to quantify value. My nonprescibed drugs are wearing off and things become fuzzy about here.