Tango on Baseball Archives

© Tangotiger

Archive List

Forecasting Pitchers - Adjacent Seasons (January 30, 2004)

The following table presents the performance of pitchers, year-to-year, based on age, from 1955 to 2003.
--posted by TangoTiger at 11:09 AM EDT


Posted 11:18 a.m., January 30, 2004 (#1) - Craig B
  Wow, great stuff, Tango! I'm building a projection system as we speak :)

Posted 11:50 a.m., January 30, 2004 (#2) - tangotiger
  Please be careful that I have listed things as RATIOS and not rates. So, there's going to be some "interdependence". Just be careful!

I'll let the stat-savvy's speak their mind on the regression towards the mean and selective sampling issue. Following that, I'll be happy to creat an Excel file (something like Brock2.... I don't want to use Marcel, because this is too convoluted for that... how about Rodin1?).

Posted 11:54 a.m., January 30, 2004 (#3) - Mike Green
  Yes, this is great stuff. Control improving to the late 30s isn't really that surprising I suppose, and it confirms that the late career success of Ryan and Randy Johnson should not really have shocked us.

Posted 11:54 a.m., January 30, 2004 (#4) - tangotiger(e-mail)
  Oh, and the other thing that I haven't done (and I'm not sure that I will do in the foreseeable future) is playing time.

So, in conjunction with the above, you want to have something else that also includes playing time. Pitchers are a pain, because of the start/relief issue, and the high number of injuries.

Posted 11:56 a.m., January 30, 2004 (#5) - Chuck Oliveros
  I was surprised that SO rates peaked at such an early age. I wonder why that is? I find it hard to believe that a physical decline would begin so early. Could it be that young pitchers learn to pace themselves and take a little something off their fastballs in the interests of pitching more effectively? Is there a study that could be constructed to explain that?

Posted 12:26 p.m., January 30, 2004 (#6) - studes (homepage)
  Great job, Tango. Two simple questions: did you correct for ballpark at all? (I assume you did).

Also, why did you use the ratio of performance to the league, instead of straight performance metrics? Was that to isolate the quality of batters faced?

Posted 1:29 p.m., January 30, 2004 (#7) - VoiceOfUnreason
  Would using the various DIPS components break up the interdependences?

Posted 1:31 p.m., January 30, 2004 (#8) - tangotiger
  Actually, I should have noted that I only looked at pitchers who faced at least 250 batters in consecutive years while playing for the same team and in the same league. So, in the rare cases where a team changed park, this was not a good choice. (Perhaps I should have limited by that too.) However, seeing that it is rare, it would hardly cause a dent. Remember, I've got several hundred pitchers per age pair. One or two pitchers won't affect that.

As for ratios and not rates, I've talked about this a few times in a few places. It's the only way to get the chaining to work fine. I'll explain in a separate post.

As for the "binary" approach, this is also necessary to chain, and to get the best "opportunity" factor for every event, and, so far, the best way to isolate all the events (probably).

Posted 1:32 p.m., January 30, 2004 (#9) - tangotiger
  Would using the various DIPS components break up the interdependences

This is what I used. From the article, I noted:

HBP: HBP/(PA-IBB-HBP)
BB: (BB-IBB)/(PA-HBP-BB)
SO: SO/(PA-HBP-BB-SO)
HR: HR/(PA-HBP-BB-SO-HR)
xH: (H-HR)/(PA-HBP-BB-SO-H)

As you can see, the numerator becomes more and more isolated.

Posted 1:32 p.m., January 30, 2004 (#10) - tangotiger
  Ughh... denominator.

Posted 1:45 p.m., January 30, 2004 (#11) - tangotiger
  I posted this at fanhome 3 years ago, and it's important to those who want to understand the difference between rates and ratios (odds).

=========================

Let's say that at age 31, the typical player has 5 triples, and 25 doubles. And let's say that that same guy, playing at age 36 would have 2 triples and 20 doubles.

So, as a "ratio", the age 31 3b/2b is 5:25, or .20. At age 36, the ratio is 2:20, or .10. So, comparing age 36 to age 31 shows that the ratio should be .10/.20 or .50.

Now, let's say you have a guy that has 10 triples, and 20 doubles at age 31. That ratio is .333. At age 36, we would expect this guy, if he follows the same aging pattern as the typical example from above to have a ratio of .1667. (Now, this is where you need more info. You need to figure out his $BB, so that you can get his new AB and BB. Then you need his $K, $HR, $H, $XBH to get his new K, HR, H, XBH. Let's say his XBH comes out to 21.) So, if his ratio is .1667, and we know that his 2b+3b is 21, then we can say that his 3B will be 3, and his 2B will be 18.

Now, let's try it the other way, and work with percentages. The typical player at age 31 will have 17% of his XBH as triples, and 83% as doubles. At age 36, the triples% will be 11% and his doubles% will be 89%. As you can see, his triples rate at age 36 is 67% of age 31, and his doubles rate of age 36 is 107% of his age 31 rate.

Ok, so, now, let's take our guy with 10 triples and 20 doubles at age 31 again. His triples rate is 33% and his doubles rate is 67%. Applying the factors from above, at age 36 his triples rate will be .333*.667 = .222. His doubles rate will be .666 * 1.07 = .713. As you can see, his 2b+3b will equal .935. And this is impossible.

Posted 2:19 p.m., January 30, 2004 (#12) - MGL
  Tango,

Can you explain the "chaining" in more detail. I didn't understand that before and I still don't understand it. When I do my age curves, as I did for the Superlwts age curves, I simply plot the average delta added to some contsant, for each each age pair, and that's it. I have no idea what the differences are in the numbers in your pre-chaining and post-chaining tables.

Also, no one wants to addresss my concern about age versus experience. We've discussed this before and I posted something about it in the other thread. If these patterns are NOT caused primarily by aging, but by ML experience, they will look identical, but they an aging pattern will NOT be useful for projecting pitcher performance. I suspect that there is more of an experience influecne than age influence, at least until the later ages (mid and late 30's). Certainly we have to check that out before declaring a cause/effect relatrionship between age and performance!

What I would do is to first do the same exact charts, but simply substitute years of ML srvice for age. The charts should look similar if not exactly the same. Then you have to do more research to see which is the more promintnet cause/effect relationship - age/perform. or ML service/perform! You can't assume that it is age, especially with pitchers!

The best way to attack that, initially at least, is to establish 2 or 3 groups of pitchers. Group I debuted in the ML's at say age 24 or less, group II 25 to 28, and group III, older than 28. Or whatever. Then do the age and service charts for each of those groups.

I suspect you will find that service time is at least as important as age, but I could be wrong. As I said, the distinction is critical if you want to apply these curves to any out of sample pitcher or year (as for projections)...

Posted 2:27 p.m., January 30, 2004 (#13) - tangotiger
  MGL, I looked at age/experience a few years ago at Fanhome. I'll try to find it, and post it here for you.

==============================

As for the pre-chaining, and post-chaining.

Say you have
Age Change in Performance
26-27 +5%
27-28 +0%
28-29 -5%
29-30 -10%
30-31 -15%

What do you do?

Now, the simple way would be to do:
26 100
27 105 (or 100 + 5)
28 105 (or 105 + 0)
29 100 (or 105 - 5)
30 90 (or 100 - 10)
31 75 (or 90 - 15)

This is probably the way MGL does it. But, percentages are not meant to be added like this. It's close, but not mathematically accurate.

What you want is:

26 100
27 105 (or 100 + 5%)
28 105 (or 105 + 0%)
29 99.75 (or 105 - 5% or 105 * .95)
30 89.775 (or 99.75 - 10% or 99.75 * .90)
31 76.31 (or 89.775 - 15% or 89.775 * .85)

I understand there's not much difference (especially since we are really only taking about 2 or 3% change, and not the 5 or 10% in my example).

But, I see no reason not to do it right.

Posted 2:53 p.m., January 30, 2004 (#14) - tangotiger (homepage)
  MGL, the above homepage link contains a file I had prepared for you at fanhome 2 years ago. I don't remember the time period but I think it was 1979-1999.

It shows you the unregressed year-to-year change, for ERA, given: prior years experience and age.

Let's look at age 27, and let me cut and paste the relevant portion:
PriorExp Age PA1 $ER
0 27 9135 1.06
1 27 19121 1.05
2 27 33890 1.06
3 27 44862 1.04
4 27 38466 1.06
5 27 29669 1.02
6 27 22825 1.06
7 27 8268 0.96

So, what does this show us? For players aged 27, their performance worsened by about 5%, regardless of number of previous years of experience.

****

Running a regression analysis of prior years against change in performance, and the r is .02. A regression of age against change in performance, and the r is .18. Age and prior years, and the r is .31. So, it seems that number of prior years seems to be important.

However...

How about running a regression of age when first made the majors (age minus prior exp) against change in performance? r is .28.

When a pitcher is brought up at age 21, it includes with it a certain amoung of scouting information.

Let's look at the 10 best change in performance year-to-year:

PriorExp Age First Year PA1 $ER
18 37 19 2260 0.82
15 35 20 2449 0.85
6 25 19 2346 0.87
10 30 20 6319 0.87
16 35 19 2776 0.89
4 24 20 10130 0.92
17 37 20 2321 0.92
4 25 21 24857 0.92
13 34 21 7089 0.92
7 34 27 2892 0.92

Wow. In 9 of the 10 age groups, knowing that the pitcher first entered the league at age 19 through 21, and we see that those pitchers had the best change in performance.

Of course, we've got small sample-itis. Redoing this, but limiting it to at least 10,000 PAs for each year-to-year change, and we have:
PriorExp Age First Year PA1 $ER
4 25 21 24857 0.92
4 24 20 10130 0.92
9 30 21 14527 0.94
2 23 21 20635 0.96
2 28 26 16468 0.98
4 30 26 14764 0.98
1 24 23 40812 0.99
8 29 21 19124 0.99
1 25 24 45688 1.01
5 28 23 38200 1.01

Again, the pitchers who started their careers at age 20/21 dominate the most.

Here are the 5 worst year-to-year change in performance:
PriorExp Age First Year PA1 $ER
3 28 25 27264 1.13
3 29 26 17358 1.13
9 33 24 18345 1.14
0 26 26 19150 1.16
6 32 26 11438 1.17

Pitcher who first started at age 25/26.

Let me reiterate: the biggest indicator is NOT based on how many prior years experience you have, but rather the age at which you first entered the league.

The younger you entered the league, the better your year-to-year changes will be, regardless of how many prior years experience you have had to that point.

So, to maximize the forecast for your pitchers, you want to know:
- how old is he now
- how old was he when he first entered the league

That second portion "carries" information about your pitcher. A pitcher who makes MLB at age 21 probably has much much more true talent than a pitcher who makes MLB at age 28. What this does is that rather than regressing all your pitchers towards the same pop mean, your pitchers who started off at age 21 will regress to a much higher (in terms of true talent) pop mean than the pitcher who makes the bigs at age 28.

That is, knowing the age a pitcher starts off, is kind of like using scouting information.

And, as we see here, an incredible amount of information can be gained from that.

Posted 4:16 p.m., January 30, 2004 (#15) - MGL
  Tango,

Fantastic stuff! Absolutely fantastic! There can be no one that can do this stuff more quickly than you can!

First of all, I think that much more detailed research needs to be done with the age/experience stuff.

That second portion "carries" information about your pitcher. A pitcher who makes MLB at age 21 probably has much much more true talent than a pitcher who makes MLB at age 28. What this does is that rather than regressing all your pitchers towards the same pop mean, your pitchers who started off at age 21 will regress to a much higher (in terms of true talent) pop mean than the pitcher who makes the bigs at age 28.

What you are actually suggesting is that there IS a significant aging patterns with pitchers, independent of their experience, and that the only reason we see either an experience influence or an "age of debut" influence is because of an "illusion" in the year to year data pairs created by the fact that the earlier a pitcher is called up, the better his true talent, regardless of his performance, such that we want to regress a pitcher's year X stats LESS for a pitcher who debuts early, than for a pitcher who debuts late. I agree, but I think there is more to it than that.

I have to think about this some more. It is indeed fascinating, but I'm afraid that we have only scratched the surface. As I said before, there is no particular reason NOT to think that major league experience improves a pitcher's talent regardless of age, or that age, especially in the middle (22-32), in and of itself, should affect a pitcher's talent very much. The opposite is true with hitting, and hitting is much more of a physical thing. Given that reasonable assumption, I would be a little surprised if there is a strong aging/performance (talent) cause/effect relationship, and if there weren't somewhat of an experience/performance (talent) relationship.

Remember that your initial r's (age and experience) are very misleading, which is why you got a higher r when using both variables. One, if perforemance improves with experience, but declines with age, of course a regression of experience on performance, without controlling for age, is going to yield an r of near zero, as those two things will cancel eachother out. Secondly, if what I am saying about age, that it may be somewhat irrelevant in the middle age catgeries, then an r generated from a linear regression is not even valid if the realtionship is non-linear! Remember that "r" is only useful for certain linear relationships. Even if there is a nice relationship, if it is not linear, "r" essentially is meaningless.

Plus, the age that a player debuts is almost inextricably realrted to his year of experience. For a player to have, say 10 years of experience, in order for him to NOT have debuted at an early age, he would have to be old, such that his year to year changes might reflect his age AND his experience rather than his age of debut. IOW, I am sugesting that this age of debut thing might be an illusion - that the real relationship might be age AND experience only, and they might even be independent, although your chart of the 27 yo pitchers with different years of experience tends to refute that. But then again, the problem with that chart is that there are different regression and selective sampling problems with each of those groups. Pitchers who are 27 with 5 or 6 years of experience already will tend to be the much more talented pitchers...

Posted 4:18 p.m., January 30, 2004 (#16) - tangotiger
  One final word on this.

If I start to create "classes", I get better regressions using:
Prior Year's Experience: 0,1,2+
First Year in MLB: 21 or under, 22, 23, 24, 25 or later
Age: (no classes, each age is its own class)

So, for prior year's experience, once you have been in the league 2+ years, experience doesn't count for anything.

For first year, treat all the 21 and under pitchers as carrying the same amount of scouting information, and the 25 and older pitchers as well.

Have fun!

Posted 4:22 p.m., January 30, 2004 (#17) - tangotiger
  experience doesn't count for anything.

that should read, "count for anything more". That is, 2 or 5 years of prior experience doesn't make any difference.

I agree, lots more to do, but I don't see much anyway. The changes, year-to-year will still be pretty tiny.

I suppose if you were trying to forecast for the next 5 years, it would be important. But, for year-to-year? I don't see it.

Posted 4:43 p.m., January 30, 2004 (#18) - tangotiger
  Ok, one more thing to think about.

If say that the year-to-year changes showed that the ERA got worse by 2% for EVERY age pair, the regression would show that age would have a correlation of ZERO.

That is, suppose you have:
21 to 22 102%
22 to 23 102%
23 to 24 102%
24 to 25 102%
25 to 26 102%
26 to 27 102%
27 to 28 102%
28 to 29 102%
29 to 30 102%
30 to 31 102%
31 to 32 102%
32 to 33 102%

Clearly, what we have is that for every age pair, there is a constant decrease in performance. However, the regression shows that the age pair would have zero effect.

What the age thing is actually showing is that there's no ADDITIONAL effect at that age, but that age does have an effect.

The problem is that I'm doing age pairs, and then running a regression on the age.

What I REALLY should do (but won't) is to first chain the age performances (1.02, 1.04, 1.06, etc, etc), and then run a regression of that performance line. However, it won't be a straight line anymore, but some parabolic curve.

So, really, I need to, somehow, run a regression against some sort of quadratic equation.

Anyway, I posted the file with all the data. The stat-savvy's among you probably know what I'm talking about, and can figure it out better than I could.

Hopefully, we'll see some results.

Posted 5:16 p.m., January 30, 2004 (#19) - MGL
  Just want to make sure I understand the data in the table above:

The $ER (don't you mean "delta" and not "$"? We use "$" to mean some kind of "rate" like $H) is the simple ratio of year X-1 ERA to year X ERA? There is not chainign in this data, right? What is the min number of BF for each year? The TPA1 column is the actual total number of PA in year X, and not the "min PA of year X and X-1"? Finally, "age 20" means from 20 to 21?

So your final conclusion is that if we just use an aging curve like we do for batters, we are not going to be costing oursleves much by not paying any attention to years of experience or age of debut?

Posted 6:35 p.m., January 30, 2004 (#20) - Stephen
  Brief Hijack

MGL, do you happen to have your UZR results from 2000-2003 by year in a spreadsheet? Thanks.

Posted 8:08 p.m., January 30, 2004 (#21) - Blixa Bargeld (homepage)
  uzr

Posted 9:43 p.m., January 30, 2004 (#22) - MGL
  Thanks Blixa. Tango, BTW, I finally got what you mean by "chaining." After all these years, literally. I can be very dense sometimes! I agree that although it doesn't make that much difference, you might as well do it right. For example, in the other thread AED (I think) pointed out that the correct way to do the weightings when you have matched pairs with different numbers of PA's (for example) is to use 1/(1/PA1+1/PA2), rather than the "lesser of the two PA's." It's almost the same thing, but, as you said, you might as well do it right! And yes, I have done it the way you thought I did it, by just adding or subtracting one from the other. That's how the graphs are constructed. It won't make any difference in the shape of the graphs of course, but I will correct it.

Funny, on BPro, Davenport had an article about constructing MLE's for the Winter Leagues. He weighted the matched pairs by using the "lesser of the two PA's." I chuckled when I read that, because I never heard of anyone using it other than Tango and me, and it turns out that technically it is not correct!

Posted 11:41 p.m., January 30, 2004 (#23) - tangotiger
  Voros also used the lesser of 2 PAs for his minor league MLEs. It's always nice to have someone with AED's background to set us stat-amateurs straight.

MGL, your post 19 is almost accurate, except that PA column IS the min(PA1,PA2).

Posted 10:14 a.m., January 31, 2004 (#24) - Blixa Bargeld
  Tango - are you planning on doing the same thing for batters (say, *OPS+)?

Posted 11:25 a.m., January 31, 2004 (#25) - tangotiger (homepage)
  Not really. Regression towards the mean affects hitters much less than pitchers (because their stats are a more reliable indicator).

However, my first article at Primer 2 years ago had a rather extensive look at hitters. And, if you go to my site above, you'll get a great chart on aging patterns by components (similar to what I did for pitching).

Posted 12:52 p.m., January 31, 2004 (#26) - MGL
  Chaining assumes that the y-t-y changes occurs on a percentage basis for all pitchers (and hitters) regardless of the underlying rate (high or low). Do we know for sure that this is true - that is doesn't change by some constant or some combination of a constant and a percentage (something non-linear)?

For example, if a 22 yo hitter has a true HR rate of 5 per 500 PA and another one has 10 per PA, do we know that the average change in HR rate for these 2 players is not a constant, like +1 per 500?

If you did a study to check this, you can't look at players who had above or below average rates at any age of course, because that would be their observed rates and not their actual rates, and the players with high and low rates will regress the following year, giving you false values for their true changes due to age.

You would have to maybe group players by defensive position and look at the age curves for, say 1st basemen, LF, and RF, versus the age curves for SS, catchers, and 2B man.

For pitchers, I'm not sure how you would get around the huge selective sampling issues I'm talking about. I guess you could group the pitchers by career rates, but again, pitchers with high or low career rates in a any category probably had a weird aging pattern, independent of whether all aging patterns are indeed arithmetic or geometric (adding a constant from y-t-y or a percentage, in which case you have to do the chaining)....

Posted 1:04 p.m., January 31, 2004 (#27) - tangotiger
  MGL, I'm faily confidant that if you use the RATIO process, that you will get the best year-to-year change that can be applied to various levels o fperformance.

Posted 1:06 p.m., January 31, 2004 (#28) - Anonymous
  .

Posted 4:12 p.m., January 31, 2004 (#29) - FJM
  I ran a quick multiple regression on your data set. Since the average pitcher debuts at age 23 I defined X1 as DebutAge-23 and X2 was years of experience. The results can out pretty much as you expected.

Y = 1.042 + 0.012*X1 + 0.003*X2.

In other words, a pitcher who debuted at age 23 (X1=0) can expect to experience a 4.2% decline in his second year, 4.5% in his third year, then 4.8% and so on. If he debuts at age 24, the first year decline increases to 5.4%, then 5.7% and 6.0% and so on. For each year his debut is delayed, his rate of decline increases by 1.2% every year.

So if his debut is delayed until 27, his second year decline is expected to be 4.2%+4*1.2%=9.0%, more than double the rate of decline experienced by those who break in at 23. No wonder pitchers who arrive late don't last very long!

Of course, for those who break in before age 23, the rate of decline is reduced by 1.2% per year. So a pitcher who debuts at 20 can expect a first year decline of only 4.2%-3*1.2%=0.6%, hardly any dropoff at all.

The model produces a fairly good fit, although R^2 is only 10%. The F-statistic is 7.59. X1 is definitely significant (t-statistic of 3.88) while X2 is only marginally significant (t-statistic 1.86). It's interesting that the model never projects any improvement, regardless of debut age or years of experience.

But wait. There is something seriously wrong with this model. It weights all observations the same, even though some are based on only a little over 2,000 PA's and others are based on 40,000+. I will rerun it next week with the observations weighted from 1 to 12. The weights will be determined by PA/4000, rounded off to the nearest whole number. It will be interesting to see how it changes.

Posted 9:37 p.m., January 31, 2004 (#30) - MGL
  I'm not sure what your dependent variable Y is (is it year to year decline or total running decline from the debut year). For a pitcher who debuts at age 23, he experiences a 4.2% decline in the second year. In the third year, is that a 4.5% decline from the first year or the second year? IOW, after the second year are we expecting his ERA to be 8.7% (a little different if we "chain") worse than his debut, or 4.5% worse than his debut?

Posted 11:32 p.m., January 31, 2004 (#31) - FJM
  I didn't chain before running the regression.

Posted 11:36 p.m., January 31, 2004 (#32) - MGL
  FJM, that wasn't my question....

Posted 12:05 a.m., February 1, 2004 (#33) - RossCW
  Here is the result of all that for
HBP: HBP/(PA-IBB-HBP)
BB: (BB-IBB)/(PA-HBP-BB)
SO: SO/(PA-HBP-BB-SO)
HR: HR/(PA-HBP-BB-SO-HR)
xH: (H-HR)/(PA-HBP-BB-SO-H)

Perhaps somone can explain why is the relationship of home runs to plate appearances that are not strikeouts, walks, home runs or hit by pitch significant or the relationship of strike outs to late appearances that are not walks, hit by pitch or strikeouts?

Posted 7:56 a.m., February 1, 2004 (#34) - David Smyth
  Tango, I don't quite understand why you gave the last chart, with the forced-in regression values. I mean, if your study of regression gave the values in the middle chart, why not just use those? Why confuse us as to which chart to use?

So the general pattern is of pitchers coming into the lg with their best stuff, and learning to compensate for losing it due to aging, by increasing their control. How does your study deal with the sampling problem of pitchers who were not able to improve their control much, and therefore dropped out of the lg well before age 39? Wouldn't that make control seem like it peaks later than it really does?

And I do find the chart a bit surprising. It always seemed to me, by casual observation, that pitchers in general tended to improve their K rates and HR rates for a few years after coming into the lg.

Posted 2:04 p.m., February 1, 2004 (#35) - RossCW
  It always seemed to me, by casual observation, that pitchers in general tended to improve their K rates and HR rates for a few years after coming into the lg.

That's because they do. The numbers above measure K rate to non-walks and hit by pitch. So if a pitcher strikes out 5 and walks 4 in 10 plate appearances one year and strikes out 5 and walks 1 in 7 plate appearances the next the "K rate" remains the same even though the pitcher struck out over 70% of the batters faced the second year and 50% the first.

Posted 3:45 p.m., February 1, 2004 (#36) - tangotiger
  if your study of regression gave the values in the middle chart, why not just use those? Why confuse us as to which chart to use?

Because of the amount of selective sampling, I'm not sure that my regression values apply directly here.

Since I did not spend much effort in creating those charts, I did not want to leave the impression that any of the charts are final, and then having to justify something that I did not research extensively.

Posted 3:40 p.m., February 2, 2004 (#37) - FJM
  My regression was on the year-to-year changes, not the cumulative effect. They still need to be "chained" together for that.

I reran the regression using the weighting procedure outlined in my previous post. I was glad to see the model didn't change very much. The R^2 improved only slightly to 14%, although the F-statistic jumped from 7.59 to 41.31. The t-statistic for debut age more than doubled, from 3.88 to 8.72, and the the t-stat for years of experience nearly tripled, from 1.86 to 5.24. But the actual coefficients changed very little. The new regression equation is Y=1.037+0.0127*X1+0.0036*X2.

Posted 4:01 p.m., February 2, 2004 (#38) - tangotiger
  By the way, I'm not sold that the binary split that I have is fine. They have never been verified. These splits were first brought forth (to me) by Voros, and making it binary does give us some benefits.

They just seem logical, but you can come up with different ways to split it up. For example, you can make it nonContactPAs / contactPAs, so that you have (BB+K+HBP)/(PA-BB-K-HBP) as one ratio. Then, you can do HBP/(BB+K), and then, K/BB.

In the end, the way to break it down has to be supported by how baseball really works. I'm pretty sure that we can't just break it down into these binary measures, but I'm not sure the impact of doing so.

Posted 7:26 p.m., February 2, 2004 (#39) - J Cross
  So, FJM, would it be fair to say that a pitcher age for the purposed of projection is:

Age - 3*(23-debutage) ???

Pedro, age 32, debut 20 --- proj. age... 23!!!
Colon, age 28, debut 24 --- proj. age... 31

Look's like Pedro's got another decade left :)

Posted 6:08 p.m., February 3, 2004 (#40) - FJM
  It's true that pitchers who break in at 24 "age" about 3 times faster than those who debut at 20. But you are ignoring X2, the Years of Experience factor. Although its coefficient is small relative to the X1 coefficient, it becomes very significant for pitchers who have been around a long time, like Pedro.

Looking at the Tango's raw data for pitchers who debut at 20, it looks like Pedro has 5 years left, at best. Age 37 is the last year in his data base where the total PA's exceed 2,000. And remember, that's for all pitchers in that category combined. I don't know how many of them made it to 37, but I can tell you that in their peak years (age 24 and 26) the same group was recording over 10,000 PA's per year. Interestingly, age 37 seems to be the end of the line for pitchers who debut at 24 (like Colon) as well. But since there were a lot more of them, their decline is lot more spectacular: at their peak, they were posting over 45,000 PA's per year.

While I'm on the subject, take a look at this chart which I developed from Tango's data by "chaining", multiplying the year-to-year factors together to see the cumulative effect of aging.

PriorExp DA20 DA21 DA22 DA23 DA24 DA25 DA26
0................0.93... 0.95... 1.02... 1.04... 1.07... 1.10... 1.16
1................0.94... 1.02... 1.05... 1.03... 1.08... 1.18... 1.22
2................1.02... 0.98... 1.09... 1.05... 1.12... 1.25... 1.19
3................1.05... 1.01... 1.15... 1.13... 1.17... 1.41... 1.35
4................0.97... 0.93... 1.23... 1.20... 1.23... 1.49... 1.32
5................1.07... 1.04... 1.25... 1.21... 1.30... 1.52... 1.45
6................1.17... 1.10... 1.29... 1.27... 1.43... 1.68... 1.70
7................1.13... 1.21... 1.34... 1.36... 1.50... 1.79... 1.82
8................1.12... 1.20... 1.50... 1.53... 1.59... 2.12... 1.71
9................1.19... 1.12... 1.56... 1.67... 1.82... 2.31... 1.68
10..............1.04... 1.24... 1.62... 1.72... 2.03... 2.77... 1.81
11..............1.17... 1.34... 1.72... 1.92... 2.07... 2.58... 2.25
12..............1.29... 1.52... 1.81... 2.11... 2.51... 3.30
13..............1.36... 1.40... 1.84... 2.26... 2.71... 3.23
14..............1.57... 1.30... 2.10... 2.31.............. 3.62
15..............1.34... 1.47... 2.52... 2.58
16..............1.59... 1.78... 2.47... 2.69
17..............1.46... 1.87... 2.84... 2.66
18........................ 1.93
19........................ 2.04

Look at the "12" line, for example. After 12 years a pitcher who debuted at 20 (e.g., Pedro) has declined in effectiveness by 29% on average. That's not bad at all. But notice almost all that decline can be attributed to the last 2 years. (The apparent dramatic improvement from 9 to 10 I attribute to small sample size. Remember, the number of pitchers in each group keeps changing over time.)

Returning to the "12" line, notice how much greater the decline in effectiveness is for those who debut later. For the 21-year-old phenoms, the dropoff is nearly twice as much, 52%. For the 22's, it's almost 3 times as much, 81%. For the 23's, the average pitchers, it's nearly 4 times as much, 111%. Pitchers like Colon who start at 24 can expect a 151% worsening, more than 5 times as much. And those who don't make it until 25 can look forward to a catastrophic 230% decline over 12 years, nearly 8 times as bad as their 20-y.o. counterparts. That's assuming they make it that far. Those who debut after 25 have little hope of being around after 12 years.

I must say I find this data shocking. How could any pitcher have his ERA more than triple and still see significant playing time? Even doubling an ERA seems highly unlikely. Perhaps Tango can shed some light on this.

Posted 11:45 p.m., February 3, 2004 (#41) - tangotiger
  I think regression comes into play here.

If the players in my sample continually is made up of better than average performances, we expect them to get worse ERA the next year.

So, a 3.8 ERA will have a 4.0 ERA the next year,even though the true talent level stayed the same.

Now, the next year, your group of pitchers has an ERA of 3.8 (those guys in the 4.0 group, plus a new batch of pitchers). They follow that up with an eRA of 4.0

When you chain, that makes it 3.8, 4.0, 4.2. In actual fact, they should probably be 4.0,4.0,4.0.

Regression makes a world of difference. Without it, you get the incorrect conclusion that pitchers peak at age 24.

Posted 12:47 p.m., February 4, 2004 (#42) - FJM
  I understand what you are saying, but I'm not sure it applies here. In life insurance terminology this analysis involves looking at cohorts; i.e., groups of pitchers sorted by debut age. So the population we are dealing with is fixed; no new pitchers can join a cohort after the debut age. The only changes in the cohort over time come from pitchers dropping out due to injury, aging or just plain poor performance. Thus it would seem to me any bias would be in the direction of retaining the better pitchers while purging the marginal ones. If that is the case, we would expect to see less deterioration in this data base than there is in reality. Imagine how bad it could get if every pitcher was forced to continue pitching until age 37!

The easiest way I can think of to test for this bias is to create new cohorts based not only on debut age but also on how old each pitcher was in his final qualifying year. Do you have the data to do that?

Posted 1:08 p.m., February 4, 2004 (#43) - tangotiger
  Let's concentrate on your last column: players who debuted at age 26.

Those are not the same players each year, as you will have attrition. Who goes? Those guys at the (performance) bottom of the barrel.

So, in their rookie year, they'd have an ERA of 4.0, and, of those guys who made it to the soph year, they have an ERA of say 4.4. But, what about the guys who didn't make it into their soph years? The guys at the 4.0 level is a subset of a larger group. This larger group really had an ERA of (say) 4.5. Now, how did they (or would have) done in their soph years? Well, the guys who stuck around did 4.4, and the guys who didn't... well, we don't know, but say they would have done a 5.0, such that the overall average is 4.5.

So, tracking the exact same group of pitchers from year-to-year-to-year, and we see we've got a problem with attrition. If you've got attrition (i.e., selective sampling in most cases), then you need regression towards the mean.

Posted 1:12 p.m., February 4, 2004 (#44) - J Cross
  Okay, but let's assume that three players from the same cohort all decrease in true talent by 10% one year. One of the players had good luck and his stats showed no decrease. One of the players has even luck and showed the 10% decrease. The third player had bad luck and his stats showed a 20% decrease and he's forced to retire. Assuming the two remaining players would be expected to decrease in true talent by another 10% the following year their stats would actually be expected to decrease by 15% overstating their decline.

Posted 1:28 p.m., February 4, 2004 (#45) - J Cross
  Yeah, what tango said.

Posted 2:28 p.m., February 4, 2004 (#46) - FJM
  I think we're saying the same thing. The point is, we can all speculate on the effect that attrition has on our data base. But if you define your cohorts based on both debut age AND final age we won't need to speculate; we'll know the answer.

Posted 2:38 p.m., February 4, 2004 (#47) - tangotiger
  Yes, one thing that I wanted to do is take players over the same age span. Say, all pitchers who pitched at least 250 PA each year from age 23 to age 33. Then, look only at those pitchers from age 24 to 32. I would apply this last condition only because of the selective sampling that happens in the last year, and maybe the first year.

Of course, if I looked at pitchers aged 22 to 40, I'd get a different aging pattern (likely that the decline phase would not be so bad... think Clemens).

Essentially, any combination of age x to age y, such that you look at the performances of x+1 to y-1. If you make x from 21 to 26, and you make y from 26 to 36, that gives you 66 different combinations to look at.

It's a good idea, one which I've wanted to do, but, it looks like a big pain in the butt. If someone wants to contract me out :), I'd be happy to do it.

Posted 2:52 p.m., February 4, 2004 (#48) - tangotiger
  For the sharp-thinkers, that would be 60, since x-y must be at least 4.

Posted 7:22 p.m., February 4, 2004 (#49) - FJM
  The study you are proposing is somewhat different than what I suggested. Requiring that a pitcher have a minimum of 250 PA's every year strikes me as too restrictive. It essentially requires that he be free of major injury for most of his career, something very few pitchers achieve. For some of the smaller cohorts (e.g., debut age 20) you could be down to just a handful of pitchers at some ages, perhaps none at all in the advanced stages of their careers. Moreover, as you point out, you still have to deal with the problem of attrition in year x-y. If x>>y the attrition problem isn't too serious, but the injury-free requirement is; if x=y+4 or 5 attrition is still a major issue. If you try my approach attrition is not a problem at all and major injuries are a minor complication not requiring that a pitcher be thrown out of the study.