SABER All I Want To: Starting Pitcher Ratings: Motivation

Around this time last year, I wrote a series of posts introducing the Weighted WAR framework for evaluating position players. At the time, I specifically excluded pitching value from consideration. At the risk of being a bit self-indulgent, I’ll quote my own post from last year on this topic:

“The question still remains: which form of WAR to use? This is a thorny question for pitchers, because unlike the batting and fielding components of different WAR systems (which may differ in outcome but are generally trying to accomplish the same thing), pitching WAR varies philosophically from system to system. To hugely simplify a complicated topic, Baseball Reference WAR starts from pitcher runs allowed and makes a blanket adjustment for the general quality of the team’s fielders. Fangraphs WAR uses fielding independent numbers (strikeouts, walks and home runs) to measure a pitcher’s entire contribution and assumes any variance outside of that is attributable to the team’s fielders. And RA9 WAR (also presented by Fangraphs these days, and used in other sources as well, including Seamheads for NeL stats) takes the pitcher’s park adjusted runs allowed and ignores any fielding effects entirely.

I don’t particularly like any of the freely available WAR systems for pitchers; I think they all simplify the pitcher-fielder dynamic too much and produce extreme outcomes as a result. As such, both because of the extra work that would be involved and a lack of faith in a reasonable outcome, I’m not considering pitching value as part of this project. Again, this is by no means intended to imply that pitching value shouldn’t count; I may circle back to this topic in the future.”

Welcome to “in the future.” In my next several posts, I’ll be introducing a new method of evaluating starting pitchers, one which I think provides a compelling alternative to any of the readily available WAR systems. But before we get into that, it’s probably a good idea to further explore those WAR systems and explain why I think an alternative is worth introducing to begin with.

As mentioned above, there are three versions of pitching WAR that can be easily found on major baseball websites. I’ll take them on in reverse order of my own preference.

My least-favorite option is also the simplest: RA9 WAR. RA9 stands for runs allowed per 9 innings; it’s similar to the more-familiar ERA, but with unearned runs included (a choice that we’ll discuss later in the series, but one that I agree with). RA9 WAR simply park adjusts the pitcher’s rate of runs allowed (different sources may use different styles of park adjustment, either general to the team or specific to the pitcher), then compares it to the league’s replacement level for the year. Outside of the park adjustment, no other modifications to the pitcher’s performance are made. In particular, there is no attempt to separate the pitcher’s performance from the contributions of the fielders behind him (whether helpful or harmful), or from his bullpen’s success or failure in stranding runners who were on base when the pitcher was removed.

To take an extreme example: Say the starting pitcher records two outs in the seventh inning, then strikes out the next hitter, but the ball gets away from the catcher and the hitter reaches base. The starter is pulled from the game, and the new pitcher immediately allows a line drive into the gap. The center fielder has a chance at the ball, but can’t quite reach it; the right fielder picks it up and tries to throw out the runner at home, but his throw is slightly up the line and the runner scores. At minimum, five players contributed to allowing that run, and one could argue that the starter bears the smallest share of that responsibility. But the official stats charge him with the run, and therefore so does RA9 WAR.

While most runs allowed are not this heavily skewed away from the pitcher’s own contributions, it’s still obvious that the efforts of the fielders and relievers are crucial pieces of context for the starter’s performance. If you compare Jim Palmer and Rick Reuschel’s 1974 seasons, but fail to account for the fact that Palmer was pitching in front of brilliant fielders such as Mark Belanger and Brooks Robinson and Paul Blair while Reuschel suffered through the butchery of Bill Madlock and Rick Monday, you won’t get the full picture of their relative effectiveness. Ignoring fielding and relief work entirely strikes me as eminently counterproductive.

Next up, we’ll go to the other extreme with FIP WAR, primarily presented on Fangraphs (and therefore usually abbreviated as fWAR). Per Fangraphs’ own explanation of their system, fWAR uses five components on a per-inning basis: home runs, walks, batters hit by pitch, strikeouts, and infield popups. (Walks and HBP are treated as equivalent, as are strikeouts and popouts.) The theory behind this is that these components are the best representation of the pitcher’s actual ability, as they are the events for which the pitcher is most directly responsible. As such, they also tend to have better year-to-year correlations than rates of hits or runs allowed.

As a predictive measure of pitching ability, FIP is almost certainly preferable to RA or ERA; it does in fact correlate better year-to-year, both with itself and with future ERA. But WAR is not generally advertised as a forward-looking projection system; it’s presented as a measure of backward-looking value. A pitcher who posts a solid FIP but also allows a high number of line drives to remote areas of the park might still have a bright future, but the actual hits and runs he allows still hurt the team.

FIP is also overly reductive in that it ignores less-measurable but still existent pitcher abilities, including things like wild pitch/passed ball tendencies, ground ball rate, and performance with runners on base (some pitchers work better from the stretch than others). Just to pick one, let’s focus on the pitcher’s own fielding, specifically with regard to four-time Gold Glove winner Mark Buehrle. Buehrle was one of the best-fielding pitchers in the game on batted balls, and also a master of controlling baserunners. He allowed 59 total stolen bases over his 16-year career (less than four per season); by comparison, he picked off an even100 runners (just over six per year). Given that FIP by definition ignores the pitcher’s own fielding, it is not a coincidence that Buehrle’s career RA9 WAR (59.9) exceeds his fWAR (52.3) by a substantial margin, and this is clearly a case in which fWAR gets it wrong. There are enough similar errors (in both directions) to reduce the utility of fWAR for purposes of historical evaluation, which is my primary interest.

That brings us to Baseball Reference WAR (generally called bWAR), which is my preference among these options. Like RA9 WAR, bWAR starts with the pitcher’s park adjusted runs allowed. However, it then adjusts for the quality of the fielders behind the pitcher based on his team’s performance in the fielding metric used in position player WAR calculations (which changes depending on the level of data available for the year in question). This effect is calibrated for the number of balls in play the pitcher allows; it will typically be worth between 0.5 and -0.5 runs per 9 innings on all but the most extreme teams.

Sounds great, right? Unlike fWAR, it doesn’t ignore actual runs allowed, but unlike RA9 WAR, it accounts for the team’s fielders. And in practice, this usually works reasonably well; there’s a reason it’s my preferred option among the three WAR systems. But just as with the other two, there are edge cases that cause trouble. The primary problem here arises from the assumption that all of the pitchers on a team get the same level of support from the team’s fielders – an assumption that is not ridiculous on its face, but can be cast into doubt by a few examples.

In this case, let’s look at a quartet of Texan aces in 2019: Justin Verlander and Gerrit Cole of the Astros, and Mike Minor and Lance Lynn of the Rangers. All four put up very good seasons, and all four received Cy Young votes. By conventional numbers, Cole (20-5, 2.50 ERA) and Verlander (21-6, 2.58) were solidly better than Lynn (16-11, 3.67) or Minor (14-10, 3.59). Unsurprisingly, given the difference in their ERAs, RA9 WAR prefers the Astros pair as well; the gap narrows somewhat thanks to Texas’s status as a hitter’s park, but the AL’s top four still goes Verlander 9.0, Cole 8.0, Minor 6.5, Lynn 6.3.

Houston, however, was a much better team than Texas (or almost anyone else; the Astros won 107 games), so it’s not a surprise to learn that they had better fielders. You might think, then, that FIP would close the gap for the Rangers. And it does… for one of them. Lynn’s FIP of 3.13 is a marked improvement on his 3.67 ERA, while Verlander’s 2.58 ERA balloons into a 3.27 FIP. Cole, on the other hand, posted a 2.64 FIP (thanks to an absurd 13.8 strikeouts per 9 innings), very comparable to his ERA. And Minor? His 3.59 ERA climbs to a 4.25 FIP. As such, fWAR has a top 3 of Cole-Lynn-Verlander, and Minor doesn’t show up until the #8 position.

And what does bWAR make of this group of pitchers? Well, as noted, bWAR evaluates Houton’s fielders as very good (bWAR’s fielding adjustment for Cole is +0.46 runs per 9) and Texas’s as very bad (Lynn’s adjustment is -0.30). So, starting from the baseline of RA9 WAR and accounting for quality of fielding, bWAR gives a result of Minor 8.0, Lynn 7.7, Verlander 7.4, Cole 6.7.

This is a bit surprising, given the other results, so what’s going on? As noted, B-R gives Minor substantial credit for pitching effectively in front of a bad defense. As a sanity check for this, let’s look at batting average on balls in play (BABIP for short). The AL average BABIP in 2019 was .299; the Rangers allowed a mark of .314, 15 points higher than average and consistent with a poor fielding team. (By contrast, Houston allowed a .272 BABIP.) Lance Lynn’s BABIP allowed was slightly worse than his team’s, at .322; adjusting for this seems totally fair in his case. As for Minor? His BABIP allowed was .288 – lower than the AL average. There are three possible interpretations of this. First, hitters had a tougher time hitting Minor and his balls in play were easier for the Rangers’ subpar fielders to handle. Second, the Rangers’ fielders performed better with Minor on the mound than they did otherwise, for whatever reason. Or third, both factors had an effect, and the relative extent of those effects is unknowable.

Any time a discrepancy of this type arises, bWAR uses the first interpretation. As such, it is also intermittently prone to startling swings toward or against certain pitchers. And that is how you get 2019 Mike Minor (3.59 ERA, 4.25 FIP, .288 BABIP) rated substantially ahead of Gerrit Cole (2.50 ERA, 2.64 FIP, .276 BABIP). For an even more extreme example, Aaron Nola in 2018 allowed a .254 BABIP for a team that bWAR thinks was below average in the field by 0.41 runs per 9 innings; as a result, bWAR rates Nola’s admittedly very good campaign as the single best season any pitcher has had in the last 15 years, narrowly outpacing Jacob deGrom’s 1.70 ERA/1.98 FIP effort the same year.

Where to go from here? We have three options for pitching WAR, but all three seem to have the same failure point: the pitcher-fielder interaction. In my opinion, all three systems handle this interaction too rigidly by either by ignoring it entirely, assuming the pitcher can only affect a few specific outcomes, or assuming that all pitchers on the same team have the same fielding performance behind them. It would probably be possible to amalgamate all three systems into some sort of WAR hydra, allowing each system’s extremes to be smoothed out by the other two.

Or we could abandon the WAR framework entirely and do something completely different. Next time, we’ll start to explore the basis for a new option.

SABER All I Want To

Monday, September 22, 2025

Starting Pitcher Ratings: Motivation

No comments:

Post a Comment