Around this time last year, I wrote a series of posts introducing the Weighted WAR framework for evaluating position players. At the time, I specifically excluded pitching value from consideration. At the risk of being a bit self-indulgent, I’ll quote my own post from last year on this topic:
“The question still remains: which form of WAR to use? This
is a thorny question for pitchers, because unlike the batting and fielding
components of different WAR systems (which may differ in outcome but are
generally trying to accomplish the same thing), pitching WAR varies
philosophically from system to system. To hugely simplify a complicated topic,
Baseball Reference WAR starts from pitcher runs allowed and makes a blanket
adjustment for the general quality of the team’s fielders. Fangraphs WAR uses fielding
independent numbers (strikeouts, walks and home runs) to measure a pitcher’s
entire contribution and assumes any variance outside of that is attributable to
the team’s fielders. And RA9 WAR (also presented by Fangraphs these days, and
used in other sources as well, including Seamheads for NeL stats) takes the
pitcher’s park adjusted runs allowed and ignores any fielding effects entirely.
I don’t particularly like any of the freely available WAR
systems for pitchers; I think they all simplify the pitcher-fielder dynamic too
much and produce extreme outcomes as a result. As such, both because of the
extra work that would be involved and a lack of faith in a reasonable outcome,
I’m not considering pitching value as part of this project. Again, this is by
no means intended to imply that pitching value shouldn’t count; I may circle
back to this topic in the future.”
Welcome to “in the future.” In my next several posts, I’ll
be introducing a new method of evaluating starting pitchers, one which I think
provides a compelling alternative to any of the readily available WAR systems.
But before we get into that, it’s probably a good idea to further explore those WAR
systems and explain why I think an alternative is worth introducing to begin
with.
As mentioned above, there are three versions of pitching WAR
that can be easily found on major baseball websites. I’ll take them on in
reverse order of my own preference.
My least-favorite option is also the simplest: RA9 WAR. RA9 stands for runs allowed per 9 innings; it’s similar to the more-familiar
ERA, but with unearned runs included (a choice that we’ll discuss later in the series, but one that I agree with). RA9 WAR simply park adjusts the
pitcher’s rate of runs allowed (different sources may use different styles of park adjustment, either general to the team or specific to the pitcher), then compares it to the league’s replacement
level for the year. Outside of the park adjustment, no other modifications to the pitcher’s performance are made. In
particular, there is no attempt to separate the pitcher’s performance from
the contributions of the fielders behind him (whether helpful or harmful), or
from his bullpen’s success or failure in stranding runners who were on base when the pitcher was
removed.
To take an extreme example: Say the starting pitcher records
two outs in the seventh inning, then strikes out the next hitter, but the ball
gets away from the catcher and the hitter reaches base. The starter is pulled
from the game, and the new pitcher immediately allows a line drive into the gap. The center
fielder has a chance at the ball, but can’t quite reach it; the right fielder
picks it up and tries to throw out the runner at home, but his throw is
slightly up the line and the runner scores. At minimum, five players
contributed to allowing that run, and one could argue that the starter bears the smallest share of that responsibility. But the official stats charge him with
the run, and therefore so does RA9 WAR.
While most runs allowed are not this heavily skewed away from the pitcher’s own contributions, it’s still obvious that the efforts of the fielders and relievers are crucial pieces of context for the starter’s performance. If you compare Jim Palmer and Rick Reuschel’s 1974 seasons, but fail to account for the fact that Palmer was pitching in front of brilliant fielders such as Mark Belanger and Brooks Robinson and Paul Blair while Reuschel suffered through the butchery of Bill Madlock and Rick Monday, you won’t get the full picture of their relative effectiveness. Ignoring fielding and relief work entirely strikes me as eminently counterproductive.
Next up, we’ll go to the other extreme with FIP WAR, primarily presented on Fangraphs (and therefore usually abbreviated as fWAR). Per Fangraphs’ own explanation of their system, fWAR uses five components on a per-inning basis: home runs, walks, batters hit by pitch, strikeouts, and infield popups. (Walks and HBP are treated as equivalent, as are strikeouts and popouts.) The theory behind this is that these components are the best representation of the pitcher’s actual ability, as they are the events for which the pitcher is most directly responsible. As such, they also tend to have better year-to-year correlations than rates of hits or runs allowed.
As a predictive measure of pitching ability, FIP is almost
certainly preferable to RA or ERA; it does in fact correlate better
year-to-year, both with itself and with future ERA. But WAR is not generally
advertised as a forward-looking projection system; it’s presented as a measure
of backward-looking value. A pitcher who posts a solid FIP but also allows a
high number of line drives to remote areas of the park might still have a
bright future, but the actual hits and runs he allows still hurt the team.
FIP is also overly reductive in that it ignores less-measurable but still existent pitcher abilities, including things like wild pitch/passed
ball tendencies, ground ball rate, and performance with runners on base (some pitchers work better from the stretch than others). Just to pick one, let’s focus on the
pitcher’s own fielding, specifically with regard to four-time Gold Glove winner Mark Buehrle. Buehrle was one of the
best-fielding pitchers in the game on batted balls, and also a master of
controlling baserunners. He allowed 59 total stolen bases over his 16-year
career (less than four per season); by comparison, he picked off an even100 runners
(just over six per year). Given that FIP by definition ignores the pitcher’s
own fielding, it is not a coincidence that Buehrle’s career RA9 WAR (59.9)
exceeds his fWAR (52.3) by a substantial margin, and this is clearly a case in
which fWAR gets it wrong. There are enough similar errors (in both directions) to
reduce the utility of fWAR for purposes of historical evaluation, which is my
primary interest.
That brings us to Baseball Reference WAR (generally called
bWAR), which is my preference among these options. Like RA9 WAR, bWAR starts
with the pitcher’s park adjusted runs allowed. However, it then adjusts for the
quality of the fielders behind the pitcher based on his team’s performance in
the fielding metric used in position player WAR calculations (which changes
depending on the level of data available for the year in question). This effect is calibrated for the
number of balls in play the pitcher allows; it will typically be worth between
0.5 and -0.5 runs per 9 innings on all but the most extreme teams.
Sounds great, right? Unlike fWAR, it doesn’t ignore actual
runs allowed, but unlike RA9 WAR, it accounts for the team’s fielders. And in practice, this usually works reasonably well; there’s a reason it’s my preferred option
among the three WAR systems. But just as with the other two, there are edge
cases that cause trouble. The primary problem here arises from the assumption
that all of the pitchers on a team get the same level of support from the
team’s fielders – an assumption that is not ridiculous on its face, but can be
cast into doubt by a few examples.
In this case, let’s look at a quartet of Texan aces in 2019:
Justin Verlander and Gerrit Cole of the Astros, and Mike Minor and Lance Lynn
of the Rangers. All four put up very good seasons, and all four received Cy
Young votes. By conventional numbers, Cole (20-5, 2.50 ERA) and Verlander
(21-6, 2.58) were solidly better than Lynn (16-11, 3.67) or Minor (14-10,
3.59). Unsurprisingly, given the difference in their ERAs, RA9 WAR prefers the
Astros pair as well; the gap narrows somewhat thanks to Texas’s status as a
hitter’s park, but the AL’s top four still goes Verlander 9.0, Cole 8.0, Minor
6.5, Lynn 6.3.
Houston, however, was a much better team than Texas (or
almost anyone else; the Astros won 107 games), so it’s not a surprise to learn
that they had better fielders. You might think, then, that FIP would close the
gap for the Rangers. And it does… for one of them. Lynn’s FIP of 3.13 is a
marked improvement on his 3.67 ERA, while Verlander’s 2.58 ERA balloons into a
3.27 FIP. Cole, on the other hand, posted a 2.64 FIP (thanks to an absurd 13.8
strikeouts per 9 innings), very comparable to his ERA. And Minor? His 3.59 ERA
climbs to a 4.25 FIP. As such, fWAR has a top 3 of Cole-Lynn-Verlander, and
Minor doesn’t show up until the #8 position.
And what does bWAR make of this group of pitchers? Well, as
noted, bWAR evaluates Houton’s fielders as very good (bWAR’s fielding
adjustment for Cole is +0.46 runs per 9) and Texas’s as very bad (Lynn’s
adjustment is -0.30). So, starting from the baseline of RA9 WAR and accounting
for quality of fielding, bWAR gives a result of Minor 8.0, Lynn 7.7, Verlander
7.4, Cole 6.7.
This is a bit surprising, given the other results, so what’s
going on? As noted, B-R gives Minor substantial credit for pitching effectively
in front of a bad defense. As a sanity check for this, let’s look at batting
average on balls in play (BABIP for short). The AL average BABIP
in 2019 was .299; the Rangers allowed a mark of .314, 15 points higher than
average and consistent with a poor fielding team. (By contrast, Houston allowed a .272 BABIP.) Lance Lynn’s BABIP
allowed was slightly worse than his team’s, at .322; adjusting for this seems
totally fair in his case. As for Minor? His BABIP allowed was .288 – lower than
the AL average. There are three possible interpretations of this. First,
hitters had a tougher time hitting Minor and his balls in play were easier for
the Rangers’ subpar fielders to handle. Second, the Rangers’ fielders performed
better with Minor on the mound than they did otherwise, for whatever reason. Or
third, both factors had an effect, and the relative extent of those effects is
unknowable.
Any time a discrepancy of this type arises, bWAR uses the first interpretation. As such, it is also intermittently prone to startling swings toward or against certain pitchers. And that is how you get 2019 Mike Minor (3.59 ERA, 4.25 FIP, .288 BABIP) rated substantially ahead of Gerrit Cole (2.50 ERA, 2.64 FIP, .276 BABIP). For an even more extreme example, Aaron Nola in 2018 allowed a .254 BABIP for a team that bWAR thinks was below average in the field by 0.41 runs per 9 innings; as a result, bWAR rates Nola’s admittedly very good campaign as the single best season any pitcher has had in the last 15 years, narrowly outpacing Jacob deGrom’s 1.70 ERA/1.98 FIP effort the same year.
Where to go from here? We have three options for pitching
WAR, but all three seem to have the same failure point: the pitcher-fielder
interaction. In my opinion, all three systems handle this interaction too
rigidly by either by ignoring it entirely, assuming the pitcher can only affect
a few specific outcomes, or assuming
that all pitchers on the same team have the same fielding performance behind
them. It would probably be possible to amalgamate all three systems into some
sort of WAR hydra, allowing each system’s extremes to be smoothed out by the
other two.
Or we could abandon the WAR framework entirely and do
something completely different. Next time, we’ll start to explore the basis for
a new option.