To introduce this series, I discussed the three major WAR systems for pitchers and the issues I have with each of them. Having pointed out the problem, let's get to work on a solution.
Whatever issues
I may have with it, one of WAR’s benefits is very compelling: it reduces the
player’s complex contributions to a single number, which can be compared
to a similarly derived single number for other players. How does a pitcher with a 2.60 ERA and 220
strikeouts in 280 innings in 1968 compare to a pitcher with a 3.40 ERA and 250
strikeouts in 210 innings in 1999? After sorting through all the complications
under the surface, WAR can give you a simple answer. If I’m suggesting an
alternative to WAR, it should ideally come in a similarly straightforward
package. Fortunately, just as in many other statistical problems
in baseball, we can start by borrowing from the work Bill James. In this case, we’ll use the Game Score.
(Before anyone
else says it: Yes, Game Score is only set up for starting pitchers; it is not
generally applied to relief outings, even relief outings made by pitchers who
are usually starters. We’ll look more at the effects of this on the rankings
once we get far enough to actually discuss rankings, but there definitely is an
effect. It is also worth pointing out that we can only pull Game Scores for
pitchers who have box score records available. Since I’m pulling my data from
Baseball Reference, that means I have nothing before 1901, and nothing for the
Negro Leagues. So we’re working with half of Cy Young’s career, under 20% of
Kid Nichols, and none of Smokey Joe Williams.)
James
introduced Game Score in the ‘80s as a fun way to evaluate the effectiveness of
a starting pitcher’s effort in a particular game. His original formulation is:
GS = 50 + Outs
+ Strikeouts – Walks – 2*Hits – 4*(Earned Runs) – 2*(Unearned Runs) +
2*(Innings completed after the fourth)
The description
is a little clunky (and so is implementing it in a spreadsheet of game logs),
but the results are pretty reasonable. League average will usually be around
50. 60 is a solid start, 70 is a very good one, 80 is excellent, and 90 will
usually be among the 25-50 best overall starts of the season. The same is true in the
other direction; 40 is shaky, 30 is pretty bad, 20 is lousy, etc. My initial
foray into starting pitcher evaluation used the original formulation of Game
Score, and it generally performed well… until it didn’t.
The major issue
I ran into deals with the earned/unearned run distinction. Whether this
distinction has value in general is a bit of a rabbit hole (most of the
research I’ve seen suggests it does not), but even if you think it’s worth
using earned runs in some circumstances, it becomes a huge problem trying to
use them for game-by-game evaluation of baseball seasons from over a century
ago.
Here is a
sample of the Baseball Reference game log page for Christy Mathewson’s
outstanding 1905 season. See if you can spot the issue:
Yeah, that’s
not going to work. We know Matty’s overall total of earned runs allowed for the
year (48, out of 85 runs total), but game-to-game we have no idea of his
earned/unearned run split.
Having
encountered this problem, I went looking for an alternative Game Score
formulation. Fortunately, Fangraphs already had one, courtesy of the inimitable
Tangotiger. Game Score 2 is laid out here, and the formula is as follows:
GS2 = 40 +
2*Outs + Strikeouts – 2*Hits – 2*Walks – 3*Runs – 6*(Home Runs)
To clarify, the
home run is also a hit (and will result in at least one run scoring); the 6
points subtracted are in addition to the other penalties, just as the point for
a strikeout is added on top of the points already given for outs in general. If
you’re curious about where the specific coefficients come from, Tango explains
it further here (link is to part 3, which in turn links to parts 1 and 2).
The short version is that it’s a combination of the run value of each event
type and the pitcher’s estimated share of responsibility for those events.
To take the obvious example, a hit is more damaging than a walk, but the hit is
partly the responsibility of the fielders whereas the walk is completely the
pitcher’s fault, so the pitcher is penalized the same amount for each of them.
As noted
earlier, the GS2 formulation resolves the issue of missing data by removing the
earned/unearned run distinction. Conveniently, it also addresses another problem with
Original Recipe Game Score, which is the handling of short starts. Usually if
your pitcher gets pulled after one inning, you’re going to consider that a
below average result. As such, rather than starting each game at the expected average of 50, it makes more sense to start from a below-average 40 and build up from
there. (Sorry,
openers! But not all that sorry. We’ll come back to them later in the series.)
For additional detail, here is a table of the baseline score earned by a
pitcher who completes each whole number of innings 1-9 in both Game Score
versions:
IP |
GS1 |
GS2 |
1.0 |
53 |
46 |
2.0 |
56 |
52 |
3.0 |
59 |
58 |
4.0 |
62 |
64 |
5.0 |
67 |
70 |
6.0 |
72 |
76 |
7.0 |
77 |
82 |
8.0 |
82 |
88 |
9.0 |
87 |
94 |
The modified formula places a higher premium on lasting longer in the game, but also includes larger penalties for negative outcomes; the 5-point baseline difference for a 7-inning start can be erased by allowing one solo home run (worth -6 points in GS1, 2 for the hit and 4 for the earned run; -11 in GS2, 2 for the hit, 3 for the run, 6 extra for the homer).
The wider
variance in GS2 can be more fully demonstrated by looking at a bit of seasonal
data. In 2016, the original Game Score formula produced no scores over 100, and
five negative scores; GS2 (with my modifications, to be revealed shortly) saw
six starts clear 100, and a whopping 31 dip below 0. The standard deviations
differ accordingly (16.8 for GS1, 19.5 for GS2). These results are pretty
typical.
The improvements
in handling unearned runs and truncated outings combine to make GS2 my clear choice
of metric moving forward in this project. Being who I am, though, I couldn’t
resist making a couple of small modifications. You may have noticed in the
screenshot above that Baseball Reference has hit batters and wild pitches
recorded on a game-by-game basis even as far back as 1905, and I can think of no reason not to account
for those when I already have the numbers handy. (Other data, such as stolen
bases allowed, pickoffs, passed balls, or intentional walks, might also provide
value, but aren’t available for the oldest games in the sample. Balks are available, but are so rare
that it’s impossible to imagine them making a difference in an overall
evaluation. To pick a season at random, MLB in 1955 saw 506 total hit batters,
477 wild pitches, and only 36 balks.)
How do we adjust
for hit batters and wild pitches? HBP are easy; they are exactly equivalent to walks and are therefore
penalized equally (-2 points). WP are more complicated. They’re similar to
stolen bases, which run estimators generally consider to be worth less than
half as much as a walk. But in comparison to stolen bases, wild pitches are more likely to score a run from
third, sometimes advance multiple runners at once, and when coming at the end of a strikeout, can even occasionally put a runner on base, which should make them proportionately more damaging. We’ll value them at -1
point. The final formula is:
GS2 = 40 +
2*Outs + Strikeouts – 2*(Hits + Walks + Hit Batters) – (Wild Pitches) – 3*Runs
– 6*(Home Runs)
To get a sense
of Game Score’s scale, here are examples of various Game Scores, each of them
an actual start from the 2016 season:
Pitcher |
Date |
IP |
H |
R |
BB |
K |
Other |
GS2 |
Danny Duffy |
8/1/16 |
8 |
1 |
0 |
1 |
16 |
|
100 |
Matt Shoemaker |
7/16/16 |
7.1 |
3 |
0 |
0 |
12 |
|
90 |
Jake Arrieta |
5/14/16 |
8 |
3 |
2 |
2 |
11 |
1 HBP, 1 WP |
80 |
Hector Santiago |
6/15/16 |
6 |
2 |
1 |
2 |
5 |
|
70 |
Chris Archer |
8/1/16 |
7.1 |
6 |
3 |
1 |
6 |
1 HR, 1 WP |
60 |
Jharel Cotton |
9/13/16 |
5.2 |
7 |
3 |
1 |
2 |
1 WP |
50 |
Robert Gsellman |
9/9/16 |
5 |
7 |
4 |
2 |
6 |
1 HR |
40 |
Joel De La Cruz |
8/10/16 |
4 |
7 |
4 |
2 |
2 |
1 HR |
30 |
Jordan Lyles |
5/23/16 |
2.1 |
5 |
6 |
3 |
3 |
1 HBP, 1 WP |
20 |
Jacob deGrom |
8/24/16 |
4.2 |
12 |
5 |
2 |
3 |
3 HR |
10 |
Chris Young |
6/25/16 |
2.1 |
7 |
7 |
4 |
2 |
2 HR, 1 WP |
0 |
However you
feel about the details, the broad strokes are reasonable; those starts
generally get worse as you go down the list.
Let’s consider
the characteristics of this formula. It includes aspects of pitching
performance beyond simply counting innings and runs allowed, but also reflects actual hits and
runs rather than completely disclaiming the pitcher’s responsibility for events
to which other players contribute. It indirectly accounts for fielding by
putting extra weight on things that are clearly the pitcher’s responsibility
(strikeouts, walks, and homers) compared to things that have shared
responsibility (hits and runs), but doesn’t assume a fixed level of fielding
contribution based on a team’s overall performance. Which is to say that, to
varying extents, it addresses the issues I have with all three versions of
pitcher WAR. That doesn’t make it a perfect metric – but it does make it an
intriguing option to build around.
Next time,
we’ll continue the building process by going through a central question of any
comprehensive baseball measure: how do you adjust for context?
No comments:
Post a Comment