Monday, September 29, 2025

Starting Pitcher Ratings: Game Score

To introduce this series, I discussed the three major WAR systems for pitchers and the issues I have with each of them. Having pointed out the problem, let's get to work on a solution.

Whatever issues I may have with it, one of WAR’s benefits is very compelling: it reduces the player’s complex contributions to a single number, which can be compared to a similarly derived single number for other players. How does a pitcher with a 2.60 ERA and 220 strikeouts in 280 innings in 1968 compare to a pitcher with a 3.40 ERA and 250 strikeouts in 210 innings in 1999? After sorting through all the complications under the surface, WAR can give you a simple answer. If I’m suggesting an alternative to WAR, it should ideally come in a similarly straightforward package. Fortunately, just as in many other statistical problems in baseball, we can start by borrowing from the work Bill James. In this case, we’ll use the Game Score.

(Before anyone else says it: Yes, Game Score is only set up for starting pitchers; it is not generally applied to relief outings, even relief outings made by pitchers who are usually starters. We’ll look more at the effects of this on the rankings once we get far enough to actually discuss rankings, but there definitely is an effect. It is also worth pointing out that we can only pull Game Scores for pitchers who have box score records available. Since I’m pulling my data from Baseball Reference, that means I have nothing before 1901, and nothing for the Negro Leagues. So we’re working with half of Cy Young’s career, under 20% of Kid Nichols, and none of Smokey Joe Williams.)

James introduced Game Score in the ‘80s as a fun way to evaluate the effectiveness of a starting pitcher’s effort in a particular game. His original formulation is:

GS = 50 + Outs + Strikeouts – Walks – 2*Hits – 4*(Earned Runs) – 2*(Unearned Runs) + 2*(Innings completed after the fourth)

The description is a little clunky (and so is implementing it in a spreadsheet of game logs), but the results are pretty reasonable. League average will usually be around 50. 60 is a solid start, 70 is a very good one, 80 is excellent, and 90 will usually be among the 25-50 best overall starts of the season. The same is true in the other direction; 40 is shaky, 30 is pretty bad, 20 is lousy, etc. My initial foray into starting pitcher evaluation used the original formulation of Game Score, and it generally performed well… until it didn’t.

The major issue I ran into deals with the earned/unearned run distinction. Whether this distinction has value in general is a bit of a rabbit hole (most of the research I’ve seen suggests it does not), but even if you think it’s worth using earned runs in some circumstances, it becomes a huge problem trying to use them for game-by-game evaluation of baseball seasons from over a century ago.

Here is a sample of the Baseball Reference game log page for Christy Mathewson’s outstanding 1905 season. See if you can spot the issue:

Yeah, that’s not going to work. We know Matty’s overall total of earned runs allowed for the year (48, out of 85 runs total), but game-to-game we have no idea of his earned/unearned run split.

Having encountered this problem, I went looking for an alternative Game Score formulation. Fortunately, Fangraphs already had one, courtesy of the inimitable Tangotiger. Game Score 2 is laid out here, and the formula is as follows:

GS2 = 40 + 2*Outs + Strikeouts – 2*Hits – 2*Walks – 3*Runs – 6*(Home Runs)

To clarify, the home run is also a hit (and will result in at least one run scoring); the 6 points subtracted are in addition to the other penalties, just as the point for a strikeout is added on top of the points already given for outs in general. If you’re curious about where the specific coefficients come from, Tango explains it further here (link is to part 3, which in turn links to parts 1 and 2). The short version is that it’s a combination of the run value of each event type and the pitcher’s estimated share of responsibility for those events. To take the obvious example, a hit is more damaging than a walk, but the hit is partly the responsibility of the fielders whereas the walk is completely the pitcher’s fault, so the pitcher is penalized the same amount for each of them.

As noted earlier, the GS2 formulation resolves the issue of missing data by removing the earned/unearned run distinction. Conveniently, it also addresses another problem with Original Recipe Game Score, which is the handling of short starts. Usually if your pitcher gets pulled after one inning, you’re going to consider that a below average result. As such, rather than starting each game at the expected average of 50, it makes more sense to start from a below-average 40 and build up from there. (Sorry, openers! But not all that sorry. We’ll come back to them later in the series.) For additional detail, here is a table of the baseline score earned by a pitcher who completes each whole number of innings 1-9 in both Game Score versions:

IP

GS1

GS2

1.0

53

46

2.0

56

52

3.0

59

58

4.0

62

64

5.0

67

70

6.0

72

76

7.0

77

82

8.0

82

88

9.0

87

94

The modified formula places a higher premium on lasting longer in the game, but also includes larger penalties for negative outcomes; the 5-point baseline difference for a 7-inning start can be erased by allowing one solo home run (worth -6 points in GS1, 2 for the hit and 4 for the earned run; -11 in GS2, 2 for the hit, 3 for the run, 6 extra for the homer).

The wider variance in GS2 can be more fully demonstrated by looking at a bit of seasonal data. In 2016, the original Game Score formula produced no scores over 100, and five negative scores; GS2 (with my modifications, to be revealed shortly) saw six starts clear 100, and a whopping 31 dip below 0. The standard deviations differ accordingly (16.8 for GS1, 19.5 for GS2). These results are pretty typical.

The improvements in handling unearned runs and truncated outings combine to make GS2 my clear choice of metric moving forward in this project. Being who I am, though, I couldn’t resist making a couple of small modifications. You may have noticed in the screenshot above that Baseball Reference has hit batters and wild pitches recorded on a game-by-game basis even as far back as 1905, and I can think of no reason not to account for those when I already have the numbers handy. (Other data, such as stolen bases allowed, pickoffs, passed balls, or intentional walks, might also provide value, but aren’t available for the oldest games in the sample. Balks are available, but are so rare that it’s impossible to imagine them making a difference in an overall evaluation. To pick a season at random, MLB in 1955 saw 506 total hit batters, 477 wild pitches, and only 36 balks.)

How do we adjust for hit batters and wild pitches? HBP are easy; they are exactly equivalent to walks and are therefore penalized equally (-2 points). WP are more complicated. They’re similar to stolen bases, which run estimators generally consider to be worth less than half as much as a walk. But in comparison to stolen bases, wild pitches are more likely to score a run from third, sometimes advance multiple runners at once, and when coming at the end of a strikeout, can even occasionally put a runner on base, which should make them proportionately more damaging. We’ll value them at -1 point. The final formula is:

GS2 = 40 + 2*Outs + Strikeouts – 2*(Hits + Walks + Hit Batters) – (Wild Pitches) – 3*Runs – 6*(Home Runs)

To get a sense of Game Score’s scale, here are examples of various Game Scores, each of them an actual start from the 2016 season:

Pitcher

Date

IP

H

R

BB

K

Other

GS2

Danny Duffy

8/1/16

8

1

0

1

16

 

100

Matt Shoemaker

7/16/16

7.1

3

0

0

12

 

90

Jake Arrieta

5/14/16

8

3

2

2

11

1 HBP, 1 WP

80

Hector Santiago

6/15/16

6

2

1

2

5

 

70

Chris Archer

8/1/16

7.1

6

3

1

6

1 HR, 1 WP

60

Jharel Cotton

9/13/16

5.2

7

3

1

2

1 WP

50

Robert Gsellman

9/9/16

5

7

4

2

6

1 HR

40

Joel De La Cruz

8/10/16

4

7

4

2

2

1 HR

30

Jordan Lyles

5/23/16

2.1

5

6

3

3

1 HBP, 1 WP

20

Jacob deGrom

8/24/16

4.2

12

5

2

3

3 HR

10

Chris Young

6/25/16

2.1

7

7

4

2

2 HR, 1 WP

0

However you feel about the details, the broad strokes are reasonable; those starts generally get worse as you go down the list.

Let’s consider the characteristics of this formula. It includes aspects of pitching performance beyond simply counting innings and runs allowed, but also reflects actual hits and runs rather than completely disclaiming the pitcher’s responsibility for events to which other players contribute. It indirectly accounts for fielding by putting extra weight on things that are clearly the pitcher’s responsibility (strikeouts, walks, and homers) compared to things that have shared responsibility (hits and runs), but doesn’t assume a fixed level of fielding contribution based on a team’s overall performance. Which is to say that, to varying extents, it addresses the issues I have with all three versions of pitcher WAR. That doesn’t make it a perfect metric – but it does make it an intriguing option to build around.

Next time, we’ll continue the building process by going through a central question of any comprehensive baseball measure: how do you adjust for context?

Monday, September 22, 2025

Starting Pitcher Ratings: Motivation

Around this time last year, I wrote a series of posts introducing the Weighted WAR framework for evaluating position players. At the time, I specifically excluded pitching value from consideration. At the risk of being a bit self-indulgent, I’ll quote my own post from last year on this topic:

“The question still remains: which form of WAR to use? This is a thorny question for pitchers, because unlike the batting and fielding components of different WAR systems (which may differ in outcome but are generally trying to accomplish the same thing), pitching WAR varies philosophically from system to system. To hugely simplify a complicated topic, Baseball Reference WAR starts from pitcher runs allowed and makes a blanket adjustment for the general quality of the team’s fielders. Fangraphs WAR uses fielding independent numbers (strikeouts, walks and home runs) to measure a pitcher’s entire contribution and assumes any variance outside of that is attributable to the team’s fielders. And RA9 WAR (also presented by Fangraphs these days, and used in other sources as well, including Seamheads for NeL stats) takes the pitcher’s park adjusted runs allowed and ignores any fielding effects entirely.

I don’t particularly like any of the freely available WAR systems for pitchers; I think they all simplify the pitcher-fielder dynamic too much and produce extreme outcomes as a result. As such, both because of the extra work that would be involved and a lack of faith in a reasonable outcome, I’m not considering pitching value as part of this project. Again, this is by no means intended to imply that pitching value shouldn’t count; I may circle back to this topic in the future.”

Welcome to “in the future.” In my next several posts, I’ll be introducing a new method of evaluating starting pitchers, one which I think provides a compelling alternative to any of the readily available WAR systems. But before we get into that, it’s probably a good idea to further explore those WAR systems and explain why I think an alternative is worth introducing to begin with.

As mentioned above, there are three versions of pitching WAR that can be easily found on major baseball websites. I’ll take them on in reverse order of my own preference.

My least-favorite option is also the simplest: RA9 WAR. RA9 stands for runs allowed per 9 innings; it’s similar to the more-familiar ERA, but with unearned runs included (a choice that we’ll discuss later in the series, but one that I agree with). RA9 WAR simply park adjusts the pitcher’s rate of runs allowed (different sources may use different styles of park adjustment, either general to the team or specific to the pitcher), then compares it to the league’s replacement level for the year. Outside of the park adjustment, no other modifications to the pitcher’s performance are made. In particular, there is no attempt to separate the pitcher’s performance from the contributions of the fielders behind him (whether helpful or harmful), or from his bullpen’s success or failure in stranding runners who were on base when the pitcher was removed.

To take an extreme example: Say the starting pitcher records two outs in the seventh inning, then strikes out the next hitter, but the ball gets away from the catcher and the hitter reaches base. The starter is pulled from the game, and the new pitcher immediately allows a line drive into the gap. The center fielder has a chance at the ball, but can’t quite reach it; the right fielder picks it up and tries to throw out the runner at home, but his throw is slightly up the line and the runner scores. At minimum, five players contributed to allowing that run, and one could argue that the starter bears the smallest share of that responsibility. But the official stats charge him with the run, and therefore so does RA9 WAR.

While most runs allowed are not this heavily skewed away from the pitcher’s own contributions, it’s still obvious that the efforts of the fielders and relievers are crucial pieces of context for the starter’s performance. If you compare Jim Palmer and Rick Reuschel’s 1974 seasons, but fail to account for the fact that Palmer was pitching in front of brilliant fielders such as Mark Belanger and Brooks Robinson and Paul Blair while Reuschel suffered through the butchery of Bill Madlock and Rick Monday, you won’t get the full picture of their relative effectiveness. Ignoring fielding and relief work entirely strikes me as eminently counterproductive.

Next up, we’ll go to the other extreme with FIP WAR, primarily presented on Fangraphs (and therefore usually abbreviated as fWAR). Per Fangraphs’ own explanation of their system, fWAR uses five components on a per-inning basis: home runs, walks, batters hit by pitch, strikeouts, and infield popups. (Walks and HBP are treated as equivalent, as are strikeouts and popouts.) The theory behind this is that these components are the best representation of the pitcher’s actual ability, as they are the events for which the pitcher is most directly responsible. As such, they also tend to have better year-to-year correlations than rates of hits or runs allowed.

As a predictive measure of pitching ability, FIP is almost certainly preferable to RA or ERA; it does in fact correlate better year-to-year, both with itself and with future ERA. But WAR is not generally advertised as a forward-looking projection system; it’s presented as a measure of backward-looking value. A pitcher who posts a solid FIP but also allows a high number of line drives to remote areas of the park might still have a bright future, but the actual hits and runs he allows still hurt the team.

FIP is also overly reductive in that it ignores less-measurable but still existent pitcher abilities, including things like wild pitch/passed ball tendencies, ground ball rate, and performance with runners on base (some pitchers work better from the stretch than others). Just to pick one, let’s focus on the pitcher’s own fielding, specifically with regard to four-time Gold Glove winner Mark Buehrle. Buehrle was one of the best-fielding pitchers in the game on batted balls, and also a master of controlling baserunners. He allowed 59 total stolen bases over his 16-year career (less than four per season); by comparison, he picked off an even100 runners (just over six per year). Given that FIP by definition ignores the pitcher’s own fielding, it is not a coincidence that Buehrle’s career RA9 WAR (59.9) exceeds his fWAR (52.3) by a substantial margin, and this is clearly a case in which fWAR gets it wrong. There are enough similar errors (in both directions) to reduce the utility of fWAR for purposes of historical evaluation, which is my primary interest.

That brings us to Baseball Reference WAR (generally called bWAR), which is my preference among these options. Like RA9 WAR, bWAR starts with the pitcher’s park adjusted runs allowed. However, it then adjusts for the quality of the fielders behind the pitcher based on his team’s performance in the fielding metric used in position player WAR calculations (which changes depending on the level of data available for the year in question). This effect is calibrated for the number of balls in play the pitcher allows; it will typically be worth between 0.5 and -0.5 runs per 9 innings on all but the most extreme teams.

Sounds great, right? Unlike fWAR, it doesn’t ignore actual runs allowed, but unlike RA9 WAR, it accounts for the team’s fielders. And in practice, this usually works reasonably well; there’s a reason it’s my preferred option among the three WAR systems. But just as with the other two, there are edge cases that cause trouble. The primary problem here arises from the assumption that all of the pitchers on a team get the same level of support from the team’s fielders – an assumption that is not ridiculous on its face, but can be cast into doubt by a few examples.

In this case, let’s look at a quartet of Texan aces in 2019: Justin Verlander and Gerrit Cole of the Astros, and Mike Minor and Lance Lynn of the Rangers. All four put up very good seasons, and all four received Cy Young votes. By conventional numbers, Cole (20-5, 2.50 ERA) and Verlander (21-6, 2.58) were solidly better than Lynn (16-11, 3.67) or Minor (14-10, 3.59). Unsurprisingly, given the difference in their ERAs, RA9 WAR prefers the Astros pair as well; the gap narrows somewhat thanks to Texas’s status as a hitter’s park, but the AL’s top four still goes Verlander 9.0, Cole 8.0, Minor 6.5, Lynn 6.3.

Houston, however, was a much better team than Texas (or almost anyone else; the Astros won 107 games), so it’s not a surprise to learn that they had better fielders. You might think, then, that FIP would close the gap for the Rangers. And it does… for one of them. Lynn’s FIP of 3.13 is a marked improvement on his 3.67 ERA, while Verlander’s 2.58 ERA balloons into a 3.27 FIP. Cole, on the other hand, posted a 2.64 FIP (thanks to an absurd 13.8 strikeouts per 9 innings), very comparable to his ERA. And Minor? His 3.59 ERA climbs to a 4.25 FIP. As such, fWAR has a top 3 of Cole-Lynn-Verlander, and Minor doesn’t show up until the #8 position.

And what does bWAR make of this group of pitchers? Well, as noted, bWAR evaluates Houton’s fielders as very good (bWAR’s fielding adjustment for Cole is +0.46 runs per 9) and Texas’s as very bad (Lynn’s adjustment is -0.30). So, starting from the baseline of RA9 WAR and accounting for quality of fielding, bWAR gives a result of Minor 8.0, Lynn 7.7, Verlander 7.4, Cole 6.7.

This is a bit surprising, given the other results, so what’s going on? As noted, B-R gives Minor substantial credit for pitching effectively in front of a bad defense. As a sanity check for this, let’s look at batting average on balls in play (BABIP for short). The AL average BABIP in 2019 was .299; the Rangers allowed a mark of .314, 15 points higher than average and consistent with a poor fielding team. (By contrast, Houston allowed a .272 BABIP.) Lance Lynn’s BABIP allowed was slightly worse than his team’s, at .322; adjusting for this seems totally fair in his case. As for Minor? His BABIP allowed was .288 – lower than the AL average. There are three possible interpretations of this. First, hitters had a tougher time hitting Minor and his balls in play were easier for the Rangers’ subpar fielders to handle. Second, the Rangers’ fielders performed better with Minor on the mound than they did otherwise, for whatever reason. Or third, both factors had an effect, and the relative extent of those effects is unknowable.

Any time a discrepancy of this type arises, bWAR uses the first interpretation. As such, it is also intermittently prone to startling swings toward or against certain pitchers. And that is how you get 2019 Mike Minor (3.59 ERA, 4.25 FIP, .288 BABIP) rated substantially ahead of Gerrit Cole (2.50 ERA, 2.64 FIP, .276 BABIP). For an even more extreme example, Aaron Nola in 2018 allowed a .254 BABIP for a team that bWAR thinks was below average in the field by 0.41 runs per 9 innings; as a result, bWAR rates Nola’s admittedly very good campaign as the single best season any pitcher has had in the last 15 years, narrowly outpacing Jacob deGrom’s 1.70 ERA/1.98 FIP effort the same year.

Where to go from here? We have three options for pitching WAR, but all three seem to have the same failure point: the pitcher-fielder interaction. In my opinion, all three systems handle this interaction too rigidly by either by ignoring it entirely, assuming the pitcher can only affect a few specific outcomes,  or assuming that all pitchers on the same team have the same fielding performance behind them. It would probably be possible to amalgamate all three systems into some sort of WAR hydra, allowing each system’s extremes to be smoothed out by the other two.

Or we could abandon the WAR framework entirely and do something completely different. Next time, we’ll start to explore the basis for a new option.