Monday, September 29, 2025

Starting Pitcher Ratings: Game Score

To introduce this series, I discussed the three major WAR systems for pitchers and the issues I have with each of them. Having pointed out the problem, let's get to work on a solution.

Whatever issues I may have with it, one of WAR’s benefits is very compelling: it reduces the player’s complex contributions to a single number, which can be compared to a similarly derived single number for other players. How does a pitcher with a 2.60 ERA and 220 strikeouts in 280 innings in 1968 compare to a pitcher with a 3.40 ERA and 250 strikeouts in 210 innings in 1999? After sorting through all the complications under the surface, WAR can give you a simple answer. If I’m suggesting an alternative to WAR, it should ideally come in a similarly straightforward package. Fortunately, just as in many other statistical problems in baseball, we can start by borrowing from the work Bill James. In this case, we’ll use the Game Score.

(Before anyone else says it: Yes, Game Score is only set up for starting pitchers; it is not generally applied to relief outings, even relief outings made by pitchers who are usually starters. We’ll look more at the effects of this on the rankings once we get far enough to actually discuss rankings, but there definitely is an effect. It is also worth pointing out that we can only pull Game Scores for pitchers who have box score records available. Since I’m pulling my data from Baseball Reference, that means I have nothing before 1901, and nothing for the Negro Leagues. So we’re working with half of Cy Young’s career, under 20% of Kid Nichols, and none of Smokey Joe Williams.)

James introduced Game Score in the ‘80s as a fun way to evaluate the effectiveness of a starting pitcher’s effort in a particular game. His original formulation is:

GS = 50 + Outs + Strikeouts – Walks – 2*Hits – 4*(Earned Runs) – 2*(Unearned Runs) + 2*(Innings completed after the fourth)

The description is a little clunky (and so is implementing it in a spreadsheet of game logs), but the results are pretty reasonable. League average will usually be around 50. 60 is a solid start, 70 is a very good one, 80 is excellent, and 90 will usually be among the 25-50 best overall starts of the season. The same is true in the other direction; 40 is shaky, 30 is pretty bad, 20 is lousy, etc. My initial foray into starting pitcher evaluation used the original formulation of Game Score, and it generally performed well… until it didn’t.

The major issue I ran into deals with the earned/unearned run distinction. Whether this distinction has value in general is a bit of a rabbit hole (most of the research I’ve seen suggests it does not), but even if you think it’s worth using earned runs in some circumstances, it becomes a huge problem trying to use them for game-by-game evaluation of baseball seasons from over a century ago.

Here is a sample of the Baseball Reference game log page for Christy Mathewson’s outstanding 1905 season. See if you can spot the issue:

Yeah, that’s not going to work. We know Matty’s overall total of earned runs allowed for the year (48, out of 85 runs total), but game-to-game we have no idea of his earned/unearned run split.

Having encountered this problem, I went looking for an alternative Game Score formulation. Fortunately, Fangraphs already had one, courtesy of the inimitable Tangotiger. Game Score 2 is laid out here, and the formula is as follows:

GS2 = 40 + 2*Outs + Strikeouts – 2*Hits – 2*Walks – 3*Runs – 6*(Home Runs)

To clarify, the home run is also a hit (and will result in at least one run scoring); the 6 points subtracted are in addition to the other penalties, just as the point for a strikeout is added on top of the points already given for outs in general. If you’re curious about where the specific coefficients come from, Tango explains it further here (link is to part 3, which in turn links to parts 1 and 2). The short version is that it’s a combination of the run value of each event type and the pitcher’s estimated share of responsibility for those events. To take the obvious example, a hit is more damaging than a walk, but the hit is partly the responsibility of the fielders whereas the walk is completely the pitcher’s fault, so the pitcher is penalized the same amount for each of them.

As noted earlier, the GS2 formulation resolves the issue of missing data by removing the earned/unearned run distinction. Conveniently, it also addresses another problem with Original Recipe Game Score, which is the handling of short starts. Usually if your pitcher gets pulled after one inning, you’re going to consider that a below average result. As such, rather than starting each game at the expected average of 50, it makes more sense to start from a below-average 40 and build up from there. (Sorry, openers! But not all that sorry. We’ll come back to them later in the series.) For additional detail, here is a table of the baseline score earned by a pitcher who completes each whole number of innings 1-9 in both Game Score versions:

IP

GS1

GS2

1.0

53

46

2.0

56

52

3.0

59

58

4.0

62

64

5.0

67

70

6.0

72

76

7.0

77

82

8.0

82

88

9.0

87

94

The modified formula places a higher premium on lasting longer in the game, but also includes larger penalties for negative outcomes; the 5-point baseline difference for a 7-inning start can be erased by allowing one solo home run (worth -6 points in GS1, 2 for the hit and 4 for the earned run; -11 in GS2, 2 for the hit, 3 for the run, 6 extra for the homer).

The wider variance in GS2 can be more fully demonstrated by looking at a bit of seasonal data. In 2016, the original Game Score formula produced no scores over 100, and five negative scores; GS2 (with my modifications, to be revealed shortly) saw six starts clear 100, and a whopping 31 dip below 0. The standard deviations differ accordingly (16.8 for GS1, 19.5 for GS2). These results are pretty typical.

The improvements in handling unearned runs and truncated outings combine to make GS2 my clear choice of metric moving forward in this project. Being who I am, though, I couldn’t resist making a couple of small modifications. You may have noticed in the screenshot above that Baseball Reference has hit batters and wild pitches recorded on a game-by-game basis even as far back as 1905, and I can think of no reason not to account for those when I already have the numbers handy. (Other data, such as stolen bases allowed, pickoffs, passed balls, or intentional walks, might also provide value, but aren’t available for the oldest games in the sample. Balks are available, but are so rare that it’s impossible to imagine them making a difference in an overall evaluation. To pick a season at random, MLB in 1955 saw 506 total hit batters, 477 wild pitches, and only 36 balks.)

How do we adjust for hit batters and wild pitches? HBP are easy; they are exactly equivalent to walks and are therefore penalized equally (-2 points). WP are more complicated. They’re similar to stolen bases, which run estimators generally consider to be worth less than half as much as a walk. But in comparison to stolen bases, wild pitches are more likely to score a run from third, sometimes advance multiple runners at once, and when coming at the end of a strikeout, can even occasionally put a runner on base, which should make them proportionately more damaging. We’ll value them at -1 point. The final formula is:

GS2 = 40 + 2*Outs + Strikeouts – 2*(Hits + Walks + Hit Batters) – (Wild Pitches) – 3*Runs – 6*(Home Runs)

To get a sense of Game Score’s scale, here are examples of various Game Scores, each of them an actual start from the 2016 season:

Pitcher

Date

IP

H

R

BB

K

Other

GS2

Danny Duffy

8/1/16

8

1

0

1

16

 

100

Matt Shoemaker

7/16/16

7.1

3

0

0

12

 

90

Jake Arrieta

5/14/16

8

3

2

2

11

1 HBP, 1 WP

80

Hector Santiago

6/15/16

6

2

1

2

5

 

70

Chris Archer

8/1/16

7.1

6

3

1

6

1 HR, 1 WP

60

Jharel Cotton

9/13/16

5.2

7

3

1

2

1 WP

50

Robert Gsellman

9/9/16

5

7

4

2

6

1 HR

40

Joel De La Cruz

8/10/16

4

7

4

2

2

1 HR

30

Jordan Lyles

5/23/16

2.1

5

6

3

3

1 HBP, 1 WP

20

Jacob deGrom

8/24/16

4.2

12

5

2

3

3 HR

10

Chris Young

6/25/16

2.1

7

7

4

2

2 HR, 1 WP

0

However you feel about the details, the broad strokes are reasonable; those starts generally get worse as you go down the list.

Let’s consider the characteristics of this formula. It includes aspects of pitching performance beyond simply counting innings and runs allowed, but also reflects actual hits and runs rather than completely disclaiming the pitcher’s responsibility for events to which other players contribute. It indirectly accounts for fielding by putting extra weight on things that are clearly the pitcher’s responsibility (strikeouts, walks, and homers) compared to things that have shared responsibility (hits and runs), but doesn’t assume a fixed level of fielding contribution based on a team’s overall performance. Which is to say that, to varying extents, it addresses the issues I have with all three versions of pitcher WAR. That doesn’t make it a perfect metric – but it does make it an intriguing option to build around.

Next time, we’ll continue the building process by going through a central question of any comprehensive baseball measure: how do you adjust for context?

No comments:

Post a Comment