Monday, October 13, 2025

Starting Pitcher Ratings: Deviations

So far in this series, we’ve discussed the reason to develop a new rating system, introduced the version of Game Score we’re using, and worked out the adjustments necessary for Game Score on a game-to-game level. This time, we’ll work on converting average Game Score into a more robust metric that balances excellence and playing time.

The immediate temptation here is to attempt a conversion between Game Score points and runs, and transition into yet another WAR system. Wanting to avoid the confusion and redundancy of adding a fourth version of pitching WAR to the conversation, I instead decided to try a different approach, one based around the idea of statistical confidence.

Every baseball fan understands statistical confidence on some level. Any player can go 2 for 5 in a game; it’s quite a bit harder to go 20 for 50 over a longer period, and nearly impossible these days to go 200 for 500. In any statistical measure you can name, the larger the sample over which the performance is sustained, the more confidence you can have that said performance is a reliable reflection of the theoretical true talent of the player who produced it.

This is readily understandable in theory; the practice is trickier, and we’ll discuss the exact details of the sample size adjustment later. For now, I’ll just say that I’ve modeled the metric on the general concept of standard deviations away from average, a very common approach to statistical confidence. But before 80% of the audience leaves (and the other 20% has their hopes raised too high), I’ll say that I’m not aiming for a 100% rigorous standard-deviations-above-average metric. There are two major differences between that approach and the one I’m introducing here: I’m not using the actual standard deviation (either for the individual pitcher or for the league), and I’m not comparing to average.

We’ll start with the issue of the baseline for comparison. Average is always a tempting option, since it’s generally both easy to calculate and easy to explain. An average team wins about half of its games; an average player therefore contributes roughly equal numbers of wins and losses to his team. But one of the fundamental insights around which WAR is based is that average players are neither easily nor cheaply obtainable. A starting pitcher who reliably posts league average numbers over 25-30 starts per year will easily command an eight-figure annual salary as a free agent.

So what to use as the baseline? I tried a few approaches and settled on the aggregate performance of pitchers who make less than 1/15 of their team’s starts (which is to say, less than a third of the starts a full-season pitcher would make in a five-man rotation). For a full-length schedule of either 154 or 162 games, this is 10 starts or fewer. The seasons for which this cutoff is below 10 (either due to war, labor dispute, pandemic, or simply shorter standard schedule) are listed here:

Year

Cutoff Starts

1901

9

1902

9

1903

9

1918

8

1919

9

1981

7

1994

7

1995

9

2020

4

In theory, this method should provide a sample composed of four groups: pitchers whose seasons were shortened by injury, pitchers who lost their spots in the rotation due to poor performance, pitchers who replaced either an injured pitcher or a poor performer, and prospects who earned late-season callups. (Nothing stops a pitcher from occupying multiple groups at once, of course – a prospect could get called up to replace a poor performer and then get injured, for example.) Two of those groups (replacements and prospects) were likely not being counted on as primary options in the team’s original plans for the season, and a third (poor performers) weren’t trusted enough to stay in the rotation through a bad stretch. As such, the hope is that the overall production of this collection of pitchers should be solidly below average, and therefore usable as a lower performance baseline.

How does this baseline fare in practice? Frankly, about as well as can reasonably be hoped for. From 1901-2024, pitchers below the cutoff (let’s call them “scrubs”) produced an average adjusted Game Score of 43.2. Pitchers above the cutoff (“regulars”) were at 51.8. In every single one of the 124 seasons in the sample, regulars produced substantially higher Game Scores than scrubs. The average difference was 8.6, with the differences ranging from 5.5 to 14.6. 96 of the 124 seasons produced differences between 7 and 10. Let’s look at the outliers on either side, because both are instructive.

10 largest differences between average adjusted Game Scores for regulars and scrubs:

Year

Scrub

Regular

Difference

1902

42.7

57.3

14.6

1901

43.5

57.5

13.9

1919

43.1

54.7

11.6

1950

39.9

51.3

11.4

1904

46.3

57.7

11.4

1912

44.7

55.9

11.2

1903

46.8

57.6

10.8

1934

42.3

52.9

10.7

2004

39.5

50.0

10.5

1920

44.2

54.7

10.4

Yes, that is 7 of the top 10 differences coming between 1901-20, including the first four years of the period. Given that the founding of the AL in 1901 doubled the size of the major leagues, I don’t think that’s a coincidence. (The other three seasons show no obvious pattern.)

10 smallest differences between average adjusted Game Scores for regulars and scrubs:

Year

Scrub

Regular

Difference

1945

46.6

52.0

5.5

1981

44.3

49.9

5.6

1982

43.6

49.5

5.9

1937

46.0

52.2

6.2

1978

44.1

50.3

6.3

1907

49.5

55.8

6.3

1922

46.3

52.8

6.5

1918

48.3

54.8

6.5

2011

43.5

50.2

6.7

2023

43.4

50.1

6.7

1945 and 1918 are a problem here. Remember the four types of pitchers you’d expect to have fewer than 10 starts? These seasons add a fifth – pitchers whose campaigns were artificially shortened by military service. To pick the most famous examples, 1918 Grover Cleveland Alexander and 1945 Bob Feller both had fewer than 10 starts because they either went into the army shortly after the year started (Alexander) or returned from it near the end (Feller).

Also, the late ‘70s and early ‘80s are interesting, especially when you learn that 1979 was eleventh-lowest, and 1980 was fourteenth. Outside of that, I don’t see much of a pattern; there are in fact several of these years that are pretty close to seasons that made the top 10 (1904 and 1907, or 1920 and 1922, or 1934 and 1937).

Why so much focus on the exact difference between the Game Scores for regulars and scrubs? I’ll posit that the difference between regulars and scrubs can also serve as a good choice of deviation to use in a stab at a confidence-based measure, since seasons that are prone to extremes in one sense (performance of replacement pitchers) are often prone to extremes in others (performance of standout pitchers). I did look for correlations between these two factors; the results were not unassailable, but there was a reasonable positive connection, enough that I feel comfortable proceeding.

However, we’ve seen that in its raw form, the regular-scrub difference is prone to pretty wild annual variations, so it probably needs to be toned down at least a bit. I’m using a rolling five-year average, with a historically normal 8.5 added as a regression to the mean to mute the 1945-style extremes a bit more. (Note that for 1901-02, we don’t have a full five years of surrounding data, so the regression term has a bit more weight. This is also true for the most recent seasons, which is why I’m not finalizing the numbers from those years yet.) As you would likely guess based on the above tables, 1901 has the highest rolling average deviation (11.97), and 1980 has the lowest (6.71). Using the weighted and regressed average deviation, 111 of the 122 finalized years score between 7.5 and 9.5; 89 are between 8 and 9, and 58 are between 8.25 and 8.75.

Once you take a rolling average of the deviation, you’re suddenly no longer using the measured single-season performance of scrub pitchers, so the baseline has to be reoriented in comparison to league average. The relevant equation here is:

Scrub GS2 = Average GS2 – (Scrub Deviation) * (1 – Scrub Fraction),

Where scrub fraction, unsurprisingly, is the percentage of the league’s total starts made by scrubs. This leads us to one final complication: Scrub fraction, like many of the other values we’ve encountered, varies from year to year. The average across our seasonal data is 11.2%; 101 of the 124 seasons fall between 8.2% and 14.1%. Scrub fraction does tend to be low in the very oldest seasons (six of the bottom nine were recorded between 1901-07); after that, it jumps around quite a bit… until recently. Every season since 2015 has been above average in scrub fraction, and every season since 2019 has been at 13.2% or higher. The two most recent years with completed data (2023-24) have the second- and third-highest scrub fractions, at 16.3% and 16.1%. (They trail only 1946, which is heavily influenced by pitchers making their way back from war).

Anyone who’s watched a significant amount of recent baseball can probably figure out the reason for this increase: the opener. In 2013, there were three pitchers who averaged less than 2 innings per start, each of whom made exactly one start. In 2018, there were 27 such pitchers making a total of 92 starts (29 by proto-opener Ryne Stanek). In 2023? 58 pitchers, 144 starts, and that’s without a Stanek inflating the numbers (the highest individual total was 13). Nearly 3% of the league’s starts in ’23 were taken by openers, and I don’t anticipate the trend reversing any time soon.

I pondered the handling of the opener for a bit, and decided to make a simple compensation for its presence in order not to over-penalize modern starters for working in an era of higher strategic optimization. Rather than using either the single-season value or a rolling average for scrub fraction, I’m using an approximation for the historical average (11%) across the board. So if the rolling average scrub deviation is 9 points, the scrub pitcher baseline would be estimated to be 8.01 points below league average.

We now have the baseline to which we can compare a pitcher’s average adjusted Game Score, but that still leaves the problem of sample size. We’ll borrow from more exacting statistical approaches here. When measuring standard deviations from average, the standard error of the average of a particular sample shrinks as the sample size increases, and does so according to the following relationship:

Standard error of sample mean = (Standard deviation of population) / sqrt(Sample size)

We’ll apply the same inverse square root relationship to the number of starts made by a particular pitcher in calculating “scrub deviations above scrub level.” So if the scrub deviation is 9 points and a pitcher makes 16 starts, we reduce the expected deviation of his average to 9/4 = 2.25. (Really for arcane technical reasons, since our population – overall starts in the league for the season in question – is finite, the proper denominator is sqrt(Sample size – 1), so the pitcher would have to make 17 starts rather than 16 for the deviation to be cut down to a exactly quarter of its original value. The most obvious effect of this is that for pitchers with only one start, the expected deviation of the mean is infinite, so all of their scores reflect no statistical confidence. Sorry, 1930 Dizzy Dean!)

That gives us three components of the metric: a context-adjusted rate performance for the pitcher (average adjusted Game Score), a baseline for comparison (estimated average Game Score for scrubs), and a way of compensating for playing time (divide by the scrub deviation reduced by a factor of the inverse square root of games started). Having worked through all of the components, let’s bring them together to form Game Score Deviations. Here’s the raw formula, fancied up a bit for readability:

Which is to say, the difference between the pitcher’s average Game Score and the estimated baseline for a scrub pitcher, divided by the deviation between scrubs and regulars which has in turn been divided by the square root of the pitcher’s total starts (minus one) to account for the confidence level given by the sample size.

So that’s the formula we’ll be testing as an evaluation tool for starting pitchers. It is also an enormous amount of math. Next time, I promise, we'll get to some of the fun stuff: what do the actual results of GSDev look like on a seasonal basis?

Monday, October 6, 2025

Starting Pitcher Ratings: Adjusted Game Score

So far in this series, we’ve discussed why an alternative to pitching WAR might be beneficial, and introduced Game Score as the basis on which an alternative might be built. Now, let’s put Game Score through a few basic adjustments so we can make it more viable for historical comparisons.

We’ve seen already how Game Score takes a variety of results (innings, hits, runs, strikeouts, etc.) and combines them into a single number reflecting the effectiveness of a pitcher’s performance in a particular start. But there are still a few obvious factors in the pitcher’s results that have yet to be accounted for: what team was he facing, and when and where did the game take place?

The approach here has two steps. First, find the expected runs per game by the opponent in the park and year in question. To do this, start with the opponent’s raw scoring rate in runs per game, then adjust for their overall park factor (including both home and road games) to get their normalized scoring rate, then multiply that by the park factor of the stadium that hosted the game in question.

If the gritty details of park adjustments don’t interest you, skip past the formulas that are a couple paragraphs down. For anyone still reading, I'm using a regressed three-year park factor, depending on how many of the surrounding seasons the team spent in the same park. (Yes, it’s probably more rigorous to use more seasons of park data, but that also adds more complications when teams move to a new park and would delay finalization of seasonal results even further. As it is, I already had to wait for the 2025 regular season to end so I could run preliminary rankings for 2024.)

The formulas, to be shown below, apply to both overall park factor and home park factor. APF is the adjusted park factor, PF(0) is the raw park factor for the season under examination, and PF (-1) and PF (1) are for the previous and next season, respectively. The formulas are:

One year in park: APF = 0.67*PF(0) + 0.33

Two years: APF = 0.5*PF(0) + 0.25*PF(±1) + 0.25

Three years: APF = 0.4*PF(0) + 0.21*(PF(1) + PF(-1)) + 0.18

Park adjusted opponent scoring can have a pretty wide range even within a single season. If you compare the figures across history, the gap is often prodigious. The 1999 Giants have an enormous expected output of 8.36 runs per game in Coors Field, while the 1968 Mets in Dodger Stadium would anticipate an anemic 2.47. (Just for fun – the ’99 Giants actually scored 58 runs in their 6 games in Coors, averaging 9.67; the ’68 Mets put up 21 in 9 games in Dodger Stadium, or 2.33 per.)

None of the above is original; park and league context adjustments are one of the first topics of sabermetric study. The question for our purposes is, how do you apply this to Game Score, which is not laid out in run-based units?

This leads us to step 2: develop a Game Score adjustment based on scoring environment. Early in my pitching-based work, I decided to use a single formula across history for this purpose rather than modifying it year-to-year. This approach has benefits and drawbacks. One of the primary benefits at the time was making the numbers easier to work with, but it’s also nice to have a fixed formula that can capture changes in average starting pitcher production over time. The downside is a fairly marginal loss in accuracy in the adjustment, which would primarily show up in extremes that most pitchers won’t see very often (even the 1968 Dodgers don’t face the Mets at home in every start, after all).

To generate this adjustment, I pulled the GS2 numbers for three seasons: 1965, 1978, and 1994. This gave a wide range of scoring environments, but kept the league context reasonably modern without being influenced by the wonkier tendencies of current-day pitching (such as the opener). I ran a regression of GS2 against total runs scored in the game for each year, which gave the following results:

1965: GS2 = 75.7 – 2.79*R

1978: GS2 = 74.6 – 2.68*R

1994: GS2 = 72.6 – 2.46*R

Note that for any relatively normal scoring context (say, total runs per game between 7 and 11), there is less than one point of difference between the outcome of any of these formulas; for R=9 and R=10, the differences are 0.2 or less. For simplicity’s sake, I ended up using a rounded compromise formula:

Expected GS2 = 75 – 2.7*R

If you want the average to end up as 50, you adjust the pitcher’s actual Game Score by adding the difference between 50 and this value. Also, since R was expressed in total runs per game (rather than runs per game for the individual team), we double its coefficient to account for park-adjusted opponent scoring:

GS2 Adjustment = 5.4*(PAOS) – 25

For a pitcher facing the aforementioned ’99 Giants in Colorado, this adjustment is +20.2 points of Game Score; for the ’68 Mets visiting LA, it’s -11.7.

What do these adjustments look like over a full season? We’ll stick with ’68 and ’99 as extremes in each direction. In 1968, the pitcher with the most hitter-friendly set of opponents and environments (with at least 25 starts) was Joe Niekro of the Cubs, whose Game Scores are adjusted an average of -4.7 points; on the pitcher-friendly side, it’s LA’s Don Drysdale with a -7.9. In 1999, the friendliest conditions were given to San Francisco’s Shawn Estes with a -0.1, while the most hostile were unsurprisingly suffered by the poor souls condemned to Colorado, with Brian Bohanon taking the crown at +8.8. Note the complete lack of overlap between the two seasons; in fact, there’s less distance between the extremes in 1968 than there is between the highest ’68 adjustment and the lowest ’99 adjustment.

How does the formula fare over time? Average opponent-park adjusted GS2 (OPAGS2) by decade:

Years

OPAGS2

1901-09

56.1

1910-19

54.0

1920-29

52.1

1930-39

51.6

1940-49

50.8

1950-59

49.6

1960-69

50.2

1970-79

50.0

1980-89

49.4

1990-99

49.6

2000-09

49.0

2010-19

49.4

2020-24

48.9

That stabilizes reasonably quickly; once the early modern game sets in (integration and expansion), the average only varies by a point or so. There’s still an obvious difference in the earliest part of our dataset, but that’s understandable – and ultimately we’ll be adjusting for league context on a yearly basis, so the final numbers won’t be polluted by this change.

Speaking of polluting the numbers, one additional note I should make before talking results: I am including postseason starts in the consideration set. Yes, this is different from my approach in the weighted WAR system, but in that case I was constrained by what Baseball Reference factors into its WAR calculations. Here, I’m building my own system and can use whatever data I choose. And I choose to include playoff starts for two reasons. First, the postseason is considered MORE important than the regular season by the teams, players, and fans. For pitchers in particular, good or bad postseason performance can have an enormous impact on how the player is regarded. (Ask Madison Bumgarner... or Clayton Kershaw.) Second, postseason participation often costs pitchers regular season starts, either in the same year (as the team manages workload and sets the playoff rotation) or in the future (due to wear and tear, and sometimes due to immediate injury). Counting postseason stats does raise some issues of fairness to pitchers on bad teams, but it strikes me as being a more reasonable option than ignoring them.

All right, you’ve stuck with me through three posts (and a few dozen repetitions of the word “adjustment”), so it’s time for our first actual fun table of numbers. Here are the top 50 single seasons in average adjusted Game Score (25-start minimum):

Rank

Year

Pitcher

Starts

OPAGS2

1

2000

Pedro Martinez

29

79.1

2

1999

Pedro Martinez

31

76.2

3

1913

Walter Johnson

36

74.9

4

1910

Ed Walsh

36

74.3

5

1912

Walter Johnson

37

74.0

6

1994

Greg Maddux

25

73.5

7

1918

Walter Johnson

29

73.4

8

1901

Cy Young

41

73.0

9

1997

Roger Clemens

34

72.9

10

1968

Bob Gibson

37

72.6

11

1905

Ed Reulbach

29

72.3

12

1997

Pedro Martinez

31

72.2

13

1931

Lefty Grove

33

72.1

14

1995

Greg Maddux

33

71.9

15

1924

Dazzy Vance

34

71.8

16

1915

Pete Alexander

44

71.6

17

1910

Russ Ford

33

71.5

18

1909

Mordecai Brown

34

71.4

19

1919

Walter Johnson

29

71.3

20

1905

Christy Mathewson

40

71.3

21

2001

Randy Johnson

39

71.3

22

1902

Rube Waddell

27

71.0

23

1995

Randy Johnson

33

70.7

24

1912

Smoky Joe Wood

41

70.6

25

1946

Hal Newhouser

34

70.1

26

1928

Dazzy Vance

32

69.8

27

1908

Mordecai Brown

32

69.8

28

1985

Dwight Gooden

35

69.7

29

1910

Walter Johnson

42

69.7

30

1911

Ed Walsh

37

69.6

31

1965

Sandy Koufax

44

69.5

32

1908

Cy Young

33

69.5

33

1940

Bob Feller

37

69.4

34

1936

Lefty Grove

30

69.3

35

1915

Walter Johnson

39

69.3

36

1902

Cy Young

43

69.3

37

1932

Lefty Grove

30

69.2

38

1999

Randy Johnson

36

69.2

39

1972

Steve Carlton

41

69.2

40

1936

Carl Hubbell

36

69.1

41

1969

Bob Gibson

35

69.1

42

1914

Russ Ford

26

69.0

43

1971

Tom Seaver

35

68.9

44

1914

Walter Johnson

40

68.9

45

1986

Mike Scott

39

68.9

46

1911

Vean Gregg

26

68.8

47

1914

Claude Hendrix

37

68.8

48

1902

Bill Bernhard

25

68.7

49

1937

Lefty Gomez

36

68.6

50

1935

Cy Blanton

30

68.6

If I may say so, that’s a pretty fun list. You can definitely see the effects of the change in pitcher usage –Russ Ford makes as many appearances as Roger Clemens and Sandy Koufax combined. But I don’t mind letting the deadball pitchers have a moment in the sun; you can already infer from the table of league averages that their numbers are going to have some serious air let out moving forward.

Also, even with the deadball advantage on the table, the difference between Pedro Martinez’s highest average and anyone else’s highest average is the same as the gap between #3 and #23 on the list. Pedro, it turns out, was pretty good.

Pedro was not, however, especially durable; note that his exceptional 2000 season included only 29 starts, while several of the seasons behind him had totals in the high 30s and a few even cleared 40. Next time, we’ll take our context-adjusted Game Score and work on turning it into a more robust measure of seasonal performance, one that balances excellence and availability.