SABER All I Want To: Starting Pitcher Ratings: Deviations

So far in this series, we’ve discussed the reason to develop a new rating system, introduced the version of Game Score we’re using, and worked out the adjustments necessary for Game Score on a game-to-game level. This time, we’ll work on converting average Game Score into a more robust metric that balances excellence and playing time.

The immediate temptation here is to attempt a conversion between Game Score points and runs, and transition into yet another WAR system. Wanting to avoid the confusion and redundancy of adding a fourth version of pitching WAR to the conversation, I instead decided to try a different approach, one based around the idea of statistical confidence.

Every baseball fan understands statistical confidence on some level. Any player can go 2 for 5 in a game; it’s quite a bit harder to go 20 for 50 over a longer period, and nearly impossible these days to go 200 for 500. In any statistical measure you can name, the larger the sample over which the performance is sustained, the more confidence you can have that said performance is a reliable reflection of the theoretical true talent of the player who produced it.

This is readily understandable in theory; the practice is trickier, and we’ll discuss the exact details of the sample size adjustment later. For now, I’ll just say that I’ve modeled the metric on the general concept of standard deviations away from average, a very common approach to statistical confidence. But before 80% of the audience leaves (and the other 20% has their hopes raised too high), I’ll say that I’m not aiming for a 100% rigorous standard-deviations-above-average metric. There are two major differences between that approach and the one I’m introducing here: I’m not using the actual standard deviation (either for the individual pitcher or for the league), and I’m not comparing to average.

We’ll start with the issue of the baseline for comparison. Average is always a tempting option, since it’s generally both easy to calculate and easy to explain. An average team wins about half of its games; an average player therefore contributes roughly equal numbers of wins and losses to his team. But one of the fundamental insights around which WAR is based is that average players are neither easily nor cheaply obtainable. A starting pitcher who reliably posts league average numbers over 25-30 starts per year will easily command an eight-figure annual salary as a free agent.

So what to use as the baseline? I tried a few approaches and settled on the aggregate performance of pitchers who make less than 1/15 of their team’s starts (which is to say, less than a third of the starts a full-season pitcher would make in a five-man rotation). For a full-length schedule of either 154 or 162 games, this is 10 starts or fewer. The seasons for which this cutoff is below 10 (either due to war, labor dispute, pandemic, or simply shorter standard schedule) are listed here:

Year	Cutoff Starts
1901	9
1902	9
1903	9
1918	8
1919	9
1981	7
1994	7
1995	9
2020	4

In theory, this method should provide a sample composed of four groups: pitchers whose seasons were shortened by injury, pitchers who lost their spots in the rotation due to poor performance, pitchers who replaced either an injured pitcher or a poor performer, and prospects who earned late-season callups. (Nothing stops a pitcher from occupying multiple groups at once, of course – a prospect could get called up to replace a poor performer and then get injured, for example.) Two of those groups (replacements and prospects) were likely not being counted on as primary options in the team’s original plans for the season, and a third (poor performers) weren’t trusted enough to stay in the rotation through a bad stretch. As such, the hope is that the overall production of this collection of pitchers should be solidly below average, and therefore usable as a lower performance baseline.

How does this baseline fare in practice? Frankly, about as well as can reasonably be hoped for. From 1901-2024, pitchers below the cutoff (let’s call them “scrubs”) produced an average adjusted Game Score of 43.2. Pitchers above the cutoff (“regulars”) were at 51.8. In every single one of the 124 seasons in the sample, regulars produced substantially higher Game Scores than scrubs. The average difference was 8.6, with the differences ranging from 5.5 to 14.6. 96 of the 124 seasons produced differences between 7 and 10. Let’s look at the outliers on either side, because both are instructive.

10 largest differences between average adjusted Game Scores for regulars and scrubs:

Year	Scrub	Regular	Difference
1902	42.7	57.3	14.6
1901	43.5	57.5	13.9
1919	43.1	54.7	11.6
1950	39.9	51.3	11.4
1904	46.3	57.7	11.4
1912	44.7	55.9	11.2
1903	46.8	57.6	10.8
1934	42.3	52.9	10.7
2004	39.5	50.0	10.5
1920	44.2	54.7	10.4

Yes, that is 7 of the top 10 differences coming between 1901-20, including the first four years of the period. Given that the founding of the AL in 1901 doubled the size of the major leagues, I don’t think that’s a coincidence. (The other three seasons show no obvious pattern.)

10 smallest differences between average adjusted Game Scores for regulars and scrubs:

Year	Scrub	Regular	Difference
1945	46.6	52.0	5.5
1981	44.3	49.9	5.6
1982	43.6	49.5	5.9
1937	46.0	52.2	6.2
1978	44.1	50.3	6.3
1907	49.5	55.8	6.3
1922	46.3	52.8	6.5
1918	48.3	54.8	6.5
2011	43.5	50.2	6.7
2023	43.4	50.1	6.7

1945 and 1918 are a problem here. Remember the four types of pitchers you’d expect to have fewer than 10 starts? These seasons add a fifth – pitchers whose campaigns were artificially shortened by military service. To pick the most famous examples, 1918 Grover Cleveland Alexander and 1945 Bob Feller both had fewer than 10 starts because they either went into the army shortly after the year started (Alexander) or returned from it near the end (Feller).

Also, the late ‘70s and early ‘80s are interesting, especially when you learn that 1979 was eleventh-lowest, and 1980 was fourteenth. Outside of that, I don’t see much of a pattern; there are in fact several of these years that are pretty close to seasons that made the top 10 (1904 and 1907, or 1920 and 1922, or 1934 and 1937).

Why so much focus on the exact difference between the Game Scores for regulars and scrubs? I’ll posit that the difference between regulars and scrubs can also serve as a good choice of deviation to use in a stab at a confidence-based measure, since seasons that are prone to extremes in one sense (performance of replacement pitchers) are often prone to extremes in others (performance of standout pitchers). I did look for correlations between these two factors; the results were not unassailable, but there was a reasonable positive connection, enough that I feel comfortable proceeding.

However, we’ve seen that in its raw form, the regular-scrub difference is prone to pretty wild annual variations, so it probably needs to be toned down at least a bit. I’m using a rolling five-year average, with a historically normal 8.5 added as a regression to the mean to mute the 1945-style extremes a bit more. (Note that for 1901-02, we don’t have a full five years of surrounding data, so the regression term has a bit more weight. This is also true for the most recent seasons, which is why I’m not finalizing the numbers from those years yet.) As you would likely guess based on the above tables, 1901 has the highest rolling average deviation (10.81), and 1980 has the lowest (7.42). Using the weighted and regressed average deviation, 117 of the 122 finalized years score between 7.5 and 9.5; 105 are between 8 and 9, and 82 are between 8.25 and 8.75.

Once you take a rolling average of the deviation, you’re suddenly no longer using the measured single-season performance of scrub pitchers, so the baseline has to be reoriented in comparison to league average. The relevant equation here is:

Scrub GS2 = Average GS2 – (Scrub Deviation) * (1 – Scrub Fraction),

Where scrub fraction, unsurprisingly, is the percentage of the league’s total starts made by scrubs. This leads us to one final complication: Scrub fraction, like many of the other values we’ve encountered, varies from year to year. The average across our seasonal data is 11.2%; 101 of the 124 seasons fall between 8.2% and 14.1%. Scrub fraction does tend to be low in the very oldest seasons (six of the bottom nine were recorded between 1901-07); after that, it jumps around quite a bit… until recently. Every season since 2015 has been above average in scrub fraction, and every season since 2019 has been at 13.2% or higher. The two most recent years with completed data (2023-24) have the second- and third-highest scrub fractions, at 16.3% and 16.1%. (They trail only 1946, which is heavily influenced by pitchers making their way back from war).

Anyone who’s watched a significant amount of recent baseball can probably figure out the reason for this increase: the opener. In 2013, there were three pitchers who averaged less than 2 innings per start, each of whom made exactly one start. In 2018, there were 27 such pitchers making a total of 92 starts (29 by proto-opener Ryne Stanek). In 2023? 58 pitchers, 144 starts, and that’s without a Stanek inflating the numbers (the highest individual total was 13). Nearly 3% of the league’s starts in ’23 were taken by openers, and I don’t anticipate the trend reversing any time soon.

I pondered the handling of the opener for a bit, and decided to make a simple compensation for its presence in order not to over-penalize modern starters for working in an era of higher strategic optimization. Rather than using either the single-season value or a rolling average for scrub fraction, I’m using an approximation for the historical average (11%) across the board. So if the rolling average scrub deviation is 9 points, the scrub pitcher baseline would be estimated to be 8.01 points below league average.

We now have the baseline to which we can compare a pitcher’s average adjusted Game Score, but that still leaves the problem of sample size. We’ll borrow from more exacting statistical approaches here. When measuring standard deviations from average, the standard error of the average of a particular sample shrinks as the sample size increases, and does so according to the following relationship:

Standard error of sample mean = (Standard deviation of population) / sqrt(Sample size)

We’ll apply the same inverse square root relationship to the number of starts made by a particular pitcher in calculating “scrub deviations above scrub level.” So if the scrub deviation is 9 points and a pitcher makes 16 starts, we reduce the expected deviation of his average to 9/4 = 2.25. (Really for arcane technical reasons, since our population – overall starts in the league for the season in question – is finite, the proper denominator is sqrt(Sample size – 1), so the pitcher would have to make 17 starts rather than 16 for the deviation to be cut down to a exactly quarter of its original value. The most obvious effect of this is that for pitchers with only one start, the expected deviation of the mean is infinite, so all of their scores reflect no statistical confidence. Sorry, 1930 Dizzy Dean!)

That gives us three components of the metric: a context-adjusted rate performance for the pitcher (average adjusted Game Score), a baseline for comparison (estimated average Game Score for scrubs), and a way of compensating for playing time (divide by the scrub deviation reduced by a factor of the inverse square root of games started). Having worked through all of the components, let’s bring them together to form Game Score Deviations. Here’s the raw formula, fancied up a bit for readability:

Which is to say, the difference between the pitcher’s average Game Score and the estimated baseline for a scrub pitcher, divided by the deviation between scrubs and regulars which has in turn been divided by the square root of the pitcher’s total starts (minus one) to account for the confidence level given by the sample size.

So that’s the formula we’ll be testing as an evaluation tool for starting pitchers. It is also an enormous amount of math. Next time, I promise, we'll get to some of the fun stuff: what do the actual results of GSDev look like on a seasonal basis?

NOTE (11/10/25): Numbers in the section dealing with the regressed five-year averages have been updated to reflect adjustments to the regression process, as discussed here.

SABER All I Want To

Monday, October 13, 2025

Starting Pitcher Ratings: Deviations

No comments:

Post a Comment