Welcome back. This is the third part in a series on answering the question "Which teams are deserving of a playoff invitation?" In it I'll outline a model I'll refer to as "Extended Standings" and over the course of the series I'll provide the exact details so anyone interested can independently verify the results.
Is A preferred to B?
At the root of it, that’s the only question in life. Whether we’re choosing our toothpaste, car, home, job, mate, secretly favorite child, or soft drink, the only question is "When presented with two options, which is preferred?"
The question of preference has been studied since at least 1927 when L.L. Thurstone published his seminal article on the subject in Psychological Review, "A law of comparative judgment". Thurstone demonstrated how a collection of items could be scaled and ordered (i.e., rated and ranked) based on comparisons between two items at a time. He used examples of gray colors and cylindrical weights, which can be readily checked against quantifiable attributes, but he also demonstrated how the model could be applied to subjective comparisons, such as grading handwriting specimens or children's drawings.
Thurstone never used the term in the article but the general method is known today as “pairwise” comparisons. I'll use a different mathematical model than Thurstone's for the Extended Standings, but for now let’s just look conceptually at how pairwise comparison works.
Imagine we are judging the quality of soft drinks. While it may be difficult to rank numerous brands of soft drinks directly, performing pairwise comparisons would be a feasible approach. We could then use a pairwise comparison model to derive a complete scale and order (again, rating and ranking) between all the items.
Although pairwise comparison has many significant uses, it could be, and indeed has been, applied to sports by framing a season as a series of pairwise comparisons between teams.
Let’s look at major league baseball as an example. In 1960 there were 616 American League games played between the league’s eight teams. In each game, two teams were compared and a winner identified. For example, the New York Yankees defeated the Boston Red Sox on April 19th, lost to them on April 20th, defeated them on April 21st, defeated the Baltimore Orioles on April 22, etc.
Running the outcomes of these comparisons through a pairwise comparison model produces a rating and ranking for all the teams.
In their purest form, pairwise comparison models treat every outcome as binary, meaning either A or B is preferred with no distinction as to the degree of preference. This lack of distinction gives the impression there was a 100% preference for one option and a corresponding zero percent preference for the other.
However, we can use the aggregation of binary outcomes to discern the degree of preference. To continue with the 1960 American League example, when the Yankees defeated the Red Sox on April 19th, it indicated a 100% preference for the Yankees in that comparison, but on the season the Yankees were 15-7 against the Red Sox, indicating they were actually preferred to the Red Sox by 15 wins / 22 games = 68.2%. Correspondingly the Red Sox were preferred to the Yankees by 7 wins / 22 games = 31.8%.
In general, this works well enough for dense datasets like the 1960 American League, where each team played every other team in the league 20-plus times over the course of the season.
But what about sparse datasets where multiple pairwise comparisons between items simply don't exist or aren't practical?
Let's return to our soft drink taste test. If we had 12 soft drinks to consider and enough samples (and bladder capacity) to make all 132 possible comparisons hundreds of times each, we would be perfectly fine using a binary outcome of either A or B. However, realistically we may only have enough to compare each of the soft drinks to three other soft drinks just once.
Under these constraints, using only a binary outcome obviously makes the resulting ratings and rankings less useful.
However, we can improve upon this by changing the outcome from binary to “stepwise”. In other words, instead of simply stating we preferred one soft drink over another, we instead state to what degree it was preferred, such as “Much Preferred”, “Preferred”, or “Slightly Preferred”.
We might then grade a “Much Preferred” outcome as an 85% preference for the favored soft drink with a corresponding 15% preference for the other. “Preferred” and “Slightly Preferred” might be graded as 70% and 55% preferences for the favored soft drink with corresponding 30% and 45% preferences for the other.
Hopefully it's clear, greater degrees of distinction provide more precise results.
Even better than stepwise outcomes would be outcomes of a continuous range, from 100% (an absolute preference for one soft drink over the other) to 50% (no preference between the two soft drinks) to zero percent (an absolute non-preference for one soft drink over the other) and anywhere in between. For example, perhaps we could imagine a device that measured your physiological reaction to determine you have a 79.627% preference for the favored soft drink and a 20.373% preference for the other.
Now we've moved our soft drink taste test from binary to stepwise to continuous outcomes. Why? To repeat the statement from above, greater degrees of distinction provide more precise results.
Sparse datasets are the rule in football. The 417 GHSA teams will play roughly 2,100 regular season games out of the 173,472 possible pairings, forcing us to draw inferences about the quality of all teams off of about 1.2% of the total possible combinations.
Again, binary outcomes (i.e., “Who won?”) are limiting in this case. But, just as in our soft drink taste test, what if we could apply a more precise distinction between the two teams instead of simply a winner and a loser. We could use the stepwise outcome where the winning team played “Much Better”, “Better”, or “Slightly Better” than the losing team. This approach would allow us to make a similar assumption as for the soft drink taste test to credit the “Slightly Better” team with 55% of the win while crediting the losing team with the corresponding 45%.
And once again, our mantra becomes greater degrees of distinction provide more precise results.
Finally, let’s follow our soft drink taste test pattern to take this example one step further to its logical conclusion and consider the possibility of a continuous outcome, where we could credit each team with anywhere from 100% of a win down to nothing in a way that is reflective of how close they played the game.
Clearly greater degrees of distinction provide more precise results, but now the question becomes, how can we measure of the "closeness" of a football game between two participants?
All good sports fans intuitively know the answer already . . .
In the early 1970s, David Rothman developed what I’ll refer to as "Rothman Grading", a method to award the win for a game between the two participants by using the margin of victory as a measure for the closeness of the game.
Although presented in a different form, Rothman effectively averaged the binary outcome (1 for the winning team, 0 for the losing team, and 0.5 for each team in the case of a tie) with the result of a logistic function using the margin of victory as the input and 8.25 for the steepness of the curve.
Now that last sentence may have glazed over some eyes, but please bear with me.
Interested readers can research the logistic function on their own, but for the rest of us let’s look at how simple this actually is by assuming a team won by 7 points.
In Excel, our standard tool for exercises in this series, type the following formula (which is the logistic function with 8.25 as the steepness) into cell C3:
=1/(1+exp(-A3/8.25))
Notice the formula refers to cell A3, which is currently empty. Now, enter the margin of victory, 7, in cell A3 and cell C3 should change to 0.700249208.
Since 7 is the margin of victory, the team obviously won and so the 0.700249208 is averaged with their binary outcome of 1 to get (1 + 0.700249208) / 2 = 0.850125.
So, using Rothman Grading, we credit the winning team of a 7 point game with 85.0125% of the win.
We can get the losing team’s portion of the win in one of two ways, the first would be simply subtracting the winning team’s percentage of the win from 100% to get 14.9875%. However, it'd be better to enter -7 into cell A3 to see that we'd get 0.299751 in cell C3. Averaging this with the losing team's binary outcome of 0 gives us (0.299751 + 0) / 2 = 0.149875, or 14.9875%.
Now, it’s prudent to recognize using margin of victory to rate teams is often controversial as it is believed to encourage poor sportsmanship between mismatched teams.
But here's the strongest case for including margin of victory – greater degrees of distinction provide more precise results.
In short, using margin of victory makes the model economical, one of the five characteristics of a good playoff invitation model. When we simplify the margin of victory to a binary outcome, we discard fundamental data required to perform meaningful comparisons between teams.
However, margin of victory is indeed subject to abuse. But rather than oversimplify the model by throwing out the baby with the bathwater, let's address this potential abuse by testing Rothman Grading against our equitable principles.
That'll be the subject next time. In the meantime take some time to play around with other margins of victory if you’d like. Rothman Grading is an important component of the Extended Standings and understanding it will help us have productive discussions on the merits of the system.