University of Tennessee, Knoxville, TN 379961200
Soren P. Sorensen
University of Tennessee
Knoxville, TN 389961200
sorensorensen@utk.edu
The purpose of this report is
In order to do this extensive use of mathematics is used, which might make the text more difficult to read, but ensures the method is well documented and reproducible by others, who might want to use it or derive another ranking method from it. The report is, on the other hand, also more detailed than a ''typical'' scientific paper and discusses details, which in a scientific paper intended for publication would be omitted.
We will in this report focus on NCAA 1A football, but the methods described here are very general and can be applied to most other sports with only minor modifications.
In general most ranking systems fall in one of the following two categories: predictive or earned rankings. The goal of an earned ranking is to rank the teams according to their past performance in the season in order to provide a method for selecting either a champ or a set of teams that should participate in a playoff (or bowl games). The goal of a predictive ranking method, on the other hand, is to provide the best possible prediction of the outcome of a future game between two teams.
In an earned system objective and well publicized criteria should be used to rank the teams, like who won or the score difference or a combination of both. By using well defined criteria for the ranking then teams know exactly what the consequences of a win or at loss will be. This is done in most football conferences to select the conference champion or at least the two teams selected for a championship game. In general an earned ranking system allocates a number (often called the power ranking), which is used the order teams in a linear sequence.
Most systems found on the WWW is predictive, even many of the BCS systems. In order to make a predictive system as accurate as possible it is allowed to include any information, which is deemed useful, like the strength of the quarterback, yards earned, number of fumbles etc. In particular it is very common to put more weight on recent games than older games. This allows a more precise extrapolation the next weeks games. In a more advanced predictive system the teams are not necessarily linearly ordered. One can easily imagine situations, where a good predictive system will predict, that A beats B, B beats C, but C beats A. This is not possible in a pure earned system.
Unfortunately most WWW ranking systems are a strange mix of the two types of systems described here. The BCS ranking, that determines which teams have earned the right to play in the various bowls, in particular the bowl determining the national championship, should be a pure earned ranking system,. But may, if not most, of the BCS systems seem to be predictive. It is even so bad, that a particular web site ranks the BCS systems according to how well they predict next weeks games!
The method described below are all intended as earned systems and now efforts are spent on trying to optimize for predictive capabilities.
Currently the Bowl Championship Series (BCS) system [] is based on 4 components, 1) 2 subjective polls by coaches and journalists, b) 8 computer programs, c) a well defined, but primitive, method for calculating the strength of schedule, and 4) the number of losses of each team.
Here we will be concern ourselves with the 8 BCS computer based methods for ranking the teams.
Of all these systems only Rothman offers the code to others. However, Massey has a fairly detailed description of his system so it should be possible to recreate his rankings. Since all the other systems seem to be well guarded secrets it is not possible for others to check the calculations, or try to estimate what effect wins or losses in the next weeks will have. It is especially problematic in the case of Jeff Sagarin, who is making a living of publishing ratings for various sports in USA Today. Since Mr. Sagarin has an economical advantage of keeping his system proprietary, it will be very difficult to obtain a detailed description or the actual source text of his system.
This secrecy hurts the computer rankings tremendously, since they create the impression that they are ununderstandable and unfair and rewards ''running up the score''.
It is also a very uncommon situation. In other sports, where rankings play an important role (chess, golf, tennis, etc.) the ranking method is completely public and can be checked by any interested person.
I will therefore suggest, that the current system is replaced with an open system based on publicly accessible code with a detailed mathematical description. Ideally the whole package should be available for download from the BCS WWW site, so coaches, players, journalists and fans can perform the rankings themselves. The criteria the teams should be ranked on should be well published and should be quantifiable in terms of either simple formulas or tables.
For this reason I have chosen to give a detailed report on the ranking system I use. As will be seen, it is based on the work of others. It does, however, contain a few original, minor ideas.
I will also propose, that the NCAA should be responsible for the WWW posting of all NCAA results in a standard ASCII based format, that can be used by everybody ranking sports teams.
Often a ranking between teams is obtained by letting them play in tournaments with the purpose of either establishing a ranking between them or at least define ''the best team''. This tournament can either be stretched over the whole season or, more commonly, is used in a playoff at the end of the season.
All teams in the tournament play each other and the ranking is determined by the number of points each team accumulates (see the section on accumulative point systems). This is a very ''fair'' system, but requires a large amount of games.
The tournament is played in ''rounds''. Only winners are allowed to play in the next round. Eventually only one team is left and is declared the winner. The theme can be varied by introducing a losers bracket, so a team is only eliminated after two loses. This is a very popular system, since it will find an undisputed ''best'' team using the least amount of games. However, due to the unavoidable fluctuations in performance of each team from game to game it happens very often, that a better teams looses a match to an inferior opponent and is then eliminated. This is not fair from a ranking point of view, but has a lot of appeal from a spectator point of view.
The Monrad system is a very interesting variation of the cup system, which to my knowledge is only used on a regular basis in chess tournaments. In the first round all teams are paired randomly. The winner gets 1 point and the looser zero. In each successive round all teams with the same number of points are paired randomly (except that teams which earlier have played each other can not be paired if there are other pairing possibilities). This system has the advantage, that all teams keep playing, in contrast to the cup system, and as the season (or tournament) advances teams with equal strength will be meeting each other. There are no limitations to the number of rounds that can be played, but eventually teams have to be paired if they have similar, but not necessarily identical, number of points. The team with the largest number of points after a predefined set of rounds is the winner.
This system would be ideally suited for NCAA football, if only tradition and logistics would not interfere. Early in the season teams would play 56 Monrad games within their conference and the rest of the season they would be paired nationally, but with a pairing preference for teams geographically close. This would result in some very exciting games in November and would create optimally matched bowl games.
Most sports use ranking systems based on the idea, that for each match or tournament the team (or player) acquires a certain number of points depending on their performance and the teams ranking is based on the total number of point they have accumulated during the season. If several teams have the same number of points additional objective criteria are used, like winner of mutual game or accumulated score difference.
In most soccer leagues the winning team gets 2 (or 3) points and the looser none and each team gets 1 point for a tie. If all teams play each other (round robin as described above) this provides a very simple and effective method for evaluating the integrated performance of each team. Typical sizes of leagues range from 6 to 24. Usually the best teams from a league will move up in a higher league next season and the lowest ranked teams will be moved to a lower league. The winner of the highest league will be the overall winner.
Golf and tennis are using systems where each player accumulates points based on their placement in each tournament according to published tables. Prestigious tournaments will provide more points and small local tournament will of course provide less points. Usually the points are accumulated over a sliding time interval of one year.
The advantage of the accumulative point system is their simplicity. It is easy for each team or the spectators to figure out their accumulated score and therefore their ranking. All participants know what they gain or loose by winning or loosing a game. The disadvantage is, that the point allocation, especially in tournaments for large systems like in golf and tennis, becomes somewhat arbitrary and not based on the actual strength of the participants. It is also sometimes possible to rake up a lot of points by carefully selecting weak tournaments or by playing a large amount of tournaments.
The Elo rating system was first used by the International Chess Federation in 1970 to rank chess players. The system was proposed by Arpad E. Elo. It is partly based on earlier work done by Anton Hoesslinger. The official description of the system as it is used in chess can be found at []. Jones has given a nice overview of the system in .
The basic idea in the system is to continuously change a players rating R_{p} based on whether she performs better or worse than expected in tournaments or matches. For a new player with a total of N matches, where N £ N_{cut} (N_{cut} = 20) the rating R_{p} is calculated as

where á R_{c} ñ is the arithmetric average of the competitor's ratings at the time of the match, N_{w} and N_{l} is the number of wins and losses, respectively, and a = 400 is an initial scale factor. This is the basic Ingo system named after the place of origin, Ingolstadt in Germany, of it's inventor Anton Hoesslinger.
Elo's important improvement to this system was to introduce the Win Expectancy Function W_{e}, which is defined as

where

For a tournament with M matches played by a player with a rating R_{p} the new ranking R_{p,new} after the tournament becomes

where the score is defined as (N_{t} is the number of tied games)

and the sum i runs over each of the games the player played in the tournament. K is in principle a constant, but is in reality varied slightly depending on the rating of the player.
The Elo system seem especially well suited to sports with a large number of participants ( 10,000), where methods based on linear algebra have problems due to memory limitations in computers. However, the idea of introducing the probability function W_{e} is very powerful and can be used in other ranking system. In particular Massey has used part of these ideas in his very interesting BCS ranking system.
Select a ranking that minimizes the number of violations. A violation is a game where a team with a lower ranking defeats a team with a higher ranking.
.......
Let us consider a set of teams T consisting of N_{T} teams playing a total of N_{G} games between each other. Depending on the nature of the sport the term ''team'' can refer to either an individual (chess, boxing, singles tennis etc.) or a set of individuals (football, baseball, basketball, doubles tennis etc.). We will only consider games consisting of a set of two teams, but the method outlined in this paper can easily be generalized to consider games consisting of n > 2 teams (common in track and field, swimming etc.).
In game g (g = 1,...,N_{G}) the home team is denoted t_{h} (h = 1,...,N_{T}) and the away team is t_{a} (a = 1,...,N_{T}). In this game the home team obtains a score of S_{h} and the away team a score of S_{a}. The game is played at time T_{g} within a given season y. Results of games are assumed to be available from a total of N_{y} seasons (y = 1,...,N_{y}). The score of the winner is S_{w} and the score of the looser is S_{l}. We will assume for simplicity that the winner of a game is the team with the larger score. The margin of victory or point spread DS can be defined in two different ways.

or

DS_{wl} will always be nonnegative, whereas DS_{ha} will be positive if the home team wins and negative if the away team wins. The relation between them is

The assumptions of the current ranking model are:
In the following the discussion will be based on examples from football, but the formalism is completely general and could be used for any binary game resulting in a final set of scores. It follows from assumptions 4, that the current ranking method does not take into account other ''unofficial'' statistics from a game, like half time score, yardage gained or lost, fumbles etc. While these variables might very well be of importance for a prediction algorithm, we consider them irrelevant and unfair to use for a ranking algorithm, which purpose is to evaluate which team is the best or which set of teams are the best. In this latter case the teams need to know exactly what they are being evaluated on.
Let us now consider in more detail the result R_{g} of game g. As stated in assumption 3 the result R_{g}(S_{h},S_{a}) is a real function of the two scores S_{h} and S_{a}. R_{g} will be considered a measurement of the strength of the two teams. As usual with any physical measurement the result R_{g} does not provide an exact measurement of the strengths of the two team, but an uncertainty s_{g} is associated with each measurement g. There is, unfortunately, not a universally accepted result function. Some commonly used functions are discussed below.
It only matters which team wins (= which team has the higher score), but the actual scores do not matter.

This system is often referred to as the JWB system (Just Win Baby). Effectively this is how many fans view the game outcome, since the only thing that matters is whether you win or not. Not how you win or by how much.
The result of the game is defined as the difference between the scores of the two teams:

In football this system is often referred to as the BOMB Index (Bowden  Osborne Memorial Blowout Index), since it is perceived, that Florida State and Nebraska used to run up the score against weak opponents, which will help their ranking in this system.
A modification the SD system, where the score difference is truncated at some value DS_{max} in order to avoid to heavy an emphasis on games, where one team ''runs up'' the score:

In football typical values of maximum point spread DS_{max} ranges from 2135. DS_{t} is called the truncated point spread. Please note that the game outcome now also depends on the gameindependent parameter DS_{max.}
A mathematically more elegant way of providing this cutoff is to use the hyperbolic tangent function

with

and

Since both the pure WL and SD system tend to favor a particular, but maybe extreme, view of the game outcome, other systems have introduced a simple linear combination of the two systems:

In football typical values for the ''bonus'' B_{w} for a win is 50100. For B_{w} = 0 the system reduces to the SD system and for B_{w} >> DS it is identical to the WL system.
Instead of forming the difference between the score, as in the SD system, the ratio between the scores is the game outcome.

In this case 1 £ R_{g}^{SR} £ 1, with  R_{g}^{SR} = 1 if the loosing team does not score any points (a shotout). The rare case of S_{h} = S_{a} = 0 is not defined in this system and it can therefore not be used in sports, where this result is possible.
All the above mentioned methods for evaluating the outcome of a game have their virtues. It is therefore natural to form a outcome function, that combines all of them. The simplest way of doing this is by forming a linear combination of the three types of outcome:

There are three parameters in this approach: 1) the win bonus B_{w}, 2) the maximum point spread DS_{max}, and 3) the scoring ratio weight factor B_{r}. They are, however, not independent since a scaling of all of them will lead to the same ranking. Since the sum of the three parameters is equal to the maximum value of the game outcome function R_{g}^{max}, it is convenient to constrain the three parameters by choosing R_{g}^{max} to have a convenient value like 100.

In this paper we will choose the following values

A subsequent paper will discuss techniques for choosing the most optimal values for (B_{w},DS_{max},B_{r}).
The outcome prediction function P_{g}(r_{h},r_{a}a_{j}) estimates the outcome of a game g between the two teams t_{h} and t_{a} based on their strength rating r_{h} and r_{a}, respectively. In addition P can depend on a number of additional parameters, depending on the choice of game outcome function, R_{g}. In the case the WDR game outcome function R_{g}^{WDR} is used the parameter vector is [(a)\vec] = (B_{w},DS_{max},B_{r}). In addition other parameters, like the home field advantage B_{h}, can be added to the parameter set, if needed. Actual choices of P will be discussed later in this paper, but first we will outline the method for determining the strength ratings, r_{i}, independent of the functional expression of P.
During a season with N_{g} games a total of N_{g} measurements of the N_{t} strength ratings r_{i} is performed and we want to determine the vector [r\vec] so the difference between the game result R_{g} and the prediction (or hypothesis) P_{g} is as small as possible for as many games as possible. This can be done in a number of ways. We choose to minimize the sum of the square of the difference between the game result and the prediction. This is the socalled Least Squares Method.
minimize: c^{2}([r\vec]) = å_{g = 1}^{Ng}[ [(R_{g}([(a)\vec])P_{g}([r\vec]  [(a)\vec]))/(s_{g})]] ^{2}
The measurement error s_{g} for each game will be discussed later.
Other methods are based on the maximum norm ......
If we furthermore assume, that the outcome prediction function depends linearly on the strength ratings, r_{i}, we can use the general framework for linear least squares fits. This is a very strong assumption and a later paper will discuss nonlinear approaches. We will furthermore assume P_{g} does not depend on [(a)\vec], but only depends on the relative difference between the strength ratings of the two teams participating in the game

This reduces the problem to the minimization of

Following the standard c^{2} nomenclature we introduce the design matrix A with elements

the weighted result vector b with elements

and the strength parameter vector r. In matrix notation the problem can now be written as

where the   ^{2} symbol indicates the Euclidean norm in the vector space spanned by all the games.
Using the Single Value Decomposition algorithm the vector r, that minimizes c^{2} is

where U, V are the SVD matrices and v is the single value vector defined as

or

We obtain the U and V matrices and the v vector using the routines described in Press et al. [1992].
The advantage of using the linear c^{2} method is speed, since it only takes a few seconds on a normal PC to solve for the rankings. The SVD algorithm is the standard tool for solving linear c^{2} problems due to its robustness. For further details see press et al. [1992].
The game weight factors, defined as

provide the option to let various games influence the rankings differently. It is very common in other ranking models to make the weights timedependent

where T is the time where the ranking is performed. This will put more emphasis on games played recently. This method is, however, more appropriate for prediction model than for ranking models. The current ranking model does not incorporate any timedependence in the weights, with the exception of the initial period as explained later.
If two teams of very similar strength are playing each other, we consider the outcome of this game a ''good'' measurement of the relative strengths of the two teams. In contrast we consider mismatched games between opponents with very different strengths as ''bad'' measurements. In mathematical terms this implies, that we will associate a larger weight w_{g} (or equivalently a smaller uncertainty s_{g}) to games between teams where the difference between the rankings r_{h} and r_{a} is small. This can, however, only be done in an iterative procedure, since the rankings r_{h} and r_{a} are not initially. We are therefore using the following method:
Initial values of the ranking vector r^{(1)} are determined by assuming w_{g}^{(1)} = 1. Afterwards the c^{2} system is solved again, but this time with the following weights

where a_{w} is determined from the condition

In other words, we consider a game between the best and the worst team to have an measurement error b_{w} times larger than when two completely equal teams play. b_{w} is a free parameter in this ranking model. The numerical examples show later have used the value b_{w} = 3.
Early in the season, when only few games have been played, the design matrix A has a rank smaller than N_{t}. This has the effect, that the relative ranking of many subsets of teams are undetermined. For NCAA 1A division football teams, where N_{t} = 112, the number of independent subsets of teams as function of number of the number of weeks played is shown in the table below (based on the 199698 seasons):
Number of independent sets  
week  1996  1997  1998 
1  109  104  104 
2  78  78  65 
3  36  36  18 
4  3  2  1 
5  1  1  1 

Only after 5 playing weeks will it be possible to obtain a solution, where all teams are ''connected''. For NCAA 1A football the onset of full connectiveness seems empirically to coincide with a condition, that the average number of games per team is 2.5  3.0 or a total of 139 to 170 games between the 112 teams (only games where both teams are division 1A counts).
It is, however, often required to obtain a ranking early in the season before the full connectedness haven been obtained. Since we are requiring, that only the final results of games can be used in the ranking this can only be obtained by using game results from the previous season(s). So early in the season results from the previous season are also included in the ranking, but with a decreasing weight factor. The total weight factor on each game is

where w_{g} is defined above and
w_{s} = {

 


where á N_{gt} ñ is the average number of games played by each team and R_{cut} = 4.5 represents a cutoff value for the inclusion of previous seasons games.
Currently it is cumbersome to obtain NCAA results, especially for the lower divisions. I would therefore like to propose, that NCAA post results of all NCAA games (football, basketball, tennis, etc.) in a standard ASCII based format, that can be used directly a input to ranking programs.
For each sport two files should be created for each season: a team file and a game file.
The Team file should contain the following information on each team:
The format of the file should be ( generic C printf statement ):
fprintf( TeamFile, '' %5d %25s %35s %25sn'',
Index, Name, Conference, Division );
The Game file should contain the following information on each game:
The format of the file should be ( generic C printf statement ):
fprintf( GameFile,
'' %4d %2d %2d %25s %3d  %25s %3d %5d %5d %1dn''
Year, Month, Day,
AwayTeamName, AwayTeamScore,
HomeTeamName, HomeTeamScore,
AwayTeamIndex, HomeTeamIndex, FieldCode );
==============================================
Created Sun Nov 07 11:13:54 1999
by Soren Sorensen
==============================================