Message #28771 at http://games.groups.yahoo.com/group/crossword-games-pro/messages/28771
From: Steven Alexander 
Date: Thu Dec 12, 2002  10:52 am
Subject: Ratings systems 

The ratings system's drawbacks will be overcome, whether within the current 
framework or creating a wholly new one, only with serious statistical study.  
While what's been posted here recently (which I'll read more thoroughly than 
already[1]) is valuable, anyone who really wants to think about ratings should 
read up on some very good published work.

Before the constructive work noted below, please absorb the attached excerpt from 
Chapter 5 of John D. Beasley's 1989 book "The Mathematics of Games" (Oxford Univ 
Pr, ISBN 0-19-286107-7), entitled "If A Beats B, and B Beats C ..."  (Actually, 
the section of Ch. 5 about non-transitivity is the only one not reproduced.)

Though I am open to other possibilities, I currently consider the strongest ones 
(1) a system pairing (rating; uncertainty), where the rating component is similar 
to current ratings (perhaps enough to use the current scale) and the uncertainty 
measures how unsure the first component is to be close to the "true" rating; and 
(2) a score-based system, in the form (score, defense) [comma-separated because 
these two numbers are in the same units: points].

The first kind has been developed for the US Chess Federation, the "Glicko" 
system, and for table tennis, the Marcus system. While I am collecting references 
of essential reading for Rating Committee members and other arguers, for now, 
starting from www.glicko.com and www.davidmarcus.com, should lead to extensive 
details on these two.

The second kind is, of course, examined in Robert Parker's writings.  (I'll 
assemble these with other available links and some copies of papers for all 
concerned.)

While I very much like the Parker-type system, its adoption or other changes will 
depend on evidence -- of how good the new system would be, not how bad the old 
one was, if I have anything to say about it.  This will involve both running 
historical data (win-loss for modifications of the current system, but as Joe 
Edley noted, score data not yet collected will be necessary for the Parker) to 
test at least how predictive a system would have been had it been in effect 
before the predictions to be made. Also to be evaluated are deserved stability of 
players' ratings and the degree of any undesired incentives; and with a Parker- 
type system, how much is gained by adding factors other than offense and defense 
(that arise not from the inherent meaning of the measures, but from imperfect 
match of the otherwise elegant system with reality).

Enjoy reading.

Steven Alexander
     NSA Ratings Committee member
   


[1] Those who just criticize, many assuming that their desires
     are both consistent among themselves and consistent with
     others' priorities, might benefit most by reading and
     learning.  Those publishing data and experiments here
     already are thinking concretely about the problem.

   ----------

The Mathematics of Games

John D. Beasley
Oxford Univ Pr 1989
ISBN 0-19-286107-7

Chapter 5   If A Beats B, and B Beats C ...
   [all but the last section of Ch. 5; pages 47-61]


In the previous chapter, we looked at some of the pseudo-random effects which 
appear to affect the results of games.  We now attempt to measure the actual 
skill of performers.  There is no difficulty in finding apparently suitable 
mathematical formulae; textbooks are full of them.  Our primary aim here is to 
examine the circumstances in which a particular formula may be valid, and to note 
any difficulties which may attend its use.

The assessment of a single player in isolation
----------------------------------------------

We start by considering games such as golf, in which each player records an 
independent score.  In practice, of course, few competitive games are completely 
free from interactions between the players; a golfer believing himself to be two 
strokes behind the tournament leader may take risks that he would not take if he 
believed himself to be two strokes ahead of the field.  But for present purposes, 
we assume that any such interactions can be ignored.  We also ignore any effects 
that external circumstances may have on our data.  In Chapter 4, we were able to 
adjust our scores to allow for the general conditions pertaining to each round, 
because the pooling of the scores of all the players allowed the effect of these 
conditions to be assessed with reasonable confidence.  A sequence of scores from 
one player alone does not allow such assessments to be made, and we have little 
alternative but to accept the scores at face value.

To fix our ideas, let us suppose that a player has returned four separate scores, 
say 73, 71, 70, and 68 (Figure 5.1).  If these scores were recorded at 
approximately the same time, we might conclude that a reasonable estimate of his 
skill is given by the unweighted mean 70.5 (U in Figure 5.1).  This is 
effectively the basis on which tournament results are calculated.  On the other 
hand, if the scores were returned over a long period, we might prefer to give 
greater weight to the more recent of them.  For example, if we assign weights 
1:2:3:4 in order, we obtain a weighted mean of 69.7 (W in Figure 5.1).  More 
sophisticated weighting, taking account of the actual dates of the scores, is 
also possible.

                     73  *---------------
                         |    |    |    |
                         ----------------
                         |    |    |    |
                     71  -----*----------
                         |    |    |    |   <-- U (70.5)
                     70  ----------*-----
                         |    |    |    |   <-- W (69.7)
                         ----------------
                         |    |    |    |
                     68  ---------------*

             Figure 5.1  Weighted and unweighted means

So, we see, right from the start, that our primary need is not a knowledge of 
abstruse formulae, but a commonsense understanding of the circumstances in which 
the data have been generated.

Now let us assume that we already have an estimate, and that the player returns 
an additional score.  Specifically, let us suppose that our estimate has been 
based on n scores s_1, ..., s_n, and that the player has now returned an 
additional score s_{n+1}.  If we are using an unweighted mean based on the n most 
recent scores, we must now replace our previous estimate

         (s_1+...+s_n)/n

by a new estimate

         (s_2+...+s_{n+1})/n;

the contribution of s_1 vanishes, the contributions from s_2,...,s_n remain 
unchanged, and a new contribution appears from s_{n+1}.  In other words, the 
contribution of a particular score to an unweighted mean remains constant until n 
more scores have been recorded, and then suddenly vanishes.  On the other hand, 
if we use a weighted mean with weights 1:2:...:n, the effect of a new score 
s_{n+1} is to replace the previous estimate

         2(s_1+2s_2...+n s_n)/n(n+1)

by a new estimate

         2(s_2+2s_3...+n s_{n+1})/n(n+1);

not only does the contribution from s_1 vanish, but the contributions from
s_2,...,s_n are all decreased.  This seems rather more satisfactory.

Nevertheless, anomalies may still arise.  Let us go back to the scores in Figure 
5.1, which yielded a mean of 69.7 using weights 1:2:3:4 , and let us suppose that 
an additional score of 70 is recorded.  If we form a new estimate by discarding 
the earliest score and applying the same weights 1:2:3:4 to the remainder, we 
obtain 69.5, which is less than either the previous estimate or the additional 
score.  So we check our arithmetic, suspecting a mistake, but we find the value 
indeed to be correct.  Such an anomaly is always possible when the mean of the 
previous scores differs from the mean of the contributions discarded.  It is 
rarely large, but it may be disconcerting to the inexperienced.

If we are to avoid anomalies of this kind, we must ensure that the updated 
estimate always lies between the previous estimate and the additional score. This 
is easily done; if E_n is the estimate after n scores s_1,...,s_n all we need is 
to ensure that

         E_{n+1} = w_n E_n + (1-w_n)s{n+1}

where w_n is some number satisfying 0improved as a result.

This contravenes common sense, and suggests that we should confine our attention 
to estimates which respond conformably to all constituent scores: a decrease in 
any score should decrease the estimate, and an increase in any score should 
increase it.  But it turns out that such an estimate cannot lie outside the 
bounds of the constituent scores, and this greatly reduces the scope for 
estimation of trends.  The proof is simple and elegant.  Let S be the largest of 
the constituent scores.  If each score actually equals S, the estimate must equal 
S also.  If any score s does not equal S and the estimating procedure is 
conformable, the replacement of S must equal S also.  If any score s does not 
equal S and the estimating procedure is conformable, the replacement of S by s 
must cause a reduction in the estimate.  So a conformable estimate cannot exceed 
the largest of the constituent scores; and similarly, it cannot be less than the 
smallest of them.\fn{1}

\fn{1} It follows that economic estimates which attempt to project current trends 
are in general not conformable; and while this is unlikely to be the 
whole reason for their apparent unreliability, it is not an encouraging thought.

In practice, therefore, we have little choice.  Given that common sense demands 
conformable behaviour, we cannot use an estimating procedure which predicts a 
future score outside the bounds of previous scores; we can merely give the 
greatest weight to the most recent of them.  If this is unwelcome news to 
improving youngsters, it is likely to gratify old stagers who do not like being 
reminded too forcibly of their declining prowess.  In fact, the case which most 
commonly causes difficulty is that of the player who has recently entered 
top-class competition and whose first season's performance is appreciably below 
the standard which he subsequently establishes; and the best way to handle this 
case is not to use a clever formula to estimate the improvement, but to ignore 
the first year's results when calculating subsequent estimates.

Interactive games
-----------------

We now turn to games in which the result is recorded only as a win for a 
particular player, or perhaps as a draw.  These games present a much more 
difficult problem.  The procedure usually adopted is to assume that the 
performance of a player can be represented by a single number, called his 
grade or rating, and to calculate this grade so as to reflect 
his actual results.  For anything other than a trivial game, the assumption is a 
gross over-simplification, so anomalies are almost inevitable and controversy 
must be expected.  In the case of chess, which is the game for which grading has 
been most widely adopted, a certain amount of controversy has indeed arisen; some 
players and commentators appear to regard grades with excessive reverence, most 
assume them to be tolerable approximations to the truth, a few question the 
detailed basis of the calculations, and a few regard them as a complete waste of 
ink.  The resolution of such controversy is beyond the scope of this book, but at 
least we can illuminate the issues.

The basic computational procedure is to assume that the mean expected result of a 
game between two players is given by an 'expectation function' which depends only 
on their grades a and b, and then to calculate these grades so as to reflect the 
actual results.  It might seem that the accuracy of the expectation function is 
crucial, but we shall see in due course that it is actually among the least of 
our worries; provided that the function is reasonably sensible, the errors 
introduced by its inaccuracy are likely to be small compared with those resulting 
from other sources.  In particular, if the game offers no advantage to either 
player, it may be sufficient to calculate the grading difference d=a-b and to use 
a simple smooth function f(d) such as that shown in Figure 5.3.  For a game such 
as chess, the function should be offset to allow for the first player's 
advantage, but his is a detail easily accommodated.\fn{2}

\fn{2} Figure 5.3 adopts the chess player's scaling of results: 1 for a win, 0 
for a loss, and 0.5 for a draw.  The scaling of the d-axis is arbitrary.


                             1.0  |                 -
                                  |       /
                                  |
                                  |     /
                                  |
                                  |   /
                                  |
                                  | /
                                  |
                             0.5  /
                                  |
                                / |
                                  |
                              /   |
                                  |
                            /     |
                                  |
              _           /  0.0  |
              --------------------+------------------
             -100      -50        0       50      100

             Figure 5.3  A typical expectation function
                      [showing S-shaped curve
                      from (-100,near 0) thru
                     (0,0.5) to (100,near 1.0)]

Once the function f(d) has been chosen, the calculation of grades is 
straightforward.  Suppose for a moment that two players already have grades which 
differ by d, and that they now play another game, the player with the higher 
grade winning.  Before the game, we assessed his expectation as f(d); after the 
game, we might reasonably assess it as a weighted mean of the previous 
expectation and the new result.  Since a win has value 1, this suggests that his 
new expectation should be given by a formula such as

         w + (1-w)f(d)

where w is a weighting factor, and this is equivalent to

         f(d) + w(1-f(d)).

More generally, if the stronger player achieves a result of value r, the same 
argument suggests that his new expectation should be given by the formula

         f(d) + w(r-f(d)).

Now if the expectation function is scaled as in Figure 5.3 and the grading 
difference is small, we see that a change of \delta in d produces a change of 
approximately \delta/100 in f(d).  It follows that approximately the required 
change in expectation can be obtained by increasing the grading difference by 
100w(r-f(d)).  As the grading difference becomes larger, the curve flattens, and 
a given change in the grading difference produces a smaller change in the 
expectation.  In principle, this can be accomplished by increasing the scaling 
factor 100, but it is probably better to keep this factor constant, since always 
to make the same change in the expectation may demand excessive changes in the 
grades.  The worst case occurs when a player unexpectedly fails to beat a much 
weaker opponent; the change in grading difference needed to reduce an expectation 
of 0.99 to 0.985 may be great indeed.  To look at the matter another way, keeping 
the scaling factor constant amounts to giving reduced weight to games between 
opponents of widely differing ability, which is plainly reasonable since the ease 
with which a player beats a much weaker opponent does not necessarily say a great 
deal about his ability against his approximate peers.

A simple modification of this procedure can be used to assign a grade to a 
previously ungraded player.  Once he has played a reasonable number of games, he 
can be assigned that grade which would be left unchanged if adjusted according to 
his actual results.  The same technique can also be used if it desired to ignore 
ancient history and grade a player only on the basis of recent games.

Grades calculated on this basis can be expected to provide at least a rough 
overall measure of each regular player's performance.  However, certain practical 
matters must be decided by the grading administrator, and these may have a 
perceptible effect on the figures.  Examples are the interval at which grades are 
updated, the value of the weighting parameter w, the relative division of an 
update between grades of the players (in particular, when one player is well 
established whereas the other is a relative newcomer), the criteria by which less 
than fully competitive games are excluded, and the circumstances in which a 
player's grade is recalculated to take account only of his most recent games.  
Grades are therefore not quite the objective measures that their more uncritical 
admirers like to maintain.

Grades as measures of ability
-----------------------------

Although grading practitioners usually stress that their grades are merely 
measures of performance, players are interested in them primarily as 
measures of ability.  A grading system defines an expectation between 
every pair of graded players, and the grades are of interest only in so far as 
these expectations correspond to reality.

A little thought suggests that this correspondence is unlikely to be exact.  If 
two players A and B have the same grade, their expectations against any third 
player C are asserted to be exactly equal.  Alternatively, suppose that A, B, Y, 
and Z have grades such that A's expectation against B is asserted to equal Y's 
against Z, and that expectations are calculated using a function which depends 
only on the grading difference.  If these grades are a, b, y, and z, then they 
must satisfy a-b = y=z, from which it follows that a-y = b -z, and hence A's 
expectation against Y is asserted to equal B's against Z.  Assertions as precise 
as this are unlikely to be true for other than very simple games, and it follows 
that grades cannot be expected to yield exact expectations; the most for which we 
can hope is that they form a reasonable average measure whose deficiencies are 
small compared with the effects of chance fluctuation.

These chance effects can easily be estimaged.  If A's expectation against B is p 
and there is a probability h that they draw, the standard deviation of a single 
result is \sqrt({p(1-p) - h/4}).  If they now play a sufficiently long series of 
n games, the distribution of the discrepancy between mean result and expectation 
can be taken as a normal distribution with standard deviation s/\sqrt n, and a 
simple rule of thumb gives the approximate probability that any particular 
discrepancy would have arisen by chance: a discrepancy exceeding the standard 
deviation can be expected on about one trial in three, and a discrepancy 
exceeding twice the standard deviation on about one trial in twenty.  What 
constitutes a sufficiently large value of n depends on the expectation p.  If p 
lies between 0.4 and 0.6, n should be at least 10; if p is smaller than 0.4 or 
greater than 0.6, n should be at least 4/p or 4/(1-p) respectively.  More 
detailed calculations, taking into account the incidence of each specific 
combination of results, are obviously possible, but they are unlikely to be 
worthwhile.

A practicable testing procedure now suggests itself.  Every time a new set of 
grades is calculated, the results used to calculate the new grades can be used 
also to test the old ones.  If two particular opponents play each other 
sufficiently often, their results provide a particularly convenient test; 
otherwise, results must be grouped, though this must be done with care since the 
grouping of inhomogenous results may lead to wrong conclusions.  The mean of the 
new results can be compared with the expectation predicted by the previous 
grades, and large discrepancies can be highlighted: one star if the discrepancy 
exceeds the standard deviation, and two if it exceeds twice the standard 
deviation.  The rule of thumb above gives the approximate frequency with which 
stars are to be expected if chance fluctuations are the sole source of error.

In practice, of course, chance fluctuations are not the only source of error. 
Players improve when they are young, they decline as they approach old age, and 
they sometimes suffer temporary loss of form due to illness or domestic 
disturbance.  The interpretation of stars therefore demands common sense. 
Nevertheless, if the proportions of stars and double stars greatly exceed those 
attributable to chance fluctuation, the usefulness of the grades is clearly 
limited.

If grades do indeed constitute acceptable measures of ability, regular testing 
such as this should satisfy all but the most extreme and blinkered of critics. 
However, grading administrator and critic alike must always remember that 
around one discrepancy in three should be starred, and around one in twenty 
doubly starred, on account of chance fluctuations, even if there is no other 
source of error.  If a grading administrator performs a hundred tests without 
finding any doubly starred discrepancies, he should not congratulate himself on 
the success of his grading system; he should check the correctness of his 
testing.

The self-fulfilling nature of grading systems
---------------------------------------------

We now come to one of the most interesting mathematical aspects of grading 
systems: their self-fulling nature.  It might seem that a satisfactory 
expectation function must closely reflect the true nature of the game, but in 
fact this is not so.  Regarded as measures of ability, grades are subject to 
errors from two sources: (i) discrepancies between ability and actual 
performance, and (ii) errors in the calculated expectations due to the use of an 
incorrect expectation function.  In practice, the latter are likely to be much 
smaller than the former.

Table 5.1 illustrates this.  It relates to a very simple game in which each 
player throws a single object at a target, scoring a win if he hits and his 
opponent misses, and the game being drawn if both hit or if both miss.  If the 
probability that player j hits is p_j, the expectation of player j against player 
k can be shown to be (1+p_j-p_k)/2, so we can calculate expectations exactly by 
setting the grade of player j to 50p_j and using the expectation function f(d) = 
0.5 + d/100.  Now let us suppose that we have nine players whose probabilities 
p_1,...,p_9 range linearly from 0.1 to 0.9, that they play each other with equal 
frequency, and that we deliberately use the incorrect expectation function f(d) = 
N(d\sqrt (2\pi)/100) where N(x) is the normal distribution function.  The first 
column of Table 5.1 shows the grades that are produced if the results of the 
games agree strictly with expectation, and the entries for each pair of players 
show (i) the discrepancy between the true and the calculated expectations, and 
(ii) the standard deviation of a single result between the players.  The latter 
is always large compared with the former, which means that a large number of 
games are needed before the discrepancy can be detected against the background of 
chance fluctuation.  The standard deviation of a mean result decreases only with 
the inverse square root of the number of games played, so we can expect to 
require well over a hundred sets of all-play-all results before even the worst 
discrepancy (player 1 against player 9) can be diagnosed with confidence.

Table 5.1  Throwing one object: the effect of an incorrect expectation 
function
------------------------------------------------------------------------------------------------------
                                        Opponent
      Grade       -------------------------------------------------
Player           1     2     3     4     5     6     7     8     9
------------------------------------------------------------------
1      5.5       - -.009 -.013 -.014 -.011 -.005  .004  .017  .032
                  -  .250  .274  .287  .292  .287  .274  .250  .212

2     17.3   0.009     - -.006 -.009 -.009 -.007 -.002  .006  .017
               .250     -  .304  .287  .292  .287  .274  .250  .212

3     28.5   0.013  .006     - -.004 -.006 -.007 -.005 -.002  .004
               .274  .304     -  .335  .339  .316  .324  .304  .274

4     39.3   0.014  .009  .004     - -.003 -.006 -.007 -.007 -.005
               .287  .316  .335     -  .350  .346  .335  .316  .287

5     50.0   0.011  .009  .006  .004     - -.003 -.006 -.009 -.011
               .292  .320  .339  .335     -  .350  .339  .320  .292

6     60.7   0.005  .007  .007  .006  .003     - -.004 -.009 -.014
               .287  .316  .335  .346  .350     -  .335  .316  .287

7     71.5   0.004  .002  .005  .007  .006  .004     -  .006 -.013
               .274  .304  .324  .335  .339  .335     -  .304  .274

8     82.7   0.017 -.006  .002  .007  .009  .009  .006     - -.009
               .250  .283  .304  .316  .320  .316  .304     -  .250

9     94.5   0.032 -.017 -.004  .005  .011  .014  .013  .009     -
               .212  .250  .274  .287  .292  .287  .274  .250     -
==================================================================

The grades are calculated using an incorrect expectation function as described in 
the text.  The tabular values show (i) the discrepancy between the calculated and 
true expectations, and (ii) the standard deviation of a single result.


Experiment bears this out.  Table 5.2 records a computer simulation of a hundred 
sets of all-play-all results, the four rows for each player showing (i) his true 
expectation against each opponent, (ii) the mean of his actual results against 
each opponent, (iii) his grade as calculated from these results using the correct 
expectation function 0.5 + d/100, together with his expectation against each 
opponent as calculated from their respective grades, and (iv) the same as 
calculated using the incorrect expectation function N(d\sqrt(2\pi)/100).  The 
differences between rows (i) and (iii) are caused by the differences between the 
theoretical expectations and the actual results, and the differences between rows 
(iii) and (iv) are caused by the difference between the expectation functions.  
In over half the cases, the former difference is greater than the latter, so on 
this occasion even a hundred sets of all-play-all results have not sufficed to 
betray the incorrect expectation function with reasonable certainty. Nor are the 
differences between actual results and theoretical expectations in Table 5.2 in 
any way abnormal.  If the experiment were to be performed again, it is slightly 
more likely than not that the results in row (ii) would differ from expectation 
more widely than those which appear here.\fn{3}

\fn{3} In practice, of course, we do not know the true expectation function, so 
rows (i) and (iii) are hidden from us, and all we can do is assess whether the 
discrepancies between rows (ii) and (iv) might reasonably be attributable to 
chance.  Such a test is far from sensitive; for example, the discrepancies in 
Table 5.2 are so close to the median value which can be expected from chance 
fluctuations alone that nothing untoward can be discerned in them.  We omit the 
proof of this, because the analysis is not straightforward; the simple rules of 
thumb which we used in the previous section cannot be applied, because we are now 
looking at the spread of results around expectations to whose calculation 
they themselves have contributed (whereas the rules apply to the spread of 
results about independently calculated expectation) and we must take the 
dependence into account.  Techniques exist for doing this, but the details are 
beyond the scope of this book.


Table 5.2  Throwing one object: grading systems compared
------------------------------------------------------------------
                                        Opponent
      Grade       -------------------------------------------------
Player           1     2     3     4     5     6     7     8     9
------------------------------------------------------------------
1                -  .450  .400  .350  .300  .250  .200  .150  .100
                  -  .455  .435  .350  .335  .230  .200  .150  .125
       11.8       -  .471  .400  .355  .314  .250  .197  .182  .110
        7.8       -  .466  .388  .342  .304  .247  .203  .191  .139

2             .550     -  .450  .400  .350  .300  .250  .200  .150
               .545     -  .395  .395  .330  .290  .245  .210  .130
       17.6    .529     -  .429  .384  .344  .280  .226  .211  .139
       14.6    .534     -  .422  .374  .334  .275  .228  .215  .159

3             .600  .550     -  .450  .400  .350  .300  .250  .200
               .565  .605     -  .450  .390  .380  .315  .285  .185
       31.7    .600  .570     -  .455  .414  .350  .297  .281  .209
       30.4    .612  .578     -  .451  .409  .344  .292  .277  .212

4             .650  .600  .550     -  .450  .400  .350  .300  .250
               .650  .605  .550     -  .435  .430  .365  .310  .240
       40.8    .645  .616  .546     -  .459  .395  .343  .327  .254
       40.2    .658  .626  .549     -  .457  .390  .336  .320  .249

5             .700  .650  .600  .550     -  .450  .400  .350  .300
               .665  .670  .610  .565     -  .370  .395  .370  .305
       48.9    .685  .657  .586  .540     -  .436  .343  .368  .295
       48.8    .696  .666  .591  .543     -  .432  .336  .360  .284

6             .750  .700  .650  .600  .550     -  .450  .400  .350
               .770  .710  .620  .570  .630     -  .395  .435  .395
       61.7    .750  .721  .650  .604  .564     -  .447  .432  .359
       62.4    .753  .725  .656  .610  .568     -  .442  .425  .345

7             .800  .750  .700  .650  .600  .550     -  .450  .400
               .800  .755  .685  .635  .605  .605     -  .520  .400
       72.3    .803  .773  .703  .685  .617  .553     -  .484  .412
       74.0    .797  .772  .708  .664  .624  .558     -  .483  .400

8             .850  .800  .750  .700  .650  .600  .550     -  .450
               .850  .790  .715  .690  .630  .565  .480     -  .425
       75.4    .818  .789  .718  .673  .633  .569  .516     -  .427
       77.5    .809  .785  .723  .680  .640  .575  .517     -  .417

9             .900  .850  .800  .750  .700  .650  .600  .550     -
               .875  .870  .815  .760  .695  .605  .600  .575     -
       89.9    .891  .861  .791  .745  .705  .641  .588  .575     -
       94.3    .861  .841  .788  .751  .716  .655  .600  .583     -
==================================================================

For each player, the four rows show (i) the true expectation against each 
opponent; (ii) the average result of a hundred games against each component, 
simulated by computer; (iii) the grade calculated from the simulated games, using 
the correct expectation function, and the resulting expectations against each 
opponent; and (iv) the same using an incorrect expectation function as described 
in the text.

This is excellent news for grading secretaries, since it suggests that any 
reasonable expectation function can be used; the spacing of grades may differ 
from that which a correct expectation function would have generated, but the 
expectations will be adjusted in approximate compensation, and any residual 
errors will be small compared with the effects of chance fluctuation on the 
actual results.  But there is an obvious corollary:  the apparently 
successful calculation of expectations by a grading system throws no real light 
on the underlying nature of the game.  Chess grades are currently calculated 
using a system, due to A. E. Elo, in which expectations are calculated by the 
normal distribution function, and the general acceptance of this system by chess 
players has fostered the belief that the normal distribution provides the most 
appropriate expectation for chess.  In fact it is by no means obvious that this 
is so.  The normal distribution function is not a magic formula of universal 
applicability; its validity as an estimator of unknown chance effects depends on 
the Central Limit Theorem, which states that the sum of a large 
number of independent samples from the same distribution can be 
regarded as a sample from a normal distribution, and it can reasonably be adopted 
as a model for the behavior only if the chance factors affecting the result are 
equivalent to a large number of independent events which combine additively.  
Chess may well not satisfy this condition, since many a game appears to be 
decided not by an accumulation of small blunders but by a few large ones.  But 
while the question is of some theoretical interest, it hardly matters from the 
viewpoint of practical grading.  Chess gradings are of greatest interest at 
master level, and the great majority of games at this level are played within an 
expectation range of 0.3 to 0.7.  Over this range, the normal distribution is 
almost linear, but so is any simple alternative candidate, and so in all 
probability is the unknown 'true' function which most closely approximates to the 
actual behaviour of the game.  In such circumstances, the errors resulting from 
an incorrect choice of expectation function are likely to be even smaller than 
those which appear in Table 5.1.

The limitations of grading
--------------------------

Grades help tournament organizers to group players of approximately equal 
strength, and they provide the appropriate authorities with a convenient basis 
for the awarding of honorific titles such as 'master' and 'grandmaster'. However, 
it is very easy to become drunk with figures, and it is appropriate that this 
discussion should end with some cautionary remarks.

(a) Grades calculated from only a few results are unlikely to be reliable.

(b) The assumption underlying all grades is that a player's performance against 
one opponent casts light on his expectation against another.  If this assumption 
is unjustified, no amount of mathematical sophistication will provide a remedy. 
In particular, a grade calculated only from results against much weaker opponents 
is unlikely to place a player accurately among his peers.

(c) There are circumstances in which grades are virtually meaningless.  For an 
artificial but instructive example, suppose that we have a set of players in 
London and another in Moscow.  If we try to calculate grades embracing both sets, 
the placing of players within each set may well be determined, but the placing of 
the sets as a whole will depend on the results of the few games between players 
in different cities.  Furthermore, these games are likely to have been between 
the leading players in each city, and little can be inferred from them about the 
relative abilities of more modest performers.  Grading administrators are well 
aware of these problems and refrain from publishing composite lists in such 
circumstances, but players sometimes try to make inferences by combining lists 
which administrators have been careful to keep separate.

(d)  A grade is merely a general measure of a player's performance relative to 
that of certain other players over a particular period.  It is not an 
absolute measure of anything at all.  The average ability of a pool of 
players is always changing, through study, practice, and ageing, but grading 
provides no mechanism by which the average grade can be made to reflect these 
changes; indeed, if the pool of players remains constant and every game causes 
equal and opposite changes to the grades of the affected players, the average 
grade never changes at all.  What does change the average grade of a pool is the 
arrival and departure of players, and if a player has a different grade when he 
leaves than he received when he arrived then his sojourn will have disturbed the 
average grade of the other players; but this change is merely an artificial 
consequence of the grading calculations, and it does not represent any change in 
average ability.  It is of course open to a grading administrator to adjust the 
average grade of his pool to conform to any overall change in ability which he 
believes to have occurred, but the absence of an external standard of comparison 
means that any such adjustment is conjectural.

It is this last limitation that is most frequently overlooked.  Students of all 
games like to imagine how players of different periods would have compared with 
each other, and long-term grading has been hailed as providing an answer.  This 
is wishful thinking.  Grades may appear to be pure numbers, but they are actually 
measures relative to ill-defined and changing reference levels, and they cannot 
answer questions about the relative abilities of player when the reference levels 
are not the same.  The absolute level represented by a particular whether a 
player's grade ten years before his peak can properly be compared with that ten 
years after, and quite certain that his peak cannot be compared with somebody 
else's peak in a different era altogether.  Morphy in 1857-8 and Fischer in 
1970-2 were outstanding among their chess contemporaries, and it is natural to 
speculate how they would have fared against each other; but such speculations are 
not answered by calculating grades through chains of intermediaries spanning over 
a hundred years.\fn{4}

\fn{4} Chess enthusiasts may be surprised that the name of Elo has not figured 
more prominently in this discussion, since the Elo rating system has been in use 
internationally since 1970.  However, Elo's work as described in his book 
The rating of chessplayers, past and present (Batsford 1978) is 
open to serious criticism.  His statistical testing is unsatisfactory to the 
point of being meaningless; he calculates standard deviations without allowing 
for draws, he does not always appear to allow for the extent to which his tests 
have contributed to the ratings which they purport to be testing, and he fails to 
make the important distinction between proving a proposition true and merely 
failing to prove it false.  In particular, an analysis of 4795 games from 
Milwaukee Open tournaments, which he represents as demonstrating the normal 
distribution function to be the appropriate expectation function for chess, is 
actually no more than an incorrect analysis of the variation within his data. He 
also appears not to realize that changes in the overall strength of a pool cannot 
be detected, and that his 'deflation control', which claims to stabilize the 
implied reference level, is a delusion.  Administrators of other sports (for 
example tennis) currently publish only rankings.  The limitations of those are 
obvious, but at least they do not encourage illusory comparisons between today's 
champions with those of the past.