by Rod Edwards (edo - dot - chess - at - yahoo - dot - ca)
Rating historical chess playersThere has always been a certain fascination with objective comparisons between strengths of chess players. Who was the greatest of them all? Was Rubinstein in his day the equal of Korchnoi in his? Just how strong was Baron Tassilo von der Lasa in relation to the other top players of the 1840s and early 1850s? Trying to answer such questions might be just idle speculation, or at best subjective assessment, if it weren't for chess rating systems, which hold the promise of objectivity and accuracy. But ratings are bound to be limited in their ability to answer questions like those above, especially for comparisons between different historical periods, and we cannot claim more than is possible. There is inevitably a sense in which we can only compare the strength of players in relation to the other players of their own period. After all, chess knowledge advances over time and in other respects the nature of the game as it is played has changed.
But somehow we are still tempted to believe that there is an innate aptitude for chess held by different players in different degrees. And it is surely not just a matter of innate potential. The extent to which that potential is fulfilled may depend on life circumstances, amount of study, etc. This potential, to the extent that is realized, is manifested in results against other players over the chessboard, and the premise of rating systems is that 'playing strength' can be summarized by a number reflecting these results. If the compression of all the subtle skills required to be a good chess player into a single number seems naive, we may still think of it as a prediction of how well a player will do in future contests.
Arpad Elo put ratings on the map when he introduced his rating system first in the United States in 1960 and then internationally in 1970. There were, of course, earlier rating systems but Elo was the first to attempt to put them on a sound statistical footing. Richard Eales says in Chess: The History of a Game (1985) regarding Murray's definitive 1913 volume, A History of Chess that "The very excellence of his work has had a dampening effect on the subject," since historians felt that Murray had had the last word. The same could be said of Elo and his contribution to rating theory and practice. However, Elo, like Murray, is not perfect and there are many reasons for exploring improvements. The most obvious at present is the steady inflation in international Elo ratings [though this should probably not be blamed on Elo, as the inflation started after Elo stopped being in charge of F.I.D.E.'s ratings (added Jan. 2010)]. Another is that the requirements of a rating system for updating current players' ratings on a day-to-day basis are different from those of a rating system for players in some historical epoch. Retroactive rating is a different enterprise than the updating of current ratings.
In fact, when Elo attempted to calculate ratings of players in history, he did not use the Elo rating system at all! Instead, he applied an iterative method to tournament and match results over five-year periods to get what are essentially performance ratings for each period and then smoothed the resulting ratings over time. This procedure and its results are summarized in his book, The Rating of Chessplayers Past and Present (1978), though neither the actual method of calculation nor full results are laid out in detail. We get only a list of peak ratings of 475 players and a series of graphs indicating the ratings over time of a few of the strongest players, done by fitting a smooth curve to the annual ratings of players with results over a long enough period. When it came to initializing the rating of modern players, Elo collected results of international events over the period 1966-1969 and applied a similar iterative method. Only then could the updating system everyone knows come into effect - there had to be a set of ratings to start from.
Iterative methods rate a pool of players simultaneously, rather than adjusting each individual player's rating sequentially, after each event or rating period. The idea is to find the set of ratings for which the observed results are collectively most likely. But they are basically applicable only to static comparisons, giving the most accurate assessment possible of relative strengths at a given time. Elo's idea in his historical rating attempt was to smooth these static annual ratings over time.
While we can safely bet that Elo did a careful job of rating historical players, inevitably many choices have to be made in such an attempt, and other approaches could be taken. Indeed, Clarke made an attempt at rating players in history before Elo. Several other approaches have recently appeared, including the Chessmetrics system by Jeff Sonas, the Glicko system(s) of Mark Glickman, a version of which has been adopted by the US Chess Federation, and an unnamed rating method applied to results from 1836-1863 and published online on the Avler Chess Forum by Jeremy Spinrad. By and large these others have applied sequential updating methods, though the new (2005) incarnation of the Chessmetrics system is an interesting exception (see below) and Spinrad achieved a kind of simultaneous rating (at least more symmetric in time) by running an update algorithm alternately forwards and backwards in time.
There are pros and cons to all of these. Some years ago, I had a novel idea for the retroactive rating of players over a fixed period, exploiting as much as possible the benefits of simultaneous methods, in a way different than that used by Elo. This has developed into what I've called the 'Edo historical chess ratings,' which are the subject of this website. I emphasize that Edo ratings are not particularly suitable to the regular updating of current ratings. [But see comments below on Rémi Coulom's 'Whole-History Rating' (added Jan. 2010).] The problem the Edo ratings attempt to deal with is that of making the best estimate of the ratings of historical players, in a statistically justifiable way, and of the reliability of those estimates, even when results are scanty. I have tried to use as much information as possible on as many players as possible for the period 1809 - 1902 (so far). [Now more selective information from strong events up to 1921 is included (added Feb. 2016).]
[(Added Jan. 2010) A note on the name: I chose to call my rating method 'Edo' partly as a slightly tongue in cheek play on 'Elo', but also following Glickman's example, with his 'Glicko' ratings, of tacking 'o' onto the first syllable of his surname. We've had Ingo ratings, Elo ratings, and Glicko ratings, so Edo ratings seemed natural. But, aside from being flippant, it was perhaps also an unfortunate choice because the name 'Edo' is already a proper name in several other cultures, which confuses searches. I am leaving the name intact, however, because it is cited in so many places now, including published articles. Changing it now would just create more confusion.]
How Edo ratings workEdo ratings are calculated simultaneously over an entire period of history by an iterative method similar to that used by Elo to initialize the international ratings around 1970, but with an important difference: each player in each year is treated as a separate player.
The main problem with rating chess players is that their abilities change over time. This is what makes it more difficult than standard static 'paired comparison' experiments as used for example in psychology and marketting, for which statistical methods are well developed. (A good reference for these methods, particularly the Bradley-Terry model, is the book The Method of Paired Comparisons by Herbert A. David, published in the 60's but reprinted in 1988.)
Static methods can't be applied to results over a long period of time, since players' abilities change and no direct comparison can be made between players at different times. If results over a long period are simply lumped together, only an average rating for each player over their period of play can be obtained. A static method can be applied separately to each rating period, but this only gives performance ratings for each period, and doesn't take into account the evidence of earlier (or later) results. This is a problem especially when a player has few results in a given period: A single draw against Morphy, say, by itself indicates a rating equal to Morphy's but if the player's rating prior to this game were 300 points lower, we would not believe that the player's true ability had increased that much. There are, after all, statistical fluctuations in results even if true playing strength is fairly consistent or smoothly changing over time. What is needed is a mathematically sound method for guaranteeing this sort of consistency over time, a balance between performance ratings in a given rating period and ratings in neighbouring rating periods. Update systems attempt to do this, allowing performance ratings only a certain amount of weight in departing from the previous rating. There should be some inertia in ratings to attenuate the unrealistically dramatic changes of raw performance ratings.
So here's the key idea: Each player in each year is considered a separate player. To introduce the required inertia, weight is given to previous and subsequent ratings by introducing hypothetical tied matches between a player in one year and the same player in neighbouring years. For example, the 'player' Staunton-1845 is hypothesized to have obtained a 50% score in a 30-game match against Staunton-1844 and in another 30-game match against Staunton-1846. Then a static iterative method (known as the Bradley-Terry method) is applied to the entire collection of 'players', as if they all played in one big tournament. Staunton-1845 for example played matches with Williams-1845, Spreckley-1845 and Mongredien-1845, but also against Staunton-1844 and Staunton-1846. These hypothetical 'self-matches' account for the fact that a player's rating generally does not change by a huge amount from one year to the next, though they don't prevent such large changes when there is strong evidence for them in competition scores.
There is a statistical basis for all this. The Bradley-Terry model makes the 'best' (in the maximum-likelihood sense) compromise between conflicting rating differences between pairs of players determined by the familiar 'logistic' formula that relates scores to rating differences:
where R and r are the ratings of two players who play each other, and s is the score of the first player out of n games. Thus, if player A's performance against player B indicates a 200 point rating advantage, and player B's performance against player C puts B 100 points ahead of C, then player A should be 300 points ahead of C. But if A actually played C, they might have had a score indicating some other difference, say 225 points. There is a best compromise in this situation which picks the three ratings most likely to have produced the observed scores. The Bradley-Terry model allows calculation of these optimum ratings for any set of games between players (or comparisons between objects) by a relatively simple iteration. One requirement is that all players (or objects) be connected by some set of games (A played B who played C who played D, etc.). The Philidor-Stamma match from 1747, for example, could not be included as we have no results linking them to the 19th century players under consideration.
In the Edo system, each historical player becomes a whole set of players for the application of the Bradley-Terry algorithm, one for each year from the first year for which they have results to the last year for which they have results in our data set. The self-matches amount to what is known in Statistics as a Bayesian approach to the underlying probability distribution of rating changes for a player from year to year. Technically, the self-matches constitute a particular choice of 'conjugate prior' (for background on the use of conjugate priors in paired comparison experiments, see papers by Davidson and Solomon cited in David's The Method of Paired Comparisons).
Without worrying about the statistical justification, the idea is pretty simple. Once you know how to run the Bradley-Terry algorithm, which is not very difficult (though it can take a while with the large data set I'm working with), you simply apply it to the set of 'player-years' with both their real results against other players and their hypothetical results against their neighbouring year counterparts.
The Bradley-Terry algorithm produces a set of numbers that are not in the familiar range of chess ratings, where 2200 is master level, and 2700 is world championship level. The scale is not really fixed, since the method can only estimate rating differences between players, not absolute ratings. Thus, adding 100 points or 1000 to everyone's rating changes nothing, as far as the method goes, since the differences between ratings remain the same. Thus, we have to apply an offset to the numbers produced by the iteration to get them into the right range. This is in a sense arbitrary and may seem immaterial, but it is important to make a reasonable choice here for the adjustment step (see below).
The Bradley-Terry method also allows us to calculate the variances of our rating estimates. Our estimate is only a best guess, and the 'true' rating could be larger or smaller, but by how much? The variance gives us a measure of this uncertainty. A rating estimate is really an estimate of the mean of a distribution of playing strengths which we expect to vary somewhat from event to event or game to game. We are not particularly interested in this inevitable fluctuation, rather we want to hone in on the mean, which is our real idea of the player's underlying strength. Thus, the measure of uncertainty in our estimate is actually the standard error of the mean. We call this the 'rating deviation'. Technically, the Bradley-Terry method gives us an approximation to the standard error which is more accurate the larger the number of games played.
The Edo method has the practical drawback of having to deal with the enormous size of the 'crosstable' for the whole data set, with results between not only a large number of players, but multiple versions of each player for every year of their career. My current data set with all the results I could use up to 1902 has 1757 players [5494 players up to 1921 as of Jan. 2017] and 9522 'player-years' [37557 as of Jan. 2017]. Thus, the overall 'crosstable' or 'result matrix' is 9522 x 9522 [37557 x 37557 in Jan. 2017] (actually bigger [38834 x 38834 in Jan. 2017], since there are a few additional 'ghost' players standing in for players who give odds -- see below).
[Technical part: To get rating deviations I should, in principle, have to invert this matrix. However, I have developed an efficient algorithm to find the diagonal of the matrix inverse (which is all I need for the variances) when the matrix is block-tridiagonal, which makes this feasible even on a normal PC.]
This approach works quite well, except that when we have few results for a player, these results could be anomalous, and the method as outlined above gives them undue weight and some rather unlikely ratings can emerge. If there was a player, for example, for whom we had the result of only a single game, and that was a draw against Morphy, then the best estimate of that player's rating is that it is equal to Morphy's if no other information is taken into account. But we would be justified in objecting to this. Some other information must be taken into account. The solution is essentially to use what we know about the underlying distribution of ratings of all chess players - some ratings we know are less likely than others before we even look at game results.
The adjustment stepLecrivain won a match against Dubois in 1855 with a score of 4-1 (according to Spinrad's collection of results) [This was an error and has been changed - see below (Jan. 2010) - but similar comments could be made about Cochrane's purported win over Kolisch in a series of games in 1869 (Oct. 2016)]. Dubois had a fairly well-established rating placing him in the world's top ten at the time. According to the above formula, an 80% score indicates a rating difference of 241 points. So, if no other results were taken into account for Lecrivain, we would predict that he was 241 points above the highly rated Dubois (Edo rating 2524, so Lecrivain would be 2765). Nobody believes that Lecrivain in 1855 was comparable in strength to Morphy a few years later. The result must have been a lucky one for Lecrivain. He also played a short odds match with Morphy in 1858, getting only 2 out of 6 at odds of pawn and two moves. This will indirectly bring down his 1855 rating a fair bit (because of the tied self-matches against himself from year to year), but the high score against Dubois still weighs heavily. The raw results of our Bradley-Terry algorithm put Lecrivain at 2599 in 1855.
We cannot just arbitrarily change the answers given by our method, which we believe to be a reasonable one. But there is a principle behind our objection. Players rated 2600 or more are rare and there is always a certain amount of random fluctuation in contest scores: they don't exactly match the predictions based on rating differences. So although it is not very common for a lower or even similarly rated player to get an 80% score, we are only talking about 5 games. And the alternative is even less likely, namely that Lecrivain is really rated close to 2765 and the score reflected the real rating difference (or even that combined with the other result, Lecrivain was 2599 in 1855).
This is really a description of the Bayesian idea in statistics. There is an underlying distribution of playing strengths existing in the background, considering the chess-playing population as a whole, before we even consider results of games. There are many more players near the middle of this rating distribution than at the extremes. Most of the players in the historical record we have access to are certainly in the upper parts of this distribution, since we mainly have results of international tournaments, or at least major national tournaments, and matches between the better-known players. Nevertheless, if the raw results indicate a rating of 2500 but with a standard error of, say, 60, then there is a better chance that the true rating is a bit lower than 2500 than that it is higher simply because fewer players have higher ratings. In fact, considering the range of feasible ratings around the raw estimate of 2500, it is actually more likely that the rating is a bit lower than 2500 than exactly 2500. How much lower depends on the standard error of the raw estimate. If the raw estimate is unreliable (large standard error) then the best guess at the true rating may be considerably lower than the raw estimate would suggest. How to make the best estimate considering these two sources of information, raw estimates based on scores, and underlying rating probabilities, requires knowledge of the underlying distribution and the appropriate statistical method.
If we look at modern statistics on ratings of the mass of chess players, the overall distribution of ratings, on the traditional scale introduced by Elo, is roughly a normal distribution, with a mean of about 1500 and a standard deviation of about 300 (which implies that about 1% of players have ratings over 2200). Now, our historical data is not going to be centred on this mean since the players whose results get into the historical record tend to be the stronger players, although as we get toward the end of the nineteenth century especially, we have records of more tournaments, including weaker ones. Thus, even if most of the players in our sample are rated over 2000, say, we should not be surprised to find some players in the record with ratings much lower, especially as we include more national and local tournaments as well as the main international ones.
The Edo system accounts for this background information by performing an adjustment step on the raw results produced by the Bradley-Terry algorithm. The Bradley-Terry method gives the set of rating differences that makes the observed scores most likely. An offset is then applied to put the scale in the range of familiar Elo ratings, with a top player like Lasker rising over 2700. Then, the adjustment is done by calculating the maximum combined probability of the rating given the scores and the rating based on the overall distribution, i.e., the product of these probabilities. This gives a more conservative but more sensible estimate of each player's true strength. Players with ratings above 1500 will have their ratings brought down by the adjustment, while players (there are a few in my sample) rated below 1500 have their ratings brought up. The size of this adjustment depends in part on how far from 1500 the rating is, but is limited more strongly by the standard error of the Bradley-Terry rating. The more reliably a player is rated based on scores, the less effect the adjustment will have (reliability of a rating depends on numbers of games played and reliability of ratings of opponents). The rating deviation is also adjusted as the combined distribution has a variance lower than that of either distribution individually, though when the standard error of a raw rating is small, the adjustment decreases it only very slightly.
[A technical point: in principle, the adjustment should be applied to all years for each player including those in which they had no results. However, it is a characteristic of the Bradley-Terry method that ratings across such a gap always lie on a straight line between the endpoints determined by years with actual game results. Rating deviations inevitably increase in these gaps, so that the adjustment would cause the rating estimates in the gap to 'sag' below the straight line. From experience, we know that rating curves are generally concave downwards rather than upwards - they typically rise in youth, peak in middle age and then decline. I consider it a small weakness in the method that the estimates in gaps do not have this typical curvature - this is a result of the assumed simple zero-mean distribution of rating changes over years. To compromise between the 'sagging' that would be induced by the adjustment in the gaps and the 'bulging' that we think is more realistic, I opted to apply the rating adjustment only to years in which players actually have results and then to re-impose the straight line interpolation across gaps. The rating deviation adjustment is still applied to all years, including gaps.]
As an example, the adjustment brings Lecrivain's rating in 1855 down from 2599 to 2507 and his rating deviation down from 91 to 87. Not all players are affected this much by the adjustment of course. Most reliably rated players are adjusted downwards by 20 to 70 rating points. But there are some with very large rating deviations indeed and thus large adjustments.
[On rereading Spinrad's presumed web source, La Grande Storia degli Scacchi, it seems clearly to assert that Dubois was ahead of Lécrivain in the ratio of 4 games to 1, though how many games were played is not specified. I have changed this to a 'soft' result of 4-1 for Dubois so the above discussion of Lécrivain's rating no longer applies. The principle still holds, however (Aug. 2011).]
An extreme case is that of Petrov, the first great Russian player. We have some results of his matches with other players in the 1850's and 60's, as well as a two-game match with von Jaenisch in 1844 (though for reasons described below, I have added an extra game here) [Now I include only a 1-0 'soft' result, with Petrov giving Pawn and Move odds: see below (Feb. 2012)]. But Petrov was playing chess as early as the age of 15 and I have a match result with a player named Baranov in 1809 [this match was actually against Kopev (June 2013)]. Petrov in 1809 is connected to the rest of our players only very tenuously through a long chain of self-matches from one year to the next between 1809 and 1844, then only three games [Now one game (Feb. 2012)] with von Jaenisch [and a couple of results against otherwise unconnected players (July 2013)] and another series of self-matches up to 1853, when the other matches begin. This makes his rating in 1809 extremely unreliable. His raw rating in 1809 is 2628 [2819 (Jan. 2017)], but the raw rating deviation is 293 [299 (Jan. 2017)]. This means essentially that his true rating could plausibly be anywhere between 600 points lower and 600 points higher than 2628 [2819 (Jan. 2017)]. Of course, 600 points higher is unlikely on the basis of the prior distribution of ratings - nobody has a rating over 3200 [3400 (Sep. 2015)]! That's where our adjustment comes in. But since our Bradley-Terry rating estimate is so loose, the combined probabilities are highest for much lower ratings and the required adjustment is calculated to be 551 points [657 (Jan. 2017)], bringing Petrov's rating in 1809 down to 2077 [2162 (Jan. 2017)]. This is still, of course, unreliable - we cannot get more reliability than the scanty game results allow - but the adjusted rating deviation does drop to 210 [212 (Jan. 2017)]. In this case, there is a good argument to exclude the early Petrov from the entire project, but I set out to seek best estimates of rating strengths (including an assessment of how uncertain these estimates are) for all early players whenever we have results for them. Even though the ratings for the early Petrov are really not much better than wild guesses, including him reminds us at least that he was there in the following years though largely isolated from the main European chess scene.
Summary of the Edo systemIn summary, the Edo system is done by
Comparison with the previous version of the Edo systemIn my earlier (2004) version of the Edo Rating system, I tried to address the issue of adjustment for prior information of player ratings by introducing additional hypothetical games between a player and a hypothetical average or 'reference' player. I did this only where players were less strongly connected by other games, namely at the ends of their careers (in their first and last years of play) when the number of games they played that year was less than 25. All such reference games were taken to have resulted in draws, or 50% scores. Thus, the hypothetical reference player would have a rating that averages those of players who played against him.
This sounded like a good idea, but there were three problems. One is that it was not clear who to assign the reference games to, and arbitrary decisions about how many reference games to apply and to which players, like the ones described above, were required. One might suppose that all players (in each career year) should be assigned a reference game, but this fails because of a second problem: The effect of these reference games spreads. To the extent that a player's rating is changed by the reference games, other closely connected players are also changed. Thus players feel the effects of reference games applied to other players as well as themselves, so that if every player has reference games a huge accumulation of effect occurs and dominates the effects of actual game scores. Reducing the weighting of reference games (i.e., applying a small fraction of a game against the reference player to each player in the system) helps but then the effects are too close to uniform to make any real difference at all. The third problem is that this method implicitly pulls all players towards a rating that is a kind of average of players in the system. Thus, I ran into problems with weak tournaments, where reference games caused all players to be brought up in rating towards the mean, sometimes putting the winner of the tournament at an unrealistically high rating. As discussed above, we really don't think that weaker tournaments are unrealistic, so they shouldn't be pulled up. And the new adjustment method handles this appropriately. In fact, in the previous version the weaker players were universally rated too strongly, since their ratings were pulled up towards an average value, around 2200 or 2300. This has been corrected.
Another aspect of the previous system has been modified: the calculation of rating deviations (standard errors). Previously, I calculated the variances of the Bradley-Terry estimates in a straightforward way, but this is based on an assumption of decisive results of 'comparisons' or, in our case, games. Chess games, of course, can result in a draw. Naturally, the presence of draws increases the chances of even scores in a match between equal players, for example. Thus, the variability in overall scores for any given rating difference between players is less than it would be if draws were impossible. The simplest way to view this is that if a draw is as likely as a win (this seems a fair estimate of the situation in chess - that roughly half of all games end in draws overall), then the result of a game (0, 1/2 or 1), is equivalent to the result of two decisive 'comparisons' in the sense of paired comparison experiments. A draw corresponds to one of the two comparisons going each way. A win or loss corresponds to both comparisons going the same way. Thus, the effective number of comparisons is double the number of games. Since variance decreases proportionally with the number of comparisons, the variance is actually half what is suggested by the straightforward application of the Bradley-Terry method (so the rating deviation is divided by the square root of 2 and is thus about 70% of the straightforward value). Another way to look at it is that the result of each game (0, 1/2 or 1) actually gives finer comparitive information between players than a 1-0 comparison would, so there is less uncertainty in scores for given rating differences. If the probability of draws is greater or less than one half, the above would have to be adjusted a bit, but we take it as a reasonable estimate.
Long term consistency (the holy grail of historical rating)Inflation and deflation of ratings are an inevitable concern in any rating system. How can we be sure that a 2600 rating in one year is equivalent to a 2600 rating 50 or 100 years later? All rating systems have some explicit or more usually implicit method of seeking consistency in ratings over time. As discussed in the introduction above, this is an inherently impossible task. The nature of the game changes, as more knowledge of chess is accumulated. The number of active players has consistently increased over time - does this reflect simply a larger random sample from the same underlying global distribution, or is the increase in numbers more at the lower end of the playing strength distribution? For example, if the few players for whom we have recorded results in the first half of the 19th century were really the strongest players in the world at the time, then were they equal in strength to a similar number of top players in the second half of the 19th century, so that the bulk of later 19th century players were weaker, or was the average of the early players the same as the average of active players in the later part of the 19th century, in which case the top later players would be considerably stronger than the top early players? How can we know?
We have to accept that we cannot detect whether there is, for example, an overall gradual rise in the strength of the entire chess-playing world over the years. But we can aspire to assess strength in relation to the overall level of chess ability of the time, or even some kind of innate chess-playing ability, rather than expertise in actual play, which depends on how much chess knowledge is available in a given era of history.
There are three ways to handle this 'consistency over time' problem:
The Elo system uses local consistency: Rating points are shifted around amongst a group of players after a tournament, but the average rating of these players stays the same. Thus, from rating period to rating period, the established players in the system maintain a common average. The Glicko system, on the other hand, is really of the second type, though not rigidly so. Gains and losses of rating points do not always balance out, since they depend on rating deviations (or uncertainties), but over time the mean rating of the entire set of players must stay about the same since all new players start from the middle of an unvarying distribution.
A previous version of the Chessmetrics system had a fairly sophisticated distribution adjustment procedure to maintain consistency in a sensible way from the mid-nineteenth century to the end of the twentieth. Rating systems for current players in some rating pools have had to combine periodic adjustments with the Elo system to maintain consistency because of gradual deflation or inflation.
The Edo system is naturally of the local consistency type. The 50% scores in the hypothetical games between each player and their future and past counterparts ensure this. The adjustment step, which involves a fixed rating distribution with a relatively low mean (1500), does not impose a fixed mean in the pool of players at hand (which is after all much higher than the underlying mean). Although the adjustment provides a pull towards the underlying mean, it is applied on an individual basis after the raw ratings are calculated. If inflation or deflation occurs, it will be present in the raw ratings and will not be removed by the adjustment.
I believe that this method is appropriate because of the incomplete nature of the data we have to work with. We simply have more data on events in the later nineteenth century than the earlier part. As the century goes on, we have results from an increasing number of lesser-known tournaments with players far weaker than the top players of the day. The average rating of players is obviously lowered in those years for which these weaker events are included - we are sampling lower levels in the distribution of player strengths. If I insisted on maintaining a globally constant average over the years, then in the later years where more results are available, the average rating would be artificially raised to keep it in line with that of previous years, causing a general inflation in ratings. Including results for additional players at the lower end of the rating distribution should not be allowed to raise everyone's ratings.
It is possible that deflation over time or even some inflation could occur in this system, as occurs in the Elo system, too. There is an implicit assumption in the local consistency approach that on average players tend to leave the pool with the same rating as they entered with. If on average players finish their careers at higher ratings than they start with, then the average change in player's ratings from year to year is greater than zero, but our system will try to keep it at zero and will thus cause a deflation over time. The presence of such deflation (or inflation) can be determined by checking, for example, that the mean of the top 10 or 20 players remains more or less constant. Since we have relatively few players in our system in a given year, especially the earlier years, I chose to check the average of the top 10, rather than 20 or more. A marked trend in the top 10 mean is an indication of deflation (if it decreases) or inflation (if it increases). We cannot know, of course, whether such an inflation or deflation is real (more likely the former), but if we believe that a top group of players should be on average similar in strength in any era, then we should avoid inflation or deflation. If an inflation is found and is not considered real then the implication is that players on average actually finish their careers with lower ratings than they start.
The Edo system could be made to compensate for either inflation or deflation by using a score slightly different from 50% in the self-matches. The presence of deflation, for example, would mean that individual players were on average increasing in strength slightly from year to year and the tied self-matches were thus pulling them down slightly from their 'true' ratings, so the self-match scores could be set at, say, 51% in favour of the later year.
In practice, I have found no real evidence of deflation or inflation in the results I've calculated so far, up to the year 1902 [1921 (Jan. 2017)]. Between 1859 and 1901 [1857 and 1921 (Jan. 2017)], the top 10 average hovers between about 2580 and 2615 [2592 and 2645. There is no increasing or decreasing trend in these small oscillations (Jan. 2017)]. Before 1860, there is certainly an increase over time in the average of the top 10 players but we have so few players in the system in these years that the top 10 is actually a fairly large proportion of all players in the system, whereas later it is a much smaller proportion. It might even be better to compare the mean of the top 5 in an early year to the mean of the top 10 in a later year. In fact, we know that strong players are missing (because we have no firm results for them), so we may be getting lower than top 10 averages anyway. Rather than trying to tweak the comparison for the pre-1860 period, I felt that the lack of trend in the post-1860 period was reasonable evidence for long-term consistency. No measures were taken to change the self-match scores.
Advantages of the Edo system
The balance of inertia and responsivenessAll rating systems try to produce ratings that reflect the scores in matches and tournaments, according to a formula like the one given above. However, performance ratings also reflect statistical variations in individual events. If a player gets a 2400 performance rating in one tournament and then a 2600 performance rating in the next followed by another at 2400, nobody believes that their true playing strength really jumped up by 200 points for that one event. So all rating systems attempt to smooth these fluctuations out in some way.
In the Elo system, for example, a new result is allowed only to modify the previous rating by a small amount if it consists of just a few games, and the relative weight of the new results increases with the number of games. The K parameter in the Elo system determines the relative influence of the new results. In fact, it is not exactly a weighted average between the previous weighting and the new performance rating, as is sometimes claimed, but rather an adjustment factor that is scaled according to the number of games. This distinction becomes more noticable when a really large number of games is played in a rating period. The Elo system's K factor (usually 16 or 32, but Sonas argues for 24) simply multiplies the discrepancy between the expected score of the player (based on his previous rating) against the average of the opponents' ratings and the actual score (reflected by the performance rating). If the previous rating was 2400, for example, and the performance rating was 2432, then a discrepancy in actual vs. expected scores of more than 2 will result in an adjusted rating actually higher than the performance rating, even if K is only 16.
The Glicko system has a similar adjustment calculation, though the value of K is no longer fixed, but depends on the 'rating deviations' of the players. However, the same overshooting of performance ratings is possible, especially when large numbers of games are played. The Glicko and Elo systems have a natural 'self-correction' property, in that if a player's rating is overadjusted, as in the above example, it will tend to be brought back into line by later results, which would not be expected to reflect the overadjusted rating. Errors are damped out of the system, though of course new errors are constantly introduced.
The Edo system also has a natural balance between inertia and responsiveness to new results, with a parameter that determines the relative influence of the two - namely, the number of hypothetical self-games between adjacent years for a single player. The more self-games, the more the inertia and the fewer self games, the more responsive to current performance. In fact, the inertia not only operates forwards in time but backwards as well - a player's rating is influenced by both the previous year's and the subsequent year's rating. And here, it really is more of a weighted average between performance rating and the rating of neighbouring years. In the example described above, the player who has a 2600 performance in one year between years with a 2400 rating cannot jump up over the performance rating. There is also naturally a greater influence of current results when they are based on a larger number of games. For example, if a player has 60 games in the current year at a performance rating of 2600, but 30 self-games with each of the neighbouring years at 2400 (50% scores against players rated 2400 - actually the same player, of course), then the two sets of 60 games pull equally hard and the current rating will be 2500. If fewer than 60 games were played in the current year, then the current rating will remain closer to 2400.
Time symmetryThe influence of both preceding and subsequent years' ratings on the current rating is a natural consequence of the Edo method. If Staunton-1845 played 30 hypothetical games against Staunton-1844 then Staunton-1844 also played 30 games against Staunton-1845, obviously. Thus, there is a time symmetry built into the Edo method. This seems a distinct advantage for a historical retrospective rating system (though not for a continual update rating system for current players). If we have very few results for a certain player in a given year, but we know that in the following year this player played a great many games indicating a rating of 2400, then we have good reason to expect that in the given year the rating should be close to 2400. In fact, if the few results of the given year by themselves indicate a rating of 2100, this should not be given too much weight in the light of the strong evidence from the following year of a much higher rating. The Edo system handles this automatically. In themselves, update rating systems do not. No use is made of information from later in time, only earlier information.
Some attempts have been made to introduce additional processing into update systems to address this lack. In Mark Glickman's application of the Glicko system to historical data, for example, he has a post-processing step that smooths the rating results backwards in time from the final rating period. But this is somewhat ad hoc, and does not really fit into the original model on which the Glicko system is built. A more explicit attempt to introduce time symmetry into an update system was the approach taken by Jerry Spinrad, where he simply applied the updating algorithm forwards in time, and then applied the same algorithm backwards in time, as if each backwards time step was actually forwards. He repeated this a number of times, but then it is not clear how many such repetitions is appropriate - there is no natural way to decide when to stop. The Edo system, on the other hand, uses future and past information in an entirely equivalent way, all in one go. (It should be remarked that Glickman's complete theory as detailed in his PhD thesis, actually is time-symmetric, though the approximate methods he has developed from his theory for practical use are not.)
No early career provisional rating periodAnother difficulty with update systems, especially for sometimes sparse historical data, is that that it can take some time for the number of games to be sufficient for a rating to reflect ability. In the application of the Elo system, for example, there is a period in which a player's rating is considered 'provisional' before there is enough confidence that it is reasonably accurate. In the Glicko system, all players start initially with the same rating, but with a large rating deviation. The uncertainty implied by this large deviation allows the rating to change rapidly (as though the Elo K parameter were large), but nevertheless there is a period of time in which the rating does not reflect true ability, a period which could last a long time if few results were available for many years. In the Glicko system, this is tolerated because the uncertainty is explicitly recorded in the rating deviation.
In the Edo system, however, the most probable rating is always assigned based on the information at hand (including information for later results, as well as earlier). If there are few results over a long period, then ratings still cannot be assigned with great certainty of course, but still, the best guess possible (given the available data) will always be made. There is also a natural measure of uncertainty associated with every rating, that comes simply from calculating the variances in the Bradley-Terry scheme. There is no early 'provisional' period in which ratings have not yet caught up with true ability.
Measuring uncertainty - rating deviationsSome of the recently developed rating systems were motivated by the obvious advantage in taking into account not only a best guess at each player's rating in each rating period, but also a measure of the uncertainty in each player's rating. We never know exactly what a player's rating must be. There is some distribution of plausible ratings, from which we can extract a mean as a 'best guess' and a standard error (standard deviation of the mean), indicating how uncertain the best guess is. The more uncertain a player's rating, the less weight it should have in determining an opponent's rating. The Elo system does not use rating uncertainties. The Glicko system is based on a well-founded mathematical theory of the weighting effect of these rating deviations. Even there, however, there are parameters to choose, which increase or decrease these effects in different ways (the initial rating deviation, which also serves as a maximum possible deviation, and another parameter that governs the rate at which the deviation increases over time while a player is inactive). These choices can significantly alter the resulting ratings. An earlier version of Jeff Sonas' Chessmetrics system similarly attempted to account for rating uncertainties.
I believe that the Edo system has distinct advantages in measuring rating uncertainties. It comes along with a natural measure of uncertainty, since it is really just a large implementation of the Bradley-Terry model, for which variances can easily be calculated. The correct assessment of variances has to take into account the possibility of drawn games, which reduces variance in comparison to standard paired comparison experiments where 'comparisons' always have decisive results (see comparison above to previous version of the Edo system for details). Since the Bradley-Terry model finds an optimum balance between all interactions or comparisons (i.e., game results) between players, it automatically optimizes the effects of uncertainty and allows these uncertainties to be quantified. The Bradley-Terry algorithm, of course, does not know which game results are hypothetical and which are real, and the choice of the number of self-games will also affect the rating deviations.
However, there is another, perhaps subtler, advantage of the way uncertainties are handled in the Edo system over the way they are handled in other systems. Consider a situation, which certainly arises in our historical data set, in which one group of players largely played in isolation from the main bulk of players. There must be some links to the main group, of course, or there would be no way to compare one group with the other and we would have two separate rating pools. Now, a player in the main bulk of players who starts his career at the same time as a player in the smaller, partially isolated group and plays a similar number of games will normally have a similar rating deviation in, say, the Glicko system. However, if the number of games between the smaller group and the main group is very small, then the rating of every player in the small group is tenuous in the sense that a minor change in the result of one of the connecting games would radically change the ratings of the entire small group. In the Edo system, this is all automatically accounted for, whereas the Glicko system, or any similar system, only considers numbers of games and rating deviations of opponents, not more global topological information about the connectivity of the network of players.
An example of this in my historical data set is a group of Italian players who played each other in tournaments in Italy in the 1870's and 1880's. For few of the players in this set of tournaments do we have results in other events linking them to the rest of the chess world and these connections are fairly weak. This group of players was largely isolated. In experiments with the Glicko system for my data set, a player like d'Aumiller, who played in 4 of the Italian tournaments and for whom we have results of 63 games, is assigned rating deviations that are not at all large, but the Edo system recognizes the tenuousness of the ratings of the entire group within which he played and assigns him quite high rating deviations.
Parameter optimization for the Edo systemThere is one main parameter in my implementation of the Edo system, and its value needs to be chosen appropriately, namely the number of hypothetical self-games between a player and himself in adjacent years. The testing I did to determine an optimum for this parameter is detailed in the following paragraph. In principle, another parameter is the bias in the self-matches, in other words the score attributed to the player of the earlier and later year. However, as mentioned above, I found no evidence for a need to use anything other than the obvious 50% score here. The offset to set the scale of the entire rating system is also a parameter, though it has no effect on the differences between raw, unadjusted, ratings of any two players. However, it does act, in combination with the mean and standard deviation of the underlying distribution of ratings in general, to determine the size of the adjustment made to the rating of each individual player. I took an offset which set Zukertort's raw rating in 1882 to be 2665. This was a somewhat arbitrary choice, but was made so that after the adjustment, the Edo ratings overall would be reasonably consistent with other historical rating attempts, such as Elo's, which put, for example, von der Lasa's and Anderssen's peaks at about 2600 or so [now considerably above 2600, and thus somewhat higher than Elo's estimates for these two players (Feb. 2014)], and Lasker well over 2700. I have taken the mean and standard deviation of the underlying rating distribution to be 1500 and 300 respectively, consistent with Glickman's information, for example, on the very large pool of the US Chess Federation's rated players.
I conducted considerable testing to determine an optimal size of the self-matches. Too few self-games and performances in individual years cause wild fluctuations in ratings, and too many keep a player's rating too flat for too long. As the number of self-games shrinks to zero, the rating system becomes strictly a reflection of performance in each separate year. As the number of self-games increases towards infinity, the ratings for each player become flat across time, and their values reflect an average across the years for which we have data. The predictive value is optimized for intermediate numbers of self-games. Prediction was evaluated in two different ways. One is to measure discrepancies between real scores and those predicted by ratings in the previous year, and by ratings in the subsequent year. Another is to generate an artificial, but plausible, set of ratings over time for made-up players, generate scores in made-up matches and tournaments between these players, using the usual logistic formula for probability of a win depending on rating differences, and then run the rating system on these data. Since the 'true' ratings are known beforehand in this case, it is possible to measure discrepancies between ratings calculated by the Edo system, and the 'true ratings'. In both cases, the number of self-games that produced the minimum discrepancies (either in scores or in ratings) was around 30, though the minimum was rather broad on the higher end, so that discrepancies were not much worse out to 60 self-games or so. I opted to use 30 self-games.
The adjustment step can only make prediction errors worse, but this is expected since it imposes a different source of information than that provided by the game results alone, a kind of prior common sense knowledge about the relative likelihood of any rating. This background distribution of ratings is assumed to be more-or-less universal.
Data selectionIt is probably true that the most significant factor affecting historical chess rating attempts is the choice of data to include. The rather extreme differences for some players between the old version of Sonas' Chessmetrics ratings and those of Jeremy Spinrad or my own Edo ratings were certainly primarily the result of a radically different choice of data. Sonas had the brilliant idea of using the millions of games listed in ChessBase, supplemented at the modern end from other sources (The Week in Chess). ChessBase has an enormous collection of historical games already organized and ready for use, so a huge amount of effort is avoided in collecting the data to feed into the rating algorithm. For later periods, ChessBase may have a quite representative sampling of actual results, but as Jerry Spinrad has pointed out, this may not be true as you go back into the 19th century. One glaring example is the rating of Howard Staunton, who had an old Chessmetrics rating that placed him well above all competition well into the 1850's. Spinrad's ratings don't even put him top in the world after 1846 and by 1851 there are a whole raft of players above him. The reason, Spinrad asserts, is that the set of recorded games available is skewed by the biases of chess journalists of the time, prominent among them being Staunton himself. ChessBase in a sense has an inordinate number of Staunton wins because Staunton liked to record those more than his losses.
Thus, it is better, at least for 19th century ratings, to use only entire match scores and tournament scores rather than individually recorded games, the selection of which may have been biased, and this is the approach I have taken for my own rating attempt. In fact, there is no need to be restricted to games whose moves were recorded, as long as we have a reliable statement of the outcomes. (Sonas' new Chessmetrics system also follows this approach at least for the 19th century, and also uses Spinrad's collected results for 1836-1863.)
A statement such as 'A won a match against B' I do not consider sufficient to include, nor 'A played a match against B and won most of the games'. Neither do I include partial results of just a few games of a match - we have then no idea what the result of the remaining games were, though I have occasionally included them when they include enough games to give a fair assessment idea of the outcome. Otherwise, I include only final numerical match scores. ['Matches' consisting of a single game are included where we know only one game was played (added Jan. 2010)]. Tournament scores I include if a full crosstable is available. If only final scores are available for each player, I include it if the scores appear to add up to a complete single round (or a complete double round, or triple...) all-play-all. In the latter case, the results of individual pairings need not be known, since it is a property of the Bradley-Terry method that only the final scores matter, as long as everyone played the same number of games against everyone else. [Since the last implementation of the Edo ratings, I realized that it is possible in many cases to use tournaments for which we have only partial information. Even partial results often contain information on relative strengths of players that can be incorporated in a mathematically correct way into the results matrix. For example, among all-play-all tournaments for which we have only total scores and no crosstable, I now include those with scores only for some of the players. This still allows us to derive a usable list of subtotals for games among the players for whom scores are known (but not against the other players), as long as we know or can surmise how many players there were in total and how many rounds there were (i.e., how many games between each pair of players). Furthermore, we can even use tournaments like the above when some games were forfeited if we know how many were forfeited and against whom. This applies to a number of Philadelphia tournaments from Chess in Philadelphia by Reichhelm, for example, where we can usually assume that the forfeited games were against the lowest rated player(s). We can also partially use results for odds tournaments with players in classes by strength and for which we only have total scores for some of the players. In such cases, however, we can only treat the results as subtournaments between players in the same class (added Jan. 2010).]
There are a few cases in which I was less strict about my inclusion of results, especially in early years when results are really quite thin. I have only partial results for the Amsterdam 1851 knockout tournament but included them because they do help to evaluate some of these Dutch players at an early date. The same situation applies to the San Francisco tournament of 1858. The London club tournament of 1851 (not the international tournament) was never completed - I include the results of the games that were played. At the Dublin handicap tournament of 1889, we know which players received odds but not what the odds were. Rather than exclude it completely, as a compromise guess between the possibilities I have assumed that these three players all received pawn and two moves from the five non-handicapped players and played even games against each other [I have now omitted the scores of the 3 players who received the unknown odds (Jan. 2010)]. There seems to be an indication that a Berlin club tournament in 1890 was a handicap tournament, but if it was, there is no indication of what the handicaps were or who received them. I have included this tournament without any handicapping information despite the possibility of some inaccuracy if some players received odds [but this inaccuracy is too large to tolerate so this tournament has been removed (Jan. 2010)].
There is one instance [actually several - now marked as 'soft' results (Jan. 2010)] in which it seemed necessary to make up a reasonable-looking numerical match score that was consistent with available information. In Paris in 1821, Deschapelles beat Cochrane 6.5/7 [6/7 in games whose results are known (Jan. 2010)] despite giving him odds of pawn and two moves. According to many sources, Cochrane then challenged Deschapelles to an even match (of an unspecified number of games) in which to win the stakes, Deschapelles had to win 2/3 of the games. We are told that he failed to do so. While I would not usually include such a result as we don't know how many games were played or what the final score was, it so clearly leaves a distorted picture if it is excluded and precious few results are available in this early period. A score of 6.5/7 [6/7 (Jan. 2010)] normally indicates a rating difference of 446 [311 (Jan. 2010)], but after the adjustment for the odds (see below), we would approximate the difference (in the absence of other information) at 663 [537 (Jan. 2010)]. If Deschapelles failed to get a 2/3 score in an even match, that in itself would indicate a rating difference of less than 120. Because of this huge discrepancy, I wanted to incude both matches. So as not to weight the second match too highly, I assumed a score of 3/6 for Deschapelles, which is less than two thirds but not too many games.
In the case of a few other early matches, I have taken somewhat vague information about the score literally: apparently in 1842 Staunton and Cochrane played about 600 games, Staunton winning about 400 of them. I have taken this to mean that he got a 400/600 score. It could, of course, be meant that he won 400 and drew some of the remaining 200 so that his score would be higher [I have now used a soft result of 4-1 for Staunton - see notes to the matches (Nov. 2010)]. Similarly, in Löwenthal's memoir of Paul Morphy, he says: "In 1849, 1850, and 1851, Mr. Morphy achieved a series of triumphs over the stoutest player in the Union, among whom were Messrs. Ernest Morphy, Stanley, and Rousseau. It is said that out of above fifty games found during these years with Mr. Eugene Rousseau, his young antagonist won fully nine-tenths." We have no idea of scores in the first two cases but I have taken the latter literally and included a 45/50 score for Morphy in a match with Rousseau in 1851 [now split between the years 1849 and 1850 (Jan. 2010)]. This kind of information is too valuable to leave out entirely.
We have only one match result for Petrov in the 43-year gap between his match with Baranov [actually Kopev (June 2013)] in 1809 and his match with Urusov in 1853, a tied two-game match with von Jaenisch in 1844 (according to Di Felice). But there are good indications that Petrov won matches with von Jaenisch in this period, though we don't have scores. The Oxford Companion to Chess by Hooper and Whyld (1st edition, pp. 246-247) says "Petroff won matches against Jaenisch in the 1840s..." Since the data is so sparse, the tied match, despite being only two games, carries a lot of weight in equalizing their ratings. So to give a better representation of the situation, I have included the won matches of the 1840s as a single additional 1-0 result for Petrov in 1844. This effectively gives Petrov a 2-1 advantage over von Jaenisch, which I hope is closer to an accurate reflection of the true results. [Now only 1-0 for Petrov giving odds of Pawn and Move. (Feb. 2012)]
Tournaments (or matches) that took place over the end of one calendar year and the beginning of the next, I have taken as a result for the year in which it finished. This is somewhat of an arbitrary choice, but I did have in mind that if Christmas was in any way a break, we could consider a year to go from Christmas to Christmas. However, there were tournaments that started before Christmas and extended into the new year.
Since I believe that only games actually played out between two players can be taken to indicate playing strength, I have endeavored to exclude all games won by default. On the other hand, in many early tournaments draws were not counted in the scoring, but I have included them as they do provide information on playing strength. The tournament and match scores listed on the player pages reflect this choice and therefore in some cases do not correspond to the published scores that were used to determine placement and prizes.
In one case at least, there is evidence that a game was 'fixed': Grundy's win over Preston Ware at New York, 1880. Grundy was accused of cheating at the time, but it couldn't be made to stick. According to Andy Soltis (Chess Lists, 2nd ed., p.61), however, he "eventually admitted his guilt and was tossed out of American chess." Jeremy Spinrad, in a ChessCafé column called The Scandal of 1880, Part One, reveals that both parties probably had some share of culpability and finds at least hints that other games may have been thrown at this tournament, but these charges seem less certain. Thus, I have excluded the win by Grundy over Ware, but no others.
Odds gamesAnother difficulty with the early 19th century is the frequency with which matches were played at odds. Whole tournaments were designed with handicaps of various odds for players of different classes, even as late as 1889 or 1890. Jerry Spinrad argues for the inclusion of these results, with an adjustment in the indicated rating difference depending on the odds. Since they were so common in the early period, we have to do something like this or the available results would be too thin to be useful. Hence, I have taken the same approach, though my rating offsets for the various types of odds differ from Spinrad's. He suggested, without justification but as a starting point for discussion, an advantage of 100 rating points when given pawn and move, 200 points when given pawn and two moves, and 300 points when given knight odds. This has the feature that an extra knight makes three times the rating difference of an extra pawn, corresponding to the generally accepted ratio of piece values, if this ratio can simply be translated into ratings.
Rather than accept these suggested rating offsets, however, I tried to estimate the effects from the data itself. The idea is that whenever a pair of players played games both at odds and on even terms, the difference in percentage scores can be interpreted as reflecting the difference in rating advantage. This proved difficult, however, as there were few occasions on which the same pair of players played with and without odds, especially in the same year, and sometimes when they did, the results were contrary to what one would expect.
For pawn and move there is a reasonable case to be made, especially due to the long matches between Dubois and Wyvill in 1846. They played 81 games on even terms and 69 games in which Wyvill was given pawn and move. The score for the even match suggests that Dubois' rating should be 130 points above Wyvill's. The score for the odds match corresponds to a 46 point lower rating for Dubois, without accounting for the advantage. This suggests a 130+46=176 point effect of the odds. Other situations with fewer games, considered collectively, suggest a similar rating effect for pawn and move, though perhaps slightly less. I adopted a 174 point effect for pawn and move. [Nineteenth century sources, by Staunton for example, who was expert at playing with these odds, assessed the disadvantage of giving pawn and move as somewhat less than this (e.g. approximately equivalent to 2-to-1 odds, which is itself equivalent to a rating difference of about 120 points). I have therefore decided to lower my estimate of the rating disadvantage of giving pawn and move to 148 points as a compromise between the indications mainly of the Dubois-Wyvill matches and the statements from contemporary sources (added Jan. 2010).]
The comparitive results for pawn and two moves are much scarcer and sometimes contradictory (there is more than one instance of a player doing better when giving pawn and two, than when giving pawn and one to the same opponent). The rating advantage seems to be considerably less than double that of pawn and move. I adopted a value of 226. For knight odds, I have very little to go on, but those comparisons I do have, with some guesswork led me to adopt a 347 point rating advantage. This is not triple the effect of pawn and move, of course, but I don't think that such a ratio is necessarily appropriate. I have very few results for rook odds games but guess at a 500-point effect. In one instance, I have included a match at odds of the exchange (Wayte-Löwenthal, 1852) and have guessed at an effect of 300 points [now lowered to 148 points, taking into account Staunton's assessment of these odds as equal to pawn and move odds, and applied also to Löwenthal-Brien, 1853 (added Jan. 2010) and other matches (added Oct. 2012)]. In these cases, having the inaccuracy of guesses at the magnitude of the effects seemed to be preferable to the effective misrepresentation of results by not including these matches at all.
There are some handicap tournaments in which odds of the move were given, which I take to mean that the player given the odds simply played white in every game. This gives a very minor advantage, that of white over black, which has been estimated by Jeff Sonas to be about 28 points, so I adopt this value as the offset for odds of the move.
[The above considerations have induced me to include results of matches played with even rarer odds, estimating the rating effect as best I can based on statements in contemporary sources where possible. Thus, I have now rated matches and tournaments played with odds of 'rook and knight' (869 points), 'the queen' (1086), 'two moves' (61), 'rook, pawn and move' (651), 'rook, pawn and two moves' (730), and 'pawn and three moves' (261), where my estimated rating disadvantages are given in parentheses (added Jan. 2010). I have now also included odds of giving mate on a specified square, considered equivalent to knight odds (347) (added Oct. 2012).]
[In addition, matches played by Deschapelles against de Saint-Amant and de la Bourdonnais at even stranger odds could be useful for the determination of his strength in the 1830s and 1840s. These involve playing without a rook or queen, but receiving in compensation some number of additional pawns. Rating disadvantages (or advantages) of giving these odds can again be estimated from statements in contemporary sources about the relative effects of these odds. Thus, giving 'rook for 2 pawns and move' is said to be approximately equivalent to giving pawn and two moves and so I assess it as a 226 point disadvantage. Giving 'rook for 5 pawns and move' is said to be equivalent to receiving pawn and move odds, so I assess it as a disadvantage of -148 points (i.e., an advantage of 148 points). Similarly, giving 'queen for 6 pawns and move' I assess as a 304 point disadvantage, 'queen for 8 pawns and move' I assess as a 0 point disadvantage (i.e., equivalent to playing even), and 'queen for 9 pawns and move' I assess as a disadvantage of -104 points (i.e., an advantage of 104 points). (Added Jan. 2010) I no longer use games played with these last odds (called 'the game of pawns') because it seems a bit too different from regular chess. (Jan. 2011)]
I have implemented the odds effect by introducing an additional 'ghost player' into my huge crosstable for each player giving odds (one for each year they give odds and one for each type of odds given). For example, Staunton in 1845 played matches with Williams, Spreckley and Mongredien, giving in each case pawn and two moves. Thus, I not only have a player 'Staunton-1845' in my crosstable, but a player 'Staunton-1845-giving-pawn-and-two'. The above three matches are counted as being between this 'ghost' Staunton and the three opponents. If we had results of even encounters between Staunton and other players in 1845, they would be attributed to the 'real' Staunton-1845, but we don't have any other results, so the 'real' Staunton-1845 is assigned only the self-matches with Staunton-1844 and Staunton-1846. Then to fix the odds disadvantage of the 'ghost' Staunton-1845 in relation to the 'real' Staunton-1845, we posit a long (100-game) hypothetical match between these two, setting the score to produce the 226-point difference that we have taken for odds of pawn and two moves (the 'real' Staunton-1845 scores about 78.5-21.5 against the handicapped Staunton-1845). In fact, this does not completely fix the difference at the desired value, but carries the weight of 100 games over the usually much smaller number of actual games played, so the final difference will in any case be close to 226 points.
Casual gamesThe distinction between casual and 'official' matches becomes less and less distinct the further back into the past you go. So many of the results we have for the earliest periods were not official in any real sense, that to exclude them would leave us with too little information. I agree with Jerry Spinrad that these results do, in fact, give us an indication of playing strength, so I have included them. However, Spinrad argues that in the case of Adolf Anderssen, there is strong evidence that he played in quite a different way in offhand games than in official ones. He would try out wild ideas when nothing was at stake that he would not do when it counted. Thus, Spinrad suggested excluding Anderssen's offhand games. If the same were clearly true for other players, we'd have to do the same for them, but Anderssen seems to be an extreme case. In my calculation of the Edo ratings, I counted Anderssen the 'official' player to be a different person from Anderssen the 'offhand' player and included all the results, though for some matches I don't have any indication of whether or not they were offhand, and have counted them as official in that case. Indeed, the offhand Anderssen is consistently rated 190 to 280 [82 to 288 (Jan. 2017)] rating points below the official Anderssen, which confirms the discrepancy in playing style. 'Anderssen (offhand)' appears in my rating lists as well as 'Anderssen, K.E.A.'.
Correspondence, cable, team and women's events.I have not included any correspondence events, since the nature of this type of play is so different from over-the-board play. However, I see no reason to exclude cable matches, which became quite popular late in the nineteenth century, as long as they consisted of games between individuals played in real time. These include many 'matches' between city or national teams, in which usually players were matched up one to one for a single game each. I do not include any event in which many players cooperated to play each side of a game. It would seem impossible in that case to disentangle the strengths of the individual players involved. I set out to rate as many early players as possible, and there is no reason to exclude women from that enterprise. We have very few results for women players in the nineteenth century, and only two tournament crosstables, but at least one woman turns up in tournaments not specifically designated for women (i.e., "men's" events), a Mrs. E.M. Thorold in 1888 and 1890, or in matches against men, Mrs. H.J. Worrall in 1859 and Mrs. N.L. Showalter in 1894. [Several other women who played against men are now included (Oct. 2013.)] I have used the modern 'Ms.' to identify all women players as such.
Unconnected playersThe Bradley-Terry algorithm does not produce ratings for a player with a 0% score overall or 100% score overall. Technically, this is because the rating difference with maximum likelihood in such cases is an infinite one. Or rather, the greater the rating difference, the more likely is the observed perfect score. This is entirely reasonable. If a player scores 0/4 against a player with a known rating, we have evidence that the rating difference is more than 338 (which is what it would have been with a 0.5/4 score) but how much more than that, we can't say. With the Edo system's hypothetical self-matches between years for the same player, a score over 0% or under 100% in any year allows that player to be rated in all years. So the only players we cannot rate are those for whom we have nothing but 0% or 100% scores in all years (in many cases we only have a result in one year anyway). These players are removed before processing.
SourcesThe results of matches and tournaments I used came from a number of sources. I relied heavily on Jerry Spinrad's collection of results, which he posted on the web, originally on the Avler Chess Forum, but now available here. (In case this becomes unavailable, I have taken the liberty of saving a copy of this document and it is available here.) These cover the period 1836-1863. He used several sources which are in many cases contradictory. In such cases, I usually took the more conservative result (the one closer to a 50-50 score) unless it seemed pretty clear that one was right (in some cases for example, drawn games might sometimes be omitted from the score, in which case, we wish to include them). In a few cases, the results were so wildly different in different sources that I excluded them altogether.
The other primary sources were Vols. 1 & 2 [& 3 & 4 (Feb. 2016)] of Jeremy Gaige's Chess Tournament Crosstables (along with Anders Thulin's extremely useful name index for Gaige's volumes) and the recent Chess Results, 1747-1900 and Chess Results, 1901-1920 [and Chess Results, 1921-1930 (Feb. 2016)] by Gino Di Felice. Gaige's was for many years the most definitive collection of tournament results for the nineteenth and early twentieth centuries. Gino Di Felice has attempted to improve on it, incuding even more tournament crosstables as well as match results. He has been partially successful, and this is a very valuable addition to the field of chess history, but he has apparently been a little less careful in some respects than might have been hoped, so that some results have to be treated with caution. An interesting (and largely encouraging) critique by Taylor Kingston appeared at the Chess Café site on the internet and I have found other errors or questionable information. A number of supposed 'matches' between Anderssen and other German players are included, some of which may not have been genuine matches, even of a casual kind. For example, the infamous Eichborn-Anderssen games are listed, totalling 28 games with a score of 27.5 - 0.5 in Eichborn's favour. But these games were recorded by Eichborn and it seems highly likely that he simply kept his few wins out of a great many casual games they played over a long period. I have not included these 'matches', nor several other dubious-looking match results of Anderssen's.
Another interesting case is that of the Warsaw tournament in 1884. Taylor Kingston remarks that here Di Felice seems to have outdone the experts, since Szymon Winawer appears amongst the participants, and the one biography of Winawer makes no mention of it. Winawer scored only 10/22 in this local tournament against otherwise mostly unknown players. This seems surprising. In fact, when I first ran the Edo rating program on my full data set, the winner of Warsaw 1884, one Jozef Zabinski, came out with a rating over 2900. Once attention was drawn to the situation, it was easy to see that if Winawer, who was at that time about fourth in the world, scored under 50%, then the winner, who scored 19.5/22 would have to be a great deal stronger, especially since we have no other results for him to temper the effect. A little further exploration revealed the most probable resolution. In his review of Di Felice's book, Taylor Kingston points out that Szymon Winawer had many chess-playing brothers who are often difficult to distinguish in Polish historical records. Thus, the conclusion seems almost incontravertible: the Winawer who played at Warsaw 1884 was one of Szymon's brothers. After reattributing this result, the Edo ratings returned to a reasonable state.
I have made considerable efforts in revising the Edo rating system to identify players and avoid as much as possible the two kinds of mistakes that are possible: mistakenly identifying distinct players as the same person, and less seriously, mistakenly attributing results of a single player to two or more distinct players. Jeremy Gaige's Chess Personalia was a great help for the identification of players and for their full names and dates, and resolved many difficulties, but many of the more obscure nineteenth century players I was attempting to rate are not listed even here. A considerable number of players in events whose crosstables appear in Gaige's own Chess Tournament Crosstables are not listed in Chess Personalia. In fact, it is often the case that we cannot tell whether a player is listed or not, as we may only have a last name, and there may be multiple players with this last name listed in Chess Personalia with consistent dates. More thorough historical research would be needed to sort out the numerous uncertain identities of players, and it is very likely that I have made errors. I would be happy to hear from anyone who can correct any of this data.
I also consulted Kuiper's Hundert Jahre Schachturniere and Hundert Jahre Schachzweikampfe. The results of the former are almost all covered by Gaige, who is probably more accurate, and thence Di Felice. The match results of the latter are also included in Di Felice. The Italian web site, ' La grande storia degli scacchi', I found useful as a check, and a source for the occasional minor event not covered in the main sources.
Then I consulted a number of standard reference works, including the encyclopaediae of Sunnocks and Golombek, as well as both editions of the excellent Oxford Companion to Chess by Hooper and Whyld. I also used a few other miscellaneous books, such as Staunton's book on the London 1851 tournament and Wade's Soviet Chess (where the Petrov-Baranov match of 1809 is mentioned [actually Petrov-Kopev (June 2013)]).
Finally, I used reputable-looking internet sites to add some more obscure results here and there, including sites for American, Irish and German chess history for example. Particularly notable is the fantastically rich 'The Life and Chess of Paul Morphy' site by 'Sarah' and several of the columns at ChessCafé.
Comparison to other historical rating resultsIn the earlier version of this web site (2004), I included for comparison results obtained from the same data set using my own implementation of the Glicko rating method of Mark Glickman. I was not satisfied, however, that this comparison was completely fair. There are difficulties in estimating appropriate parameters for any given data set to use the Glicko system and I may not have made optimal choices. But even if these were to be tuned, there are other difficulties in the implementation of the Glicko system for sometimes sparse historical data. It does not respond well, for example, to the occasionally huge number of games recorded for certain matches, such as between Kieseritzky and Schulten in 1850 or Staunton and Cochrane in 1842. It is certainly better suited to handle regular play between current players, and Glickman's more sophisticated approach to the problem outlined in his PhD thesis would likely do a better job of historical data. I have not attempted this.
I also originally included Spinrad's rating results, but these only cover the 28-year period from 1836 to 1863. As I extended my ratings to the end of the nineteenth century and beyond, it seemed less useful to include this comparison for just a few years.
Thus, I now concentrate on my own rating system. It is, after all, the Edo Historical Chess Rating site. The most interesting comparison might be that between my ratings and the Chessmetrics ratings, which constitute a comparable historical rating effort, in fact a more ambitious one, as they continue up to the present era. However, it would be laborious to transfer this information, which might change with updates to the Chessmetrics site, and in any case one can compare by looking at the two sites if one wishes.
A note on comparison to the new Chessmetrics ratingsAs of March 2005, there is an entirely new Chessmetrics rating system that Jeff Sonas has implemented. Not only has he taken a similar data collection approach to mine (using Jeremy Spinrad's data set and Gaige's Vol. 1 as a basis for the nineteenth century), but he has been thinking along somewhat similar lines with the theory as well.
His system is still a 'forwards' system, rather than a time-symmetric system like Edo (i.e., no later results are used in calculating a rating at a particular time). However, he does use an iterative 'simultaneous' calculation, applied to results over the 4-year period leading up to a particular rating period, so it is not strictly an 'update' system either. The results over that 4-year period are weighted so that the further back in time they are the weaker their effect on the current rating. Because of this 4-year memory of his system, each player gets a rating for 4 years after their last result, whereas the Edo system cuts them off immediately. However, the Edo system will give a rating over a gap of inactive years between active ones, no matter how long (with larger uncertainties in the middle of the gap, of course), whereas the Chessmetrics ratings will stop 4 years into a gap and restart when later results begin again.
Sonas has managed to refine the dating of results to the point where his rating period is a month, not a year! So each month, a new 4-year (48-month) period is used to calculate new ratings for each active or recently active player. For some early results, the month is not known, and in those cases the results are spread out over the 12 months of the calendar year.
There is another point of similarity between the new Chessmetrics system and the Edo system: the use of hypothetical reference games (like the earlier version of the Edo system, but not the current one) to provide continuity in player ratings over time. As well as the actual match and tournament results over the 4-year period leading up to a given month, his formula includes two factors (like additional hypothetical game results), one which pulls a player's rating towards the average opponent's rating (during the simultaneous calculation) and another which pulls the rating toward a rating of 2300. The implementation of these last factors, however, gives them a much stronger effect in the Chessmetrics system than the similar factor did in the old Edo system. Every player in every rating period is considered to have this hypothetical even score against the 2300-rated reference player. In periods where the player has a large number of results, the reference score has only a very small effect, of course, and when there are few results, it has more pull, as it should to counteract the larger sensitivity to random fluctuations in scores. However, when a player has no results at all over a period of time, the effect is to pull the rating increasingly strongly back towards 2300. Sonas argues that this could be useful in a rating system as a way to encourage current players to remain active, but is certainly less ideal in an attempt to make an optimal assessment of the strength of a historical player at a given moment. Note, for example, the periodic dramatic drops in Chessmetrics ratings for Lasker, in periods where he was not active. On resuming active play, his Chessmetrics rating always rapidly pulls up again to a level similar to where it was before the gap, suggesting that in reality, his strength had been at that level all through the gap. [Now that the Edo ratings extend into the 20th century, however, it is interesting to note that Lasker's Edo rating drops significantly between 1899 and 1903, only to rise again by 1914 to a level similar to his 1899 level. Thus, at least one of the drops in Lasker's rating seen in Chessmetrics also appears here. This actually reflects weaker results in this period, especially at Cambridge Springs 1904. (Feb. 2014)]
The reference game against a 2300-level player has another questionable effect, acknowledged by Sonas, that lower rated players tend to be pulled up towards 2300 when there is little information from actual games. This was also a problem with the earlier version of the Edo system, but I have avoided it in the new version by making the adjustment step pull players towards a mean rating of 1500, rather than 2300, and doing this individually after the initial raw rating calculations, so that the adjustments do not pull the average of the entire pool down to 1500.
One other quibble with the Chessmetrics ratings is the occurrence of anomalies like the inordinately high peak of Kieseritzky in January 1851. It seems likely that this is caused by a too-large effect of a 151-game match with Schulten in 1850.
By and large, the new Chessmetrics system makes a very interesting comparison to my Edo system, with some similar ideas implemented in a different way. The site is very impressive with a fantastic amount of statistical detail, and well worth a visit.
New developments - 'TrueSkill Through Time' and 'Whole History Rating'
[(Added Jan. 2010) Since this site reached a stable form in 2006, two additional historical chess rating attempts deserving mention have appeared.
Pierre Dangauthier developed a rigorous Bayesian theory for historical chess ratings as part of his doctoral thesis at INRIA, in Grenoble, France, completed in 2007 (where I had the good fortune to be asked to serve on the jury). His method, called 'TrueSkill Through Time', uses different assumptions about the distribution of rating changes for a given player from year to year and different algorithms for calculating the ratings, but otherwise works in a very similar way to my Edo Ratings. In his thesis, Dangauthier applied his method to two sets of data, one from Divinsky (which is limited to 64 players) and the other from ChessBase, where, unfortunately, his model assumption that the mean rating of players is always the same every year is simply inconsistent with the data set, and thus the results are severely distorted. The method and the application to data from ChessBase were published in the proceedings of the Neural Information Processing Systems conference in 2007.
The basic idea behind the Edo Historical Chess Ratings, that of using hypothetical match scores between time periods for a given player, was independently reinvented by Rémi Coulom, who called his method 'Whole-History Rating'. He acknowledges the Edo Ratings as an earlier implementation of this idea, but focusses on algorithmic efficiencies to allow rapid calculation of the ratings for use at online game servers. My initial hesitation about the utility of this approach to updating current ratings was simply a concern that current players would not be happy with all their previous ratings changing as they played new games. However, this has not deterred Coulom and he has found a way to implement the approach for millions of players with updates being done in real time, though at the expense of losing accurate calculation of rating deviations. For my purposes speed of calculation is less important and I wanted to have reasonable estimates of rating deviations, so I am content with slower algorithms.
Both Dangauthier and Coulom connect players with the prior distribution of all players at the time of that player's first appearance, and thus their methods are not fully time-symmetric. In my first implementation of the Edo ratings, I spread this correction over the first and last years of a player's (known) career, so that it was time-symmetric at least, but why should we single out the first appearance or first and last appearance of a player to receive this Bayesian correction? The effect is to pull these endpoint ratings for a player toward the global mean. My global adjustment step avoids this problem, but is also not entirely satisfactory from a theoretical point of view, since it is not integrated into the main maximum likelihood calculation.
One additional point worth mentioning here is that neither Dangauthier nor Coulom have grappled with the issue of non-stationary data, which is unavoidable in the context of using an uneven historical record of results. In both cases, their implementations assume a constant mean rating over all players in the system. But, at least in the historical record, this is not likely to be true, particularly as we dig back deeper into the nineteenth century, where we mostly only hear about the very strongest players, while by the beginning of the twentieth century, and much more so today, we have records of chess events at many levels, from international tournaments among top masters down to that of local club events. Simply put, the problem with assuming a constant mean is that if we later add a lot of results from weak players in a given period, then the ratings of everyone above them rises over that period to keep the mean constant. This is unreasonable, and leads to the huge distortions in Dangauthier's estimates of historical ratings, where for example, Anand is rated well over 1000 points above Capablanca and Steinitz 200 points above Capablanca. This is not a criticism of Dangauthier's mathematics, which is solid, just the assumption of a stationary mean for the ChessBase data set. When he used Divinsky's data set for 64 strong players over history, he obtained much more reasonable results. For Coulom, this may not be so much of an issue, as long as the distribution of playing strengths of players making up the rating pool for an online server remains fairly consistent over time. The Edo Ratings, as described above, avoid the assumption of stationary data, and implement the necessary correction for the overall distribution of players as a global adjustment.]
Results(Updated Jan. 2017) The results speak largely for themselves, but here are a few general observations.
Some interesting cases
Future enhancements(Updated Jan., May, Aug., Nov. 2010, Jan., Mar., June, Aug., Nov., Dec. 2011, Feb., Oct. 2012, Jan., Aug., Oct. 2013, Jan., Feb., Apr., May, Aug., Nov. 2014, Mar., Jul., Aug., Dec. 2015, Feb., May, Jul. 2016, Jan. 2017) I have managed to extend my ratings to the year 1921, using all of the data I was able to get my hands on up to 1902 (though there is much more still to go through), and easily accessible data for important events in the years 1903-1921. I thought originally that the Edo method would run into computational difficulties in trying to extend the data set still further, partly because of memory requirements to store all the results, but mainly because it involves a massive matrix inversion. However, I have found an efficient way to invert large matrices with sparse block structure like the one that occurs here, with years as blocks, and this also gave me a way to store and handle the data in a much more memory-efficient way. Thus, it now looks possible to go considerably further into the 20th century, still processing the entire data set simultaneously. But if I keep using all results available, the number of players in the system will keep growing by hundreds for every additional year rated, and I will run into computational limits again. So far I have started to be more selective about what results to include beyond 1902. Also, there is the very real question as to whether I will ever find time to enter all the data. The second volume of Di Felice's Chess Results series, covering 1901-1920, is heftier than the first volume covering the century and a half from 1747 to 1900, and the rest of the series up to 1980 (so far) takes up a significant swath of shelf space.
Despite recent improvements to my data set (see below), there is still the need to make the often tricky identification of players more reliable, and to accurately locate events (matches and tournaments) in time (and place). It is often difficult to distinguish between players with similar names, particularly common names like Brown, Taylor, Smith, Schwarz, etc. There are errors in source material that compound the problem. The number of such difficulties also increases dramatically as the time period is extended, and the number of players grows. There are also a great many discrepancies among sources in information about matches and tournaments, especially, but not only, the more obscure events (dates, numbers of games, results, and even players involved).
I have begun to do more historical research, in an attempt to improve the accuracy and reliability of the information used to calculate the Edo Historical Chess Ratings. Secondary sources are extremely variable in quality, so I have tried to note discrepancies in information between the sources I've consulted and I have included references for all the information I've used (on the player and event pages). However, it is clear that a significant improvement in accuracy will require going through primary sources (as close to contemporary with the events as possible). The web is becoming a valuable source of nineteenth century material, making such research considerably easier than it used to be, and I have access to a few other sources. I have begun to go through some of this material but don't know how much time I'll be able to devote to it. One way to look at the problem is that, unlike a biographer, who primarily needs to look for material on a single individual, I am trying to do historical research on over 5000 (so far) early chess players. It is probably foolish to think I can do more than scratch the surface, and makes me respect the work of previous chess data collectors, like Gaige and Di Felice, and to forgive them their errors.
Previously, I mainly used such 19th century sources as are available to me for making sporadic checks of results of events I already knew about, while picking up other events occasionally. I have now just begun to mine these sources more systematically, but have reached only to 1866 so far for most of them, but to 1878 for the important Illustrated London News column.
I plan to continue deep coverage of the 19th century mainly from primary sources, but may continue to extend shallow coverage of early 20th century events. I thought previously that 1914 seemed a natural stopping point, because of the decrease in chess activity during WWI. I have now, however, extended to 1921, the year of the Lasker-Capablanca match, which makes a nice round century from the de la Bourdonnais-Cochrane-Deschapelles event of 1821, the first major event I have (not counting a few less important and more tentative earlier matches).
While my attention has been focussed for the last few years now on historical issues, I still consider the theoretical problem of calculating historical ratings not to be satisfactorily solved. Dangauthier and Coulom have gone to the trouble to lay a rigorous foundation for the Bayesian approach I initiated, but I don't believe any of us have yet figured out the 'right' way to implement the Bayesian correction for the prior distribution of all players, so as to avoid bias (e.g. first appearance treated specially) but to respect the continuity of strengths of individual players over time (one should not consider a player to be re-selected from the prior distribution at each appearance).