An introduction to the "Lehman Rating" System for OKbridge

© Bradley Lehman, 1997
latest update: Section 3a, suggestions, and a few other clarifications added 18-Jun-97; simplified explanation added as a separate page 26-Jan-98

The Lehman rating system for the game of OKbridge is a product of the summer of 1993. In the OKbridge online discussion group, Matthew Clegg invited members to propose methods of recording player-performance statistics, in addition to the scratch average system already in use. I set up experiments using 17,000 hand records supplied by Matt, devised a ranking system, submitted an initial proposal to the discussion group, processed the suggestions that came back, ran more tests, and reformulated the system into its present form. Another OKbridge player, Craig Chase, later translated the analysis program from Pascal into C, and Matt did the final implementation in the early commercial versions of OKbridge.

The explanation below is intended as a brief introduction to how the rating system works. For further details, see the official specification article from 1993.

See also the new explanation which is an attempt to show the calculation process through a simple analogy.

Goals
Calculation
Interpretation
Mathematical assumptions
Ratings of exclusive partnerships
Suggestions for improvement
NEW! Analogy for simpler understanding of the calculations

Goals

An equitable personal player-rating system should attempt to compare a player's results fairly with his/her own past performance, adjusting for the variables of different partners and opponents. If the player is now playing better bridge than before, the rating should go up; if the player is playing worse than before, the rating should go down. Comparison of the change in one's own rating is more useful than direct comparison of the ratings of two different players (although that is also a goal, to the extent that it is possible).

The rating system, as an improvement over "scratch" scoring (direct averages of all results), should encourage a player to enter into pick-up sessions, with the assurance that he/she will be rated as fairly as possible on his/her own performance at the table. One should not be able to inflate one's own rating artificially simply by always playing with a good partner against easy opponents, as can be done in "scratch" scoring. Nor should one fear an artificial deflation of one's own rating by "playing up" in stronger company, or by playing with a weaker partner.

The rating system should above all tend to reward a player for being a good partner, if possible, rather than rewarding individualistic masterminding. Bridge is a game of partnership cooperation and communication. The only aspect of the game which is independent of partner's ability is declarer play. And it is far more difficult to bid and defend accurately with a bad partner than with a good one. The worse one's partner is, the more bridge ceases to be a partnership game. Therefore, a player's personal rating should reflect that player's demonstrated ability to play well with any partner, superior or inferior.

Calculation

Each player's ratings are recalculated once per week, in the same cycle that generates the next week's set of duplicate boards. Every board which a player has played yields rating points, which resemble matchpoints adjusted against the relative ratings of the four players at the table (i.e., factoring out the effects of partner and opponents). All these ratings per board are then averaged together with a player's backlog of boards from past weeks. In short, this system resembles a scratch-average system, but with handicap adjustments on each board. (IMPs are translated to a similar percentage scale, for a separate IMP Lehman rating.) There is no minimum number of boards per week; all boards played count toward the ratings.

The system calculates the handicap adjustments for each board by comparing the statistically expected result against the actual result. The "expected result" is a percentage score based on the relative strengths of the two partnerships at the table: the amount by which the stronger pair should win in a scratch-scored match of reasonable length. A matchpoint reward is adjusted against this percentage, rather than against the 50% average of scratch scoring. If the actual result is above the expected result, the stronger pair "wins" the board; if below, the weaker pair wins. Whichever pair wins the board, a rating score is assigned so that that pair's ratings go up, and the other pair's ratings go down. Within each pair, then, that score is also weighted by the relative ratings of the two partners.

A simple analogy: Essentially, it is like a gambling pot to which each player at the table contributes an ante equal to his/her own rating. The ante total by each partnership determines the percentage by which that pair expects to outscore the other. (If N-S contribute 90 points together, and E-W contribute 100, N-S expect to get 90/190 = 47.4% of the matchpoints.) If the actual score of the board is equal to the expected score (in this case, 52.6% to E-W and 47.4% to N-S), everyone takes back the amount of his/her own ante, and no one's rating is affected. If instead one pair gets a higher score than expected, each of those players gets back his/her own ante plus a bit more, leaving the rest to be redivided proportionately among the other two players. The amount received back by each player after a board becomes that player's earned rating value for that board. All these individual board ratings are then averaged together at the end of the week. (In a scratch system, by parallel analogy, every player contributes 50 to the pot, and gets back exactly the percentage amount of the actual score. A scratch system assumes that all players have an equal 50 rating, and an equal chance of winning an above-average score.)

The Lehman system further weights the results so that the current week's boards count slightly more heavily than all previous boards. This is done so that the rating will better reflect recent performance, not only a player's mathematical reputation. When a player is relatively unknown to the system, either through being new to OKbridge or through not having played for a long time, that player's ratings will fluctuate more rapidly than those of an established player.

Interpretation of ratings

The meaning of a personal rating percentage in this system, by definition: A player with rating R% should expect to achieve this matchpoint score, playing a match with someone equally skilled as partner, against two players who expect to average (100-R%). This is best illustrated by a simple example. If two 54% players sit down to play against two 46% players, the 54% pair should expect to win 54% of the matchpoints in the long run. (For an IMP-scored game rather than matchpoints, a proportional scale is used to predict similar IMP results.) If this percentage of victory actually happens at the table, the ratings of those four players remain constant, because every player did as well as expected.

Ratings are most reliable for players who play regularly, and with a number of different partners (which allows a representative sample of results). At this point, no separate "reliability index" is calculated to indicate the margin of error of a player's rating. See also the section "Ratings of exclusive partnerships."

A player's rating naturally reflects only how that player has done individually in actual play in the OKbridge field. It does not necessarily indicate a level of general bridge ability. OKbridge involves quite a bit of luck: finding a compatible partner without much opportunity for detailed style discussion, being dealt types of boards that match one's own strengths or style, etc. Often a player gets into situations where partner's or opponents' actions overrule any good or bad decisions one has made; one gets results without really "earning" them. That is inherent in the luck of the game, perhaps magnified by the OKbridge format (where one often has never met any of the other players at the table). No system can factor out this randomness entirely.

Mathematical assumptions of the rating system

1. A good result should be worth more against good opponents than bad ones. If a bad defender makes a silly mistake and hands you the overtricks, you have not earned your top. If you find the only defense to thwart a competent declarer, you have indeed earned your top.

2. A good result is statistically more likely to be attributable to good decisions by the better player in a partnership. Therefore the better player should get slightly more of the credit.

3. A bad result is statistically more likely to be attributable to mistakes by the weaker player in a partnership. Therefore the weaker player should again get a slightly lower score than partner.

3a, added 18-Jun-97. Points 2 and 3 as worded above have apparently caused gross misunderstandings in some players' minds, judging from comments on the discussion group. The confused attitude is usually stated as a complaint such as this: "So the better player automatically gets most of the reward for good boards, and the weaker player takes the bigger hit on bad boards? Not fair!" Clarification: this interpretation/complaint is simply incorrect! Here is an additional true mathematical assumption, stated here for the record, and perhaps more helpful than either 2 or 3: at the end of any hand, it is of course impossible for the computer to determine any relative assignment of merit or blame between the two partners; therefore, the credit must be distributed between them in such a way that the two players' ratings remain in a constant ratio to each other. Both players get back the same fraction of their OWN ratings. For example, if N is getting from the board a matchpoint score that is 113% of her own rating, S is also getting 113% of his own rating: the rating points for the board are distributed so that this happens.

This is analogous to the following: N and S get together to form a company [partnership, board by board], and each invests some amount of money in the stock. When the company is dissolved [at the end of each board], the profit or loss is distributed back in the same ratio as their investments. The system can't determine between them who caused which percentage of the result, when they were functioning as a team, so the best it can do is to pay off equally per the shares they bought. The higher-ranked player, having invested more, either wins more or loses more (arithmetically, in terms of amount above or below his/her own investment)...but the amount of money each goes home with is in the same ratio as the amounts they brought in. If they always play only together, the system can never change the ratio between their ratings on any given board, because there is no way to assess their performance as individuals. And as both gain or lose simultaneously, the player who invested more always gets back a higher score than partner (unless both get 0, of course), having brought more resources (i.e. rated skill) to the company. [That higher score divided by that player's rating is equal to partner's score divided by partner's rating.]

4. If a player achieves his or her expected score for a board or a match, the personal rating should remain constant. If, for example, four 55% players play together at a table, the rating system should not lower each player's rating in the event that the match score is 50%. Similarly, four 40% players should not all go up in rating simply because they play together. (These phenomena occur in "scratch" scoring.) Also, the system must be equitable to all players, regardless of whether the opponents are a balanced or unbalanced pair.

5. All these adjustments must be balanced accurately, so that each individual player receives a reward statistically commensurate with his/her own performance on that board.

6. A player's recent results should have slightly more weight in determining personal rating, than past results. The rating should be an indicator of how well a player is playing presently, with a view to predicting how well that player will do in the future.

7. The adjustment system must be relatively simple and easy to implement. A recursive system is by far the simplest, as past boards do not need to be re-analyzed, and only one pass through the results is required. Over a reasonably short period of time, a player's numeric rating should converge quickly to be an accurate indicator of that player's true level of ability.

8. The resultant rating system should encourage players of all levels to play together, with the expectation that the scoring system equitably rewards personal and partnership excellence. Players at all levels should not be made to feel that they should avoid one another.

9. The rating system should never detract from the enjoyment of the game, and it should not affect the strategy of either playing the game or choosing one's partner and opponents.

10. A player's mathematical expectation of winning should still be highest by choosing the best possible partner, but the gap should not be as wide as in a "scratch" system. (Mathematical expectation is (reward * chance of winning).) The "best possible partner" for any given player may be an expert, or it may be a lesser player whose style fits well with one's own, producing a partnership which is better than the sum of the two players' abilities. The system must, of course, assume that the expert is the best possible partner. And the expert should not fear being paired with a novice; to compensate for the reduced chance of winning, the rewards are increased.

11. A match which contains many "swing" boards (tops and bottoms) should be more beneficial to the weaker pair than to the stronger pair. That is, it should be possible for the weaker pair to win more points above their own ratings than they can lose for a bad result (which will presumably happen more frequently). As in high-level tournament bridge, good players should not have to resort to "shooting" unless in desperate need of a short-range surge of points. A policy of sound results, avoiding disasters and accepting the opponents' gifts, wins in the long run.

12. The system should be most accurate for players who generate a useful sample of results with a variety of different partners and opponents. It is designed to measure individual performance in the field, more than the performance of regular partnerships. The basic format of OKbridge (one can pick up a partnership at any time of day, for a session of any length) makes it resemble an individual event more closely than it resembles a tournament field of established partnerships.

Ratings of exclusive partnerships

Any player who plays almost exclusively in one or two regular partnerships will tend to have a less accurate Lehman rating than players who play with a variety of partners in the field. Especially, the ratings of the two players in an exclusive partnership may become almost meaningless with regard to one another (because the system has no way of noticing if one player is improving while the other is not). The partnership's rating, as a sum of two individual ratings, remains accurate within the field, as both players move up or down together with their results. (In fact, the rating of such a regular partnership is more accurate than the rating of a pick-up partnership, because it measures the actual results they have achieved together, rather than predicted results.)

Explanation: The rating system is built upon the assumption that most players will play a representative sample of boards with a number of different partners. It is not designed to reward such behavior, but simply to measure it as accurately as possible.

Any difference between the ratings of two players A and B is of course determined by how they have done when not playing together as partners. Therefore, if A plays a very high proportion of his/her OKbridge sessions with B, the relative ratings of those two players will not necessarily reflect accurately any skill level differences between them. This unreliability develops because the sample of results outside the partnership becomes too small to be useful. When the program attempts to distribute the partnership result percentages fairly between the two players, the only information it has to go on for its weighting is the small sample of results they obtained elsewhere. Therefore, it may artificially magnify any rating differences within the partnership.

Presently, the system has no way to adjust for this, because it does not record the percentage of boards which each player plays in specific partnerships. It simply records cumulative individual rating, and number of boards played (adjusted weekly by the usual weighting factor for more recent boards, see above).

Suggestions for future improvement

If I were changing anything in the system, and right now I certainly don't have volunteer time to do any of that (so I'll leave it to more available parties), it would be:

Implement a display of how quickly a player's rating is changing, and in which direction; that could be more useful than simply seeing a current number. It would also be helpful to see a display of how many boards a player has in the backlog; the lower the number, the faster the rating will change. (Remember, each week the backlog number is decreased so that new boards will count more heavily than old boards.)
Somehow make a probationary period (maybe somewhere between 100 and 500 boards) to determine one's initial rating...but it's uncertain how to handle the ratings of the other players at the table with such a new player.
Perhaps increase the decay rate a little (maybe down to .9 or .8something) so past performance falls off even faster; everybody's rating would change faster (a popular complaint seems to be "I've fallen and I can't get up!").
It would be nice to be able to increase the table ante of a seasoned partnership a little, because a partnership unit obviously has some value over two individuals who are playing together for the first time...but that would be I think impossible to program.
If rating resets are allowed, display the date and the player's former rating; and perhaps the amount reset to should be (100-M) where M is the current mean of the entire population. This would help combat any trend of overall inflation.
New players should perhaps enter at the actual mean value of the population, rather than 50; also, the system should display the current mean, perhaps on the TABLES page.

Again, for further details about anything explained here, see the official specification article from 1993.

Or, if this present explanation seems too complicated, try the simpler non-mathematical version.