Here are some comparisons of actual outcomes (smoothed using a local linear regression, or kernel regression) to theoretical outcomes, where only the difference in ratings is used to predict outcomes (measured as the probability of winning a single game). The solid lines in these graphs can be thought of as level curves of a three-dimensional kernel-based estimate of the surface described in my sketch of a possible solution to the inaccuracy of the Elo system.

I used data on 415,662 games from just over 7 years of tournament data (from cross-tables.com, goddamn those guys are good) which indicate that the ogive ratings curve used by the NSA seems to systematically underestimate the winning chances of the lower-rated player and systematically overestimate the winning chances of the higher-rated player. This may be due to the total absence of any model of luck in the theory underlying the Elo system (developed for chess, where there is no luck component, essentially).

Note, however, that any alternative ratings curve (estimated from past data, say) would have to be applied to past events in some way (though the pairings would have been different in past tournaments under the counterfactual hypothesis of using a different ratings curve) to see if a proposed alternative continued to match past performance once "revised" ratings were calculated. Tricky stuff.

The ridiculousness that creeps in from the left in the graphs for players above 1700 comes from the fact that players rated 2000 just don't play any games where they are the lower-rated player by a margin of 300 points. A similar ridiculousness begins to creep in from the right for players rated below 900.