Fear and Loathing on the Benchmark Trail

A recent Datamation article, “MIPS, Dhrystones, and Other Tales” (June 1, 1986, p. 112) contains a lot about various benchmarking systems. It’s a good article, but it doesn’t prepare the reader for dealing with the performance data supplied by the manufacturers.

Over the past couple of years I’ve spent a lot of time evaluating small UNIX computers. While I didn’t become an expert on benchmarking, I did become an expert on reading between the lines of manufacturers’ claims. Some of my experiences are listed below. WARNING: details have been re-arranged to hide the guilty.

One manufacturer brought me an evaluation unit and a lot of benchmark data that looked great. When I tried to duplicate the benchmarks, the unix consistently ran about 20% slower than their figures. When pressed to the wall to account for the differences (it was a potential $30 million dollar order), the manufacturer admitted the tests had been done on a “hot box” that had been cobbled up - i.e, next year’s model.

Whetstones are good for measuring CPU and memory but lousy for systems that use a lot of disk. One manufacturer who cited high Whetstones was very embarrassed when a 8088 with a RAM disk could compile a five-line C program faster than his 68020 with a very slow hard disk. It seems that C compilers in UNIX use a lot of temporary files, and this manufacturers system had been optimized for high CPU speed. In order to save money he bought a cheap disk, and . . .

Beware of raw disk speed as well. A 37ms seek time looks a lot better than 85ms, but it depends on what you do with it. One manufacturer is in the embarrassing position of having a large, fast (expensive) disk that is hidden behind so many layers of controllers and servers and operating systems that it’s about halfway between a slow Winchester and a floppy. Since their UNIX is as disk-intensive as everyone else’s, this manufacturer has problems. But its raw disk figures look good.

Another manufacturer points with pride to its UNIX disk performance figures and goes on at great length about how much tuning was done in order to get that performance. Unfortunately, what’s optimal for one memory/disk combination is not optimal for another. Worse, the manufacturers rarely make it possible for you to tune your own system. Still worse, your sales and maintenance people can’t tune it either. Since they can’t adjust it in the field or at the sales office, they ship version that are optimized for minimum memory/disk configuration. The source license you need so you can tune it yourself is $60,000 - for an $8,000 computer. And you still can’t duplicate their benchmark figures.

One manufacturer gleefully touted its processor’s speed over a competitors, show how much faster integer arithmetic was its own system. It turns out that its C compiler has 16-bit integers, and the competitors has 32-bit integers. When I ran the test comparing 16-bit to 16-bit and 32-bit to 32-bit, results were considerably different. Two manufacturers provided systems that used the same processor, running at the same speed, using the same speed memory. A test that simply added two integers a million times should perform in exactly the same way on each machine. It didn’t - machine A was about 30% faster than machine B. When my colleagues and I opened up the machines to see if there were other hardware differences, we found that both machines used the same processor and memory boards, provided by a third manufacturer. So we swapped the boards form A into B and vice-versa. A was still faster. The difference turned out to be the C compiler on A generated much more efficient code. When we wrote the tests in assembler, they performed identically.

One software vendor has discovered that his compiler stinks. So it uses a competitor’s compiler to build its system. Of course, it ships its own compiler to its OEMS. We did not discover this until we could not get some of the applications to compile, paid the manufacturer a visit, and a programmer whispered the secret over a beer. We bought a copy of the other compiler to verify what we’d been told. Then we dropped the vendor.

Some of the benchmark programs we used came from Byte. One manufacturer pointed out (correctly) that the programs as shown weren’t very efficient. If we used its modified versions, the advantages of its machine would be even greater.

Well . . . sort of. We found the changes optimized performance for this particular processor. But a different set of changes improved the performance of a competing machine even more.

Another vendor touted its floating point performance on an 8086 system, saying the only way to improve performance further was to use a floating point co-processor. Unfortunately, when you add the co-processor performance doesn’t get any better. It seems that in order to improve the floating point emulations, the floating point instruction set had been bypassed, and the co-processor never got used. Oops!

After all this, I’ve come up with a few relatively simple rules for my own benchmarking:

Don’t use the manufacturers figures.
Don’t use the manufacturers benchmark programs unless you get source code and can run it on all the machines you’re testing.
Do demand a sample unit and try to duplicate the benchmark data. If they won’t duplicate, say (politely) that this looks rather bad. Would the manufacturer care to explain?
When evaluating across different processors, make sure you are testing the same thing. In C and UNIX, watch carefully for the size of data element and the use of registers.
Consider the quality of the compilers. Some programs should be written both in assembler and C, and each version should be tested.
Check the size of generated programs on all machines, and compare the generated code on similar processors.
Give the most weight to the tests that would reflect the customers’ usage, not the developers’.
Ignore raw hardware numbers, especially cycle rates, MIPS, transfer speeds, seek times, and memory wait states. What counts is system performance as a whole. One of the fastest UNIX boxes I’ve used has a rather slow and old processor, but the system was superbly designed and implemented.

Even if you follow all these rules, there is still no guarantee that you will end up with the “best” system for your needs. Often that decision has already been made on other grounds (like price), and data are only sought to rubber-stamp the choice. In that case, you’ll find that external vendors aren’t the only ones who pick and choose among the numbers.

Originally published in Datamation, Oct 15, 1986.

Back to Steve’s home page.
Contact, License and Copy Issues.