[r-t] PB6 search completes
Mark Davies
mark at snowtiger.net
Sat Feb 6 14:35:57 UTC 2010
RAS asks,
> Is that excluding rotations but including reflections?
Yes, that's right.
> 25 CPU cycles per node on average.
That doesn't seem that brilliant to me. The inner loop should need
substantially less than that; you should be able to squeeze out around
50 operations in 25 cycles, maybe more these days, which should be
enough to generate and check three or four nodes. Of course you are
going to be hit on backtracks, not least because the branch predictors
are probably going to get it wrong and your pipeline will be emptied.
SMC32 will be a bit quicker on a pure cps search, because no falseness
checks at all need be made, but this indicates to me that the inner loop
is no longer the best it could be on a modern core. The inner loop was
hand-optimised for the Pentium MMX architecture, so it manually
interleaves instructions in order to keep both integer units busy, and
makes use of MMX counters to get some mileage out of the FPU where it
can, too. Of course you must also use simple x86 instructions that map
directly to the underlying RISC microcode ops.
On the P6 and later architectures things are in theory a bit easier,
because the out-of-order scheduling means careful hand-ordering of
instructions is no longer needed to keep the ALUs busy. Plus, caching
got a lot better - memory access is now basically identical to register
use if you stay in L1 cache. However, I suspect more effort needs to be
put in to keep the pipelines full and the branch predictors happy to
make optimal use of recent cores.
Of course, as previously mentioned, the biggest improvement on a modern
chip would be the ability to multithread.
MBD
More information about the ringing-theory
mailing list