[r-t] PB6 search completes

Mark Davies mark at snowtiger.net
Sat Feb 6 14:35:57 UTC 2010


RAS asks,

> Is that excluding rotations but including reflections?

Yes, that's right.

> 25 CPU cycles per node on average.

That doesn't seem that brilliant to me. The inner loop should need 
substantially less than that; you should be able to squeeze out around 
50 operations in 25 cycles, maybe more these days, which should be 
enough to generate and check three or four nodes. Of course you are 
going to be hit on backtracks, not least because the branch predictors 
are probably going to get it wrong and your pipeline will be emptied.

SMC32 will be a bit quicker on a pure cps search, because no falseness 
checks at all need be made, but this indicates to me that the inner loop 
is no longer the best it could be on a modern core. The inner loop was 
hand-optimised for the Pentium MMX architecture, so it manually 
interleaves instructions in order to keep both integer units busy, and 
makes use of MMX counters to get some mileage out of the FPU where it 
can, too. Of course you must also use simple x86 instructions that map 
directly to the underlying RISC microcode ops.

On the P6 and later architectures things are in theory a bit easier, 
because the out-of-order scheduling means careful hand-ordering of 
instructions is no longer needed to keep the ALUs busy. Plus, caching 
got a lot better - memory access is now basically identical to register 
use if you stay in L1 cache. However, I suspect more effort needs to be 
put in to keep the pipelines full and the branch predictors happy to 
make optimal use of recent cores.

Of course, as previously mentioned, the biggest improvement on a modern 
chip would be the ability to multithread.

MBD




More information about the ringing-theory mailing list