Why pad by 128 bytes?

Posted on 2019-03-11

False sharing is where two fields occupy the same cache line and are accessed by separate threads and where at least one thread performs writes. The net effect is that writer threads frequently invalidate that cache line on other cores. This is where padding comes in.

Everyone knows that a cache line is 64 bytes, so surely to ensure two fields are on different cache lines you need only pad by 64 bytes? Not quite.

Intel has this great feature called
Adjacent Cache-Line Prefetch, documented on the Intel developer zone that in effect loads two cache-lines rather than one.

Some numbers in Nitsans' post SPSC revisited part III - FastFlow + Sparse Data, key measurements are shown under the "Applying lessons learnt" section experiments Y6 and Y7 only differ in that Y7 is double padded to the full 128 bytes.

There's a nice thread on the mechanical sympathy group with multiple observations, of particular note is the original poster, Duarte Nunes, who initially saw no difference, until he ran his experiment on a NUMA architecture system, indicating this effect may not be observable on a single node system.

We can see this double padding in action in JCTools