Supercomputing 2007: "Data Streaming"

Michael Wolfe, PhD, is a well-known compiler writer and parallelization guru, and currently works for The Portland Group. He gave a vendor talk at Supercomputing this week (Reno, NV) on what he calls “Data Streaming”. He started by showing a timeline of CPU clock rates from 1997 ... 2007.  There was a clear trend:  from 1997 .. 2002, the clock rates followed Moore's law and grew by a factor of 10 --- from 300MHz to 3GHz.  But from 2002 to 2006, the rates barely grew at all (3.8GHz), and in 2007 the clock rates actually dropped back to 3GHz.  So we appear to have hit the wall in terms of CPU clock rates.  As you all know, what chip manufacturers are doing now is going multi-core to offer increased performance.

But forget about multi-core for a moment.  Let's look at just one core running at 3GHz.  In the good 'ole days of compiler optimization, we optimized by reducing instruction count / instruction cycles.  We have become pretty good at that, to the point that we can't feed the CPU with data fast enough --- Mike did some back-of-the-envelope calculations that I won't repeat here, but basically showed that optimized, compute-intensive code needs 16-24GB/Sec bandwidth to memory.  CPUs today don't have anywhere near that bandwidth to RAM.

Okay, so add another core.  Michael showed this very simple picture of a two-headed straw: “Imagine you're at your favorite dinner sharing a milkshake with your significant other” (my words).  Both of you are pulling through the same base --- i.e. with two cores you still have only 1 pipe to memory.  This problem is going to get bad very quickly, since the # of cores will grow exponentially (quad-cores are getting common, 8-cores by next year, etc.).  But the # of pins to memory will not grow exponentially, though it will grow (AMD's new Barcelona quad-core chip has 2 pipes to memory).  Think of the multi-headed straw as you ponder the purchase of your next quad-core box...

The answer is to (a) throw more and more cache at the CPU (L1, L2, L3), and (b) start programming & optimizing for the cache --- “to the point that you consider the cost of executing an instruction to be $0.00” (Michael's words).  So “Data Streaming” is the idea that instead of always agressively doing computation, you instead aggressively fill the cache with data you will need, and then do computation once the data arrives.  If you have 2 cores, then core #1 can be reading data while core #2 is computing; when core #2 runs out of data, it starts reading data while core #1 executes. This sounds crazy --- to have an execution core basically spend its time waiting for data --- but it turns out that you get much better performance (in compute-intensive programs) than having both cores go full-speed ahead executing instructions, and essentially both waiting for data.  Michael and a colleague did this for a spec benchmark, and got around 1.5-1.7 speedup on a 2-core CPU (vs. essentially no speedup when the code was optimized for instruction counts). 

Here's a simple example of what you can do today to optimize for better cache performance.  First, imagine a collection of objects or structs.  Second, suppose you have to run through the collection, touching a field of each object / struct:

for (int i = 0; i < N; i++)
  a[i].field++;

Every time you read from memory, the hardware reads a cache line --- 8 or 16 bytes (at least).  By using an array of objects / structs, what you just did was pollute the cache with things you are never going to use (the other fields of the struct), to get at the one field you will use.  The solution is to move back to the old days of parallel arrays, i.e. using separate arrays to hold the various fields.  The loop then becomes:

for (int i = 0; i < N; i++)
   afield[i]++;

Now when you read a cache line, you'll use those other values in the cache line in future iterations.  The next step is to add prefetch instructions (which VS supports) to get ahead:

for (int i = 0; i < N; i++) {
   prefetch( afield[i+8] );  // or something, this is very HW dependent
   afield[i]++;
}

This is starting to approximate the “Data Streaming” idea, in that instead of doing computation, we think first about streaming in the data we're going to need.  The point of Michael's talk is that we need to start thinking about this more deeply.  I suspect Michael has some good ideas about how the compiler can help us.

And here's a related item:  the TLB (translation lookaside buffer), which is responsible for translating addresses from virtual memory to physical memory, is another potential area of concern.  The TLB is in essence another cache, and is getting killed as we go multi-core --- if the cores are running different instruction streams, the TLB miss rate is growing and this means more trips to RAM, and this means more waiting.  Rough rule of thumb: the cost of each memory level is a factor of 10, so CPU to L1 is 10 cycles, CPU to L2 is 100, CPU to L3 is 1,000, and CPU to RAM is 10,000 cycles.  So you can see that waiting 10,000 cycles to read something from RAM is a BAD thing for performance.

The bottom line:  getting good performance from today's multi-core chips is not as easy as just creating multiple threads and handing them off to the operating system.  Fun times!


Posted Nov 16 2007, 02:24 AM by joe-hummel

Comments

Michael Wolfe wrote re: Supercomputing 2007: &quot;Data Streaming&quot;
on 11-19-2007 3:36 PM
I'm glad you enjoyed the talk. The only correction I'd make in your summary is that optimized, compute-intensive code needs 16-24GB/sec bandwidth to memory (not GHz).
Joe Hummel wrote re: Supercomputing 2007: &quot;Data Streaming&quot;
on 11-20-2007 5:56 PM
Oops, thanks Mike, getting my units wrong. How about I update the original post? Thanks!

Add a Comment

(required)  
(optional)
(required)  
Remember Me?