
Thursday, March 13, 2008
Microsoft has just released a CTP of its upcoming HPC Server 2008 product (the next version of Windows Compute Cluster Server 2003). The MPI library has undergone a series of performance enhancements (e.g. Network Direct for more efficient message communication), and the scheduler now provides an easier and more interactive way to execute kernels of work (based on the scheduler's new SOA capabilities for job submission and execution). I'm just starting to kick the tires, existing MPI and OpenMP apps runs just fine on the new version. Next up is to start playing with the new scheduler...
The CTP also includes numerous other improvements, including better setup and administration support. To download, click here to join the beta program.
Wednesday, February 20, 2008
During the first run of our Windows Compute Cluster Server class last week in Irvine, we discovered that when you install the SDK for Compute Cluster Pack, the wrong version of msmpe.lib is installed. In particular, the 64-bit msmpe.lib file, which gets installed into C:\Program Files\Microsoft Compute Cluster Pack\Lib\amd64, is actually the 32-bit library. So when you compile in 64-bit mode and link against the 64-bit msmpe.lib, you get unresolved symbols for the MPI functions: MPI_Init, MPI_Send, MPI_Recv, MPI_Finalize, etc. Microsoft has built a correct version of the library, which I've made available here for download.
Some background... The SDK for Compute Cluster Pack is the MPI programming libraries for Windows Compute Cluster Server. There are 2 libraries each for 32-bit and 64-bit. For normal MPI programming, you link against msmpi.lib. For MPI programming with MPE-based event tracing, you link against msmpe.lib. The 32-bit msmpe.lib file is fine, the 64-bit msmpe.lib file is incorrect.
 |
Monday, November 19, 2007
At Supercomputing last week, Microsoft announced the next version of its HPC product: HPC Server 2008. The current version (”Windows Compute Cluster Server” or WCCS) is built upon Server 2003, while this next version is built upon Server 2008. I'll be blogging about v2 in the coming weeks, for now you can find additional information here. In particular, there's technical information available as a Word .docx, and you can download beta1 if you want to play.
Speaking of playing... I mentioned in an earlier post that you can build your own cluster with very little hardware, e.g. I'm using 2 dual-core Mac Minis. Phil Pennington just posted how to setup a cluster on your laptop / desktop using virtualization, very cool. Just add lots of RAM, and hopefully you have a few cores free on your CPU :-)
Friday, November 16, 2007
Michael Wolfe, PhD, is a well-known compiler writer and parallelization guru, and currently works for The Portland Group. He gave a vendor talk at Supercomputing this week (Reno, NV) on what he calls “Data Streaming”. He started by showing a timeline of CPU clock rates from 1997 ... 2007. There was a clear trend: from 1997 .. 2002, the clock rates followed Moore's law and grew by a factor of 10 --- from 300MHz to 3GHz. But from 2002 to 2006, the rates barely grew at all (3.8GHz), and in 2007 the clock rates actually dropped back to 3GHz. So we appear to have hit the wall in terms of CPU clock rates. As you all know, what chip manufacturers are doing now is going multi-core to offer increased performance.
But forget about multi-core for a moment. Let's look at just one core running at 3GHz. In the good 'ole days of compiler optimization, we optimized by reducing instruction count / instruction cycles. We have become pretty good at that, to the point that we can't feed the CPU with data fast enough --- Mike did some back-of-the-envelope calculations that I won't repeat here, but basically showed that optimized, compute-intensive code needs 16-24GB/Sec bandwidth to memory. CPUs today don't have anywhere near that bandwidth to RAM.
Okay, so add another core. Michael showed this very simple picture of a two-headed straw: “Imagine you're at your favorite dinner sharing a milkshake with your significant other” (my words). Both of you are pulling through the same base --- i.e. with two cores you still have only 1 pipe to memory. This problem is going to get bad very quickly, since the # of cores will grow exponentially (quad-cores are getting common, 8-cores by next year, etc.). But the # of pins to memory will not grow exponentially, though it will grow (AMD's new Barcelona quad-core chip has 2 pipes to memory). Think of the multi-headed straw as you ponder the purchase of your next quad-core box...
The answer is to (a) throw more and more cache at the CPU (L1, L2, L3), and (b) start programming & optimizing for the cache --- “to the point that you consider the cost of executing an instruction to be $0.00” (Michael's words). So “Data Streaming” is the idea that instead of always agressively doing computation, you instead aggressively fill the cache with data you will need, and then do computation once the data arrives. If you have 2 cores, then core #1 can be reading data while core #2 is computing; when core #2 runs out of data, it starts reading data while core #1 executes. This sounds crazy --- to have an execution core basically spend its time waiting for data --- but it turns out that you get much better performance (in compute-intensive programs) than having both cores go full-speed ahead executing instructions, and essentially both waiting for data. Michael and a colleague did this for a spec benchmark, and got around 1.5-1.7 speedup on a 2-core CPU (vs. essentially no speedup when the code was optimized for instruction counts).
Here's a simple example of what you can do today to optimize for better cache performance. First, imagine a collection of objects or structs. Second, suppose you have to run through the collection, touching a field of each object / struct:
for (int i = 0; i < N; i++) a[i].field++;
Every time you read from memory, the hardware reads a cache line --- 8 or 16 bytes (at least). By using an array of objects / structs, what you just did was pollute the cache with things you are never going to use (the other fields of the struct), to get at the one field you will use. The solution is to move back to the old days of parallel arrays, i.e. using separate arrays to hold the various fields. The loop then becomes:
for (int i = 0; i < N; i++) afield[i]++;
Now when you read a cache line, you'll use those other values in the cache line in future iterations. The next step is to add prefetch instructions (which VS supports) to get ahead:
for (int i = 0; i < N; i++) { prefetch( afield[i+8] ); // or something, this is very HW dependent afield[i]++; }
This is starting to approximate the “Data Streaming” idea, in that instead of doing computation, we think first about streaming in the data we're going to need. The point of Michael's talk is that we need to start thinking about this more deeply. I suspect Michael has some good ideas about how the compiler can help us.
And here's a related item: the TLB (translation lookaside buffer), which is responsible for translating addresses from virtual memory to physical memory, is another potential area of concern. The TLB is in essence another cache, and is getting killed as we go multi-core --- if the cores are running different instruction streams, the TLB miss rate is growing and this means more trips to RAM, and this means more waiting. Rough rule of thumb: the cost of each memory level is a factor of 10, so CPU to L1 is 10 cycles, CPU to L2 is 100, CPU to L3 is 1,000, and CPU to RAM is 10,000 cycles. So you can see that waiting 10,000 cycles to read something from RAM is a BAD thing for performance.
The bottom line: getting good performance from today's multi-core chips is not as easy as just creating multiple threads and handing them off to the operating system. Fun times!
 |
Thursday, November 15, 2007
My PhD is in parallelizing compilers, back in the 1990's when supercomputers were hot. Then the field cooled off, as the machines were too costly, the programming models were too difficult, and the auto-parallelizing compilers failed to live up to (unrealistic) expectations. But the field has come roaring back to life the last couple years, especially with cheap high-speed interconnects and multi-core CPUs. Microsoft has a number of technologies in the HPC (high performance computing) arena: Parallel LINQ, Task Parallel Library, F#, and Compute Cluster Server.
Microsoft Compute Cluster Server, or CCS, is Microsoft's foray into the HPC cluster arena --- imagine a rack of blades with quad-core CPUs and a high-speed interconnect. CCS was released in 2006, runs on Windows Server 2003 64-bit, and is currently at v1.0 with SP1; v2 was just announced at Supercomputing this week in Reno, and will run on Windows Server 2008 64-bit.
I'm really excited to be working in the field again, finally putting all that PhD work to use :-) Right now I'm part of a team here at Pluralsight developing a 5-day course on CCS: what it offers, how it works, and its supported programming models (basically OpenMP and MPI). Our first course offering is slated for the last week of November, 26-30. For more info on the course, click here .
Surprisingly, it doesn't take that much hardware to get into CCS and experiment. A CCS cluster consists of a head node (that manages the cluster and schedules jobs) and compute nodes (which do the actual work). Since a head node can also server as a compute node, you can build a cluster with one computer --- any computer capable of running Windows Server 2003 64-bit. Of course, life is more interesting if you have 2 or more compute nodes, which forces you into the world of distributed programming and MPI.
In fact, take a peek at my portable cluster --- I'm using 2 Mac Minis as compute nodes and a Thinkpad as the head node (the other thinkpad is my normal working machine, and acts as a client to the cluster):
The mac minis make excellent portable Windows servers. I booted to the mac os, ran bootcamp, inserted the Windows 20003 CD, wiped the disk clean when it rebooted, and installed. Some of the drivers are missing (e.g. bluetooth and wireless), but the server hardware I needed --- monitor, hard disk and ethernet --- were found and installed fine. Once installed, I run them headless (you have to boot with a dvi to s-video adapter attached because the mini is looking for a monitor, but once booted you can remove the adapter and remote into the machine).
Right now I'm attending the annual Supercomputing conference, which is a great conference and getting better every year. 9,000 attendees, 87 miles of fiber optic cable, and more computing power than most countries. Tomorrow I'll blog about one of the things I learned that blew me away: the cost of executing an instruction is essentially 0, so optimization now is all about data locality and caching. The optimizing compiler world has been turned on its head.
 |
Friday, November 03, 2006
I'm sure you've heard about LINQ, but just in case you haven't, it stands for Language Integrated Query. I just finished a 60-page overview of LINQ, which is available as a PDF from O'Reilly:
http://www.oreilly.com/catalog/language1/?CMP=ILC-2RQ886833906&ATT=language1
The PDF represents the first in a 3-part series on LINQ; part 2 will focus specifically on LINQ for SQL, and part 3 will focus on LINQ for XML. Along with the PDFs I'll be presenting a series of MSDN webcasts on LINQ; I'll let you know when the live presentation dates become official.
LINQ offers SQL-like query support in C# and VB, allowing you to write queries --- against objects, XML documents, relational databases, and more --- with IntelliSense and strict type-checking. For example, given a set of Doctor objects, here's a query to select all the doctors living in Chicago: var chicago = from d in doctors where d.City == “Chicago“ select d;
There's a CTP (May 2006) you can download in order to play with LINQ [1]. The technology is slated to appear sometime in 2007 with the 3.next release of .NET (i.e. the release to *follow* the upcoming 3.0 release). It's very interesting technology, I encourge you to learn about LINQ if you haven't already. Cheers!
[1] http://msdn.microsoft.com/data/ref/linq/
 |
Thursday, August 17, 2006
For the academics in the crowd...
I was under the impression that Microsoft's Academic Alliance was providing Visual Studio Team System as trial software that expires after 180 days. Turns out that if you contact MSDNAA, you can get a volume licensing key to distribute the full version of VSTS (“Visual Studio Team Suite“) to your students and install on your lab machines. VSTS includes software engineering tools and processes for the entire software lifecycle.
I'm not sure how to exactly go about getting the licensing key from MSDNAA, that's the next step :-)
Saturday, August 05, 2006
I've been a big fan of NDoc for generating professional looking documentation from XML comment files. Unfortunately, it seems development on NDoc has stalled of late, and support for generating documentation for .NET 2.0 code is still beta. The good news is that Microsoft has just released a CTP of the tool they use internally, called Sandcastle. This is an early (i.e. rough) release, but a good step in the right direction. Here's a blog entry from MS with more info:
http://blogs.msdn.com/sandcastle/default.aspx
See the bottom of that blog page for the download link. Note it's an early beta with a command-line interface. Cheers!
Sunday, April 02, 2006
Webcast listeners know I'm a big fan of Rocky's work on distributed design. He's designed a framework (CSLA) for distributed N-tier systems, and recently updated his framework for .NET 2.0. If you ever get a chance to hear him speak at TechEd or PDC, do so. In fact, turns out Rocky was recently on .NET Rocks, and you can hear him talk about CSLA 2.0. Here's the link:
http://www.dotnetrocks.com/default.aspx?showID=172
Kudos to Andy M. for the heads up and link. Cheers!
Wednesday, February 08, 2006
During today's webcast on web services (Wed Feb 8), there was a question related to moving your web service from your dev box to the production box. When the web service moves, the URL will change, so what's the best way to protect your client code from this predictable change? We shouldn't need to re-reference and recompile the client app just because the web service moved. The answer of course is that the URL should be a .config setting. The detail I forgot was exactly what property to set at run-time after you read the URL from the .config file.
Duh, it's the .Url property! Let me finish this story, and then tell you an even better one :-) But first, the .Url property. The client starts by createing the web service object (which is really the proxy), and then sets the URL like this:
this.server = new EmployeeWebService.Employees();
this.server.Url = Properties.Settings.EmployeeWebServiceURL;
This assumes you have defined a .config setting named EmployeeWebServiceURL. It may seem backwards to create the web service object first and set the URL second, since don't you need the URL to create the web service object? Nope, because you're really just creating the proxy --- the web service isn't contacted until you make a method call, and that's when you need the URL.
So that's the first part of the story: create .config setting, and set the proxy's Url property before you call it. So off I got to update my demo code in VS 2005, I bring up the Properties page, click the Settings tab, and behold, the .config setting is already there! Turns out Visual Studio 2005 automatically defines an application-level setting for the project whenever you add a web reference. So in my demo code, in the BusinessTierClient project, there's a setting called “BusinessTierClient_EmployeeWebService_Employees“ that contains the URL for the web service. And the proxy is already coded to read this setting, so if you chance it, the proxy does the right thing. Very cool.
The only problem is that this .config setting is stored in the component's app.config file, which for a DLL, isn't around at run-time. So to make this work the way you want it to --- i.e. to expose the .config setting in the client-side .exe's config file --- you have to merge the DLL's app.config file with the client-side .exe's config file. We've done this already with other settings, e.g. the connection string needed by Data Access Tier has to be merged into web.config (for a web service) or remotingserver.exe.config (for a remoting server host). I'll update my demo and repost the demo + slides to the webcasts page, in this case app.config file associated with the EmployeeClientGUI has been updated, that's it.
Learning something new every day... Cheers,
|