And not just one cluster mind you, but two! Anyway, I was fortunate enough to be able to present a half-day tutorial on "Windows HPC Server 2008 --- a developer's perspective". It went reasonably well I think, a good crowd with great questions. One mistake I made was not anticipating my audience well enough: it's obvious to me now, since I used to live in this world as a grad student. But I've been away from it too long, and while GUIs and Visual Studio are my tools of choice today, I forgot that for a majority of the HPC community, the command line and its plethora of tools are the norm. Next time (if there is a next time :-) I'll bridge those worlds better.
The other mistake was not anticipating the nuances of a travelling cluster --- in particular the nuances of cluster networking. Cluster nodes typically have at least 2 NICs: one for the public ("enterprise") network, and one for the internal cluster ("private") network. The private network is often a high-speed networking technology, e.g. InfiniBand. When you install HPC Server 2008, one of the first things you do is select your networking topology for the cluster, designating which NIC to use for the enterprise network, which for the private network, disabling the firewall on the private network, etc. As nodes come under control of HPC Server, their network settings are configured to match the selected topology.
The first complication is that machines now have at least 2 IP addresses, so what does the hosts table look like? Second, what if you have the cluster running in location X with enterprise addresses a.b.c.d, then unplug, fly to Austin for supercomputing, and plug into their local network and get new addresses e.f.g.h? Or worse, what happens if you aren't given local IP addresses at all, so the enterprise network of the cluster appears disconnected?
Well, this is why, the night before my presentation, I got the cluster out of the exhibit hall, wheeled it up to the 4th floor, went into my room, and got the cluster ready. Plugged it in, booted the nodes, and ran my demos. Worked like a champ! Until the last demo, which failed because I forgot to turn the head node into a broker node (to enable WCF-based HPC apps). I tried changing the headnode into a broker node, and got an error about the enterprise network not being found. Humm, strange, but then again, I'm in a new location. So I reconfigured the cluster's network configuration, telling HPC Server which NIC was the enterprise network and which was the private network. The WCF-based demo worked, I was happy, and went back to the hotel, confident the cluster was ready (I even left it running all night so it would be ready to go in the morning).
I'm so confident all is well that I don't retest my demos (bad idea). I also didn't regression test the demos, which I should have done after I made that change to the network configuration (bad idea). So of course, what happens? Start the presentation, things are going well, and the first demo --- the easiest one! --- tanks with a behavior I've never seen before. The processes can't write to the public network share for the client to harvest the results, what's up with that? And I had tested this just 12 hours earlier?! Wow, every presenters nightmare. (Luckily I had a backup cluster, which eventually I started using once someone in the audience reminded me I had a backup cluster :-)
Here's the issue, and I beleive this is new in the final RTM of HPC Server 2008. Turns out that HPC Server stores machine names and IP addresses in the hosts file: C:\Windows\system32\drivers\etc\hosts. On my cluster, with 2 NICs, the file looks like this:
# ManageFile = true
127.0.0.1 localhost
10.20.30.1 HEADNODE #HPC
10.20.30.2 COMPUTE1 #HPC
10.20.30.3 COMPUTE2 #HPC
10.20.30.4 COMPUTE3 #HPC
10.20.30.1 Enterprise.HEADNODE #HPC
10.20.30.2 Enterprise.COMPUTE1 #HPC
10.20.30.3 Enterprise.COMPUTE2 #HPC
10.20.30.4 Enterprise.COMPUTE3 #HPC
192.168.1.128 Private.HEADNODE #HPC
192.168.1.129 Private.COMPUTE1 #HPC
192.168.1.130 Private.COMPUTE2 #HPC
192.168.1.131 Private.COMPUTE3 #HPC
The "true" means HPC Server is managing this file. So if you unplug the cluster and move it, when you boot in a new location the hosts file is wrong (on every node). So anything involving the enterprise network fails (which many apps rely upon for data deployment and aggregation). Likewise, if you move the cluster and reconfigure the enterprise network --- but the local DHCP doesn't give you IP addresses --- then HPC Server can't find a valid public network and again the hosts file is incorrect (since now there is no enterprise network).
So the moral of the story? If your apps are just hanging -- hang trying to write results to a network share, or hang waiting for results back from WCF-based services --- then check your networking. Seems obvious now of course, but it's hard to know where to start. I'd start with the hosts file, since this was *also* wrong on my backup cluster and solves a problem I've been having for 2 weeks. In that case the hosts file was missing the Private.* entries (not sure why), which caused WCF to fail but everthing else (including MPI) worked.
The cluster of travelling with a cluster, or two :-) Thanks to the attendees who suffered while I tried to figure this out on the fly, you were good sports!
Posted
Nov 19 2008, 12:16 AM
by
joe-hummel