xCAT Cluster Management Tools

xCAT is a very good cluster mangement toolkit for the experienced. It takes some time to configure the cluster and it can get a little tedious, but the amount of control that you get over the cluster is fantastic. It works great with the IBM proposed solution for a cluster. and with the relevent hardware, the cluster is really easy to manage. This makes cluster management a breeze. A very good set of documentation is already available, so i will not attempt to recreate them.

Please refer to http://www.alphaworks.ibm.com/tech/xCAT for detailed instructions.

xCAT Notes

Things to note when some problems arise:-

NIS Problems

  1. Ensure “ypbind” & “ypserv” is enabled on server
  2. Ensure “ypbind” is enabled on client
  3. Ensure kickstart file is correctly configured with NIS master
  4. Configure NIS with “authconfig”

File problems

  1. Ensure “homefs” is configured in “site.tab”
  2. Ensure “localfs” is configured in “site.tab”
  3. Ensure “/etc/exports” is correctly configured
  4. Ensure “/etc/fstab” is correctly configured

SSH problems

  1. Ensure user acct exists
  2. Ensure SSH is stared

Make sure that public-key i.e. “identity.pub” & “id_dsa.pub” is copied into “.ssh/”

  • This should not be too much of an issue if the /home directory is correctly mapped across all nodes.

Resetting Hardware MRV

This is essentially a serial console switch that is used in xCAT or CSM to do out of band management. The problem with this switch is that it needs to be reset when you change configrations. Sometimes, when it gets delievered, the IP settings are not what is required and you need to change it. Here is how to reset the MRV when problems arise.

  1. Push in (about 1 second) the reset switch with a paper clip into the pin-size hole. All the front panel LEDs will light up.
  2. Push the reset switch in again and hold (about 3 to 4 seconds). The LEDs 1-10 will light up from right to left, then left to right, until this sweeping pattern stops. Then, LEDs 7,8 will remain lit. At this point, release the reset switch.
  3. The LEDs will then light up in a countdown pattern to 1 (doing some test). Then, all LEDs will flash once and will all go out except the RUN LED (will blink very fast).
  4. Have a terminal connected via console (cu -l ttyS0) to the largest port number. Press the ENTER key until you see a message:
          Terminal Server  .......
          (cant remember what's on the second line) ...
          Configuration in progress .....
  5. Type”access” and press ENTER. A text based configuration (initialization) menu will be displayed. The menu is quite self-explanatory. Exploring it will lead you to a menu where you can load some default initialization values.

I think I have also figured out what the numbers on the front panel means. I think

  • 1 = for ports 1, 11, 21, etc …
  • 2 = for 2, 22, 32, ….

I have been staring at the front panel long enough to see this pattern. I think they also use the front panel numbers to show error code – e.g. 45 lights up in a repeating fashion means error code 45

Linux HPC and XCAT Redbook

Linux HPC and XCAT Redbook