On your frontend end, execute:
# insert-ethers --remove="[your compute node name]"
For example, if the compute node's name is compute-0-1, you'd execute:
# insert-ethers --remove="compute-0-1"
The compute node has been removed from the cluster.
Before you can run startx you need to configure XFree86 for your video card. This is done just like on standard Red Hat machines using the system-config-display program. If you do not know anything about your video card just select "4MB" of video RAM and 16 bit color 800x600. This video mode should work on any modern VGA card.
Here's how to configure your Dell Powerconnect 5224:
You need to set the edge port flag for all ports (in some Dell switches is labeled as fast link).
First, you'll need to set up an IP address on the switch:
Plug in the serial cable that came with the switch.
Connect to the switch over the serial cable.
The username/password is: admin/admin.
Assign the switch an IP address:
# config # interface vlan 1 # ip address 10.1.2.3 255.0.0.0
Now you should be able to access the switch via the ethernet.
Plug an ethernet cable into the switch and to your laptop.
Configure the ip address on your laptop to be:
IP: 10.20.30.40 netmask: 255.0.0.0
Point your web browser on your laptop to 10.1.2.3
Username/password is: admin/admin.
Set the edge port flag for all ports. This is found under the menu item: System->Spanning Tree->Port Settings.
Save the configuration.
This is accomplished by going to System->Switch->Configuration and typing 'rocks.cfg' in the last field 'Copy Running Config to File'. In the field above it, you should see 'rocks.cfg' as the 'File Name' in the 'Start-Up Configuration File'.
We use High-Performance Linpack (HPL), the program used to rank computers on the Top500 Supercomputer lists, to debug Myrinet. HPL is installed on all compute nodes by default.
To run HPL on the compute nodes, see Interactive Mode.
Then it is just a matter of methodically testing the compute nodes, that is, start with compute-0-0 and compute-0-1 and make sure they are functioning, then move to compute-0-2 and compute-0-3, etc.
When you find a suspected malfunctioning compute node, the first thing to do is verify the Myrinet map (this contains the routes from this compute node to all the other Myrinet-connected compute nodes).
Examine the map by logging into the compute node and executing:
This will display something like:
GM build ID is "1.5_Linux @compute-0-1 Fri Apr 5 21:08:29 GMT 2002." Board number 0: lanai_clockval = 0x082082a0 lanai_cpu_version = 0x0900 (LANai9.0) lanai_board_id = 00:60:dd:7f:9b:1d lanai_sram_size = 0x00200000 (2048K bytes) max_lanai_speed = 134 MHz product_code = 88 serial_number = 66692 (should be labeled: "M3S-PCI64B-2-66692") LANai time is 0x1de6ae70147 ticks, or about 15309 minutes since reset. This is node 86 (compute-0-1) node_type=0 Board has room for 8 ports, 3000 nodes/routes, 32768 cache entries Port token cnt: send=29, recv=248 Port: Status PID 0: BUSY 12160 (this process [gm_board_info]) 2: BUSY 12552 4: BUSY 12552 5: BUSY 12552 6: BUSY 12552 7: BUSY 12552 Route table for this node follows: The mapper 48-bit ID was: 00:60:dd:7f:96:1b gmID MAC Address gmName Route ---- ----------------- -------------------------------- --------------------- 1 00:60:dd:7f:9a:d4 compute-0-10 b7 b9 89 2 00:60:dd:7f:9a:d1 compute-1-15 b7 bf 86 3 00:60:dd:7f:9b:15 compute-0-16 b7 81 84 4 00:60:dd:7f:80:ea compute-1-16 b7 b5 88 5 00:60:dd:7f:9a:ec compute-0-9 b7 b9 84 6 00:60:dd:7f:96:79 compute-2-13 b7 b8 83 8 00:60:dd:7f:80:d4 compute-1-1 b7 be 83 9 00:60:dd:7f:9b:0c compute-1-0 b7 be 84
Now, login to a known good compute node and execute /usr/sbin/gm_board_info on it. If the gmID's and gmName's are not the same on both, then there probably is a bad Myrinet component.
Start replacing components to see if you can clear the problem. Try each procedure in the list below.
Replace the cable
Move the cable to a different port on the switch
Replace the Myrinet card in the compute node
After each procedure, make sure to rerun the mapper on the compute node and then verify the map (with /usr/sbin/gm_board_info). To rerun the mapper, execute:
# /etc/rc.d/init.d/gm-mapper start
The mapper will run for a few seconds, then exit. Wait for the mapper to complete before you run gm_board_info (that is, run ps auwx | grep mapper and make sure the mapper has completed).
This is only an issue for machines that support network booting (also called PXE). In this case the boot order should be cdrom, floppy, hard disk, network. This means on bare hardware the first boot will network boot as no OS is installed on the hard disk. This PXE boot will load the Red Hat installation kernel and install the node just as if the node were booted with the Rocks Boot CD. If you select the boot order to place PXE before hard disk to node will repeatedly re-install itself.
Execute this procedure:
Add the directory you want to export to the file /etc/exports.
For example, if you want to export the directory /export/disk1, add the following to /etc/exports:
This exports the directory only to nodes that are on the internal network (in the above example, the internal network is configured to be 10.0.0.0)
# /etc/rc.d/init.d/nfs restart
Add an entry to /etc/auto.home.
For example, say you want /export/disk1 on the frontend machine (named frontend-0) to be mounted as /home/scratch on each compute node.
Add the following entry to /etc/auto.home:
Inform 411 of the change:
# make -C /var/411
Now when you login to any compute node and change your directory to /home/scratch, it will be automounted.
When compute nodes experience a hard reboot (e.g., when the compute node is reset by pushing the power button or after a power failure), they will reformat the root file system and reinstall their base operating environment.
To disable this feature:
Login to the frontend
Create a file that will override the default:
# cd /home/install # cp rocks-dist/lan/arch/build/nodes/auto-kickstart.xml \ site-profiles/4.2/nodes/replace-auto-kickstart.xml
Where arch is "i386", "x86_64" or "ia64".
Edit the file site-profiles/4.2/nodes/replace-auto-kickstart.xml
Remove the line:
Rebuild the distribution:
# cd /home/install # rocks-dist dist
Reinstall all your compute nodes
An alternative to reinstalling all your compute nodes is to login to each compute node and execute: