Overview

CIToolkit Release II - Test Entire System

The goal of these tests is to get the system to run as a whole and shake out problems with the myrinet network. Key points of failure include: Myrinet Card, Motherboard (PCI bus), Myrinet Cable (may just be loose) and Myrinet Switch.


Start the GM Myrinet Drivers

  The GM myrinet driver binaries should already have been installed into
  /cluster/vms/<vmname>/gm.  ('make myrinet', 'make myrinet_gm')

One Time Configuration

  1. Set up IP over myrinet (optional):
  2.   IP over myrinet is started by default, using '10.1.X.Y' where X and Y
      are the last two parts of the IP of the primary interface.  Using IP
      over myrinet with a specific IP scheme is B<Not Required>.
      If you want to configure nodes for specific IP's:

Load the Drivers

Once GM is compiled and installed, and the node's vm is set, GM will load on reboot of the compute nodes. You can also start GM without rebooting:

Unload the Drivers

You can also stop the drivers as follows:


Check that the Myrinet Drivers Are Loaded OK

One Node at a Time

Quick and dirty test:

Other, more thorough, commands:

On Several Nodes

Follow the procedure below to check many nodes at a time. If errors are reported they should be checked one node at a time.

[from the root node]


Check the Myrinet LAN Cables

Often times, the link between a myrinet card and the myrinet switch can fail, even if the cable is mostly good. Myrinet cable connector pins can easily be bent or otherwise damaged. Connections that are not completely tightened can also cause problems. Before replacing a questionable cable, unplug, inspect, re-plug both ends of the cable, tighten, and re-test. Obviously, you should make sure that the GM myrinet driver is loaded as well.

One Node at a Time

On Several Nodes

[from the root node]


Run the GM Mapper

Configure Mapping Options

The gm mapper's configuration file should be OK as is, but if you want to make changes (such as using shortest-path routing) then edit the following file:

  /cluster/vms/<vmname>/config/active.args

where <vmname> is something like ``gm-1.4''

Start the Mapper

Use the run.mapper script to save off old log files if they exist, and start the mapping program.

  1. rsh to the mappernode selected in the initial setup. If you don't remember which node is the mapper node, look in /cluster/vms/gm-1.4/config/mappernode.
  2. cd /cluster/bin
  3. ./run.mapper

If you are re-mapping the network after major changes to it or you have selected a new mapper node, you may want to restart the GM driver on the mapper node to clear the old route information. This may slow down the mapping process a little but can sometimes correct mapping errors.

  1. ./run.mapper --restart

Monitor the Mapper's Progress

The length of time it takes for the mapper to complete depends upon several things, so be patient:

For Sandia NM: 1024+ Nodes and gm-1.1.3.10 it took over 45 minutes. For Sandia CA: 256 Nodes and gm-1.4pre-28 it took about 10 minutes the first time, less than 5 after that.

To watch the progress and error messages, open a new connection to the mappernode and run the following:

  1. tail -f /tmp/gm_mapper/mapper.log.full
  2. ctrl-c to get out when done or you give up.

For the first several runs you should watch mapper.log.full (verbose mode) to see what is going on. First it will go through the 'scout' stage and you will see 'new host' and 'new switch' messages. Then the mapper will decide how to configure the hosts. Then it will go through the 'configuration' stage. As long as the log is showing 'handling a configure reply' then it is still working fine. If it is repeatedly 'sending configure message' to the same node you should probably kill the mapping process and inspect that node and try again. The mapper log will show 'Done mapping.' when it is complete.

Verify That All Nodes Are Mapped

When the mapper completes, look for nodes missing from the routing table.

  1. Print a list of the nodes (from the database) that did NOT get mapped, and the total number of hosts the mapper saw. On the mappernode:
  2. If all nodes are mapped, that is great! Otherwise, do the following:

Create the gmpiconf File For Every Mapping Cycle

In the GM environment, a configuration file listing all working nodes that can participate in running MPI jobs is needed. This file is $CIT_HOME/vms/<vmname>/config/gmpiconf.

To generate the file, run a command like the following after each mapping. On the admin node:

This will capture the routing table information from the Myrinet card on the mapper node, parse it, and order the nodes according to the Local.pm sort routines to improve performance.

It is OK to edit the gmpiconf file by hand.

[depreciated] Optimize the Node Order in mapper.routes for use by Cplant

The Cplant runtime environment software allocates NIDs (node ID's) from the order of the nodes in the GM mapper.routes file. Use the order_routes program to sort this file to be in a more optimal order for improved MPI job performance.

This script takes a LONG time to run. Be patient.

NOTE: The use of this method to order NIDs is depreciated in favor of using the 'hand generated' routes, described below. order_routes may not work quite right outside of the Sandia Antarctica center section. If you still want to use it:

Save the Map, Routes and Hosts files for file mapping

Once you have all the nodes mapped as you like, save off the mapper.map, mapper.routes, and mapper.hosts configuration files. Copy them from <mappernode>:/tmp/gm_mapper/, either with rcp or by digging in the diskless hierarchy. Rename them to change the prefix from ``mapper.'' to ``my.'' and place them in /cluster/vms/gm-1.4/config/

These files will be used by each compute node when they boot, to load the myrinet routes.


Map By Hand

The ``mesh_mapper'' tool can be used to generate a myrinet map and routes for most myrinet mesh topologies, such as the Antarctica cluster. An overview of the process is described below. See /cluster/bin/mesh_mapper/mesh_mapper --help for more info.

Create Map, Routes, and Hosts files

  1. Gather myrinet mac addresses. It's only required for IP over myrinet, but may help later with debugging.
  2. cd /cluster/vms/gm-1.4/config
  3. create a mapper.conf.pl file for your cluster
  4. mesh_mapper --conf your.mapper.conf.pl [--macfile myri.macs] --allfiles

Verify that you have a valid, deadlock-free, route file:

This will also verify that the map and routes files are a valid pair. For a large route file ( > 500 nodes) this will take a while. Be patient.

  1. ../bin/deadlock my.map my.routes

Generate the gmpiconf File

This creates the file /cluster/vms/gm-1.4/config/gmpiconf which lists all the nodes that are available for running MPI jobs on. It can be edited by hand if necessary.

  1. create_gmpi_conf --file my.hosts > gmpiconf

Create Routes with simple_routes

Apply Myricom's generic deadlock free routing algorithm to the manually generated map.

  1. ../bin/simple_routes my.map my.simple.routes -spread

If you get an error like 'assertion failed. sc_Calculator.c:693 (argv)' see previous instructions on compiling the mapper tools. You will need to edit the file and recompile.

Configure the nodes with the new routes

On a single node:

[from the node you want to configure]

  1. cd /cluster/rte/config
  2. /cluster/bin/gm_file_map my.map my.routes my.hosts

On a set of nodes:

  1. ccmd -t 32-47 /cluster/bin/gm_file_map /cluster/rte/config/my.map /cluster/rte/config/my.routes /cluster/rte/config/my.hosts

NOTE: The gm-1.4/rte init script is configured to run this command after loading the myrinet driver. If the three files do not exist, the file mapper will silently fail, otherwise it will run in the background and configure the myrinet card. (Takes about a minute)


Test the Myrinet Links

Test the Myrinet SAN cables adjacent to a node.

In the Antarctica Cluster, and other mesh architectures, no SAN cable is more than one hop from a node. To perform a self ping over adjacent SAN cables, use the petal command.

[on the compute node]

You can edit the script to define where the edges of the myrinet mesh are. Petal will determine what switch port the current node is on by doing a lookup in the mapinfo.host file. Ports listed as 'edges' and other hosts will not be tested.

Run mpi_init

The mpi_init program is a very simple MPI job that will make sure the first node in the gmpiconf file ('rank 0') can see all the other nodes via the myrinet.

mpi_init is also a very good test to make sure you can launch MPI programs correctly. There are two startup scripts you can use in the MPICH/GM environment. Try them both and see what works best for you:

If you get an error like ``cannot open /root/.gmpi/conf'' verify that the GMPICONF variable is correct in /cluster/config/cluster.env, and that you have sourced the file.

  1.   # cd /cluster/rte/gm-1.4/bin
      # mpirun -np 4 mpi_init

    This is the ``standard'' version of mpirun. It will assume that mpi_init has the same path on the compute nodes as it does on the admin node, which it often doesn't. However, if you have a small cluster and have a common mount point on all nodes (/home for example) that holds your mpi programs, then it may be your best option.

  2.   # mpirun.cit -np 4 /cluster/rte/bin/mpi_init

    This is a new version of mpirun, adapted to leverage the configuration database, and distribute work through 'leaders' if they are available. It is much more flexible, including the ability to copy the binary before launching, and support for mixed executable runs. The path given is the full path of the executable on the compute node. See 'mpirun.cit --help' and 'mpirun.cit --man' for more info.

The '-np 4' option specifies how many nodes you want to run on. mpi_init should print out 'Initializing...' once for every node, and 'Finalizing...' and 'Done.' a single time.

If an MPI job does not finish completely, there may be stale processes running on some nodes, which will cause future jobs to fail. In this case, do the following:

  1. Kill the parent startup process (mpirun)
  2. Run /cluster/bin/gm_mpi_kill on all nodes.
  3. Try to determine why the job failed. See the 'Troubleshoot and Recover from Errors' section below for more tips.

Run mpi_routecheck

The mpi_routecheck program is a fairly simple MPI job that will make sure all nodes in the gmpiconf file can see all the other nodes via the myrinet.

  1. cd /cluster/vms/gm-1.4/bin
  2. mpirun -np 512 mpi_routecheck
  3. The '-np 512' option specifies how many nodes you want to run on.

mpi_routecheck accepts the following options:

[-c] check for CRC errors [-no_v] do not print each loop announcement [-v1] print more info [-interval loop time in seconds] default is disabled [-min min_msg_size] default = 8 bytes [-max max_msg_size] default = 8 bytes [-m msg_size] min == max, this is the default mode

Note that checking for CRC errors slows down the process A LOT!

Sample output from mpi_routecheck:

# mpirun.ch_gm -np 3 mpi_routecheck Timing resolution is 0.000001 seconds dealer node is now rank 1 dealer node is now rank 2 Message size = 8 bytes dealer node is now rank 0 Total time = 0.001126 Avg loop time = 0.000375

Run mpptest

The mpptest program should already have been compiled from the source in the /examples/perftest directory under the mpich install directory. The script /cluster/bin/run.mpptest will launch mpptest with the correct arguments. If you determined above that you should use a different version of mpirun, edit run.mpptest accordingly.

Usage for run.mpptest:

run.mpptest [-n num_nodes] [-r numreps_per_run] [-i iterations]

Where num_nodes is the number of nodes to run on (starting from the top of gmpiconf) and numreps_per_run is the number of repetitions to pass to mpptest prog (determines running time) and iterations is the number of times to start up mpptest.

Suggested steps:

  1. run.mpptest
  2. run.mpptest -n 1024
  3. run.mpptest -n 1024 -r 70 -I 20

Run xhpl (high performance linpack)

This test is not all that useful for debugging the system, but it is a nice number to get.

First, run with a simple configuration to make sure the test will work:

  1. cd /cluster/vms/gm-1.4/bin
  2. cp $CIT_DIST/myrinet/3rdparty/hpl/bin/Linux_Alpha/HPL.dat .
  3. mpirun -np 4 xhpl

Now, modify the HPL.dat file to use more nodes and optimize performance:

  1. view $CIT_DIST/myrinet/3rdparty/hpl/TUNING
  2. edit HPL.dat to your liking and save it.
  3. mpirun.master -np 1048 xhpl

Test Individual Node to Node Connectivity

gm_ping

To troubleshoot myrinet problems, use the gm_ping command from one compute node to another:

Note that the hostname is that of the primary interface, not the myrinet interface. The test will still go via the myrinet.

The node does not have to have it's route table configured in order to use gm_ping, but both the current and destination nodes must be in the map file.

You can also use the gm_ping_all command to check connectivity to ALL nodes in the map file:

gm_probe_node

If gm_ping indicates a valid link, but MPI jobs have problems running, myrinet cables may be miswired. There is probably a host at the end of the ping route, just not the one expected. Check this with the gm_probe_node command.

This script requires that the nodes are both already configured with the 'hand generated' routes. A 'ping' is performed over the myrinet and the script makes sure the expected hostname is returned. It gets the route to the node from gm_board_info and sends a mapper probe packet to request the hostname from the destination node. If the destination node is unreachable or not in the route table then an error message is returned. Otherwise, no news is good news!

gm_crawl

If gm_ping indicates a failed link, use the gm_crawl command to locate the bad cable:

This will perform a self ping over the route to the given host, going one hop further each time until the host is reached. gm_crawl will also accept a defined route:

See 'gm_crawl -help' for more options.


Check for CRC Errors

Once MPI jobs have been running for several hours, make sure they finished cleanly and then check the packet counters on the compute nodes. Note that the packet counters are reset only when the gm driver is re-loaded (includes reboot.)

One Node at a Time

'By Hand'

netrecv_cnt (total packets) should be close to 10^8 to get really valid results. Even if it is low, CRC errors still indicate a hardware problem.

A badcrc_cnt value > 0 indicates that there was packet loss, recovered by error checking. Unless total packets are way over 10^8, any packet loss probably indicates a loose or damaged cable. It is also possible, but less likely, that a bad or badly seated card may be the cause.

To get additional debugging information, try one of the following:

'By Script'

[from the admin node]

On Multiple Nodes

The gm_crc script will automate this process for multiple compute nodes.

[on the admin node]

  1. gm_crc --check --mash t-32 t-33 t-34
  2. This will check all the counters, save them to files in /cluster/tmp/gm_crc, analyze the results, and print the 'bad' nodes at the end of the output

  3. mkdir /cluster/tmp/gm_crc.<date>
  4. mv /cluster/tmp/gm_crc/CRC* /cluster/tmp/gm_crc.<date>
  5. (Archive the results for later comparison)

  6. mpirun -np 512 mpi_routecheck
  7. Or, run any other myrinet stress test you like.

  8. gm_crc --check --mash t-0 t-1 t-2...
  9. (gather the counters again)

  10. diff /cluster/tmp/gm_crc.<date>/CRC.d /cluster/tmp/gm_crc/CRC.d
  11. This will diff the results to highlight new crc errors.

  12. more /cluster/tmp/gm_crc/CRC.all
  13. (to look at all the results again)

  14. more /cluster/tmp/gm_crc/CRC.bad
  15. (to see just the nodes with errors)

gm_crc also captures the crc results into a form that can more easily be exported into a spreadsheet for tracking and graphing. /cluster/tmp/gm_crc/CRC.tab is a tab delimited file.

Use the '--help' option for more usage information.


Troubleshoot and Recover from Errors

Some of this has already been covered but here it is all together in more detail.

Failed MPI Job

If an MPI job does not finish completely, there may be stale processes running on some nodes, which will cause future jobs to fail.

  1. Run /cluster/bin/gm_mpi_kill on all nodes.
  2. This can be automated with the following command:
  3. If the job ran for a while and then died:
  4. If the job never really got started look at the error messages produced on the terminal and also sent to /var/log/messages and see if you get one of the following...

'Send failed to complete' or 'send failed: send timed out'

If you get a GM 'send failed to complete' error when starting a MPI job then the node listed (reporting the error) could not reach some other node. If no node is listed on stdout, look at the log file (/cluster/tmp/mpptest/mpptest.N)

Clean up the system as described in 'Failed MPI Job' above.

Look at the syslog for the reported nodes. Those nodes were OK, but the errors listed in the syslog should help you track down the problem nodes.

'Port In Use'

If you try to start a job on a node with a busy Myrinet card, you will get an error like:

GM: NOTICE: User tried to claim a port number <2> that is in use. GM: NOTICE: Could not open port state.

Chances are very good that there was already a mpi job running on that node. If you didn't mean to use that node, don't worry about it, it might just continue unharmed. Otherwise, it is probably a stale process from a failed job.

Use gm_mpi_kill or kill the process by hand. If all else fails, reboot.

'LANai interface not responding'

GM: WARNING: LANai interface not responding GM: DMA len=4096 lar=0x44168 ear=0x04f284000 GM: current handler is none()

These errors will show up in /var/log/messages on the node and indicate that the Myrinet card may be seated poorly in PCI slot, or it is bad.

Drop the Bad Nodes and Push On

When you encounter nodes that are causing an MPI job to fail, debugging and solving the problem usually requires a restart of the GM driver. This will remove the routing information from the card and it will no longer be able to talk to the other nodes. Instead of going through the process of re-mapping (which may take a while) you can simply remove the node from the pool of available nodes and continue testing the others.

Perform the following steps, but don't forget that if removing the node from the list actually fixes the problem, you still have to go back and troubleshoot that node:

  1. Edit /cluster/vms/gm-1.4/config/gmpiconf
  2. Remove the line for the bad node.
  3. Modify the number at the top of the file and decrease it by 1 (or however many nodes you removed)

Diagnose a Myrinet Card by Sight

(this information is only marginally helpful)

Green light to the right of the black myrinet LAN cable:

On steady = ok, but not necessarily. On flashing = probably bad cable, partial open circuit. Off = no cable connected, or gm driver not loaded.

Verify that a Card Got Configured by the Mapper

Rsh to the node and run:

This shows the routes, written by the mapper, and other information stored on the card. If the node has been mapped successfully, then there will be a large table of routes. You can compare the number of lines in this table on the mapper node with other nodes to verify that the other nodes were fully configured.

If the gm driver is not loaded at all then gm_board_info will return 'No boards found or all ports busy'.

If the card was not found by the mapper, not configured, the gm driver was restarted or the node was rebooted since the mapper was run, then gm_board_info will return a message like 'No routes. Mapper not yet run?'


What's Next?

At this point, you should have fully functionaly myrinet hardware, valid map and route files, and a basic MPICH/GM runtime environment. Additional steps to perform include: