Back to main page

Host Processor Overhead



Method

There are multiple methods an application can use to overlap computation and communication using MPI. The method used by this routine is the post-work-wait loop using the MPI non-blocking send and receive calls, MPI_Isend() and MPI_Irecv(), to initiate the respective transfer, perform some work, and then wait for the transfer to complete using MPI_Wait(). This method is typical of most applications, and hence makes for the most realistic measure of a micro-benchmark. Periodic polling methods have also been analyzed [1], but that particular method only makes sense if the application knows that progress will not be made without periodic MPI calls during the transfer. Overhead is defined to be [2]:

"… the overhead, defined as the length of time that a processor is engaged in the transmission or reception of each message; during this time, the processor cannot perform other operations. "

Application availability is defined to be the fraction of total transfer time  that the application is free to perform non-MPI related work.
 
Application Availability = 1 – (overhead / transfer time)    (1)

Figure 1 illustrates the method used for determining the overhead time and the message transfer time. For each iteration of the post-work-wait loop the amount of work performed (work_t), which is overlapped in time with the message transfer, increases and the total amount of time for the loop to complete (iter_t) is measured. If the work interval is small, it completes before the message transfer is complete. At some point the work interval is greater than the message transfer time and the message transfer completes first. At this point, the loop time becomes the amount of time required to perform the work plus the overhead time required by the host processor to complete the transfer. The overhead can then be calculated by measuring the amount of time used to perform the same amount of work without overlapping a message transfer and subtracting this value from the loop time.

The message transfer time is equal to the loop time before the work interval becomes the dominant factor. In order to get an accurate estimate of the transfer time, the loop time values are accumulated and averaged, but only those values measured before the work interval starts to contribute to the loop time. Those values used in the average calculation are determined by comparing the iteration time to a given threshold (base_t). This threshold must be set sufficiently high to avoid a premature stop in the accumulation of the values used for the average calculation, but not so high as to use values measured after the work becomes a factor. The method does not automatically determine the threshold value. It is best to determine it empirically for a given system by trying different values and observing the results in verbose mode. A typical value is 1.02 to 1.05 times the message transfer time.

Figure 1 also shows an iteration loop stop threshold (iter_t). This threshold is not critical and can be of any value as long as it is ensured that the total loop time is significantly larger than the transfer time. A typical value is 1.5 to 2 times the transfer time. In theory, the method could stop when the base_t threshold is exceeded, but in practice it has been found that this point can be too close to the knee of the curve to provide a reliable measurement. In addition, it is not necessary to calculate the work interval without messaging until the final sample has been taken.
 Figure 1.
Figure. 1. A conceptual illustration of the message transfer time (iter_t) of a given message size for each iteration of the algorithm, with the work performed (work_t) increasing for each iteration. The message transfer time calculation threshold (base_t) and the iteration stop threshold (iter_t) are also shown along with the point at which the overhead calculation is taken.

Usage

The source is distributed as a gzip'd tarball. To install,

% cd installdir
% tar xzf smb.tar.gz

To build,

% cd smb/src/mpi_overhead
% make

The makefile assumes that mpicc is the compiler and is in the user's PATH. If this is not the case, edit Makefile to define the appropriate compiler and switches. The routine nominally is executed on two nodes. A single runtime instance will measure overhead and availability for a given message size. E.g.

% mpirun -np 2 ./mpi_overhead -v
Calculating send overhead
Iterations are being calculated automatically
Message size is 8 bytes
Overhead comparison threshold is 1.500000 (50.0%)
Base time comparison threshold is 1.020000 (2.0%)
Timing resolution is 0.10 uS
Using 1000 iterations per work value
work    iter_t       base_t      
1       3.992        3.992       
2       3.991        3.991       
4       3.991        3.991       
8       3.993        3.992       
16      3.985        3.990       
32      3.986        3.990       
64      4.002        3.991       
128     3.978        3.990       
256     4.002        3.991       
512     3.975        3.990       
1024    4.172        3.990       
2048    5.933        3.990       
4096    9.465        3.990       
msgsize iterations  iter_t      work_t      overhead    base_t      avail(%)   
8       1000        9.465       8.608       0.858       3.990       78.5       
mpi_overhead: done

In this example, the -v switch prints out the intermediate results as the work is interval is increased. When the iter_t threshold is exceeded, then the iteration loop terminates and the final calculations are made and the results are printed to stdout. In this case, the overhead = 9.465 - 8.608 = 0.858 uSec, the average loop message transfer time is 3.990 uSec, and the availability = 1 - 0.858/3.990 = 78.5%. It should be pointed out the the base_t value is a running average of the iter_t values, until the base_t threshold is exceeded. In the above example this occurs when the work interval is 1024.

In general, the usage of the routine is

% mpirun -np 2 ./mpi_overhead -q
  Usage:./mpi_overhead
  [-r | --recv] measure receive overhead, default is send
  [-i | --iterations num_iterations] default = autocalculate
  [-m | --msgsize size] default = 8 bytes
  [-t | --thresh threshold] default = 1.500000
  [-b | --bthresh base time threshold] default = 1.020000
  [-n | --nohdr] don't print header information, default == off
  [-v | --verbose] print partial results, default == off

The default operation is a send. To measure receive performance, specify the --recv switch. The iter_t and base_t thresholds can be defined on the command line also. As was discussed in the method, the base_t threshold for a given system may need to be determined empirically. If the base_t value is set too small, the base_t average value calculation will terminate too early. This can occur if there is significant noise in the system and the hence a large variation in measured iter_t values for each iteration of the work loop.

In order to run the routine for a series of message sizes, an example script is provided. The script assumes mpirun is the command to use for launching an MPI job. Edit the script for the appropriate command for your system. A simple execution of the script results in a series of results being printed out. These results can then be imported into your favorite spreadsheet or plotting routine for analysis.

% ./run_script
######### START #########
Running on n107
Dir is /scratch2/dwdoerf/smb/source/mpi_overhead
msgsize iterations  iter_t      work_t      overhead    base_t      avail(%)   
0       1000        5.708       4.932       0.776       3.765       79.4       
2       1000        9.308       8.444       0.863       3.881       77.8       
4       1000        5.681       4.969       0.712       3.773       81.1       
8       1000        5.700       4.974       0.726       3.785       80.8       
16      1000        9.320       8.485       0.835       3.890       78.5       
32      1000        9.324       8.442       0.882       3.944       77.6       
64      1000        9.428       8.445       0.983       4.507       78.2       
128     1000        9.491       8.484       1.007       4.863       79.3       
256     1000        9.591       8.482       1.110       5.487       79.8       
512     1000        16.452      15.499      0.952       6.624       85.6       
1024    1000        16.495      15.529      0.965       7.216       86.6       
2048    1000        16.509      15.521      0.988       8.509       88.4       
4096    1000        16.510      15.480      1.030       10.674      90.4       
8192    1000        30.629      29.546      1.083       15.365      93.0       
16384   1000        58.887      57.682      1.205       24.688      95.1       
32768   1000        115.392     113.956     1.436       43.360      96.7       
65536   100         228.098     226.690     1.408       82.308      98.3       
131072  100         453.242     451.644     1.598       156.938     99.0       
262144  100         903.974     901.958     2.016       306.440     99.3       
524288  100         1804.092    1802.496    1.596       604.897     99.7       
1048576 100         1804.090    1802.560    1.530       1202.453    99.9       
2097152 100         3605.904    3603.544    2.360       2396.718    99.9       
4194304 100         7208.664    7205.820    2.844       4786.473    99.9       
######### DONE! #########

By default, the script measures the send operation. To measure receive performance, specify --recv.

% ./run_script --recv
######### START #########
Running on n107
Dir is /scratch2/dwdoerf/smb/source/mpi_overhead
msgsize iterations  iter_t      work_t      overhead    base_t      avail(%)   
0       1000        9.392       8.490       0.902       3.859       76.6       
2       1000        9.526       8.493       1.033       4.173       75.3       
4       1000        9.501       8.451       1.049       4.044       74.1       
8       1000        9.502       8.453       1.050       4.054       74.1       
16      1000        9.518       8.499       1.019       4.082       75.0       
                                  *
                                  *
                                  *

Results

Several platforms were used in the development of this method. Table 1 summarizes the platforms, and Figures 2 through 5 graph results for MPI_Isend() and MPI_Irecv() operations.

Table 1: Platform Summary


Red Storm Thunderbird CBC-B Odin Red Squall
Interconnect Seastar 1.2 InfiniBand InfiniBand Myrinet 10G QsNetII
Manufacturer Cray Cisco/Topspin PathScale Myricom Quadrics
Adapter Custom PCI-Express HCA InfiniPath Myri-10G Elan4
Host Interface HT 1.0 PCI-Express HT 1.0 PCI-Express PCI-X
Programmable coprocessor Yes No No Yes Yes
MPI MPICH-1 MVAPICH InfiniPath MPICH-MX MPICH QsNet

 Figure 2.
Fig. 2. Overhead as a function of message size for MPI_Isend().
 Figure 3.
Fig. 3. Application availability as a function of message size for MPI_Isend().
 Figure 4
Fig. 4. Overhead as a function of message size for MPI_Irecv().
 Figure 5.
Fig. 5. Application availability as a function of message size for MPI_Irecv().


References

1.  W. Lawry, C. Wilson, A. Maccabe, R. Brightwell. COMB: A Portable Benchmark Suite for Assessing MPI Overlap. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER 2002), p. 472, 2002.

2.  D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian and T. von Eicken. LogP: Towards a Realistic Model of Parallel Computation. In Fourth ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pp. 262-273, 1993.

Contact Information

Douglas Doerfler
Sandia National Laboratories
Albuquerque, NM
dwdoerf@sandia.gov
(505)844-9528