Back to main page

Real-World Message Rate Benchmark



Method

* Goals:

  Unlike tradition message rate benchmarks, which attempt to discover
  peak message rate in ideal conditions, our message rate benchmark is
  more concerned with sustained message throughput in application
  scenarios.  We are concerned with how three different
  characteristics of the message pattern influence message rate:

  1) Cold Cache Start-up:  Unlike traditional microbenchmarks, which
     work hard to warm both the cache and network before execution of
     the test, the msgrate benchmark attempts to invalidate the cache
     at the start of each iteration.  This cache invalidation seeks to
     mimic the effect of the "real work" part of a scientific code,
     which is likely to include a large enough computation step to
     result in a data cache devoid of both the receive buffers and all
     MPI-related structures.

     As send buffers are generally touched just before the MPI send is
     performed, those buffers are written to at the completion of the
     cache invalidation in order to bring them into cache.

  2) Simultaneous Send and Receive: We are concerned with the ability
     of networks and MPI implementations to support a high rate of
     messaging when a given node is both sending and receiving
     messages.  Our benchmark counts the total number of messages
     processed, both sending and receiving.

  3) Multiple Communication Peers: The benchmark will simultaneously
     communicate with a specified number of peers.  All receives will
     be posted (all from peer a, then all from peer b, and so on),
     then all sends will be started (first all to peer a, then all to
     peer b, and so on).

* Methodology:

There are a number different communication patterns of interest to us,
based on codes of interest.  Assume that peers is a list of npeers in
length which is ordered to contain the npeers/2 "lower" peers, in
ascending order followed by npeers/2 "higher" peers.

- Single direction: This test approximates the behavior of traditional
  message rate benchmarks, with a given peer communicating with
  exactly one other peer (and only in one direction).  The cache
  invalidation phase, which mimics the effect of an application
  working set, is the only notable addition.  The kernel looks
  something like:

  if (odd)
    for number of iterations:
      invalidate cache
      start timer
      post N sends to peer
      wait all
      stop timer
  else
    for number of iterations:
      invalidate cache
      start timer
      post N sends to peer
      wait all
      stop timer

- Pair-based: Each process communicates with a number of peers, so
  that a given process is both sending and receiving messages with
  exactly one other process at a time.  Synchronization can not be
  guaranteed, so the test may result in a number of unexpected
  messages.  The kernel looks something like:

  for number of iterations:
    invalidate cache
    start timer
    for peers
      post N receives from peer[i]
      post N sends to peer[i]
      waitall
    stop timer

- Pre-posted: This test extends the pair-based test by pre-posting
  receives before cache invalidation, mimicing applications which
  pre-post receives at the completion of the previous computation
  phase.  The kernel looks something like:

  start timer
  <pre-post receives>
  stop timer

  for number of iterations
    invalidate cache
    barrier
    start timer
    for peers
      post N sends to peer[i]
    wait all
    for peers
      post N receives to peer[i]
    stop timer

  start timer
  <post final sends>
  stop timer

- All-Start: Similar to the pre-posted, but does not guarantee
  pre-posted.  Simulates an application which finishes a computation
  phase, then issues all communication calls at once with a single
  MPI_WAITALL.

  for number of iterations
    invalidate cache
    barrier

    start timer
    for peers
      post N receives from peer[i]
      post N sends to peer[i]
    waitall
    stop timer

Cache invalidation presents a number of challanges, as there's no way
to guarantee a cache is completely invalidated.  Our current approach
is to create an array of size 16MB (larger than most currently
available L3 cache structures) and iterate through the memory with an
algorithm similar to:

    a[i] = a[i - 1] + 1

Assuming a write-thru cache structure, this should ensure that only
the array is in any layer of caching.

Usage

* Build and Execution:

  The source is distributed as a gzip'd tarball. To install,

   % cd installdir
   % tar xzf smb.tar.gz

  To build,

   % cd smb/src/msgrate
   % make

  If the MPI to be used in testing provides a mpicc wrapper compiler,
  building should be as simple as running 'make'.  Setting CC, CFLAGS,
  and LDFLAGS should all work as expected.  Users are free to
  experiment with optimization flags, including setting CFLAGS to
  "-O3".  By default, -O3 is used.

  By default, the test requires at least 7 processes to be used in
  testing, although as few as three can be used if the number of peers
  is set sufficiently low.  A number of parameters control the
  experiment:

    -p <num>     Number of peers used in communication
    -i <num>     Number of iterations per test
    -m <num>     Number of messages per peer per iteration
    -s <size>    Number of bytes per message
    -c <size>    Cache size in bytes
    -n <ppn>     Number of procs per node
    -o           Format output to be machine readable

Results

To be supplied.

References

To be supplied.

Contact Information

Brian Barrett
Sandia National Laboratories
Albuquerque, NM
bwbarre@sandia.gov