Rationale and Plan for the Red Storm I/O Mitigation Project

Goals

The stated goals of this project are as follows:

  1. Identify major I/O bottlenecks when accessing Lustre file systems
  2. Propose a follow-on plan to address the identified I/O bottlenecks
  3. Generalize the performance analysis methodology for subsequent use
  4. Consider applicability of other Sandia I/O initiatives as work progresses

Introduction

This effort is begun in response to two recognized issues. First, that the parallel efficiency of user applications when attempting n:f file accesses, where n is the number of compute nodes and f is the number of files appears highly sub-optimal. Second, that the end-to-end efficiency of, even, a single node application on the compute side can demonstrate very poor scaling with respect to the stripe-width.

Parallel Efficiency of the User Application

The parallel efficiency of  user application I/O is directly related to the manner in which it uses the underlying, system supplied, file system. By "manner", we mean the constraints under which the application requires the file system to function. For instance, will the application want to assume that the file system is POSIX-ly correct? Does it only ever read, or must it also write? Is it distributed? If distributed and it writes, does it require that other, independent, applications have a coherent view? If distributed and writing a single file from all the sub-tasks, will it demand that the file system maintain a globally shared value for the end of the file by requesting an append-writes? Will the application be able to request access to file content at well aligned addresses and with sizes that are even multiples of the system constraints imposed on the file system implementation, allowing multiple, simultaneous accesses to the file content to be decoupled? Does the distributed application use a suitable stripe width? In short, just like the compute problem it is trying to solve, the amount of thought and attention the programmer puts into the problem has a direct impact on the ability of the system to execute a solution?

Poor Scaling With Respect to Stripe-Width

Even the best behaved application is not going to demonstrate good performance if either the file system or the communication infrastructure prevents use of all the available bandwidth. We have data from at least one application, Sage, that indicates something less than 50% of the measured end-to-end bandwidth is achievable, at best. Is this because Sage is not well behaved? Is it because Sage is not well-enough behaved? Is it because of some improper coupling in the parallel file system? Is it because of a topology problem in the communications mesh? Is it because of improper firmware handling highly contended for links in the I/O service section? We don't know,

Proposed Work

The ultimate goals in the system requirements are to demonstrate:

In order to achieve these goals, regardless of the application, we intend to:

End-to-End Efficiency

The first step in our study is to identify the major components involved in the data  stream. For the pipeline between the compute node and the IO service section and the case where stripe-width of the associated file is one, it is:

For the pipeline between the service node and the IO node, the list above is different only in the top portion. It changes to:

In all of the tasks, described below, similar testing using the same load generation as suitably modified for a Linux client will be used. Similar method and procedures will be used to elide bottlenecks and localize them.

As a first-pass, higher level cut at determining capability and demonstration of potential bottlenecks we will group many of the items, above, into major sub-systems for measurement. We will:

The above four tasks will give us well supported ideas as to the capabilities and limitations of the IO pipeline from application to storage. In the event that a bottleneck becomes apparent, we must find ways to localize it and measure on either side. This may prove difficult to measure as exactly because we probably will not be able to account for sub-system interactions on either side of the demarcation point where the bottleneck occurs.

However, the procedure described should elide all of the bottlenecks, localize them, and, barring unanticipated impedence mismatches due to coupling across the bottleneck, develop measurements that give us confidence in the pipeline later, when the the bottlenecks are eliminated or mitigated.

Parallel Efficiency

Once an idea about the efficiency of the pipeline is established, we may compare scaling studies for various cases using that pipeline in order to establish a baseline and estimate impact, or predict scalable performance.

We cannot trivially claim that a true measure of scaling performance will be achieved without a guarantee, also, that no efficiency bottleneck will be demonstrated in the efficiency measures. This is because the service side is coupled in all cases and it is impossible within the scope of this project to enumerate and account for the many and varied ways in which this happens and also how a particular bottleneck might compound.

However, if the I/O pipe can demonstrate reasonable performance in the data path between the compute node and the echo client server then we foresee no problem generating scaling curves that reasonably predict true behavior absent any bottlenecks at all.

As a Distributed File System (n:f parallelism)

We must measure two cases of parallel IO for applications that wish to perform independent IOs from many compute nodes. First is the case where the number of nodes is less than or equal to the number of IO service nodes and the second where the number of compute nodes exceeds the number of IO service nodes. In both cases, the file system is leveraged as a distributed file system. In short, each file has at most one compute node making accesses against it.

For the case where the number of desired files is less than the number of IO nodes there is, or can be if carefully arranged in the worst case, no contention for the NICs on the IO service partition. This measurement allows us to develop an understanding about how the communication infrastructure might or might not impact scalability.

For the case where the number of desired files is is exceeded by the number of compute nodes, but files still not shared, there is useful information as well. In this case, we will be searching for data that might tell us how quickly the IO service partition saturates. Then, once saturated, how well it responds under heavy contention.

In all cases, though, the opportunity to expose improper and sub-optimal coupling within the file system is present. Once again, if found, we will work to localize and explain any such issue. The goal being to develop material for the follow-on plan in order to eliminate or mitigate any such issues.

Common to both tests will be the optimal transfer size, discovered in the efficiency phase, accomplished previously.

When n <= f

For the case where the number of compute nodes attempting to access files in the IO service section is less than the number of IO servers we will once again use a synthetic workload. We will choose one or more of Sandia's piob, or LLNL's IOR program. Both are capable and instrumented to carefully measure bandwidth at scales up to, and including, the size of the Red Storm compute section. The number of compute nodes used in the study will range from 1 to the largest number Lustre Object Storage Server (OSS) nodes we can find in a single file system. This test will also be run using the service partition.

When n > f

For this case, in reality, we simply extend the number of compute nodes to and beyond the point where the file system must place multiple files on each Lustre OSS. The same benchmark will be used as for the n <= f case, above. We might, also, extend this test on the service nodes. This would have to be accomplished by running multiple, simultaneous processes per service node, though. Strictly speaking, this latter test, overloading the service nodes, might be considered optional if the service section of the machine already meets or exceeds the stated performance goals.

As a Parallel File System (n:1 parallelism)

A truly parallel file system contains mechanisms and support for the simultaneously shared access of a single file by many readers and writers. Lustre is one such file system.

Unfortunately, the semantics of file IO as mandated by the POSIX standard in all versions are incompatible with any file system implementation that would decouple the servers providing access to the storage. However, with a few explicitly stated announcements by the application to the file system, a contract may be established that does allow the necessary decoupling. We propose to alter the same benchmark used for the distributed file system tests to liberate the file system from the limiting POSIX semantics to the fullest extent that is supported.

Once that is done, a scaling study will be made using the modified benchmark utilizing from 1 to some number of compute nodes needed to clearly display asymptotic behavior in the curve.  In fact, we will increase the number of compute nodes beyond that in order to demonstrate that Lustre does or does not lose efficiency when oversubscription continues past the saturation point.

Again, just as with all of the other metrics, any bottleneck or improper coupling in the file system will be localized. A follow-on plan designed to mitigate or eliminate such problems will be proposed.

Sage and Other Applications

Of particular interest on this machine is the Los Alamos application known as Sage. This application is a case of the n:1 method of parallelism and data exists that indicates the best it can achieve is something less than 50% of the available file system bandwidth.

Why this might be so is unknown. It could be that Sage is not properly utilizng the file system or that it requires some semantic of the file system that forces the system to partially, or totally, order IO from the application. It could be that the file system is not responding appropriately.

In the case where the file system might be culpable, the parallel n:1 scaling study, decribed in the previous section, should elide the same issues. The tests, there, should be much more amenable to localizing any such issues, as well. In short, the synthetic workload should be able to help explain any such limitation in the file system that Sage might be encountering far more readily than Sage itself.

If the parallel n:1 study does not indicate a problem, then Sage itself must be examined. It is up to the application to "sign the contract" with the file system that enables the file system to drop the coherency guarantees imposed by POSIX on behalf of the application. Sage will be examined for proper application of these extensions.

Once again, for Sage this time though, a follow-on proposal for what might be done within the file system to correct issues that Sage, only, might demonstrate will be worked up and documented in the follow-on plan.

For deficiencies or lack of application of the parallel file system extensions, a follow-on plan will include what might be done for Sage. Given the type and kind of these extensions the plan would almost certainly include a partnership between the Sage developers and Sandia IO experts to accomplish the task.

Other applications might become interesting during the course of this work. They will be considered individually for inclusion in the final result.

Glossary

POSIX: The IEEE POSIX standard, introduced to enhance the portability of user and programming environments. Among other things, it defines the application programmer interface and the semantics of that interface for I/O related function.

Coherent Views: Simply, a view of file or directory name-space content that is maintained in a synchronized fashion. For instance, if two, independent, non-cooperating, processes are able to freely read and write a particular location in a file and view the each other's updates reliably, always, then the file system is said to be coherent with respect to file content.

SCSI: The Small Computer System Interface standard for I/O supporting, among other things, disk transfers.