next up previous contents index
Next: Retries upon load failure Up: Common load errors reported Previous: Failure of a compute   Contents   Index

Failure of application

The application process executes some start up code before entering the user's code. In this code it sets up it's node map and other information needed to function as a parallel application. The application process may encounter an error at this point, in which case one of the following messages will appear. It may be necessary to recompile and relink the code. If this does not fix the problem, then a developer may need to run the application code under a debugger to see if there is a bug in one of the Cplant libraries linked with it.

yod will display one of the following messages, depending on where the application failed.




\fbox{\parbox{5.5 in}{\tt
Failure while hosting application process on node...\...
...r code on compute node.\linebreak[4]
Perhaps the executable file is corrupt?
}}

\fbox{\parbox{5.5 in}{\tt
Failure while hosting application process on node...\linebreak[4]
PCT report: application process detected error before user code
}}

\fbox{\parbox{5.5 in}{\tt
Failure while hosting application process on node...\...
...inated before user code\linebreak[4]
App process terminating signal: SIGSEGV
}}

Recall that when the application writes to the stdout stream or does file I/O without using a parallel I/O facility, the operation goes through yod. Another error can occur when yod encounters an problem processing application I/O. For example if the application process or the compute node it is running on crashed midway through an I/O operation which was being carried out by yod, the I/O operation would fail. It may be necessary to kill the parallel application at this point. (That would be the owner's call.) The system administrator should investigate the compute node to determine why the application running on it did not complete the I/O protocol. This is the message displayed in this case:




\fbox{\parbox{5.5 in}{\tt
I/O error (node 27 portal ID 4 rank 12): timeout on message wait\linebreak[4]
Interrupt yod to kill the parallel application.
}}


next up previous contents index
Next: Retries upon load failure Up: Common load errors reported Previous: Failure of a compute   Contents   Index
Lee Ann Fisk 2001-06-25