next up previous contents index
Next: Failure of application Up: Common load errors reported Previous: Common load errors reported   Contents   Index

Failure of a compute node

When yod contacts the PCTs it has been allocated, the PCTs form a group so they can do scalable collective communication. If one of the PCTs fails to execute the protocol, another PCT will report this to yod. yod normally displays only the first error received from the compute partition. Then it may display a "waiting for" relationship, meaning it displays which PCTs are waiting for which. The PCT at the end of this chain is on a compute node that is malfunctioning, and system administration should log in and try to determine the problem.




\fbox{\parbox{5.5 in}{\tt
Failure while hosting application process on node 0 w...
...for a message from): \linebreak[4]
(job 101) Waiting for node: 0 <- 18 <- 19
}}

Above is a sample load error message. yod received several messages from PCTs indicating that they timed out waiting for another PCT. This first message received was from the rank 0 node. yod doesn't print the rest of them. However yod prints a summary showing that node 0 is waiting for a message from node 18, and node 18 is waiting for a message from node 19. (These are physical node numbers, not process ranks.) It would be good to look at physical node 19 and look for message passing problems. If it repeatedly fails in application loads, remove it from the virtual machine with "pingd -gone".


next up previous contents index
Next: Failure of application Up: Common load errors reported Previous: Common load errors reported   Contents   Index
Lee Ann Fisk 2001-06-25