Monday, August 8, 2011

Diagnosing stuck OpenMPI jobs with mpirun

I was testing a simple Openmpi jobs and somehow my jobs hangs without any output. In order to diagnose why the mpi-run failed, I've used this cool flag to

 $ mpirun --debug-daemons -np 24 -host c18,c19,c20  hello_world_mpi

..........
[c18.cluster.spms.ntu.edu.sg:09182] [[21720,0],1] orted_cmd: received exit
[c19.cluster.spms.ntu.edu.sg:13800] [[21720,0],2] orted_cmd: received exit
[c19.cluster.spms.ntu.edu.sg:13800] [[21720,0],2] orted: finalizing
[c18.cluster.spms.ntu.edu.sg:09182] [[21720,0],1] orted: finalizing
[c20.cluster.spms.ntu.edu.sg:08614] [[21720,0],3] orted_cmd: received exit
[c20.cluster.spms.ntu.edu.sg:08614] [[21720,0],3] orted: finalizing 
..........

Finally realised that my mpirun "ran away to a "IBM RNDIS/CDC ETHER" which I promptly shutdown with a ifdown usb0

After that it run smoothly

No comments: