Ticket #10 (assigned defect)

Opened 3 years ago

Last modified 3 years ago

Valgrind error in TestInterCollObjWorld.testReduce with openmpi 1.1 on FC6

Reported by: albertstrasheim Assigned to: dalcinl (accepted)
Priority: major Milestone:
Component: component1 Version:
Keywords: Cc:

Description

I'm using Valgrind 3.2.1 to run test_collobj.py from mpi4py 0.4.0rc2 on Fedora Core 6 with the included openmpi 1.1-7. I use Valgrind as follows:

mpiexec -n 3 \
        valgrind \
        --tool=memcheck \
        --leak-check=yes \
        --error-limit=no \
        --suppressions=valgrind-python.supp \
        --num-callers=20 \
        --freelist-vol=536870912 \
        -v \
        python test_collobj.py -v

valgrind-python.supp can be found in the Python SVN repository. Some lines have to be uncommented to suppress most of the warnings caused by Python (see README.valgrind for more info).

The following errors show up when running TestInterCollObjWorld.testReduce:

==23269== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 7698 from 7)
==23269== 
==23269== 1 errors in context 1 of 2:
==23269== Invalid read of size 4
==23269==    at 0x4F9CC92: mca_pml_ob1_recv_frag_match (pml_ob1_recvfrag.c:454)
==23269==    by 0x4F9E30E: mca_pml_ob1_recv_frag_callback (pml_ob1_recvfrag.c:101)
==23269==    by 0x50B1519: mca_btl_sm_component_progress (btl_sm_component.c:392)
==23269==    by 0x50A7BE5: mca_bml_r2_progress (bml_r2.c:102)
==23269==    by 0x4CD1A59: opal_progress (opal_progress.c:288)
==23269==    by 0x4F9AEDA: mca_pml_ob1_recv (condition.h:81)
==23269==    by 0x50C38E9: mca_coll_basic_gather_intra (coll_basic_gather.c:77)
==23269==    by 0x50C13ED: mca_coll_basic_allgather_intra (coll_basic_allgather.c:74)
==23269==    by 0x4B9D29C: ompi_comm_split (comm.c:401)
==23269==    by 0x4BC7269: MPI_Comm_split (comm_split.c:58)
==23269==    by 0x4B3D0DF: comm_split (mpi.c:3565)
==23269==    by 0xB2654C: PyCFunction_Call (in /usr/lib/libpython2.4.so.1.0)
...
==23269==  Address 0x5C is not stack'd, malloc'd or (recently) free'd
==23269== 
==23269== 1 errors in context 2 of 2:
==23269== Invalid read of size 4
==23269==    at 0x4F9CC6F: mca_pml_ob1_recv_frag_match (pml_ob1_recvfrag.c:450)
==23269==    by 0x4F9E30E: mca_pml_ob1_recv_frag_callback (pml_ob1_recvfrag.c:101)
==23269==    by 0x50B1519: mca_btl_sm_component_progress (btl_sm_component.c:392)
==23269==    by 0x50A7BE5: mca_bml_r2_progress (bml_r2.c:102)
==23269==    by 0x4CD1A59: opal_progress (opal_progress.c:288)
==23269==    by 0x4F9AEDA: mca_pml_ob1_recv (condition.h:81)
==23269==    by 0x50C38E9: mca_coll_basic_gather_intra (coll_basic_gather.c:77)
==23269==    by 0x50C13ED: mca_coll_basic_allgather_intra (coll_basic_allgather.c:74)
==23269==    by 0x4B9D29C: ompi_comm_split (comm.c:401)
==23269==    by 0x4BC7269: MPI_Comm_split (comm_split.c:58)
==23269==    by 0x4B3D0DF: comm_split (mpi.c:3565)
==23269==    by 0xB2654C: PyCFunction_Call (in /usr/lib/libpython2.4.so.1.0)
...
==23269==  Address 0xA8 is not stack'd, malloc'd or (recently) free'd

The test is run on all three nodes, but the errors only show up once, so maybe only the root node is having this error?

If this isn't a mpi4bug issue, maybe it's a problem with openmpi -- unfortunately openmpi 1.1.2 segfaults when I tried to use mpiexec to run Valgrind.

Change History

11/10/06 16:23:08 changed by dalcinl

  • status changed from new to assigned.
  • owner changed from somebody to dalcinl.

This is a bug in OMPI. I've reported it.