Welcome to the forums. Please post in English or French.

You are not logged in.

#1 2011-04-19 13:51:51

pierre_j
Member
Registered: 2010-01-19
Posts: 773

[bug report] EXECUTION_CODE_ASTER_EXIT_31552=139: coredump

Hi,

We encounter a difficult error to understand with this simulation, and would like to know which track could be followed in order to see a bit more clear in the problem.
Has anyone already met such problem?

We thank you in advance.
Bests,

Pierre

[ferrand03:31592] *** Process received signal ***
[ferrand03:31592] Signal: Segmentation fault (11)
[ferrand03:31592] Signal code: Address not mapped (1)
[ferrand03:31592] Failing at address: 0xffffffff37059580
[ferrand03:31592] [ 0] /lib64/libpthread.so.0 [0x32caa0de70]
[ferrand03:31592] [ 1] ./asteru_mpi(__Compute2WayPartitionParams+0xc7) [0x1f9c337]
[ferrand03:31592] [ 2] ./asteru_mpi(__MlevelNodeBisectionMultiple+0x4b6) [0x1fa8926]
[ferrand03:31592] [ 3] ./asteru_mpi(__MlevelNestedDissection+0xba) [0x1fa7c2a]
[ferrand03:31592] [ 4] ./asteru_mpi(METIS_NodeND+0x2a9) [0x1fa9329]
[ferrand03:31592] [ 5] ./asteru_mpi(metis_nodend_+0x12) [0x1f71bf2]
[ferrand03:31592] [ 6] ./asteru_mpi(dmumps_195_+0x48c0) [0x18aa400]
[ferrand03:31592] [ 7] ./asteru_mpi(dmumps_26_+0xde83) [0x193fb43]
[ferrand03:31592] [ 8] ./asteru_mpi(dmumps_+0xbb0) [0x187c5e0]
[ferrand03:31592] [ 9] ./asteru_mpi(amumpd_+0xd36) [0x997496]
[ferrand03:31592] [10] ./asteru_mpi(amumph_+0x159d) [0x63bdbd]
[ferrand03:31592] [11] ./asteru_mpi(tldlg3_+0x1e06) [0x9bc1d6]
[ferrand03:31592] [12] ./asteru_mpi(preres_+0xdee) [0x69f8ae]
[ferrand03:31592] [13] ./asteru_mpi(nmcoma_+0xbc7) [0x770e67]
[ferrand03:31592] [14] ./asteru_mpi(nmdesc_+0x53f) [0x5b8dcf]
[ferrand03:31592] [15] ./asteru_mpi(op0070_+0x24c1) [0x51ec91]
[ferrand03:31592] [16] ./asteru_mpi(ex0000_+0x197) [0x511237]
[ferrand03:31592] [17] ./asteru_mpi(execop_+0x12e) [0x4fb6ee]
[ferrand03:31592] [18] ./asteru_mpi(expass_+0xc1) [0x4bfd31]
[ferrand03:31592] [19] ./asteru_mpi [0x4a131f]
[ferrand03:31592] [20] /soft/libraries/Python/2.7.1/gnu/x86_64/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x66da) [0x2addb622179a]
[ferrand03:31592] [21] /soft/libraries/Python/2.7.1/gnu/x86_64/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5fd7) [0x2addb6221097]
[ferrand03:31592] [22] /soft/libraries/Python/2.7.1/gnu/x86_64/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5fd7) [0x2addb6221097]
[ferrand03:31592] [23] /soft/libraries/Python/2.7.1/gnu/x86_64/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5fd7) [0x2addb6221097]
[ferrand03:31592] [24] /soft/libraries/Python/2.7.1/gnu/x86_64/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5fd7) [0x2addb6221097]
[ferrand03:31592] [25] /soft/libraries/Python/2.7.1/gnu/x86_64/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5fd7) [0x2addb6221097]
[ferrand03:31592] [26] /soft/libraries/Python/2.7.1/gnu/x86_64/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x8c9) [0x2addb6222d89]
[ferrand03:31592] [27] /soft/libraries/Python/2.7.1/gnu/x86_64/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5c12) [0x2addb6220cd2]
[ferrand03:31592] [28] /soft/libraries/Python/2.7.1/gnu/x86_64/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x5fd7) [0x2addb6221097]
[ferrand03:31592] [29] /soft/libraries/Python/2.7.1/gnu/x86_64/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x8c9) [0x2addb6222d89]
[ferrand03:31592] *** End of error message ***
/tmp/jwang-ferrand03-batch.31552/mpi_script.sh: line 38: 31592 Segmentation fault      (core dumped) ./asteru_mpi Python/Execution/E_SUPERV.py -eficas_path ./Python -commandes fort.1 -rep none -num_job 31552 -mode batch -rep_outils /soft/aster/10.3.0-3/intel/outils -rep_mat /soft/aster/10.3.0-3/intel/STA10.3/materiau -rep_dex /soft/aster/10.3.0-3/intel/STA10.3/datg -suivi_batch -memjeveux 812.500000 -tpmax 100000
EXECUTION_CODE_ASTER_EXIT_31552=139
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 31586 on
node ferrand03 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
EXIT_COMMAND_31552_00000014=1

Last edited by pierre_j (2011-07-06 12:20:16)


Attachments:
reducedU1.zip, Size: 771.17 KiB, Downloads: 263

Offline

#2 2011-04-19 16:44:01

mathieu.courtois
Administrator
From: France
Registered: 2007-11-21
Posts: 1,170

Re: [bug report] EXECUTION_CODE_ASTER_EXIT_31552=139: coredump

Hello,

It seems that metis linked against mumps is not the correct version.
Check if the built was correct : mumps must be linked against metis-4.0.
metis-edf-4.1 is only used as executable called from Code_Aster.

Have a look at step 3 at this page : http://www.code-aster.org/wiki/doku.php … er_mpi_par

Workaround : use another renumbering by setting RENUM='PORD' for example.

MC


Code_Aster release : last unstable on Ubuntu 16.04 64 bits - GNU Compilers

Please do not forget to tag your first post as *SOLVED* when it is!

Offline

#3 2011-04-26 02:34:47

dlc
Member
Registered: 2011-02-24
Posts: 13

Re: [bug report] EXECUTION_CODE_ASTER_EXIT_31552=139: coredump

Hi,

For having done this install, and having rerun it just to check.

>It seems that metis linked against mumps is not the correct version.
>Check if the built was correct : mumps must be linked against metis-4.0.
>metis-edf-4.1 is only used as executable called from Code_Aster.

Mumps is linked against metis-4.0

See attached config.txt, and mumps link line below

mpif90 -o ssimpletest -O3 ssimpletest.o ../lib/libsmumps.a ../lib/libmumps_common.a  -L/opt/libraries/metis/4.0/intel/x86_64/lib -lmetis -L../PORD/lib/ -lpord   -L/opt/intel/Compiler/11.1/073/mkl/lib/em64t/ -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 -lguide -lpthread -lpthread

Was plain Aster 10.3, no updates applied.
I'm afraid the problem is somewhere else.

Bests.
DLC


Attachments:
config.txt, Size: 7.15 KiB, Downloads: 331

Offline

#4 2011-04-26 12:14:44

Thomas DE SOZA
Guru
From: EDF
Registered: 2007-11-23
Posts: 3,066

Re: [bug report] EXECUTION_CODE_ASTER_EXIT_31552=139: coredump

dlc wrote:

For having done this install, and having rerun it just to check.

Mumps is linked against metis-4.0

The package aster-full had a faulty metis-4.0 for a while. Either download the package again and re-install or download metis-4.0 directly from its web site.

TdS

Offline

#5 2011-04-26 23:23:44

dlc
Member
Registered: 2011-02-24
Posts: 13

Re: [bug report] EXECUTION_CODE_ASTER_EXIT_31552=139: coredump

TdS wrote:

The package aster-full had a faulty metis-4.0 for a while. Either download the package again and re-install or download metis-4.0 directly from its web site.

I did not compile with metis provided by aster
I used metis from their site:
[root@hpdlc OLD]# md5sum metis-4.0.tar.gz
0aa546419ff7ef50bd86ce1ec7f727c7  metis-4.0.tar.gz

Their site has dramatically changed Mar 19, no more metis4.0 even in the OLD section
Will use metis-4.0.3.tar.gz from now on.

I don't think however that it will fix the problem, see attached log file.
Fell on a different error (oh, boy, seen it so many times, never had a clue on it)
Was with RedHat 6, will retry with RH 5.5 to be sure.
But as I fell on it at the exact same place in the computation, I fear the problem is somewhere else

forrtl: severe (59): list-directed I/O syntax error, unit -5, file Internal List-Directed Read

Bests
DLC


Attachments:
reducedU1.log, Size: 64.94 KiB, Downloads: 285

Offline

#6 2011-04-27 12:25:04

pierre_j
Member
Registered: 2010-01-19
Posts: 773

Re: [bug report] EXECUTION_CODE_ASTER_EXIT_31552=139: coredump

Dear all,

This may be bad or good news, but we do not succeed in reproducing the error with the same model...

I enclose 2 models with 2 execution logs each:

- contact by DISCRETE\PENALISATION: the very same than the one I posted just previously in the thread with:
    - the log I already posted where a definite failure message appears about MPI and segmentation fault
    - a new log from a fresh computation just this morning, and that terminates with a more standard NonConvergenceError

- contact by CONTINUE\PENALISATION: a new one with the same mesh, but different a contact method (CONTINUE):
    - with a log from yesterday: again this definite un-understandable failure message about MPI and segmentation fault
    - and a new log from a computation carried out this morning that terminates again with a more understandable error message (but different from the other model): échec de la boucle des contraintes actives lors du traitement du contact

In the end, both cases do not run.
But each time, one termination can be considered as "normal" (that we will refer to as termination "O"), while another raises some questions about segmentation fault (termination that we will refer to as termination "X")...

I was thinking that maybe the node was in cause, or that the failures occur at the same time, but this seems not to be the case. Here is what we get:
- case DISCRETE, node ferrand03, termination X, failure at simulation time 1.496875
- case DISCRETE, node ferrand07, termination O, failure at simulation time 1.55859375
- case CONTINUE, node ferrand07, termination X, failure at simulation time 3.125E-02
- case CONTINUE, node ferrand04, termination O, failure at simulation time 3.125E-02

I enclose a zip of the files:
- one mesh, common to both models,
- 2 comm file, for DISCRETE or CONTINUE contact method,
- 4 mess and job files concerning termination X or O for DISCRETE or CONTINUE method.

If anyone sees some hidden meanings in this, please, do not hesitate to tell us so.

Bests,

Pierre

Last edited by pierre_j (2011-04-27 12:32:46)


Attachments:
disturbingCases.7z, Size: 538.04 KiB, Downloads: 216

Offline

#7 2011-04-28 00:34:43

dlc
Member
Registered: 2011-02-24
Posts: 13

Re: [bug report] EXECUTION_CODE_ASTER_EXIT_31552=139: coredump

Hi, All

Still pursuing your initial reducedU1 model, and following TdS's advice (many thanks for that, TdS), I recompiled with metis 4.0.3 from their site.
Did not work yesterday (probably some other reason linked to RedHat6)
Did work today, on CentOS 5.5, with
-aster 10.3.21-1 on 1 core (I stopped it around time step 1.6)
-plain aster full, same as yours (job is still running  on 12 cores, long way to go until final time step 2), see attached log file up to 1.524218750E+00

At some point in time, aster starts refining time steps (more and more) and goes back in time.
I'm puzzled that this happens at time step
- 1.5  with aster 10.3.21-1 on 1 core,
- but 1.4 with  plain aster full on 12 cores.
Is that normal? Due to version? Due to parallel? Some numeric instability?

Performancewise,
-I could not run it on 12 cores with MPI, got a very clear aster alarm, but am not yet skilled enough to interpret it
-12 core SMP does not seem much faster than 1 core, I'd say slower  (just a feeling, no precise analysis)

pierre_j, let's discuss offline how to provide you the executable (16 MB, too big to fit in an attachement)

Bests
DLC


Attachments:
reducedU1.10.3.0-3.12core.shm.log, Size: 366.18 KiB, Downloads: 274

Offline