Welcome to the forums. Please post in English or French.

You are not logged in.

#1 2020-06-20 18:25:05

mf
Member
Registered: 2019-06-18
Posts: 55

'Bus Error' during execution of parallel version 14.4

Hello,

during testing of new hardware I encountered a strange error during execution of the parallel version. I am using the docker version of 14.4_MPI. The error seems to show up right after the first iteration of a nonlinear simulation when I enter a mpi_nbcpu greater than the number of cores of one cpu in this system. Up until to this, everything is fine (partitioning etc..). I would understand the error if this was a single CPU system, but this is not the case. The system has 4 CPUs, each with 12 cores (Total 48 cores). Apparently, I cannot enter, for example:

mpi_nbcpu 24
mpi_nbnoued 1
ncpus 2

From my experience, ncpus should always be 2 and mpi_nbcpu should be TOTAL_CORES/ncpus to be quickest.

I have to say that the model is fine otherwise. The error seems to be all about what I enter in mpi_nbcpu. Here is the error when I put in above parameters (24/1/2) in my export-file:

Instant de calcul:  1.000000000000e+00
----------------------------------------------------------------------------------------------------------------------------------------------------------
|     CONTACT    |     CONTACT    |     NEWTON     |     RESIDU     |     RESIDU     |     OPTION     |     CONTACT    |     CONTACT    |     CONTACT    |
|    BCL. GEOM.  |    BCL. CONT.  |    ITERATION   |     RELATIF    |     ABSOLU     |   ASSEMBLAGE   |    PRESSURE    |     CRITERE    |  PENETRATION   |
|    ITERATION   |    ITERATION   |                | RESI_GLOB_RELA | RESI_GLOB_MAXI |                |    ERROR       |    VALEUR      |                |
----------------------------------------------------------------------------------------------------------------------------------------------------------
|     CONTACT    |     CONTACT    |     NEWTON     |     RESIDU     |     RESIDU     |     OPTION     |     CONTACT    |     CONTACT    |     CONTACT    |
|    BCL. GEOM.  |    BCL. CONT.  |    ITERATION   |     RELATIF    |     ABSOLU     |   ASSEMBLAGE   |    PRESSURE    |     CRITERE    |  PENETRATION   |
|    ITERATION   |    ITERATION   |                | RESI_GLOB_RELA | RESI_GLOB_MAXI |                |    ERROR       |    VALEUR      |                |
----------------------------------------------------------------------------------------------------------------------------------------------------------
|     CONTACT    |     CONTACT    |     NEWTON     |     RESIDU     |     RESIDU     |     OPTION     |     CONTACT    |     CONTACT    |     CONTACT    |
|    BCL. GEOM.  |    BCL. CONT.  |    ITERATION   |     RELATIF    |     ABSOLU     |   ASSEMBLAGE   |    PRESSURE    |     CRITERE    |  PENETRATION   |
|    ITERATION   |    ITERATION   |                | RESI_GLOB_RELA | RESI_GLOB_MAXI |                |    ERROR       |    VALEUR      |                |
----------------------------------------------------------------------------------------------------------------------------------------------------------
/tmp/aster/global/global/global/global/global/global/global/mpi_script.sh: line 37:  1837 Bus error               (core dumped) /home/aster/aster/14.4_mpi/bin/aster /home/aster/aster/14.4_mpi/lib/aster/Execution/E_SUPERV.py -commandes fort.1 --max_base=250000 --num_job=1560 --mode=interactif --rep_outils=/home/aster/aster/outils --rep_mat=/home/aster/aster/14.4_mpi/share/aster/materiau --rep_dex=/home/aster/aster/14.4_mpi/share/aster/datg --numthreads=2 --suivi_batch --tpmax=357900.0 --memjeveux=8192.0
EXECUTION_CODE_ASTER_EXIT_1560=135
EXIT_COMMAND_1560_00000022=0
<INFO> restore bases from /tmp/aster/global/global/global/global/global/global/global/BASE_PREC

<A>_ALARM          no glob/bhdf base to restore


<E>_ABNORMAL_ABORT execution aborted (comm file #1)

<INFO> Code_Aster run ended, diagnostic : <E>_ABNORMAL_ABORT

--------------------------------------------------------------------------------
Content of /tmp/aster/global/global/global/global/global/global/global after execution

.:
total 60332
drwx------  4 aster aster     4096 Jun 20 17:08 .
drwxr-xr-x 26 aster aster     4096 Jun 20 17:08 ..
-rw-r--r--  1 aster aster      901 Jun 20 17:03 1560.export
drwxr-xr-x  2 aster aster     4096 Jun 20 17:03 BASE_PREC
drwxr-xr-x  2 aster aster     4096 Jun 20 17:03 REPE_OUT
-rw-r--r--  1 aster aster     2756 Jun 20 17:03 config.txt
-rw-r--r--  1 aster aster    25023 Jun 20 17:03 fort.1
-rw-r--r--  1 aster aster    25023 Jun 20 17:03 fort.1.1
-rw-r--r--  1 aster aster     1843 Jun 20 17:03 fort.1.2
-rw-r--r--  1 aster aster 34291971 Jun 20 17:03 fort.2
-rw-r--r--  1 aster aster 23914368 Jun 20 17:03 fort.20
-rw-r--r--  1 aster aster  3444719 Jun 20 17:03 fort.3
-rw-r--r--  1 aster aster    26189 Jun 20 17:03 fort.4
-rw-r--r--  1 aster aster       21 Jun 20 17:08 fort.6
-rwxr-xr-x  1 aster aster     2256 Jun 20 17:03 mpi_script.sh

REPE_OUT:
total 8
drwxr-xr-x 2 aster aster 4096 Jun 20 17:03 .
drwx------ 4 aster aster 4096 Jun 20 17:08 ..


--------------------------------------------------------------------------------
Size of bases


--------------------------------------------------------------------------------
Copying results


<A>_COPYFILE       no such file or directory: fort.80


<A>_COPYFILE       no such file or directory: fort.5

copying .../fort.6...                                                   [  OK  ]

<E>_ABNORMAL_ABORT Code_Aster run ended



---------------------------------------------------------------------------------
                                            cpu     system    cpu+sys    elapsed
---------------------------------------------------------------------------------
   Preparation of environment              0.00       0.00       0.00       0.00
   Copying datas                           0.05       0.09       0.14       0.22
   Code_Aster run                        278.98      31.83     310.81     311.31
   Copying results                         0.00       0.00       0.00       0.01
---------------------------------------------------------------------------------
   Total                                 279.15      31.98     311.13     311.91
---------------------------------------------------------------------------------
   (*) cpu and system times may be not correctly counted using mpirun.

as_run 2019.0

------------------------------------------------------------
--- DIAGNOSTIC JOB : <F>_ABNORMAL_ABORT
------------------------------------------------------------

The 'bus error' seems to be a hardware error, similar to a segmentation fault. Thus, CA does not seem to report any reasons for the error.

I looked at the mpi_script.sh line 37, but I do not see anything wrong there. It looks like this for each proc.X (X for MPI-process nr.):

#!/bin/bash
#
# script template to run Code_Aster using MPI
#
#
# This template contains following Python strings formatting keys :
#
#     cmd_to_run         : Code_Aster command line
#     mpi_get_procid_cmd : command to retreive processor ID
#
# automatically generated for job number #1560
#

ASRUN_PROCID=`echo $PMI_RANK`

if [ -z "$ASRUN_PROCID" ]; then
   echo "Processor ID is not defined !"
   exit 4
fi

ASRUN_WRKDIR=/tmp/aster/global/global/global/global/global/global/proc.$ASRUN_PROCID

if [ -e $ASRUN_WRKDIR ]; then
   rm -rf $ASRUN_WRKDIR
fi
if [ ! -d /tmp/aster/global/global/global/global/global/global ]; then
   mkdir -p /tmp/aster/global/global/global/global/global/global
fi
cp -r /tmp/aster/global/global/global/global/global/global/global $ASRUN_WRKDIR
if [ $? -ne 0 ]; then
    echo "non zero exit status for : cp -r /tmp/aster/global/global/global/global/global/global/global $ASRUN_WRKDIR"
    exit 4
fi
chmod 0700 $ASRUN_WRKDIR

cd $ASRUN_WRKDIR
( . /home/aster/aster/14.4_mpi/share/aster/profile.sh ; /home/aster/aster/14.4_mpi/bin/aster /home/aster/aster/14.4_mpi/lib/aster/Execution/E_SUPERV.py -commandes fort.1  --max_base=250000 --num_job=1560 --mode=interactif --rep_outils=/home/aster/aster/outils --rep_mat=/home/aster/aster/14.4_mpi/share/aster/materiau --rep_dex=/home/aster/aster/14.4_mpi/share/aster/datg --numthreads=2 --suivi_batch --tpmax=357900.0 --memjeveux=8192.0 ; echo EXECUTION_CODE_ASTER_EXIT_1560=$? ) | tee fort.6
iret=$?

if [ -f info_cpu ]; then
   infos=`cat info_cpu`
   echo "PROC=$ASRUN_PROCID INFO_CPU=$infos"
fi

if [ $ASRUN_PROCID -eq 0 ]; then
   echo "Content after execution of $ASRUN_WRKDIR :"
   ls -la . REPE_OUT

   rm -f /tmp/aster/global/global/global/global/global/global/global/glob.* /tmp/aster/global/global/global/global/global/global/global/bhdf.* /tmp/aster/global/global/global/global/global/global/global/pick.*
   rm -rf /tmp/aster/global/global/global/global/global/global/global/REPE_OUT
   # to save time during the following copy
   rm -rf $ASRUN_WRKDIR/REPE_IN $ASRUN_WRKDIR/Python
   cp -rf $ASRUN_WRKDIR/* /tmp/aster/global/global/global/global/global/global/global/
   kret=$?
   if [ $kret -gt $iret ]; then
      iret=$kret
   fi
fi
rm -rf $ASRUN_WRKDIR

I did the following additional tests on this system:
1) OpenFOAM MPI, runs fine with 48 MPI processes (Of course, that does not have anything to do with the CodeAster-Docker, which is by definition encapsulated with its own MPI-installation). Nevertheless, it shows that the hardware is ok.
2) I tested several C-programs for OpenMPI, e.g. Hello_world_MPI etc. Also here, I may use 48 MPI processes (or even more) without any problems. Hardware seems ok.
3) I compiled simple MPI C programs inside the docker, here I am able to use 48 MPI processes also. The MPI installation inside the docker is OK.

To summarize, I am not sure what is wrong. The problem is, if I am not able to enter 24/1/2 this 4 CPU system is not faster than a similar system with 2 CPUs, I would not be able to use the advantage of the 2 additional CPUs.

I'd be glad for any advice, perhaps I only have to change some parameters? Or perhaps the installation is faulty?

Please let me know if you need more data to judge this error.

Thank you,

Mario.

Last edited by mf (2020-06-23 13:53:58)

Offline

#2 2020-06-22 19:56:58

mf
Member
Registered: 2019-06-18
Posts: 55

Re: 'Bus Error' during execution of parallel version 14.4

Anyone?

Offline

#3 2020-06-23 08:39:44

chenghui62000
Member
From: Norway
Registered: 2018-06-19
Posts: 95

Re: 'Bus Error' during execution of parallel version 14.4

Hi, Mario,
I am not sure about the MPI problem. but could you please share your experience about how to install the MPI version of Code_Aster? I hardly find useful information on the Internet.

Best regards,
Hui Cheng

Offline

#4 2020-06-23 09:51:15

mf
Member
Registered: 2019-06-18
Posts: 55

Re: 'Bus Error' during execution of parallel version 14.4

Hi,

I never managed to do the installation myself. Like you say, this is quite a tedious task (or quite frankly, a nightmarish experience).

I use this docker instead, this is where the error comes up. It is an encapsulated, virtual environment (not allowed to post links, just add the https://):

github.com/tianyikillua/code_aster_on_docker

Just install docker on your OS (I recommend docker on Linux, it is faster) and follow the instructions there, and you are good to go with the parallel version. No compiling is needed.

Cheers,

Mario.

Offline

#5 2020-06-23 10:35:29

chenghui62000
Member
From: Norway
Registered: 2018-06-19
Posts: 95

Re: 'Bus Error' during execution of parallel version 14.4

Hi, Mario, Thank you very much! Hui

Offline

#6 2020-06-23 13:58:47

tianyikillua
Member
From: Paris
Registered: 2017-11-06
Posts: 68

Re: 'Bus Error' during execution of parallel version 14.4

I'll have a look if I can manage to find some time.

Could you please try if possible the previous v13 version by using

docker pull quay.io/tianyikillua/code_aster:v13

If v13 works for you it means there is some regression in my v14 version.

Offline

#7 2020-06-23 14:28:21

mf
Member
Registered: 2019-06-18
Posts: 55

Re: 'Bus Error' during execution of parallel version 14.4

Hi,

thank you. I tried the v13-docker on the same machine, but I can't use the current simulation due to this (different) error:

!--------------------------------------------------------------------!
   ! <A> <MED_24>                                                       !
   !                                                                    !
   !   -> Le fichier n'a pas été construit avec la même version de MED. !
   !   -> Risque & Conseil :                                            !
   !      La lecture du fichier peut échouer !                          !
   !                                                                    !
   !                                                                    !
   !    Version de la bibliothèque MED utilisée par Code_Aster:  3 3 1  !
   !                                                                    !
   !    Version de la bibliothèque MED pour créer le fichier  :  4 0 0  !
   !                                                                    !
   !   -> Incohérence de version détectée.                              !
   !                                                                    !
   !                                                                    !
   ! Ceci est une alarme. Si vous ne comprenez pas le sens de cette     !
   ! alarme, vous pouvez obtenir des résultats inattendus !             !
   !--------------------------------------------------------------------!

Basically, it means, that the mesh files are too new?... :-(   I'd have to create an 'older' example or use a different example.

Maybe one of the test cases will provoke the same errors. I don't know yet.

This will take a while. I will come back to you.

Offline

#8 2020-06-23 14:40:19

tianyikillua
Member
From: Paris
Registered: 2017-11-06
Posts: 68

Re: 'Bus Error' during execution of parallel version 14.4

You can use HDF5View (among others) to simply change the version of MED files, in the INFO_GENERALES section I think.

Offline

#9 2020-06-23 15:05:25

mf
Member
Registered: 2019-06-18
Posts: 55

Re: 'Bus Error' during execution of parallel version 14.4

Ok, I managed to convert to MED 3.2 with the latest SM, that seems to be close enough to 3.1.1.

Here is the result I get with V13 and MED 3.2 and

mpi_nbcpu 24
mpi_nbnoued 1
ncpus 2

within the export file and the same simulation:

!-------------------------------------------------------------!
   ! <EXCEPTION> <APPELMPI_5>                                    !
   !                                                             !
   !  Erreur lors de l'appel à une fonction MPI.                 !
   !  Les détails de l'erreur devraient être affichés ci-dessus. !
   !-------------------------------------------------------------!
   

<F> MPI Error code 138008847:
    Other MPI error, error stack:
MPI_Recv(200).........................: MPI_Recv(buf=0x7fe72c82f010, count=160001, MPI_INTEGER, src=5, tag=29, comm=0x84000004, status=0x7ffdb06d79b0) failed
MPID_Recv(132)........................:
MPID_nem_lmt_RndvRecv(168)............:
do_cts(562)...........................:
MPID_nem_lmt_shm_start_recv(181)......:
MPID_nem_allocate_shm_region(886).....:
MPIU_SHMW_Seg_create_and_attach(897)..:
MPIU_SHMW_Seg_create_attach_templ(620): write failed

   
   !-------------------------------------------------------------!
   ! <EXCEPTION> <APPELMPI_5>                                    !
   !                                                             !
   !  Erreur lors de l'appel à une fonction MPI.                 !
   !  Les détails de l'erreur devraient être affichés ci-dessus. !
   !-------------------------------------------------------------!
   
/tmp/aster/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/mpi_script.sh: line 37:   221 Bus error               (core dumped) /home/aster/aster/13.6_mpi/bin/aster /home/aster/aster/13.6_mpi/lib/aster/Execution/E_SUPERV.py -commandes fort.1 --max_base=250000 --num_job=72 --mode=interactif --rep_outils=/home/aster/aster/outils --rep_mat=/home/aster/aster/13.6_mpi/share/aster/materiau --rep_dex=/home/aster/aster/13.6_mpi/share/aster/datg --numthreads=2 --suivi_batch --memjeveux=16384.0 --tpmax=357900.0
EXECUTION_CODE_ASTER_EXIT_72=135
/tmp/aster/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/mpi_script.sh: line 37:   248 Bus error               (core dumped) /home/aster/aster/13.6_mpi/bin/aster /home/aster/aster/13.6_mpi/lib/aster/Execution/E_SUPERV.py -commandes fort.1 --max_base=250000 --num_job=72 --mode=interactif --rep_outils=/home/aster/aster/outils --rep_mat=/home/aster/aster/13.6_mpi/share/aster/materiau --rep_dex=/home/aster/aster/13.6_mpi/share/aster/datg --numthreads=2 --suivi_batch --memjeveux=16384.0 --tpmax=357900.0
EXECUTION_CODE_ASTER_EXIT_72=135
EXIT_COMMAND_72_00000022=0
<INFO> restore bases from /tmp/aster/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/BASE_PREC

<A>_ALARM          no glob/bhdf base to restore


<E>_ABNORMAL_ABORT execution aborted (comm file #1)

<INFO> Code_Aster run ended, diagnostic : <E>_ABNORMAL_ABORT

--------------------------------------------------------------------------------
Content of /tmp/aster/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global/global after execution

.:
total 60352
drwx------  4 aster aster     4096 Jun 23 14:00 .
drwxr-xr-x 25 aster aster     4096 Jun 23 14:00 ..
-rw-r--r--  1 aster aster     1037 Jun 23 13:55 72.export
drwxr-xr-x  2 aster aster     4096 Jun 23 13:55 BASE_PREC
drwxr-xr-x  2 aster aster     4096 Jun 23 13:55 REPE_OUT
-rw-r--r--  1 aster aster     2653 Jun 23 13:55 config.txt
-rw-r--r--  1 aster aster    25023 Jun 23 13:55 fort.1
-rw-r--r--  1 aster aster    25023 Jun 23 13:55 fort.1.1
-rw-r--r--  1 aster aster     1843 Jun 23 13:55 fort.1.2
-rw-r--r--  1 aster aster 34298146 Jun 23 13:55 fort.2
-rw-r--r--  1 aster aster 23920390 Jun 23 13:55 fort.20
-rw-r--r--  1 aster aster  3451256 Jun 23 13:55 fort.3
-rw-r--r--  1 aster aster    29630 Jun 23 13:55 fort.4
-rw-r--r--  1 aster aster       21 Jun 23 14:00 fort.6
-rwxr-xr-x  1 aster aster     3581 Jun 23 13:55 mpi_script.sh

REPE_OUT:
total 8
drwxr-xr-x 2 aster aster 4096 Jun 23 13:55 .
drwx------ 4 aster aster 4096 Jun 23 14:00 ..


--------------------------------------------------------------------------------
Size of bases


--------------------------------------------------------------------------------
Copying results


<A>_COPYFILE       no such file or directory: fort.80


<A>_COPYFILE       no such file or directory: fort.5

copying .../fort.6...                                                   [  OK  ]

<E>_ABNORMAL_ABORT Code_Aster run ended



---------------------------------------------------------------------------------
                                            cpu     system    cpu+sys    elapsed
---------------------------------------------------------------------------------
   Preparation of environment              0.00       0.00       0.00       0.00
   Copying datas                           0.05       0.09       0.14       0.33
   Code_Aster run                        554.91      61.90     616.81     308.76
   Copying results                         0.01       0.02       0.03       0.01
---------------------------------------------------------------------------------
   Total                                 555.06      62.06     617.12     309.48
---------------------------------------------------------------------------------
   (*) cpu and system times may be not correctly counted using mpirun.

as_run 2018.0

------------------------------------------------------------
--- DIAGNOSTIC JOB : <F>_ABNORMAL_ABORT
------------------------------------------------------------


EXIT_CODE=4

The output of MPI looks a bit more detailed at first glance, but I think it is essentially the same error.

Hope that helps.

=================================================================================

EDIT: in V14 I also tried alterations of the mpirun call in asrun. I tried combinations of rank-by, map-by and bind-to (numa, socket, etc) options in the mpirun call but did not succeed. The error persists. Also I tried without MATR_DISTRIBUEE='OUI', error persists.

Last edited by mf (2020-06-23 15:38:05)

Offline

#10 2020-06-24 10:58:53

mf
Member
Registered: 2019-06-18
Posts: 55

Re: 'Bus Error' during execution of parallel version 14.4

Hi,

can anyone of you with a dual CPU or quad CPU system please check if it is possible (or just answer without checking if possible for sure), without error, to calculate, for example on a dual 10-core system on an otherwise working parallel installation with:

mpi_nbcpu 20
mpi_nbnoued 1
ncpus 1 or 2 (2 for quad CPU, not feasible on dual CPU system)

I know it is not the fastest way to calculate without OpenMP, but it would show if it is possible at all without error. If so, it would indicate that there might be a problem with configuration/compilation of this docker version.

Thank you,

Mario.

Last edited by mf (2020-06-24 10:59:14)

Offline

#11 2020-07-02 08:30:55

mathieu.courtois
Administrator
From: France
Registered: 2007-11-21
Posts: 1,169

Re: 'Bus Error' during execution of parallel version 14.4

Hello,

There is no reason not to use all cores on your system but it will probably be not very efficient.

Each MPI process is allocating its own memory space + its own disk space.
If a process needs 4 GB RAM, you need 24x4 GB on the node + 24x(disk space needed) with 24 concurrent read/write the disk...

You should test your problem with less MPI process (4, 8, 12 mpi_ncpu) and probably 2, 4 or 6 openMP threads (ncpus parameter) to find the best configuration for your problem.

The error message of the first post may be due to a problem in MPI communications, hardware, build of the software or source code mistake...

MC


Code_Aster release : last unstable on Ubuntu 16.04 64 bits - GNU Compilers

Please do not forget to tag your first post as *SOLVED* when it is!

Offline

#12 2020-07-02 14:58:53

mf
Member
Registered: 2019-06-18
Posts: 55

Re: 'Bus Error' during execution of parallel version 14.4

Hi,

thank you for taking the time to answer, I really appreciate that.

I also tested smaller problems with <1M DOFs. The error occurs also. Therefore, I also rule out insufficient RAM or disk space, both are plenty in this machine (I always follow the use of RAM with htop and the use of disk space with watch -d df, I am not getting anywhere close to the limits).

Of course I tried:

mpi_nbcpu 12
mpi_nbnoued 1
ncpus 4

but as I mentioned, with these parameters, the dual CPU machine next to it is equally as fast (dual 8-core that I use with the following parameters to achieve a minimum of computation time:

mpi_nbcpu 8
mpi_nbnoued 1
ncpus 2.)

The installation of MPI is ok, I checked with small programs compiled with mpicc.

Hardware is ok, runs flawlessly 247 on other software.

I don't suspect source code, because then I wouldn't be the only one with this problem.

The only 2 possibilities left are:
-) faulty configuration/compilation (I suspect MUMPS_MPI, it never reaches the matrix decomposition stage..)
-) it's normal to get this error (I suspect it is not, as CA would not scale very well in this case)

So, I guess I have to live with that at the moment...

Thank you anyway,

Mario.

Last edited by mf (2020-07-02 15:00:15)

Offline