GPU-MHDdecay

The goal is to study the resistive limitation of the inverse MHD cascade both with and without net magnetic helicity. This was studied previously in the case without net magnetic helicity (see Brandenburg et al. 2024a,b) at resolutions up to 20483. It was found that the quantity tvA(t)/ξM(t) approaches a constant during the turbulent decay, provided the decay remains nearly self-similar. Surprisingly, the asymptotic value of tvA(t)/ξM(t) was found to increase with increasing Lundquist number. We found a similar behavior in 2-D as in 3-D, which may be suspect. In 2-D, we found that the Lundquist number dependence saturates near 104. To verify the Lundquist number dependence to examine the possibility of saturation in 3-D, we require a resolution of up to 81923 meshpoints.

  • Detailed project description with details about timing results.
  • A problem right now is that the GPU and CPU runs yield different results. The error might well be very trivial! Below the relevant test directories using 1283 meshpoints.

    Run directories:

  • GPU run k600_nu5em7_k4_Pm5_128a.
  • CPU run k600_nu5em7_k4_Pm5_128a_CPU with 128 CPUs.
  • CPU run k600_nu5em7_k4_Pm5_128a_8CPU with only 8 CPUs.

  • Essential steps to get the gputestv6 branch (for axel as user):
    git clone -b gputestv6 --recurse-submodules https://AxelBrandenburg@pencil-code.org/git/ pencil-code
    cd pencil-code/src/astaroth/submodule
    git checkout PCinterface_2019-8-12
    
  • Then, before we can build, we need to load some modules:
    ml rocm
    ml cmake
    
  • Next, for now, we compile with:
    pc_build -f compilers/Cray_MPI FFLAGS+=" -g -O0" LDFLAGS+='-Wl,--no-relax -L /opt/cray/pe/lib64 -lmpi_gtl_hsa'
    
    This is because the compiler has a bug, so we need the -O0 optimization. Later, we could use
    pc_build -f compilers/Cray_MPI FFLAGS+=" -g" LDFLAGS+='-Wl,--no-relax'
    
    or
    pc_build -f compilers/Cray_MPI LDFLAGS+='-Wl,--no-relax'
    
    Need the linker part.

  • Make sure you have: src/astaroth/DSL/local/equations.h.

  • To work with the gnu compiler, we start by saying (as usual):
    module swap PrgEnv-cray PrgEnv-gnu
    
    To compile, we must suppress the "FSTD_95=-std=f95" option. (But how?) Next, We would use
    pc_build -f compilers/GNU-GCC_MPI FFLAGS+=" -g -mcmodel=large" LDFLAGS+='-L /opt/cray/pe/lib64 -lmpi_gtl_hsa'
    
    The plan is to work in:
    /cfs/klemming/home/b/brandenb/data/GPU/axel/decay/reconnection/k60_nu5em6_k4_Pm5_128b_gnu
    


  • To compare the results of the 3 runs above, we use pcomp_GPU.pro in the idl directory. The resulting pcomp_GPU.pdf file in the idl directory shows that the GPU run (black lines, solid for Brms and dashed for urms) results in a slightly stronger magnetic field than the CPU runs with either 128 CPUs (red line) or just 8 CPUs (dashed orange line on top).

    The figure pcomp_GPU.pdf above shows that Brms starts off with a value of around 0.003 and urms gets driven by the Lorentz force and quickly reaches comparable values. The GPU runs reproduce the CPU runs qualitatively, and the initial condition exactly.

    Fixed time step: run directories:

  • GPU run k600_nu5em5_k4_Pm5_128a.
  • CPU run k600_nu5em5_k4_Pm5_128a_CPU with 128 CPUs.



    The two runs now agree.

    Open issues:

  • In the CPU version, when spec_start=T, we output the very first timestep plus the next one. The reason for this is that the gravitational wave module (where all evolution happens at the level of auxiliaries!) does not have the first time yet. But even without gravitational waves, it was useful, because velocity spectra also only exist after the first time step. The output of the timestep after the first one therefore should still be implemented.
  • At the moment, we have export OMP_NUM_THREADS=8. Maybe we should put it to 7? This may be lumi-specific.
  • Even after having disabled all VAR-file and slice outputs, having it1=100, isave=1000, and a snapshot interval of 1, corresponding to a frequency of 21 spectra during the full run. the GPU code uses a 16 times longer wallclock time for the same task; see below.

    Run directories:

  • GPU run k60_nu2em5_k4_Pm5_128c_dt_GPU.
  • CPU run k60_nu2em5_k4_Pm5_128c_dt with 128 CPUs.

  • Performance for 1283 tests:
    GPU:
     Wall clock time [hours] =  0.850     (+/-  8.3333E-12)
     Wall clock time/timestep/meshpoint [microsec] = 0.1620545    
     Maximum used memory per cpu [MBytes] =  1904.918
     Maximum used memory [GBytes] =       14.003
    
    Compared to:
    CPU:
     Wall clock time [hours] =  5.454E-02 (+/-  5.5556E-12)
     Wall clock time/timestep/meshpoint [microsec] = 1.0402362E-02
     Maximum used memory per cpu [MBytes] =    43.348
     Maximum used memory [GBytes] =        4.776
    
  • When we also put isave=10000 (as opposed to 1000) and dspec=10 (as opposed to 1), we find (in k60_nu2em5_k4_Pm5_128c_dt_GPU_lspec):
     Wall clock time [hours] =  0.624     (+/-  5.5556E-12)
     Wall clock time/timestep/meshpoint [microsec] = 0.1190291
     Maximum used memory per cpu [MBytes] =  1519.820
     Maximum used memory [GBytes] =       11.492
    


    Magnetic energy (upper two lines) and kinetic energy (lower two lines). Red lines: GPU run, blue lines CPU. (The dashed orange line is for the CPU code without upwinding of the vector potential; see below.) Both runs with eta=4e-6 and nu=2e-5. The initial spectra are essentially the same (the slight differences are probably explained by the slightly different times). It seems that the magnetic and velocity fields have slighly less diffusion in the GPU code. This difference becomes much bigger at smaller viscosities (e.g., 4 times).

    The difference is caused by having used upwinding for the vector potential in the CPU version, which is not yet, however, coded in the GPU version; see

    src/astaroth/DSL/magnetic/induction.h 
    
    To confirm this, we ran the CPU version without upwinding of the vector potential; see the dashed orange line in the spectra above.
  • The problem with the spectra not being outputted at the first and second time step is seen in the following comparson, where we see first the incorrect output from the GPU version (k60_nu2em5_k4_Pm5_128c_dt_GPU) and then the one from the CPU version (k60_nu2em5_k4_Pm5_128c_dt):
    brandenb@login1:/cfs/klemming/home/b/brandenb/data/GPU/axel/decay/reconnection/k60_nu2em5_k4_Pm5_128c_dt_GPU_lspec> head -20 ../k60_nu2em5_k4_Pm5_128c_dt_GPU/data/power_kin.dat
     3.36215353875403758E-2
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
     1.0323241484733396
      1.05E-09  1.03E-06  8.25E-06  1.30E-05  2.44E-05  2.36E-05  2.42E-05  2.37E-05
      2.17E-05  2.32E-05  1.87E-05  1.70E-05  1.63E-05  1.57E-05  1.45E-05  1.22E-05
      1.23E-05  1.12E-05  1.01E-05  8.78E-06  8.73E-06  8.30E-06  7.53E-06  7.00E-06
      6.58E-06  6.55E-06  6.02E-06  5.59E-06  5.44E-06  5.10E-06  4.90E-06  4.68E-06
      4.29E-06  4.34E-06  4.09E-06  3.88E-06  3.80E-06  3.55E-06  3.51E-06  3.26E-06
      3.22E-06  3.17E-06  2.97E-06  2.90E-06  2.80E-06  2.79E-06  2.63E-06  2.57E-06
      2.49E-06  2.45E-06  2.41E-06  2.32E-06  2.30E-06  2.25E-06  2.18E-06  2.14E-06
      2.11E-06  2.12E-06  2.05E-06  2.01E-06  2.02E-06  1.99E-06  1.97E-06  1.89E-06
     2.0274802674034622
      5.33E-09  1.41E-06  8.09E-06  1.06E-05  1.71E-05  1.69E-05  1.38E-05  1.30E-05
    brandenb@login1:/cfs/klemming/home/b/brandenb/data/GPU/axel/decay/reconnection/k60_nu2em5_k4_Pm5_128c_dt_GPU_lspec> head -20 ../k60_nu2em5_k4_Pm5_128c_dt/data/power_kin.dat
       0.0000000000000000     
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
      0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
       2.3946539533068810E-003
      1.87E-14  3.46E-10  3.50E-09  1.13E-08  3.50E-08  8.70E-08  1.42E-07  2.20E-07
      3.32E-07  5.47E-07  6.85E-07  7.50E-07  9.60E-07  1.11E-06  1.23E-06  1.21E-06
      1.47E-06  1.55E-06  1.62E-06  1.56E-06  1.74E-06  1.83E-06  1.80E-06  1.83E-06
      1.88E-06  2.00E-06  2.00E-06  2.01E-06  2.01E-06  2.01E-06  2.03E-06  2.09E-06
      2.02E-06  2.13E-06  2.15E-06  2.12E-06  2.14E-06  2.12E-06  2.13E-06  2.05E-06
      2.12E-06  2.12E-06  2.07E-06  2.10E-06  2.04E-06  2.08E-06  2.02E-06  1.99E-06
      2.01E-06  2.01E-06  1.98E-06  1.91E-06  1.96E-06  1.92E-06  1.90E-06  1.87E-06
      1.86E-06  1.86E-06  1.82E-06  1.80E-06  1.81E-06  1.78E-06  1.79E-06  1.74E-06
       1.0028751054320695     
      4.45E-09  1.78E-06  1.84E-05  1.95E-05  4.41E-05  4.40E-05  3.87E-05  3.47E-05
    

    readme.txt.

    Other discussion points:

  • Mergin branch to master
  • auto-tun to put into pc_newrun when -s is invoked.

    Fixed time step (agreement with lupw_aa=F): run directories:

  • GPU run k600_nu5em5_k4_Pm5_128a (jrms=2.958E-02).
  • CPU run k600_nu5em5_k4_Pm5_128a_8CPU_lupw_aaF with 8 CPUs and lupw_aa=F: agrees perfectly (jrms=2.958E-02)!
  • CPU run k600_nu5em5_k4_Pm5_128a_8CPU with 8 CPUs, but still lupw_aa=T: does not agree well enough (jrms=2.957E-02).
  • CPU run k600_nu5em5_k4_Pm5_128a_CPU with 128 CPUs and lupw_aa=T: does not agree well enough (jrms=2.961E-02).

    where the jrms values indicate the value from the time series at the last time of each run. Doing a diff on the spectra incidates that the time stamps are wrong, but the data themselves are correct.

    References:

    Presentation by Matthias Rheinhardt (Aalto University) at the Pencil Code User Meeting 2024 in Barcelona about the GPU acceleration in the Pencil Code using Astaroth: Introduction to PC-A [pptx] (25 Sep 2024)

    Brandenburg, A., Neronov, A., & Vazza, F.: 2024a, ``Resistively controlled primordial magnetic turbulence decay,'' Astron. Astrophys., in press (arXiv:2401.08569, ADS, HTML, PDF)

    Brandenburg, A., Neronov, A., & Vazza, F.: 2024b, Datasets for ``Resistively controlled primordial magnetic turbulence decay'' v2024.01.18. Zenodo, DOI:10.5281/zenodo.10527437 (HTML, DOI)



  • The runs above had wav1=10. When I repeat the GPU run with wav1=1 (k60_nu5em6_k4_Pm5_128a), I get "error while loading shared libraries: libamdhip64.so.5: cannot open shared object file: No such file or directory". I suspect that the "pc_newrun -s" command is still incomplete. I get then only
    brandenb@login1:/cfs/klemming/home/b/brandenb/data/GPU/axel/decay/reconnection/k60_nu5em6_k4_Pm5_128a> ls -l src/astaroth/submodule/acc-runtime/
    total 8
    drwxr-sr-x 2 brandenb pg_snic2020-4-12 4096 dec 30 02:03 built-in
    drwxr-sr-x 5 brandenb pg_snic2020-4-12 4096 dec 30 02:03 samples
    
    while in the old directory we have
    brandenb@login1:/cfs/klemming/home/b/brandenb/data/GPU/axel/decay/reconnection/k600_nu5em7_k4_Pm5_128a> ls -l src/astaroth/submodule/acc-runtime/
    total 24
    lrwxrwxrwx 1 brandenb pg_snic2020-4-12   89 dec 28 12:15 acc -> /cfs/klemming/home/b/brandenb/data/GPU/pencil-code/src/astaroth/submodule/acc-runtime/acc
    lrwxrwxrwx 1 brandenb pg_snic2020-4-12   89 dec 28 12:15 api -> /cfs/klemming/home/b/brandenb/data/GPU/pencil-code/src/astaroth/submodule/acc-runtime/api
    drwxr-sr-x 2 brandenb pg_snic2020-4-12 4096 dec 28 12:15 built-in
    lrwxrwxrwx 1 brandenb pg_snic2020-4-12  100 dec 28 12:15 CMakeLists.txt -> /cfs/klemming/home/b/brandenb/data/GPU/pencil-code/src/astaroth/submodule/acc-runtime/CMakeLists.txt
    drwxr-sr-x 2 brandenb pg_snic2020-4-12 4096 dec 28 12:15 dynamic
    drwxr-sr-x 5 brandenb pg_snic2020-4-12 4096 dec 28 11:23 samples
    


    → return to projects

    Axel Brandenburg
    $Date: 2025/01/08 09:45:13 $, $Author: brandenb $, $Revision: 1.17 $