GPU-MHDdecay

The goal is to study the resistive limitation of the inverse MHD cascade both with and without net magnetic helicity. This was studied previously in the case without net magnetic helicity (see Brandenburg et al. 2024a,b) at resolutions up to 2048³. It was found that the quantity tv_A(t)/ξ_M(t) approaches a constant during the turbulent decay, provided the decay remains nearly self-similar. Surprisingly, the asymptotic value of tv_A(t)/ξ_M(t) was found to increase with increasing Lundquist number. We found a similar behavior in 2-D as in 3-D, which may be suspect. In 2-D, we found that the Lundquist number dependence saturates near 10⁴. To verify the Lundquist number dependence to examine the possibility of saturation in 3-D, we require a resolution of up to 8192³ meshpoints.

Detailed project description with details about timing results.

A problem right now is that the GPU and CPU runs yield different results. The error might well be very trivial! Below the relevant test directories using 128³ meshpoints.

Run directories:

GPU run k600_nu5em7_k4_Pm5_128a.

CPU run k600_nu5em7_k4_Pm5_128a_CPU with 128 CPUs.

CPU run k600_nu5em7_k4_Pm5_128a_8CPU with only 8 CPUs.

Essential steps to get the gputestv6 branch (for axel as user):

git clone -b gputestv6 --recurse-submodules https://AxelBrandenburg@pencil-code.org/git/ pencil-code
cd pencil-code/src/astaroth/submodule
git checkout PCinterface_2019-8-12

Then, before we can build, we need to load some modules:

ml rocm
ml cmake

Next, for now, we compile with:

pc_build -f compilers/Cray_MPI FFLAGS+=" -g -O0" LDFLAGS+='-Wl,--no-relax -L /opt/cray/pe/lib64 -lmpi_gtl_hsa'

This is because the compiler has a bug, so we need the -O0 optimization. Later, we could use

pc_build -f compilers/Cray_MPI FFLAGS+=" -g" LDFLAGS+='-Wl,--no-relax'

pc_build -f compilers/Cray_MPI LDFLAGS+='-Wl,--no-relax'

Need the linker part.

Make sure you have: src/astaroth/DSL/local/equations.h.

To work with the gnu compiler, we start by saying (as usual):

module swap PrgEnv-cray PrgEnv-gnu

To compile, we must suppress the "FSTD_95=-std=f95" option. (But how?) Next, We would use

pc_build -f compilers/GNU-GCC_MPI FFLAGS+=" -g -mcmodel=large" LDFLAGS+='-L /opt/cray/pe/lib64 -lmpi_gtl_hsa'

The plan is to work in:

/cfs/klemming/home/b/brandenb/data/GPU/axel/decay/reconnection/k60_nu5em6_k4_Pm5_128b_gnu

To compare the results of the 3 runs above, we use pcomp_GPU.pro in the idl directory. The resulting pcomp_GPU.pdf file in the idl directory shows that the GPU run (black lines, solid for B_rms and dashed for u_rms) results in a slightly stronger magnetic field than the CPU runs with either 128 CPUs (red line) or just 8 CPUs (dashed orange line on top).

The figure pcomp_GPU.pdf above shows that B_rms starts off with a value of around 0.003 and u_rms gets driven by the Lorentz force and quickly reaches comparable values. The GPU runs reproduce the CPU runs qualitatively, and the initial condition exactly.

Fixed time step: run directories:

GPU run k600_nu5em5_k4_Pm5_128a.

CPU run k600_nu5em5_k4_Pm5_128a_CPU with 128 CPUs.

The two runs now agree.

Open issues:

In the CPU version, when spec_start=T, we output the very first timestep plus the next one. The reason for this is that the gravitational wave module (where all evolution happens at the level of auxiliaries!) does not have the first time yet. But even without gravitational waves, it was useful, because velocity spectra also only exist after the first time step. The output of the timestep after the first one therefore should still be implemented.

At the moment, we have export OMP_NUM_THREADS=8. Maybe we should put it to 7? This may be lumi-specific.

Even after having disabled all VAR-file and slice outputs, having it1=100, isave=1000, and a snapshot interval of 1, corresponding to a frequency of 21 spectra during the full run. the GPU code uses a 16 times longer wallclock time for the same task; see below.

Run directories:

GPU run k60_nu2em5_k4_Pm5_128c_dt_GPU.

CPU run k60_nu2em5_k4_Pm5_128c_dt with 128 CPUs.

Performance for 128³ tests:

GPU:
 Wall clock time [hours] =  0.850     (+/-  8.3333E-12)
 Wall clock time/timestep/meshpoint [microsec] = 0.1620545    
 Maximum used memory per cpu [MBytes] =  1904.918
 Maximum used memory [GBytes] =       14.003

Compared to:

CPU:
 Wall clock time [hours] =  5.454E-02 (+/-  5.5556E-12)
 Wall clock time/timestep/meshpoint [microsec] = 1.0402362E-02
 Maximum used memory per cpu [MBytes] =    43.348
 Maximum used memory [GBytes] =        4.776

When we also put isave=10000 (as opposed to 1000) and dspec=10 (as opposed to 1), we find (in k60_nu2em5_k4_Pm5_128c_dt_GPU_lspec):

 Wall clock time [hours] =  0.624     (+/-  5.5556E-12)
 Wall clock time/timestep/meshpoint [microsec] = 0.1190291
 Maximum used memory per cpu [MBytes] =  1519.820
 Maximum used memory [GBytes] =       11.492

Magnetic energy (upper two lines) and kinetic energy (lower two lines). Red lines: GPU run, blue lines CPU. (The dashed orange line is for the CPU code without upwinding of the vector potential; see below.) Both runs with eta=4e-6 and nu=2e-5. The initial spectra are essentially the same (the slight differences are probably explained by the slightly different times). It seems that the magnetic and velocity fields have slighly less diffusion in the GPU code. This difference becomes much bigger at smaller viscosities (e.g., 4 times).

The difference is caused by having used upwinding for the vector potential in the CPU version, which is not yet, however, coded in the GPU version; see

src/astaroth/DSL/magnetic/induction.h

To confirm this, we ran the CPU version without upwinding of the vector potential; see the dashed orange line in the spectra above.

The problem with the spectra not being outputted at the first and second time step is seen in the following comparson, where we see first the incorrect output from the GPU version (k60_nu2em5_k4_Pm5_128c_dt_GPU) and then the one from the CPU version (k60_nu2em5_k4_Pm5_128c_dt):

brandenb@login1:/cfs/klemming/home/b/brandenb/data/GPU/axel/decay/reconnection/k60_nu2em5_k4_Pm5_128c_dt_GPU_lspec> head -20 ../k60_nu2em5_k4_Pm5_128c_dt_GPU/data/power_kin.dat
 3.36215353875403758E-2
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
 1.0323241484733396
  1.05E-09  1.03E-06  8.25E-06  1.30E-05  2.44E-05  2.36E-05  2.42E-05  2.37E-05
  2.17E-05  2.32E-05  1.87E-05  1.70E-05  1.63E-05  1.57E-05  1.45E-05  1.22E-05
  1.23E-05  1.12E-05  1.01E-05  8.78E-06  8.73E-06  8.30E-06  7.53E-06  7.00E-06
  6.58E-06  6.55E-06  6.02E-06  5.59E-06  5.44E-06  5.10E-06  4.90E-06  4.68E-06
  4.29E-06  4.34E-06  4.09E-06  3.88E-06  3.80E-06  3.55E-06  3.51E-06  3.26E-06
  3.22E-06  3.17E-06  2.97E-06  2.90E-06  2.80E-06  2.79E-06  2.63E-06  2.57E-06
  2.49E-06  2.45E-06  2.41E-06  2.32E-06  2.30E-06  2.25E-06  2.18E-06  2.14E-06
  2.11E-06  2.12E-06  2.05E-06  2.01E-06  2.02E-06  1.99E-06  1.97E-06  1.89E-06
 2.0274802674034622
  5.33E-09  1.41E-06  8.09E-06  1.06E-05  1.71E-05  1.69E-05  1.38E-05  1.30E-05
brandenb@login1:/cfs/klemming/home/b/brandenb/data/GPU/axel/decay/reconnection/k60_nu2em5_k4_Pm5_128c_dt_GPU_lspec> head -20 ../k60_nu2em5_k4_Pm5_128c_dt/data/power_kin.dat
   0.0000000000000000     
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00  0.00E+00
   2.3946539533068810E-003
  1.87E-14  3.46E-10  3.50E-09  1.13E-08  3.50E-08  8.70E-08  1.42E-07  2.20E-07
  3.32E-07  5.47E-07  6.85E-07  7.50E-07  9.60E-07  1.11E-06  1.23E-06  1.21E-06
  1.47E-06  1.55E-06  1.62E-06  1.56E-06  1.74E-06  1.83E-06  1.80E-06  1.83E-06
  1.88E-06  2.00E-06  2.00E-06  2.01E-06  2.01E-06  2.01E-06  2.03E-06  2.09E-06
  2.02E-06  2.13E-06  2.15E-06  2.12E-06  2.14E-06  2.12E-06  2.13E-06  2.05E-06
  2.12E-06  2.12E-06  2.07E-06  2.10E-06  2.04E-06  2.08E-06  2.02E-06  1.99E-06
  2.01E-06  2.01E-06  1.98E-06  1.91E-06  1.96E-06  1.92E-06  1.90E-06  1.87E-06
  1.86E-06  1.86E-06  1.82E-06  1.80E-06  1.81E-06  1.78E-06  1.79E-06  1.74E-06
   1.0028751054320695     
  4.45E-09  1.78E-06  1.84E-05  1.95E-05  4.41E-05  4.40E-05  3.87E-05  3.47E-05

readme.txt.

Other discussion points:

Mergin branch to master

auto-tun to put into pc_newrun when -s is invoked.

Fixed time step (agreement with lupw_aa=F): run directories:

GPU run k600_nu5em5_k4_Pm5_128a (jrms=2.958E-02).

CPU run k600_nu5em5_k4_Pm5_128a_8CPU_lupw_aaF with 8 CPUs and lupw_aa=F: agrees perfectly (jrms=2.958E-02)!

CPU run k600_nu5em5_k4_Pm5_128a_8CPU with 8 CPUs, but still lupw_aa=T: does not agree well enough (jrms=2.957E-02).

CPU run k600_nu5em5_k4_Pm5_128a_CPU with 128 CPUs and lupw_aa=T: does not agree well enough (jrms=2.961E-02).

where the jrms values indicate the value from the time series at the last time of each run. Doing a diff on the spectra incidates that the time stamps are wrong, but the data themselves are correct.

References:

Presentation by Matthias Rheinhardt (Aalto University) at the Pencil Code User Meeting 2024 in Barcelona about the GPU acceleration in the Pencil Code using Astaroth: Introduction to PC-A [pptx] (25 Sep 2024)

Brandenburg, A., Neronov, A., & Vazza, F.: 2024a, ``Resistively controlled primordial magnetic turbulence decay,'' Astron. Astrophys., in press (arXiv:2401.08569 , ADS , HTML , PDF)

Brandenburg, A., Neronov, A., & Vazza, F.: 2024b, Datasets for ``Resistively controlled primordial magnetic turbulence decay'' v2024.01.18. Zenodo, DOI:10.5281/zenodo.10527437 (HTML , DOI)

The runs above had wav1=10. When I repeat the GPU run with wav1=1 (k60_nu5em6_k4_Pm5_128a), I get "error while loading shared libraries: libamdhip64.so.5: cannot open shared object file: No such file or directory". I suspect that the "pc_newrun -s" command is still incomplete. I get then only

brandenb@login1:/cfs/klemming/home/b/brandenb/data/GPU/axel/decay/reconnection/k60_nu5em6_k4_Pm5_128a> ls -l src/astaroth/submodule/acc-runtime/
total 8
drwxr-sr-x 2 brandenb pg_snic2020-4-12 4096 dec 30 02:03 built-in
drwxr-sr-x 5 brandenb pg_snic2020-4-12 4096 dec 30 02:03 samples

while in the old directory we have

brandenb@login1:/cfs/klemming/home/b/brandenb/data/GPU/axel/decay/reconnection/k600_nu5em7_k4_Pm5_128a> ls -l src/astaroth/submodule/acc-runtime/
total 24
lrwxrwxrwx 1 brandenb pg_snic2020-4-12   89 dec 28 12:15 acc -> /cfs/klemming/home/b/brandenb/data/GPU/pencil-code/src/astaroth/submodule/acc-runtime/acc
lrwxrwxrwx 1 brandenb pg_snic2020-4-12   89 dec 28 12:15 api -> /cfs/klemming/home/b/brandenb/data/GPU/pencil-code/src/astaroth/submodule/acc-runtime/api
drwxr-sr-x 2 brandenb pg_snic2020-4-12 4096 dec 28 12:15 built-in
lrwxrwxrwx 1 brandenb pg_snic2020-4-12  100 dec 28 12:15 CMakeLists.txt -> /cfs/klemming/home/b/brandenb/data/GPU/pencil-code/src/astaroth/submodule/acc-runtime/CMakeLists.txt
drwxr-sr-x 2 brandenb pg_snic2020-4-12 4096 dec 28 12:15 dynamic
drwxr-sr-x 5 brandenb pg_snic2020-4-12 4096 dec 28 11:23 samples

→ return to projects

Axel Brandenburg
$Date: 2025/01/08 09:45:13 $, $Author: brandenb $, $Revision: 1.17 $