Difference between revisions of "User:Manoj"
(824 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | = | + | = STREAM= |
− | + | == A few words about numactl== | |
− | Following | + | NUMA is an acronym for Non Uniform Memory Access, and numactl is a tool to assign memory to the node. Following are a few important keywords one should know before embarking on the numactl mission: |
+ | <source lang=make> | ||
+ | physcpubind = ID of the cores | ||
+ | cpunodebind = ID of the nodes | ||
+ | membind = ID of the node that the memory is assigned to | ||
+ | </source> | ||
+ | For example, on an AMD machine with 16 cores, or in the terminology of NUMA, 4 nodes with 4 cores on each node, the command line | ||
+ | <source lang=make> | ||
+ | –membind=0 –physcpubind=0-3 | ||
+ | </source> | ||
+ | asigns four threads running on cores 0 to 3 (node 0) with the memory also assigned to the node 0. However, the command line | ||
+ | <source lang=make> | ||
+ | –membind=1 –physcpubind=0-3 | ||
+ | </source> | ||
+ | assigns four threads on the cores 0 to 3 (node 0) but the memory is assigned to the node 1. As this memory is not local to the node that the threads are running on, the performance will be affected. Assigning memory locally to the node can also be done by ”-l” option of the numactl. | ||
+ | |||
+ | Alternatively, above command lines can be shortened by using "cpunodebind". For example, | ||
+ | <source lang=make> | ||
+ | –membind=0 –cpunodebind=0 | ||
+ | </source> | ||
+ | |||
+ | means that the memory is assigned to node 0 and the threads are also running on node 0. One should note that with the use of | ||
+ | "cpunodebind" the number of threads will be equal to the number of cores on the node, so in this case number of threads has to be equal to four. However, if we wish to run two threads on node 0, its only possible with "physcpubind". You have more control of running your threads with "physcpubind" as you can choose the cores that you wish to run your jobs on. | ||
+ | For detail description please follow the manual page of numactl. | ||
+ | |||
+ | == Intel (2 x E5-2643 @ 3.30GHz)== | ||
+ | |||
+ | Streams is a well-known memory bandwidth benchmark. Before we attempt to find the maximum bandwidth, it's necessary to find out the architecture of the machine. The command "numactl --hardware" on this machine produces: | ||
+ | |||
+ | <source lang=make> | ||
+ | |||
+ | available: 2 nodes (0-1) | ||
+ | node 0 cpus: 0 1 2 3 | ||
+ | node 0 size: 32739 MB | ||
+ | node 0 free: 30624 MB | ||
+ | node 1 cpus: 4 5 6 7 | ||
+ | node 1 size: 32768 MB | ||
+ | node 1 free: 31280 MB | ||
+ | node distances: | ||
+ | node 0 1 | ||
+ | 0: 10 21 | ||
+ | 1: 21 10 | ||
+ | |||
+ | </source> | ||
+ | |||
+ | From the above result, we can conclude that there are two numa nodes with four cores on each: in total eight cores. | ||
+ | |||
+ | Before measuring the maximum memory bandwidth of the server, we first determine the number of threads required to achieve the maximum bandwidth of a given NUMA node. Results are summarized in the following table: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !Number <br> of threads!!Bandwidth <br>(GB/s) | ||
+ | |- | ||
+ | |1||9.5 | ||
+ | |- | ||
+ | |2||18.8 | ||
+ | |- | ||
+ | |3||21.4 | ||
+ | |- | ||
+ | |4||34.0 | ||
+ | |} | ||
+ | |||
+ | From the above table, we conclude that the maximum number of threads that we need to run on each node is four. Above table was obtained by running the threads on node 0 and assigning the memory on the same node as well. This result can be reproduced on other nodes as well. | ||
+ | |||
+ | Following table describes the effect of variation of memory allocation with respect to the processors where the threads are running on the memory bandwidth(number of threads is four): | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !MEM<br>CPU|||0||1 | ||
+ | |- | ||
+ | |'''0'''||34.0||17.4 | ||
+ | |- | ||
+ | |'''1'''||18.9||33.5 | ||
+ | |} | ||
+ | |||
+ | In the above table, variation of the memory nodes are in the rows while cpu nodes are in the column. You can clearly see the effect of memory binding with the respect to the cores where the threads are running. Please note that the above table resembles the "node distance table " obtained using "numactl --hardware" earlier. | ||
+ | |||
+ | ==AMD (2 x 6220 @ 3.0 GHz)== | ||
+ | |||
+ | This is an Interlagos machine with 16 cores (numa 4 nodes with 4 cores each). Each core has 4 GB of memory, which results in the memory of machine to be 64GB. I compiled the code with open64 compiler. It is noteworthy that gcc compiler gives about half of the bandwidth as open64, while intel compiler results on this machine vary (64GB to 40 GB). "numactl --hardware" produces: | ||
+ | |||
+ | <source lang=make> | ||
+ | |||
+ | available: 4 nodes (0-3) | ||
+ | node 0 cpus: 0 1 2 3 | ||
+ | node 0 size: 16382 MB | ||
+ | node 0 free: 2930 MB | ||
+ | node 1 cpus: 4 5 6 7 | ||
+ | node 1 size: 16384 MB | ||
+ | node 1 free: 5082 MB | ||
+ | node 2 cpus: 8 9 10 11 | ||
+ | node 2 size: 16384 MB | ||
+ | node 2 free: 2281 MB | ||
+ | node 3 cpus: 12 13 14 15 | ||
+ | node 3 size: 16368 MB | ||
+ | node 3 free: 550 MB | ||
+ | node distances: | ||
+ | node 0 1 2 3 | ||
+ | 0: 10 16 16 16 | ||
+ | 1: 16 10 16 16 | ||
+ | 2: 16 16 10 16 | ||
+ | 3: 16 16 16 10 | ||
+ | |||
+ | </source> | ||
+ | |||
+ | Following table describes memory bandwidth on a single node by varying number of threads: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !Number <br> of threads!!Bandwidth <br>(GB/s) | ||
+ | |- | ||
+ | |1||14.0 | ||
+ | |- | ||
+ | |2||15.0 | ||
+ | |- | ||
+ | |3||17.8 | ||
+ | |- | ||
+ | |4||18.5 | ||
+ | |} | ||
+ | |||
+ | Again, similar to the Intel machine, the maximum number of threads we need to run on each node is four. | ||
+ | |||
+ | Following table describes the effect of variation of memory allocation with respect to the processors where the threads are running on the memory bandwidth(number of threads is four): | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !MEM<br>CPU|||0||1||2||3 | ||
+ | |- | ||
+ | |'''0'''||18.1||11.8||6.5||5.6 | ||
+ | |- | ||
+ | |'''1'''||11.8||18.7||5.5||6.5 | ||
+ | |- | ||
+ | |'''2'''||6.5||5.5||18.5||11.6 | ||
+ | |- | ||
+ | |'''3'''||5.6||6.5||11.8||18.5 | ||
+ | |} | ||
+ | |||
+ | Contrary to the Intel machine, the above table does not agree with the "node distance" produced by the "numactl --hardware"! | ||
+ | |||
+ | ==AMD (4 x 6378 @ 2.4 GHz)== | ||
+ | In NUMA terminology, this server has 8 nodes with 8 cores on each. | ||
+ | <source lang=make> | ||
+ | numactl --hardware | ||
+ | |||
+ | available: 8 nodes (0-7) | ||
+ | node 0 cpus: 0 1 2 3 4 5 6 7 | ||
+ | node 0 size: 32765 MB | ||
+ | node 0 free: 29324 MB | ||
+ | node 1 cpus: 8 9 10 11 12 13 14 15 | ||
+ | node 1 size: 32768 MB | ||
+ | node 1 free: 31892 MB | ||
+ | node 2 cpus: 16 17 18 19 20 21 22 23 | ||
+ | node 2 size: 32768 MB | ||
+ | node 2 free: 31900 MB | ||
+ | node 3 cpus: 24 25 26 27 28 29 30 31 | ||
+ | node 3 size: 32768 MB | ||
+ | node 3 free: 31911 MB | ||
+ | node 4 cpus: 32 33 34 35 36 37 38 39 | ||
+ | node 4 size: 32768 MB | ||
+ | node 4 free: 31964 MB | ||
+ | node 5 cpus: 40 41 42 43 44 45 46 47 | ||
+ | node 5 size: 32768 MB | ||
+ | node 5 free: 31942 MB | ||
+ | node 6 cpus: 48 49 50 51 52 53 54 55 | ||
+ | node 6 size: 32768 MB | ||
+ | node 6 free: 31866 MB | ||
+ | node 7 cpus: 56 57 58 59 60 61 62 63 | ||
+ | node 7 size: 32752 MB | ||
+ | node 7 free: 31960 MB | ||
+ | node distances: | ||
+ | node 0 1 2 3 4 5 6 7 | ||
+ | 0: 10 16 16 22 16 22 16 22 | ||
+ | 1: 16 10 22 16 22 16 22 16 | ||
+ | 2: 16 22 10 16 16 22 16 22 | ||
+ | 3: 22 16 16 10 22 16 22 16 | ||
+ | 4: 16 22 16 22 10 16 16 22 | ||
+ | 5: 22 16 22 16 16 10 22 16 | ||
+ | 6: 16 22 16 22 16 22 10 16 | ||
+ | 7: 22 16 22 16 22 16 16 10 | ||
+ | </source> | ||
+ | |||
+ | Memory bandwidth on a single node by varying number of threads: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !Number <br> of threads!!Bandwidth <br>(GB/s) | ||
+ | |- | ||
+ | |1||13.0 | ||
+ | |- | ||
+ | |2||14.1 | ||
+ | |- | ||
+ | |3||17.1 | ||
+ | |- | ||
+ | |4||17.4 | ||
+ | |- | ||
+ | |5||17.1 | ||
+ | |- | ||
+ | |6||16.7 | ||
+ | |- | ||
+ | |7||16.6 | ||
+ | |- | ||
+ | |8||16.1 | ||
+ | |} | ||
+ | |||
+ | Following table describes the variation of memory bandwidth when we change memory allocation with respect to the cores where threads are running (Number of threads=4) | ||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !MEM<br>CPU|||0||1||2||3||4||5||6||7 | ||
+ | |- | ||
+ | |'''0'''||17.3||8.0||5.6||4.1||5.7||4.1||5.5||4.0 | ||
+ | |- | ||
+ | |'''1'''||8.2||17.6||6.5||6.5||4.0||5.5||4.0||5.4 | ||
+ | |- | ||
+ | |'''2'''||5.7||6.5||17.9||7.9||5.6||4.1||5.6||4.1 | ||
+ | |- | ||
+ | |'''3'''||4.1||6.5||8.1||17.8||4.1||5.6||4.1||5.7 | ||
+ | |- | ||
+ | |'''4'''||5.6||4.0||5.7||4.2||17.7||7.9||5.7||4.1 | ||
+ | |- | ||
+ | |'''5'''||4.0||5.6||4.1||5.6||8.1||17.7||4.0||5.5 | ||
+ | |- | ||
+ | |'''6'''||5.4||4.0||5.6||4.1||5.7||4.1||17.8||7.9 | ||
+ | |- | ||
+ | |'''7'''||3.9||5.4||4.0||5.6||4.2||5.6||8.1||17.7 | ||
+ | |} | ||
+ | |||
+ | == Bandwidth in terms of Socket== | ||
+ | |||
+ | A socket for AMD 6200 and 6300 machine is two NUMA nodes combined together. The sockets have 16 cores for the 6378 server while 8 cores for 6220 server. The memory bandwidth for each NUMA node is maximum with about 4 threads, and we wonder what is the maximum bandwidth for a socket. | ||
+ | A reasonable guess from our previous results is to use 8 threads for the socket with 4 distributed over each NUMA node. If we run the stream with 8 cores as follows: | ||
+ | |||
+ | <source lang=make> | ||
+ | numactl --physcpubind=0,1,2,3,8,9,10,11 --membind=0,1 ./stream | ||
+ | </source> | ||
+ | we get 34.7 GB/s memory bandwidth. | ||
+ | |||
+ | By running, | ||
+ | <source lang=make> | ||
+ | numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=0,1 ./stream | ||
+ | </source> | ||
+ | also yields 35 GB/s bandwidth. | ||
+ | |||
+ | By varying the membind to different sockets as follows: | ||
+ | |||
+ | <source lang=make> | ||
+ | |||
+ | numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=0,1 ./stream | ||
+ | numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=2,3 ./stream | ||
+ | numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=4,5 ./stream | ||
+ | numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=6,7 ./stream | ||
+ | numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=0,1 ./stream | ||
+ | numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=2,3 ./stream | ||
+ | numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=4,5 ./stream | ||
+ | numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=6,7 ./stream | ||
+ | numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=0,1 ./stream | ||
+ | numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=2,3 ./stream | ||
+ | numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=4,5 ./stream | ||
+ | numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=6,7 ./stream | ||
+ | numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=0,1 ./stream | ||
+ | numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=2,3 ./stream | ||
+ | numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=4,5 ./stream | ||
+ | numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=6,7 ./stream | ||
+ | |||
+ | </source> | ||
+ | we get following table (In terms of socket, i.e. node 0-1 is socket 1, node 2-3 is socket 2 and so on) | ||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !MEM<br>CPU|||1||2||3||4 | ||
+ | |- | ||
+ | |'''1'''||35.2||11.2||11.0||10.7 | ||
+ | |- | ||
+ | |'''2'''||11.3||35.3||11.2||11.1 | ||
+ | |- | ||
+ | |'''3'''||10.9||11.2||35.2||11.0 | ||
+ | |- | ||
+ | |'''4'''||10.7||11.1||11.1||35.4 | ||
+ | |} | ||
+ | |||
+ | = VASP = | ||
+ | This page describes benchmarking of Vienna Ab-initio Simulation Package (VASP), a plane wave density functional theory code, used in studying electronic structure of materials. | ||
+ | |||
+ | ==Intel (2 x E5-2643 @ 3.30GHz)== | ||
+ | |||
+ | ===Native FFT Library=== | ||
+ | Following libraries and flags were used: | ||
+ | |||
+ | <source lang=make> | ||
MKLDIR = $(HPC_MKL_DIR) | MKLDIR = $(HPC_MKL_DIR) | ||
+ | MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core | ||
+ | MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64 | ||
+ | FFTLIB = -lfftw3xf | ||
+ | INCS = -I$(MKLDIR)/include/fftw | ||
+ | FFT_OBJS = fftmpi.o fftmpi_map.o fftw3d.o fft3dlib.o | ||
+ | FFLAGS = -free -names lowercase -assume byterecl | ||
+ | OFLAG = -O2 -xsse2 -unroll-aggressive -warn general | ||
+ | </source> | ||
+ | As a first check, Streaming SIMD Extension (SSE) was changed and following is the result of a self consistent field (SCF) calculation for MgMOS (For input files, please ask Charles Taylor or Manoj Srivastava): | ||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !align="Center"|SIMD Instruction!!Time(s) | ||
+ | |- | ||
+ | |align="Center"|sse2||align="Center"|158 | ||
+ | |- | ||
+ | |align="Center"|sse4.1||align="Center"|156 | ||
+ | |- | ||
+ | |align="Center"|sse4.2||align="Center"|155 | ||
+ | |- | ||
+ | |align="Center"|avx||align="Center"|155 | ||
+ | |- | ||
+ | |align="Center"|ssse3||align="Center"|156 | ||
+ | |} | ||
+ | |||
+ | There does not seem to be a significant impact of SSE sets on the run time of VASP. | ||
− | + | ===MKL FFTs (via FFTW wrappers)=== | |
+ | Upon profiling the code, we found that the code spent most of its time in the FFT libraries, so the next step was to change the FFT libraries. Following changes were made: | ||
− | + | <source lang=make> | |
+ | FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o | ||
+ | </source> | ||
− | + | (The change here is replacement of "fftmpi.o" in the original VASP makefile with "fftmpiw.o") | |
+ | <source lang=make> | ||
+ | MKLDIR = $(HPC_MKL_DIR) | ||
+ | MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core | ||
+ | MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64 | ||
+ | FFTLIB = -lfftw3xf | ||
+ | INCS = -I$(MKLDIR)/include/fftw | ||
+ | FFLAGS = -free -names lowercase -assume byterecl | ||
OFLAG = -O2 -xsse2 -unroll-aggressive -warn general | OFLAG = -O2 -xsse2 -unroll-aggressive -warn general | ||
− | + | </source> | |
− | {| border= | + | Following table depicts the run time variation with SIMD instruction sets: |
+ | |||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
!SIMD Instruction!!Time(s) | !SIMD Instruction!!Time(s) | ||
|- | |- | ||
− | |sse2|| | + | |sse2||97 |
|- | |- | ||
− | |sse4.1|| | + | |sse4.1||95 |
|- | |- | ||
− | |sse4.2|| | + | |sse4.2||94 |
|- | |- | ||
− | |avx|| | + | |avx||94 |
|- | |- | ||
− | |ssse3|| | + | |ssse3||94 |
− | | | + | |} |
+ | In conclusion, upon making changes (using "fftmpiw.o" as opposed to "fftmpi.o"), a significant 60% improvement on the run time of the code was found on the Intel machine (E5-2643 @ 3.30GHz). | ||
− | + | ===FFTW FFTs === | |
+ | We further compiled VASP by using FFT library from the FFTW package with following flags: | ||
− | + | <source lang=make> | |
MKLDIR = $(HPC_MKL_DIR) | MKLDIR = $(HPC_MKL_DIR) | ||
+ | MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core | ||
+ | MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64 | ||
+ | FFTWDIR = /apps/fftw/3.3.2 | ||
+ | FFTLIB = -L$(FFTWDIR)/lib -lfftw3 | ||
+ | INCS = -I$(FFTWDIR)/include | ||
+ | FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o | ||
+ | FFLAGS = -free -names lowercase -assume byterecl | ||
+ | OFLAG = -O2 -xsse2 -unroll-aggressive -warn general | ||
+ | |||
+ | </source> | ||
+ | From our previous experience, we concluded that the performance of VASP did not depend substantially on the SIMD instruction sets, so for FFTW library, we only tried one set. Following is the result: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !SIMD Instruction!!Time(s) | ||
+ | |- | ||
+ | |sse2||118 | ||
+ | |} | ||
+ | |||
+ | We conclude that FFTs from MKL library are better than the ones from FFTW. | ||
+ | |||
+ | == AMD (2 x 6220 @ 3.0 GHz) == | ||
+ | |||
+ | This machine has 16 cores, in numactl terminology 4 NUMA nodes with 4 cores on each nodes. As the result of VASP depends heavily on the choice of FFT libraries, we checked performance of this machine with different FFTs, namely, FFT provided by VASP package, MKL, and FFTW. We built FFTW libraries with various flags to see if we could find a better choice for FFTs. | ||
+ | The libraries and flags used to compile VASP are as follows (FFT libraries were changed depending on which FFT we wanted to use): | ||
+ | <source lang=make> | ||
+ | |||
+ | MKLDIR = $(HPC_MKL_DIR) | ||
MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core | MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core | ||
+ | MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64 | ||
+ | FFTWDIR = /apps/fftw/3.3.2 | ||
+ | FFTLIB = -L$(FFTWDIR)/lib -lfftw3 | ||
+ | INCS = -I$(FFTWDIR)/include | ||
+ | FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o | ||
+ | FFLAGS = -free -names lowercase -assume byterecl | ||
+ | OFLAG = -O2 -xsse2 -unroll-aggressive -warn general | ||
+ | |||
+ | </source> | ||
+ | |||
+ | From the computer architecture point of view, the bulldozer core aka module of AMD server lies in between a true dual core processor and a single core processor with simultaneous multithreading capability. The cores on AMD servers share some of the resources such as L2 cache and floating point unit (FPU), so the performance of a code would get affected if the threads are run on the shared cores or exclusive cores. | ||
+ | For detail information about bulldozer core, please have a look at | ||
+ | http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29 | ||
+ | |||
+ | |||
+ | The results are summarized in the following table (8 processor runs): | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !Run Scheme!! Native!! MKL!!FFTW!!FFTW!!FFTW!!FFTW!!FFTW!!FFTW | ||
+ | |- | ||
+ | |Shared <br>time(s)||399||261||333||319||334||336||315||319 | ||
+ | |- | ||
+ | |Exclusive <br> time (s)||274||159||217||219||215||217||213||211 | ||
+ | |- | ||
+ | |Notes||-||-||1||2||3||4||5||6 | ||
+ | |} | ||
+ | |||
+ | In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources. | ||
+ | |||
+ | You can clearly see the effect of FPU sharing on the server for all the FFT libraries. Also, similar to the Intel servers, FFT from the MKL libraries work better as opposed to any other libraries. | ||
+ | |||
+ | <sup>1</sup> Default compiler Flags were used to build FFT.<br> | ||
+ | <sup>2</sup> CFLAGS=-O3, FFLAGS=-O3, -enable sse2 <br> | ||
+ | <sup>3</sup> enable-mpi CFLAGS=-O3, FFLAGS=-O3, -enable sse2 <br> | ||
+ | <sup>4</sup> CC='opencc -march=bdver1' F77='openf90 -march=bdver1' CFLAGS='-msse3 -msse4.1 -msse4.2 -msse4a -mfma4 -O2' FFLAGS='-msse3 -msse4.1 -msse4.2 -msse4a -mfma4 -O2' --enable-fma --enable-mpi <br> | ||
+ | <sup>5</sup> FFLAGS/ CFLAGS="-OPT:Ofast -mavx -mfma4 -march=bdver1 -O3 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math"<br> | ||
+ | <sup>6</sup> ufhpc compiler options. FFTWDIR = /apps/fftw/3.3.2 | ||
+ | |||
+ | == Performance Comparison == | ||
+ | |||
+ | Following is a summary of results for the test case of MgMOS ran on the Intel and AMD servers with 8 processors. | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !Server!!Native!!MKL!!FFTW | ||
+ | |- | ||
+ | |Intel ||158||97||118 | ||
+ | |- | ||
+ | |Intel (Scaled)||174||106||130 | ||
+ | |- | ||
+ | |AMD (Shared)||399||261||319 | ||
+ | |- | ||
+ | |AMD (Exclusive)||274||159||211 | ||
+ | |- | ||
+ | |AMD Shared/AMD Exc.||1.46||1.64||1.51 | ||
+ | |- | ||
+ | |AMD Exc./Intel (scaled)||1.57||1.50||1.62 | ||
+ | |- | ||
+ | |Notes||-||-||1 | ||
+ | |} | ||
+ | |||
+ | In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources. | ||
+ | |||
+ | In summary-- | ||
+ | |||
+ | 1. VASP runs better (~1.5x) on the Intel server than the AMD server, even with the interleaved cores. | ||
+ | |||
+ | 2. FPU sharing reduces the efficiency on the AMD server to 1.5x. | ||
+ | |||
+ | 3. FFTs from MKL are better builds. | ||
+ | |||
+ | |||
+ | <sup>1</sup> Compiled by UFHPC (Charles Taylor or Craig Prescott)<br> | ||
+ | |||
+ | = LAMMPS = | ||
+ | ==Scaling with Number of Processors== | ||
+ | LAMMPS is compiled with the following flags: | ||
+ | |||
+ | <source lang=make> | ||
+ | module load intel openmpi | ||
+ | CC = mpiCC | ||
+ | CCFLAGS = -O2 -xsse2 | ||
+ | FFT_INC = -I$(HPC_MKL_DIR)/include/fftw | ||
+ | FFT_PATH = | ||
+ | FFT_LIB = -L$(HPC_MKL_DIR)/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core | ||
+ | </source> | ||
+ | |||
+ | The benchmarking runs are done by the input file provided with the package. (LJ = atomic fluid, Lennard-Jones potential with 2.5 sigma cutoff (55 neighbors per atom), NVE integration). | ||
+ | Following table describes the variation of run time with number of processors on the Intel server: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | ! # processors!! Time(s) | ||
+ | |- | ||
+ | |8||158 | ||
+ | |- | ||
+ | |4||309 | ||
+ | |- | ||
+ | |1||1139 | ||
+ | |} | ||
+ | We find linear scaling with number of processors on the intel machine. | ||
+ | |||
+ | We also ran the "lj" benchmark on the AMD server (for comparison, we provide results on the Intel server as well): | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | ! # processors!!lj<br>Time(s)!!chain<br>Time(s)!!eam<br>Time(s)!!rhodo<br>Time(s)!!Notes | ||
+ | |- | ||
+ | |16||180||84||476||2877||- | ||
+ | |- | ||
+ | |8||329||149||908||5506||- | ||
+ | |- | ||
+ | |4||547||248||1509||9398||- | ||
+ | |- | ||
+ | |1||1651||724||4708||-||- | ||
+ | |- | ||
+ | |Intel (8 proc)||158||67||396||2361||1 | ||
+ | |- | ||
+ | |Scaled <br>Intel (8 proc)||217||92||545||3246||2 | ||
+ | |- | ||
+ | |Scaled Intel(8 proc)/ <br>AMD (16 proc)||1.20||1.15||1.14||1.13||3 | ||
+ | |} | ||
+ | |||
+ | For all the test cases, runs on the Intel servers( 8 threads) are slower than AMD servers (16 threads) by about 15%. | ||
+ | |||
+ | <sup>1</sup> Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz<br> | ||
+ | <sup>2</sup> Scaling was done by the factor of 3.3/2.4=1.375 <br> | ||
+ | <sup>3</sup> Comparison of 8 processors run on Intel vs 16 processors run on AMD <br> | ||
+ | |||
+ | == Comparison of Intel, Open64 and GNU Builds == | ||
+ | |||
+ | LAMMPS with Intel compiler: | ||
+ | |||
+ | <source lang=make> | ||
+ | module load intel openmpi | ||
+ | CC = mpiCC | ||
+ | CCFLAGS = -O2 -msse2 | ||
+ | FFT_INC = -I$(HPC_MKL_DIR)/include/fftw | ||
+ | FFT_PATH = | ||
+ | FFT_LIB = -L$(HPC_MKL_DIR)/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core | ||
+ | </source> | ||
+ | |||
+ | LAMMPS with open64 compiler: | ||
+ | <source lang=make> | ||
+ | module load open64/.4.5.2 openmpi | ||
+ | CC = mpiCC | ||
+ | CCFLAGS = -O2 -msse2 | ||
+ | MPI_DIR = /usr/mpi/open64/openmpi-1.6 | ||
+ | MPI_INC = -I$(MPI_DIR)/include | ||
+ | MPI_LIB = -L$(MPI_DIR)/lib64 -lmpi | ||
+ | MPI_PATH = | ||
+ | FFT_DIR = /home/manoj/FFTW/charlie/3.3.2 | ||
+ | FFT_INC = -I$(FFT_DIR)/include/fftw3 | ||
+ | FFT_PATH = | ||
+ | FFT_LIB = -L$(FFT_DIR)/lib -lfftw3 | ||
+ | </source> | ||
+ | |||
+ | LAMMPS with gnu compiler: | ||
+ | <source lang=make> | ||
+ | module load gcc/.4.7.2 openmpi | ||
+ | CC = g++ | ||
+ | CCFLAGS = -O2 -msse2 | ||
+ | MPI_DIR = /usr/mpi/gnu/openmpi-1.6 | ||
+ | MPI_INC = -I$(MPI_DIR)/include | ||
+ | MPI_LIB = -L$(MPI_DIR)/lib64 -lmpi -lmpi_cxx | ||
+ | MPI_PATH = | ||
+ | FFT_DIR = /home/manoj/FFTW/gnu/3.3.2 | ||
+ | FFT_INC = -I$(FFT_DIR)/include/fftw3 | ||
+ | FFT_PATH = | ||
+ | FFT_LIB = -L$(FFT_DIR)/lib -lfftw3 | ||
+ | </source> | ||
+ | |||
+ | For testing, we only ran "lj" benchmark and found: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !Compiler!!Intel<br>Time(s)!!Intel<br>Time(s)!!AMD<br>Time(s)!!AMD<br>Time(s) | ||
+ | |- | ||
+ | |Intel||158||151||329||321 | ||
+ | |- | ||
+ | |Open64||173||-||352||337 | ||
+ | |- | ||
+ | |GNU||152||145||341||320 | ||
+ | |- | ||
+ | |NOTES||1||2||1,3||2,3 | ||
+ | |} | ||
+ | |||
+ | <sup>1</sup> '''Basic Flags:'''<br>'''Intel:''' -O2 -msse2 <br> | ||
+ | '''Open64:''' -O2 -msse2 <br> | ||
+ | '''GNU:''' -O2 -msse2 <br> | ||
+ | |||
+ | <sup>2</sup> '''Fancy Flags:'''<br> | ||
+ | '''Intel:''' -O2 -mavx -unroll-aggresive -ipo -opt-prefetch -use-intel-optimized-headers <br> | ||
+ | '''Open64:''' CCFLAGS =-OPT:Ofast -mavx -mfma4 -march=bdver1 -O2 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math <br> | ||
+ | '''GNU:''' CCFLAGS= -O2 -mavx -fsched-pressure -flto -funroll-all-loops -fprefetch-loop-arrays -minline-all-stringops -fno-tree-pre -ftree-vectorize | ||
+ | |||
+ | <sup>3</sup> Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit) | ||
+ | |||
+ | == Intel (2 x E5-2643 @ 3.30GHz) == | ||
+ | === Intel Compiler and SIMD Sets=== | ||
+ | We used Intel compiler as follows: | ||
+ | |||
+ | <source lang=make> | ||
+ | module load intel openmpi | ||
+ | CC = mpiCC | ||
+ | CCFLAGS = -O2 -xSSE2 | ||
+ | FFT_INC = -I$(HPC_MKL_DIR)/include/fftw | ||
+ | FFT_PATH = | ||
+ | FFT_LIB = -L$(HPC_MKL_DIR)/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core | ||
+ | </source> | ||
+ | |||
+ | Following table shows variation of Streaming SIMD Extension (SSE) sets(# threads=8): | ||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | |Extension<br>set|| Time(s) | ||
+ | |- | ||
+ | |align="Center"|sse2||align="Center"|158 | ||
+ | |- | ||
+ | |align="Center"|sse3||align="Center"|157 | ||
+ | |- | ||
+ | |align="Center"|ssse3||align="Center"|157 | ||
+ | |- | ||
+ | |align="Center"|sse4.1||align="Center"|158 | ||
+ | |- | ||
+ | |align="Center"|sse4.2||align="Center"|157 | ||
+ | |- | ||
+ | |align="Center"|avx||align="Center"| 152 | ||
+ | |} | ||
+ | |||
+ | "avx" instruction set is slightly better than the other sets! | ||
+ | |||
+ | The binaries for the above SIMD sets use "-x" option for the build, which does not work for the instruction sets other than "-sse2" on the AMD server, so for the next step we build our binaries with "-m" option and run it on the intel and AMD servers to see whether we could successfully run the binaries on both servers. Following table demonstrates the result for the "lj" benchmark: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !align="Center"|SIMD <br>Instruction!!Intel<br>Time(s)!!AMD<br>Time(s) | ||
+ | |- | ||
+ | |align="Center"|sse2||align="Center"|158||align="Center"|329 | ||
+ | |- | ||
+ | |align="Center"|sse3||align="Center"|157||align="Center"|329 | ||
+ | |- | ||
+ | |align="Center"|ssse3||align="Center"|157||align="Center"|329 | ||
+ | |- | ||
+ | |align="Center"|sse4.1||align="Center"|158||align="Center"|330 | ||
+ | |- | ||
+ | |align="Center"|sse4.2||align="Center"|157||align="Center"|329 | ||
+ | |- | ||
+ | |align="Center"|avx|| align="Center"|152||align="Center"|319 | ||
+ | |- | ||
+ | |align="Center"|Notes||align="Center"|-||align="Center"|1 | ||
+ | |} | ||
+ | |||
+ | <sup>1</sup> Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit).<br> | ||
+ | |||
+ | Clearly on both, Intel as well as AMD servers, "avx" instructions are better choice for the "lj" benchmark. We ran other benchmarks for the SIMD sets: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !SIMD <br>Instruction!!colspan="2"|chain!!colspan="2"|eam!!colspan="2"|rhodo | ||
+ | |- | ||
+ | !-!!Intel!!AMD!!Intel!!AMD!!Intel!!AMD | ||
+ | |- | ||
+ | |sse2||67||149||396||908||2361||5506 | ||
+ | |- | ||
+ | |sse3||67||149||398||908||2355||5486 | ||
+ | |- | ||
+ | |ssse3||66||149||399||907||2359||5485 | ||
+ | |- | ||
+ | |sse4.1||68||148||395||908||2351||5420 | ||
+ | |- | ||
+ | |sse4.2||66||148||396||909||2346||5479 | ||
+ | |- | ||
+ | |avx||65||145||387||897||2290||5360 | ||
+ | |- | ||
+ | |Notes||-||1||-||1||-||1 | ||
+ | |} | ||
+ | |||
+ | For all the benchmarks, "avx" seems to be a better choice compared to other instruction sets. | ||
+ | |||
+ | <sup>1</sup> Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit). | ||
+ | |||
+ | == MKL vs FFTW FFTs == | ||
+ | We profiled the code to see where does it spend most of its time. Below is a summary of all the time spent in the FFTs for all the benchmarks that we tried. | ||
+ | |||
+ | '''lj:''' | ||
+ | |||
+ | time seconds seconds calls s/call s/call name | ||
+ | 0.00 185.61 0.00 1 0.00 0.00 LAMMPS_NS::FFT3d::timing1d(double*, int, int) | ||
+ | 0.00 185.61 0.00 1 0.00 0.00 fft_1d_only | ||
+ | |||
+ | '''chain:''' | ||
+ | |||
+ | time seconds seconds calls s/call s/call name | ||
+ | 0.00 70.62 0.00 1 0.00 0.00 LAMMPS_NS::FFT3d::timing1d(double*, int, int) | ||
+ | 0.00 70.62 0.00 1 0.00 0.00 fft_1d_only | ||
+ | |||
+ | '''eam:''' | ||
+ | |||
+ | time seconds seconds calls s/call s/call name | ||
+ | 0.00 381.72 0.00 1 0.00 0.00 LAMMPS_NS::FFT3d::timing1d(double*, int, int) | ||
+ | 0.00 381.72 0.00 1 0.00 0.00 fft_1d_only | ||
+ | |||
+ | '''rhodo:''' | ||
+ | |||
+ | time seconds seconds calls s/call s/call name | ||
+ | 0.81 118.18 1.04 10423040 0.00 0.00 kf_work(FFT_DATA*, FFT_DATA const*, unsigned long, int, int*, kiss_fft_state*) | ||
+ | 0.72 119.10 0.92 31269120 0.00 0.00 kf_bfly4(FFT_DATA*, unsigned long, kiss_fft_state*, unsigned long) | ||
+ | |||
+ | We can clearly see that the code does not spend any significant time in the FFT routines for any benchmarks. So, if we change the FFT from MKL to FFTW, it should not change the performance at all. As a check, we built LAMMPS with FFTW FFTs using: | ||
+ | <source lang=make> | ||
+ | module load intel openmpi fftw | ||
+ | CC = mpiCC | ||
+ | CCFLAGS = -O2 -mavx | ||
+ | FFT_DIR = /apps/fftw/3.3.2 | ||
+ | FFT_INC = -I$(FFT_DIR)/include/fftw3 | ||
+ | FFT_PATH = | ||
+ | FFT_LIB = -L$(FFT_DIR)/lib -lfftw3 | ||
+ | </source> | ||
+ | |||
+ | For the test, we ran "lj" benchmark on the Intel server and found: | ||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !FFT!!Time(s) | ||
+ | |- | ||
+ | |MKL||152 | ||
+ | |- | ||
+ | |FFTW||152 | ||
+ | |} | ||
+ | |||
+ | As expected, there is no difference between the FFTs from FFTW or MKL on the performance of LAMMPS. | ||
+ | |||
+ | == AMD (4 x 6378 @ 2.4 GHz) == | ||
+ | |||
+ | In this section, we descirbe the effect of shared FPUs and cache of the AMD server on the performance of LAMMPS. | ||
+ | The results are summarized in the following table (# threads=8): | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !Run Scheme!! lj!!chain!!eam!!rhodo | ||
+ | |- | ||
+ | |Shared<br> time(s)||319||145||897||5360 | ||
+ | |- | ||
+ | |Exclusive<br>time (s)||277||126||778||4426 | ||
+ | |} | ||
+ | |||
+ | In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources. | ||
+ | |||
+ | == Performance Comparison == | ||
+ | |||
+ | Following is a table for performance comparison of Intel and AMD servers when the job was run using 8 threads. | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !Server!!lj!!chain!!eam!!rhodo | ||
+ | |- | ||
+ | |Intel ||158||65||387||2290 | ||
+ | |- | ||
+ | |Intel (Scaled)||217||89||532||3149 | ||
+ | |- | ||
+ | |AMD (Shared)||319||145||897||5360 | ||
+ | |- | ||
+ | |AMD (Exclusive)||277||126||778||4426 | ||
+ | |- | ||
+ | |AMD Shared/AMD Exc.||1.15||1.15||1.15||1.21 | ||
+ | |- | ||
+ | |AMD Exc./Intel (scaled)||1.28||1.42||1.46||1.41 | ||
+ | |} | ||
+ | |||
+ | In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources. | ||
+ | |||
+ | In summary-- | ||
+ | |||
+ | 1. LAMMPS runs better (~1.3-1.5X) on the Intel server than the AMD server, even with the interleaved cores. | ||
+ | |||
+ | 2. FPU sharing reduces the efficiency on the AMD server to 1.2X. | ||
+ | |||
+ | =GROMACS= | ||
+ | |||
+ | == Comparison of Intel, Open64 and GNU Builds == | ||
+ | |||
+ | Intel compiler: | ||
+ | |||
+ | <source lang=make> | ||
+ | module load intel openmpi | ||
+ | export F77=mpif77 | ||
+ | export F90=mpif90 | ||
+ | export CC=mpicc | ||
+ | export CFLAGS="-O2 -msse2" | ||
+ | export FFLAGS="-O2 -msse2" | ||
+ | ./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float --with-fft=mkl LIBS="-L/opt/intel/composerxe | ||
+ | /lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core" | ||
+ | make | ||
+ | make install | ||
+ | </source> | ||
+ | |||
+ | Open64 compiler: | ||
+ | <source lang=make> | ||
+ | module load open64/.4.5.2 openmpi | ||
+ | export F77=openf90 | ||
+ | export F90=openf90 | ||
+ | export CC=opencc | ||
+ | export CFLAGS="-O2 -msse2" | ||
+ | export FFLAGS="-O2 -msse2" | ||
+ | export CPPFLAGS="-I/home/manoj/FFTW/fpic-charlie/3.3.2/include" | ||
+ | export LDFLAGS="-L/home/manoj/FFTW/fpic-charlie/3.3.2/lib" | ||
+ | ./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float | ||
+ | make | ||
+ | make install | ||
+ | </source> | ||
+ | |||
+ | GNU compiler: | ||
+ | |||
+ | <source lang=make> | ||
+ | module load gcc/.4.7.2 openmpi | ||
+ | export F77=gfortran | ||
+ | export F90=f95 | ||
+ | export CC=gcc | ||
+ | export CFLAGS="-O2 -msse2" | ||
+ | export FFLAGS="-O2 -msse2" | ||
+ | export CPPFLAGS="-I/home/manoj/FFTW/gnu/3.3.2/include" | ||
+ | export LDFLAGS="-L/home/manoj/FFTW/gnu/3.3.2/lib" | ||
+ | ./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float | ||
+ | make | ||
+ | make install | ||
+ | </source> | ||
+ | |||
+ | There are some test cases in "gromacs-4.5.5/share/tutor" directory, however not all of them work. So far, I could only get "water", | ||
+ | "methane", and "mixed" to work. Instructions to run MD simulations are on http://manual.gromacs.org/online/water.html page. | ||
+ | You first need to create a " .tpr" file using | ||
+ | <source lang=make> | ||
+ | ./grompp_d -v | ||
+ | </source> | ||
+ | After this, you can run "mdrun_d" for the molecular dynamics simulation. | ||
+ | |||
+ | In the input file provided by GROMACS, there is a mistake in the "grompp.mdp" file. The line starting with "bd-temp" has to be commented out. Upon some internet search, I found that the file, "grompp.mdp" is the input file for an older version of GROMACS, and apparently some of the parameters have become obsolete. | ||
+ | |||
+ | Following results are for the MD simulation on water using 8 processors on Intel (E5-2643) and AMD (Opetran-6378) servers: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !Compiler!!Intel<br>Time(s)!!Intel<br>Time(s)!!AMD<br>Time(s)!!AMD<br>Time(s) | ||
+ | |- | ||
+ | |Intel||157||157||361||363 | ||
+ | |- | ||
+ | |Open64||167||-||392||383 | ||
+ | |- | ||
+ | |GNU||160||-||377||368 | ||
+ | |- | ||
+ | |NOTES||1||2||1,3||2,3 | ||
+ | |} | ||
+ | |||
+ | <sup>1</sup> '''Basic Flags:'''<br>'''Intel:''' -O2 -msse2 <br> | ||
+ | '''Open64:''' -O2 -msse2 <br> | ||
+ | '''GNU:''' -O2 -msse2 <br> | ||
+ | |||
+ | <sup>2</sup> '''Fancy Flags:'''<br> | ||
+ | '''Intel:''' -O2 -mavx -unroll-aggresive -opt-prefetch -use-intel-optimized-headers <br> | ||
+ | '''Open64:''' CCFLAGS =-OPT:Ofast -mavx -mfma4 -march=bdver1 -O2 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns <br> | ||
+ | '''GNU:''' CCFLAGS= -O2 -mavx -fsched-pressure -flto -funroll-all-loops -fprefetch-loop-arrays -minline-all-stringops -fno-tree-pre -ftree-vectorize | ||
+ | |||
+ | <sup>3</sup> Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit) | ||
+ | |||
+ | ==Scaling with Number of Processors== | ||
+ | |||
+ | "Water" benchmark was run using gromacs compiled with the intel and openmpi (fancy flags) as shown on above section. | ||
+ | |||
+ | Following table describes the variation of run time with number of processors on the Intel (E5-2643) server: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | ! # processors!! Time(s)!! Factor | ||
+ | |- | ||
+ | |1||950||1.00 | ||
+ | |- | ||
+ | |4||252||3.76 | ||
+ | |- | ||
+ | |8||157||6.05 | ||
+ | |} | ||
+ | We find linear scaling with number of processors on the intel server. | ||
+ | |||
+ | We also ran the same benchmark on the AMD (Opetran-6378) server (for comparison, we provide results on the Intel server as well): | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | ! # processors!!water<br>Time(s)!!Notes | ||
+ | |- | ||
+ | |16||241||- | ||
+ | |- | ||
+ | |8||363||- | ||
+ | |- | ||
+ | |4||532||- | ||
+ | |- | ||
+ | |1||1288||- | ||
+ | |- | ||
+ | |Intel (8 proc)||157||1 | ||
+ | |- | ||
+ | |Scaled <br>Intel (8 proc)||216||2 | ||
+ | |- | ||
+ | |AMD (16 proc)/ <br>Scaled Intel(8 proc)||1.12||3 | ||
+ | |} | ||
+ | |||
+ | <sup>1</sup> Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz<br> | ||
+ | <sup>2</sup> Scaling was done by the factor of 3.3/2.4=1.375 <br> | ||
+ | <sup>3</sup> Comparison of 8 processors run on Intel vs 16 processors run on AMD <br> | ||
+ | |||
+ | ==Instruction Set Dependence== | ||
+ | |||
+ | GROMACS is compiled with the following flags: | ||
+ | |||
+ | <source lang=make> | ||
+ | module load intel openmpi mkl | ||
+ | export F77=mpif77 | ||
+ | export F90=mpif90 | ||
+ | export CC=mpicc | ||
+ | export CFLAGS="-O2 -msse2 -unroll-aggresive -opt-prefetch -use-intel-optimized-headers" | ||
+ | export FFLAGS="-O2 -msse2 -unroll-aggresive -opt-prefetch -use-intel-optimized-headers" | ||
+ | ./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float --with-fft=mkl LIBS="-L/opt/intel | ||
+ | /composerxe/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core" | ||
+ | make | ||
+ | make install | ||
+ | </source> | ||
+ | |||
+ | Table captures the dependence of instruction sets on Intel (E5-2643) and AMD (Opetran-6378) machine with 8 processes: | ||
+ | |||
+ | {| | border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !SIMD <br>Instruction!!Intel!!AMD | ||
+ | |- | ||
+ | |sse2||158||364 | ||
+ | |- | ||
+ | |sse3||158||362 | ||
+ | |- | ||
+ | |ssse3||157||362 | ||
+ | |- | ||
+ | |sse4.1||159||360 | ||
+ | |- | ||
+ | |sse4.2||157||362 | ||
+ | |- | ||
+ | |avx||157||363 | ||
+ | |- | ||
+ | |Notes||-||1 | ||
+ | |} | ||
+ | |||
+ | GROMACS seems to be instruction set independent.<br> | ||
+ | <sup>1</sup> Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit) | ||
+ | |||
+ | ==MKL vs FFTW== | ||
+ | |||
+ | There is a problem in profiling the code. We don't see as many subroutines as we wish to see. Intel compiler is still OK, but GNU compiler is worse: it only shows only one subroutine. We don't see any FFT routine in the test case that we are using. | ||
+ | |||
+ | '''Intel Compiler''' | ||
+ | |||
+ | time seconds seconds calls s/call s/call name | ||
+ | 68.42 0.13 0.13 6 21.67 28.33 do_md | ||
+ | 15.79 0.16 0.03 2160054 0.00 0.00 copy_rvec | ||
+ | 5.26 0.17 0.01 600018 0.00 0.00 clear_mat | ||
+ | 5.26 0.18 0.01 _intel_fast_memcpy | ||
+ | 5.26 0.19 0.01 _intel_fast_memcpy.P | ||
+ | 0.00 0.19 0.00 720018 0.00 0.00 copy_mat | ||
+ | 0.00 0.19 0.00 6 0.00 28.33 mdrunner | ||
+ | 0.00 0.19 0.00 3 0.00 0.00 copy_rvec | ||
+ | 0.00 0.19 0.00 1 0.00 0.00 copy_mat | ||
+ | 0.00 0.19 0.00 1 0.00 0.00 get_nthreads | ||
+ | 0.00 0.19 0.00 1 0.00 0.00 mdrunner_start_threads | ||
+ | |||
+ | |||
+ | |||
+ | '''GNU Compiler''' | ||
+ | |||
+ | time seconds seconds calls s/call s/call name | ||
+ | 100.01 0.02 0.02 1 20.00 20.00 do_md | ||
+ | |||
+ | |||
+ | To see the FFT dependence, we built GROMACS with FFTW FFTs using: | ||
+ | <source lang=make> | ||
+ | module load intel openmpi fftw | ||
+ | export F77=mpif77 | ||
+ | export F90=mpif90 | ||
+ | export CC=mpicc | ||
+ | export CFLAGS="-O2 -mavx -unroll-aggresive -opt-prefetch -use-intel-optimized-headers" | ||
+ | export FFLAGS="-O2 -mavx -unroll-aggresive -opt-prefetch -use-intel-optimized-headers" | ||
+ | #export CFLAGS="-O2 -msse2" | ||
+ | #export FFLAGS="-O2 -msse2" | ||
+ | export CPPFLAGS="-I/apps/fftw/3.3.2/include" | ||
+ | export LDFLAGS="-L/apps/fftw/3.3.2/lib -lfftw3" | ||
+ | ./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float | ||
+ | make | ||
+ | make install | ||
+ | </source> | ||
+ | |||
+ | We ran "water" benchmark on the Intel(E5-2643) and AMD (Opetran-6378) servers using 8 processors and found: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !FFT!!Intel<br>Time(s)!!AMD<br>Time(s) | ||
+ | |- | ||
+ | |MKL||157||363 | ||
+ | |- | ||
+ | |FFTW||157||360 | ||
+ | |} | ||
+ | |||
+ | There seems to be no difference between the FFTs from FFTW or MKL on the performance of GROMACS. | ||
+ | |||
+ | == Shared vs Exclusive run on AMD servers (4 x 6378 @ 2.4 GHz) == | ||
+ | |||
+ | We descirbe the effect of shared FPU and L2-cache of the AMD server. | ||
+ | The results are summarized in the following table (# processes=8): | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !Run Scheme!! Time(s) | ||
+ | |- | ||
+ | |Shared<br> time(s)||363 | ||
+ | |- | ||
+ | |Exclusive<br>time (s)||266 | ||
+ | |} | ||
+ | |||
+ | In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources. | ||
+ | |||
+ | == Performance Comparison == | ||
+ | |||
+ | Following is a table for performance comparison of Intel and AMD servers when the job was run using 8 processors: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !Server!!Time(s)!!Notes | ||
+ | |- | ||
+ | |Intel ||157||1 | ||
+ | |- | ||
+ | |Intel (Scaled)||216||2 | ||
+ | |- | ||
+ | |AMD (Shared)||363||3 | ||
+ | |- | ||
+ | |AMD (Exclusive)||266||4 | ||
+ | |- | ||
+ | |AMD Shared/AMD Exc.||1.36||- | ||
+ | |- | ||
+ | |AMD Exc./Intel (scaled)||1.23||- | ||
+ | |} | ||
+ | |||
+ | <sup>1</sup> Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz<br> | ||
+ | <sup>2</sup> Scaling was done by the factor of 3.3/2.4=1.375 <br> | ||
+ | <sup>3</sup> "Shared" represents cores that share resources such as L2-Cache and FPU <br> | ||
+ | <sup>4</sup> "Exclusive" represents cores that don't share resources such as L2-Cache and FPU | ||
+ | |||
+ | =DL_POLY= | ||
+ | |||
+ | == Comparison of Intel, Open64 and GNU Builds == | ||
+ | |||
+ | '''Intel compiler:''' | ||
+ | |||
+ | <source lang=make> | ||
+ | module load intel openmpi | ||
+ | $(MAKE) LD="mpif90 -v -o " \ | ||
+ | LDFLAGS="-shared-intel" \ | ||
+ | FC="mpif90 -c" \ | ||
+ | FCFLAGS="-O3 -mavx -opt-prefetch -use-intel-optimized-headers" \ | ||
+ | EX=$(EX) BINROOT=$(BINROOT) $(TYPE) | ||
+ | </source> | ||
+ | |||
+ | '''GNU compiler:''' | ||
+ | |||
+ | <source lang=make> | ||
+ | module load gcc/.4.7.2 openmpi | ||
+ | $(MAKE) LD="mpif90 -v -o " \ | ||
+ | LDFLAGS=" " \ | ||
+ | FC="mpif90 -c" \ | ||
+ | FCFLAGS="-O3 -mavx -fsched-pressure -flto -funroll-all-loops -fprefetch-loop-arrays -minline-all-stringops -fno-tree- | ||
+ | pre -ftree-vectorize" \ | ||
+ | EX=$(EX) BINROOT=$(BINROOT) $(TYPE) | ||
+ | </source> | ||
+ | |||
+ | |||
+ | '''Open64 compiler:''' | ||
+ | |||
+ | Open64 shows a problem in the subroutine config_module.f90 at line 62. This subroutine resizes the length of an array. The line | ||
+ | |||
+ | <source lang=make> | ||
+ | Character( Len = * ), Allocatable, Intent( InOut ) :: a(:) | ||
+ | </source> | ||
+ | |||
+ | makes "a" an allocatable array of strings, but the length of string is not defined. Intel and GNU compiler can handle this, however the open64 compiler can not. We made one assignment modification at this line (Interested readers should look up the code) and made the open64 compiler work. | ||
+ | <source lang=make> | ||
+ | module load open64/.4.5.2 openmpi | ||
+ | $(MAKE) LD="mpif90 -v -o " \ | ||
+ | LDFLAGS=" " \ | ||
+ | FC="mpif90 -c" \ | ||
+ | FCFLAGS="-OPT:Ofast -mavx -mfma4 -march=bdver1 -O2 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math " \ | ||
+ | EX=$(EX) BINROOT=$(BINROOT) $(TYPE) | ||
+ | </source> | ||
+ | |||
+ | We downloaded 42 benchmarks from the DL_POLY website. The number of benchmarks are higher so we looked in to the result of profiling and found that there are in fact only 15 cases that use different codes inside the DL_POLY package and we are going to present our results based on those. | ||
+ | |||
+ | Following results are for the various benchmarks using '''8 processors on Intel (E5-2643)''' servers: | ||
+ | {| | ||
+ | | STYLE="top"| | ||
+ | {|| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | |+'''Wall Time''' | ||
+ | |- | ||
+ | !Compiler!!Intel-Basic<br>Time(s)!!Intel-Fancy<br>Time(s)!!GNU-Basic<br>Time(s)!!GNU-Fancy<br>Time(s)!!Open64-Basic<br>Time(s) | ||
+ | |- | ||
+ | |Test1||66||62||71||83||75 | ||
+ | |- | ||
+ | |Test3||69||65||76||75||79 | ||
+ | |- | ||
+ | |Test4||74||87||77||78||87 | ||
+ | |- | ||
+ | |Test5||70||65||81||75||87 | ||
+ | |- | ||
+ | |Test7||95||88||105||99||106 | ||
+ | |- | ||
+ | |Test9||67||57||85||69||117 | ||
+ | |- | ||
+ | |Test11||72||61||85||83||91 | ||
+ | |- | ||
+ | |Test13||63||58||65||62||72 | ||
+ | |- | ||
+ | |Test14||106||114||113||136||91 | ||
+ | |- | ||
+ | |Test17||67||65||74||72||77 | ||
+ | |- | ||
+ | |Test18||102||89||104||99||115 | ||
+ | |- | ||
+ | |Test27||136||131||154||-||158 | ||
+ | |- | ||
+ | |Test31||114||96||121||121||134 | ||
+ | |- | ||
+ | |Test35||80||81||90||90||93 | ||
+ | |} | ||
+ | | STYLE="top"| | ||
+ | {|| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | |+'''CPU Time''' | ||
+ | |- | ||
+ | !Compiler!!Intel-Basic<br>Time(s)!!Intel-Fancy<br>Time(s)!!GNU-Basic<br>Time(s)!!GNU-Fancy<br>Time(s)!!Open64-Basic<br>Time(s) | ||
+ | |- | ||
+ | |Test1||65||60||-||-||74 | ||
+ | |- | ||
+ | |Test3||68||64||-||-||79 | ||
+ | |- | ||
+ | |Test4||70||64||-||-||79 | ||
+ | |- | ||
+ | |Test5||61||59||-||-||86 | ||
+ | |- | ||
+ | |Test7||88||84||-||-||104 | ||
+ | |- | ||
+ | |Test9||59||57||-||-||116 | ||
+ | |- | ||
+ | |Test11||72||60||-||-||90 | ||
+ | |- | ||
+ | |Test13||57||56||-||-||67 | ||
+ | |-style="color: red;" | ||
+ | |'''Test14'''||'''39'''||'''42'''||-||-||'''43''' | ||
+ | |- | ||
+ | |Test17||64||63||-||-||75 | ||
+ | |- | ||
+ | |Test18||94||88||-||-||112 | ||
+ | |- | ||
+ | |Test27||136||116||-||-||157 | ||
+ | |- | ||
+ | |Test31||113||94||-||-||134 | ||
+ | |- | ||
+ | |Test35||80||80||-||-||92 | ||
+ | |} | ||
+ | |} | ||
+ | |||
+ | From above table, we conclude that intel build with fancy flags is better choice for the code. | ||
+ | |||
+ | We tried to compare intel and open64 builds on the AMD server and following table depicts that comparison: | ||
+ | {| | ||
+ | | STYLE="top"| | ||
+ | {|| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | !Compiler!!Intel-Fancy<br>Time(s)!!Open64-Basic<br>Time(s)!!Open64-Fancy<br>Time(s) | ||
+ | |- | ||
+ | |Test1||95||91||90 | ||
+ | |- | ||
+ | |+'''Wall Time''' | ||
+ | |- | ||
+ | |Test3||221||235||219 | ||
+ | |- | ||
+ | |Test4||213||225||211 | ||
+ | |- | ||
+ | |Test5||142||173||161 | ||
+ | |- | ||
+ | |Test7||300||310||295 | ||
+ | |- | ||
+ | |Test9||144||219||209 | ||
+ | |- | ||
+ | |Test11||158||202||183 | ||
+ | |- | ||
+ | |Test13||247||262||245 | ||
+ | |- | ||
+ | |Test14||186||165||149 | ||
+ | |- | ||
+ | |Test17||190||201||189 | ||
+ | |- | ||
+ | |Test18||356||353||385 | ||
+ | |- | ||
+ | |Test27||323||377||334 | ||
+ | |- | ||
+ | |Test31||269||313||290 | ||
+ | |- | ||
+ | |Test35||228||251||240 | ||
+ | |} | ||
+ | | STYLE="top"| | ||
+ | {|| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | |+'''CPU Time''' | ||
+ | |- | ||
+ | !Compiler!!Intel-Fancy<br>Time(s)!!Open64-Basic<br>Time(s)!!Open64-Fancy<br>Time(s) | ||
+ | |- | ||
+ | |Test1||91||90||90 | ||
+ | |- | ||
+ | |Test3||219||233||218 | ||
+ | |- | ||
+ | |Test4||204||218||203 | ||
+ | |- | ||
+ | |Test5||141||171||160 | ||
+ | |- | ||
+ | |Test7||298||309||293 | ||
+ | |- | ||
+ | |Test9||137||219||207 | ||
+ | |- | ||
+ | |Test11||157||202||183 | ||
+ | |- | ||
+ | |Test13||245||259||242 | ||
+ | |-style="color: red;" | ||
+ | |'''Test14'''||'''149'''||'''147'''||'''143''' | ||
+ | |- | ||
+ | |Test17||189||201||189 | ||
+ | |- | ||
+ | |Test18||348||350||*383 | ||
+ | |- | ||
+ | |Test27||323||370||332 | ||
+ | |- | ||
+ | |Test31||268||312||290 | ||
+ | |- | ||
+ | |Test35||227||251||239 | ||
+ | |} | ||
+ | |} | ||
+ | |||
+ | |||
+ | |||
+ | <sup>1</sup> '''Basic Flags:'''<br>'''Intel:''' -O2 -msse2 <br> | ||
+ | '''Open64:''' -O2 -msse2 <br> | ||
+ | '''GNU:''' -O2 -msse2 <br> | ||
+ | |||
+ | <sup>2</sup> '''Fancy Flags:'''<br> | ||
+ | '''Intel:''' -O2 -mavx -opt-prefetch -use-intel-optimized-headers <br> | ||
+ | '''GNU:''' CCFLAGS= -O2 -mavx -fsched-pressure -flto -funroll-all-loops -fprefetch-loop-arrays -minline-all-stringops -fno-tree-pre -ftree-vectorize <br> | ||
+ | '''Open64''' -OPT:Ofast -mavx -mfma4 -march=bdver1 -O2 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math | ||
+ | |||
+ | == More studies on Test 14 == | ||
+ | |||
+ | '''intel compiler:''' | ||
+ | |||
+ | Intel server: CPU time = 39s, Wall Time= 49s | ||
+ | |||
+ | % cumulative self self total | ||
+ | time seconds seconds calls s/call s/call name | ||
+ | 13.56 4.85 4.85 182 0.03 0.03 constraints_shake_vv_ | ||
+ | 9.09 8.10 3.25 14 0.23 0.27 link_cell_pairs_ | ||
+ | 7.94 10.94 2.84 78 0.04 0.04 deport_atomic_data_ | ||
+ | 5.65 12.96 2.02 __intel_ssse3_rep_memcpy | ||
+ | 5.56 14.95 1.99 14 0.14 0.14 ewald_spme_forces_IP_spme_forces_ | ||
+ | 5.29 16.84 1.89 14 0.14 0.63 ewald_spme_forces_ | ||
+ | |||
+ | AMD server: CPU Time=151s Wall Time =190s | ||
+ | |||
+ | % cumulative self self total | ||
+ | time seconds seconds calls s/call s/call name | ||
+ | 19.85 23.87 23.87 91 0.26 0.27 constraints_shake_vv_ | ||
+ | 12.33 38.70 14.83 78 0.19 0.19 deport_atomic_data_ | ||
+ | 6.82 46.90 8.20 13 0.63 0.64 constraints_rattle_ | ||
+ | 6.06 54.19 7.29 14 0.52 0.60 link_cell_pairs_ | ||
+ | 5.61 60.94 6.75 3836 0.00 0.00 gpfa_module_mp_gpfa3f_ | ||
+ | 5.32 67.34 6.40 14 0.46 0.46 bspgen_ | ||
+ | 5.15 73.53 6.19 52 0.12 0.77 npt_b0_vv_ | ||
+ | |||
+ | ''' Open64 Compiler ''' | ||
+ | |||
+ | Intel server: CPU Time= 45s, Wall Time=290s | ||
+ | |||
+ | % cumulative self self total | ||
+ | time seconds seconds calls s/call s/call name | ||
+ | 15.64 5.64 5.64 182 0.03 0.03 constraints_shake_vv__ | ||
+ | 12.06 9.99 4.35 14 0.31 0.69 ewald_spme_forces__ | ||
+ | 11.37 14.09 4.10 14 0.29 0.34 link_cell_pairs__ | ||
+ | 9.45 17.50 3.41 78 0.04 0.04 deport_atomic_data__ | ||
+ | 5.52 19.49 1.99 3378329 0.00 0.00 images_ | ||
+ | 5.16 21.35 1.86 3808 0.00 0.00 GPFA2F.in.GPFA_MODULE | ||
+ | 5.16 23.21 1.86 26 0.07 0.07 constraints_rattle__ | ||
+ | |||
+ | AMD server: CPU Time= 156s, Wall Time=164s | ||
+ | |||
+ | % cumulative self self total | ||
+ | time seconds seconds calls s/call s/call name | ||
+ | 18.38 22.58 22.58 91 0.25 0.25 constraints_shake_vv__ | ||
+ | 13.07 38.64 16.06 78 0.21 0.21 deport_atomic_data__ | ||
+ | 8.29 48.82 10.18 14 0.73 2.73 ewald_spme_forces__ | ||
+ | 7.54 58.09 9.27 14 0.66 0.76 link_cell_pairs__ | ||
+ | 5.99 65.45 7.36 13 0.57 0.58 constraints_rattle__ | ||
+ | 5.86 72.65 7.20 3836 0.00 0.00 GPFA3F.in.GPFA_MODULE | ||
+ | 5.34 79.21 6.56 14 0.47 0.47 bspgen_ | ||
+ | |||
+ | ==Scaling with Number of Processors== | ||
+ | |||
+ | From the data of the above section, we conclude that intel build with fancy flags is better choice for the code and so we do our further runs only using the intel build. | ||
+ | |||
+ | Following table describes the variation of run time with number of processors on the '''Intel (E5-2643)''' server: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !#proc!!Test1!!Test3!!Test4!!Test5!!Test7!!Test9!!Test11!!Test13!!Test14!!Test17!!Test18!!Test27!!Test31!!Test35 | ||
+ | |- | ||
+ | |1<br>(Factor)||91<br>(1.00)||389<br>(1.00)||541<br>(1.00)||511<br>(1.00)||502<br>(1.00)||755<br>(1.00)||397<br>(1.00)||292<br>(1.00)||255<br>(1.00)||302<br>(1.00)||445<br>(1.00)||651<br>(1.00)||612<br>(1.00)||- | ||
+ | |- | ||
+ | |4<br>(Factor)||90<br>(1.01)||122<br>(3.18)||160<br>(3.38)||119<br>(4.29)||173<br>(2.90)||119<br>(6.34)||119<br>(3.34)||116<br>(2.52)||147<br>(1.73)||105<br>(2.87)||147<br>(3.03)||229<br>(2.84)||184<br>(3.33)||136<br>(-) | ||
+ | |- | ||
+ | |8<br>(Factor)||62<br>(1.47)||65<br>(5.98)||87<br>(6.22)||65<br>(7.86)||88<br>(5.70)||57<br>(13.2)||61<br>(6.51)||58<br>(5.03)||114<br>(2.23)||65<br>(4.65)||89<br>(5.00)||131<br>(4.96)||96<br>(6.38)||81<br>(-) | ||
+ | |} | ||
+ | |||
+ | Following table depicts the result in the scaled manner. It's basically the same table as above. | ||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !#proc!!Test1!!Test3!!Test4!!Test5!!Test7!!Test9!!Test11!!Test13!!Test14!!Test17!!Test18!!Test27!!Test31!!Test35 | ||
+ | |- | ||
+ | |1||1.00||1.00||1.00||1.00||1.00||1.00||1.00||1.00||1.00||1.00||1.00||1.00||1.00||- | ||
+ | |- | ||
+ | |4||1.01||3.18||3.38||4.29||2.90||6.34||3.34||2.52||1.73||2.87||3.03||2.84||3.33||- | ||
+ | |- | ||
+ | |8||1.47||5.98||6.22||7.86||5.70||13.2||6.51||5.03||2.23||4.65||5.00||4.96||6.38||- | ||
+ | |} | ||
+ | |||
+ | We also ran the same benchmark on the '''AMD (Opetran-6378)''' server (for comparison, we provide results on the Intel server as well): | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !#proc!!Test1!!Test3 !!Test4!!Test5!!Test7!!Test9!!Test11!!Test13!!Test14!!Test17!!Test18!!Test27!!Test31!!Test35!!Notes | ||
+ | |- | ||
+ | |1||91||395||619||788||1406||1292||825||393||441||882||898||651||651||-||- | ||
+ | |- | ||
+ | |4||102||362||375||277||521||285||307||376||267||318||497||554||477||381||- | ||
+ | |- | ||
+ | |8||95||221||213||142||300||144||158||247||186||190||356||323||269||228||- | ||
+ | |- | ||
+ | |16||90||127||112||78||167||74||90||170||116||145||193||192||152||170||- | ||
+ | |- | ||
+ | |Intel(8-proc)||62||65||87||65||88||57||61||58||114||65||89||131||96||81||1 | ||
+ | |- | ||
+ | |Scaled Intel||85||89||120||89||121||78||84||80||157||89||122||180||132||111||2 | ||
+ | |- | ||
+ | |AMD (16 proc)/<br>Intel(8 proc)||1.06||1.43||0.93||0.88||1.38||.95||1.07||2.14||0.74||1.63||1.58||1.07||1.15||1.53||3 | ||
+ | |} | ||
+ | |||
+ | Table for the scaling with respect to 1 processor runs: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !#proc!!Test1!!Test3!!Test4!!Test5!!Test7!!Test9!!Test11!!Test13!!Test14!!Test17!!Test18!!Test27!!Test31!!Test35 | ||
+ | |- | ||
+ | |1||1.00||1.00||1.00||1.00||1.00||1.00||1.00||1.00||1.00||1.00||1.00||1.00||1.00||- | ||
+ | |- | ||
+ | |4||0.89||1.09||1.65||2.84||2.70||4.53||2.69||1.05||1.65||2.77||1.81||1.18||1.36||- | ||
+ | |- | ||
+ | |8||0.96||1.79||2.91||5.55||4.69||8.97||5.22||1.59||2.37||4.64||2.52||2.02||2.42||- | ||
+ | |- | ||
+ | |16||1.01||3.11||5.53||10.1||8.42||17.46||9.17||2.31||3.80||6.08||4.65||3.39||4.28||- | ||
+ | |} | ||
+ | |||
+ | There is wide variety of scaling. Some test cases scale linearly while others don't. The scaling is better on the Intel server as compared to the AMD server in most cases but in some cases they do show about the same scaling. | ||
+ | |||
+ | <sup>1</sup> Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz<br> | ||
+ | <sup>2</sup> Scaling was done by the factor of 3.3/2.4=1.375 <br> | ||
+ | <sup>3</sup> Comparison of 8 processors run on Intel vs 16 processors run on AMD <br> | ||
+ | |||
+ | == MKL vs FFTW FFTs == | ||
+ | We profiled the code to see where does it spend most of its time. Below is a summary of the profile for all the benchmarks that we tried. | ||
+ | |||
+ | '''Test1:''' | ||
+ | |||
+ | '''Intel''' | ||
+ | |||
+ | time seconds seconds calls s/call s/call name | ||
+ | 28.26 7.16 7.16 675309 0.00 0.00 vdw_forces_ | ||
+ | 25.57 13.64 6.48 201 0.03 0.03 link_cell_pairs_ | ||
+ | 17.72 18.13 4.49 675309 0.00 0.00 ewald_real_forces_ | ||
+ | 7.22 19.96 1.83 201 0.01 0.01 ewald_spme_forces_IP_spme_forces_ | ||
+ | 5.17 21.27 1.31 201 0.01 0.02 ewald_spme_forces_ | ||
+ | 3.95 22.27 1.00 201 0.00 0.12 two_body_forces_ | ||
+ | 3.83 23.24 0.97 675309 0.00 0.00 images_ | ||
+ | 3.08 24.02 0.78 201 0.00 0.00 bspgen_ | ||
+ | 1.42 24.38 0.36 11658 0.00 0.00 gpfa_module_mp_gpfa3f_ | ||
+ | 1.07 24.65 0.27 202 0.00 0.00 shellsort2_ | ||
+ | 0.67 24.82 0.17 1206 0.00 0.00 export_atomic_data_ | ||
+ | |||
+ | '''AMD''' | ||
+ | time seconds seconds calls s/call s/call name | ||
+ | 23.43 18.45 18.45 3379737 0.00 0.00 vdw_forces_ | ||
+ | 21.47 35.36 16.91 1000 0.02 0.02 link_cell_pairs_ | ||
+ | 14.23 46.57 11.21 3379737 0.00 0.00 ewald_real_forces_ | ||
+ | 11.77 55.84 9.27 500 0.02 0.02 bspgen_ | ||
+ | 5.78 60.39 4.55 1000 0.00 0.00 ewald_spme_forces_IP_spme_forces_ | ||
+ | 4.57 63.99 3.60 1000 0.00 0.02 ewald_spme_forces_ | ||
+ | 4.36 67.42 3.43 deport_atomic_data_ | ||
+ | 3.34 70.05 2.63 3379737 0.00 0.00 images_ | ||
+ | 2.76 72.22 2.17 500 0.00 0.14 two_body_forces_ | ||
+ | 2.13 73.90 1.68 spec_dexp2 | ||
+ | 1.24 74.88 0.98 29000 0.00 0.00 gpfa_module_mp_gpfa3f_ | ||
+ | 1.05 75.71 0.83 998 0.00 0.00 nvt_b0_vv_ | ||
+ | 0.86 76.39 0.68 501 0.00 0.00 shellsort2_ | ||
+ | |||
+ | '''Test3:''' | ||
− | + | '''Intel''' | |
+ | |||
+ | time seconds seconds calls s/call s/call name | ||
+ | 17.53 26.21 26.21 1298431 0.00 0.00 vdw_forces_ | ||
+ | 14.82 48.37 22.16 201 0.11 0.13 link_cell_pairs_ | ||
+ | 14.22 69.62 21.25 26532 0.00 0.00 gpfa_module_mp_gpfa2f_ | ||
+ | 11.73 87.16 17.54 1298431 0.00 0.00 ewald_real_forces_ | ||
+ | 6.62 97.06 9.90 1200 0.01 0.01 deport_atomic_data_ | ||
+ | 4.04 103.10 6.04 201 0.03 0.21 ewald_spme_forces_ | ||
+ | 3.80 108.78 5.68 200 0.03 0.03 constraints_shake_vv_ | ||
+ | 3.37 113.82 5.04 43159044 0.00 0.00 local_index_ | ||
+ | 3.32 118.79 4.97 2355005 0.00 0.00 images_ | ||
+ | 3.06 123.36 4.57 200 0.02 0.03 constraints_rattle_ | ||
+ | 2.99 127.82 4.47 326609967 0.00 0.00 match_ | ||
+ | 2.57 131.66 3.84 201 0.02 0.02 ewald_spme_forces_IP_spme_forces_ | ||
+ | 2.01 134.67 3.01 201 0.01 0.60 two_body_forces_ | ||
+ | 1.10 136.32 1.65 201 0.01 0.01 parallel_fft_mp_forward_3d_fft_z_ | ||
+ | 1.04 137.87 1.55 201 0.01 0.06 parallel_fft_mp_forward_3d_fft_y_ | ||
+ | 0.98 139.33 1.46 201 0.01 0.01 bspgen_ | ||
+ | |||
+ | '''AMD''' | ||
+ | |||
+ | time seconds seconds calls s/call s/call name | ||
+ | 15.14 28.44 28.44 1299553 0.00 0.00 vdw_forces_ | ||
+ | 12.46 51.85 23.41 201 0.12 0.14 link_cell_pairs_ | ||
+ | 11.54 73.53 21.68 26532 0.00 0.00 gpfa_module_mp_gpfa2f_ | ||
+ | 9.90 92.13 18.60 1299553 0.00 0.00 ewald_real_forces_ | ||
+ | 7.30 105.85 13.72 200 0.07 0.07 constraints_rattle_ | ||
+ | 7.01 119.03 13.18 200 0.07 0.07 constraints_shake_vv_ | ||
+ | 6.23 130.73 11.70 1200 0.01 0.01 deport_atomic_data_ | ||
+ | 3.69 137.66 6.93 201 0.03 0.03 bspgen_ | ||
+ | 3.17 143.61 5.95 201 0.03 0.24 ewald_spme_forces_ | ||
+ | 2.65 148.59 4.98 2367053 0.00 0.00 images_ | ||
+ | 2.45 153.19 4.60 39319511 0.00 0.00 local_index_ | ||
+ | 2.25 157.41 4.23 328218229 0.00 0.00 match_ | ||
+ | 2.14 161.43 4.02 201 0.02 0.02 ewald_spme_forces_IP_spme_forces_ | ||
+ | 1.81 164.84 3.41 spec_dexp2 | ||
+ | 1.59 167.82 2.98 201 0.01 0.66 two_body_forces_ | ||
+ | 1.30 170.26 2.44 201 0.01 0.01 parallel_fft_mp_forward_3d_fft_z_ | ||
+ | 0.80 171.76 1.50 201 0.01 0.01 parallel_fft_mp_back_3d_fft_x_ | ||
+ | |||
+ | '''Test4:''' | ||
+ | |||
+ | '''Intel''' | ||
+ | time seconds seconds calls s/call s/call name | ||
+ | 18.66 29.94 29.94 1604385 0.00 0.00 vdw_forces_ | ||
+ | 15.84 55.35 25.41 31 0.82 0.98 link_cell_pairs_ | ||
+ | 13.21 76.55 21.20 1604385 0.00 0.00 ewald_real_forces_ | ||
+ | 12.69 96.92 20.37 30 0.68 0.71 constraints_shake_vv_ | ||
+ | 7.68 109.24 12.32 30 0.41 0.44 constraints_rattle_ | ||
+ | 7.32 120.99 11.75 180 0.07 0.07 deport_atomic_data_ | ||
+ | 3.79 127.07 6.08 2880181 0.00 0.00 images_ | ||
+ | 3.23 132.25 5.18 40320168 0.00 0.00 local_index_ | ||
+ | 3.10 137.22 4.98 364097691 0.00 0.00 match_ | ||
+ | 2.72 141.58 4.36 31 0.14 0.14 ewald_spme_forces_IP_spme_forces_ | ||
+ | 2.16 145.05 3.47 31 0.11 3.38 two_body_forces_ | ||
+ | 2.05 148.34 3.29 4092 0.00 0.00 gpfa_module_mp_gpfa2f_ | ||
+ | 1.95 151.47 3.13 31 0.10 0.42 ewald_spme_forces_ | ||
+ | 0.67 152.54 1.07 31 0.03 0.03 bspgen_ | ||
+ | |||
+ | '''AMD''' | ||
+ | |||
+ | time seconds seconds calls s/call s/call name | ||
+ | 18.03 33.20 33.20 1603402 0.00 0.00 vdw_forces_ | ||
+ | 13.13 57.37 24.17 31 0.78 0.90 link_cell_pairs_ | ||
+ | 12.76 80.87 23.50 30 0.78 0.81 constraints_shake_vv_ | ||
+ | 11.25 101.59 20.72 1603402 0.00 0.00 ewald_real_forces_ | ||
+ | 9.40 118.90 17.31 30 0.58 0.60 constraints_rattle_ | ||
+ | 7.53 132.76 13.86 180 0.08 0.08 deport_atomic_data_ | ||
+ | 3.27 138.78 6.02 2878846 0.00 0.00 images_ | ||
+ | 2.65 143.65 4.87 31 0.16 0.16 bspgen_ | ||
+ | 2.60 148.44 4.79 40437733 0.00 0.00 local_index_ | ||
+ | 2.45 152.96 4.52 spec_dexp2 | ||
+ | 2.36 157.31 4.35 31 0.14 0.14 ewald_spme_forces_IP_spme_forces_ | ||
+ | 2.07 161.13 3.82 62 0.06 1.77 two_body_forces_ | ||
+ | 2.02 164.84 3.72 363897922 0.00 0.00 match_ | ||
+ | 1.87 168.28 3.44 31 0.11 0.55 ewald_spme_forces_ | ||
+ | 1.82 171.64 3.36 4092 0.00 0.00 gpfa_module_mp_gpfa2f_ | ||
+ | 0.75 173.03 1.39 150 0.01 0.30 nve_0_vv_ | ||
+ | |||
+ | |||
+ | '''Test5:''' | ||
+ | |||
+ | time seconds seconds calls s/call s/call name | ||
+ | 44.99 50.53 50.53 101 0.50 0.50 three_body_forces_ | ||
+ | 16.21 68.74 18.21 873523 0.00 0.00 ewald_real_forces_ | ||
+ | 15.19 85.80 17.06 101 0.17 0.17 link_cell_pairs_ | ||
+ | 10.25 97.31 11.51 873523 0.00 0.00 vdw_forces_ | ||
+ | 3.71 101.48 4.17 873523 0.00 0.00 images_ | ||
+ | 3.33 105.22 3.74 101 0.04 0.59 two_body_forces_ | ||
+ | 2.08 107.56 2.34 101 0.02 0.02 ewald_spme_forces_IP_spme_forces_ | ||
+ | 1.41 109.14 1.58 101 0.02 0.05 ewald_spme_forces_ | ||
+ | 0.97 110.23 1.09 101 0.01 0.01 bspgen_ | ||
+ | |||
+ | '''Test7:''' | ||
+ | |||
+ | time seconds seconds calls s/call s/call name | ||
+ | 16.97 13.50 13.50 101 0.13 0.15 link_cell_pairs_ | ||
+ | 16.75 26.82 13.32 1326 0.01 0.01 constraints_shake_vv_ | ||
+ | 11.59 36.04 9.22 600 0.02 0.02 deport_atomic_data_ | ||
+ | 10.69 44.54 8.50 1249646 0.00 0.00 ewald_real_forces_ | ||
+ | 10.59 52.96 8.42 1249646 0.00 0.00 vdw_forces_ | ||
+ | 4.39 56.45 3.49 2122099 0.00 0.00 images_ | ||
+ | 4.34 59.91 3.46 23122212 0.00 0.00 local_index_ | ||
+ | 4.14 63.20 3.29 101 0.03 0.03 ewald_spme_forces_IP_spme_forces_ | ||
+ | 3.34 65.86 2.66 200 0.01 0.11 npt_b0_vv_ | ||
+ | 2.87 68.14 2.28 101 0.02 0.09 ewald_spme_forces_ | ||
+ | 2.78 70.35 2.21 100 0.02 0.03 constraints_rattle_ | ||
+ | 2.21 72.11 1.76 101 0.02 0.46 two_body_forces_ | ||
+ | 1.89 73.61 1.51 154519604 0.00 0.00 match_ | ||
+ | 1.85 75.08 1.47 101 0.01 0.01 bspgen_ | ||
+ | 0.65 75.60 0.52 202 0.00 0.00 shellsort2_ | ||
+ | |||
+ | '''Test9:''' | ||
+ | |||
+ | 95.21 75.98 75.98 401 0.19 0.19 tersoff_forces_ | ||
+ | 1.09 76.85 0.87 2400 0.00 0.00 deport_atomic_data_ | ||
+ | 0.96 77.62 0.77 402 0.00 0.00 shellsort2_ | ||
+ | |||
+ | '''Test11:''' | ||
+ | |||
+ | 36.75 52.13 52.13 4004908 0.00 0.00 metal_forces_ | ||
+ | 22.05 83.41 31.28 1001 0.03 0.03 link_cell_pairs_ | ||
+ | 21.82 114.36 30.95 4004908 0.00 0.00 metal_ld_collect_fst_ | ||
+ | 10.26 128.91 14.55 8009816 0.00 0.00 images_ | ||
+ | 2.59 132.59 3.68 1001 0.00 0.14 two_body_forces_ | ||
+ | 2.44 136.05 3.46 1001 0.00 0.04 metal_ld_compute_ | ||
+ | 1.35 137.97 1.92 1002 0.00 0.00 shellsort2_ | ||
+ | 0.62 138.85 0.88 6000 0.00 0.00 deport_atomic_data_ | ||
+ | |||
+ | '''Test13:''' | ||
+ | |||
+ | 18.91 22.56 22.56 22348 0.00 0.00 gpfa_module_mp_gpfa2f_ | ||
+ | 12.43 37.39 14.83 900 0.02 0.02 deport_atomic_data_ | ||
+ | 11.80 51.47 14.08 1950 0.01 0.01 constraints_shake_vv_ | ||
+ | 8.38 61.47 10.00 151 0.07 0.08 link_cell_pairs_ | ||
+ | 5.49 68.02 6.55 300 0.02 0.09 npt_b0_vv_ | ||
+ | 5.02 74.01 5.99 302 0.02 0.02 gpfa_module_mp_gpfa3f_ | ||
+ | 4.86 79.81 5.80 151 0.04 0.33 ewald_spme_forces_ | ||
+ | 4.38 85.04 5.23 151 0.03 0.04 ewald_spme_forces_IP_spme_forces_ | ||
+ | 3.53 89.26 4.22 34695506 0.00 0.00 local_index_ | ||
+ | 3.47 93.40 4.14 150 0.03 0.03 constraints_rattle_ | ||
+ | 2.94 96.91 3.51 2364786 0.00 0.00 ewald_real_forces_ | ||
+ | 2.84 100.30 3.39 4097851 0.00 0.00 images_ | ||
+ | 2.36 103.11 2.81 2364786 0.00 0.00 vdw_forces_ | ||
+ | 1.32 104.68 1.57 151 0.01 0.01 bspgen_ | ||
+ | 1.27 106.19 1.51 151 0.01 0.01 parallel_fft_mp_forward_3d_fft_z_ | ||
+ | 1.18 107.60 1.41 151 0.01 0.01 parallel_fft_mp_back_3d_fft_x_ | ||
+ | 1.16 108.98 1.38 151 0.01 0.10 parallel_fft_mp_forward_3d_fft_y_ | ||
+ | 1.14 110.34 1.36 151 0.01 0.48 two_body_forces_ | ||
+ | 1.11 111.66 1.33 92125754 0.00 0.00 match_ | ||
+ | 1.09 112.96 1.30 151 0.01 0.01 parallel_fft_mp_forward_3d_fft_x_ | ||
+ | 0.99 114.14 1.18 151 0.01 0.10 parallel_fft_mp_back_3d_fft_y_ | ||
+ | |||
+ | '''Test14:''' | ||
+ | |||
+ | 21.33 16.80 16.80 130 0.13 0.13 constraints_shake_vv_ | ||
+ | 10.48 25.05 8.25 60 0.14 0.14 deport_atomic_data_ | ||
+ | 7.24 30.75 5.70 11 0.52 0.60 link_cell_pairs_ | ||
+ | 6.41 35.80 5.05 3014 0.00 0.00 gpfa_module_mp_gpfa3f_ | ||
+ | 5.75 40.33 4.53 20 0.23 1.35 npt_b0_vv_ | ||
+ | 5.30 44.50 4.17 10 0.42 0.43 constraints_rattle_ | ||
+ | 5.27 48.65 4.15 11 0.38 2.20 ewald_spme_forces_ | ||
+ | 5.13 52.69 4.04 2992 0.00 0.00 gpfa_module_mp_gpfa2f_ | ||
+ | 4.28 56.06 3.37 11 0.31 0.31 ewald_spme_forces_IP_spme_forces_ | ||
+ | 3.19 58.57 2.51 20333039 0.00 0.00 local_index_ | ||
+ | 2.79 60.77 2.20 1534827 0.00 0.00 ewald_real_forces_ | ||
+ | 2.78 62.96 2.19 2646191 0.00 0.00 images_ | ||
+ | 2.30 64.77 1.81 1534827 0.00 0.00 vdw_forces_ | ||
+ | 2.29 66.57 1.80 11 0.16 0.16 bspgen_ | ||
+ | 1.92 68.08 1.51 22 0.07 0.07 gpfa_module_mp_gpfa5f_ | ||
+ | 1.80 69.50 1.42 11 0.13 0.60 parallel_fft_mp_forward_3d_fft_y_ | ||
+ | 1.79 70.91 1.41 11 0.13 0.60 parallel_fft_mp_back_3d_fft_y_ | ||
+ | 1.09 71.77 0.86 57924127 0.00 0.00 match_ | ||
+ | 0.95 72.52 0.75 1 0.75 0.75 dihedrals_14_check_ | ||
+ | |||
+ | '''Test17:''' | ||
+ | |||
+ | 20.71 10.65 10.65 201 0.05 0.06 link_cell_pairs_ | ||
+ | 13.88 17.79 7.14 984116 0.00 0.00 ewald_real_forces_ | ||
+ | 11.43 23.67 5.88 984116 0.00 0.00 vdw_forces_ | ||
+ | 9.72 28.67 5.00 3200 0.00 0.00 constraints_shake_vv_ | ||
+ | 6.68 32.11 3.44 2260717 0.00 0.00 images_ | ||
+ | 5.84 35.11 3.01 23232840 0.00 0.00 local_index_ | ||
+ | 5.06 37.71 2.60 201 0.01 0.01 ewald_spme_forces_IP_spme_forces_ | ||
+ | 4.53 40.04 2.33 200 0.01 0.02 constraints_rattle_ | ||
+ | 3.46 41.82 1.78 201 0.01 0.03 ewald_spme_forces_ | ||
+ | 2.51 43.11 1.29 4235 0.00 0.00 pmf_coms_ | ||
+ | 2.25 44.27 1.16 118620664 0.00 0.00 match_ | ||
+ | 2.18 45.39 1.12 1200 0.00 0.00 deport_atomic_data_ | ||
+ | 2.10 46.47 1.08 201 0.01 0.17 two_body_forces_ | ||
+ | 2.10 47.55 1.08 201 0.01 0.01 bspgen_ | ||
+ | 1.50 48.32 0.77 400 0.00 0.03 npt_b0_vv_ | ||
+ | 1.11 48.89 0.57 402 0.00 0.00 shellsort2_ | ||
+ | 0.80 49.30 0.41 201 0.00 0.00 pass_shared_units_ | ||
+ | |||
+ | '''Test18:''' | ||
+ | |||
+ | 34.45 69.38 69.38 1800 0.04 0.04 constraints_shake_vv_ | ||
+ | 9.78 89.07 19.69 101 0.19 0.22 link_cell_pairs_ | ||
+ | 8.63 106.45 17.38 100 0.17 0.20 constraints_rattle_ | ||
+ | 6.61 119.77 13.32 2041857 0.00 0.00 ewald_real_forces_ | ||
+ | 5.95 131.75 11.98 2041857 0.00 0.00 vdw_forces_ | ||
+ | 5.38 142.59 10.84 2718 0.00 0.01 pmf_coms_ | ||
+ | 5.29 153.24 10.65 4883071 0.00 0.00 images_ | ||
+ | 4.95 163.22 9.98 66656800 0.00 0.00 local_index_ | ||
+ | 3.18 169.62 6.40 200 0.03 0.60 npt_b0_vv_ | ||
+ | 2.74 175.14 5.52 101 0.05 0.06 ewald_spme_forces_IP_spme_forces_ | ||
+ | 1.84 178.84 3.70 101 0.04 0.16 ewald_spme_forces_ | ||
+ | 1.71 182.28 3.44 600 0.01 0.01 deport_atomic_data_ | ||
+ | 1.51 185.33 3.05 101 0.03 0.03 bspgen_ | ||
+ | 1.44 188.24 2.91 101 0.03 0.74 two_body_forces_ | ||
+ | 1.18 190.61 2.38 231279151 0.00 0.00 match_ | ||
+ | 0.76 192.14 1.53 25304 0.00 0.00 update_shared_units_ | ||
+ | |||
+ | '''Test27:''' | ||
+ | |||
+ | 38.45 91.00 91.00 12076288 0.00 0.00 metal_forces_ | ||
+ | 24.32 148.56 57.56 3501 0.02 0.02 link_cell_pairs_ | ||
+ | 17.06 188.93 40.37 12076288 0.00 0.00 metal_ld_collect_eam_ | ||
+ | 8.90 210.00 21.07 24152576 0.00 0.00 images_ | ||
+ | 2.64 216.24 6.24 3501 0.00 0.06 two_body_forces_ | ||
+ | 2.39 221.90 5.66 3501 0.00 0.02 metal_ld_compute_ | ||
+ | 1.46 225.35 3.45 3502 0.00 0.00 shellsort2_ | ||
+ | 1.11 227.98 2.63 21000 0.00 0.00 deport_atomic_data_ | ||
+ | 0.90 230.11 2.13 21006 0.00 0.00 export_atomic_data_ | ||
+ | |||
+ | '''Test31:''' | ||
+ | |||
+ | 31.17 66.30 66.30 12004000 0.00 0.00 metal_forces_ | ||
+ | 29.18 128.38 62.08 3001 0.02 0.02 link_cell_pairs_ | ||
+ | 13.22 156.50 28.12 12004000 0.00 0.00 metal_ld_collect_eam_ | ||
+ | 10.09 177.96 21.46 24008000 0.00 0.00 images_ | ||
+ | 5.35 189.35 11.39 18000 0.00 0.00 deport_atomic_data_ | ||
+ | 2.99 195.71 6.36 3001 0.00 0.06 two_body_forces_ | ||
+ | 2.77 201.60 5.89 3001 0.00 0.02 metal_ld_compute_ | ||
+ | 1.29 204.34 2.75 3002 0.00 0.00 shellsort2_ | ||
+ | 0.70 205.83 1.49 6000 0.00 0.00 npt_b0_vv_ | ||
+ | |||
+ | '''Test35:''' | ||
+ | |||
+ | 27.22 27.64 27.64 562 0.05 0.06 link_cell_pairs_ | ||
+ | 19.94 47.89 20.25 1878757 0.00 0.00 ewald_real_forces_ | ||
+ | 9.75 57.79 9.90 1878757 0.00 0.00 vdw_forces_ | ||
+ | 5.52 63.40 5.61 3227549 0.00 0.00 images_ | ||
+ | 5.46 68.95 5.55 2008 0.00 0.00 constraints_shake_vv_ | ||
+ | 4.44 73.46 4.51 562 0.01 0.01 ewald_spme_forces_IP_spme_forces_ | ||
+ | 3.43 76.95 3.49 354007894 0.00 0.00 match_ | ||
+ | 3.38 80.38 3.43 562 0.01 0.15 two_body_forces_ | ||
+ | 3.30 83.73 3.35 562 0.01 0.01 bspgen_ | ||
+ | 3.23 87.01 3.29 28441741 0.00 0.00 local_index_ | ||
+ | 2.87 89.92 2.91 500 0.01 0.01 constraints_rattle_ | ||
+ | 2.85 92.81 2.89 562 0.01 0.02 ewald_spme_forces_ | ||
+ | 2.05 94.89 2.08 3366 0.00 0.00 deport_atomic_data_ | ||
+ | 1.71 96.63 1.74 1124 0.00 0.00 shellsort2_ | ||
+ | 1.44 98.09 1.46 1000 0.00 0.01 npt_m1_vv_ | ||
+ | 0.49 98.59 0.50 562 0.00 0.00 set_halo_particles_ | ||
+ | |||
+ | From the above data, we can clearly see that the FFT dependence of the code is very tiny (in some cases where it shows up, it's about a percent), so there is no need to try different FFT libraries for the purpose of efficiency. | ||
+ | |||
+ | == Shared vs Exclusive run on AMD servers (4 x 6378 @ 2.4 GHz) == | ||
+ | |||
+ | We descirbe the effect of shared FPU and L2-cache of the AMD server. | ||
+ | The results are summarized in the following table '''(# processes=8)''': | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !Run Scheme!!Test1!!Test3 !!Test4!!Test5!!Test7!!Test9!!Test11!!Test13!!Test14!!Test17!!Test18!!Test27!!Test31!!Test35 | ||
+ | |- | ||
+ | |Shared<br> time(s)||95||221||213||142||300||144||158||247||186||190||356||323||269||228 | ||
+ | |- | ||
+ | |Exclusive<br>time (s)||90||168||161||109||226||116||128||185||157||151||282||265||209||173 | ||
+ | |} | ||
+ | |||
+ | In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources. | ||
+ | |||
+ | == Performance Comparison == | ||
+ | |||
+ | Following is a table for performance comparison of Intel and AMD servers when the job was run using '''8 processors''': | ||
+ | |||
+ | {| border= 3 align="Center" style="text-align: center;" | ||
+ | !Server!!Test1!!Test3 !!Test4!!Test5!!Test7!!Test9!!Test11!!Test13!!Test14!!Test17!!Test18!!Test27!!Test31!!Test35!!Notes | ||
+ | |- | ||
+ | |Intel||62||65||87||65||88||57||61||58||114||65||89||131||96||81||1 | ||
+ | |- | ||
+ | |Scaled Intel||85||89||120||89||121||78||84||80||157||89||122||180||132||111||2 | ||
+ | |- | ||
+ | |AMD (Shared)||95||221||213||142||300||144||158||247||186||190||356||323||269||228||1,3 | ||
+ | |- | ||
+ | |AMD(Exclusive)||90||168||161||109||226||116||128||185||157||151||282||265||209||173||1,4 | ||
+ | |- | ||
+ | |AMD Shared/AMD Exc.||1.05||1.31||1.32||1.30||1.33||1.24||1.23||1.34||1.18||1.26||1.26||1.22||1.29||1.32||- | ||
+ | |- | ||
+ | |AMD Exc./Intel (scaled)||1.06||1.89||1.34||1.22||1.87||1.49||1.52||2.31||1.00||1.70||2.31||1.47||1.58||1.56||- | ||
+ | |- | ||
+ | |AMD Shared/Intel (scaled)||1.12||2.48||1.78||1.60||2.48||1.85||1.88||3.09||1.18||2.13||2.92||1.79||2.03||2.05||- | ||
+ | |- | ||
+ | |AMD (16 proc)/<br>Intel(8 proc)||1.06||1.43||0.93||0.88||1.38||.95||1.07||2.14||0.74||1.63||1.58||1.07||1.15||1.53||3 | ||
+ | |} | ||
+ | |||
+ | In summary-- | ||
+ | |||
+ | 1. In majority of cases, DL_POLY runs better (~1.2-2.3X) on the Intel server than the AMD server, even with the interleaved cores. However there are a few cases (test1 and test14) where there is negligible performance difference between both servers. | ||
+ | |||
+ | 2. FPU sharing reduces the efficiency on the AMD server to 1.2-1.3X. | ||
+ | |||
+ | |||
+ | <sup>1</sup> Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz<br> | ||
+ | <sup>2</sup> Scaling was done by the factor of 3.3/2.4=1.375 <br> | ||
+ | <sup>3</sup> "Shared" represents cores that share resources such as L2-Cache and FPU <br> | ||
+ | <sup>4</sup> "Exclusive" represents cores that don't share resources such as L2-Cache and FPU | ||
+ | |||
+ | =BLAST= | ||
+ | |||
+ | A couple of things before we dive into the benchmarking- | ||
+ | |||
+ | 1. NCBI-BLAST is not an MPI enabled code. Only parallelization that is available is intranode which can be achieved through OpenMP. If you would like parallel BLAST, MPIBLAST is an option but upon doing some internet search I found people reporting it to be unstable. | ||
+ | |||
+ | 2. Compilation of BLAST with OpenMP must be done using thread safety otherwise the code would crash for more than one thread. This can be enabled with the use of "--with-mt" in the configure script. | ||
+ | |||
+ | |||
+ | == Comparison of Intel, Open64 and GNU Builds == | ||
+ | |||
+ | |||
+ | |||
+ | '''Intel compiler:''' | ||
+ | |||
+ | <source lang=make> | ||
+ | module load intel | ||
+ | export CFLAGS="-Wall -O2 -msse2 " | ||
+ | export CXXFLAGS="-Wall -O2 -msse2 " | ||
+ | export CPPFLAGS="-Wall -O2 -msse2 " | ||
+ | ./configure --with-bin-release --without-debug --with-mt | ||
+ | </source> | ||
+ | |||
+ | |||
+ | |||
+ | '''GNU compiler:''' | ||
+ | |||
+ | <source lang=make> | ||
+ | module load gcc/.4.7.2 | ||
+ | export CFLAGS="-Wall -O2 -msse2 " | ||
+ | export CXXFLAGS="-Wall -O2 -msse2 " | ||
+ | export CPPFLAGS="-Wall -O2 -msse2 " | ||
+ | ./configure --with-bin-release --without-debug --with-mt --with-64 | ||
+ | </source> | ||
+ | |||
+ | |||
+ | '''Open64 compiler:''' | ||
+ | |||
+ | <source lang=make> | ||
+ | module load open64/.4.5.2 | ||
+ | export CFLAGS="-Wall -O2 -msse2 " | ||
+ | export CXXFLAGS="-Wall -O2 -msse2 " | ||
+ | export CPPFLAGS="-Wall -O2 -msse2 " | ||
+ | ./configure --with-bin-release --without-debug --with-mt --with-64 | ||
+ | </source> | ||
+ | |||
+ | Following is a comparison of test case for blastx executable of BLAST. | ||
+ | |||
+ | {| | ||
+ | | STYLE="top"| | ||
+ | {|| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | |+'''Wall Time on Intel Sandy-Bridge''' | ||
+ | |- | ||
+ | !#processor!!Intel-Basic<br>Time(s)!!Intel-Fancy<br>Time(s)!!GNU-Basic<br>Time(s)!!GNU-Fancy<br>Time(s)!!Open64-Basic<br>Time(s)!!Open64-Fancy<br>Time(s) | ||
+ | |- | ||
+ | |1||400||412||416||451||475||- | ||
+ | |- | ||
+ | |2||209||216||219||235||251||- | ||
+ | |- | ||
+ | |4||115||120||122||130||137||- | ||
+ | |- | ||
+ | |8||67||69||73||77||79||- | ||
+ | |} | ||
+ | |||
+ | The executable built with Open64 and fancy flags crashes on the Intel Sandy-Bridge server. | ||
+ | |||
+ | Clearly, intel compiler with basic flags has better performance with other compiler+flag combinations. Interestingly, the fancy flags suffer for both intel and gcc case. | ||
+ | |||
+ | |||
+ | On the AMD server, we got (these runs share resources (FPU, L2-Cache))- | ||
+ | |||
+ | {| | ||
+ | | STYLE="top"| | ||
+ | {|| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | |+'''Wall Time on AMD Abu-Dhabi''' | ||
+ | |- | ||
+ | !#processor!!Intel<br>Time(s)!!Open64<br>Time(s) | ||
+ | |- | ||
+ | |1||557||615 | ||
+ | |- | ||
+ | |2||348||380 | ||
+ | |- | ||
+ | |4||199||216 | ||
+ | |- | ||
+ | |8||126||136 | ||
+ | |- | ||
+ | |16||83||88 | ||
+ | |} | ||
+ | |||
+ | |||
+ | <sup>1</sup> '''Basic Flags:'''<br>'''Intel:'''-Wall -O2 -msse2 <br> | ||
+ | '''Open64:'''-Wall -O2 -msse2 <br> | ||
+ | '''GNU:''' -Wall -O2 -msse2 <br> | ||
+ | |||
+ | <sup>2</sup> '''Fancy Flags:'''<br> | ||
+ | '''Intel:''' -Wall -O3 -mavx -unroll-aggresive -opt-prefetch -use-intel-optimized-headers <br> | ||
+ | '''GNU:''' CCFLAGS= -Wall -O3 -mavx -fsched-pressure -flto -funroll-all-loops -fprefetch-loop-arrays -minline-all-stringops -fno-tree-pre -ftree-vectorize <br> | ||
+ | '''Open64''' -Wall -OPT:Ofast -mavx -mfma4 -march=bdver1 -O3 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math | ||
+ | |||
+ | ==Scaling== | ||
+ | |||
+ | In the following table, we compare scaling of different executables of BLAST on Intel Sandy-Bridge server: | ||
+ | {| | ||
+ | | STYLE="top"| | ||
+ | {|| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | |+'''Scaling on Intel Sandy-Bridge''' | ||
+ | |- | ||
+ | !#processor!!blastx!!tblastx!!blastp | ||
+ | |- | ||
+ | |1||412||768||248 | ||
+ | |- | ||
+ | |2||216||574||172 | ||
+ | |- | ||
+ | |4||120||268||131 | ||
+ | |- | ||
+ | |8||69||98||113 | ||
+ | |} | ||
+ | |||
+ | Above table can be displayed as below, where runs on single processor are normalized to 1.00 and for others ratio to the single processor is calculated. | ||
+ | |||
+ | {| | ||
+ | | STYLE="top"| | ||
+ | {|| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | |+'''Scaling on Intel Sandy-Bridge''' | ||
+ | |- | ||
+ | !#processor!!blastx!!tblastx!!blastp | ||
+ | |- | ||
+ | |1||1.00||1.00||1.00 | ||
+ | |- | ||
+ | |2||1.91||1.34||1.44 | ||
+ | |- | ||
+ | |4||3.43||2.87||1.89 | ||
+ | |- | ||
+ | |8||5.97||7.84||2.19 | ||
+ | |} | ||
+ | |||
+ | Scaling on the AMD Abu Dhabi server is displayed below- | ||
+ | |||
+ | {| | ||
+ | | STYLE="top"| | ||
+ | {|| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | |+'''Scaling on AMD Abu Dhabi''' | ||
+ | |- | ||
+ | !#processor!!blastx!!tblastx!!blastp | ||
+ | |- | ||
+ | |1||557||1243||- | ||
+ | |- | ||
+ | |2||348||943||- | ||
+ | |- | ||
+ | |4||199||437 ||- | ||
+ | |- | ||
+ | |8||126||199||- | ||
+ | |- | ||
+ | |16||83||-||- | ||
+ | |} | ||
+ | |||
+ | Normalized table corresponding to above table is - | ||
+ | |||
+ | {| | ||
+ | | STYLE="top"| | ||
+ | {|| border=3 align="Center" style="text-align: center;" | ||
+ | |- | ||
+ | |+'''Scaling on AMD Abu Dhabi''' | ||
+ | |- | ||
+ | !#processor!!blastx!!tblastx!!blastp | ||
+ | |- | ||
+ | |1||1.00||-||- | ||
+ | |- | ||
+ | |2||1.60||267||- | ||
+ | |- | ||
+ | |4||2.80|| ||- | ||
+ | |- | ||
+ | |8||4.42||-||- | ||
+ | |- | ||
+ | |16||6.71||-||- | ||
+ | |} | ||
+ | |||
+ | == With boost and python Libraries == | ||
+ | |||
+ | I read online that boost libraries might increase the efficiency. So, compiled the code using boost to see if there is any performance improvement. | ||
+ | |||
+ | <source lang=make> | ||
+ | module load boost | ||
+ | ./configure --with-bin-release --without-debug --with-mt --with-boost=/apps/boost/1.53.0 | ||
+ | </source> | ||
+ | |||
+ | {| border= 3 align="Center" style="text-align: center;" | ||
+ | !Server!!with<br>boost!!without<br>boost | ||
+ | |- | ||
+ | |Intel||70||67 | ||
+ | |- | ||
+ | |AMD||125||126 | ||
+ | |} | ||
+ | |||
+ | I tried python libraries as well | ||
+ | |||
+ | <source lang=make> | ||
+ | module load python | ||
+ | ./configure --with-bin-release --without-debug --with-mt --with-python=/apps/python/2.7.3 | ||
+ | </source> | ||
+ | |||
+ | The table below depicts the performance: | ||
+ | |||
+ | {| border= 3 align="Center" style="text-align: center;" | ||
+ | !Server!!with<br>python!!without<br>python | ||
+ | |- | ||
+ | |Intel||70||67 | ||
+ | |- | ||
+ | |AMD||126||126 | ||
+ | |} | ||
+ | |||
+ | None of these libraries seem to improve the performance of BLAST. | ||
+ | |||
+ | == Profiling == | ||
+ | |||
+ | For the blastx executable, we found- | ||
+ | |||
+ | |||
+ | % cumulative self self total | ||
+ | time seconds seconds calls s/call s/call name | ||
+ | 48.33 206.29 206.29 6838800 0.00 0.00 s_BlastAaWordFinder_TwoHit | ||
+ | 17.73 281.95 75.66 12644388 0.00 0.00 s_BlastSmallAaScanSubject | ||
+ | 11.13 329.45 47.50 801659273 0.00 0.00 BSearchContextInfo | ||
+ | 7.16 360.00 30.55 800757705 0.00 0.00 s_BlastAaExtendLeft | ||
+ | 4.90 380.92 20.92 800757705 0.00 0.00 s_BlastAaExtendTwoHit | ||
+ | 2.47 391.47 10.55 2311127628 0.00 0.00 ComputeTableIndexIncremental | ||
+ | 1.77 399.01 7.54 48008 0.00 0.00 BlastKarlinLHtoK | ||
+ | 1.43 405.12 6.12 211437430 0.00 0.00 s_BlastAaExtendRight | ||
+ | 1.19 410.21 5.09 1835792 0.00 0.00 Blast_SemiGappedAlign | ||
+ | |||
+ | |||
+ | In the "s_BlastAaWordFinder_TwoHit" subroutine, we have | ||
+ | |||
+ | index % time self children called name | ||
+ | 206.29 191.90 6838800/6838800 BlastAaWordFinder [9] | ||
+ | [10] 93.3 206.29 191.90 6838800 s_BlastAaWordFinder_TwoHit [10] | ||
+ | 75.66 11.13 12644388/12644388 s_BlastSmallAaScanSubject [15] | ||
+ | 20.92 36.67 800757705/800757705 s_BlastAaExtendTwoHit [16] | ||
+ | 47.44 0.00 800757705/801659273 BSearchContextInfo [17] | ||
+ | 0.08 0.00 6838800/6838800 Blast_UngappedStatsUpdate [194] | ||
+ | 0.00 0.01 901568/901568 BlastSaveInitHsp [880] | ||
+ | 0.01 0.00 6838800/6838800 Blast_ExtendWordExit [1357] | ||
+ | |||
+ | It seems that the blastx excutable uses mostly inside routines, not any library routines such as FFT. | ||
+ | |||
+ | == Shared vs Exclusive run on AMD servers (4 x 6378 @ 2.4 GHz) == | ||
+ | |||
+ | We descirbe the effect of shared FPU and L2-cache of the AMD server. | ||
+ | The results are summarized in the following table '''(# processes=8)''': | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !Run Scheme!!blastx | ||
+ | |- | ||
+ | |Shared<br> time(s)||127 | ||
+ | |- | ||
+ | |Exclusive<br>time (s)||112 | ||
+ | |} | ||
+ | |||
+ | In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources. | ||
+ | |||
+ | == Performance Comparison == | ||
+ | {| border= 3 align="Center" style="text-align: center;" | ||
+ | !Server!!blastx!!Notes | ||
+ | |- | ||
+ | |Intel||67||1 | ||
+ | |- | ||
+ | |Scaled Intel||92||2 | ||
+ | |- | ||
+ | |AMD (Shared)||127||1,3 | ||
+ | |- | ||
+ | |AMD(Exclusive)||112||1,4 | ||
+ | |- | ||
+ | |AMD Shared/AMD Exc.||1.13||- | ||
+ | |- | ||
+ | |AMD Exc./Intel (scaled)||1.22||- | ||
+ | |- | ||
+ | |AMD Shared/Intel (scaled)||1.38||- | ||
+ | |- | ||
+ | |AMD (16 proc)/<br>Intel(8 proc)||0.90|| | ||
+ | |} | ||
+ | |||
+ | <sup>1</sup> Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz<br> | ||
+ | <sup>2</sup> Scaling was done by the factor of 3.3/2.4=1.375 <br> | ||
+ | <sup>3</sup> "Shared" represents cores that share resources such as L2-Cache and FPU <br> | ||
+ | <sup>4</sup> "Exclusive" represents cores that don't share resources such as L2-Cache and FPU | ||
+ | |||
+ | <!-- = QUANTUM ESPRESSO = | ||
+ | |||
+ | Quantum Espresso is a plane wave density functional theory code used for the electronic structure calculations of materials. | ||
+ | |||
+ | ==Intel (2 x E5-2643 @ 3.30GHz)== | ||
+ | |||
+ | The test case for these runs is a self consistent calculation for energy of bulk copper. (For input file please ask Manoj Srivastava) | ||
+ | Following table demonstrates the scaling of the code with number of processors: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | ! 8<br>processors!! 4 proc<br> numanode=0!! 4 proc<br> numanode=0,1 | ||
+ | |- | ||
+ | |156||293||289 | ||
+ | |} | ||
+ | Clearly, we can see that there is no shared cache effect on the intel machine. | ||
+ | From our experience with VASP as well as profiling we know that the code spends most of its time in FFT libraries. Following table captures result of variation of FFT libraries: | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !FFTW-FFT!!MKL-FFT | ||
+ | |- | ||
+ | |156||140 | ||
+ | |} | ||
+ | |||
+ | ==AMD (2 x 6220 @ 3.0 GHz)== | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | ! 16<br>processors!! 8 proc<br> Non-shared!! 8 proc<br> Shared | ||
+ | |- | ||
+ | |252||250||345 | ||
+ | |} | ||
+ | |||
+ | {| border=3 align="Center" style="text-align: center;" | ||
+ | !FFTW-FFT!!MKL-FFT | ||
+ | |- | ||
+ | |250||232 | ||
+ | |} | ||
+ | --> | ||
+ | |||
+ | =Compilation Table= | ||
+ | This table instructs number of different compiler options that can be used to build the code. | ||
− | + | {| border= 3 align="Center" style="text-align: center;" | |
+ | !Package!!Intel!!GNU!!Open64 | ||
+ | |- | ||
+ | |VASP ||Y||-||N | ||
+ | |- | ||
+ | |LAMMPS||Y||Y||Y | ||
+ | |- | ||
+ | |GROMACS||Y||Y||Y | ||
+ | <!-- |- | ||
+ | |QE||Y||-||- --> | ||
+ | |- | ||
+ | |DL_POLY||Y||Y||Y | ||
+ | |- | ||
+ | |BLAST||Y||Y||N | ||
+ | |} |
Latest revision as of 08:47, 25 July 2013
STREAM
A few words about numactl
NUMA is an acronym for Non Uniform Memory Access, and numactl is a tool to assign memory to the node. Following are a few important keywords one should know before embarking on the numactl mission:
physcpubind = ID of the cores
cpunodebind = ID of the nodes
membind = ID of the node that the memory is assigned to
For example, on an AMD machine with 16 cores, or in the terminology of NUMA, 4 nodes with 4 cores on each node, the command line
–membind=0 –physcpubind=0-3
asigns four threads running on cores 0 to 3 (node 0) with the memory also assigned to the node 0. However, the command line
–membind=1 –physcpubind=0-3
assigns four threads on the cores 0 to 3 (node 0) but the memory is assigned to the node 1. As this memory is not local to the node that the threads are running on, the performance will be affected. Assigning memory locally to the node can also be done by ”-l” option of the numactl.
Alternatively, above command lines can be shortened by using "cpunodebind". For example,
–membind=0 –cpunodebind=0
means that the memory is assigned to node 0 and the threads are also running on node 0. One should note that with the use of "cpunodebind" the number of threads will be equal to the number of cores on the node, so in this case number of threads has to be equal to four. However, if we wish to run two threads on node 0, its only possible with "physcpubind". You have more control of running your threads with "physcpubind" as you can choose the cores that you wish to run your jobs on. For detail description please follow the manual page of numactl.
Intel (2 x E5-2643 @ 3.30GHz)
Streams is a well-known memory bandwidth benchmark. Before we attempt to find the maximum bandwidth, it's necessary to find out the architecture of the machine. The command "numactl --hardware" on this machine produces:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 32739 MB
node 0 free: 30624 MB
node 1 cpus: 4 5 6 7
node 1 size: 32768 MB
node 1 free: 31280 MB
node distances:
node 0 1
0: 10 21
1: 21 10
From the above result, we can conclude that there are two numa nodes with four cores on each: in total eight cores.
Before measuring the maximum memory bandwidth of the server, we first determine the number of threads required to achieve the maximum bandwidth of a given NUMA node. Results are summarized in the following table:
Number of threads |
Bandwidth (GB/s) |
---|---|
1 | 9.5 |
2 | 18.8 |
3 | 21.4 |
4 | 34.0 |
From the above table, we conclude that the maximum number of threads that we need to run on each node is four. Above table was obtained by running the threads on node 0 and assigning the memory on the same node as well. This result can be reproduced on other nodes as well.
Following table describes the effect of variation of memory allocation with respect to the processors where the threads are running on the memory bandwidth(number of threads is four):
MEM CPU |
0 | 1 |
---|---|---|
0 | 34.0 | 17.4 |
1 | 18.9 | 33.5 |
In the above table, variation of the memory nodes are in the rows while cpu nodes are in the column. You can clearly see the effect of memory binding with the respect to the cores where the threads are running. Please note that the above table resembles the "node distance table " obtained using "numactl --hardware" earlier.
AMD (2 x 6220 @ 3.0 GHz)
This is an Interlagos machine with 16 cores (numa 4 nodes with 4 cores each). Each core has 4 GB of memory, which results in the memory of machine to be 64GB. I compiled the code with open64 compiler. It is noteworthy that gcc compiler gives about half of the bandwidth as open64, while intel compiler results on this machine vary (64GB to 40 GB). "numactl --hardware" produces:
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 16382 MB
node 0 free: 2930 MB
node 1 cpus: 4 5 6 7
node 1 size: 16384 MB
node 1 free: 5082 MB
node 2 cpus: 8 9 10 11
node 2 size: 16384 MB
node 2 free: 2281 MB
node 3 cpus: 12 13 14 15
node 3 size: 16368 MB
node 3 free: 550 MB
node distances:
node 0 1 2 3
0: 10 16 16 16
1: 16 10 16 16
2: 16 16 10 16
3: 16 16 16 10
Following table describes memory bandwidth on a single node by varying number of threads:
Number of threads |
Bandwidth (GB/s) |
---|---|
1 | 14.0 |
2 | 15.0 |
3 | 17.8 |
4 | 18.5 |
Again, similar to the Intel machine, the maximum number of threads we need to run on each node is four.
Following table describes the effect of variation of memory allocation with respect to the processors where the threads are running on the memory bandwidth(number of threads is four):
MEM CPU |
0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | 18.1 | 11.8 | 6.5 | 5.6 |
1 | 11.8 | 18.7 | 5.5 | 6.5 |
2 | 6.5 | 5.5 | 18.5 | 11.6 |
3 | 5.6 | 6.5 | 11.8 | 18.5 |
Contrary to the Intel machine, the above table does not agree with the "node distance" produced by the "numactl --hardware"!
AMD (4 x 6378 @ 2.4 GHz)
In NUMA terminology, this server has 8 nodes with 8 cores on each.
numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32765 MB
node 0 free: 29324 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 31892 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 31900 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32768 MB
node 3 free: 31911 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 32768 MB
node 4 free: 31964 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 32768 MB
node 5 free: 31942 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 32768 MB
node 6 free: 31866 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 32752 MB
node 7 free: 31960 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 22 16 22 16 22
1: 16 10 22 16 22 16 22 16
2: 16 22 10 16 16 22 16 22
3: 22 16 16 10 22 16 22 16
4: 16 22 16 22 10 16 16 22
5: 22 16 22 16 16 10 22 16
6: 16 22 16 22 16 22 10 16
7: 22 16 22 16 22 16 16 10
Memory bandwidth on a single node by varying number of threads:
Number of threads |
Bandwidth (GB/s) |
---|---|
1 | 13.0 |
2 | 14.1 |
3 | 17.1 |
4 | 17.4 |
5 | 17.1 |
6 | 16.7 |
7 | 16.6 |
8 | 16.1 |
Following table describes the variation of memory bandwidth when we change memory allocation with respect to the cores where threads are running (Number of threads=4)
MEM CPU |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
0 | 17.3 | 8.0 | 5.6 | 4.1 | 5.7 | 4.1 | 5.5 | 4.0 |
1 | 8.2 | 17.6 | 6.5 | 6.5 | 4.0 | 5.5 | 4.0 | 5.4 |
2 | 5.7 | 6.5 | 17.9 | 7.9 | 5.6 | 4.1 | 5.6 | 4.1 |
3 | 4.1 | 6.5 | 8.1 | 17.8 | 4.1 | 5.6 | 4.1 | 5.7 |
4 | 5.6 | 4.0 | 5.7 | 4.2 | 17.7 | 7.9 | 5.7 | 4.1 |
5 | 4.0 | 5.6 | 4.1 | 5.6 | 8.1 | 17.7 | 4.0 | 5.5 |
6 | 5.4 | 4.0 | 5.6 | 4.1 | 5.7 | 4.1 | 17.8 | 7.9 |
7 | 3.9 | 5.4 | 4.0 | 5.6 | 4.2 | 5.6 | 8.1 | 17.7 |
Bandwidth in terms of Socket
A socket for AMD 6200 and 6300 machine is two NUMA nodes combined together. The sockets have 16 cores for the 6378 server while 8 cores for 6220 server. The memory bandwidth for each NUMA node is maximum with about 4 threads, and we wonder what is the maximum bandwidth for a socket. A reasonable guess from our previous results is to use 8 threads for the socket with 4 distributed over each NUMA node. If we run the stream with 8 cores as follows:
numactl --physcpubind=0,1,2,3,8,9,10,11 --membind=0,1 ./stream
we get 34.7 GB/s memory bandwidth.
By running,
numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=0,1 ./stream
also yields 35 GB/s bandwidth.
By varying the membind to different sockets as follows:
numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=0,1 ./stream
numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=2,3 ./stream
numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=4,5 ./stream
numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=6,7 ./stream
numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=0,1 ./stream
numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=2,3 ./stream
numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=4,5 ./stream
numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=6,7 ./stream
numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=0,1 ./stream
numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=2,3 ./stream
numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=4,5 ./stream
numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=6,7 ./stream
numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=0,1 ./stream
numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=2,3 ./stream
numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=4,5 ./stream
numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=6,7 ./stream
we get following table (In terms of socket, i.e. node 0-1 is socket 1, node 2-3 is socket 2 and so on)
MEM CPU |
1 | 2 | 3 | 4 |
---|---|---|---|---|
1 | 35.2 | 11.2 | 11.0 | 10.7 |
2 | 11.3 | 35.3 | 11.2 | 11.1 |
3 | 10.9 | 11.2 | 35.2 | 11.0 |
4 | 10.7 | 11.1 | 11.1 | 35.4 |
VASP
This page describes benchmarking of Vienna Ab-initio Simulation Package (VASP), a plane wave density functional theory code, used in studying electronic structure of materials.
Intel (2 x E5-2643 @ 3.30GHz)
Native FFT Library
Following libraries and flags were used:
MKLDIR = $(HPC_MKL_DIR)
MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTLIB = -lfftw3xf
INCS = -I$(MKLDIR)/include/fftw
FFT_OBJS = fftmpi.o fftmpi_map.o fftw3d.o fft3dlib.o
FFLAGS = -free -names lowercase -assume byterecl
OFLAG = -O2 -xsse2 -unroll-aggressive -warn general
As a first check, Streaming SIMD Extension (SSE) was changed and following is the result of a self consistent field (SCF) calculation for MgMOS (For input files, please ask Charles Taylor or Manoj Srivastava):
SIMD Instruction | Time(s) |
---|---|
sse2 | 158 |
sse4.1 | 156 |
sse4.2 | 155 |
avx | 155 |
ssse3 | 156 |
There does not seem to be a significant impact of SSE sets on the run time of VASP.
MKL FFTs (via FFTW wrappers)
Upon profiling the code, we found that the code spent most of its time in the FFT libraries, so the next step was to change the FFT libraries. Following changes were made:
FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o
(The change here is replacement of "fftmpi.o" in the original VASP makefile with "fftmpiw.o")
MKLDIR = $(HPC_MKL_DIR)
MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTLIB = -lfftw3xf
INCS = -I$(MKLDIR)/include/fftw
FFLAGS = -free -names lowercase -assume byterecl
OFLAG = -O2 -xsse2 -unroll-aggressive -warn general
Following table depicts the run time variation with SIMD instruction sets:
SIMD Instruction | Time(s) |
---|---|
sse2 | 97 |
sse4.1 | 95 |
sse4.2 | 94 |
avx | 94 |
ssse3 | 94 |
In conclusion, upon making changes (using "fftmpiw.o" as opposed to "fftmpi.o"), a significant 60% improvement on the run time of the code was found on the Intel machine (E5-2643 @ 3.30GHz).
FFTW FFTs
We further compiled VASP by using FFT library from the FFTW package with following flags:
MKLDIR = $(HPC_MKL_DIR)
MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTWDIR = /apps/fftw/3.3.2
FFTLIB = -L$(FFTWDIR)/lib -lfftw3
INCS = -I$(FFTWDIR)/include
FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o
FFLAGS = -free -names lowercase -assume byterecl
OFLAG = -O2 -xsse2 -unroll-aggressive -warn general
From our previous experience, we concluded that the performance of VASP did not depend substantially on the SIMD instruction sets, so for FFTW library, we only tried one set. Following is the result:
SIMD Instruction | Time(s) |
---|---|
sse2 | 118 |
We conclude that FFTs from MKL library are better than the ones from FFTW.
AMD (2 x 6220 @ 3.0 GHz)
This machine has 16 cores, in numactl terminology 4 NUMA nodes with 4 cores on each nodes. As the result of VASP depends heavily on the choice of FFT libraries, we checked performance of this machine with different FFTs, namely, FFT provided by VASP package, MKL, and FFTW. We built FFTW libraries with various flags to see if we could find a better choice for FFTs. The libraries and flags used to compile VASP are as follows (FFT libraries were changed depending on which FFT we wanted to use):
MKLDIR = $(HPC_MKL_DIR)
MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTWDIR = /apps/fftw/3.3.2
FFTLIB = -L$(FFTWDIR)/lib -lfftw3
INCS = -I$(FFTWDIR)/include
FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o
FFLAGS = -free -names lowercase -assume byterecl
OFLAG = -O2 -xsse2 -unroll-aggressive -warn general
From the computer architecture point of view, the bulldozer core aka module of AMD server lies in between a true dual core processor and a single core processor with simultaneous multithreading capability. The cores on AMD servers share some of the resources such as L2 cache and floating point unit (FPU), so the performance of a code would get affected if the threads are run on the shared cores or exclusive cores. For detail information about bulldozer core, please have a look at http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29
The results are summarized in the following table (8 processor runs):
Run Scheme | Native | MKL | FFTW | FFTW | FFTW | FFTW | FFTW | FFTW |
---|---|---|---|---|---|---|---|---|
Shared time(s) |
399 | 261 | 333 | 319 | 334 | 336 | 315 | 319 |
Exclusive time (s) |
274 | 159 | 217 | 219 | 215 | 217 | 213 | 211 |
Notes | - | - | 1 | 2 | 3 | 4 | 5 | 6 |
In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources.
You can clearly see the effect of FPU sharing on the server for all the FFT libraries. Also, similar to the Intel servers, FFT from the MKL libraries work better as opposed to any other libraries.
1 Default compiler Flags were used to build FFT.
2 CFLAGS=-O3, FFLAGS=-O3, -enable sse2
3 enable-mpi CFLAGS=-O3, FFLAGS=-O3, -enable sse2
4 CC='opencc -march=bdver1' F77='openf90 -march=bdver1' CFLAGS='-msse3 -msse4.1 -msse4.2 -msse4a -mfma4 -O2' FFLAGS='-msse3 -msse4.1 -msse4.2 -msse4a -mfma4 -O2' --enable-fma --enable-mpi
5 FFLAGS/ CFLAGS="-OPT:Ofast -mavx -mfma4 -march=bdver1 -O3 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math"
6 ufhpc compiler options. FFTWDIR = /apps/fftw/3.3.2
Performance Comparison
Following is a summary of results for the test case of MgMOS ran on the Intel and AMD servers with 8 processors.
Server | Native | MKL | FFTW |
---|---|---|---|
Intel | 158 | 97 | 118 |
Intel (Scaled) | 174 | 106 | 130 |
AMD (Shared) | 399 | 261 | 319 |
AMD (Exclusive) | 274 | 159 | 211 |
AMD Shared/AMD Exc. | 1.46 | 1.64 | 1.51 |
AMD Exc./Intel (scaled) | 1.57 | 1.50 | 1.62 |
Notes | - | - | 1 |
In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources.
In summary--
1. VASP runs better (~1.5x) on the Intel server than the AMD server, even with the interleaved cores.
2. FPU sharing reduces the efficiency on the AMD server to 1.5x.
3. FFTs from MKL are better builds.
1 Compiled by UFHPC (Charles Taylor or Craig Prescott)
LAMMPS
Scaling with Number of Processors
LAMMPS is compiled with the following flags:
module load intel openmpi
CC = mpiCC
CCFLAGS = -O2 -xsse2
FFT_INC = -I$(HPC_MKL_DIR)/include/fftw
FFT_PATH =
FFT_LIB = -L$(HPC_MKL_DIR)/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
The benchmarking runs are done by the input file provided with the package. (LJ = atomic fluid, Lennard-Jones potential with 2.5 sigma cutoff (55 neighbors per atom), NVE integration). Following table describes the variation of run time with number of processors on the Intel server:
# processors | Time(s) |
---|---|
8 | 158 |
4 | 309 |
1 | 1139 |
We find linear scaling with number of processors on the intel machine.
We also ran the "lj" benchmark on the AMD server (for comparison, we provide results on the Intel server as well):
# processors | lj Time(s) |
chain Time(s) |
eam Time(s) |
rhodo Time(s) |
Notes |
---|---|---|---|---|---|
16 | 180 | 84 | 476 | 2877 | - |
8 | 329 | 149 | 908 | 5506 | - |
4 | 547 | 248 | 1509 | 9398 | - |
1 | 1651 | 724 | 4708 | - | - |
Intel (8 proc) | 158 | 67 | 396 | 2361 | 1 |
Scaled Intel (8 proc) |
217 | 92 | 545 | 3246 | 2 |
Scaled Intel(8 proc)/ AMD (16 proc) |
1.20 | 1.15 | 1.14 | 1.13 | 3 |
For all the test cases, runs on the Intel servers( 8 threads) are slower than AMD servers (16 threads) by about 15%.
1 Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz
2 Scaling was done by the factor of 3.3/2.4=1.375
3 Comparison of 8 processors run on Intel vs 16 processors run on AMD
Comparison of Intel, Open64 and GNU Builds
LAMMPS with Intel compiler:
module load intel openmpi
CC = mpiCC
CCFLAGS = -O2 -msse2
FFT_INC = -I$(HPC_MKL_DIR)/include/fftw
FFT_PATH =
FFT_LIB = -L$(HPC_MKL_DIR)/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
LAMMPS with open64 compiler:
module load open64/.4.5.2 openmpi
CC = mpiCC
CCFLAGS = -O2 -msse2
MPI_DIR = /usr/mpi/open64/openmpi-1.6
MPI_INC = -I$(MPI_DIR)/include
MPI_LIB = -L$(MPI_DIR)/lib64 -lmpi
MPI_PATH =
FFT_DIR = /home/manoj/FFTW/charlie/3.3.2
FFT_INC = -I$(FFT_DIR)/include/fftw3
FFT_PATH =
FFT_LIB = -L$(FFT_DIR)/lib -lfftw3
LAMMPS with gnu compiler:
module load gcc/.4.7.2 openmpi
CC = g++
CCFLAGS = -O2 -msse2
MPI_DIR = /usr/mpi/gnu/openmpi-1.6
MPI_INC = -I$(MPI_DIR)/include
MPI_LIB = -L$(MPI_DIR)/lib64 -lmpi -lmpi_cxx
MPI_PATH =
FFT_DIR = /home/manoj/FFTW/gnu/3.3.2
FFT_INC = -I$(FFT_DIR)/include/fftw3
FFT_PATH =
FFT_LIB = -L$(FFT_DIR)/lib -lfftw3
For testing, we only ran "lj" benchmark and found:
Compiler | Intel Time(s) |
Intel Time(s) |
AMD Time(s) |
AMD Time(s) |
---|---|---|---|---|
Intel | 158 | 151 | 329 | 321 |
Open64 | 173 | - | 352 | 337 |
GNU | 152 | 145 | 341 | 320 |
NOTES | 1 | 2 | 1,3 | 2,3 |
1 Basic Flags:
Intel: -O2 -msse2
Open64: -O2 -msse2
GNU: -O2 -msse2
2 Fancy Flags:
Intel: -O2 -mavx -unroll-aggresive -ipo -opt-prefetch -use-intel-optimized-headers
Open64: CCFLAGS =-OPT:Ofast -mavx -mfma4 -march=bdver1 -O2 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math
GNU: CCFLAGS= -O2 -mavx -fsched-pressure -flto -funroll-all-loops -fprefetch-loop-arrays -minline-all-stringops -fno-tree-pre -ftree-vectorize
3 Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit)
Intel (2 x E5-2643 @ 3.30GHz)
Intel Compiler and SIMD Sets
We used Intel compiler as follows:
module load intel openmpi
CC = mpiCC
CCFLAGS = -O2 -xSSE2
FFT_INC = -I$(HPC_MKL_DIR)/include/fftw
FFT_PATH =
FFT_LIB = -L$(HPC_MKL_DIR)/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
Following table shows variation of Streaming SIMD Extension (SSE) sets(# threads=8):
Extension set |
Time(s) |
sse2 | 158 |
sse3 | 157 |
ssse3 | 157 |
sse4.1 | 158 |
sse4.2 | 157 |
avx | 152 |
"avx" instruction set is slightly better than the other sets!
The binaries for the above SIMD sets use "-x" option for the build, which does not work for the instruction sets other than "-sse2" on the AMD server, so for the next step we build our binaries with "-m" option and run it on the intel and AMD servers to see whether we could successfully run the binaries on both servers. Following table demonstrates the result for the "lj" benchmark:
SIMD Instruction |
Intel Time(s) |
AMD Time(s) |
---|---|---|
sse2 | 158 | 329 |
sse3 | 157 | 329 |
ssse3 | 157 | 329 |
sse4.1 | 158 | 330 |
sse4.2 | 157 | 329 |
avx | 152 | 319 |
Notes | - | 1 |
1 Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit).
Clearly on both, Intel as well as AMD servers, "avx" instructions are better choice for the "lj" benchmark. We ran other benchmarks for the SIMD sets:
SIMD Instruction |
chain | eam | rhodo | |||
---|---|---|---|---|---|---|
- | Intel | AMD | Intel | AMD | Intel | AMD |
sse2 | 67 | 149 | 396 | 908 | 2361 | 5506 |
sse3 | 67 | 149 | 398 | 908 | 2355 | 5486 |
ssse3 | 66 | 149 | 399 | 907 | 2359 | 5485 |
sse4.1 | 68 | 148 | 395 | 908 | 2351 | 5420 |
sse4.2 | 66 | 148 | 396 | 909 | 2346 | 5479 |
avx | 65 | 145 | 387 | 897 | 2290 | 5360 |
Notes | - | 1 | - | 1 | - | 1 |
For all the benchmarks, "avx" seems to be a better choice compared to other instruction sets.
1 Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit).
MKL vs FFTW FFTs
We profiled the code to see where does it spend most of its time. Below is a summary of all the time spent in the FFTs for all the benchmarks that we tried.
lj:
time seconds seconds calls s/call s/call name 0.00 185.61 0.00 1 0.00 0.00 LAMMPS_NS::FFT3d::timing1d(double*, int, int) 0.00 185.61 0.00 1 0.00 0.00 fft_1d_only
chain:
time seconds seconds calls s/call s/call name 0.00 70.62 0.00 1 0.00 0.00 LAMMPS_NS::FFT3d::timing1d(double*, int, int) 0.00 70.62 0.00 1 0.00 0.00 fft_1d_only
eam:
time seconds seconds calls s/call s/call name 0.00 381.72 0.00 1 0.00 0.00 LAMMPS_NS::FFT3d::timing1d(double*, int, int) 0.00 381.72 0.00 1 0.00 0.00 fft_1d_only
rhodo:
time seconds seconds calls s/call s/call name 0.81 118.18 1.04 10423040 0.00 0.00 kf_work(FFT_DATA*, FFT_DATA const*, unsigned long, int, int*, kiss_fft_state*) 0.72 119.10 0.92 31269120 0.00 0.00 kf_bfly4(FFT_DATA*, unsigned long, kiss_fft_state*, unsigned long)
We can clearly see that the code does not spend any significant time in the FFT routines for any benchmarks. So, if we change the FFT from MKL to FFTW, it should not change the performance at all. As a check, we built LAMMPS with FFTW FFTs using:
module load intel openmpi fftw
CC = mpiCC
CCFLAGS = -O2 -mavx
FFT_DIR = /apps/fftw/3.3.2
FFT_INC = -I$(FFT_DIR)/include/fftw3
FFT_PATH =
FFT_LIB = -L$(FFT_DIR)/lib -lfftw3
For the test, we ran "lj" benchmark on the Intel server and found:
FFT | Time(s) |
---|---|
MKL | 152 |
FFTW | 152 |
As expected, there is no difference between the FFTs from FFTW or MKL on the performance of LAMMPS.
AMD (4 x 6378 @ 2.4 GHz)
In this section, we descirbe the effect of shared FPUs and cache of the AMD server on the performance of LAMMPS. The results are summarized in the following table (# threads=8):
Run Scheme | lj | chain | eam | rhodo |
---|---|---|---|---|
Shared time(s) |
319 | 145 | 897 | 5360 |
Exclusive time (s) |
277 | 126 | 778 | 4426 |
In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources.
Performance Comparison
Following is a table for performance comparison of Intel and AMD servers when the job was run using 8 threads.
Server | lj | chain | eam | rhodo |
---|---|---|---|---|
Intel | 158 | 65 | 387 | 2290 |
Intel (Scaled) | 217 | 89 | 532 | 3149 |
AMD (Shared) | 319 | 145 | 897 | 5360 |
AMD (Exclusive) | 277 | 126 | 778 | 4426 |
AMD Shared/AMD Exc. | 1.15 | 1.15 | 1.15 | 1.21 |
AMD Exc./Intel (scaled) | 1.28 | 1.42 | 1.46 | 1.41 |
In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources.
In summary--
1. LAMMPS runs better (~1.3-1.5X) on the Intel server than the AMD server, even with the interleaved cores.
2. FPU sharing reduces the efficiency on the AMD server to 1.2X.
GROMACS
Comparison of Intel, Open64 and GNU Builds
Intel compiler:
module load intel openmpi
export F77=mpif77
export F90=mpif90
export CC=mpicc
export CFLAGS="-O2 -msse2"
export FFLAGS="-O2 -msse2"
./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float --with-fft=mkl LIBS="-L/opt/intel/composerxe
/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core"
make
make install
Open64 compiler:
module load open64/.4.5.2 openmpi
export F77=openf90
export F90=openf90
export CC=opencc
export CFLAGS="-O2 -msse2"
export FFLAGS="-O2 -msse2"
export CPPFLAGS="-I/home/manoj/FFTW/fpic-charlie/3.3.2/include"
export LDFLAGS="-L/home/manoj/FFTW/fpic-charlie/3.3.2/lib"
./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float
make
make install
GNU compiler:
module load gcc/.4.7.2 openmpi
export F77=gfortran
export F90=f95
export CC=gcc
export CFLAGS="-O2 -msse2"
export FFLAGS="-O2 -msse2"
export CPPFLAGS="-I/home/manoj/FFTW/gnu/3.3.2/include"
export LDFLAGS="-L/home/manoj/FFTW/gnu/3.3.2/lib"
./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float
make
make install
There are some test cases in "gromacs-4.5.5/share/tutor" directory, however not all of them work. So far, I could only get "water", "methane", and "mixed" to work. Instructions to run MD simulations are on http://manual.gromacs.org/online/water.html page. You first need to create a " .tpr" file using
./grompp_d -v
After this, you can run "mdrun_d" for the molecular dynamics simulation.
In the input file provided by GROMACS, there is a mistake in the "grompp.mdp" file. The line starting with "bd-temp" has to be commented out. Upon some internet search, I found that the file, "grompp.mdp" is the input file for an older version of GROMACS, and apparently some of the parameters have become obsolete.
Following results are for the MD simulation on water using 8 processors on Intel (E5-2643) and AMD (Opetran-6378) servers:
Compiler | Intel Time(s) |
Intel Time(s) |
AMD Time(s) |
AMD Time(s) |
---|---|---|---|---|
Intel | 157 | 157 | 361 | 363 |
Open64 | 167 | - | 392 | 383 |
GNU | 160 | - | 377 | 368 |
NOTES | 1 | 2 | 1,3 | 2,3 |
1 Basic Flags:
Intel: -O2 -msse2
Open64: -O2 -msse2
GNU: -O2 -msse2
2 Fancy Flags:
Intel: -O2 -mavx -unroll-aggresive -opt-prefetch -use-intel-optimized-headers
Open64: CCFLAGS =-OPT:Ofast -mavx -mfma4 -march=bdver1 -O2 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns
GNU: CCFLAGS= -O2 -mavx -fsched-pressure -flto -funroll-all-loops -fprefetch-loop-arrays -minline-all-stringops -fno-tree-pre -ftree-vectorize
3 Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit)
Scaling with Number of Processors
"Water" benchmark was run using gromacs compiled with the intel and openmpi (fancy flags) as shown on above section.
Following table describes the variation of run time with number of processors on the Intel (E5-2643) server:
# processors | Time(s) | Factor |
---|---|---|
1 | 950 | 1.00 |
4 | 252 | 3.76 |
8 | 157 | 6.05 |
We find linear scaling with number of processors on the intel server.
We also ran the same benchmark on the AMD (Opetran-6378) server (for comparison, we provide results on the Intel server as well):
# processors | water Time(s) |
Notes |
---|---|---|
16 | 241 | - |
8 | 363 | - |
4 | 532 | - |
1 | 1288 | - |
Intel (8 proc) | 157 | 1 |
Scaled Intel (8 proc) |
216 | 2 |
AMD (16 proc)/ Scaled Intel(8 proc) |
1.12 | 3 |
1 Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz
2 Scaling was done by the factor of 3.3/2.4=1.375
3 Comparison of 8 processors run on Intel vs 16 processors run on AMD
Instruction Set Dependence
GROMACS is compiled with the following flags:
module load intel openmpi mkl
export F77=mpif77
export F90=mpif90
export CC=mpicc
export CFLAGS="-O2 -msse2 -unroll-aggresive -opt-prefetch -use-intel-optimized-headers"
export FFLAGS="-O2 -msse2 -unroll-aggresive -opt-prefetch -use-intel-optimized-headers"
./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float --with-fft=mkl LIBS="-L/opt/intel
/composerxe/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core"
make
make install
Table captures the dependence of instruction sets on Intel (E5-2643) and AMD (Opetran-6378) machine with 8 processes:
SIMD Instruction |
Intel | AMD |
---|---|---|
sse2 | 158 | 364 |
sse3 | 158 | 362 |
ssse3 | 157 | 362 |
sse4.1 | 159 | 360 |
sse4.2 | 157 | 362 |
avx | 157 | 363 |
Notes | - | 1 |
GROMACS seems to be instruction set independent.
1 Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit)
MKL vs FFTW
There is a problem in profiling the code. We don't see as many subroutines as we wish to see. Intel compiler is still OK, but GNU compiler is worse: it only shows only one subroutine. We don't see any FFT routine in the test case that we are using.
Intel Compiler
time seconds seconds calls s/call s/call name 68.42 0.13 0.13 6 21.67 28.33 do_md 15.79 0.16 0.03 2160054 0.00 0.00 copy_rvec 5.26 0.17 0.01 600018 0.00 0.00 clear_mat 5.26 0.18 0.01 _intel_fast_memcpy 5.26 0.19 0.01 _intel_fast_memcpy.P 0.00 0.19 0.00 720018 0.00 0.00 copy_mat 0.00 0.19 0.00 6 0.00 28.33 mdrunner 0.00 0.19 0.00 3 0.00 0.00 copy_rvec 0.00 0.19 0.00 1 0.00 0.00 copy_mat 0.00 0.19 0.00 1 0.00 0.00 get_nthreads 0.00 0.19 0.00 1 0.00 0.00 mdrunner_start_threads
GNU Compiler
time seconds seconds calls s/call s/call name 100.01 0.02 0.02 1 20.00 20.00 do_md
To see the FFT dependence, we built GROMACS with FFTW FFTs using:
module load intel openmpi fftw
export F77=mpif77
export F90=mpif90
export CC=mpicc
export CFLAGS="-O2 -mavx -unroll-aggresive -opt-prefetch -use-intel-optimized-headers"
export FFLAGS="-O2 -mavx -unroll-aggresive -opt-prefetch -use-intel-optimized-headers"
#export CFLAGS="-O2 -msse2"
#export FFLAGS="-O2 -msse2"
export CPPFLAGS="-I/apps/fftw/3.3.2/include"
export LDFLAGS="-L/apps/fftw/3.3.2/lib -lfftw3"
./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float
make
make install
We ran "water" benchmark on the Intel(E5-2643) and AMD (Opetran-6378) servers using 8 processors and found:
FFT | Intel Time(s) |
AMD Time(s) |
---|---|---|
MKL | 157 | 363 |
FFTW | 157 | 360 |
There seems to be no difference between the FFTs from FFTW or MKL on the performance of GROMACS.
We descirbe the effect of shared FPU and L2-cache of the AMD server. The results are summarized in the following table (# processes=8):
Run Scheme | Time(s) |
---|---|
Shared time(s) |
363 |
Exclusive time (s) |
266 |
In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources.
Performance Comparison
Following is a table for performance comparison of Intel and AMD servers when the job was run using 8 processors:
Server | Time(s) | Notes |
---|---|---|
Intel | 157 | 1 |
Intel (Scaled) | 216 | 2 |
AMD (Shared) | 363 | 3 |
AMD (Exclusive) | 266 | 4 |
AMD Shared/AMD Exc. | 1.36 | - |
AMD Exc./Intel (scaled) | 1.23 | - |
1 Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz
2 Scaling was done by the factor of 3.3/2.4=1.375
3 "Shared" represents cores that share resources such as L2-Cache and FPU
4 "Exclusive" represents cores that don't share resources such as L2-Cache and FPU
DL_POLY
Comparison of Intel, Open64 and GNU Builds
Intel compiler:
module load intel openmpi
$(MAKE) LD="mpif90 -v -o " \
LDFLAGS="-shared-intel" \
FC="mpif90 -c" \
FCFLAGS="-O3 -mavx -opt-prefetch -use-intel-optimized-headers" \
EX=$(EX) BINROOT=$(BINROOT) $(TYPE)
GNU compiler:
module load gcc/.4.7.2 openmpi
$(MAKE) LD="mpif90 -v -o " \
LDFLAGS=" " \
FC="mpif90 -c" \
FCFLAGS="-O3 -mavx -fsched-pressure -flto -funroll-all-loops -fprefetch-loop-arrays -minline-all-stringops -fno-tree-
pre -ftree-vectorize" \
EX=$(EX) BINROOT=$(BINROOT) $(TYPE)
Open64 compiler:
Open64 shows a problem in the subroutine config_module.f90 at line 62. This subroutine resizes the length of an array. The line
Character( Len = * ), Allocatable, Intent( InOut ) :: a(:)
makes "a" an allocatable array of strings, but the length of string is not defined. Intel and GNU compiler can handle this, however the open64 compiler can not. We made one assignment modification at this line (Interested readers should look up the code) and made the open64 compiler work.
module load open64/.4.5.2 openmpi
$(MAKE) LD="mpif90 -v -o " \
LDFLAGS=" " \
FC="mpif90 -c" \
FCFLAGS="-OPT:Ofast -mavx -mfma4 -march=bdver1 -O2 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math " \
EX=$(EX) BINROOT=$(BINROOT) $(TYPE)
We downloaded 42 benchmarks from the DL_POLY website. The number of benchmarks are higher so we looked in to the result of profiling and found that there are in fact only 15 cases that use different codes inside the DL_POLY package and we are going to present our results based on those.
Following results are for the various benchmarks using 8 processors on Intel (E5-2643) servers:
|
|
From above table, we conclude that intel build with fancy flags is better choice for the code.
We tried to compare intel and open64 builds on the AMD server and following table depicts that comparison:
|
|
1 Basic Flags:
Intel: -O2 -msse2
Open64: -O2 -msse2
GNU: -O2 -msse2
2 Fancy Flags:
Intel: -O2 -mavx -opt-prefetch -use-intel-optimized-headers
GNU: CCFLAGS= -O2 -mavx -fsched-pressure -flto -funroll-all-loops -fprefetch-loop-arrays -minline-all-stringops -fno-tree-pre -ftree-vectorize
Open64 -OPT:Ofast -mavx -mfma4 -march=bdver1 -O2 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math
More studies on Test 14
intel compiler:
Intel server: CPU time = 39s, Wall Time= 49s
% cumulative self self total time seconds seconds calls s/call s/call name 13.56 4.85 4.85 182 0.03 0.03 constraints_shake_vv_ 9.09 8.10 3.25 14 0.23 0.27 link_cell_pairs_ 7.94 10.94 2.84 78 0.04 0.04 deport_atomic_data_ 5.65 12.96 2.02 __intel_ssse3_rep_memcpy 5.56 14.95 1.99 14 0.14 0.14 ewald_spme_forces_IP_spme_forces_ 5.29 16.84 1.89 14 0.14 0.63 ewald_spme_forces_
AMD server: CPU Time=151s Wall Time =190s
% cumulative self self total time seconds seconds calls s/call s/call name 19.85 23.87 23.87 91 0.26 0.27 constraints_shake_vv_ 12.33 38.70 14.83 78 0.19 0.19 deport_atomic_data_ 6.82 46.90 8.20 13 0.63 0.64 constraints_rattle_ 6.06 54.19 7.29 14 0.52 0.60 link_cell_pairs_ 5.61 60.94 6.75 3836 0.00 0.00 gpfa_module_mp_gpfa3f_ 5.32 67.34 6.40 14 0.46 0.46 bspgen_ 5.15 73.53 6.19 52 0.12 0.77 npt_b0_vv_
Open64 Compiler
Intel server: CPU Time= 45s, Wall Time=290s
% cumulative self self total time seconds seconds calls s/call s/call name 15.64 5.64 5.64 182 0.03 0.03 constraints_shake_vv__ 12.06 9.99 4.35 14 0.31 0.69 ewald_spme_forces__ 11.37 14.09 4.10 14 0.29 0.34 link_cell_pairs__ 9.45 17.50 3.41 78 0.04 0.04 deport_atomic_data__ 5.52 19.49 1.99 3378329 0.00 0.00 images_ 5.16 21.35 1.86 3808 0.00 0.00 GPFA2F.in.GPFA_MODULE 5.16 23.21 1.86 26 0.07 0.07 constraints_rattle__
AMD server: CPU Time= 156s, Wall Time=164s
% cumulative self self total time seconds seconds calls s/call s/call name 18.38 22.58 22.58 91 0.25 0.25 constraints_shake_vv__ 13.07 38.64 16.06 78 0.21 0.21 deport_atomic_data__ 8.29 48.82 10.18 14 0.73 2.73 ewald_spme_forces__ 7.54 58.09 9.27 14 0.66 0.76 link_cell_pairs__ 5.99 65.45 7.36 13 0.57 0.58 constraints_rattle__ 5.86 72.65 7.20 3836 0.00 0.00 GPFA3F.in.GPFA_MODULE 5.34 79.21 6.56 14 0.47 0.47 bspgen_
Scaling with Number of Processors
From the data of the above section, we conclude that intel build with fancy flags is better choice for the code and so we do our further runs only using the intel build.
Following table describes the variation of run time with number of processors on the Intel (E5-2643) server:
#proc | Test1 | Test3 | Test4 | Test5 | Test7 | Test9 | Test11 | Test13 | Test14 | Test17 | Test18 | Test27 | Test31 | Test35 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 (Factor) |
91 (1.00) |
389 (1.00) |
541 (1.00) |
511 (1.00) |
502 (1.00) |
755 (1.00) |
397 (1.00) |
292 (1.00) |
255 (1.00) |
302 (1.00) |
445 (1.00) |
651 (1.00) |
612 (1.00) |
- |
4 (Factor) |
90 (1.01) |
122 (3.18) |
160 (3.38) |
119 (4.29) |
173 (2.90) |
119 (6.34) |
119 (3.34) |
116 (2.52) |
147 (1.73) |
105 (2.87) |
147 (3.03) |
229 (2.84) |
184 (3.33) |
136 (-) |
8 (Factor) |
62 (1.47) |
65 (5.98) |
87 (6.22) |
65 (7.86) |
88 (5.70) |
57 (13.2) |
61 (6.51) |
58 (5.03) |
114 (2.23) |
65 (4.65) |
89 (5.00) |
131 (4.96) |
96 (6.38) |
81 (-) |
Following table depicts the result in the scaled manner. It's basically the same table as above.
#proc | Test1 | Test3 | Test4 | Test5 | Test7 | Test9 | Test11 | Test13 | Test14 | Test17 | Test18 | Test27 | Test31 | Test35 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | - |
4 | 1.01 | 3.18 | 3.38 | 4.29 | 2.90 | 6.34 | 3.34 | 2.52 | 1.73 | 2.87 | 3.03 | 2.84 | 3.33 | - |
8 | 1.47 | 5.98 | 6.22 | 7.86 | 5.70 | 13.2 | 6.51 | 5.03 | 2.23 | 4.65 | 5.00 | 4.96 | 6.38 | - |
We also ran the same benchmark on the AMD (Opetran-6378) server (for comparison, we provide results on the Intel server as well):
#proc | Test1 | Test3 | Test4 | Test5 | Test7 | Test9 | Test11 | Test13 | Test14 | Test17 | Test18 | Test27 | Test31 | Test35 | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 91 | 395 | 619 | 788 | 1406 | 1292 | 825 | 393 | 441 | 882 | 898 | 651 | 651 | - | - |
4 | 102 | 362 | 375 | 277 | 521 | 285 | 307 | 376 | 267 | 318 | 497 | 554 | 477 | 381 | - |
8 | 95 | 221 | 213 | 142 | 300 | 144 | 158 | 247 | 186 | 190 | 356 | 323 | 269 | 228 | - |
16 | 90 | 127 | 112 | 78 | 167 | 74 | 90 | 170 | 116 | 145 | 193 | 192 | 152 | 170 | - |
Intel(8-proc) | 62 | 65 | 87 | 65 | 88 | 57 | 61 | 58 | 114 | 65 | 89 | 131 | 96 | 81 | 1 |
Scaled Intel | 85 | 89 | 120 | 89 | 121 | 78 | 84 | 80 | 157 | 89 | 122 | 180 | 132 | 111 | 2 |
AMD (16 proc)/ Intel(8 proc) |
1.06 | 1.43 | 0.93 | 0.88 | 1.38 | .95 | 1.07 | 2.14 | 0.74 | 1.63 | 1.58 | 1.07 | 1.15 | 1.53 | 3 |
Table for the scaling with respect to 1 processor runs:
#proc | Test1 | Test3 | Test4 | Test5 | Test7 | Test9 | Test11 | Test13 | Test14 | Test17 | Test18 | Test27 | Test31 | Test35 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | - |
4 | 0.89 | 1.09 | 1.65 | 2.84 | 2.70 | 4.53 | 2.69 | 1.05 | 1.65 | 2.77 | 1.81 | 1.18 | 1.36 | - |
8 | 0.96 | 1.79 | 2.91 | 5.55 | 4.69 | 8.97 | 5.22 | 1.59 | 2.37 | 4.64 | 2.52 | 2.02 | 2.42 | - |
16 | 1.01 | 3.11 | 5.53 | 10.1 | 8.42 | 17.46 | 9.17 | 2.31 | 3.80 | 6.08 | 4.65 | 3.39 | 4.28 | - |
There is wide variety of scaling. Some test cases scale linearly while others don't. The scaling is better on the Intel server as compared to the AMD server in most cases but in some cases they do show about the same scaling.
1 Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz
2 Scaling was done by the factor of 3.3/2.4=1.375
3 Comparison of 8 processors run on Intel vs 16 processors run on AMD
MKL vs FFTW FFTs
We profiled the code to see where does it spend most of its time. Below is a summary of the profile for all the benchmarks that we tried.
Test1:
Intel
time seconds seconds calls s/call s/call name 28.26 7.16 7.16 675309 0.00 0.00 vdw_forces_ 25.57 13.64 6.48 201 0.03 0.03 link_cell_pairs_ 17.72 18.13 4.49 675309 0.00 0.00 ewald_real_forces_ 7.22 19.96 1.83 201 0.01 0.01 ewald_spme_forces_IP_spme_forces_ 5.17 21.27 1.31 201 0.01 0.02 ewald_spme_forces_ 3.95 22.27 1.00 201 0.00 0.12 two_body_forces_ 3.83 23.24 0.97 675309 0.00 0.00 images_ 3.08 24.02 0.78 201 0.00 0.00 bspgen_ 1.42 24.38 0.36 11658 0.00 0.00 gpfa_module_mp_gpfa3f_ 1.07 24.65 0.27 202 0.00 0.00 shellsort2_ 0.67 24.82 0.17 1206 0.00 0.00 export_atomic_data_
AMD
time seconds seconds calls s/call s/call name 23.43 18.45 18.45 3379737 0.00 0.00 vdw_forces_ 21.47 35.36 16.91 1000 0.02 0.02 link_cell_pairs_ 14.23 46.57 11.21 3379737 0.00 0.00 ewald_real_forces_ 11.77 55.84 9.27 500 0.02 0.02 bspgen_ 5.78 60.39 4.55 1000 0.00 0.00 ewald_spme_forces_IP_spme_forces_ 4.57 63.99 3.60 1000 0.00 0.02 ewald_spme_forces_ 4.36 67.42 3.43 deport_atomic_data_ 3.34 70.05 2.63 3379737 0.00 0.00 images_ 2.76 72.22 2.17 500 0.00 0.14 two_body_forces_ 2.13 73.90 1.68 spec_dexp2 1.24 74.88 0.98 29000 0.00 0.00 gpfa_module_mp_gpfa3f_ 1.05 75.71 0.83 998 0.00 0.00 nvt_b0_vv_ 0.86 76.39 0.68 501 0.00 0.00 shellsort2_
Test3:
Intel
time seconds seconds calls s/call s/call name 17.53 26.21 26.21 1298431 0.00 0.00 vdw_forces_ 14.82 48.37 22.16 201 0.11 0.13 link_cell_pairs_ 14.22 69.62 21.25 26532 0.00 0.00 gpfa_module_mp_gpfa2f_ 11.73 87.16 17.54 1298431 0.00 0.00 ewald_real_forces_ 6.62 97.06 9.90 1200 0.01 0.01 deport_atomic_data_ 4.04 103.10 6.04 201 0.03 0.21 ewald_spme_forces_ 3.80 108.78 5.68 200 0.03 0.03 constraints_shake_vv_ 3.37 113.82 5.04 43159044 0.00 0.00 local_index_ 3.32 118.79 4.97 2355005 0.00 0.00 images_ 3.06 123.36 4.57 200 0.02 0.03 constraints_rattle_ 2.99 127.82 4.47 326609967 0.00 0.00 match_ 2.57 131.66 3.84 201 0.02 0.02 ewald_spme_forces_IP_spme_forces_ 2.01 134.67 3.01 201 0.01 0.60 two_body_forces_ 1.10 136.32 1.65 201 0.01 0.01 parallel_fft_mp_forward_3d_fft_z_ 1.04 137.87 1.55 201 0.01 0.06 parallel_fft_mp_forward_3d_fft_y_ 0.98 139.33 1.46 201 0.01 0.01 bspgen_
AMD
time seconds seconds calls s/call s/call name 15.14 28.44 28.44 1299553 0.00 0.00 vdw_forces_ 12.46 51.85 23.41 201 0.12 0.14 link_cell_pairs_ 11.54 73.53 21.68 26532 0.00 0.00 gpfa_module_mp_gpfa2f_ 9.90 92.13 18.60 1299553 0.00 0.00 ewald_real_forces_ 7.30 105.85 13.72 200 0.07 0.07 constraints_rattle_ 7.01 119.03 13.18 200 0.07 0.07 constraints_shake_vv_ 6.23 130.73 11.70 1200 0.01 0.01 deport_atomic_data_ 3.69 137.66 6.93 201 0.03 0.03 bspgen_ 3.17 143.61 5.95 201 0.03 0.24 ewald_spme_forces_ 2.65 148.59 4.98 2367053 0.00 0.00 images_ 2.45 153.19 4.60 39319511 0.00 0.00 local_index_ 2.25 157.41 4.23 328218229 0.00 0.00 match_ 2.14 161.43 4.02 201 0.02 0.02 ewald_spme_forces_IP_spme_forces_ 1.81 164.84 3.41 spec_dexp2 1.59 167.82 2.98 201 0.01 0.66 two_body_forces_ 1.30 170.26 2.44 201 0.01 0.01 parallel_fft_mp_forward_3d_fft_z_ 0.80 171.76 1.50 201 0.01 0.01 parallel_fft_mp_back_3d_fft_x_
Test4:
Intel
time seconds seconds calls s/call s/call name 18.66 29.94 29.94 1604385 0.00 0.00 vdw_forces_ 15.84 55.35 25.41 31 0.82 0.98 link_cell_pairs_ 13.21 76.55 21.20 1604385 0.00 0.00 ewald_real_forces_ 12.69 96.92 20.37 30 0.68 0.71 constraints_shake_vv_ 7.68 109.24 12.32 30 0.41 0.44 constraints_rattle_ 7.32 120.99 11.75 180 0.07 0.07 deport_atomic_data_ 3.79 127.07 6.08 2880181 0.00 0.00 images_ 3.23 132.25 5.18 40320168 0.00 0.00 local_index_ 3.10 137.22 4.98 364097691 0.00 0.00 match_ 2.72 141.58 4.36 31 0.14 0.14 ewald_spme_forces_IP_spme_forces_ 2.16 145.05 3.47 31 0.11 3.38 two_body_forces_ 2.05 148.34 3.29 4092 0.00 0.00 gpfa_module_mp_gpfa2f_ 1.95 151.47 3.13 31 0.10 0.42 ewald_spme_forces_ 0.67 152.54 1.07 31 0.03 0.03 bspgen_
AMD
time seconds seconds calls s/call s/call name 18.03 33.20 33.20 1603402 0.00 0.00 vdw_forces_ 13.13 57.37 24.17 31 0.78 0.90 link_cell_pairs_ 12.76 80.87 23.50 30 0.78 0.81 constraints_shake_vv_ 11.25 101.59 20.72 1603402 0.00 0.00 ewald_real_forces_ 9.40 118.90 17.31 30 0.58 0.60 constraints_rattle_ 7.53 132.76 13.86 180 0.08 0.08 deport_atomic_data_ 3.27 138.78 6.02 2878846 0.00 0.00 images_ 2.65 143.65 4.87 31 0.16 0.16 bspgen_ 2.60 148.44 4.79 40437733 0.00 0.00 local_index_ 2.45 152.96 4.52 spec_dexp2 2.36 157.31 4.35 31 0.14 0.14 ewald_spme_forces_IP_spme_forces_ 2.07 161.13 3.82 62 0.06 1.77 two_body_forces_ 2.02 164.84 3.72 363897922 0.00 0.00 match_ 1.87 168.28 3.44 31 0.11 0.55 ewald_spme_forces_ 1.82 171.64 3.36 4092 0.00 0.00 gpfa_module_mp_gpfa2f_ 0.75 173.03 1.39 150 0.01 0.30 nve_0_vv_
Test5:
time seconds seconds calls s/call s/call name 44.99 50.53 50.53 101 0.50 0.50 three_body_forces_ 16.21 68.74 18.21 873523 0.00 0.00 ewald_real_forces_ 15.19 85.80 17.06 101 0.17 0.17 link_cell_pairs_ 10.25 97.31 11.51 873523 0.00 0.00 vdw_forces_ 3.71 101.48 4.17 873523 0.00 0.00 images_ 3.33 105.22 3.74 101 0.04 0.59 two_body_forces_ 2.08 107.56 2.34 101 0.02 0.02 ewald_spme_forces_IP_spme_forces_ 1.41 109.14 1.58 101 0.02 0.05 ewald_spme_forces_ 0.97 110.23 1.09 101 0.01 0.01 bspgen_
Test7:
time seconds seconds calls s/call s/call name 16.97 13.50 13.50 101 0.13 0.15 link_cell_pairs_ 16.75 26.82 13.32 1326 0.01 0.01 constraints_shake_vv_ 11.59 36.04 9.22 600 0.02 0.02 deport_atomic_data_ 10.69 44.54 8.50 1249646 0.00 0.00 ewald_real_forces_ 10.59 52.96 8.42 1249646 0.00 0.00 vdw_forces_ 4.39 56.45 3.49 2122099 0.00 0.00 images_ 4.34 59.91 3.46 23122212 0.00 0.00 local_index_ 4.14 63.20 3.29 101 0.03 0.03 ewald_spme_forces_IP_spme_forces_ 3.34 65.86 2.66 200 0.01 0.11 npt_b0_vv_ 2.87 68.14 2.28 101 0.02 0.09 ewald_spme_forces_ 2.78 70.35 2.21 100 0.02 0.03 constraints_rattle_ 2.21 72.11 1.76 101 0.02 0.46 two_body_forces_ 1.89 73.61 1.51 154519604 0.00 0.00 match_ 1.85 75.08 1.47 101 0.01 0.01 bspgen_ 0.65 75.60 0.52 202 0.00 0.00 shellsort2_
Test9:
95.21 75.98 75.98 401 0.19 0.19 tersoff_forces_ 1.09 76.85 0.87 2400 0.00 0.00 deport_atomic_data_ 0.96 77.62 0.77 402 0.00 0.00 shellsort2_
Test11:
36.75 52.13 52.13 4004908 0.00 0.00 metal_forces_ 22.05 83.41 31.28 1001 0.03 0.03 link_cell_pairs_ 21.82 114.36 30.95 4004908 0.00 0.00 metal_ld_collect_fst_ 10.26 128.91 14.55 8009816 0.00 0.00 images_ 2.59 132.59 3.68 1001 0.00 0.14 two_body_forces_ 2.44 136.05 3.46 1001 0.00 0.04 metal_ld_compute_ 1.35 137.97 1.92 1002 0.00 0.00 shellsort2_ 0.62 138.85 0.88 6000 0.00 0.00 deport_atomic_data_
Test13:
18.91 22.56 22.56 22348 0.00 0.00 gpfa_module_mp_gpfa2f_ 12.43 37.39 14.83 900 0.02 0.02 deport_atomic_data_ 11.80 51.47 14.08 1950 0.01 0.01 constraints_shake_vv_ 8.38 61.47 10.00 151 0.07 0.08 link_cell_pairs_ 5.49 68.02 6.55 300 0.02 0.09 npt_b0_vv_ 5.02 74.01 5.99 302 0.02 0.02 gpfa_module_mp_gpfa3f_ 4.86 79.81 5.80 151 0.04 0.33 ewald_spme_forces_ 4.38 85.04 5.23 151 0.03 0.04 ewald_spme_forces_IP_spme_forces_ 3.53 89.26 4.22 34695506 0.00 0.00 local_index_ 3.47 93.40 4.14 150 0.03 0.03 constraints_rattle_ 2.94 96.91 3.51 2364786 0.00 0.00 ewald_real_forces_ 2.84 100.30 3.39 4097851 0.00 0.00 images_ 2.36 103.11 2.81 2364786 0.00 0.00 vdw_forces_ 1.32 104.68 1.57 151 0.01 0.01 bspgen_ 1.27 106.19 1.51 151 0.01 0.01 parallel_fft_mp_forward_3d_fft_z_ 1.18 107.60 1.41 151 0.01 0.01 parallel_fft_mp_back_3d_fft_x_ 1.16 108.98 1.38 151 0.01 0.10 parallel_fft_mp_forward_3d_fft_y_ 1.14 110.34 1.36 151 0.01 0.48 two_body_forces_ 1.11 111.66 1.33 92125754 0.00 0.00 match_ 1.09 112.96 1.30 151 0.01 0.01 parallel_fft_mp_forward_3d_fft_x_ 0.99 114.14 1.18 151 0.01 0.10 parallel_fft_mp_back_3d_fft_y_
Test14:
21.33 16.80 16.80 130 0.13 0.13 constraints_shake_vv_ 10.48 25.05 8.25 60 0.14 0.14 deport_atomic_data_ 7.24 30.75 5.70 11 0.52 0.60 link_cell_pairs_ 6.41 35.80 5.05 3014 0.00 0.00 gpfa_module_mp_gpfa3f_ 5.75 40.33 4.53 20 0.23 1.35 npt_b0_vv_ 5.30 44.50 4.17 10 0.42 0.43 constraints_rattle_ 5.27 48.65 4.15 11 0.38 2.20 ewald_spme_forces_ 5.13 52.69 4.04 2992 0.00 0.00 gpfa_module_mp_gpfa2f_ 4.28 56.06 3.37 11 0.31 0.31 ewald_spme_forces_IP_spme_forces_ 3.19 58.57 2.51 20333039 0.00 0.00 local_index_ 2.79 60.77 2.20 1534827 0.00 0.00 ewald_real_forces_ 2.78 62.96 2.19 2646191 0.00 0.00 images_ 2.30 64.77 1.81 1534827 0.00 0.00 vdw_forces_ 2.29 66.57 1.80 11 0.16 0.16 bspgen_ 1.92 68.08 1.51 22 0.07 0.07 gpfa_module_mp_gpfa5f_ 1.80 69.50 1.42 11 0.13 0.60 parallel_fft_mp_forward_3d_fft_y_ 1.79 70.91 1.41 11 0.13 0.60 parallel_fft_mp_back_3d_fft_y_ 1.09 71.77 0.86 57924127 0.00 0.00 match_ 0.95 72.52 0.75 1 0.75 0.75 dihedrals_14_check_
Test17:
20.71 10.65 10.65 201 0.05 0.06 link_cell_pairs_ 13.88 17.79 7.14 984116 0.00 0.00 ewald_real_forces_ 11.43 23.67 5.88 984116 0.00 0.00 vdw_forces_ 9.72 28.67 5.00 3200 0.00 0.00 constraints_shake_vv_ 6.68 32.11 3.44 2260717 0.00 0.00 images_ 5.84 35.11 3.01 23232840 0.00 0.00 local_index_ 5.06 37.71 2.60 201 0.01 0.01 ewald_spme_forces_IP_spme_forces_ 4.53 40.04 2.33 200 0.01 0.02 constraints_rattle_ 3.46 41.82 1.78 201 0.01 0.03 ewald_spme_forces_ 2.51 43.11 1.29 4235 0.00 0.00 pmf_coms_ 2.25 44.27 1.16 118620664 0.00 0.00 match_ 2.18 45.39 1.12 1200 0.00 0.00 deport_atomic_data_ 2.10 46.47 1.08 201 0.01 0.17 two_body_forces_ 2.10 47.55 1.08 201 0.01 0.01 bspgen_ 1.50 48.32 0.77 400 0.00 0.03 npt_b0_vv_ 1.11 48.89 0.57 402 0.00 0.00 shellsort2_ 0.80 49.30 0.41 201 0.00 0.00 pass_shared_units_
Test18:
34.45 69.38 69.38 1800 0.04 0.04 constraints_shake_vv_ 9.78 89.07 19.69 101 0.19 0.22 link_cell_pairs_ 8.63 106.45 17.38 100 0.17 0.20 constraints_rattle_ 6.61 119.77 13.32 2041857 0.00 0.00 ewald_real_forces_ 5.95 131.75 11.98 2041857 0.00 0.00 vdw_forces_ 5.38 142.59 10.84 2718 0.00 0.01 pmf_coms_ 5.29 153.24 10.65 4883071 0.00 0.00 images_ 4.95 163.22 9.98 66656800 0.00 0.00 local_index_ 3.18 169.62 6.40 200 0.03 0.60 npt_b0_vv_ 2.74 175.14 5.52 101 0.05 0.06 ewald_spme_forces_IP_spme_forces_ 1.84 178.84 3.70 101 0.04 0.16 ewald_spme_forces_ 1.71 182.28 3.44 600 0.01 0.01 deport_atomic_data_ 1.51 185.33 3.05 101 0.03 0.03 bspgen_ 1.44 188.24 2.91 101 0.03 0.74 two_body_forces_ 1.18 190.61 2.38 231279151 0.00 0.00 match_ 0.76 192.14 1.53 25304 0.00 0.00 update_shared_units_
Test27:
38.45 91.00 91.00 12076288 0.00 0.00 metal_forces_ 24.32 148.56 57.56 3501 0.02 0.02 link_cell_pairs_ 17.06 188.93 40.37 12076288 0.00 0.00 metal_ld_collect_eam_ 8.90 210.00 21.07 24152576 0.00 0.00 images_ 2.64 216.24 6.24 3501 0.00 0.06 two_body_forces_ 2.39 221.90 5.66 3501 0.00 0.02 metal_ld_compute_ 1.46 225.35 3.45 3502 0.00 0.00 shellsort2_ 1.11 227.98 2.63 21000 0.00 0.00 deport_atomic_data_ 0.90 230.11 2.13 21006 0.00 0.00 export_atomic_data_
Test31:
31.17 66.30 66.30 12004000 0.00 0.00 metal_forces_ 29.18 128.38 62.08 3001 0.02 0.02 link_cell_pairs_ 13.22 156.50 28.12 12004000 0.00 0.00 metal_ld_collect_eam_ 10.09 177.96 21.46 24008000 0.00 0.00 images_ 5.35 189.35 11.39 18000 0.00 0.00 deport_atomic_data_ 2.99 195.71 6.36 3001 0.00 0.06 two_body_forces_ 2.77 201.60 5.89 3001 0.00 0.02 metal_ld_compute_ 1.29 204.34 2.75 3002 0.00 0.00 shellsort2_ 0.70 205.83 1.49 6000 0.00 0.00 npt_b0_vv_
Test35:
27.22 27.64 27.64 562 0.05 0.06 link_cell_pairs_ 19.94 47.89 20.25 1878757 0.00 0.00 ewald_real_forces_ 9.75 57.79 9.90 1878757 0.00 0.00 vdw_forces_ 5.52 63.40 5.61 3227549 0.00 0.00 images_ 5.46 68.95 5.55 2008 0.00 0.00 constraints_shake_vv_ 4.44 73.46 4.51 562 0.01 0.01 ewald_spme_forces_IP_spme_forces_ 3.43 76.95 3.49 354007894 0.00 0.00 match_ 3.38 80.38 3.43 562 0.01 0.15 two_body_forces_ 3.30 83.73 3.35 562 0.01 0.01 bspgen_ 3.23 87.01 3.29 28441741 0.00 0.00 local_index_ 2.87 89.92 2.91 500 0.01 0.01 constraints_rattle_ 2.85 92.81 2.89 562 0.01 0.02 ewald_spme_forces_ 2.05 94.89 2.08 3366 0.00 0.00 deport_atomic_data_ 1.71 96.63 1.74 1124 0.00 0.00 shellsort2_ 1.44 98.09 1.46 1000 0.00 0.01 npt_m1_vv_ 0.49 98.59 0.50 562 0.00 0.00 set_halo_particles_
From the above data, we can clearly see that the FFT dependence of the code is very tiny (in some cases where it shows up, it's about a percent), so there is no need to try different FFT libraries for the purpose of efficiency.
We descirbe the effect of shared FPU and L2-cache of the AMD server. The results are summarized in the following table (# processes=8):
Run Scheme | Test1 | Test3 | Test4 | Test5 | Test7 | Test9 | Test11 | Test13 | Test14 | Test17 | Test18 | Test27 | Test31 | Test35 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Shared time(s) |
95 | 221 | 213 | 142 | 300 | 144 | 158 | 247 | 186 | 190 | 356 | 323 | 269 | 228 |
Exclusive time (s) |
90 | 168 | 161 | 109 | 226 | 116 | 128 | 185 | 157 | 151 | 282 | 265 | 209 | 173 |
In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources.
Performance Comparison
Following is a table for performance comparison of Intel and AMD servers when the job was run using 8 processors:
Server | Test1 | Test3 | Test4 | Test5 | Test7 | Test9 | Test11 | Test13 | Test14 | Test17 | Test18 | Test27 | Test31 | Test35 | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Intel | 62 | 65 | 87 | 65 | 88 | 57 | 61 | 58 | 114 | 65 | 89 | 131 | 96 | 81 | 1 |
Scaled Intel | 85 | 89 | 120 | 89 | 121 | 78 | 84 | 80 | 157 | 89 | 122 | 180 | 132 | 111 | 2 |
AMD (Shared) | 95 | 221 | 213 | 142 | 300 | 144 | 158 | 247 | 186 | 190 | 356 | 323 | 269 | 228 | 1,3 |
AMD(Exclusive) | 90 | 168 | 161 | 109 | 226 | 116 | 128 | 185 | 157 | 151 | 282 | 265 | 209 | 173 | 1,4 |
AMD Shared/AMD Exc. | 1.05 | 1.31 | 1.32 | 1.30 | 1.33 | 1.24 | 1.23 | 1.34 | 1.18 | 1.26 | 1.26 | 1.22 | 1.29 | 1.32 | - |
AMD Exc./Intel (scaled) | 1.06 | 1.89 | 1.34 | 1.22 | 1.87 | 1.49 | 1.52 | 2.31 | 1.00 | 1.70 | 2.31 | 1.47 | 1.58 | 1.56 | - |
AMD Shared/Intel (scaled) | 1.12 | 2.48 | 1.78 | 1.60 | 2.48 | 1.85 | 1.88 | 3.09 | 1.18 | 2.13 | 2.92 | 1.79 | 2.03 | 2.05 | - |
AMD (16 proc)/ Intel(8 proc) |
1.06 | 1.43 | 0.93 | 0.88 | 1.38 | .95 | 1.07 | 2.14 | 0.74 | 1.63 | 1.58 | 1.07 | 1.15 | 1.53 | 3 |
In summary--
1. In majority of cases, DL_POLY runs better (~1.2-2.3X) on the Intel server than the AMD server, even with the interleaved cores. However there are a few cases (test1 and test14) where there is negligible performance difference between both servers.
2. FPU sharing reduces the efficiency on the AMD server to 1.2-1.3X.
1 Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz
2 Scaling was done by the factor of 3.3/2.4=1.375
3 "Shared" represents cores that share resources such as L2-Cache and FPU
4 "Exclusive" represents cores that don't share resources such as L2-Cache and FPU
BLAST
A couple of things before we dive into the benchmarking-
1. NCBI-BLAST is not an MPI enabled code. Only parallelization that is available is intranode which can be achieved through OpenMP. If you would like parallel BLAST, MPIBLAST is an option but upon doing some internet search I found people reporting it to be unstable.
2. Compilation of BLAST with OpenMP must be done using thread safety otherwise the code would crash for more than one thread. This can be enabled with the use of "--with-mt" in the configure script.
Comparison of Intel, Open64 and GNU Builds
Intel compiler:
module load intel
export CFLAGS="-Wall -O2 -msse2 "
export CXXFLAGS="-Wall -O2 -msse2 "
export CPPFLAGS="-Wall -O2 -msse2 "
./configure --with-bin-release --without-debug --with-mt
GNU compiler:
module load gcc/.4.7.2
export CFLAGS="-Wall -O2 -msse2 "
export CXXFLAGS="-Wall -O2 -msse2 "
export CPPFLAGS="-Wall -O2 -msse2 "
./configure --with-bin-release --without-debug --with-mt --with-64
Open64 compiler:
module load open64/.4.5.2
export CFLAGS="-Wall -O2 -msse2 "
export CXXFLAGS="-Wall -O2 -msse2 "
export CPPFLAGS="-Wall -O2 -msse2 "
./configure --with-bin-release --without-debug --with-mt --with-64
Following is a comparison of test case for blastx executable of BLAST.
The executable built with Open64 and fancy flags crashes on the Intel Sandy-Bridge server. Clearly, intel compiler with basic flags has better performance with other compiler+flag combinations. Interestingly, the fancy flags suffer for both intel and gcc case.
|