STREAM

A few words about numactl

NUMA is an acronym for Non Uniform Memory Access, and numactl is a tool to assign memory to the node. Following are a few important keywords one should know before embarking on the numactl mission:

physcpubind = ID of the cores
cpunodebind = ID of the nodes
membind = ID of the node that the memory is assigned to

For example, on an AMD machine with 16 cores, or in the terminology of NUMA, 4 nodes with 4 cores on each node, the command line

–membind=0 –physcpubind=0-3

asigns four threads running on cores 0 to 3 (node 0) with the memory also assigned to the node 0. However, the command line

–membind=1 –physcpubind=0-3

assigns four threads on the cores 0 to 3 (node 0) but the memory is assigned to the node 1. As this memory is not local to the node that the threads are running on, the performance will be affected. Assigning memory locally to the node can also be done by ”-l” option of the numactl.

Alternatively, above command lines can be shortened by using "cpunodebind". For example,

–membind=0 –cpunodebind=0

means that the memory is assigned to node 0 and the threads are also running on node 0. One should note that with the use of "cpunodebind" the number of threads will be equal to the number of cores on the node, so in this case number of threads has to be equal to four. However, if we wish to run two threads on node 0, its only possible with "physcpubind". You have more control of running your threads with "physcpubind" as you can choose the cores that you wish to run your jobs on. For detail description please follow the manual page of numactl.

Intel (2 x E5-2643 @ 3.30GHz)

Streams is a well-known memory bandwidth benchmark. Before we attempt to find the maximum bandwidth, it's necessary to find out the architecture of the machine. The command "numactl --hardware" on this machine produces:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 32739 MB
node 0 free: 30624 MB
node 1 cpus: 4 5 6 7
node 1 size: 32768 MB
node 1 free: 31280 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

From the above result, we can conclude that there are two numa nodes with four cores on each: in total eight cores.

Before measuring the maximum memory bandwidth of the server, we first determine the number of threads required to achieve the maximum bandwidth of a given NUMA node. Results are summarized in the following table:

Number of threads	Bandwidth (GB/s)
1	9.5
2	18.8
3	21.4
4	34.0

From the above table, we conclude that the maximum number of threads that we need to run on each node is four. Above table was obtained by running the threads on node 0 and assigning the memory on the same node as well. This result can be reproduced on other nodes as well.

Following table describes the effect of variation of memory allocation with respect to the processors where the threads are running on the memory bandwidth(number of threads is four):

MEM CPU	0	1
0	34.0	17.4
1	18.9	33.5

In the above table, variation of the memory nodes are in the rows while cpu nodes are in the column. You can clearly see the effect of memory binding with the respect to the cores where the threads are running. Please note that the above table resembles the "node distance table " obtained using "numactl --hardware" earlier.

AMD (2 x 6220 @ 3.0 GHz)

This is an Interlagos machine with 16 cores (numa 4 nodes with 4 cores each). Each core has 4 GB of memory, which results in the memory of machine to be 64GB. I compiled the code with open64 compiler. It is noteworthy that gcc compiler gives about half of the bandwidth as open64, while intel compiler results on this machine vary (64GB to 40 GB). "numactl --hardware" produces:

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 16382 MB
node 0 free: 2930 MB
node 1 cpus: 4 5 6 7
node 1 size: 16384 MB
node 1 free: 5082 MB
node 2 cpus: 8 9 10 11
node 2 size: 16384 MB
node 2 free: 2281 MB
node 3 cpus: 12 13 14 15
node 3 size: 16368 MB
node 3 free: 550 MB
node distances:
node   0   1   2   3
  0:  10  16  16  16
  1:  16  10  16  16
  2:  16  16  10  16
  3:  16  16  16  10

Following table describes memory bandwidth on a single node by varying number of threads:

Number of threads	Bandwidth (GB/s)
1	14.0
2	15.0
3	17.8
4	18.5

Again, similar to the Intel machine, the maximum number of threads we need to run on each node is four.

Following table describes the effect of variation of memory allocation with respect to the processors where the threads are running on the memory bandwidth(number of threads is four):

MEM CPU	0	1	2	3
0	18.1	11.8	6.5	5.6
1	11.8	18.7	5.5	6.5
2	6.5	5.5	18.5	11.6
3	5.6	6.5	11.8	18.5

Contrary to the Intel machine, the above table does not agree with the "node distance" produced by the "numactl --hardware"!

AMD (4 x 6378 @ 2.4 GHz)

In NUMA terminology, this server has 8 nodes with 8 cores on each.

numactl --hardware  

available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32765 MB
node 0 free: 29324 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 31892 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 31900 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32768 MB
node 3 free: 31911 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 32768 MB
node 4 free: 31964 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 32768 MB
node 5 free: 31942 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 32768 MB
node 6 free: 31866 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 32752 MB
node 7 free: 31960 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  22  16  22  16  22
  1:  16  10  22  16  22  16  22  16
  2:  16  22  10  16  16  22  16  22
  3:  22  16  16  10  22  16  22  16
  4:  16  22  16  22  10  16  16  22
  5:  22  16  22  16  16  10  22  16
  6:  16  22  16  22  16  22  10  16
  7:  22  16  22  16  22  16  16  10

Memory bandwidth on a single node by varying number of threads:

Number of threads	Bandwidth (GB/s)
1	13.0
2	14.1
3	17.1
4	17.4
5	17.1
6	16.7
7	16.6
8	16.1

Following table describes the variation of memory bandwidth when we change memory allocation with respect to the cores where threads are running (Number of threads=4)

MEM CPU	0	1	2	3	4	5	6	7
0	17.3	8.0	5.6	4.1	5.7	4.1	5.5	4.0
1	8.2	17.6	6.5	6.5	4.0	5.5	4.0	5.4
2	5.7	6.5	17.9	7.9	5.6	4.1	5.6	4.1
3	4.1	6.5	8.1	17.8	4.1	5.6	4.1	5.7
4	5.6	4.0	5.7	4.2	17.7	7.9	5.7	4.1
5	4.0	5.6	4.1	5.6	8.1	17.7	4.0	5.5
6	5.4	4.0	5.6	4.1	5.7	4.1	17.8	7.9
7	3.9	5.4	4.0	5.6	4.2	5.6	8.1	17.7

Bandwidth in terms of Socket

A socket for AMD 6200 and 6300 machine is two NUMA nodes combined together. The sockets have 16 cores for the 6378 server while 8 cores for 6220 server. The memory bandwidth for each NUMA node is maximum with about 4 threads, and we wonder what is the maximum bandwidth for a socket. A reasonable guess from our previous results is to use 8 threads for the socket with 4 distributed over each NUMA node. If we run the stream with 8 cores as follows:

numactl --physcpubind=0,1,2,3,8,9,10,11 --membind=0,1 ./stream

we get 34.7 GB/s memory bandwidth.

By running,

numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=0,1 ./stream

also yields 35 GB/s bandwidth.

By varying the membind to different sockets as follows:

numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=0,1 ./stream
numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=2,3 ./stream
numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=4,5 ./stream
numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=6,7 ./stream
numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=0,1 ./stream
numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=2,3 ./stream
numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=4,5 ./stream
numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=6,7 ./stream
numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=0,1 ./stream
numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=2,3 ./stream
numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=4,5 ./stream
numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=6,7 ./stream
numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=0,1 ./stream
numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=2,3 ./stream
numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=4,5 ./stream
numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=6,7 ./stream

we get following table (In terms of socket, i.e. node 0-1 is socket 1, node 2-3 is socket 2 and so on)

MEM CPU	1	2	3	4
1	35.2	11.2	11.0	10.7
2	11.3	35.3	11.2	11.1
3	10.9	11.2	35.2	11.0
4	10.7	11.1	11.1	35.4

VASP

This page describes benchmarking of Vienna Ab-initio Simulation Package (VASP), a plane wave density functional theory code, used in studying electronic structure of materials.

Intel (2 x E5-2643 @ 3.30GHz)

Native FFT Library

Following libraries and flags were used:

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTLIB = -lfftw3xf
INCS = -I$(MKLDIR)/include/fftw
FFT_OBJS = fftmpi.o fftmpi_map.o fftw3d.o fft3dlib.o
FFLAGS =  -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

As a first check, Streaming SIMD Extension (SSE) was changed and following is the result of a self consistent field (SCF) calculation for MgMOS (For input files, please ask Charles Taylor or Manoj Srivastava):

SIMD Instruction	Time(s)
sse2	158
sse4.1	156
sse4.2	155
avx	155
ssse3	156

MKL FFTs (via FFTW wrappers)

Upon profiling the code, we found that the code spent most of its time in the FFT libraries, so the next step was to change the FFT libraries. Following changes were made:

FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o

(The change here is replacement of "fftmpi.o" in the original VASP makefile with "fftmpiw.o")

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTLIB = -lfftw3xf
INCS = -I$(MKLDIR)/include/fftw
FFLAGS = -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

Upon making above changes, about 60% improvement on run time of the code was found on the Intel machine (E5-2643 @ 3.30GHz). Following table depicts the run time variation with SIMD instruction sets:

SIMD Instruction	Time(s)
sse2	97
sse4.1	95
sse4.2	94
avx	94
ssse3	94

FFTW FFTs

We further compiled VASP by using FFT library from the FFTW package with following flags:

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTWDIR = /apps/fftw/3.3.2
FFTLIB  = -L$(FFTWDIR)/lib -lfftw3
INCS = -I$(FFTWDIR)/include
FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o
FFLAGS =  -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

From our previous experience, we concluded that the performance of VASP did not depend substantially on the SIMD instruction sets, so for FFTW library, we only tried one set. Following is the result:

SIMD Instruction	Time(s)
sse2	118

AMD (2 x 6220 @ 3.0 GHz)

This machine has 16 cores, in numactl terminology 4 NUMA nodes with 4 cores on each nodes. As the result of VASP depends heavily on the choice of FFT libraries, we checked performance of this machine with different FFTs, namely, FFT provided by VASP package, MKL, and FFTW. We built FFTW libraries with various flags to see if we could find a better choice for FFTs. The libraries and flags used to compile VASP are as follows (FFT libraries were changed depending on which FFT we wanted to use):

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTWDIR = /apps/fftw/3.3.2
FFTLIB  = -L$(FFTWDIR)/lib -lfftw3
INCS = -I$(FFTWDIR)/include
FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o
FFLAGS =  -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

From the computer architecture point of view, the bulldozer core aka module of AMD server lies in between a true dual core processor and a single core processor with simultaneous multithreading capability. The cores on AMD servers share some of the resources such as L2 cache and floating point unit (FPU), so the performance of a code would get affected if the threads are run on the shared cores or exclusive cores. For detail information about bulldozer core, please have a look at http://en.wikipedia.org/wiki/Bulldozer_%28microarchitecture%29

The results are summarized in the following table:

Run Scheme	Native	MKL	FFTW	FFTW	FFTW	FFTW	FFTW	FFTW
Shared time(s)	399	261	333	319	334	336	315	319
Exclusive time (s)	274	159	217	219	215	217	213	211
Notes	-	-	1	2	3	4	5	6

In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources.

¹ Default compiler Flags were used to build FFT.
² CFLAGS=-O3, FFLAGS=-O3, -enable sse2
³ enable-mpi CFLAGS=-O3, FFLAGS=-O3, -enable sse2
⁴ CC='opencc -march=bdver1' F77='openf90 -march=bdver1' CFLAGS='-msse3 -msse4.1 -msse4.2 -msse4a -mfma4 -O2' FFLAGS='-msse3 -msse4.1 -msse4.2 -msse4a -mfma4 -O2' --enable-fma --enable-mpi
⁵ FFLAGS/ CFLAGS="-OPT:Ofast -mavx -mfma4 -march=bdver1 -O3 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math"
⁶ ufhpc compiler options. FFTWDIR = /apps/fftw/3.3.2

Performance Comparison

Following is a summary of results for the test case of MgMOS ran on the Intel and AMD servers with 8 processors.

Server	Native	MKL	FFTW
Intel	158	97	118
Intel (Scaled)	174	106	130
AMD (Shared)	399	261	319
AMD (Exclusive)	274	159	211
AMD Shared/AMD Exc.	1.46	1.64	1.51
AMD Exc./Intel (scaled)	1.57	1.50	1.62
Notes	-	-	1

In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources.

¹ Compiled by UFHPC (Charles Taylor or Craig Prescott)

LAMMPS

Scaling with Number of Processors

LAMMPS is compiled with the following flags:

module load intel openmpi
CC = mpiCC
CCFLAGS = -O2 -xsse2
FFT_INC =  -I$(HPC_MKL_DIR)/include/fftw
FFT_PATH =
FFT_LIB = -L$(HPC_MKL_DIR)/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core

The benchmarking runs are done by the input file provided with the package. (LJ = atomic fluid, Lennard-Jones potential with 2.5 sigma cutoff (55 neighbors per atom), NVE integration). Following table describes the variation of run time with number of processors on the Intel server:

# processors	Time(s)
8	158
4	309
1	1139

We find linear scaling with number of processors on the intel machine.

We also ran the "lj" benchmark on the AMD server (for comparison, we provide results on the Intel server as well):

# processors	lj Time(s)	chain Time(s)	eam Time(s)	rhodo Time(s)	Notes
16	180	84	476	2877	-
8	329	149	908	5506	-
4	547	248	1509	9398	-
1	1651	724	4708	-	-
Intel (8 proc)	158	67	396	2361	1
Scaled Intel (8 proc)	217	92	545	3246	2
Scaled Intel(8 proc)/ AMD (16 proc)	1.20	1.15	1.14	1.13	3

¹ Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz
² Scaling was done by the factor of 3.3/2.4=1.375
³ Comparison of 8 processors run on Intel vs 16 processors run on AMD

Comparison of Intel, Open64 and GNU Builds

LAMMPS with Intel compiler:

module load intel openmpi
CC = mpiCC
CCFLAGS = -O2 -msse2
FFT_INC = -I$(HPC_MKL_DIR)/include/fftw
FFT_PATH =
FFT_LIB = -L$(HPC_MKL_DIR)/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core

LAMMPS with open64 compiler:

module load open64/.4.5.2 openmpi
CC = mpiCC
CCFLAGS = -O2 -msse2
MPI_DIR = /usr/mpi/open64/openmpi-1.6
MPI_INC = -I$(MPI_DIR)/include
MPI_LIB = -L$(MPI_DIR)/lib64 -lmpi
MPI_PATH =
FFT_DIR = /home/manoj/FFTW/charlie/3.3.2
FFT_INC = -I$(FFT_DIR)/include/fftw3
FFT_PATH = 
FFT_LIB = -L$(FFT_DIR)/lib -lfftw3

LAMMPS with gnu compiler:

module load gcc/.4.7.2 openmpi
CC = g++
CCFLAGS = -O2 -msse2
MPI_DIR = /usr/mpi/gnu/openmpi-1.6
MPI_INC = -I$(MPI_DIR)/include
MPI_LIB = -L$(MPI_DIR)/lib64 -lmpi -lmpi_cxx
MPI_PATH =
FFT_DIR = /home/manoj/FFTW/gnu/3.3.2
FFT_INC = -I$(FFT_DIR)/include/fftw3
FFT_PATH = 
FFT_LIB = -L$(FFT_DIR)/lib -lfftw3

For testing, we only ran "lj" benchmark and found:

Compiler	Intel Time(s)	Intel Time(s)	AMD Time(s)	AMD Time(s)
Intel	158	151	329	321
Open64	173	-	352	337
GNU	152	145	341	320
NOTES	1	2	1,3	2,3

¹ Basic Flags:
Intel: -O2 -msse2
Open64: -O2 -msse2
GNU: -O2 -msse2

² Fancy Flags:
Intel: -O2 -mavx -unroll-aggresive -ipo -opt-prefetch -use-intel-optimized-headers
Open64: CCFLAGS =-OPT:Ofast -mavx -mfma4 -march=bdver1 -O2 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math
GNU: CCFLAGS= -O2 -mavx -fsched-pressure -flto -funroll-all-loops -fprefetch-loop-arrays -minline-all-stringops -fno-tree-pre -ftree-vectorize

³ Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit)

Intel (2 x E5-2643 @ 3.30GHz)

Intel Compiler and SIMD Sets

We used Intel compiler as follows:

module load intel openmpi
CC = mpiCC
CCFLAGS = -O2 -xSSE2
FFT_INC =  -I$(HPC_MKL_DIR)/include/fftw
FFT_PATH =
FFT_LIB = -L$(HPC_MKL_DIR)/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core

Following table shows variation of Streaming SIMD Extension (SSE) sets(# threads=8):

Extension set	Time(s)
sse2	158
sse3	157
ssse3	157
sse4.1	158
sse4.2	157
avx	152

"avx" instruction set is slightly better than the other sets!

The binaries for the above SIMD sets use "-x" option for the build, which does not work for the instruction sets other than "-sse2" on the AMD server, so for the next step we build our binaries with "-m" option and run it on the intel and AMD servers to see whether we could successfully run the binaries on both servers. Following table demonstrates the result for the "lj" benchmark:

SIMD Instruction	Intel Time(s)	AMD Time(s)
sse2	158	329
sse3	157	329
ssse3	157	329
sse4.1	158	330
sse4.2	157	329
avx	152	319
Notes	-	1

¹ Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit).

Clearly on both, Intel as well as AMD servers, "avx" instructions are better choice for the "lj" benchmark. We ran other benchmarks for the SIMD sets:

SIMD Instruction	chain		eam		rhodo
-	Intel	AMD	Intel	AMD	Intel	AMD
sse2	67	149	396	908	2361	5506
sse3	67	149	398	908	2355	5486
ssse3	66	149	399	907	2359	5485
sse4.1	68	148	395	908	2351	5420
sse4.2	66	148	396	909	2346	5479
avx	65	145	387	897	2290	5360
Notes	-	1	-	1	-	1

For all the benchmarks, "avx" seems to be a better choice compared to other instruction sets.

¹ Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit).

MKL vs FFTW FFTs

We profiled the code to see where does it spend most of its time. Below is a summary of all the time spent in the FFTs for all the benchmarks that we tried.

lj:

time   seconds   seconds    calls   s/call   s/call  name
0.00    185.61     0.00        1     0.00     0.00  LAMMPS_NS::FFT3d::timing1d(double*, int, int)
0.00    185.61     0.00        1     0.00     0.00  fft_1d_only

chain:

time   seconds   seconds    calls   s/call   s/call  name
0.00     70.62     0.00        1     0.00     0.00  LAMMPS_NS::FFT3d::timing1d(double*, int, int)
0.00     70.62     0.00        1     0.00     0.00  fft_1d_only

eam:

time   seconds   seconds    calls   s/call   s/call  name
0.00    381.72     0.00        1     0.00     0.00  LAMMPS_NS::FFT3d::timing1d(double*, int, int)
0.00    381.72     0.00        1     0.00     0.00  fft_1d_only

rhodo:

time   seconds   seconds    calls   s/call   s/call  name
0.81    118.18     1.04 10423040     0.00     0.00  kf_work(FFT_DATA*, FFT_DATA const*, unsigned long, int, int*, kiss_fft_state*)
0.72    119.10     0.92 31269120     0.00     0.00  kf_bfly4(FFT_DATA*, unsigned long, kiss_fft_state*, unsigned long)

We can clearly see that the code does not spend any significant time in the FFT routines for any benchmarks. So, if we change the FFT from MKL to FFTW, it should not change the performance at all. As a check, we built LAMMPS with FFTW FFTs using:

module load intel openmpi fftw
CC = mpiCC
CCFLAGS = -O2 -mavx
FFT_DIR = /apps/fftw/3.3.2
FFT_INC = -I$(FFT_DIR)/include/fftw3
FFT_PATH = 
FFT_LIB = -L$(FFT_DIR)/lib -lfftw3

For the test, we ran "lj" benchmark on the Intel server and found:

FFT	Time(s)
MKL	152
FFTW	152

As expected, there is no difference between the FFTs from FFTW or MKL on the performance of LAMMPS.

AMD (4 x 6378 @ 2.4 GHz)

In this section, we descirbe the effect of shared cache of the AMD server on the performance of LAMMPS. The results are summarized in the following table (# threads=8):

Run Scheme	lj	chain	eam	rhodo
Shared time(s)	319	145	897	5360
Exclusive time (s)	277	126	778	4426

In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources.

Performance Comparison

Following is a table for performance comparison of Intel and AMD servers when the job was run using 8 threads.

Server	lj	chain	eam	rhodo
Intel	158	65	387	2290
Intel (Scaled)	217	89	532	3149
AMD (Shared)	319	145	897	5360
AMD (Exclusive)	277	126	778	4426
AMD Shared/AMD Exc.	1.15	1.15	1.15	1.21
AMD Exc./Intel (scaled)	1.28	1.42	1.46	1.41

In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources.

GROMACS

Comparison of Intel, Open64 and GNU Builds

Intel compiler:

module load intel openmpi
export F77=mpif77
export F90=mpif90
export CC=mpicc
export CFLAGS="-O2 -msse2"
export FFLAGS="-O2 -msse2"
./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float --with-fft=mkl LIBS="-L/opt/intel/composerxe
/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core"
make
make install

Open64 compiler:

module load open64/.4.5.2 openmpi
export F77=openf90
export F90=openf90
export CC=opencc
export CFLAGS="-O2 -msse2"
export FFLAGS="-O2 -msse2"
export CPPFLAGS="-I/home/manoj/FFTW/fpic-charlie/3.3.2/include"
export LDFLAGS="-L/home/manoj/FFTW/fpic-charlie/3.3.2/lib"
./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float
make
make install

GNU compiler:

module load gcc/.4.7.2 openmpi
export F77=gfortran
export F90=f95
export CC=gcc
export CFLAGS="-O2 -msse2"
export FFLAGS="-O2 -msse2"
export CPPFLAGS="-I/home/manoj/FFTW/gnu/3.3.2/include"
export LDFLAGS="-L/home/manoj/FFTW/gnu/3.3.2/lib"
./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float
make
make install

There are some test cases in "gromacs-4.5.5/share/tutor" directory, however not all of them work. So far, I could only get "water", "methane", and "mixed" to work. Instructions to run MD simulations are on http://manual.gromacs.org/online/water.html page. You first need to create a " .tpr" file using

./grompp_d -v

After this, you can run "mdrun_d" for the molecular dynamics simulation.

In the input file provided by GROMACS, there is a mistake in the "grompp.mdp" file. The line starting with "bd-temp" has to be commented out. Upon some internet search, I found that the file, "grompp.mdp" is the input file for an older version of GROMACS, and apparently some of the parameters have become obsolete.

Following results are for the MD simulation on water using 8 processors on Intel (E5-2643) and AMD (Opetran-6378) servers:

Compiler	Intel Time(s)	Intel Time(s)	AMD Time(s)	AMD Time(s)
Intel	157	157	361	363
Open64	167	-	392	383
GNU	160	-	377	368
NOTES	1	2	1,3	2,3

¹ Basic Flags:
Intel: -O2 -msse2
Open64: -O2 -msse2
GNU: -O2 -msse2

² Fancy Flags:
Intel: -O2 -mavx -unroll-aggresive -opt-prefetch -use-intel-optimized-headers
Open64: CCFLAGS =-OPT:Ofast -mavx -mfma4 -march=bdver1 -O2 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns
GNU: CCFLAGS= -O2 -mavx -fsched-pressure -flto -funroll-all-loops -fprefetch-loop-arrays -minline-all-stringops -fno-tree-pre -ftree-vectorize

³ Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit)

Scaling with Number of Processors

"Water" benchmark was run using gromacs compiled with the intel and openmpi (fancy flags) as shown on above section.

Following table describes the variation of run time with number of processors on the Intel (E5-2643) server:

# processors	Time(s)	Factor
1	950	1.00
4	252	3.76
8	157	6.05

We find linear scaling with number of processors on the intel server.

We also ran the same benchmark on the AMD (Opetran-6378) server (for comparison, we provide results on the Intel server as well):

# processors	water Time(s)	Notes
16	241	-
8	363	-
4	532	-
1	1288	-
Intel (8 proc)	157	1
Scaled Intel (8 proc)	216	2
AMD (16 proc)/ Scaled Intel(8 proc)	1.12	3

¹ Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz
² Scaling was done by the factor of 3.3/2.4=1.375
³ Comparison of 8 processors run on Intel vs 16 processors run on AMD

Instruction Set Dependence

GROMACS is compiled with the following flags:

module load intel openmpi mkl
export F77=mpif77
export F90=mpif90
export CC=mpicc
export CFLAGS="-O2 -msse2 -unroll-aggresive -opt-prefetch -use-intel-optimized-headers"
export FFLAGS="-O2 -msse2 -unroll-aggresive -opt-prefetch -use-intel-optimized-headers"
./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float --with-fft=mkl LIBS="-L/opt/intel
/composerxe/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core"
make
make install

Table captures the dependence of instruction sets on Intel (E5-2643) and AMD (Opetran-6378) machine with 8 processes:

SIMD Instruction	Intel	AMD
sse2	158	364
sse3	158	362
ssse3	157	362
sse4.1	159	360
sse4.2	157	362
avx	157	363
Notes	-	1

GROMACS seems to be instruction set independent.
¹ Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit)

MKL vs FFTW

There is a problem in profiling the code. We don't see as many subroutines as we wish to see. Intel compiler is still OK, but GNU compiler is worse: it only shows only one subroutine. We don't see any FFT routine in the test case that we are using.

Intel Compiler

time    seconds   seconds    calls   s/call   s/call  name
68.42      0.13     0.13        6    21.67    28.33  do_md
15.79      0.16     0.03  2160054     0.00     0.00  copy_rvec
 5.26      0.17     0.01   600018     0.00     0.00  clear_mat
 5.26      0.18     0.01                             _intel_fast_memcpy
 5.26      0.19     0.01                             _intel_fast_memcpy.P
 0.00      0.19     0.00   720018     0.00     0.00  copy_mat
 0.00      0.19     0.00        6     0.00    28.33  mdrunner
 0.00      0.19     0.00        3     0.00     0.00  copy_rvec
 0.00      0.19     0.00        1     0.00     0.00  copy_mat
 0.00      0.19     0.00        1     0.00     0.00  get_nthreads
 0.00      0.19     0.00        1     0.00     0.00  mdrunner_start_threads

GNU Compiler

time     seconds   seconds    calls   s/call   s/call  name
100.01     0.02     0.02        1    20.00    20.00  do_md

To see the FFT dependence, we built GROMACS with FFTW FFTs using:

module load intel openmpi fftw
export F77=mpif77
export F90=mpif90
export CC=mpicc
export CFLAGS="-O2 -mavx -unroll-aggresive -opt-prefetch -use-intel-optimized-headers"
export FFLAGS="-O2 -mavx -unroll-aggresive -opt-prefetch -use-intel-optimized-headers"
#export CFLAGS="-O2 -msse2" 
#export FFLAGS="-O2 -msse2" 
export CPPFLAGS="-I/apps/fftw/3.3.2/include"
export LDFLAGS="-L/apps/fftw/3.3.2/lib -lfftw3"
./configure --prefix=/home/manoj/profile/gromacs/gromacs-4.5.5 --enable-shared=yes --enable-mpi --without-x --disable-float
make
make install

We ran "water" benchmark on the Intel(E5-2643) and AMD (Opetran-6378) servers using 8 processors and found:

FFT	Intel Time(s)	AMD Time(s)
MKL	157	363
FFTW	157	360

There seems to be no difference between the FFTs from FFTW or MKL on the performance of GROMACS.

Shared vs Exclusive run on AMD servers (4 x 6378 @ 2.4 GHz)

We descirbe the effect of shared FPU and L2-cache of the AMD server. The results are summarized in the following table (# processes=8):

Run Scheme	Time(s)
Shared time(s)	363
Exclusive time (s)	266

In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources.

Performance Comparison

Following is a table for performance comparison of Intel and AMD servers when the job was run using 8 processors:

Server	Time(s)	Notes
Intel	157	1
Intel (Scaled)	216	2
AMD (Shared)	363	3
AMD (Exclusive)	266	4
AMD Shared/AMD Exc.	1.36	-
AMD Exc./Intel (scaled)	1.23	-

¹ Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz
² Scaling was done by the factor of 3.3/2.4=1.375
³ "Shared" represents cores that share resources such as L2-Cache and FPU
⁴ "Exclusive" represents cores that don't share resources such as L2-Cache and FPU

DL_POLY

Comparison of Intel, Open64 and GNU Builds

Intel compiler:

module load intel openmpi
$(MAKE) LD="mpif90 -v -o " \
                LDFLAGS="-shared-intel" \
                FC="mpif90 -c" \
                FCFLAGS="-O3 -mavx -opt-prefetch -use-intel-optimized-headers" \
                EX=$(EX) BINROOT=$(BINROOT) $(TYPE)

GNU compiler:

module load gcc/.4.7.2 openmpi
$(MAKE) LD="mpif90 -v -o " \
                LDFLAGS=" " \
                FC="mpif90 -c" \
                FCFLAGS="-O3 -mavx -fsched-pressure -flto -funroll-all-loops -fprefetch-loop-arrays -minline-all-stringops -fno-tree-                 
                pre -ftree-vectorize" \
                EX=$(EX) BINROOT=$(BINROOT) $(TYPE)

Open64 compiler:

Open64 shows a problem in the subroutine config_module.f90 at line 62. This subroutine resizes the length of an array. The line

Character( Len = * ), Allocatable, Intent( InOut ) :: a(:)

makes "a" an allocatable array of strings, but the length of string is not defined. Intel and GNU compiler can handle this, however the open64 compiler can not. We made one assignment modification at this line (Interested readers should look up the code) and made the open64 compiler work.

 
 module load open64/.4.5.2 openmpi
 $(MAKE) LD="mpif90 -v -o " \
                LDFLAGS=" " \
                FC="mpif90 -c" \
                FCFLAGS="-OPT:Ofast -mavx -mfma4 -march=bdver1 -O2 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math " \
                EX=$(EX) BINROOT=$(BINROOT) $(TYPE)

We downloaded 42 benchmarks from the DL_POLY website. The number of benchmarks are higher so we looked in to the result of profiling and found that there are in fact only 15 cases that use different codes inside the DL_POLY package and we are going to present our results based on those.

Following results are for the various benchmarks using 8 processors on Intel (E5-2643) servers:

Compiler	Intel-Basic Time(s)	Intel-Fancy Time(s)	GNU-Basic Time(s)	GNU-Fancy Time(s)	Open64-Basic Time(s)
Test1	66	62	71	83	75
Test3	69	65	76	75	79
Test4	74	87	77	78	87
Test5	70	65	81	75	87
Test7	95	88	105	99	106
Test9	67	57	85	69	117
Test11	72	61	85	83	91
Test13	63	58	65	62	72
Test14	106	114	113	136	91
Test17	67	65	74	72	77
Test18	102	89	104	99	115
Test27	136	131	154	-	158
Test31	114	96	121	121	134
Test35	80	81	90	90	93

We tried to compare intel and open64 builds on the AMD server and following table depicts that comparison:

Compiler	Intel-Fancy Time(s)	Open64-Basic Time(s)	Open64-Fancy Time(s)
Test1	95	91	90
Test3	221	235	219
Test4	213	225	211
Test5	142	173	161
Test7	300	310	295
Test9	144	219	209
Test11	158	202	183
Test13	247	262	245
Test14	186	165	149
Test17	190	201	189
Test18	356	353	385
Test27	323	377	334
Test31	269	313	290
Test35	228	251	240

¹ Basic Flags:
Intel: -O2 -msse2
Open64: -O2 -msse2
GNU: -O2 -msse2

² Fancy Flags:
Intel: -O2 -mavx -opt-prefetch -use-intel-optimized-headers
GNU: CCFLAGS= -O2 -mavx -fsched-pressure -flto -funroll-all-loops -fprefetch-loop-arrays -minline-all-stringops -fno-tree-pre -ftree-vectorize
Open64 -OPT:Ofast -mavx -mfma4 -march=bdver1 -O2 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math

More studies on Test 14

intel compiler:

Intel server: CPU time = 39s, Wall Time= 49s

 %   cumulative   self              self     total
time   seconds   seconds    calls   s/call   s/call  name
13.56      4.85     4.85      182     0.03     0.03  constraints_shake_vv_
 9.09      8.10     3.25       14     0.23     0.27  link_cell_pairs_
 7.94     10.94     2.84       78     0.04     0.04  deport_atomic_data_
 5.65     12.96     2.02                             __intel_ssse3_rep_memcpy
 5.56     14.95     1.99       14     0.14     0.14  ewald_spme_forces_IP_spme_forces_
 5.29     16.84     1.89       14     0.14     0.63  ewald_spme_forces_

AMD server: CPU Time=151s Wall Time =190s

 %   cumulative   self              self     total
time   seconds   seconds    calls   s/call   s/call  name
19.85     23.87    23.87       91     0.26     0.27  constraints_shake_vv_
12.33     38.70    14.83       78     0.19     0.19  deport_atomic_data_
 6.82     46.90     8.20       13     0.63     0.64  constraints_rattle_
 6.06     54.19     7.29       14     0.52     0.60  link_cell_pairs_
 5.61     60.94     6.75     3836     0.00     0.00  gpfa_module_mp_gpfa3f_
 5.32     67.34     6.40       14     0.46     0.46  bspgen_
 5.15     73.53     6.19       52     0.12     0.77  npt_b0_vv_

Open64 Compiler

Intel server: CPU Time= 45s, Wall Time=290s

 %   cumulative   self              self     total
time   seconds   seconds    calls   s/call   s/call  name
15.64      5.64     5.64      182     0.03     0.03  constraints_shake_vv__
12.06      9.99     4.35       14     0.31     0.69  ewald_spme_forces__
11.37     14.09     4.10       14     0.29     0.34  link_cell_pairs__
 9.45     17.50     3.41       78     0.04     0.04  deport_atomic_data__
 5.52     19.49     1.99  3378329     0.00     0.00  images_
 5.16     21.35     1.86     3808     0.00     0.00  GPFA2F.in.GPFA_MODULE
 5.16     23.21     1.86       26     0.07     0.07  constraints_rattle__

AMD server: CPU Time= 156s, Wall Time=164s

 %   cumulative   self              self     total
time   seconds   seconds    calls   s/call   s/call  name
18.38     22.58    22.58       91     0.25     0.25  constraints_shake_vv__
13.07     38.64    16.06       78     0.21     0.21  deport_atomic_data__
 8.29     48.82    10.18       14     0.73     2.73  ewald_spme_forces__
 7.54     58.09     9.27       14     0.66     0.76  link_cell_pairs__
 5.99     65.45     7.36       13     0.57     0.58  constraints_rattle__
 5.86     72.65     7.20     3836     0.00     0.00  GPFA3F.in.GPFA_MODULE
 5.34     79.21     6.56       14     0.47     0.47  bspgen_

Scaling with Number of Processors

From the data of the above section, we conclude that intel build with fancy flags is better choice for the code and so we do our further runs only using the intel build.

Following table describes the variation of run time with number of processors on the Intel (E5-2643) server:

#proc	Test1	Test3	Test4	Test5	Test7	Test9	Test11	Test13	Test14	Test17	Test18	Test27	Test31	Test35
1 (Factor)	91 (1.00)	389 (1.00)	541 (1.00)	511 (1.00)	502 (1.00)	755 (1.00)	397 (1.00)	292 (1.00)	255 (1.00)	302 (1.00)	445 (1.00)	651 (1.00)	612 (1.00)	-
4 (Factor)	90 (1.01)	122 (3.18)	160 (3.38)	119 (4.29)	173 (2.90)	119 (6.34)	119 (3.34)	116 (2.52)	147 (1.73)	105 (2.87)	147 (3.03)	229 (2.84)	184 (3.33)	136 (-)
8 (Factor)	62 (1.47)	65 (5.98)	87 (6.22)	65 (7.86)	88 (5.70)	57 (13.2)	61 (6.51)	58 (5.03)	114 (2.23)	65 (4.65)	89 (5.00)	131 (4.96)	96 (6.38)	81 (-)

Following table depicts the result in the scaled manner. It's basically the same table as above.

#proc	Test1	Test3	Test4	Test5	Test7	Test9	Test11	Test13	Test14	Test17	Test18	Test27	Test31	Test35
1	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	-
4	1.01	3.18	3.38	4.29	2.90	6.34	3.34	2.52	1.73	2.87	3.03	2.84	3.33	-
8	1.47	5.98	6.22	7.86	5.70	13.2	6.51	5.03	2.23	4.65	5.00	4.96	6.38	-

We also ran the same benchmark on the AMD (Opetran-6378) server (for comparison, we provide results on the Intel server as well):

#proc	Test1	Test3	Test4	Test5	Test7	Test9	Test11	Test13	Test14	Test17	Test18	Test27	Test31	Test35	Notes
1	91	395	619	788	1406	1292	825	393	441	882	898	651	651	-	-
4	102	362	375	277	521	285	307	376	267	318	497	554	477	381	-
8	95	221	213	142	300	144	158	247	186	190	356	323	269	228	-
16	90	127	112	78	167	74	90	170	116	145	193	192	152	170	-
Intel(8-proc)	62	65	87	65	88	57	61	58	114	65	89	131	96	81	1
Scaled Intel	85	89	120	89	121	78	84	80	157	89	122	180	132	111	2
AMD (16 proc)/ Intel(8 proc)	1.06	1.43	0.93	0.88	1.38	.95	1.07	2.14	0.74	1.63	1.58	1.07	1.15	1.53	3

Table for the scaling with respect to 1 processor runs:

#proc	Test1	Test3	Test4	Test5	Test7	Test9	Test11	Test13	Test14	Test17	Test18	Test27	Test31	Test35
1	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	-
4	0.89	1.09	1.65	2.84	2.70	4.53	2.69	1.05	1.65	2.77	1.81	1.18	1.36	-
8	0.96	1.79	2.91	5.55	4.69	8.97	5.22	1.59	2.37	4.64	2.52	2.02	2.42	-
16	1.01	3.11	5.53	10.1	8.42	17.46	9.17	2.31	3.80	6.08	4.65	3.39	4.28	-

¹ Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz
² Scaling was done by the factor of 3.3/2.4=1.375
³ Comparison of 8 processors run on Intel vs 16 processors run on AMD

MKL vs FFTW FFTs

We profiled the code to see where does it spend most of its time. Below is a summary of the profile for all the benchmarks that we tried.

Test1:

time   seconds   seconds    calls   s/call   s/call  name
28.26      7.16     7.16   675309     0.00     0.00  vdw_forces_
25.57     13.64     6.48      201     0.03     0.03  link_cell_pairs_
17.72     18.13     4.49   675309     0.00     0.00  ewald_real_forces_
 7.22     19.96     1.83      201     0.01     0.01  ewald_spme_forces_IP_spme_forces_
 5.17     21.27     1.31      201     0.01     0.02  ewald_spme_forces_
 3.95     22.27     1.00      201     0.00     0.12  two_body_forces_
 3.83     23.24     0.97   675309     0.00     0.00  images_
 3.08     24.02     0.78      201     0.00     0.00  bspgen_
 1.42     24.38     0.36    11658     0.00     0.00  gpfa_module_mp_gpfa3f_
 1.07     24.65     0.27      202     0.00     0.00  shellsort2_
 0.67     24.82     0.17     1206     0.00     0.00  export_atomic_data_

Test3:

time   seconds   seconds    calls   s/call   s/call  name
17.53     26.21    26.21  1298431     0.00     0.00  vdw_forces_
14.82     48.37    22.16      201     0.11     0.13  link_cell_pairs_
14.22     69.62    21.25    26532     0.00     0.00  gpfa_module_mp_gpfa2f_
11.73     87.16    17.54  1298431     0.00     0.00  ewald_real_forces_
 6.62     97.06     9.90     1200     0.01     0.01  deport_atomic_data_
 4.04    103.10     6.04      201     0.03     0.21  ewald_spme_forces_
 3.80    108.78     5.68      200     0.03     0.03  constraints_shake_vv_
 3.37    113.82     5.04 43159044     0.00     0.00  local_index_
 3.32    118.79     4.97  2355005     0.00     0.00  images_
 3.06    123.36     4.57      200     0.02     0.03  constraints_rattle_
 2.99    127.82     4.47 326609967     0.00     0.00  match_
 2.57    131.66     3.84      201     0.02     0.02  ewald_spme_forces_IP_spme_forces_
 2.01    134.67     3.01      201     0.01     0.60  two_body_forces_
 1.10    136.32     1.65      201     0.01     0.01  parallel_fft_mp_forward_3d_fft_z_
 1.04    137.87     1.55      201     0.01     0.06  parallel_fft_mp_forward_3d_fft_y_
 0.98    139.33     1.46      201     0.01     0.01  bspgen_

Test4:

time   seconds   seconds    calls   s/call   s/call  name
18.66     29.94    29.94  1604385     0.00     0.00  vdw_forces_
15.84     55.35    25.41       31     0.82     0.98  link_cell_pairs_
13.21     76.55    21.20  1604385     0.00     0.00  ewald_real_forces_
12.69     96.92    20.37       30     0.68     0.71  constraints_shake_vv_
 7.68    109.24    12.32       30     0.41     0.44  constraints_rattle_
 7.32    120.99    11.75      180     0.07     0.07  deport_atomic_data_
 3.79    127.07     6.08  2880181     0.00     0.00  images_
 3.23    132.25     5.18 40320168     0.00     0.00  local_index_
 3.10    137.22     4.98 364097691     0.00     0.00  match_
 2.72    141.58     4.36       31     0.14     0.14  ewald_spme_forces_IP_spme_forces_
 2.16    145.05     3.47       31     0.11     3.38  two_body_forces_
 2.05    148.34     3.29     4092     0.00     0.00  gpfa_module_mp_gpfa2f_
 1.95    151.47     3.13       31     0.10     0.42  ewald_spme_forces_
 0.67    152.54     1.07       31     0.03     0.03  bspgen_

Test5:

time   seconds   seconds    calls   s/call   s/call  name
44.99     50.53    50.53      101     0.50     0.50  three_body_forces_
16.21     68.74    18.21   873523     0.00     0.00  ewald_real_forces_
15.19     85.80    17.06      101     0.17     0.17  link_cell_pairs_
10.25     97.31    11.51   873523     0.00     0.00  vdw_forces_
 3.71    101.48     4.17   873523     0.00     0.00  images_
 3.33    105.22     3.74      101     0.04     0.59  two_body_forces_
 2.08    107.56     2.34      101     0.02     0.02  ewald_spme_forces_IP_spme_forces_
 1.41    109.14     1.58      101     0.02     0.05  ewald_spme_forces_
 0.97    110.23     1.09      101     0.01     0.01  bspgen_

Test7:

time   seconds   seconds    calls   s/call   s/call  name
16.97     13.50    13.50      101     0.13     0.15  link_cell_pairs_
16.75     26.82    13.32     1326     0.01     0.01  constraints_shake_vv_
11.59     36.04     9.22      600     0.02     0.02  deport_atomic_data_
10.69     44.54     8.50  1249646     0.00     0.00  ewald_real_forces_
10.59     52.96     8.42  1249646     0.00     0.00  vdw_forces_
 4.39     56.45     3.49  2122099     0.00     0.00  images_
 4.34     59.91     3.46 23122212     0.00     0.00  local_index_
 4.14     63.20     3.29      101     0.03     0.03  ewald_spme_forces_IP_spme_forces_
 3.34     65.86     2.66      200     0.01     0.11  npt_b0_vv_
 2.87     68.14     2.28      101     0.02     0.09  ewald_spme_forces_
 2.78     70.35     2.21      100     0.02     0.03  constraints_rattle_
 2.21     72.11     1.76      101     0.02     0.46  two_body_forces_
 1.89     73.61     1.51 154519604     0.00     0.00  match_
 1.85     75.08     1.47      101     0.01     0.01  bspgen_
 0.65     75.60     0.52      202     0.00     0.00  shellsort2_

Test9:

95.21     75.98    75.98      401     0.19     0.19  tersoff_forces_
 1.09     76.85     0.87     2400     0.00     0.00  deport_atomic_data_
 0.96     77.62     0.77      402     0.00     0.00  shellsort2_

Test11:

36.75     52.13    52.13  4004908     0.00     0.00  metal_forces_
22.05     83.41    31.28     1001     0.03     0.03  link_cell_pairs_
21.82    114.36    30.95  4004908     0.00     0.00  metal_ld_collect_fst_
10.26    128.91    14.55  8009816     0.00     0.00  images_
 2.59    132.59     3.68     1001     0.00     0.14  two_body_forces_
 2.44    136.05     3.46     1001     0.00     0.04  metal_ld_compute_
 1.35    137.97     1.92     1002     0.00     0.00  shellsort2_
 0.62    138.85     0.88     6000     0.00     0.00  deport_atomic_data_

Test13:

18.91     22.56    22.56    22348     0.00     0.00  gpfa_module_mp_gpfa2f_
12.43     37.39    14.83      900     0.02     0.02  deport_atomic_data_
11.80     51.47    14.08     1950     0.01     0.01  constraints_shake_vv_
 8.38     61.47    10.00      151     0.07     0.08  link_cell_pairs_
 5.49     68.02     6.55      300     0.02     0.09  npt_b0_vv_
 5.02     74.01     5.99      302     0.02     0.02  gpfa_module_mp_gpfa3f_
 4.86     79.81     5.80      151     0.04     0.33  ewald_spme_forces_
 4.38     85.04     5.23      151     0.03     0.04  ewald_spme_forces_IP_spme_forces_
 3.53     89.26     4.22 34695506     0.00     0.00  local_index_
 3.47     93.40     4.14      150     0.03     0.03  constraints_rattle_
 2.94     96.91     3.51  2364786     0.00     0.00  ewald_real_forces_
 2.84    100.30     3.39  4097851     0.00     0.00  images_
 2.36    103.11     2.81  2364786     0.00     0.00  vdw_forces_
 1.32    104.68     1.57      151     0.01     0.01  bspgen_
 1.27    106.19     1.51      151     0.01     0.01  parallel_fft_mp_forward_3d_fft_z_
 1.18    107.60     1.41      151     0.01     0.01  parallel_fft_mp_back_3d_fft_x_
 1.16    108.98     1.38      151     0.01     0.10  parallel_fft_mp_forward_3d_fft_y_
 1.14    110.34     1.36      151     0.01     0.48  two_body_forces_
 1.11    111.66     1.33 92125754     0.00     0.00  match_
 1.09    112.96     1.30      151     0.01     0.01  parallel_fft_mp_forward_3d_fft_x_
 0.99    114.14     1.18      151     0.01     0.10  parallel_fft_mp_back_3d_fft_y_

Test14:

21.33     16.80    16.80      130     0.13     0.13  constraints_shake_vv_
10.48     25.05     8.25       60     0.14     0.14  deport_atomic_data_
 7.24     30.75     5.70       11     0.52     0.60  link_cell_pairs_
 6.41     35.80     5.05     3014     0.00     0.00  gpfa_module_mp_gpfa3f_
 5.75     40.33     4.53       20     0.23     1.35  npt_b0_vv_
 5.30     44.50     4.17       10     0.42     0.43  constraints_rattle_
 5.27     48.65     4.15       11     0.38     2.20  ewald_spme_forces_
 5.13     52.69     4.04     2992     0.00     0.00  gpfa_module_mp_gpfa2f_
 4.28     56.06     3.37       11     0.31     0.31  ewald_spme_forces_IP_spme_forces_
 3.19     58.57     2.51 20333039     0.00     0.00  local_index_
 2.79     60.77     2.20  1534827     0.00     0.00  ewald_real_forces_
 2.78     62.96     2.19  2646191     0.00     0.00  images_
 2.30     64.77     1.81  1534827     0.00     0.00  vdw_forces_
 2.29     66.57     1.80       11     0.16     0.16  bspgen_
 1.92     68.08     1.51       22     0.07     0.07  gpfa_module_mp_gpfa5f_
 1.80     69.50     1.42       11     0.13     0.60  parallel_fft_mp_forward_3d_fft_y_
 1.79     70.91     1.41       11     0.13     0.60  parallel_fft_mp_back_3d_fft_y_
 1.09     71.77     0.86 57924127     0.00     0.00  match_
 0.95     72.52     0.75        1     0.75     0.75  dihedrals_14_check_

Test17:

20.71     10.65    10.65      201     0.05     0.06  link_cell_pairs_
13.88     17.79     7.14   984116     0.00     0.00  ewald_real_forces_
11.43     23.67     5.88   984116     0.00     0.00  vdw_forces_
 9.72     28.67     5.00     3200     0.00     0.00  constraints_shake_vv_
 6.68     32.11     3.44  2260717     0.00     0.00  images_
 5.84     35.11     3.01 23232840     0.00     0.00  local_index_
 5.06     37.71     2.60      201     0.01     0.01  ewald_spme_forces_IP_spme_forces_
 4.53     40.04     2.33      200     0.01     0.02  constraints_rattle_
 3.46     41.82     1.78      201     0.01     0.03  ewald_spme_forces_
 2.51     43.11     1.29     4235     0.00     0.00  pmf_coms_
 2.25     44.27     1.16 118620664     0.00     0.00  match_
 2.18     45.39     1.12     1200     0.00     0.00  deport_atomic_data_
 2.10     46.47     1.08      201     0.01     0.17  two_body_forces_
 2.10     47.55     1.08      201     0.01     0.01  bspgen_
 1.50     48.32     0.77      400     0.00     0.03  npt_b0_vv_
 1.11     48.89     0.57      402     0.00     0.00  shellsort2_
 0.80     49.30     0.41      201     0.00     0.00  pass_shared_units_

Test18:

34.45     69.38    69.38     1800     0.04     0.04  constraints_shake_vv_
 9.78     89.07    19.69      101     0.19     0.22  link_cell_pairs_
 8.63    106.45    17.38      100     0.17     0.20  constraints_rattle_
 6.61    119.77    13.32  2041857     0.00     0.00  ewald_real_forces_
 5.95    131.75    11.98  2041857     0.00     0.00  vdw_forces_
 5.38    142.59    10.84     2718     0.00     0.01  pmf_coms_
 5.29    153.24    10.65  4883071     0.00     0.00  images_
 4.95    163.22     9.98 66656800     0.00     0.00  local_index_
 3.18    169.62     6.40      200     0.03     0.60  npt_b0_vv_
 2.74    175.14     5.52      101     0.05     0.06  ewald_spme_forces_IP_spme_forces_
 1.84    178.84     3.70      101     0.04     0.16  ewald_spme_forces_
 1.71    182.28     3.44      600     0.01     0.01  deport_atomic_data_
 1.51    185.33     3.05      101     0.03     0.03  bspgen_
 1.44    188.24     2.91      101     0.03     0.74  two_body_forces_
 1.18    190.61     2.38 231279151     0.00     0.00  match_
 0.76    192.14     1.53    25304     0.00     0.00  update_shared_units_

Test27:

38.45     91.00    91.00 12076288     0.00     0.00  metal_forces_
24.32    148.56    57.56     3501     0.02     0.02  link_cell_pairs_
17.06    188.93    40.37 12076288     0.00     0.00  metal_ld_collect_eam_
 8.90    210.00    21.07 24152576     0.00     0.00  images_
 2.64    216.24     6.24     3501     0.00     0.06  two_body_forces_
 2.39    221.90     5.66     3501     0.00     0.02  metal_ld_compute_
 1.46    225.35     3.45     3502     0.00     0.00  shellsort2_
 1.11    227.98     2.63    21000     0.00     0.00  deport_atomic_data_
 0.90    230.11     2.13    21006     0.00     0.00  export_atomic_data_

Test31:

31.17     66.30    66.30 12004000     0.00     0.00  metal_forces_
29.18    128.38    62.08     3001     0.02     0.02  link_cell_pairs_
13.22    156.50    28.12 12004000     0.00     0.00  metal_ld_collect_eam_
10.09    177.96    21.46 24008000     0.00     0.00  images_
 5.35    189.35    11.39    18000     0.00     0.00  deport_atomic_data_
 2.99    195.71     6.36     3001     0.00     0.06  two_body_forces_
 2.77    201.60     5.89     3001     0.00     0.02  metal_ld_compute_
 1.29    204.34     2.75     3002     0.00     0.00  shellsort2_
 0.70    205.83     1.49     6000     0.00     0.00  npt_b0_vv_

Test35:

27.22     27.64    27.64      562     0.05     0.06  link_cell_pairs_
19.94     47.89    20.25  1878757     0.00     0.00  ewald_real_forces_
 9.75     57.79     9.90  1878757     0.00     0.00  vdw_forces_
 5.52     63.40     5.61  3227549     0.00     0.00  images_
 5.46     68.95     5.55     2008     0.00     0.00  constraints_shake_vv_
 4.44     73.46     4.51      562     0.01     0.01  ewald_spme_forces_IP_spme_forces_
 3.43     76.95     3.49 354007894     0.00     0.00  match_
 3.38     80.38     3.43      562     0.01     0.15  two_body_forces_
 3.30     83.73     3.35      562     0.01     0.01  bspgen_
 3.23     87.01     3.29 28441741     0.00     0.00  local_index_
 2.87     89.92     2.91      500     0.01     0.01  constraints_rattle_
 2.85     92.81     2.89      562     0.01     0.02  ewald_spme_forces_
 2.05     94.89     2.08     3366     0.00     0.00  deport_atomic_data_
 1.71     96.63     1.74     1124     0.00     0.00  shellsort2_
 1.44     98.09     1.46     1000     0.00     0.01  npt_m1_vv_
 0.49     98.59     0.50      562     0.00     0.00  set_halo_particles_

From the above data, we can clearly see that the FFT dependence of the code is very tiny (in some cases where it shows up, it's about a percent), so there is no need to try different FFT libraries for the purpose of efficiency.

Shared vs Exclusive run on AMD servers (4 x 6378 @ 2.4 GHz)

We descirbe the effect of shared FPU and L2-cache of the AMD server. The results are summarized in the following table (# processes=8):

Run Scheme	Test1	Test3	Test4	Test5	Test7	Test9	Test11	Test13	Test14	Test17	Test18	Test27	Test31	Test35
Shared time(s)	95	221	213	142	300	144	158	247	186	190	356	323	269	228
Exclusive time (s)	90	168	161	109	226	116	128	185	157	151	282	265	209	173

In the above table, "shared" represents cores that share resources such as L2-Cache and FPU, while "exclusive" stands for cores which do not share any resources.

Performance Comparison

Following is a table for performance comparison of Intel and AMD servers when the job was run using 8 processors:

Server	Test1	Test3	Test4	Test5	Test7	Test9	Test11	Test13	Test14	Test17	Test18	Test27	Test31	Test35	Notes
Intel	62	65	87	65	88	57	61	58	114	65	89	131	96	81	1
Scaled Intel	85	89	120	89	121	78	84	80	157	89	122	180	132	111	2
AMD (Shared)	95	221	213	142	300	144	158	247	186	190	356	323	269	228	1,3
AMD(Exclusive)	90	168	161	109	226	116	128	185	157	151	282	265	209	173	1,4
AMD Shared/AMD Exc.	1.05	1.31	1.32	1.30	1.33	1.24	1.23	1.34	1.18	1.26	1.26	1.22	1.29	1.32	-
AMD Exc./Intel (scaled)	1.06	1.89	1.34	1.22	1.87	1.49	1.52	2.31	1.00	1.70	2.31	1.47	1.58	1.56	-
AMD Shared/Intel (scaled)	1.12	2.48	1.78	1.60	2.48	1.85	1.88	3.09	1.18	2.13	2.92	1.79	2.03	2.05	-
AMD (16 proc)/ Intel(8 proc)	1.06	1.43	0.93	0.88	1.38	.95	1.07	2.14	0.74	1.63	1.58	1.07	1.15	1.53	3

¹ Intel: E5 2643 @ 3.3 GHz, AMD: Opetron 6378 @ 2.4 GHz
² Scaling was done by the factor of 3.3/2.4=1.375
³ "Shared" represents cores that share resources such as L2-Cache and FPU
⁴ "Exclusive" represents cores that don't share resources such as L2-Cache and FPU

QUANTUM ESPRESSO

Quantum Espresso is a plane wave density functional theory code used for the electronic structure calculations of materials.

Intel (2 x E5-2643 @ 3.30GHz)

The test case for these runs is a self consistent calculation for energy of bulk copper. (For input file please ask Manoj Srivastava) Following table demonstrates the scaling of the code with number of processors:

8 processors	4 proc numanode=0	4 proc numanode=0,1
156	293	289

Clearly, we can see that there is no shared cache effect on the intel machine. From our experience with VASP as well as profiling we know that the code spends most of its time in FFT libraries. Following table captures result of variation of FFT libraries:

FFTW-FFT	MKL-FFT
156	140

AMD (2 x 6220 @ 3.0 GHz)

16 processors	8 proc Non-shared	8 proc Shared
252	250	345

FFTW-FFT	MKL-FFT
250	232

Compilation Table

This table instructs number of different compiler options that can be used to build the code.

Package	Intel	GNU	Open64
VASP	Y	-	-
LAMMPS	Y	Y	Y
GROMACS	Y	Y	Y
QE	Y	-	-
DL_POLY	Y	Y	Y

User:Manoj

STREAM

A few words about numactl

Intel (2 x E5-2643 @ 3.30GHz)

AMD (2 x 6220 @ 3.0 GHz)

AMD (4 x 6378 @ 2.4 GHz)

Bandwidth in terms of Socket

VASP

Intel (2 x E5-2643 @ 3.30GHz)

Native FFT Library

MKL FFTs (via FFTW wrappers)

FFTW FFTs

AMD (2 x 6220 @ 3.0 GHz)

Performance Comparison

LAMMPS

Scaling with Number of Processors

Comparison of Intel, Open64 and GNU Builds

Intel (2 x E5-2643 @ 3.30GHz)

Intel Compiler and SIMD Sets

MKL vs FFTW FFTs

AMD (4 x 6378 @ 2.4 GHz)

Performance Comparison

GROMACS

Comparison of Intel, Open64 and GNU Builds

Scaling with Number of Processors

Instruction Set Dependence

MKL vs FFTW

Shared vs Exclusive run on AMD servers (4 x 6378 @ 2.4 GHz)

Performance Comparison

DL_POLY

Comparison of Intel, Open64 and GNU Builds

More studies on Test 14

Scaling with Number of Processors

MKL vs FFTW FFTs

Shared vs Exclusive run on AMD servers (4 x 6378 @ 2.4 GHz)

Performance Comparison

QUANTUM ESPRESSO

Intel (2 x E5-2643 @ 3.30GHz)

AMD (2 x 6220 @ 3.0 GHz)

Compilation Table

Navigation menu

Search