User:Manoj
VASP BENCHMARKING
This page describes benchmarking of Vienna Ab-initio Simulation Package (VASP), a plane wave density functional theory code, used in studying electronic structure of materials.
Intel (2 x E5-2643 @ 3.30GHz)
Native FFT Library
Following libraries and flags were used:
MKLDIR = $(HPC_MKL_DIR)
MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTLIB = -lfftw3xf
INCS = -I$(MKLDIR)/include/fftw
FFT_OBJS = fftmpi.o fftmpi_map.o fftw3d.o fft3dlib.o
FFLAGS = -free -names lowercase -assume byterecl
OFLAG = -O2 -xsse2 -unroll-aggressive -warn general
As a first check, Streaming SIMD Extension (SSE) was changed and following is the result of a self consistent field (SCF) calculation for MgMOS (For input files, please ask Charles Taylor or Manoj Srivastava):
SIMD Instruction | Time(s) |
---|---|
sse2 | 158 |
sse4.1 | 156 |
sse4.2 | 155 |
avx | 155 |
ssse3 | 156 |
MKL FFTs (via FFTW wrappers)
Upon profiling the code, we found that the code spent most of its time in the FFT libraries, so the next step was to change the FFT libraries. Following changes were made:
FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o
(The change here is replacement of "fftmpi.o" in the original VASP makefile with "fftmpiw.o")
MKLDIR = $(HPC_MKL_DIR)
MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTLIB = -lfftw3xf
INCS = -I$(MKLDIR)/include/fftw
FFLAGS = -free -names lowercase -assume byterecl
OFLAG = -O2 -xsse2 -unroll-aggressive -warn general
Upon making above changes, about 60% improvement on run time of the code was found on the Intel machine (E5-2643 @ 3.30GHz). Following table depicts the run time variation with SIMD instruction sets:
SIMD Instruction | Time(s) |
---|---|
sse2 | 97 |
sse4.1 | 95 |
sse4.2 | 94 |
avx | 94 |
ssse3 | 94 |
FFTW FFTs
We further compiled VASP by using FFT library from the FFTW package with following flags:
MKLDIR = $(HPC_MKL_DIR)
MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTWDIR = /apps/fftw/3.3.2
FFTLIB = -L$(FFTWDIR)/lib -lfftw3
INCS = -I$(FFTWDIR)/include
FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o
FFLAGS = -free -names lowercase -assume byterecl
OFLAG = -O2 -xsse2 -unroll-aggressive -warn general
From our previous experience, we concluded that the performance of VASP did not depend substantially on the SIMD instruction sets, so for FFTW library, we only tried one set. Following is the result:
SIMD Instruction | Time(s) |
---|---|
sse2 | 118 |
AMD (2 x 6220 @ 3.0 GHz)
This machine has 16 cores, in numactl terminology 4 NUMA nodes with 4 cores on each nodes. As the result of VASP depends heavily on the choice of FFT libraries, we checked performance of this machine with different FFTs, namely, FFT provided by VASP package, MKL, and FFTW. We built FFTW libraries with various flags to see if we could find a better choice for FFTs. The libraries and flags used to compile VASP are as follows (FFT libraries were changed depending on which FFT we wanted to use):
MKLDIR = $(HPC_MKL_DIR)
MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTWDIR = /apps/fftw/3.3.2
FFTLIB = -L$(FFTWDIR)/lib -lfftw3
INCS = -I$(FFTWDIR)/include
FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o
FFLAGS = -free -names lowercase -assume byterecl
OFLAG = -O2 -xsse2 -unroll-aggressive -warn general
The results are summarized in the following table:
Run Scheme | Native | MKL | FFTW | FFTW | FFTW | FFTW | FFTW |
---|---|---|---|---|---|---|---|
Shared L2-Cache time(s) | 399 | 261 | 333 | - | - | - | - |
Exclusive L2-Cache time (s) | 274 | 159 | 217 | 203 | 210 | 217 | 213 |
Notes | - | - | 1 | 2 | 3 | 4 | 5 |
1 Default compiler Flags were used to build FFT.
2 CFLAGS=-O3, FFLAGS=-O3, -enable sse2
3 enable-mpi CFLAGS=-O3, FFLAGS=-O3, -enable sse2
4 CC='opencc -march=bdver1' F77='openf90 -march=bdver1' CFLAGS='-msse3 -msse4.1 -msse4.2 -msse4a -mfma4 -O2' FFLAGS='-msse3 -msse4.1 -msse4.2 -msse4a -mfma4 -O2' --enable-fma --enable-mpi
5 FFLAGS/ CFLAGS="-OPT:Ofast -mavx -mfma4 -march=bdver1 -O3 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math"
Performance Comparison
Following is a summary of results for the test case of MgMOS ran on the Intel and AMD servers with 8 processors.
Server | Native | MKL | FFTW | FFTW |
---|---|---|---|---|
Intel | 158 | 97 | 118 | - |
Intel (Scaled) | 174 | 106 | 130 | - |
AMD (Shared L2 Cache) | 399 | 261 | - | - |
AMD (Exc. L2 Cache) | 274 | 159 | - | 203 |
AMD Shared/AMD Exc. | 1.46 | 1.64 | - | - |
AMD Exc./Intel (scaled) | 1.57 | 1.50 | - | - |
Notes | - | - | 1 | 2 |
1 Compiled by UFHPC (Charles Taylor or Craig Prescott)
2 CFLAG=-O3, FFLAG=-O3, -enable sse2
STREAM BENCHMARKING
A few words about numactl
NUMA is an acronym for Non Uniform Memory Access, and numactl is a tool to assign memory to the node. Following are a few important keywords one should know before embarking on the numactl mission:
physcpubind = ID of the cores
cpunodebind = ID of the nodes
membind = ID of the node that memory is assigned to
For example, on an AMD machine with 16 cores, or in the terminology of NUMA, 4 nodes with 4 cores on each node, the command line
–membind=0 –physcpubind=0-3
asigns four threads running on cores 0 to 3 (node 0) with the memory also assigned to the node 0. However, the command line
–membind=1 –physcpubind=0-3
assigns four threads on the cores 0 to 3 (node 0) but the memory is assigned to the node 1. As this memory is not local to the node that the threads are running on, the performance will be affected. Assigning memory locally to the node can also be done by ”-l” option of the numactl.
Alternatively, above command lines can be shortened by using "cpunodebind". For example,
–membind=0 –cpunodebind=0
means that the memory is assigned to node 0 and the threads are also running on node 0. One should note that with the use of "cpunodebind" the number of threads will be equal to the number of cores on the node, so in this case number of threads has to be equal to four. However, if we wish to run two threads on node 0, its only possible with "physcpubind". You have more control of running your threads with "physcpubind" as you can choose the cores that you wish to run your jobs on. For detail description please follow the manual page of numactl.
Intel (2 x E5-2643 @ 3.30GHz)
Stream is the program to evaluate the memory bandwidth of the node. Before we attempt to find the maximum bandwidth, it's necessary to find out the architecture of the machine.
the command "numactl --hardware" will produce:
<source lang=make>
available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 32739 MB node 0 free: 30624 MB node 1 cpus: 4 5 6 7 node 1 size: 32768 MB node 1 free: 31280 MB node distances: node 0 1
0: 10 21 1: 21 10
<\source>
Before finding out the maximum bandwidth on the entire node, it’s a good idea to discover how many threads one needs to run on each numa node to get the maximum bandwidth. Following is a table describing that Above table was obtained by running the threads on node 0 and assigning the memory on the same node as well. This resultcan be reproduced on other nodes as well.