This page describes benchmarking of Vienna Ab-initio Simulation Package (VASP), a plane wave density functional theory code, used in studying electronic structure of materials.
Intel (2 x E5-2643 @ 3.30GHz)
Native FFT Library
Following libraries and flags were used:
MKLDIR = $(HPC_MKL_DIR) MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64 FFTLIB = -lfftw3xf INCS = -I$(MKLDIR)/include/fftw FFT_OBJS = fftmpi.o fftmpi_map.o fftw3d.o fft3dlib.o FFLAGS = -free -names lowercase -assume byterecl OFLAG = -O2 -xsse2 -unroll-aggressive -warn general
As a first check, Streaming SIMD Extension (SSE) was changed and following is the result of a self consistent field (SCF) calculation for MgMOS (For input files, please ask Charles Taylor or Manoj Srivastava):
MKL FFTs (via FFTW wrappers)
Upon profiling the code, we found that the code spent most of its time in the FFT libraries, so the next step was to change the FFT libraries. Following changes were made:
FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o
(The change here is replacement of "fftmpi.o" in the original VASP makefile with "fftmpiw.o")
MKLDIR = $(HPC_MKL_DIR) MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64 FFTLIB = -lfftw3xf INCS = -I$(MKLDIR)/include/fftw FFLAGS = -free -names lowercase -assume byterecl OFLAG = -O2 -xsse2 -unroll-aggressive -warn general
Upon making above changes, about 60% improvement on run time of the code was found on the Intel machine (E5-2643 @ 3.30GHz). Following table depicts the run time variation with SIMD instruction sets:
We further compiled VASP by using FFT library from the FFTW package with following flags:
MKLDIR = $(HPC_MKL_DIR) MKLLIBS = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64 FFTWDIR = /apps/fftw/3.3.2 FFTLIB = -L$(FFTWDIR)/lib -lfftw3 INCS = -I$(FFTWDIR)/include FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o FFLAGS = -free -names lowercase -assume byterecl OFLAG = -O2 -xsse2 -unroll-aggressive -warn general
From our previous experience, we concluded that the performance of VASP did not depend substantially on the SIMD instruction sets, so for FFTW library, we only tried one set. Following is the result:
AMD (2 x 6220 @ 3.0 GHz)
This machine has 16 cores, in numactl terminology 4 NUMA nodes with 4 cores on each nodes. As the result of VASP depends heavily on the choice of FFT libraries, we checked performance of this machine with different FFTs, namely, FFT provided by VASP package, MKL, and FFTW. We built FFTW libraries with various flags to see if we could find a better choice for FFTs. The libraries and flags used to compile VASP are as follows (FFT libraries were changed depending on which FFT we wanted to use):
The results are summarized in the following table:
|Shared L2-Cache time(s)||399||261||333||-||-||-||-|
|Exclusive L2-Cache time (s)||274||159||217||203||210||217||213|
1 Default compiler Flags were used to build FFT.
2 CFLAGS=-O3, FFLAGS=-O3, -enable sse2
3 enable-mpi CFLAGS=-O3, FFLAGS=-O3, -enable sse2
4 CC='opencc -march=bdver1' F77='openf90 -march=bdver1' CFLAGS='-msse3 -msse4.1 -msse4.2 -msse4a -mfma4 -O2' FFLAGS='-msse3 -msse4.1 -msse4.2 -msse4a -mfma4 -O2' --enable-fma --enable-mpi
5 FFLAGS/ CFLAGS="-OPT:Ofast -mavx -mfma4 -march=bdver1 -O3 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math"
Following is a summary of results for the test case of MgMOS ran on the machines with 8 processesors.
|AMD (Shared L2 Cache)||399||261||-||-|
|AMD (Exc. L2 Cache)||274||159||-||203|
|AMD Shared/AMD Exc.||1.46||1.64||-||-|
|AMD Exc./Intel (scaled)||1.57||1.50||-||-|
1 Compiled by UFHPC (Charles Taylor or Craig Prescott)
2 CFLAG=-O3, FFLAG=-O3, -enable sse2