Difference between revisions of "User:Manoj"

From UFRC
Jump to navigation Jump to search
Line 156: Line 156:
  
 
{| border= 3
 
{| border= 3
!Server!!Native!!MKL!!FFTW!! FFTW!! Shared/Exc.!! Exc./Intel (scaled)!!Notes
+
!Server!!Native!!MKL!!FFTW!! FFTW!!  
 
|-
 
|-
|Intel ||158||174||399||274||1.46||1.57
+
|Intel ||158||97||399||274||1.46||1.57
 
|-
 
|-
 
|Intel (Scaled)||174||106||261||159||1.64||1.50
 
|Intel (Scaled)||174||106||261||159||1.64||1.50
 
|-
 
|-
|AMD (Shared L2 Cache)||399||130||-||-||-||-||1
+
|AMD (Shared L2 Cache)||399||261||-||-||-||-||1
 
|-
 
|-
|AMD (Exc. L2 Cache)||274||-||-||203||-||-||2
+
|AMD (Exc. L2 Cache)||274||159||-||203||-||-||2
 
|-
 
|-
|Shared/Exc.||1.46
+
|Shared/Exc.||1.46||1.64
 
|-
 
|-
|Exc./Intel (scaled)||1.57
+
|Exc./Intel (scaled)||1.57||1.50
 
|-
 
|-
 
|Notes||
 
|Notes||
 
|}
 
|}

Revision as of 23:51, 13 December 2012

VASP BENCHMARKING

Intel (2 x E5-2643 @ 3.30GHz)

Native FFT Library

Following library and flags were used:

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTLIB = -lfftw3xf
INCS = -I$(MKLDIR)/include/fftw
FFT_OBJS = fftmpi.o fftmpi_map.o fftw3d.o fft3dlib.o
FFLAGS =  -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

As a first check, SIMD were changed and following is the result for MgMOS (For input file, please ask Charles Taylor or Manoj Srivastava):

SIMD Instruction Time(s)
sse2 158
sse4.1 156
sse4.2 155
avx 155
ssse3 156

MKL FFTs (via FFTW wrappers)

Upon profiling the code, we found that the code spends most of its time in the FFT libraries, so the next step is to change FFTW libraries. Following changes were made:

FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o

(The change here is replacement of "fftmpi.o" of the original VASP makefile with "fftmpiw.o")

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTLIB = -lfftw3xf
INCS = -I$(MKLDIR)/include/fftw
FFLAGS = -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

Upon making above changes, about 60% improvement on the run time of the code was found on the Intel machine (E5-2643 @ 3.30GHz). Following table depicts the run time variation with SIMD instruction sets:

SIMD Instruction Time(s)
sse2 97
sse4.1 95
sse4.2 94
avx 94
ssse3 94

FFTW FFTs

We further compiled VASP by using FFT library from FFTW with following flags:

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTWDIR = /apps/fftw/3.3.2
FFTLIB  = -L$(FFTWDIR)/lib -lfftw3
INCS = -I$(FFTWDIR)/include
FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o
FFLAGS =  -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

From our previous experience, we conclude that the performance of VASP does not depend substantially on the SIMD instruction sets, so for FFTW library, we only tried one set. Following is the result:

SIMD Instruction Time(s)
sse2 118

AMD (2 x 6220 @ 3.0 GHz)

This machine has 16 cores, in numactl terminology 4 NUMA nodes with 4 cores on each nodes. As the result of VASP depends heavily on the choice of FFT libraries, we checked various performance of this machine with different FFTs, namely, FFT provided by VASP package, MKL, and FFTW. We build FFTW libraries with various flags to see if we could find a better choice for FFTs. The libraries and flags used to compile VASP are as follows (FFT libraries were changed depending on which FFT we want to use):

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTWDIR = /apps/fftw/3.3.2
FFTLIB  = -L$(FFTWDIR)/lib -lfftw3
INCS = -I$(FFTWDIR)/include
FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o
FFLAGS =  -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

The results are summarized in the following table:

FFT Library Shared L2-Cached time(s) Exclusive L2-Cache time (s) Notes
Naive (VASP) 399
MKL 261 159
FFTW 333 217 1
FFTW - 203 2
FFTW - 210 3
FFTW - 217 4
FFTW - 213 5

1 No Flags were used to build FFT.
2 CFLAG=-O3, FFLAG=-O3, -enable sse2
3 enable-mpi CFLAG=-O3, FFLAG=-O3, -enable sse2
4 CC='opencc -march=bdver1' F77='openf90 -march=bdver1' CFLAGS='-msse3 -msse4.1 -msse4.2 -msse4a -mfma4 -O2'  FFLAGS='-msse3 -msse4.1 -msse4.2 -msse4a -mfma4 -O2' --enable-fma --enable-mpi
5 FFLAGS/ CFLAGS="-OPT:Ofast -mavx -mfma4 -march=bdver1 -O3 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math"

Performance Comparison

Following is a summary of results for the test case of MgMOS ran on the machines with 8 threads.

FFT Library Intel Intel (Scaled) AMD (Shared L2 Cache) AMD (Exc. L2 Cache) Shared/Exc. Exc./Intel (scaled) Notes
Native 158 174 399 274 1.46 1.57
MKL 97 106 261 159 1.64 1.50
FFTW 118 130 - - - - 1
FFTW - - - 203 - - 2

1 Compiled by UFHPC (Charles Taylor or Craig Prescott)
2 CFLAG=-O3, FFLAG=-O3, -enable sse2

Server Native MKL FFTW FFTW
Intel 158 97 399 274 1.46 1.57
Intel (Scaled) 174 106 261 159 1.64 1.50
AMD (Shared L2 Cache) 399 261 - - - - 1
AMD (Exc. L2 Cache) 274 159 - 203 - - 2
Shared/Exc. 1.46 1.64
Exc./Intel (scaled) 1.57 1.50
Notes