Difference between revisions of "User:Manoj"

Revision as of 00:03, 13 December 2012

VASP BENCHMARKING

Intel Machine ( E5-2643 @ 3.30GHz)

VASP Native FFT Library

Following library and flags were used:

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTLIB = -lfftw3xf
INCS = -I$(MKLDIR)/include/fftw
FFT_OBJS = fftmpi.o fftmpi_map.o fftw3d.o fft3dlib.o
FFLAGS =  -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

As a first check, SIMD were changed and following is the result for MgMOS (For input file, please ask Charles Taylor or Manoj Srivastava):

SIMD Instruction	Time(s)
sse2	158
sse4.1	156
sse4.2	155
avx	155
ssse3	156

MKL FFTs (via FFTW wrappers)

Upon profiling the code, we found that the code spends most of its time in the FFT libraries, so the next step is to change FFTW libraries. Following changes were made:

FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o

(The change here is replacement of "fftmpi.o" of the original VASP makefile with "fftmpiw.o")

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTLIB = -lfftw3xf
INCS = -I$(MKLDIR)/include/fftw
FFLAGS = -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

Upon making above changes, about 60% improvement on the run time of the code was found on the Intel machine (E5-2643 @ 3.30GHz). Following table depicts the run time variation with SIMD instruction sets:

SIMD Instruction	Time(s)
sse2	97
sse4.1	95
sse4.2	94
avx	94
ssse3	94

FFTW FFTs

We further compiled VASP by using FFT library from FFTW with following flags:

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTWDIR = /apps/fftw/3.3.2
FFTLIB  = -L$(FFTWDIR)/lib -lfftw3
INCS = -I$(FFTWDIR)/include
FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o
FFLAGS =  -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

From our previous experience, we conclude that the performance of VASP does not depend substantially on the SIMD instruction sets, so for FFTW library, we only tried one set. Following is the result:

SIMD Instruction	Time(s)
sse2	118

Performance on AMD Machine (Opteron 6220 @ 3.0 GHz)

This machine has 16 cores, in numactl terminology numanodes with 4 cores on each nodes. As the result of VASP depends heavily on the choice of FFT libraries, we com

Revision as of 00:00, 13 December 2012 (view source) Manoj (talk \| contribs) (→‎FFTW FFTs) ← Older edit		Revision as of 00:03, 13 December 2012 (view source) Manoj (talk \| contribs) (→‎Performance on AMD Machine (Opteron 6220 @ 3.0 GHz)) Newer edit →
Line 93:		Line 93:

	=== Performance on AMD Machine (Opteron 6220 @ 3.0 GHz)===		=== Performance on AMD Machine (Opteron 6220 @ 3.0 GHz)===
		+
		+	This machine has 16 cores, in numactl terminology numanodes with 4 cores on each nodes. As the result of VASP depends heavily on the choice of FFT libraries, we com