Revision as of 09:05, 29 December 2012

STREAM BENCHMARKING

A few words about numactl

NUMA is an acronym for Non Uniform Memory Access, and numactl is a tool to assign memory to the node. Following are a few important keywords one should know before embarking on the numactl mission:

physcpubind = ID of the cores
cpunodebind = ID of the nodes
membind = ID of the node that the memory is assigned to

For example, on an AMD machine with 16 cores, or in the terminology of NUMA, 4 nodes with 4 cores on each node, the command line

–membind=0 –physcpubind=0-3

asigns four threads running on cores 0 to 3 (node 0) with the memory also assigned to the node 0. However, the command line

–membind=1 –physcpubind=0-3

assigns four threads on the cores 0 to 3 (node 0) but the memory is assigned to the node 1. As this memory is not local to the node that the threads are running on, the performance will be affected. Assigning memory locally to the node can also be done by ”-l” option of the numactl.

Alternatively, above command lines can be shortened by using "cpunodebind". For example,

–membind=0 –cpunodebind=0

means that the memory is assigned to node 0 and the threads are also running on node 0. One should note that with the use of "cpunodebind" the number of threads will be equal to the number of cores on the node, so in this case number of threads has to be equal to four. However, if we wish to run two threads on node 0, its only possible with "physcpubind". You have more control of running your threads with "physcpubind" as you can choose the cores that you wish to run your jobs on. For detail description please follow the manual page of numactl.

Intel (2 x E5-2643 @ 3.30GHz)

Streams is a well-known memory bandwidth benchmark. Before we attempt to find the maximum bandwidth, it's necessary to find out the architecture of the machine. The command "numactl --hardware" on this machine produces:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 32739 MB
node 0 free: 30624 MB
node 1 cpus: 4 5 6 7
node 1 size: 32768 MB
node 1 free: 31280 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

From the above result, we can conclude that there are two numa nodes with four cores on each: in total eight cores.

Before measuring the maximum memory bandwidth of the server, we first determine the number of threads required to achieve the maximum bandwidth of a given NUMA node. Results are summarized in the following table:

Number of threads	Bandwidth (GB/s)
1	9.5
2	18.8
3	21.4
4	34.0

From the above table, we conclude that the maximum number of threads that we need to run on each node is four. Above table was obtained by running the threads on node 0 and assigning the memory on the same node as well. This result can be reproduced on other nodes as well.

Following table describes the effect of variation of memory allocation with respect to the processors where the threads are running on the memory bandwidth(number of threads is four):

MEM CPU	0	1
0	34.0	17.4
1	18.9	33.5

In the above table, variation of the memory nodes are in the rows while cpu nodes are in the column. You can clearly see the effect of memory binding with the respect to the cores where the threads are running. Please note that the above table resembles the "node distance table " obtained using "numactl --hardware" earlier.

AMD (2 x 6220 @ 3.0 GHz)

This is an Interlagos machine with 16 cores (numa 4 nodes with 4 cores each). Each core has 4 GB of memory, which results in the memory of machine to be 64GB. I compiled the code with open64 compiler. It is noteworthy that gcc compiler gives about half of the bandwidth as open64, while intel compiler results on this machine vary (64GB to 40 GB). "numactl --hardware" produces:

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 16382 MB
node 0 free: 2930 MB
node 1 cpus: 4 5 6 7
node 1 size: 16384 MB
node 1 free: 5082 MB
node 2 cpus: 8 9 10 11
node 2 size: 16384 MB
node 2 free: 2281 MB
node 3 cpus: 12 13 14 15
node 3 size: 16368 MB
node 3 free: 550 MB
node distances:
node   0   1   2   3
  0:  10  16  16  16
  1:  16  10  16  16
  2:  16  16  10  16
  3:  16  16  16  10

Following table describes memory bandwidth on a single node by varying number of threads:

Number of threads	Bandwidth (GB/s)
1	14.0
2	15.0
3	17.8
4	18.5

Again, similar to the Intel machine, the maximum number of threads we need to run on each node is four.

Following table describes the effect of variation of memory allocation with respect to the processors where the threads are running on the memory bandwidth(number of threads is four):

MEM CPU	0	1	2	3
0	18.1	11.8	6.5	5.6
1	11.8	18.7	5.5	6.5
2	6.5	5.5	18.5	11.6
3	5.6	6.5	11.8	18.5

Contrary to the Intel machine, the above table does not agree with the "node distance" produced by the "numactl --hardware"!

AMD (4 x 6378 @ 2.4 GHz)

In NUMA terminology, this server has 8 nodes with 8 cores on each.

numactl --hardware  

available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32765 MB
node 0 free: 29324 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 31892 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32768 MB
node 2 free: 31900 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32768 MB
node 3 free: 31911 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 32768 MB
node 4 free: 31964 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 32768 MB
node 5 free: 31942 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 32768 MB
node 6 free: 31866 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 32752 MB
node 7 free: 31960 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  22  16  22  16  22
  1:  16  10  22  16  22  16  22  16
  2:  16  22  10  16  16  22  16  22
  3:  22  16  16  10  22  16  22  16
  4:  16  22  16  22  10  16  16  22
  5:  22  16  22  16  16  10  22  16
  6:  16  22  16  22  16  22  10  16
  7:  22  16  22  16  22  16  16  10

Memory bandwidth on a single node by varying number of threads:

Number of threads	Bandwidth (GB/s)
1	13.0
2	14.1
3	17.1
4	17.4
5	17.1
6	16.7
7	16.6
8	16.1

Following table describes the variation of memory bandwidth when we change memory allocation with respect to the cores where threads are running (Number of threads=4)

MEM CPU	0	1	2	3	4	5	6	7
0	17.3	8.0	5.6	4.1	5.7	4.1	5.5	4.0
1	8.2	17.6	6.5	6.5	4.0	5.5	4.0	5.4
2	5.7	6.5	17.9	7.9	5.6	4.1	5.6	4.1
3	4.1	6.5	8.1	17.8	4.1	5.6	4.1	5.7
4	5.6	4.0	5.7	4.2	17.7	7.9	5.7	4.1
5	4.0	5.6	4.1	5.6	8.1	17.7	4.0	5.5
6	5.4	4.0	5.6	4.1	5.7	4.1	17.8	7.9
7	3.9	5.4	4.0	5.6	4.2	5.6	8.1	17.7

Bandwidth in terms of Socket

A socket for AMD 6200 and 6300 machine is two NUMA nodes combined together. The sockets have 16 cores for the 6378 server while 8 cores for 6220 server. The memory bandwidth for each NUMA node is maximum with about 4 threads, and we wonder what is the maximum bandwidth for a socket. A reasonable guess from our previous results is to use 8 threads for the socket with 4 distributed over each NUMA node. If we run the stream with 8 cores as follows:

numactl --physcpubind=0,1,2,3,8,9,10,11 --membind=0,1 ./stream

we get 34.7 GB/s memory bandwidth.

By running,

numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=0,1 ./stream

also yields 35 GB/s bandwidth.

By varying the membind to different sockets as follows:

numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=0,1 ./stream
numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=2,3 ./stream
numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=4,5 ./stream
numactl --physcpubind=0,2,4,6,8,10,12,14 --membind=6,7 ./stream
numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=0,1 ./stream
numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=2,3 ./stream
numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=4,5 ./stream
numactl --physcpubind=16,18,20,22,24,26,28,30 --membind=6,7 ./stream
numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=0,1 ./stream
numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=2,3 ./stream
numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=4,5 ./stream
numactl --physcpubind=32,34,36,38,40,42,44,46 --membind=6,7 ./stream
numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=0,1 ./stream
numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=2,3 ./stream
numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=4,5 ./stream
numactl --physcpubind=48,50,52,54,56,58,60,62 --membind=6,7 ./stream

we get following table (In terms of socket, i.e. node 0-1 is socket 1, node 2-3 is socket 2 and so on)

MEM CPU	1	2	3	4
1	35.2	11.2	11.0	10.7
2	11.3	35.3	11.2	11.1
3	10.9	11.2	35.2	11.0
4	10.7	11.1	11.1	35.4

VASP BENCHMARKING

This page describes benchmarking of Vienna Ab-initio Simulation Package (VASP), a plane wave density functional theory code, used in studying electronic structure of materials.

Intel (2 x E5-2643 @ 3.30GHz)

Native FFT Library

Following libraries and flags were used:

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTLIB = -lfftw3xf
INCS = -I$(MKLDIR)/include/fftw
FFT_OBJS = fftmpi.o fftmpi_map.o fftw3d.o fft3dlib.o
FFLAGS =  -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

As a first check, Streaming SIMD Extension (SSE) was changed and following is the result of a self consistent field (SCF) calculation for MgMOS (For input files, please ask Charles Taylor or Manoj Srivastava):

SIMD Instruction	Time(s)
sse2	158
sse4.1	156
sse4.2	155
avx	155
ssse3	156

MKL FFTs (via FFTW wrappers)

Upon profiling the code, we found that the code spent most of its time in the FFT libraries, so the next step was to change the FFT libraries. Following changes were made:

FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o

(The change here is replacement of "fftmpi.o" in the original VASP makefile with "fftmpiw.o")

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTLIB = -lfftw3xf
INCS = -I$(MKLDIR)/include/fftw
FFLAGS = -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

Upon making above changes, about 60% improvement on run time of the code was found on the Intel machine (E5-2643 @ 3.30GHz). Following table depicts the run time variation with SIMD instruction sets:

SIMD Instruction	Time(s)
sse2	97
sse4.1	95
sse4.2	94
avx	94
ssse3	94

FFTW FFTs

We further compiled VASP by using FFT library from the FFTW package with following flags:

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTWDIR = /apps/fftw/3.3.2
FFTLIB  = -L$(FFTWDIR)/lib -lfftw3
INCS = -I$(FFTWDIR)/include
FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o
FFLAGS =  -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

From our previous experience, we concluded that the performance of VASP did not depend substantially on the SIMD instruction sets, so for FFTW library, we only tried one set. Following is the result:

SIMD Instruction	Time(s)
sse2	118

AMD (2 x 6220 @ 3.0 GHz)

This machine has 16 cores, in numactl terminology 4 NUMA nodes with 4 cores on each nodes. As the result of VASP depends heavily on the choice of FFT libraries, we checked performance of this machine with different FFTs, namely, FFT provided by VASP package, MKL, and FFTW. We built FFTW libraries with various flags to see if we could find a better choice for FFTs. The libraries and flags used to compile VASP are as follows (FFT libraries were changed depending on which FFT we wanted to use):

MKLDIR    = $(HPC_MKL_DIR)
MKLLIBS   = -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
MKLLIBDIR = $(HPC_MKL_DIR)/lib/intel64
FFTWDIR = /apps/fftw/3.3.2
FFTLIB  = -L$(FFTWDIR)/lib -lfftw3
INCS = -I$(FFTWDIR)/include
FFT_OBJS = fftmpi_map.o fftmpiw.o fftw3d.o fft3dlib.o
FFLAGS =  -free -names lowercase -assume byterecl
OFLAG  = -O2 -xsse2 -unroll-aggressive -warn general

The results are summarized in the following table:

Run Scheme	Native	MKL	FFTW	FFTW	FFTW	FFTW	FFTW	FFTW
Shared L2-Cache time(s)	399	261	333	319	334	336	315	319
Exclusive L2-Cache time (s)	274	159	217	219	215	217	213	211
Notes	-	-	1	2	3	4	5	6

¹ Default compiler Flags were used to build FFT.
² CFLAGS=-O3, FFLAGS=-O3, -enable sse2
³ enable-mpi CFLAGS=-O3, FFLAGS=-O3, -enable sse2
⁴ CC='opencc -march=bdver1' F77='openf90 -march=bdver1' CFLAGS='-msse3 -msse4.1 -msse4.2 -msse4a -mfma4 -O2' FFLAGS='-msse3 -msse4.1 -msse4.2 -msse4a -mfma4 -O2' --enable-fma --enable-mpi
⁵ FFLAGS/ CFLAGS="-OPT:Ofast -mavx -mfma4 -march=bdver1 -O3 -fomit-frame-pointer -LNO:simd=2 -WOPT:sib=on -LNO:prefetch=2:pf2=0 -CG:use_prefetchnta=on -LNO:prefetch_ahead=4-malign-double -fstrict-aliasing -fno-schedule-insns -ffast-math"
⁶ ufhpc compiler options. FFTWDIR = /apps/fftw/3.3.2

Performance Comparison

Following is a summary of results for the test case of MgMOS ran on the Intel and AMD servers with 8 processors.

Server	Native	MKL	FFTW
Intel	158	97	118
Intel (Scaled)	174	106	130
AMD (Shared L2 Cache)	399	261	319
AMD (Exc. L2 Cache)	274	159	211
AMD Shared/AMD Exc.	1.46	1.64	1.51
AMD Exc./Intel (scaled)	1.57	1.50	1.62
Notes	-	-	1

¹ Compiled by UFHPC (Charles Taylor or Craig Prescott)

LAMMPS BENCHMARKING

Intel (2 x E5-2643 @ 3.30GHz)

Intel Compiler

LAMMPS is compiled with following flags:

module load intel openmpi
CC = mpiCC
CCFLAGS = -O2 -xSSE2
FFT_INC =  -I$(HPC_MKL_DIR)/include/fftw
FFT_PATH =
FFT_LIB = -L$(HPC_MKL_DIR)/lib/intel64 -lfftw3xc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core

The benchmarking runs are done by the input file provided with the package. (LJ = atomic fluid, Lennard-Jones potential with 2.5 sigma cutoff (55 neighbors per atom), NVE integration). Following table describes the variation of run time with number of processors:

# processors	Time(s)
8	158
4	309
1	1139

We find linear scaling with number of processors on the intel machine.

Following table shows variation of Streaming SIMD Extension (SSE) sets(# threads=8):

SIMD Instruction	Time(s)
sse2	158
sse4.1	158
sse4.2	157
avx	152
sse3	157
ssse3	157

"AVX" instruction sets are slightly better than all the other ones!

The binaries for above SIMD sets use "-x" option for the build, which does not work for instruction set other than "-sse2" on the AMD server, so for the next step we build our binaries with "-m" option and run it on the intel and AMD servers to see whether we could successfully run the binaries on both servers. Following table demonstrates the results:

SIMD Instruction	Intel Time(s)	AMD Time(s)
sse2	158	329
sse4.1	158	330
sse4.2	157	329
avx	152	319
sse3	157	329
ssse3	157	329
Notes	-	1

Clearly on both, Intel as well as AMD servers, "avx" instructions are better choice and we are going to use them in further benchmarks.

¹ Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit)

MKL vs FFTW FFTs

In above section, we used MKL FFTs, in this section we make a comparison with FFTW FFTs

module load intel openmpi fftw
CC = mpiCC
CCFLAGS = -O2 -mavx
FFT_DIR = /apps/fftw/3.3.2
FFT_INC = -I$(FFT_DIR)/include/fftw3
FFT_PATH = 
FFT_LIB = -L$(FFT_DIR)/lib -lfftw3

FFT	Time(s)
MKL	152
FFTW	152

There seems to be no difference between the FFTs from FFTW or MKL on the performance of lammps.

Open64 Compiler

We built LAMMS with open64 compiler with following flags ("-mavx" flag does not compile, open64 compiler does not support it, so we had to use "-msse2")

module load open64/.4.5.2 openmpi
CC =            mpiCC
CCFLAGS =       -O2 -msse2
MPI_DIR        = /usr/mpi/open64/openmpi-1.6
MPI_INC        = -I$(MPI_DIR)/include
MPI_LIB        = -L$(MPI_DIR)/lib64 -lmpi
MPI_PATH =
FFT_DIR = /home/manoj/FFTW/charlie/3.3.2
FFT_INC =  -I$(FFT_DIR)/include/fftw3
FFT_PATH = 
FFT_LIB = -L$(FFT_DIR)/lib -lfftw3

Following table describes the time to run

Compiler	Intel Time(s)	AMD Time(s)
Intel	158	329
Open64	173	352
GNU	-	-
NOTES	-	1

¹ Runs on AMD servers are "naive":caches are shared and so are FPUs (Floating point unit)

AMD (4 x 6378 @ 2.4 GHz)

In this section, we descirbe the effect of shared cache on the AMD cluster.

The results are summarized in the following table:

Run Scheme	MKL
Shared L2-Cache time(s)	319
Exclusive L2-Cache time (s)	277
Notes	-

Performance Comparison

Following is a table for performance comparison of Intel and AMD servers when the job was run using with 8 threads.

Server	MKL
Intel	158
Intel (Scaled)	217
AMD (Shared L2 Cache)	319
AMD (Exc. L2 Cache)	277
AMD Shared/AMD Exc.	1.15
AMD Exc./Intel (scaled)	1.27
Notes	-

QUANTUM ESPRESSO BENCHMARKING

Quantum Espresso is a plane wave density functional theory code used for the electronic structure calculations of materials.

Intel (2 x E5-2643 @ 3.30GHz)

The test case for these runs is a self consistent calculation for energy of bulk copper. (For input file please ask Manoj Srivastava) Following table demonstrates the scaling of the code with number of processors:

8 processors	4 proc numanode=0	4 proc numanode=0,1
156	293	289

Clearly, we can see that there is no shared cache effect on the intel machine. From our experience with VASP as well as profiling we know that the code spends most of its time in FFT libraries. Following table captures result of variation of FFT libraries:

FFTW-FFT	MKL-FFT
156	140

AMD (2 x 6220 @ 3.0 GHz)

16 processors	8 proc Non-shared	8 proc Shared
252	250	345

FFTW-FFT	MKL-FFT
250	232

@@ Line 540: / Line 540: @@
 |}
-There is no difference between the FFTs from FFTW or MKL.
+There seems to be no difference between the FFTs from FFTW or MKL on the performance of
+lammps.
 === Open64 Compiler ===

Difference between revisions of "User:Manoj"

Revision as of 09:05, 29 December 2012

Contents

STREAM BENCHMARKING

A few words about numactl

Intel (2 x E5-2643 @ 3.30GHz)

AMD (2 x 6220 @ 3.0 GHz)

AMD (4 x 6378 @ 2.4 GHz)

Bandwidth in terms of Socket

VASP BENCHMARKING

Intel (2 x E5-2643 @ 3.30GHz)

Native FFT Library

MKL FFTs (via FFTW wrappers)

FFTW FFTs

AMD (2 x 6220 @ 3.0 GHz)

Performance Comparison

LAMMPS BENCHMARKING

Intel (2 x E5-2643 @ 3.30GHz)

Intel Compiler

MKL vs FFTW FFTs

Open64 Compiler

AMD (4 x 6378 @ 2.4 GHz)

Performance Comparison

QUANTUM ESPRESSO BENCHMARKING

Intel (2 x E5-2643 @ 3.30GHz)

AMD (2 x 6220 @ 3.0 GHz)

Navigation menu

Search