CUDA Examples

From UFRC
Jump to navigation Jump to search

CUBLAS vs. Optimized BLAS

Note that double-precision linear algebra is a less than ideal application for the GPUs. Still, it is a functional example of using one of the available CUDA runtime libraries.

cuda_bm.c

{{#fileAnchor: cuda_bm.c}} Download raw source of the [{{#fileLink: cuda_bm.c}} cuda_bm.c]

Expand to view example.

resuse.c

{{#fileAnchor: resuse.c}} Download raw source of the [{{#fileLink: resuse.c}} resuse.c]

Expand to view example.

Makefile for Building cuda_bm

{{#fileAnchor: Makefile}} Download raw source of the [{{#fileLink: Makefile}} Makefile]

Expand to view example.

Compiling and Linking

Download the above three files into a single directory. Then type:

[taylor@c11a-s15 Bench]$ module load intel/2013
[taylor@c11a-s15 Bench]$ module load cuda/5.5
[taylor@c11a-s15 Bench]$ make clean
rm -f *.o core.* bm cuda_bm

[taylor@c11a-s15 Bench]$ make cuda_bm
icc -O3 -openmp -I/opt/intel/composer_xe_2013_sp1.0.080/mkl/include   -c -o resuse.o resuse.c
icc -O3 -openmp -I/opt/intel/composer_xe_2013_sp1.0.080/mkl/include -I/opt/cuda/5.5/include -I/opt/intel/composer_xe_2013_sp1.0.080/mkl/include -o cuda_bm cuda_bm.c resuse.o -L/opt/cuda/5.5/lib64 -lcublas -L/opt/intel/composer_xe_2013_sp1.0.080/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread -lrt

Output

 
# Host: Single, Hex-Core Intel 5675 @ 3.0 GHz
# GPU: Kepler K20c
$ export OMP_NUM_THREADS=4
$ ./cuda_bm
Initializing Matrices...Done.    Elapsed Time = 2.3768 secs
Starting parallel DGEMM...Done.  Elapsed Time = 33.2302 secs
                                 GFlOP Rate   =  45.3546
Initializing CUDA...Done.
Re-initializing Matrices...Done.  Elapsed Time = 2.1733 secs
Starting CUDA DGEMM...Done.       Elapsed Time = 2.3522 secs
                                  GFlOP Rate   = 640.7458