COT 5: CUDA Optimization Tutorial


Parallelizing and Optimizing Programs for GPU Acceleration using CUDA


Morning of September 23, 2012
Minneapolis, MN

Held in conjunction with the
The 21st International Conference on
Parallel Architectures and Compilation Techniques

Intended audience

Anyone who is curious about GPU programming as well as people with prior CUDA experience who want to improve the performance of their codes.

Schedule

GPU hardware overview and CUDA introduction
Serial CPU C code example and porting to CUDA
Parallelization and step-by-step performance tuning
break
Introduction of irregular tree algorithm
Detailed optimization of four irregular kernels
Summary, conclusion, and outlook

Abstract

GPUs offer an order of magnitude higher performance, energy efficiency, and price/performance than multicore CPUs, but it is substantially harder to write efficient programs for GPUs, especially if the programs are not very regular. In this tutorial, I will first talk about the key hardware features of GPUs and give an overview of the CUDA C programming language. Then I will introduce a regular n-body code and show how to port it to CUDA and parallelize it. Next, I will illustrate, in ten steps, how to optimize this code until it runs at close to peak performance on a high-end GPU. After the break, I will present a more sophisticated but irregular n-body code that repeatedly builds an unbalanced tree data structure and performs complex traversals on it. Then, I will discuss how this irregular algorithm can be implemented in CUDA to maximally exploit the GPU hardware. The final version of this code running on one GPU is over twenty times faster than an optimized OpenMP implementation running on a high-end hex-core CPU. I will conclude the tutorial with a summary of general optimization strategies and an outlook into the future.

Presenter

Martin Burtscher, Texas State University-San Marcos, burtscher@txstate.edu

Speaker biography

Martin Burtscher is Associate Professor in the Department of Computer Science at Texas State University-San Marcos. He received the combined BS/MS degree in computer science from the Swiss Federal Institute of Technology (ETH) Zurich in 1996 and the Ph.D. degree in computer science from the University of Colorado at Boulder in 2000. Martin's research interests include efficient parallelization of programs for GPUs and multicore CPUs as well as automatic performance assessment and optimization of HPC applications. He is a senior member of the IEEE, its Computer Society, and the ACM. Martin has co-authored over 60 peer-reviewed publications, including a book chapter in NVIDIA's GPU Computing Gems, is the recipient of an NVIDIA Academic Partnership award, and is the PI of a CUDA Teaching Center.
Official Texas State University Disclaimer