Graphic Processing Units (GPUs) have mainly been game- and video-centric devices.
Due to the increasing computational requirements of graphics-processing applications, GPUs have become
very powerful parallel processors and this, moreover, incited research interest in computing outside
the graphics-community. Until recently, however, programming GPUs was limited to graphics libraries
such as OpenGL and Direct3D, and for many applications, especially those based on integer-arithmetic,
the perfor mance improvements over CPUs was minimal or even degrading. The release of NVIDIA’s G80
series and ATI’s HD2000 series GPUs (which implemented the unified shader architecture), along with the
companies’ release of higherlevel language support with Compute Unified Device Architecture (CUDA),
Close to Metal (CTM) and the more recent Open Computing Language (OpenCL), however, facilitate the
development of massively-parallel general purpose applications for GPUs. These general purpose GPUs have
become a common target for numerically-intensive applications given their ease of programming
(compared to previous generation GPUs), and ability to outperform CPUs in data-parallel applications,
commonly by orders of magnitude.
In addition to the common floating point processing capabilities of previous generation GPUs, starting
with the G80 series, NVIDIA’s GPU architecture added support for integer arithmetic, including 32-bit
addition/subtraction and bit-wise operations, scatter/gather memory access and different memory spaces.
Each GPU contains between 10 and 30 streaming multiprocessors (SMs) each equipped with: eight scalar
processor (SP) cores, fast 16-way banked onchip shared memory (16KB/SM), a multithreaded instruction unit,
large register file (8192 for G80-based GPUs, 16384 for the newer GT200 series), read-only caches for
constant (8KB/SM) and texture memories (varying between 6 and 8 KB/SM), and two special function units
CUDA is an extension of the C language that employs the new massively parallel programming model, single
instruction multiple-thread. SIMT differs from SIMD in that the underlying vector size is hidden and the
programmer is restricted to writing scalar code that is parallel at the thread-level. The programmer defines
kernel functions, which are compiled for and executed on the SPs of each SM, in parallel: each light-weight
thread executes the same code, operating on different data. A number of threads (less than 512) are grouped
into a thread block which is scheduled on a single SM, the threads of which timeshare the SPs. This additional
hierarchy provides for threads within the same block to communicate using the on-chip shared memory and
synchronize their execution using barriers. Moreover, multiple thread blocks can be executed simultaneously
on the GPU as part of a grid; a maximum of eight thread blocks can be scheduled per SM and in order to hide
instruction and memory (among other) latencies, it is important that at least two blocks be scheduled on each SM.