General-purpose computing on GPU with CUDA
Авторы: Komarichev R. E., Girovskaya I. V.
Источник: Young scientists' researches and achievements in science: сборник докладов научно-технической конференции для молодых учёных (Донецк, 16 апреля 2020 г.) / ответств. за вып. Е. Н. Кушниренко. — Донецк: ДонНТУ, 2020. — с. 56-62.
Abstract
Komarichev R. E., Girovskaya I. V. — General-purpose computing on GPU with CUDA. In the article the common differences of CPU and GPU are described, reviewed how CUDA works in basics. Two common GPGPU tasks are analyzed and explained on an example.
Everyone who ever tried to program started with small scripts that resolved simple tasks. Such programs don't need high computing power and can be fine executed on CPU with single core. You can walk a long way never facing a problem of lack of power. But there are so many cases where sequential solution won't be quick enough to use in real situations.
As you might know, CPU (central processing unit) and GPU (graphics processing unit) are both processors, but they're optimised for different types of objectives.
The main task of CPU is to execute chains of instructions as quick as possible. It designed to perform several chains at the same time or to split one chain of instructions to many and after that merge them together again. Each instruction depends on another and that's why CPU has so few computing cores. All the emphasis is on speed of execution and reduction of downtime that is achieved with cache and pipelines.
The main task of GPU is rendering of graphics and visual effects. In fact, it's work is operating on a huge number of independent tasks, so it has much more memory but not as fast as CPU has. Also modern GPU has thousands of computing cores whereas CPU most often has 2-8.
There are many differences in multithreading support. CPU executes 1-2 computing threads per core. GPU can launch several thousands threads for each of it's multiprocessor. Switching between threads on CPU costs hundreds of clock cycles, GPU switches several threads in a single clock cycle. In CPU most of chip area is occupied by instruction buffers, hardware branch prediction and huge cache sizes, while in GPU most of area is execution units (Fig. 1) [1].
One of the most common operation that performs on GPU much faster than on CPU is matrix multiplication. Let's say we have two square matrices A and B and their product is matrix C. According to the rules of matrix multiplication, each C's element is sum of products A's row and B's column.
For N=100 we have to perform (100+99)*100*100=1990000 math operations, not to mention index increments. Pretty much to compute sequentially, and the bigger N the longer it takes to compute. A thing is each C's element is independent on any other, so we don't have to wait for result of C1,1 to start computation of C1,2 or any other. Thus we can efficiently parallelise these tasks.
The popular technology for such purposes today is CUDA – parallel computing architecture developed by NVIDIA Corporation that can significantly increase computing performance through the use of GPUs. It is widely used by software developers, scientists and researchers in various fields such as video and image processing, computational biology and chemistry, modeling fluid dynamics, reconstructing images obtained by computed tomography, seismic analysis and more. To use this technology you need any NVIDIA GeForce 400 series video card or later and C/C++ programming skills. Specifications of latest cards are described below (Table 1) [2][3][4].
Model | Number of CUDA cores | Base Clock (MHz) | Memory amount |
---|---|---|---|
GTX 1050 | 640 768 |
1354 1392 |
2 GB GDDR5 3 GB GDDR5 |
GTX 1050 Ti | 768 | 1290 | 4 GB GDDR5 |
GTX 1060 | 1152 1280 |
1506 1506 |
3 GB GDDR5 6 GB GDDR5X |
GTX 1070 | 1920 | 1506 | 8 GB GDDR5 |
GTX 1070 Ti | 2432 | 1607 | 8 GB GDDR5 |
GTX 1080 | 2560 | 1607 | 8 GB GDDR5X |
GTX 1080 Ti | 3584 | 1481 | 11 GB GDDR5X |
GTX 1650 | 896 896 |
1485 1410 |
4 GB GDDR5 4 GB GDDR6 |
GTX 1650 SUPER | 1280 | 1530 | 4 GB GDDR6 |
GTX 1660 | 1480 | 1530 | 6 GB GDDR5 |
GTX 1660 Ti | 1536 | 1500 | 6 GB GDDR6 |
GTX 1660 SUPER | 1408 | 1530 | 6 GB GDDR6 |
RTX 2060 | 1920 | 1365 | 6 GB GDDR6 |
RTX 2070 | 2304 | 1410 | 8 GB GDDR6 |
RTX 2070 SUPER | 2560 | 1605 | 8 GB GDDR6 |
RTX 2080 | 2944 | 1515 | 8 GB GDDR6 |
RTX 2080 SUPER | 3072 | 1650 | 8 GB GDDR6 |
CUDA C++ extends C++ and allows us to define C++ functions, called kernels, that are executed N times in parallel by N different CUDA threads. To define the count of parallel threads and distribute them data CUDA used concepts of blocks and grids. Call to kernel launches grid of blocks, each block has many parallel threads (Fig. 2).
Let's do some experiments and see how much benefit we'll get with CUDA. The hardware is Intel Core i7-8750H vs NVIDIA GeForce RTX 2060. In each experiment we will multiply two square matrices of size NxN on CPU (single thread) and GPU and measure the time elapsed (Table 2). Blocks will always have dimension 32x32 and grid size will be calculated according to this fact and to size of matrices so that each element of C is computed in its own thread.
N | CPU, ms | GPU, ms | Acceleration |
---|---|---|---|
256 | 101 | 2 | x50,5 |
512 | 542 | 9 | x60,2 |
768 | 1789 | 29 | x61,7 |
1024 | 5013 | 70 | x71,6 |
1280 | 10278 | 177 | x77,3 |
1536 | 20925 | 220 | x95,1 |
1792 | 38214 | 314 | x121,7 |
2048 | 59879 | 420 | x145,3 |
As we can see, using GPU operation takes much less time. Figure 3 illustrates the resulting acceleration.
But to be honest, it should be noted that computing on GPU requires additional overhead on memory exchange. Every time we want to perform GPU computation we have to send input data from host memory to device memory and, after it done, send output data back, whereas when using CPU, all data is on host all the time. Lucky to us we have pretty few data in this experiment and its transfers almost didn't take a time and results are still relevant. But we should keep this moment in mind.
To see influence of memory exchanging let's do another experiment. We have two enormous arrays of numbers A and B and want to find array C of the same size where Ci = Ai + Bi. In this situation number of operations per element is 1 (just one sum), unlike to previous where it was 2N-1 (calculate N products and find their sum). Thus one element is faster to compute and we can take much larger N without increasing experiment duration (Table 3).
N | CPU, ms | GPU w/o transferring, ms | One transfer of N elements, ms | GPU with transferring, ms | Acceleration |
---|---|---|---|---|---|
2^30 = 1073741824 | 2374 | 42 | 165 | 537 | x4,4 |
2^31 = 2147483648 | 4714 | 90 | 316 | 1038 | x4,5 |
2^32 = 4294967296 | 9431 | 176 | 640 | 2096 | x4,5 |
2^33 = 8589934592 | 18907 | 366 | 1290 | 4236 | x4,5 |
2^34 = 17179869184 | 37678 | 706 | 2650 | 8656 | x4,4 |
As we can see, time of execution increases linearly (Fig. 4) because different N do not affect the amount of work for one element. Acceleration now is many times worse because of need to transfer huge amount of data between host and device.
The conclusion is efficiency of GPU-accelerated computing is directly proportional to the number of operations performed on an individual set of data independently of the others.
References
- CUDA C++ Programming Guide [Электронный ресурс] // CUDA Toolkit Documentation. URL: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html (дата обращения: 15.04.2020 г.).
- GeForce 10 Series Graphics Cards [Электронный ресурс] // NVIDIA. URL: https://www.nvidia.com/en-us/geforce/10-series/ (дата обращения 15.04.2020 г.).
- GeForce GTX 16 Series Graphics Card [Электронный ресурс] // NVIDIA. URL: https://www.nvidia.com/en-us/geforce/graphics-cards/16-series/ (дата обращения 15.04.2020 г.).
- GeForce RTX 20 Series and 20 SUPER Graphics Cards [Электронный ресурс] // NVIDIA. URL: https://www.nvidia.com/en-us/geforce/20-series/ (дата обращения 15.04.2020 г.).