Daniel Bolella - Introduction to Parallel Programming and CUDA with Sample Code

Parallel programming is becoming one of the hottest topics in software today as multi-core CPUs decrease in price and increase in power. Parallelism in programs allows multiple processes to be executed concurrently using separate threads and processing units. This is appealing to developers and users alike, because it can help reduce runtimes while still producing the same results as if it were run in serial. This article by Daniel Bolella gives us an overview to Parallel Programming and CUDA and is accompanied with a sample code to present the concept.

To give an example, let’s say we have an array that contains thousands of floating-point integers and each value needs to be run through a lengthy algorithm. Instead of running each value through the algorithm consecutively (i.e. one at a time), parallelism allows multiple values to be processed simultaneously (i.e. running many values through the algorithm at the same time), reducing overall processing time and producing fast and accurate results.

There are some restrictions with using parallelism and not every program can be done in parallel. For instance, let’s say we have that same program from before but this time as we process a value we want to then check the currently processed value against all the previously calculated values in the array, before going to the next. We can confidently say all previous values in the array have been processed and are available to be accessed for the check. If we tried to do this in parallel, we could have incorrect data because multiple values are calculated at the same time and some may be ready for checking while others are not. Extra checks and steps are needed to prevent these types of concurrency issues. However, the results could still prove to be worth the extra steps.

One of the major breakthroughs in parallel programming technology today goes beyond the scope of just multi-core CPU’s. Although they do offer a lot more power and potential than single-core units, another common computer component, the GPU, offers even more power, and NVIDIA’s flagship product, called CUDA, offers this technology to all developers easily and for free.

CUDA was developed by NVIDIA to provide simple access to GPGPU (General-Purpose Computation on Graphics Processing Units) and parallel computing on their own GPU’s. The logic behind the idea is that GPU’s have much more processing power than CPU’s and have numerous cores that operate in parallel to run intensive graphics operations. By allowing developers to utilize this power for their own projects, it can create fast solutions to some heavy and time-consuming programs, specifically those that run the same process recursively and independently of other processes.

The learning curve is not very steep for most developers. CUDA accomplishes making GPGPU easily usable by adding functionality to the standard C and C++ programming languages. This allows for fast adoption by almost any programmer and helps with cross-platform integration.

To get started with CUDA, you will need a recent NVIDIA GPU (Geforce 8 series and beyond, or you can check on the NVIDIA website to see which GPU’s are CUDA enabled). CUDA works on Windows, Mac OSX, and certain Linux distributions. You will need to download and install the developer drivers, the CUDA toolkit, and the CUDA SDK off the Nvidia website, respectively.

NVIDIA provides an installation guide on their website which provides more details about the installation process, as well as a method of checking the installation to see if it is working.

Now to write and run our first CUDA program in C! Open a text editor and copy and paste the following code:

#include <stdio.h>

#include <cuda.h>

__global__ void mult_vect(float * x, float * y, float * z, int n)

{

int idx= blockIdx.x * blockDim.x + threadIdx.x;

if(idx < n)

{

z[idx] = x[idx] * y[idx];

}

int main()

{

float *x_h, *y_h, *z_h;

float *x_d, *y_d, *z_d;

int n=20;

size_t size= n * sizeof(float);

x_h=(float *)malloc(size);

y_h=(float *)malloc(size);

z_h=(float *)malloc(size);

int i;

for(i=0; i < n; i++)

{

x_h[i]= (float) i;

y_h[i]= (float) i;

}

cudaMemcpy(x_d, x_h, size, cudaMemcpyHostToDevice);

cudaMemcpy(y_d, y_h, size, cudaMemcpyHostToDevice);

cudaMemcpy(z_d, z_h, size, cudaMemcpyHostToDevice);

int block_size= 4;

int num_blocks= (n + block_size - 1) / block_size;

mult_vect<<num_blocks, block_size>>>(x_d, y_d, z_d, n);

cudaMemcpy(x_h, x_d, size, cudaMemcpyDeviceToHost);

cudaMemcpy(y_h, y_d, size, cudaMemcpyDeviceToHost);

cudaMemcpy(z_h, z_d, size, cudaMemcpyDeviceToHost);

for(i = 0; i < n ; i++)

{

printf("%f ", z_h[i]);

if(z_h[i] == (x[i] * y[i]))

printf("Success\n");

else

printf("Fail\n");

}

free(x_h);

free(y_h);

free(z_h);

cudaFree(x_d);

cudaFree(y_d);

cudaFree(z_d);

}

You will notice that there are a lot of similarities to the regular C language, with a few exceptions. Let’s first look at the main. Here, we create an array on the local host computer as well as the device (specified by the _h and _d suffixes). Because this is a simple example, we will then fill the x_h and y_h array with floats from 0 to n. Notice we only fill the local variables. CUDA does not allow the device to use anything in local memory. The values must first be created in local memory and then copied into the device’s memory to be used during device processes. Later, we will need to copy the processed memory in the device back to the local memory so it can be used for local processes. So, to copy the host variables to the device variables, we use cudaMemcpy() and transfer the values from the local memory to the device.

Next, block sizes and the number of blocks to be used during the GPU runtime must be calculated. A block size specifies how many threads to use per block, and the number of blocks is related to the number of processes that can be run at once. For more advanced parallel performance, the number of blocks and their sizes could be refined to produce even faster results. However, for this simple exercise, we will just specify the block size as 4 threads and the number of blocks will be calculated from that, which is (20 + 4 -1)/4 = 23/4 = 5.

Now we call our CUDA function to process the array using the GPU. Let’s examine what the function looks like and how it works.

You’ll notice that our CUDA function is declared with "__global__ void". This states that it is a CUDA function called by the host to then be run by the GPU. (Note: A function with a "__device__" prefix is called by the device to be run on the device. I will expand upon this later on.) Our function takes in the three float arrays and our array lengths, n. The function first calculates the current index by taking the block ID, the block Dim, and the thread ID and multiplying them. If the index calculated is within n, we will then multiply x and y and insert it into z. After all the indexes have completed, the program will then exit the device and return to the local main.

Back in the main, we copy back the new values from the device to the host using cudaMemcpy again. Then, we run a loop to display the results from the device and check whether the values are correct. Lastly, we free all of the variables using free() for host variables and cudaFree() for device variables.

Now to run the code! Save the file as a .cu and open the command prompt/terminal. Go to the directory where CUDA was installed. We will now compile the code by typing in "nvcc .cu –o ", and after it has compiled, type ".exe" (Windows) or "./" (Linux).

If the program runs and the values display all successful, then that’s it! You are now setup for GPGPU and parallel programming on your computer!

Now that we’ve gone over the basics of how CUDA looks and works, the last thing to go over is CUDA functions. Functions in CUDA are similar to functions in C or C++, but do have distinctions that identify the type of CUDA functions they are. As mentioned earlier, they are distinguished with the "__global__" and "__device__" prefixes. Both are run on the device during runtime. The difference between the two is that "__global__" functions are called by the local host (usually in the main) and "__device__" functions are called by other functions run on the device, including both "__global__" and "__device__" prefixes. So in our example, we could actually create a "__device__" function that multiplies x and y, returns the value and then in our "__global__" function set z equal to the result.

Summary

There are many more things to talk about and discuss with CUDA, but they are beyond the scope of this article. My hopes for this introduction are that you learned what CUDA is, learned how it can be used, and that it will come in handy for any intensive programs you need to run in the future. For more source code, examples, and documentation, I recommend visiting NVIDIA website where there is tons of support and an elaborate showcase of what others of have done with the power of CUDA. Enjoy!