A blog by Vui Chee

Artificial Intelligence (AI) has become ubiquitous in the tech industry, with companies rapidly integrating AI into their workflows to boost developer productivity and reduce costs. However, as someone without deep AI expertise, I lack the motivation to wade through the complex mathematics and theoretical foundations of machine learning. My intuition told me this wasn't the path I wanted to pursue — a feeling reinforced by previous deep learning internships where I found the entire process of model construction, overnight training cycles, and constant re-evaluation frustrating rather than fulfilling.

So where should I direct my time and curiosity? I realized that behind every AI/ML application lies critical infrastructure designed to support it. Most notably, AI models depend heavily on GPUs for their computational needs. Unlike traditional CPUs, GPUs excel at the parallel processing required for machine learning — handling thousands of simultaneous calculations that would bottleneck a CPU. This infrastructure layer represents a fascinating intersection of hardware optimization and software engineering that directly enables the AI revolution we're witnessing today.

Therefore, I've decided to compile essential GPU information to build a foundational understanding, with a particular focus on running CUDA applications on NVIDIA hardware. This post won't dive deep into CUDA programming concepts, but will provide sufficient detail to get you up and running with your first application.

Introduction (CPU vs GPU)

Before programming on such hardware, there is a need to understand some details underlying it. etl-arch-diagram From this diagram, we can observe that both CPUs and GPUs share similar cache architectures with L1/L2 levels. However, their core designs differ fundamentally: CPUs feature fewer but more powerful cores optimized for complex, sequential tasks, while GPUs contain hundreds or thousands of smaller, simpler cores. This abundance of lightweight cores enables GPUs to execute the same instruction across many data elements simultaneously — a capability known as parallel processing that makes them ideal for computationally intensive workloads like machine learning.

CUDA Device Architecture

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model designed to leverage the massive parallel processing power of their GPUs. At the heart of CUDA-enabled GPUs lies a unified architecture where hundreds or thousands of cores are organized into groups called Streaming Multiprocessors (SMs). Each SM contains multiple cores that execute instructions in lockstep, enabling efficient parallel computation across large datasets—a fundamental requirement for machine learning, scientific computing, and graphics processing.

We'll explore warps — groups of 32 threads that execute instructions simultaneously — and how they enable CUDA's massive parallel execution capabilities in later blog posts.

Using CUDA

In order to take advantage of Nvidia's GPU massive parallel computing capabilities, we need matching software. That stack is called CUDA toolkit.

You can install them at cuda downloads. Follow the steps to install the right version based on your OS. Take note of the GCC version you are running. Older toolkits may require an older version of GCC compiler in order to run.

For older toolkit versions, you can find them in this archive.

1. CUDA Compiler (NVCC)

nvcc is the executable that you will use to compile CUDA code. Once you installed the toolkit, this tool will be available to you.

2. CUDA Libraries

In addition to the compiler, you will also get libraries (non-exhaustive) that will provide optimized functions.

See all libraries available. These libraries allow us to perform complex computations on the GPU without having to hand-roll optimized versions ourselves.

Quick Example

With the CUDA toolkit installed (make sure nvcc is on your path), we're ready to run our first CUDA application. The code below queries and displays your GPU's device properties—crucial information for understanding your hardware's capabilities and optimizing performance (though we'll save optimization techniques for future posts).

Run nvcc -ccbin g++-11 main.cu -o main && ./main to compile and run the above code. Obviously, you should swap out the g++ based on the toolkit version you have installed.

When you compile CUDA applications with nvcc, the CUDA runtime header is automatically included. The nvcc compiler implicitly includes the necessary CUDA headers, including cuda_runtime.h , so your CUDA kernel launches, memory management functions, and other CUDA runtime API calls will work without explicit inclusion.

Conclusion

This introduction barely scratches the surface of CUDA programming and NVIDIA GPU architecture. There's a vast landscape of concepts — from memory hierarchy and thread synchronization to kernel optimization and performance tuning — that we'll need to explore before building efficient, production-ready applications.