Nvidia, CUDA, and you…

The future of high-throughput computing!

Ape Inago
Scat Sense
10 min readDec 28, 2016

--

As we see a rise in dedicated GPU-like processors for applications such as AR/VR and AI, I thought it might be useful to rewrite this paper on the early history of GPUs and the rise of CUDA, as a medium post for future developers and other interested parties. While originally written in December 2008, as the final paper for my computing architectures coursework, the short history of GPUs and the cute M&M explanation of SIMD/MIMD are still quite relevant today.

If you are curious and want to get a much deeper overview of how developing massively parallel programming works in theory, I found Principles of Parallel Programming to be a wonderful high level overview of the various techniques. If you want to dive deeper into the history of GPUs and graphics, I found The 8-bit Guy’s youtube series to be quite illuminating. As for working with GPUs on a more practical level, I recommend the “GPU Gems” series of books that Nvidia has published online for free!

Happy programming!

— Nicholas (@ultimape) Perry

Parallel Programming

Graphical Processing Units, such as Nvidia’s 8xxx series and 9xxx series of graphics cards, have become quite formidable parallel processing systems. No longer just focused on graphics rendering for video games, Nvidia’s Graphics co-processor is now capable of doing arbitrary computation. I hope in this document to inform you about the origins of CUDA, and explain what it is and how it may impact the future of computing.

A) State of the Industry of Parallel Programming

i. Mindset & Reasoning

Early programming was very simple. Assembly being literally a “list of instructions” feeding into a processor one at a time, sequentially. In turn, each instruction was evaluated as it was received, updating the internal state of the machine. As the instructions were fed in, the machine progressed through a series of steps and eventually an end product would be produced according to the pattern.

Largely, programming languages have kept suite to this step by step processing routine. Granted, jumps have been replaced with functions, and structured programming and flow control blocks means GOTOs have gone largely the way of the dinosaur. But after compiled, the program eventually ends up as just pure assembly, being run one instruction at a time.

There has been some advancement to this set up. Processors have been getting more efficient, smart enough to look ahead a few instructions and realizing that some things can be done before others more efficiently. Executing two instructions at one time in disjoint sections of the processing logic has been a big advancement. Combined with pipelining, these superscalar CPUs are essentially parallelizing low level code.

Recently however, a lot of interests have been put into truly parallel computation… Not only are individual instructions being reordered to run in parallel, entire programs are being run in parallel on separate CPUs. Being able to do two things at once has the potential to nearly double your processing ability.

ii. Thought Experiment 1: The Spear Fishing for M&Ms

Figure 1 — 56oz of M&M’s

Here’s a little thought experiment. Technically you could even try it in real life. Get a larger bag of M&M’s and dump them out on a clean table. The M&M’s are your data. Pretend your hands are processors, and your mouth is where processed data goes.

To account for super scalar, green M&M’s can be grabbed 3 at a time if they are close together, but otherwise you can only grab one at a time per processor.

Which gives you more satisfaction… one processor or two?

Now, this is a simplistic model, and it doesn’t account for times when both hands want to grab the same M&M, which could be a problem if one of your hands is actually being run by another person. This overhead does slow it down a bit, and optimization takes on a whole other level of complexity in that regard. But you get the idea.

Obviously two processors allow you to get nearly twice as many M&M’s than one. The more arms the better in this case. Imagine if you were the some type of spider-man with 6+ arms… that’s a boat load of M&M’s!

iii. Thought Experiment 2: The Shovel

Another interesting way to eat M&M’s: Start like before, with the M&M’s laid out in front of you… Only this time, put your mouth toward the edge of the table and use one hand like a shovel to load them in 128/256/512 at a time.

This is how graphics cards handle displaying data to the user, The data is processed in parallel over arrays of vertexes. You can imagine it like having a bunch of processors dedicated to individual pixels, iterating over the screen until all the M&M’s are the correct color and position.

The main difference here is that by using a shovel, you are going to have a hard time singling out individual M&M’s, you are pretty much forced to treat each M&M the same way. Anything more complicated besides just shoveling them toward your mouth would end up being a lot harder. With the other method of using your hands as processor you can manipulate the M&M in more unique ways, like maybe sharing that large bag with friends?

At this point, the bottleneck isn’t how fast you can process the data, it’s the size of your mouth. If you weren’t sick of M&Ms after the last experiment, you might just be after a few rounds of this.

iv. SIMD VS MIMD

The multiprocessor method is a hand, doing “multiple instructions on multiple data” a.k.a. MIMD. The graphics card is a shovel, doing a “single instruction on multiple data” a.k.a. SIMD. Depending on the task, graphics cards dominate performance just like your shovel hand dominated M&Ms.

An interesting comparision is how windows 7 has an implementation of their graphics API done with multicore processors. Using that version gets a paltry 7 frames per second while running a popular computer game “Crysis”. This is at .4 megapixels with low settings. [DirectX-Software] From personal experience, the standard GPU version with a mid-range graphics card can get well over 100fps at 1.9megapixles resolution on low settings.

If you do the math, you get a CPU at 3,360,000 pixels / second vs gpu 192,000,000 pixels / second. Which means the graphics co-processor gives you about 5700% (57x) more performance. With rate increases like that, you’d get indigestion from all the M&M’s you’d be shoveling into your mouth!

v. GPU vs CPU

Mythbusters, a popular television show on the discovery channel, did their own version of the thought experiment. Their presentation does a nice comparison of the Central Processing Unit (CPU) and the Graphics Processing Unit (GPU) using painting machines. They display one robot painting a smile on a piece of paper by shooting paint balls. It shoots one at a time, using a servo-controlled paintball gun. They call it “Leonardo”. It took forever to do this simple smiley face. Then they talked about wanting to paint the most famous smile in the world: the Mona Lisa…

So, in classic Mythbuster bravado, they unveil a giant “Leonardo v.2” which instead of having one barrel, uses 1100 “individually addressed” paintball barrels, and over a mile of pneumatic tubing, and 220 gallons of compressed air. Instead of taking roughly a minute to do something simple… it takes 80ms for a pretty reasonably detailed facsimile of Mona Lisa to be produced.

Figure 2 — Leonardo v1

They describe Leonardo v1 as: “How a CPU might do it: perform a series of discrete actions performed sequentially, one after another. Re-addressing itself one after another for each pixel it has to lay.”

Figure 3 — Leonardo v2

Leonardo v2 is described as: “Kind of like a parallel processor, or a GPU.”

Figure 4 — Results

B) Graphics Cards

i. History

One of the first “graphics” cards, in widespread use was the IBM MDA. The “Monochrome Display Adapter” was a giant of a card used in an ISA slot. It managed the display of characters on the old monochrome displays that were used at the time. It had 4k of ram on which to store it’s buffer, and could also output to a dot matrix printer. The device seems very similar to the LCD display modules we’ve used in our course work here at VTC (44780 ). [MDA]

As pixel-based displays came to market, such as the VGA format, advanced graphics cards capable of handling the pixels were necessarily required. They competed with larger memory sizes to handle more pixels and eventually the ability to handle more colors and higher resolutions. Aside from implementation details, for the most part this progression was just more of the same in terms of what the graphic card is tasked with… it just went from rendering letters to rendering pixels.

The game changed with the advent of a 3d acceleration. This gave a whole new aspect to the duties of a graphics card. Initially there were quite a few companies producing 3d graphics technology, including Matrox, Creative, S3, ATi, 3dfx, and others. Originally features were implemented by proprietary APIs, however OpenGL quickly became the dominate platform due to its portability and extensibility. [OpenGL]

Figure 5 — Quake II

ii. Trends

Video games have been the driving force behind the adoption of graphics card technology. OpenGL and 3d acceleration hit it big with the release of Quake II in 1997. Selling over 1 million units, everyone was clamoring for a graphics card to play it in it’s intended quality.

As games continued to embrace 3d acceleration, the demand for graphics cards steadily increased. The strong market had a lot of competing manufacturers all trying to out-do the previous generation of cards to encourage the consumer to buy theirs over others. Eventually, after a couple of closings, and a few buyouts, there were really only two major competitors in the market: ATi and Nvidia [Nvidia 3dfx]. The past 5 or so years has been a tumultuous back and forth between the two companies, each one trying desperately to gain the market share.

Figure 6 — Crysis has amazing shaders. Something only a picture can explain.

One result of this competition lead to the advacement of the “shader”…

A shader is a small program (algorithm) which mathematically describes how an individual material is rendered and how light interacts with its overall appearance. Like the M&Ms example, shaders where calculated by the CPU every time a frame (image) was rendered. But a few years ago companies like ATI and NVidia released a programmable graphics hardware capable of running these shader programs on their video card’s GPU. The GPU calculates these complex shaders much faster than your computer’s CPU which in turn gives you incredible performance and flexibility. Just like the shove example, this allowed them to render much more pixels while still maintaining the demand for faster graphics. The graphics cards that did this include ATI’s RADEON 8x00 / 9x00 series and NVidia’s GeForce 3 / 4 / FX series. [Shaders]

C) CUDA?

CUDA is a natural extension of the programmable shader pipeline architecture of Nvidia’s latest GPU’s. It enables programmers to access these pipelines for their own purposes, letting the processing power of the card be used for more than just displaying graphics.

“Nvidia’s Compute Unified Device Architecture (CUDA) is a software platform for massively parallel high-performance computing on the company’s powerful GPUs.”

“A few years ago, pioneering programmers discovered that GPUs could be re-harnessed for tasks other than graphics. However, their improvised programming model was clumsy, and the programmable pixel shaders on the chips weren’t the ideal engines for general purpose computing. Nvidia has seized upon this opportunity to create a better programming model and to improve the shaders. In fact, for the high-performance computing market, Nvidia now prefers to call the shaders “stream processors” or “thread processors.” It’s not just marketing hype.” [CUDA-report]

As one of the major competitive factors is the ability to have games implement new features, Nvidia released this system they provide a C-based API along with some pre-uilt libraries. At the time of writing they have an entire SDK complete with reference examples, a profiler, and even a debugger. [SDK] This allows developers to program CPU and GPU code into the same source files with familiar tooling.

One of the libraries is called “cuBLAS”, or CUDA Basic Linear Algebra Subprograms. It contains an API to access vector and matrix abilities of the graphics card. Matrix operations are done in a multi-step procedure on normal CPUs. But by using this library, you can leverage the graphics cards capabilities to run these calculations in parallel, and over many piece of data.

Another provided library is one that provides FFT capabilities. Fast Fourier Transforms are used heavily in image processing and signal processing. Being able to speed up these types of calculations can mean minutes vs hours of computation done on super high resolution images.

Examples on the main Nvidia CUDA site show applications and uses of CUDA that cover the entire spectrum of the “seven dwarfs” as discussed in David Patterson’s presentation on the “future of computer architecture” [Patterson]. These include things like ray tracing, speech recognition, quantum chemistry, particle physics, material stresses, fluid dynamics, soft body physics, etc. [CUDA-demos]

Summary:

Nvidia’s CUDA is a unique platform that is well suited to computational tasks that follow suit with the SIMD mindset. Instead of waiting years for many core processors, only guessing at what they’ll be capable of, we can instead just look at the card that is already well saturated in the market. Graphics cards are everywhere now. They are high powered, low cost, and relatively easy to program for.

What better excuse to get a top range graphics card?

Sources

[DirectX-Software]
Windows 7 allows DirectX 10 acceleration on the CPU” Nov 28th, 2008
Ben Hardwidge

[MDA]
Monochrome Display Adapter: Notes” Nov 6th, 2005
John Elliott

[OpenGL]
thread on the history of OpenGL” Mar 2nd 2001
Various Authors

[Nvidia 3dfx]
NVIDIA To acquire 3dfx core graphics assets” Dec 15th, 2001

[Shaders]
support
Mad Software Inc

[CUDA-report]
Parallel Processing With CUDA” Jan 1st, 2008
Tom R Halfhill

[SDK]
Learn more about CUDA
Nvidia Corp

[CUBLAS]
CUDA CUBLAS Library” Mar, 2008
Nvidia Corp

[CUDA-demos]
What is CUDA?
Nvidia Corp

[Patterson]
Future of Computer Architecture” Jan, 2007
David A. Patterson

--

--

Ape Inago
Scat Sense

I am a sufficiently advanced sentient abacus honed by a learning process built upon complex systems reacting to their environment. I also poop.