This is a response I wrote to a Reddit r/ELI5 question asking how exactly a 3D world is rendered to a screen. The original post has been deleted, but can be viewed here.

It’s long and a lot of stuff is intentionally omitted for the ELI5 aspect, but a few kind commenters asked if I would re-host it somewhere else — so here it is.


Hi, game engine developer here.

It’s probably best to explain the entire process, not just what happens on the graphics card as you’re asking — especially since this is a hard subject to ‘ELI5’. That’s also why this will be a bit long, so buckle up. Also, apologies for any holes in the answer — 3D/graphics theory isn’t necessarily hard, it’s just broad, so covering all of it while I have my morning coffee isn’t the easiest thing to do.

Also, I went over the 10,000 character limit so the second half of this response was added as a comment reply. If you understand what mesh data, textures and shaders are already, skip to that comment. Otherwise, keep reading.

Let’s start with the computer itself. There are two major pieces to consider when talking about the “graphics pipeline”, or the steps in drawing a single “frame” in a video game: the CPU (your computer) and the GPU (your graphics card) — which stand for Central and Graphical Processing Unit, respectively.

In all games/game engines, the vast majority of game logic is happening on the CPU — that is, the part of the computer that “thinks” about your keyboard/mouse input, talks to the internet, decides what to do with that button you just pressed, how much damage to deal to the enemy you just shot, etc.

This sort of logic usually happens in what is commonly called an “update loop”, or some code that runs over and over again in order to decide what’s actually happening in the game:

Did an enemy just see you? What should it do next? Are you holding the W key? Does that make you move forward? How far? Did that bomb just explode and shoot a car into the air? What’s the car’s new position? Is it rotating? Are there physics involved?

All of these are questions that have to do with the game logic and generally run many hundreds of times per second in order to keep the game feeling responsive and up-to-date.

Meanwhile, there’s the “render loop”. Of the two common loops (update and render), this loop is fairly straightforward. Depending on the player’s graphics settings (e.g. some games offer V-Sync support, which limits framerate), this loop runs once for each “frame” in a game.

A frame is a single, motionless picture of a video game, a movie file, or a gif image for example. In video games in particular, each frame is re-calculated, or “re-rendered”, from scratch each time one needs to be displayed.

Game engines generally take the current “state” of a game (what is calculated by the update loop I mentioned earlier) — usually the positions, or “transforms”, of each thing in the game — and use those positions, rotations, and scaling (how big it is) to instruct the GPU (your graphics card) to “draw” or render that frame using all of that data.

Further, everything that is sent to the graphics card has some data attached to it. There are many different types of data, but the most common types (at least for 3D games) are:

  • mesh data (3D models)
  • texture data (images, though not only images — many of which are added to “skin” a mesh)
  • shader programs (actual programming code that can manipulate the graphics directly on the GPU instead of bogging down the CPU)

Mesh data is a bunch of 3D model data usually created in a 3D modeling program such as Blender, Maya, 3DS Max, ZBrush, etc. by a 3D artist. I won’t go much into how they’re created, but instead leave it at this data holds what are called “vertices” (singular “vertex”) that have a position in 3D space relative to an “origin point” (the vertices’ positions being “local” to the origin point). The origin point is what is used to calculate the 3D model’s position in the game world itself.

For example, if a Terrorist from CS:GO is positioned at {10,0,20} in the map, and the tip of his nose has a vertex at {5, 100, 0}, then the actual position (or what’s called the “global position” or “world position”) is {10+5, 0+100, 20+0} or {15, 100, 20}. Note that these are {X, Y, Z} positions for 3D space.

Multiply that whole process by thousands, if not millions or (in modern days) billions of vertices.

Most game engines put mesh data in a hierarchical structure — for example, the world has a car, the car has a dashboard, the dashboard has a steering wheel. In order to find the world position of the steering wheel, you add the world’s position (oftentimes simply at {0, 0, 0}, usually implicitly so by the game engine), the car’s local position relative to the world, the dashboard’s local position relative to the car, and the steering wheel’s local position relative to the dashboard, arriving at the actual position of the steering wheel.

The same is done similarly so for the object’s rotation and scale. All of this data (position, rotation and scale) is what’s called a “transform” and is represented in what’s called a “matrix” (plural “matrices”). This isn’t entirely important to the ELI5 but it’s worth mentioning. Also, you don’t “add” matrices together, but multiply them. This is where stuff gets a bit confusing, and I’ll leave some of that out. For the inquisitive, you can read more about that here.

It’s mostly important to remember that to ‘combine’ a matrix, you must multiply them — not add them. From here on out, I’ll refer to all position/rotation/scaling as simply a “transform matrix” or simply “transform”, and refer to combining them as “multiplying [the] transforms” (since that is the proper terminology).

Here’s something that generally surprises newcomers to 3D game development: cameras aren’t really that special. They’re simply applying a transform to everything in the scene (e.g. transforming the “world” before adding anything to the world) — but inverted (for example, if the camera wants to look to the right, it would rotate the world to the left). That’s all there is to it.

Next, you have texture data. Textures are really just blobs of data that can be used by shaders (explained below) in order to do things like add color to a mesh (3D model), specify where on a 3D model something glows/doesn’t glow, specify which parts of a model should be raised or sunken in, which parts to actually show or hide, which parts are “shiny” or dull, etc. I’d argue that most texture data is created in photoshop or some other image editor, but the real takeaway here is that texture data is not only for coloring.

Textures are typically generated or loaded at the game start/level start and sent to be stored in the graphics card for later use. This is because textures can be massive and the communication between the CPU and GPU isn’t ‘free’ (as in, it takes time to transfer all of that data — hence contributing to why bigger games have such long loading times, even between levels). They are then re-used many times over, potentially by thousands of copies of a 3D model in the game, for example.

This is also why graphics card memory is so important — storing lots of highly detailed textures happens in the graphics card itself, not you’re computer’s RAM.

Not really ELI5 but it’s worth mentioning that older machines or machines that are lower budget might opt to ‘fallback’ to software rendering, where texture data is in fact stored in RAM and the rendering is all done on the CPU. I won’t be talking about software rendering as it’s a ‘gotcha’ that isn’t entirely relevant anymore, especially not the OP — but I’m including it for the sake of providing search terms for the curious.

Finally, you have “shaders”, or “shader programs” as they’re more technically referred to. Shaders are actually a broadly recognized concept to even non-developers, but understanding exactly what they do is lesser known.

There are many different types of shaders, and as things like DirectX and OpenGL (I’ll discuss those shortly) advance and new versions are created/features are added, more types of shaders emerge that do different things previously difficult or impossible to do with just the earlier, more primitive shader types.

Just a bit of a warning: some of the terminology in the shader world is a bit egregious — that’s to say, shaders are (arguably) simple concepts shrouded behind a whirlwind of complicated jargon. Don’t think too hard about them.

Firstly, shaders are programs that are compiled specifically for each graphics card. Shader code looks similar to the C programming language, but provide a very limited set of functionality specific to working with graphics data. Shaders are generally distributed in their source form (as opposed to compiling them before shipping them off to customers) since just about every graphics card will compile and use the shader differently (under the hood — something most developers never have to worry about).

Note that shaders are executed much differently than conventional programs that run on your CPU. A shader can be run once for many data points at once. This is called “single instruction, multiple data” execution, or SIMD, and is the sole reason why video games can run as quickly as they can on modern machines. The GPU is built in a way that allows these simplified and highly restricted programs to run once for many, many separate inputs at once.

Part of the larger graphics pipeline (which we mentioned before — the process of rendering a single frame) is the shader pipeline, which defines which types of shaders are executed when, and what they are supposed to do/what data they work with/what data they provide.

For example, here is OpenGL’s render pipeline overview, which shows all of the shaders that are run and in which order.

For the sake of this explanation, we’ll only discuss the two most common shader types: vertex and fragment shaders.

A vertex shader works with the 3D data passed in from the CPU (from the game engine). This is usually all of the world vertices of the data, though some game engines get super fancy with this for the sake of performance and apply all sorts of different transformations that one must take into account. AAA game engines (Unreal, Unity, Lumberyard, CryEngine, etc.) all generally do some wicked crazy batching and positional voodoo in order to squeeze performance out of complex games — we won’t be worrying about that for now as the math and concepts are the exact same.

Among the various things you can do with a vertex shader are modify the position of the vertex (commonly based on some texture data, oftentimes used as a bump map) as well as converting the 3D position to a 2D position by multiplying the vertex with a world matrix — though that’s not always required (sometimes the graphics card does this math for you).

If you didn’t catch it, that’s the succinct and direct answer to your question: you multiply a position in 3D space by a world matrix in order to find its 2D position. The math is way too complicated to explain for an ELI5, but it’s actually quite simple math. This process is called “3D projection” and can be done very quickly. Hopefully now the build-up to the answer makes sense — the concept is harder to explain without the backing terminology.

If you’re interested in the math itself, this is a cool explanation of how that math works. Each graphics card manufacturer implements their own version of this math that works the best with the hardware they created, which also means they re-implement OpenGL and DirectX and thus why you have very specific graphics card drivers for each model of card.

After the vertex shader runs, the 2D ‘pixel’ data is passed to a fragment shader. Why call them “fragments”? Because a fragment includes not only a 2D position, but also extra information passed from the vertex shader. This could include the original 3D position of the originating object at that fragment’s 2D position, or even some custom data calculated inside the vertex shader.

Further, a fragment could very well exist between pixels (e.g. {20.4, 14.9} — not rounded to the nearest 1), making even less sense calling them ‘pixels’.

The output of a fragment shader is simply the color that should be drawn at the fragment’s position. For example, if you simply tell it “draw red” regardless of the input data, the model that is being drawn with that fragment shader will be a completely flat red color with no shading, shadows, lights, etc.

Non-ELI5 warning: following paragraph isn’t necessarily ELI5.

Texture data is very commonly applied in the fragment shader. For example, a common texture is the “emissive” data, which when rendered has a sort of “glow” effect where the color shows up even in darkly lit rooms. Here is an example of emission — specifically, the white on the Unity logo on the wall. Note that the surrounding glow and lighting is not emission — only the the fact the white is showing up clearly in a dark room. Emission is a good example because it’s relatively easy to implement — simply calculate the color of the fragment if it didn’t have any emission, and then multiply or add (depending on the effect you want) the color by the emission value right before you return it.

Finally, now that the 2D position, or “screen position”/”screen coordinate” and its corresponding color has been calculated, the graphics card must simply put it on the screen via the hardware on the graphics card itself — that pixel data being transmitted to your monitor.

Qix
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade