Exploring OpenGL, PhysX and PyTorch, all in C++

19 min readMay 19, 2023

Introduction

As a way to explore the capabilities of AI-assisted coding in C++, I have decided to start a side project that involves creating a small physics-based environment for training a reinforcement learning (RL) agent. For rendering, I have chosen OpenGL, for simulation, PhysX, and for training, Torch. I chose these options out of my own curiosity and the available resources that inspired me.

If you’re interested only in the code, take a look at the repo.

This blog post is more about the journey of getting all these libraries working together and less about how each of them works individually. However, I will link the resources I used to gain a better understanding of each at the end.

I will discuss the key components that make up the project, including the setup, integration, and features added. I hope you will be able to learn from my mistakes. Let’s get started with the end result:

Setting up the project

I am using CLion as my IDE, which means I had to set up my C++ project using CMake. This turned out to be more difficult than expected. Fortunately, Microsoft has developed a tool called vcpkg that is essentially the “pip” (package manager for Python) of C++. After installing it, following their setup guide, I just had to add -DCMAKE_TOOLCHAIN_FILE=/path/to/vcpkg/scripts/buildsystems/vcpkg.cmake as a CMake option and create the following vcpkg.json file in the root of the project:

{
  "name": "learn-opengl",
  "version": "0.1.0",
  "dependencies": [
    "glfw3",
    "glad",
    "glm",
    "stb",
    {
      "name": "imgui",
      "features": ["glfw-binding", "opengl3-binding"]
    }
  ]
}

This ensures that every time I build the project, all the specified dependencies are met.

As you can see, this doesn’t mention anything about Torch or PhysX. I wanted to use the latest versions, so I had to manually add them to the project.

Torch was easy. I downloaded their C++ library, set Torch_DIR inside my CMakeLists.txt file to /path/to/libs/libtorch/share/cmake/Torch, and it worked!

On the other hand, setting up PhysX required quite a few steps. First, I had to install clang to ensure proper compilation. Following the instructions in the PhysX documentation, I navigated to a folder called something like linux-checked inside the compiler folder of their repository. After copying all of the contents to our libs folder, I created two new CMakeLists.txt files that informed the compiler of all the available files within this library. Here are the resulting files:

### /libs/PhysX
# Include directories
include_directories(
        ${CMAKE_CURRENT_SOURCE_DIR}/include
        ${CMAKE_CURRENT_SOURCE_DIR}/pvdruntime/include
        ${CMAKE_CURRENT_SOURCE_DIR}/source)

set(PhysXLIBS
        ${CMAKE_CURRENT_SOURCE_DIR}/bin/linux.clang/release/libPhysXExtensions_static_64.a
        ${CMAKE_CURRENT_SOURCE_DIR}/bin/linux.clang/release/libPhysX_static_64.a
        ${CMAKE_CURRENT_SOURCE_DIR}/bin/linux.clang/release/libPhysXPvdSDK_static_64.a
        ${CMAKE_CURRENT_SOURCE_DIR}/bin/linux.clang/release/libPhysXVehicle_static_64.a
        ${CMAKE_CURRENT_SOURCE_DIR}/bin/linux.clang/release/libPhysXCharacterKinematic_static_64.a
        ${CMAKE_CURRENT_SOURCE_DIR}/bin/linux.clang/release/libPhysXCooking_static_64.a
        ${CMAKE_CURRENT_SOURCE_DIR}/bin/linux.clang/release/libPhysXCommon_static_64.a
        ${CMAKE_CURRENT_SOURCE_DIR}/bin/linux.clang/release/libPhysXFoundation_static_64.a
        )

# add_library(PhysX STATIC)
add_library(PhysX INTERFACE)
target_link_libraries(PhysX INTERFACE ${PhysXLIBS})
target_include_directories(PhysX INTERFACE include)

### /libs
add_subdirectory(PhysX)

add_library(third_party INTERFACE)

target_link_libraries(third_party INTERFACE PhysX)
target_include_directories(third_party INTERFACE PhysX)

The remaining details should be straightforward and can be found in the main CMakeLists.txt inside the repository.

Humble beginnings

Most of the code in this section comes from Victor Gordan and his amazing series on OpenGL. I started by following his code and setting up a simple rendering loop. At the same time, I’ve initialised a sphere and a surface (table) in PhysX.

The vertices for the sphere were generated using the following function:

// Function to generate sphere using specified parameters
void sphereGeneration(unsigned int indices[], float vertices[], int numSlices, int numStacks, float radius, const float *color) {
    int vertexIndex = 0;
    int indexIndex = 0;

    for (int stack = 0; stack <= numStacks; ++stack)
    {
        // Calculate phi for current stack
        // the angle between the positive z-axis and the vector from the origin to the current point on the sphere.
        float phi = stack * PxPi / numStacks;
        for (int slice = 0; slice <= numSlices; ++slice)
        {
            // Calculate theta for current slice
            // the angle between the positive x-axis and the projection of the vector from the origin to the current point on the sphere onto the xy-plane.
            float theta = slice * 2 * PxPi / numSlices;
            // Calculate x, y, and z vertices for current stack and slice
            vertices[vertexIndex++] = radius * sin(phi) * cos(theta);
            vertices[vertexIndex++] = radius * sin(phi) * sin(theta);
            vertices[vertexIndex++] = radius * cos(phi);
            // Set color for current vertex if specified
            vertices[vertexIndex++] = color[0];
            vertices[vertexIndex++] = color[1];
            vertices[vertexIndex++] = color[2];

            if (stack != numStacks && slice != numSlices)
            {
                // Determine indices for current triangle
                int nextStack = stack + 1;
                int nextSlice = slice + 1;
                indices[indexIndex++] = (stack * (numSlices + 1) + slice);
                indices[indexIndex++] = (nextStack * (numSlices + 1) + slice);
                indices[indexIndex++] = (nextStack * (numSlices + 1) + nextSlice);
                indices[indexIndex++] = (stack * (numSlices + 1) + slice);
                indices[indexIndex++] = (nextStack * (numSlices + 1) + nextSlice);
                indices[indexIndex++] = (stack * (numSlices + 1) + nextSlice);
            }
        }
    }
}

In the rendering loop, I step the simulator, get the ball position and use it to update the rendering:

scene->simulate(1.0f / 60.0f);
scene->fetchResults(true);

PxVec3 ballPosition = ball->getGlobalPose().p;
PxQuat ballRotation = ball->getGlobalPose().q;
...
// Render the ball
glBindVertexArray(ballVAO);
glm::mat4 model = glm::mat4(1.0f);
// Translate the model to the ball's position
model = glm::translate(model, glm::vec3(ballPosition.x, ballPosition.y, ballPosition.z));
// Create a rotation matrix from the ball's quaternion rotation
glm::mat4 ballRotationMatrix = glm::mat4_cast(glm::quat(ballRotation.w, ballRotation.x, ballRotation.y, ballRotation.z));
// Apply the rotation matrix to the model
model = model * ballRotationMatrix;
// Scale the model to the appropriate size
model = glm::scale(model, glm::vec3(0.25f, 0.25f, 0.25f));
// Set the "model" uniform variable in the shader program to the model matrix
glUniformMatrix4fv(glGetUniformLocation(shaderProgram, "model"), 1, GL_FALSE, glm::value_ptr(model));
// Draw the ball using the appropriate vertex array and number of indices
glDrawElements(GL_TRIANGLES, numSlices * numStacks * 6, GL_UNSIGNED_INT, nullptr);

The result:

As I progressed through the tutorials for improving my rendering, I converted everything to classes to make them easier to manage. For example, after adding a Shader class that reads .vert and .frag files, I also created a Camera class.

class Camera {
public:
    // Stores the main vectors of the camera
    glm::vec3 Position;
    glm::vec3 Orientation = glm::vec3(0.0f, 0.0f, -1.0f);
    glm::vec3 Up = glm::vec3(0.0f, 1.0f, 0.0f);

    // Prevents the camera from jumping around when first clicking left click
    bool firstClick = true;

    // Stores the width and height of the window
    int width;
    int height;

    // Adjust the speed of the camera and it's sensitivity when looking around
    float speed = 0.1f;
    float sensitivity = 100.0f;

    // Camera constructor to set up initial values
    Camera(int width, int height, glm::vec3 position);

    // Updates and exports the camera matrix to the Vertex Shader
    void Matrix(float FOVdeg, float nearPlane, float farPlane, Shader &shader, const char *uniform);

    // Handles camera inputs
    void Inputs(GLFWwindow *window);
};

At this point, I not only had controls for moving the camera but also partially implemented shadows:

I wanted to have a more appealing sphere as the main actor, so I started to play with the sphere vertices code. By having an extra check, I managed to add a stripe:

// Add color to vertices
if (slice == 0 || slice == numSlices / 2) {
    vertices.push_back(0.0f);
    vertices.push_back(0.0f);
    vertices.push_back(0.0f);
}
else {
    vertices.push_back(color[0]);
    vertices.push_back(color[1]);
    vertices.push_back(color[2]);
}

To improve the visual aesthetics, I considered adding a skybox. Luckily, Victor’s tutorial provided all the necessary information. I found a texture on HDRI Heaven and converted it to a cube map using this tool.

After some more work, I managed to even add reflections on the ball and arrived at this:

The main change was made in default.frag. The new code calculates the reflection of the skybox on the surface of the ball, considering the current position of the camera, the position of the current pixel, and the surface normal. By using shouldReflect, I was able to use the same fragment shader for both the table and the ball, and choose which one is reflective.

// reflection
vec4 reflectionColor = vec4(1.0f, 1.0f, 1.0f, 1.0f);
if (shouldReflect) {
    vec3 I = normalize(crntPos - camPos);
    vec3 R = reflect(I, normal);
    reflectionColor = vec4(texture(skybox, R).rgb, 1.0f);
}

Graduating

At this point, I had finished integrating everything I needed from Victor’s tutorials. My contribution so far was the reflective sphere and the choice of texture. I wanted to test my new knowledge with a challenge. The first thing I did was add some cooler stripes to the ball!

To enable my Reinforcement Learning (RL) agent to control the ball, I needed the camera to be positioned as if in a first-person game. To achieve this, I created a new camera class called SpringCamera.

Making the camera follow the ball when it moves straight is relatively straightforward. I simply keep track of the current position and update it based on the ball’s delta position:

Position = Position + objectPos - _lastObjPos;

Dealing with rotations turned out to be trickier than expected. While figuring out how to apply the rotation, I took a break to add some debugging tools. During this break, I discovered Imgui, a fantastic C++ library for adding menus to the screen with minimal code changes. This allowed me to swap between the old camera and the new one with the click of a button while running the application:

Back to fixing the control issue. The trick to getting the camera to always stay behind the ball is to calculate a delta angle around the GLOBAL Y axis. This is then used to rotate the camera w.r.t. the OBJECT’s position. The following function demonstrates how to achieve this:

// calculate and export the camera's view and projection matrices to the vertex shader, such that the camera follows the object's position and orientation while maintaining a specified field of view and viewing frustum
void SpringArmCamera::Matrix(glm::vec3 objectPos, float FOVdeg, float nearPlane, float farPlane, Shader &shader, const char *uniform) {
    // Initializes matrices since otherwise they will be the null matrix
    glm::mat4 view = glm::mat4(1.0f);
    glm::mat4 projection = glm::mat4(1.0f);

    // update position based on objectTransform
    glm::vec3 deltaPos = objectPos - _lastObjPos;
    Position += deltaPos;
    _lastObjPos = objectPos;

    // rotate Position around Y axis using angle w.r.t. the object's position
    float x = Position.x - objectPos.x;
    float z = Position.z - objectPos.z;
    float deltaAngle = angle - _lastAngle;
    Position.x = objectPos.x + x * cos(deltaAngle) - z * sin(deltaAngle);
    Position.z = objectPos.z + x * sin(deltaAngle) + z * cos(deltaAngle);
    _lastAngle = angle;

    // Makes camera look in the right direction from the right position
    view = glm::lookAt(Position, objectPos, Up);

    // Adds perspective to the scene
    projection = glm::perspective(glm::radians(FOVdeg), (float) width / height, nearPlane, farPlane);

    // Exports the camera matrix to the Vertex Shader
    glUniformMatrix4fv(glGetUniformLocation(shader.ID, uniform), 1, GL_FALSE, glm::value_ptr(projection * view));
}

Loading a complex model

After working with Blender before, I thought it would be great to design my scene with it. I created a simple scene and saved it as .obj and .mtl files. I imported the scene using the assimp library, which processes meshes one by one. To simulate the same hierarchy in my code, I created two classes: Mesh and Model. The Mesh class is a collection of vertices and indices. However, initially, the shadows did not look correct. I quickly realized that I needed to triangulate all faces before exporting.

Moreover, I wanted to choose the color of the objects in my scene directly in Blender. To achieve this, I extracted the material and checked export materials when saving the .obj files. The following code takes care of that:

aiMaterial *material = nullptr;

if (mesh->mMaterialIndex >= 0) {
  material = scene->mMaterials[mesh->mMaterialIndex];
}

for (unsigned int i = 0; i < mesh->mNumVertices; i++) {
 ...
 if (material != nullptr) {
      aiColor4D diffuse;
      if (AI_SUCCESS == aiGetMaterialColor(material, AI_MATKEY_COLOR_DIFFUSE, &diffuse))
          vertex.Color = glm::vec3(diffuse.r, diffuse.g, diffuse.b);
  } else {
      vertex.Color = glm::vec3(0.1373f, 0.2235f, 0.3647f); // dark blue
  }
}

One trick I found is that I can add the following code to my CMakeLists.txt file to automatically copy resource files (such as .obj and .mtl) to the compiled program folder. This allows me to save the files with Git and update the program every time I make a change, without needing to copy the files manually multiple times.

file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/resources DESTINATION ${CMAKE_CURRENT_BINARY_DIR})

Creating the PhysX actor was a bit tricky. I struggled to find the appropriate documentation, and without the help of Copilot, I probably would not have been able to do it. It turns out that you need to create a PxTriangleMesh, then a rigid static actor, and generate a PxTriangleMeshGeometry from the PxTriangleMesh. Finally, attach the geometry to the actor and add it to the scene. Here is what the code looks like:

physx::PxTriangleMesh *worldSceneTriangleMesh = createTriangleMesh(physics, cooking, mesh.vertices, mesh.indices);;
// Create a rigid static actor
physx::PxTransform transform(physx::PxVec3(0.0f));
physx::PxRigidStatic *worldActor = physics->createRigidStatic(transform);
// Create a triangle mesh geometry
physx::PxTriangleMeshGeometry geometry(worldSceneTriangleMesh);
// Create and attach a shape
physx::PxRigidActorExt::createExclusiveShape(*worldActor, geometry, *material);
// Add the actor to the scene
scene->addActor(*worldActor);

The createTriangleMesh function is a helper function that I created to convert imported vertices and indices into a mesh using the PxTriangleMeshDesc. Copilot was incredibly helpful in this process.

physx::PxTriangleMesh *Model::createTriangleMesh(physx::PxPhysics *physics, physx::PxCooking *cooking, const std::vector<Vertex> &vertices, const std::vector<unsigned int> &indices) {
    // Convert vertices to PxVec3
    std::vector<physx::PxVec3> pxVertices(vertices.size());
    for (size_t i = 0; i < vertices.size(); ++i)
        pxVertices[i] = physx::PxVec3(vertices[i].Position.x, vertices[i].Position.y, vertices[i].Position.z);

    // Convert indices to PxU32
    std::vector<physx::PxU32> pxIndices(indices.begin(), indices.end());

  // Set up the triangle mesh descriptor
    physx::PxTriangleMeshDesc meshDesc;
    meshDesc.points.count = (physx::PxU32) pxVertices.size();
    meshDesc.points.stride = sizeof(physx::PxVec3);
    meshDesc.points.data = pxVertices.data();
    meshDesc.triangles.count = (physx::PxU32) (pxIndices.size() / 3);
    meshDesc.triangles.stride = 3 * sizeof(physx::PxU32);
    meshDesc.triangles.data = pxIndices.data();

  // Use PxCooking to create the triangle mesh
    physx::PxDefaultMemoryOutputStream writeBuffer;
    bool status = cooking->cookTriangleMesh(meshDesc, writeBuffer);
    if (!status)
        return nullptr;
    physx::PxDefaultMemoryInputData readBuffer(writeBuffer.getData(), writeBuffer.getSize());
    return physics->createTriangleMesh(readBuffer);
}

Generating the triangle mesh on a per-model basis proved to be crucial. Initially, I placed all vertices and indices together into two long vectors, but the collisions were all over the place. However, as soon as I created separate actors for each, it worked like a charm!

Bonus trick

One thing that got in the way while debugging was rendering the walls between the camera and the ball. There were three main steps to get this to work:

Separate each object into different meshes in Blender.
Give the ground plane a convenient name such as Ground.
When drawing the mesh with OpenGL, launch a ray starting from the camera origin directed towards the ball. If it intersects anything in between, skip rendering it, unless it is the ground plane.

The draw function inside the Mesh class:

void Mesh::Draw(unsigned int shaderProgram, glm::vec3 ballPosition, glm::vec3 cameraPosition) {
    // Calculate the ray direction from the camera to the ball
    glm::vec3 rayOrigin = cameraPosition;
    glm::vec3 rayDirection = glm::normalize(ballPosition - cameraPosition);

    // Check for intersection between the ray and each triangle in the mesh
    if (name.find("Ground") == std::string::npos) {
        bool intersects = false;
        // Loop through each triangle in the mesh
        for (int i = 0; i < indices.size(); i += 3) {
            // Get the vertices of the triangle
            glm::vec3 vertex0 = vertices[indices[i]].Position;
            glm::vec3 vertex1 = vertices[indices[i + 1]].Position;
            glm::vec3 vertex2 = vertices[indices[i + 2]].Position;
            // Check for intersection between the ray and the triangle
            float distance;
            if (rayIntersectsTriangle(rayOrigin, rayDirection, vertex0, vertex1, vertex2, distance)) {
                // If the intersection is closer than the minimum distance, return without drawing the mesh
                if (distance <= minDistance) {
                    intersects = true;
                    break;
                }
            }
        }
        // If the mesh intersects with the ray, return without drawing the mesh
        if (intersects) {
            return;
        }
    }
    // Draw the mesh
    glBindVertexArray(VAO);
    glm::mat4 model = glm::mat4(1.0f);
    glUniformMatrix4fv(glGetUniformLocation(shaderProgram, "model"), 1, GL_FALSE, glm::value_ptr(model));
    glDrawElements(GL_TRIANGLES, indices.size(), GL_UNSIGNED_INT, nullptr);
}

I am particularly proud of this accomplishment because I managed to complete it without major help from AI. The rayIntersectsTriangle function uses the Moller-Trumbore intersection algorithm to determine intersections between objects. For more information on this algorithm, please visit the following link.

Agent training

Before explaining the environment, I want to mention how I pass information between the main function and the classes. I decided to use structures that contain all the necessary inputs. Although there may be better methods to do this, since this code will not be used in a very large project, I believe this method suits my needs.

For instance, the AgentConfig is defined as follows:

typedef struct {
    int num_epochs;
    int horizon_length;
    int mini_batch_size;
    int mini_epochs;
    float learning_rate;
    float clip_param;
    float value_loss_coef;
    float bound_loss_coef;
    float gamma;
    float tau;
    float reward_multiplier;
} AgentConfig;

However, I recommend using config files as external yaml, json, or even txt files in your projects. This way, you do not need to recompile every time you change values inside.

The most popular way to create environments for training RL agents is to use the OpenAI Gym wrapper. This requires the environment to provide several functions, such as:

step: takes in an action and returns the observation, reward, and a done flag
reset: returns the observation
render

Initially, everything was placed inside the step function. Afterward, I implemented the headless mode, which allows running just the physics part of the environment, resulting in faster execution. Additionally, I added a feature where pressing V pauses rendering, allowing me to switch between the two modes on the fly. I don’t think there’s much value in showing the code here because I only refactored the code into smaller functions that are called when needed. The relevant code can be found in Environment.cpp for those interested.

The observations consist of the ball position (x, y, z) and the angle of rotation around the Z axis in radians. The angle is wrapped between [-π, π]. The actions are continuous between [1, 1]. They control the angle around the Z axis and the force applied at that angle pushing the ball forward.

Reward function

The initial reward I went for was a simple mean squared error between the actor (ball) and the goal position. This provides a dense signal on top of which I’ve added a sparse bonus reward for getting within 0.1 of the goal position.

double Environment::ComputeReward() {
    // compute reward as mean squared error between goal and ball position
    double dist = 0.0f;
    for (int i = 0; i < 3; i++) {
        dist += pow(goalPosition[i] - ballPosition[i], 2);
    }
    dist = dist / 3.0f;

    double reward = -dist;
    // if within threshold of target, add bonus reward
    if (dist < threshold) {
        reward += bonusAchievedReward;
    }

    return reward;
}

After a few more iterations, I’ve decided to switch to using the Euclidean distance, also known as root mean squared error. This is because the new distance can be seen as a straight line between the ball and the goal. I can now determine the maximum value by taking the ball to the farthest corner, and then scale the reward back between -1 and 0. Having a normalized reward makes it easier to find the right hyperparameters.

Multiple environments

Even before starting the training loop, I knew that I wanted to have multiple actors running in parallel to speed up the training process. Adding multiple balls to the scene was easy enough. However, I had to refactor most of the environment code because I needed to use Tensors directly. Tensors are just like arrays/vectors but can run on GPUs. This was crucial because I wanted to pass all actions at the same time and return all new observations as a batch. Alternatively, this could be achieved by using for loops, but at a high cost in speed.

TimeMe

At this point, the code was running nicely with a small number of environments. I wanted to make sure everything was optimized so I added the following class (trick learned from The Cherno) to help me time everything:

class TimeMe {

public:
    explicit TimeMe(std::string name) : name(std::move(name)) {
        start = std::chrono::high_resolution_clock::now();
    }

    ~TimeMe() {
        auto end = std::chrono::high_resolution_clock::now();
        auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
        std::cout << name + " took " << elapsed.count() << " milliseconds\n";
    }

private:
    string name;
    std::chrono::time_point<std::chrono::high_resolution_clock> start;
};

All you have to do is wrap the piece of code you want to time in curly brackets and create a TimeMe object:

{
 TimeMe t('function name');
 functionToTime();
}

The destructor is called automatically when exiting the local scope created by the {} and you get something like this: “function name took 1000 milliseconds”.

Collision

It was at this point I noticed that whenever I added more than twelve environments, the balls started colliding with each other. ChatGPT was not helpful in resolving the issue. After a thorough search and multiple attempts at fixing the problem by changing the PxFilterData of each actor, I discovered that the default PxFilterFlags set within the scene descriptor was not actually taking these flags into account. To resolve this issue, I created a custom PxFilterFlags by adding the highlighted code to the default function:

PxFilterFlags MyFilterShader(
        PxFilterObjectAttributes attributes0, PxFilterData filterData0,
        PxFilterObjectAttributes attributes1, PxFilterData filterData1,
        PxPairFlags& pairFlags, const void* constantBlock, PxU32 constantBlockSize)
{
    // let triggers through
    if(PxFilterObjectIsTrigger(attributes0) || PxFilterObjectIsTrigger(attributes1))
    {
        pairFlags = PxPairFlag::eTRIGGER_DEFAULT;
        return PxFilterFlag::eDEFAULT;
    }

    // generate contacts for all that were not filtered above
    pairFlags = PxPairFlag::eCONTACT_DEFAULT;

    // trigger the contact callback for pairs (A,B) where
    // the filtermask of A contains the ID of B and vice versa.
    if((filterData0.word0 & filterData1.word1) && (filterData1.word0 & filterData0.word1))
        pairFlags |= PxPairFlag::eNOTIFY_TOUCH_FOUND;

    // trigger a separation callback for pairs (A,B) where the collision group of A is included in the filtermask of B
    // same if filterData0 is the same as filterData1
    if(filterData0.word0 & filterData1.word1 || (filterData0.word1 == filterData1.word1 && filterData0.word0 == filterData1.word0)) {
        return PxFilterFlag::eKILL;
    }

    return PxFilterFlag::eDEFAULT;
}

PPO

I would recommend heading over to SpinningUp by OpenAI for a detailed explanation of how Proximal Policy Optimisation (PPO) works. As a starting point, I looked at the code written in RL Games repository.

Before moving on, I want to mention how useful I found it to create TensorOptions as variables (probably should have made them constant now that I think of it) and then use them with ease every time I create a Tensor:

TensorOptions floatOptions = torch::TensorOptions().dtype(torch::kFloat32).device(device).layout(torch::kStrided).requires_grad(false);
Tensor test = torch::zeros(10, floatOptions);

The training loop (one epoch) looks something like this:

Reset memory for storing transitions
Play H steps in the environment while collecting data
Prepare the batch by computing advantage and returns using GAE
Compute and propagate loss for each mini batch

Warning! One thing that got me confused initially was the order of calculation for the ratio Tensor ratio = (old_log_prob — new_log_prob).exp();. If using negative log probability, make sure the ratio is computed as old - new.

All of this can be implemented in about 290 lines of code. Personally, I find it easier to read code when it comes to reinforcement learning rather than looking at equations. Therefore, I won’t bore you with that, as there are blog posts out there that do a much better job of explaining it than I ever could.

Instead, I will go over the interesting things I did on top to make it work in the end.

Logging

Let’s start with logging! To determine whether your agent is learning without visually inspecting the training process, you need graphs. In Python, I typically use TensorBoard or the W&B platform. Fortunately, someone has already written a C++ library for TensorBoard that can be used.

Unfortunately, there is no documentation on how to use it, but reading the header file provided everything needed to get it to work.

TensorBoardLoggerOptions options{1000000, 5, false}; // max_queue_size, flish_period_s, resume
TensorBoardLogger logger('path', options);
...
// logging loss
logger.add_scaler("Loss/actor_loss", _steps, actor_loss.item<float>());

The things I logged include actor, critic and bound loss, reward obtained, epoch number and learning rate. All this can be found in Agent.cpp in the repo.

Normalisation

It cannot be emphasized enough how important it is to normalize everything. Neural networks learn much better when the input and output values are within the range of [-1, 1]. This includes the value predicted by the critic. In this section, I will go through all the things I normalized and how I achieved that.

The observation is easiest to normalize because we know the dimensions of the workspace. I simply divided the x, y, and z dimensions by the appropriate numbers. The angle is already wrapped around [-π, π], so we just need to divide by π.

The actions should not be clamped or normalized anywhere else but inside the environment class. The calculation of log probability must be made using the output of the network, which will sometimes exceed the [-1, 1] range.

Dealing with returns and values requires some additional setup. Since the maximum value of a state depends on reward magnitude and hyperparameters such as horizon length, we cannot assume we know the range. Therefore, we need a running mean that updates continuously. I have chosen to implement this as a network, which allows me to turn training on and off when updating the mean and variance. For example, when preparing the batch:

// flatten returns, pass through value_mean_std
SetTrain();
returns = returns.flatten(0, 1);
returns = value_mean_std->forward(returns);
SetEval();

The output of the network is assumed to be normalized. Therefore, the values gathered when playing an episode are unnormalized before calculating the returns. As shown above, the returns are used to update the mean and variance of the running mean object in the same process of normalizing the returns.

The advantages are normalised only after the returns are calculated:

memory.advantages = (memory.advantages - memory.advantages.mean()) / (memory.advantages.std() + 1e-8);

Adaptive learning rate

The Adaptive KL Penalty Coefficient was used to update the learning rate, based on the KL divergence computed between the old and new policies. This approach prevents the update from straying too far from the old policy, while still allowing for change. Below is an example of what the update function and KL calculation look like:

double Agent::update_lr(const double& kl) {
    if (kl > (2.0f * kl_threshold)) {
         learning_rate = max(min_lr, learning_rate / learning_rate_decay);
    }
    else if (kl < (0.5f * kl_threshold)) {
        learning_rate = min(max_lr, learning_rate * learning_rate_decay);
    }

    // update lr in optimizer
    for (auto& param_group : optimizer.param_groups()) {
        param_group.options().set_lr(learning_rate);
    }

    return learning_rate;
}

Tensor Agent::policy_kl(const Tensor &mu, const Tensor &sigma, const Tensor &mu_old, const Tensor &sigma_old) {
    auto sigma_ratio = (sigma_old / sigma).log();
    auto mu_diff = (sigma.pow(2) + (mu_old - mu).pow(2)) / (2 * sigma_old.pow(2));
    auto kl = (sigma_ratio + mu_diff - 0.5).sum(1);
    return kl.mean();
}

Small improvement for random batch selection

I have one final improvement regarding the batch sampling code. Initially, I used a for loop to randomly generate indices. However, I found that taking a random permutation of an array from 0 to the number of samples and then iterating through it is much faster.

Tensor batch_idx = torch::randperm(horizon_length * num_envs, longOptions);

// Update the agent using PPO
for (int i = 0; i < num_steps / mini_batch_size; i++) {
    // Sample a mini-batch of transitions and convert the required samples to tensors
    Tensor obs = memory.obs.index({batch_idx.slice(0, last_idx, last_idx + mini_batch_size)});
...

Final notes

One selling point of PhysX is its ability to run on a GPU for even greater speed. I attempted to follow the documentation to do so, but couldn’t get it to recognize my device. If you know how to solve this problem, please see my posts here and here.

I understand that this post only briefly touches on the code I’ve written and doesn’t provide much detail. My goal is to let you know that this code exists in case you encounter any issues while trying to create something similar. If you have any questions, please don’t hesitate to contact me at mihai.anca@bristol.ac.uk. I would be happy to provide more detailed explanations of any part of the code or assist you with your own project!