Taichi Cookbook 001: Five practical tips on how to master Taichi, a handy parallel programming language embedded in Python

Published in

Parallel-Programming-in-Python

9 min readAug 23, 2022

Yuanming Hu

Hey guys! Welcome to my first Taichi cooking session!👋

From time to time, I hear our users ask questions like “how can I make my code cleaner and more straightforward” or “how can I further optimize the performance of my Taichi programs”. So, I decided to share some very practical tips I myself often use when coding with Taichi, as well as a new feature ti.dataclass. Hopefully, you can make good use of them the next time you ti.init.

We have released Taichi v1.0.4. Upgrade to the latest version and view classy demos:
pip install --upgrade taichi ti gallery

You are recommended to run the following code snippets online at Colab: https://sourl.cn/GnGEEm so you can get first-hand experience!

Tip 1: Auto-debug out-of-bound array accesses

The array access violation issue is quite common during low-code programming (such as C++ and CUDA), and more often than not, a program would proceed regardless. You would not even realize it until you ended up with a wrong result. Even if, with a stroke of luck, you saw a segmentation fault triggered, you would find it hard to debug. Taichi solves this problem by providing an auto-debugging mode: just set debug=Truewhen initiating Taichi. For example:

import taichi as titi.init(arch=ti.cpu, debug=True)
f = ti.field(dtype=ti.i32, shape=(32, 32))@ti.kernel
def test() -> ti.i32:
    return f[0, 73]
    
print(test())

And you will see an error appear:

To sum up:

Bound checks are not available until you enable debug=True.
Only ti.cpu and ti.cuda are supported (you should switch to CPU/CUDA for bounds checking if you are using other backends).
Program performance may worsen after debug=True is turned on.

Tip 2: Access a high-dimensional field by indexing integer vectors

It can be cumbersome to use val[i, j, k, l]to access an element in a high-dimensional field. Is there an easier way to do that?

Well, we can index an integer vector directly (and conduct math operations based on such vectors) like this:

import taichi as ti
import matplotlib.pyplot as plt
import mathti.init(arch=ti.cpu)
n = 512
img = ti.field(dtype=ti.i32, shape=(n, n))
img_magnified = ti.field(dtype=ti.i32, shape=(n, n))@ti.kernel
def paint():
    for I in ti.grouped(img):
        f = (I / n) * math.pi * 10
        img[I] = ti.sin(f[0]) + ti.cos(f[1])        
paint()
plt.imshow(img.to_numpy())
plt.show()@ti.kernel
def magnify():
    for I in ti.grouped(img_magnified):
        img_magnified[I] = img[I // 2]    
    # equivalent to
    # for i, j in img_magnified:
    #     img_magnified[i, j] = img[i // 2, j // 2]magnify()
    
plt.imshow(img_magnified.to_numpy())
plt.show()

And run the program:

Access a high-dimensional field by indexing integer vectors

To sum up:

for I in ti.grouped(img): Make sure you use ti.grouped to pack the index into ti.Vector.
If it is a floating-point vector, make sure you use I.cast(ti.i32) to cast it to an integer; otherwise, a warning would occur.
The point of this tip is that your code becomes dimension-independent. You can apply the same set of code for either 2D or 3D.

Tip 3: Serialize the outermost for loop

By default, Taichi automatically parallelizes the for loop at the outermost scope, but sometimes some programs need to be serialized. In this case, you just need ti.loop_config(serialize=True):

import taichi as titi.init(arch=ti.cpu)
n = 1024
val = ti.field(dtype=ti.i32, shape=n)val.fill(1)@ti.kernel
def prefix_sum():
    ti.loop_config(serialize=True)
    for i in range(1, n):
        val[i] += val[i - 1]prefix_sum()
print(val)

And you will get the right result:

To sum up:

ti.loop_config(serialize=True) decorates the outermost for loop that immediately follows it.
ti.loop_config works only for the range-for loop at the outermost scope.
Inner for loops are serialized by default.

In addition, you can try warp-level intrinsics to accelerate prefix sum if you are using CUDA: https://github.com/taichi-dev/taichi/issues/4631

Tip 4: Interact with Python libraries, such as NumPy

“I really want to convert the output to the data types supported by NumPy so I can paint with Matplotlib or develop deep learning models with PyTorch!”

Taichi provides a solution:

import taichi as ti
import matplotlib.pyplot as pltti.init(arch=ti.cpu)
n = 8192
factors = ti.field(dtype=ti.i32, shape=n)@ti.kernel
def number_of_factors():
    for i in range(1, n):
        counter = 0
        for j in range(1, int(ti.floor(ti.sqrt(i))) + 1):
            if i % j == 0:
                if j * j == i:
                    counter += 1
                else:
                    counter += 2
        factors[i] = counternumber_of_factors()plt.plot(factors.to_numpy(), '.')
# for i in range(n):
#     print(i, factors[i])
plt.show()factors.to_torch()

I tried it out with Matplotlib and it went well:

Interact with Python libraries, such as NumPy

Tip 5: Analyze performance with Taichi Profiler

“It takes a long time to run my program, but how can I figure out which Taichi kernel is the most time-consuming?”

Well begun is half done. It is crucial to locate the bottleneck before you start optimization, and Taichi’s Profiler can do that for you:

import taichi as titi.init(arch=ti.cpu, kernel_profiler=True)
f = ti.field(ti.i32, shape=(32, 32))
@ti.kernel
def foo():
    for i in range(400000):
        f[0, 0] += i
        
    for i in range(100000):
        f[0, 1] += i
@ti.kernel
def bar():
    for i in range(1000):
        a = f[0, 31]for i in range(10):
    foo()
bar()ti.sync()
ti.profiler.print_kernel_profiler_info()

To give you an idea as to what the profiling report would look like:

To sum up:

A kernel that has been fully optimized by the compiler would not generate profiling records (the bar kernel mentioned above is a fully optimized one).
One kernel may generate multiple records of parallel for loops because they are divided into different tasks and assigned to separate devices.
Make sure you call ti.sync() before performance profiling if the program is running on GPU.
jit_evaluator_xxx can be ignored because it is automatically generated by the system.
Currently, kernel_profiler only supports CPU and CUDA (but you are very encouraged to make contributions and add more backends!).
You are recommended to run performance profiling several times to observe the minimum or average execution time.

Recent feature: ti.dataclass

This new feature is contributed by bsavery. It resembles dataclasses.dataclass introduced in Python 3.10 but functions in Taichi kernels.

A simple example of how to use this feature:

import taichi as ti
import taichi.math as tm
ti.init()n_particles = 16@ti.struct_class
class Particle:
    x: ti.types.vector(2, ti.f32) # position
    v: ti.types.vector(2, ti.f32) # velocity
    
    @ti.func
    def at(self, t):
        return self.x + self.v * t
    
    @ti.func
    def advance(self, dt):
        self.x = self.at(dt)particles = Particle.field(shape=(n_particles, ))dt = 0.1@ti.kernel
def simulate():
    for i in particles:
        particles[i].x = tm.vec2(i, i)
        particles[i].v = tm.vec2(0, 100)    for i in range(n_particles):
        particles[i].advance(5)simulate()
for i in range(n_particles):
    print(particles[i].x)

Result:

Hope you have fun with the following mpm99 demo written with this new feature! If the program does not support Colab, you can save it as a Python file (.py) and then run it locally, provided that you have installed the latest version of Taichi.

# MPM99 using ti.dataclassimport taichi as titi.init(arch=ti.gpu)  # Try to run on GPU
quality = 1  # Use a larger value for higher-res simulations
n_particles, n_grid = 9000 * quality**2, 128 * quality
dx, inv_dx = 1 / n_grid, float(n_grid)
dt = 1e-4 / quality
p_vol, p_rho = (dx * 0.5)**2, 1
p_mass = p_vol * p_rho
E, nu = 0.1e4, 0.2  # Young's modulus and Poisson's ratio
mu_0, lambda_0 = E / (2 * (1 + nu)), E * nu / (
    (1 + nu) * (1 - 2 * nu))  # Lame parameters@ti.dataclass
class Particle:
    x: ti.types.vector(2, ti.f32) # position
    v: ti.types.vector(2, ti.f32) # velocity
    C: ti.types.matrix(2, 2, ti.f32) # affine velocity field
    F: ti.types.matrix(2, 2, ti.f32) # deformation gradient
    Jp: ti.f32  # plastic deformation
    material: ti.i32  # material id@ti.dataclass
class Grid:
    v: ti.types.vector(2, ti.f32) # grid node momentum/velocity
    m: ti.f32 # velocityparticles = Particle.field(shape=(n_particles, ))
grid = Grid.field(shape=(n_grid, n_grid))@ti.kernel
def substep():
    for i, j in grid:
        grid.v[i, j] = [0, 0]
        grid.m[i, j] = 0
    for p in particles:  # Particle state update and scatter to grid (P2G)
        base = (particles.x[p] * inv_dx - 0.5).cast(int)
        fx = particles.x[p] * inv_dx - base.cast(float)
        # Quadratic kernels  [http://mpm.graphics   Eqn. 123, with x=fx, fx-1,fx-2]
        w = [0.5 * (1.5 - fx)**2, 0.75 - (fx - 1)**2, 0.5 * (fx - 0.5)**2]
        particles[p].F = (ti.Matrix.identity(float, 2) +
                dt * particles[p].C) @ particles[p].F  # deformation gradient update
        h = ti.exp(
            10 *
            (1.0 -
             particles[p].Jp))  # Hardening coefficient: snow gets harder when compressed
        if particles[p].material == 1:  # jelly, make it softer
            h = 0.3
        mu, la = mu_0 * h, lambda_0 * h
        if particles[p].material == 0:  # liquid
            mu = 0.0
        U, sig, V = ti.svd(particles[p].F)
        J = 1.0
        for d in ti.static(range(2)):
            new_sig = sig[d, d]
            if particles[p].material == 2:  # Snow
                new_sig = ti.min(ti.max(sig[d, d], 1 - 2.5e-2),
                                 1 + 4.5e-3)  # Plasticity
            particles[p].Jp *= sig[d, d] / new_sig
            sig[d, d] = new_sig
            J *= new_sig
        if particles[p].material == 0:  # Reset deformation gradient to avoid numerical instability
            particles.F[p] = ti.Matrix.identity(float, 2) * ti.sqrt(J)
        elif particles[p].material == 2:
            particles.F[p] = U @ sig @ V.transpose(
            )  # Reconstruct elastic deformation gradient after plasticity
        stress = 2 * mu * (particles[p].F - U @ V.transpose()) @ particles[p].F.transpose(
        ) + ti.Matrix.identity(float, 2) * la * J * (J - 1)
        stress = (-dt * p_vol * 4 * inv_dx * inv_dx) * stress
        affine = stress + p_mass * particles.C[p]
        for i, j in ti.static(ti.ndrange(
                3, 3)):  # Loop over 3x3 grid node neighborhood
            offset = ti.Vector([i, j])
            dpos = (offset.cast(float) - fx) * dx
            weight = w[i][0] * w[j][1]
            grid.v[base + offset] += weight * (p_mass * particles.v[p] + affine @ dpos)
            grid.m[base + offset] += weight * p_mass
    for i, j in grid:
        if grid.m[i, j] > 0:  # No need for epsilon here
            grid.v[i,
                   j] = (1 / grid.m[i, j]) * grid.v[i,
                                                    j]  # Momentum to velocity
            grid.v[i, j][1] -= dt * 50  # gravity
            if i < 3 and grid.v[i, j][0] < 0:
                grid.v[i, j][0] = 0  # Boundary conditions
            if i > n_grid - 3 and grid.v[i, j][0] > 0: grid.v[i, j][0] = 0
            if j < 3 and grid.v[i, j][1] < 0: grid.v[i, j][1] = 0
            if j > n_grid - 3 and grid.v[i, j][1] > 0: grid.v[i, j][1] = 0
    for p in particles.x:  # grid to particle (G2P)
        base = (particles.x[p] * inv_dx - 0.5).cast(int)
        fx = particles.x[p] * inv_dx - base.cast(float)
        w = [0.5 * (1.5 - fx)**2, 0.75 - (fx - 1.0)**2, 0.5 * (fx - 0.5)**2]
        new_v = ti.Vector.zero(float, 2)
        new_C = ti.Matrix.zero(float, 2, 2)
        for i, j in ti.static(ti.ndrange(
                3, 3)):  # loop over 3x3 grid node neighborhood
            dpos = ti.Vector([i, j]).cast(float) - fx
            g_v = grid.v[base + ti.Vector([i, j])]
            weight = w[i][0] * w[j][1]
            new_v += weight * g_v
            new_C += 4 * inv_dx * weight * g_v.outer_product(dpos)
        particles.v[p], particles.C[p] = new_v, new_C
        particles.x[p] += dt * particles.v[p]  # advection
group_size = n_particles // 3
@ti.kernel
def initialize():
    for i in range(n_particles):
        particles[i].x = [
            ti.random() * 0.2 + 0.3 + 0.10 * (i // group_size),
            ti.random() * 0.2 + 0.05 + 0.32 * (i // group_size)
        ]
        particles[i].material = i // group_size  # 0: fluid 1: jelly 2: snow
        particles[i].v = ti.Matrix([0, 0])
        particles[i].F = ti.Matrix([[1, 0], [0, 1]])
        particles[i].Jp = 1
def main():
    initialize()
    gui = ti.GUI("Taichi MLS-MPM-99", res=512, background_color=0x112F41)
    for i in range(400):
        for s in range(int(2e-3 // dt)):
            substep()
        gui.circles(particles.x.to_numpy(),
                    radius=1.5,
                    palette=[0x068587, 0xED553B, 0xEEEEF0],
                    palette_indices=particles.material)
        gui.show()
if __name__ == '__main__':
    main()

If you encounter any problems when implementing the code above or if you have any advice to help us improve Taichi’s features, you are most welcome to DM me or contact our community team community@taichi.graphics

And we look forward to your contribution or genuine opinions! Submit a PR or participate in discussions on GitHub: https://github.com/taichi-dev/taichi

About Taichi Lang:
Taichi Lang is an open-source parallel programming language designed for high-performance numerical computation. It is embedded in Python and uses just-in-time (JIT) compiler frameworks (such as LLVM), to offload the compute-intensive Python code to the native GPU or CPU instructions.
View our GitHub project and become a contributor 👉🏻: https://github.com/taichi-dev/taichi
To familiarize yourself with Taichi’s attributes or solve any technical issues, visit our doc site 👉🏻: https://www.taichi-lang.org/

Taichi Cookbook 001: Five practical tips on how to master Taichi, a handy parallel programming language embedded in Python

Tip 1: Auto-debug out-of-bound array accesses

Tip 2: Access a high-dimensional field by indexing integer vectors

Tip 3: Serialize the outermost for loop

Tip 4: Interact with Python libraries, such as NumPy

Tip 5: Analyze performance with Taichi Profiler

Recent feature: ti.dataclass

Written by Taichi Lang