Taichi Cookbook 001: Five practical tips on how to master Taichi, a handy parallel programming language embedded in Python

Taichi Lang
Parallel-Programming-in-Python
9 min readAug 23, 2022

--

Yuanming Hu

Hey guys! Welcome to my first Taichi cooking session!👋

From time to time, I hear our users ask questions like “how can I make my code cleaner and more straightforward” or “how can I further optimize the performance of my Taichi programs”. So, I decided to share some very practical tips I myself often use when coding with Taichi, as well as a new feature ti.dataclass. Hopefully, you can make good use of them the next time you ti.init.

We have released Taichi v1.0.4. Upgrade to the latest version and view classy demos:
pip install --upgrade taichi ti gallery

You are recommended to run the following code snippets online at Colab: https://sourl.cn/GnGEEm so you can get first-hand experience!

Tip 1: Auto-debug out-of-bound array accesses

The array access violation issue is quite common during low-code programming (such as C++ and CUDA), and more often than not, a program would proceed regardless. You would not even realize it until you ended up with a wrong result. Even if, with a stroke of luck, you saw a segmentation fault triggered, you would find it hard to debug. Taichi solves this problem by providing an auto-debugging mode: just set debug=Truewhen initiating Taichi. For example:

And you will see an error appear:

Out-of-bound array access

To sum up:

  1. Bound checks are not available until you enable debug=True.
  2. Only ti.cpu and ti.cuda are supported (you should switch to CPU/CUDA for bounds checking if you are using other backends).
  3. Program performance may worsen after debug=True is turned on.

Tip 2: Access a high-dimensional field by indexing integer vectors

It can be cumbersome to use val[i, j, k, l]to access an element in a high-dimensional field. Is there an easier way to do that?

Well, we can index an integer vector directly (and conduct math operations based on such vectors) like this:

And run the program:

Access a high-dimensional field by indexing integer vectors

To sum up:

  1. for I in ti.grouped(img): Make sure you use ti.grouped to pack the index into ti.Vector.
  2. If it is a floating-point vector, make sure you use I.cast(ti.i32) to cast it to an integer; otherwise, a warning would occur.
  3. The point of this tip is that your code becomes dimension-independent. You can apply the same set of code for either 2D or 3D.

Tip 3: Serialize the outermost for loop

By default, Taichi automatically parallelizes the for loop at the outermost scope, but sometimes some programs need to be serialized. In this case, you just need ti.loop_config(serialize=True):

And you will get the right result:

Serialize the outermost for loop

To sum up:

  1. ti.loop_config(serialize=True) decorates the outermost for loop that immediately follows it.
  2. ti.loop_config works only for the range-for loop at the outermost scope.
  3. Inner for loops are serialized by default.

In addition, you can try warp-level intrinsics to accelerate prefix sum if you are using CUDA: https://github.com/taichi-dev/taichi/issues/4631

Tip 4: Interact with Python libraries, such as NumPy

“I really want to convert the output to the data types supported by NumPy so I can paint with Matplotlib or develop deep learning models with PyTorch!”

Taichi provides a solution:

I tried it out with Matplotlib and it went well:

Interact with Python libraries, such as NumPy

Tip 5: Analyze performance with Taichi Profiler

“It takes a long time to run my program, but how can I figure out which Taichi kernel is the most time-consuming?”

Well begun is half done. It is crucial to locate the bottleneck before you start optimization, and Taichi’s Profiler can do that for you:

To give you an idea as to what the profiling report would look like:

Profiling report

To sum up:

  1. A kernel that has been fully optimized by the compiler would not generate profiling records (the bar kernel mentioned above is a fully optimized one).
  2. One kernel may generate multiple records of parallel for loops because they are divided into different tasks and assigned to separate devices.
  3. Make sure you call ti.sync() before performance profiling if the program is running on GPU.
  4. jit_evaluator_xxx can be ignored because it is automatically generated by the system.
  5. Currently, kernel_profiler only supports CPU and CUDA (but you are very encouraged to make contributions and add more backends!).
  6. You are recommended to run performance profiling several times to observe the minimum or average execution time.

Recent feature: ti.dataclass

This new feature is contributed by bsavery. It resembles dataclasses.dataclass introduced in Python 3.10 but functions in Taichi kernels.

A simple example of how to use this feature:

Result:

ti.dataclass example

Hope you have fun with the following mpm99 demo written with this new feature! If the program does not support Colab, you can save it as a Python file (.py) and then run it locally, provided that you have installed the latest version of Taichi.

If you encounter any problems when implementing the code above or if you have any advice to help us improve Taichi’s features, you are most welcome to DM me or contact our community team community@taichi.graphics

And we look forward to your contribution or genuine opinions! Submit a PR or participate in discussions on GitHub: https://github.com/taichi-dev/taichi

About Taichi Lang:
Taichi Lang is an open-source parallel programming language designed for high-performance numerical computation. It is embedded in Python and uses just-in-time (JIT) compiler frameworks (such as LLVM), to offload the compute-intensive Python code to the native GPU or CPU instructions.
View our GitHub project and become a contributor 👉🏻: https://github.com/taichi-dev/taichi
To familiarize yourself with Taichi’s attributes or solve any technical issues, visit our doc site 👉🏻: https://www.taichi-lang.org/

--

--

Taichi Lang
Parallel-Programming-in-Python

The Taichi language is an open-source, imperative, parallel programming language for high-performance numerical computation