Released CuPy v12

Kenichi Maehashi
CuPy
Published in
3 min readMar 30, 2023

The CuPy team is excited to announce the release of CuPy v12! In this major release, we focused on enhancing the NumPy/SciPy API coverage, including the new interpolation module (cupyx.scipy.interpolate) and ufunc methods like cupy.add.at.

Highlights

SciPy Interpolation Module

A new module, cupyx.scipy.interpolate, has been introduced to host SciPy’s Interpolation APIs optimized for GPU. The following 18 APIs are available, including univariate/multivariate interpolators and 1-D Splines.

  • Akima1DInterpolator
  • barycentric_interpolate (*)
  • BarycentricInterpolator (*)
  • BPoly
  • BSpline
  • CubicHermiteSpline
  • interpn
  • krogh_interpolate (*)
  • KroghInterpolator (*)
  • make_interp_spline
  • pchip
  • pchip_interpolate
  • PchipInterpolator
  • PPoly
  • RBFInterpolator
  • RegularGridInterpolator
  • splantider
  • splder

Acknowledgment: This work has been done by Evgani Burovski and Edgar Andrés Margffoy Tuay from Quansight, under the support of CZI’s EOSS program. The work for functions denoted by (*) has been done by Khushi Agrawal, under the support of Google Summer of Code 2022.

Enhanced NumPy/SciPy API Coverage

In addition to the interpolation module, CuPy v12 adds the following 16 NumPy/SciPy-compatible APIs.

  • cupy.byte_bounds
  • cupy.heaviside
  • cupy.min_scalar_type
  • cupy.ndarray.searchsorted
  • cupyx.scipy.special.cosm1
  • cupyx.scipy.special.exp1
  • cupyx.scipy.special.expi
  • cupyx.scipy.special.k0
  • cupyx.scipy.special.k0e
  • cupyx.scipy.special.k1
  • cupyx.scipy.special.k1e
  • cupyx.scipy.special.logsumexp
  • cupyx.scipy.special.softmax
  • cupyx.scipy.stats.boxcox_llf
  • cupyx.scipy.stats.zmap
  • cupyx.scipy.stats.zscore

ufunc Methods

CuPy now provides ufunc methods for selected ufuncs. In particular, you can now write scatter operations in the same way as NumPy, e.g., cupy.add.at(arr, slices, value) instead of cupyx.scatter_add(arr, slices, value).

  • cupy.ufunc.reduce
  • cupy.ufunc.accumulate
  • cupy.ufunc.reduceat
  • cupy.ufunc.outer
  • cupy.ufunc.at

Support for the latest platforms — CUDA 12, H100, and Jetson Orin

CuPy v12 added official support for these latest NVIDIA GPU platforms.

NVIDIA CUDA 12 is the latest CUDA major release in many years, with performance optimizations and support for the NVIDIA Hopper & Ada Lovelace architectures. CuPy now offers the cupy-cuda12x binary package, including kernels optimized for H100 GPU and Jetson Orin platform.

Thanks to the minor version compatibility, the binary package will work with CUDA 12.0 and future 12.x releases, allowing users to quickly upgrade to the latest CUDA minor version without waiting for a new CuPy release.

Acknowledgment: This work has been done by Preferred Networks and NVIDIA. Part of the work is supported by CZI’s EOSS program.

Thrust & CUB Support in CuPy JIT

You can now use Thrust functions and CUB reduction classes in kernels defined by CuPy JIT. The following examples illustrate the usage of Thrust for sorting and CUB for calculating the sum.

import cupy, cupyx

@cupyx.jit.rawkernel()
def sort_by_key(x, y):
i = cupyx.jit.threadIdx.x
x_array = x[i]
y_array = y[i]
cupyx.jit.thrust.sort_by_key(
cupyx.jit.thrust.device,
x_array.begin(),
x_array.end(),
y_array.begin(),
)

h, w = (256, 256)
x = cupy.arange(h * w, dtype=cupy.int32)
cupy.random.shuffle(x)
x = x.reshape(h, w)
y = cupy.arange(h * w, dtype=cupy.int32)
cupy.random.shuffle(y)
y = y.reshape(h, w)
sort_by_key[1, 256](x, y)
import cupy, cupyx
from cupy.cuda import runtime
from cupyx import jit

@jit.rawkernel()
def warp_reduce_sum(x, y):
WarpReduce = jit.cub.WarpReduce[cupy.int32]
temp_storage = jit.shared_memory(
dtype=WarpReduce.TempStorage, size=1)
i, j = jit.blockIdx.x, jit.threadIdx.x
value = x[i, j]
aggregator = WarpReduce(temp_storage[0])
aggregate = aggregator.Reduce(value, jit.cub.Sum())
if j == 0:
y[i] = aggregate

warp_size = 64 if runtime.is_hip else 32
h, w = (32, warp_size)
x = cupy.arange(h * w, dtype=cupy.int32).reshape(h, w)
cupy.random.shuffle(x)
y = cupy.zeros(h, dtype=cupy.int32)
warp_reduce_sum[h, w](x, y)

Acknowledgment: This work is based on the implementation done by Tsutsui Masayoshi during the summer internship program 2022 at Preferred Networks.

Upgrading from v11 or earlier

Please check out the Upgrade Guide for the list of possibly breaking changes since the last major release. In case you are working with multiple GPUs, please note that cupy.cuda.Device() API has been changed to reactivate the previous device when exiting a context manager for better interoperability with other CUDA-powered libraries like PyTorch.

Join the CuPy community!

CuPy has an active and growing community of developers and users. If you are interested in joining the community, start by talking to core developers in Gitter, filing an issue on GitHub for feature requests or bug reports, or sending a pull request to contribute code or documentation. If you are unsure where to start, take a look at one of the good first issues or try to implement one of the NumPy/SciPy APIs in CuPy.

Finally, don’t forget to star us on GitHub and follow our Twitter and Medium to stay updated on the latest news!

--

--

Kenichi Maehashi
CuPy
Editor for

Working on CuPy & PyTorch development in the Deep Learning Ecosystem Team at Preferred Networks