We will now switch gears to GPU arrays, whose operation is markedly different from that of the other previously discussed Enoki array types.
The first major change when working with GPU arrays is that Enoki is no longer a pure header file library. A compilation step becomes necessary, which produces shared libraries against which applications must be linked. The reason for this is to minimize template bloat: GPU arrays involve a certain amount of additional machinery, and it would be wasteful to have to include the underlying implementation in every file of an application that relies on GPU arrays.
Compiling Enoki produces up to three libraries that are potentially of interest:
libenoki-cuda.so
: a just-in-time compiler that is used to realize the
Enoki backend of the CUDAArray<T>
array type discussed in this section.libenoki-autodiff.so
: a library for maintaining a computation graph for
automatic differentiation discussed in the next section.enoki.cpyhon-37m-x86_64-linux-gnu.so
(platform-dependent filename): a
Python binding library that provides interoperability with Enoki’s GPU
arrays and differentiable GPU arrays.Enter the following CMake command to compile all of them:
cd <path-to-enoki>
mkdir build
cd build
cmake -DENOKI_CUDA=ON -DENOKI_AUTODIFF=ON -DENOKI_PYTHON=ON ..
make
For educational reasons, it is instructive to compile in Enoki in debug mode, which enables a number of log messages that we will refer to in the remainder of this section. Use the following CMake command to do so:
cmake -DCMAKE_BUILD_TYPE=Debug -DENOKI_CUDA=ON -DENOKI_AUTODIFF=ON -DENOKI_PYTHON=ON ..
We find it easiest to introduce Enoki’s GPU arrays from within an interactive
Python interpreter and postpone the discussion of the C++ interface to the end
of this section. We’ll start by importing the Enoki extension into an
interactive Python session and set the log level to a high value via
cuda_set_log_level()
, which will be helpful in the subsequent
discussion.
>>> from enoki import *
>>> cuda_set_log_level(4)
Note
The first time that Enoki is imported on a new machine, it will trigger a kernel pre-compilation step that takes a few seconds.
The Enoki python bindings expose a number of types in the enoki.cuda
submodule that
correspond to GPU-resident arrays. The following example initializes such an array with a
constant followed by a simple addition operation.
>>> from enoki.cuda import Float32 as FloatC
>>> a = FloatC(1)
cuda_trace_append(10): mov.$t1 $r1, 0f3f800000
>>> a = a + a
cuda_trace_append(11 <- 10, 10): add.rn.ftz.$t1 $r1, $r2, $r3
Observe the two cuda_trace_append
log messages, which begin to reveal the
mechanics underlying the GPU backend: neither of these two operations has
actually occurred at this point. Instead, Enoki simply queued this computation
for later execution using an assembly-like intermediate language named NVIDIA
PTX.
For instance, the first line cuda_trace_append(10): mov.$t1 $r1, 0f3f800000
indicates the creation of a new variable with index 10
, and the associated
line of PTX will eventually be used to initialize this variable with a binary
constant representing the floating point value 1.0
. The next
cuda_trace_append
command introduces a new variable 11
that will record
the result of the addition, while keeping track of the dependence on the
original variable 10
, etc. More complex numerical operations (e.g. a
hyperbolic tangent) result in a longer sequence of steps that are similarly
enqueued:
>>> a = tanh(a)
cuda_trace_append(12 <- 11): abs.ftz.$t1 $r1, $r2
cuda_trace_append(13): mov.$t1 $r1, 0f3f200000
... 25 lines skipped ...
cuda_trace_append(39 <- 38, 37): sub.rn.ftz.$t1 $r1, $r2, $r3
cuda_trace_append(40 <- 39, 29, 14): selp.$t1 $r1, $r2, $r3, $r4
Eventually, numerical evaluation can no longer be postponed, e.g. when we try to print the array contents:
>>> print(a)
cuda_eval(): launching kernel (n=1, in=0, out=1, ops=31)
.... many lines skipped ...
cuda_jit_run(): cache miss, jit: 541 us, ptx compilation: 43.534, 10 registers
[0.964028]
At this point, Enoki’s JIT backend compiles and launches a kernel that contains all of the computation queued thus far.
.version 6.3
.target sm_75
.address_size 64
.visible .entry enoki_8a163272(.param .u64 ptr,
.param .u32 size) {
.reg.b8 %b<41>;
.reg.b16 %w<41>;
.reg.b32 %r<41>;
.reg.b64 %rd<41>;
.reg.f32 %f<41>;
.reg.f64 %d<41>;
.reg.pred %p<41>;
// Grid-stride loop setup
ld.param.u64 %rd0, [ptr];
ld.param.u32 %r1, [size];
mov.u32 %r4, %tid.x;
mov.u32 %r5, %ctaid.x;
mov.u32 %r6, %ntid.x;
mad.lo.u32 %r2, %r5, %r6, %r4;
setp.ge.u32 %p0, %r2, %r1;
@%p0 bra L0;
mov.u32 %r7, %nctaid.x;
mul.lo.u32 %r3, %r6, %r7;
L1:
// Loop body
mov.f32 %f10, 0f3f800000;
add.rn.ftz.f32 %f11, %f10, %f10;
mul.rn.ftz.f32 %f12, %f11, %f11;
mul.rn.ftz.f32 %f13, %f12, %f12;
mul.rn.ftz.f32 %f14, %f13, %f13;
mov.f32 %f15, 0fbbbaf0ea;
mul.rn.ftz.f32 %f16, %f15, %f14;
mov.f32 %f17, 0f3e088393;
mov.f32 %f18, 0fbeaaaa99;
fma.rn.ftz.f32 %f19, %f12, %f17, %f18;
add.rn.ftz.f32 %f20, %f19, %f16;
mov.f32 %f21, 0f3ca9134e;
mov.f32 %f22, 0fbd5c1e2d;
fma.rn.ftz.f32 %f23, %f12, %f21, %f22;
fma.rn.ftz.f32 %f24, %f13, %f23, %f20;
mul.rn.ftz.f32 %f25, %f12, %f11;
fma.rn.ftz.f32 %f26, %f24, %f25, %f11;
add.rn.ftz.f32 %f27, %f11, %f11;
mov.f32 %f28, 0f3fb8aa3b;
mul.rn.ftz.f32 %f29, %f28, %f27;
ex2.approx.ftz.f32 %f30, %f29;
mov.f32 %f31, 0f3f800000;
add.rn.ftz.f32 %f32, %f30, %f31;
rcp.approx.ftz.f32 %f33, %f32;
add.rn.ftz.f32 %f34, %f33, %f33;
mov.f32 %f35, 0f3f800000;
sub.rn.ftz.f32 %f36, %f35, %f34;
abs.ftz.f32 %f37, %f11;
mov.f32 %f38, 0f3f200000;
setp.ge.f32 %p39, %f37, %f38;
selp.f32 %f40, %f36, %f26, %p39;
// Store register %f40
ldu.global.u64 %rd8, [%rd0 + 0];
st.global.f32 [%rd8], %f40;
add.u32 %r2, %r2, %r3;
setp.ge.u32 %p0, %r2, %r1;
@!%p0 bra L1;
L0:
ret;
}
Internally, Enoki hands the PTX code over to CUDA’s runtime compiler (NVRTC), which performs a second pass that translates from PTX to the native GPU instruction set SASS.
enoki_8a163272:
MOV R1, c[0x0][0x28];
S2R R0, SR_TID.X;
S2R R3, SR_CTAID.X;
IMAD R0, R3, c[0x0][0x0], R0;
ISETP.GE.U32.AND P0, PT, R0, c[0x0][0x168], PT;
@P0 EXIT;
BSSY B0, `(.L_2);
ULDC.64 UR4, c[0x0][0x160];
.L_3:
LDG.E.64.SYS R2, [UR4];
MOV R5, 0x3f76ca83;
MOV R7, c[0x0][0x0];
IMAD R0, R7, c[0x0][0xc], R0;
ISETP.GE.U32.AND P0, PT, R0, c[0x0][0x168], PT;
STG.E.SYS [R2], R5;
@!P0 BRA `(.L_3);
BSYNC B0;
.L_2:
EXIT ;
.L_4:
BRA `(.L_4);
This second phase is a full-fledged optimizing compiler with constant propagation and common subexpression elimination. You can observe this in the previous example because the second snippet is much smaller—in fact, almost all of the computation was optimized away and replaced by a simple constant (\(\tanh(2)\approx 0.964028\)).
Enoki’s approach is motivated by efficiency considerations: most array operations are individually very simple and do not involve a sufficient amount of computation to outweigh overheads related to memory accesses and GPU kernel launches. Enoki therefore accumulates larger amounts of work (potentially hundreds of thousands of individual operations) before creating and launching an optimized GPU kernel. Once evaluated, array contents can be accessed without triggering further computation:
>>> print(a)
[0.964028]
GPU kernel compilation consists of two steps: the first generates a PTX kernel from the individual operations—this is essentially just string concatenation and tends to be very fast (541 µs in the above example, most of which is caused by printing assembly code onto the console due to the high log level).
The second step (ptx compilation
) that converts the PTX intermediate
representation into concrete machine code that can be executed on the installed
graphics card is orders of magnitude slower (43 ms in the above example) but
only needs to happen once: whenever the same computation occurs again (e.g. in
subsequent iterations of an optimization algorithm), the previously generated
kernel is reused:
>>> b = FloatC(1)
>>> b = b + b
>>> b = tanh(b)
>>> print(b)
cuda_eval(): launching kernel (n=1, in=0, out=1, ops=31)
.... many lines skipped ...
cuda_jit_run(): cache hit, jit backend: 550 us
[0.964028]
We now turn to a more complex example: computing the three-dimensional volume of a sphere using Monte Carlo integration. To do so, we create a random number generator RNG that will generate 1 million samples:
>>> from enoki.cuda import PCG32 as PCG32C, UInt64 as UInt64C
>>> rng = PCG32C(UInt64C.arange(1000000))
Here, PCG32 refers to a linear congruential generator from the section on
random number generation. We use it to sample three random
number vectors from the RNG and create a dynamic array of 3D vectors
(Vector3fC
).
>>> from enoki.cuda import Vector3f as Vector3fC
>>> v = Vector3fC([rng.next_float32() * 2 - 1 for _ in range(3)])
Finally, we compute a mask that determines which of the uniformly distributed vectors on the set \([-1, 1]^3\) lie within the unit sphere:
>>> inside = norm(v) < 1
At this point, seeding of the random number generator and subsequent sampling steps touching its internal state have produced over a hundred different operations generating various intermediate results along with the output variable of interest.
To understand the specifics of this process, we assign a label to this variable
and enter the command cuda_whos()
, which is analogous to whos
in
IPython and MATLAB and generates a listing of all variables that are currently
registered (with the JIT compiler, in this case).
>>> set_label(inside, 'inside')
>>> cuda_whos()
ID Type E/I Refs Size Memory Ready Label
=================================================================
10 u32 0 / 1 1000000 3.8147 MiB [ ]
11 u64 0 / 1 1000000 7.6294 MiB [ ]
... 126 lines skipped ...
178 f32 0 / 1 1 4 B [ ]
179 msk 1 / 0 1000000 976.56 KiB [ ] inside
=================================================================
Memory usage (ready) : 0 B
Memory usage (scheduled) : 0 B + 20.027 MiB = 20.027 MiB
Memory savings : 350.95 MiB
The resulting output lists variables of many types (single precision floating
point values, 32/64 bit unsigned integers, masks, etc..), of which the last one
corresponds to the inside
variable named above.
Note how each variable lists two reference counts (in the column E/I
refs
): the first (external) specifies how many times the variable is
referenced from an external application like the interactive Python prompt,
while the second (internal) counts how many times it is referenced as part of
queued arithmetic expressions. Variables with zero references in both categories
are automatically purged from the list.
Most of the variables are only referenced internally—these correspond to temporaries created during a computation. Because they can no longer be “reached” through external references, it would be impossible to ask the system for the contents of such a temporary variable. Enoki relies on this observation to perform an important optimization: rather than storing temporaries in global GPU memory, their contents can be represented using cheap temporary GPU registers. This yields significant storage and memory traffic savings: over 350 MiB of storage can be elided in the last example, leaving only roughly 20 MiB of required storage.
In fact, these numbers can still change: we have not actually executed the
computation yet, and Enoki currently conservatively assumes that we plan to
continue using the random number generator rng
and list of 3D vectors v
later on. If we instruct Python to garbage-collect these two variables, the
required storage drops to less than a megabyte:
>>> del v, rng
>>> cuda_whos()
ID Type E/I Refs Size Memory Ready Label
=================================================================
10 u32 0 / 1 1000000 3.8147 MiB [ ]
11 u64 0 / 1 1000000 7.6294 MiB [ ]
... 126 lines skipped ...
178 f32 0 / 1 1 4 B [ ]
179 msk 1 / 0 1000000 976.56 KiB [ ] inside
=================================================================
Memory usage (ready) : 0 B
Memory usage (scheduled) : 0 B + 976.56 KiB = 976.56 KiB
Memory savings : 324.25 MiB
Finally, we can “peek” into the inside
array to compute the fraction of
points that lie within the sphere, which approximates the expected value
\(\frac{4}{3\cdot 2^3}\pi\approx0.523599\).
>>> count(inside) / len(inside)
... many lines skipped ...
0.523946
It is sometimes desirable to manually force Enoki’s JIT compiler to generate a
kernel containing the computation queued thus far. For instance, rather than
compiling a long-running iterative algorithm into a single huge kernel, a
single kernel per iteration may be preferable. This can be accomplished by
explicitly invoking the cuda_eval()
function periodically. An example:
>>> from enoki.cuda import UInt32 as UInt32C
>>> a = UInt32C.arange(1234)
>>> cuda_eval()
cuda_eval(): launching kernel (n=1234, in=0, out=1, ops=1)
>>> cuda_whos()
ID Type E/I Refs Size Memory Ready Label
=================================================================
10 u32 1 / 0 1234 4.8203 KiB [x]
=================================================================
Memory usage (ready) : 4.8203 KiB
Memory usage (scheduled) : 4.8203 KiB + 0 B = 4.8203 KiB
Memory savings : 0 B
The array is now marked “ready”, which means that its contents were evaluated
and reside in GPU memory at an address that can be queried via the data
field.
>>> a.data
140427428626432
Actually, that is not entirely accurate: kernels are always launched
asynchronously, which means that the function cuda_eval()
may have
returned before the GPU finished executing the kernel. Nonetheless, is
perfectly safe to begin using the variable immediately as asynchronous
communication with the GPU still observes a linear ordering guarantee.
In very rare cases (e.g. kernel benchmarking), it may be desirable to wait
until all currently running kernels have terminated. For this, invoke
cuda_sync()
following cuda_eval()
.
Recall the difference between vertical and horizontal operations: vertical operations are applied independently to each element of a vector, while horizontal ones combine the different elements of a vector. Enoki’s GPU arrays are designed to operate very efficiently when working with vertical operations that can be parallelized over the entire chip.
Horizontal operations (e.g. hsum()
, all()
,
count()
, etc.) are best avoided whenever possible, because they
require that all prior computation has finished. In other words: each time
Enoki encounters a horizontal operation involving an unevaluated array, it
triggers a call to cuda_eval()
. That said, horizontal reductions are
executed in parallel using NVIDIA’s CUB
library, which is a highly performant implementation of these primitives.
Enoki GPU arrays support bidirectional conversion from/to NumPy arrays, which will of course involve some communication between the CPU and GPU:
>>> x = FloatC.linspace(0, 1, 5)
>>> # Enoki -> NumPy
>>> y = Vector3fC(x, x*2, x*3).numpy()
cuda_eval(): launching kernel (n=5, in=1, out=6, ops=36)
>>> print(y)
array([[0. , 0. , 0. ],
[0.25, 0.5 , 0.75],
[0.5 , 1. , 1.5 ],
[0.75, 1.5 , 2.25],
[1. , 2. , 3. ]], dtype=float32)
>>> # NumPy -> Enoki
>>> Vector3fC(y)
cuda_eval(): launching kernel (n=5, in=1, out=3, ops=27)
[[0, 0, 0],
[0.25, 0.5, 0.75],
[0.5, 1, 1.5],
[0.75, 1.5, 2.25],
[1, 2, 3]]
PyTorch GPU tensors are supported as well. In this case, copying occurs on the GPU (but is still necessary, as the two frameworks use different memory layouts for tensors).
>>> x = FloatC.linspace(0, 1, 5)
>>> # Enoki -> PyTorch
>>> y = Vector3fC(x, x*2, x*3).torch()
cuda_eval(): launching kernel (n=5, in=2, out=5, ops=31)
>>> y
tensor([[0.0000, 0.0000, 0.0000],
[0.2500, 0.5000, 0.7500],
[0.5000, 1.0000, 1.5000],
[0.7500, 1.5000, 2.2500],
[1.0000, 2.0000, 3.0000]], device='cuda:0')
>>> # PyTorch -> Enoki
>>> Vector3fC(y)
cuda_eval(): launching kernel (n=5, in=1, out=3, ops=27)
[[0, 0, 0],
[0.25, 0.5, 0.75],
[0.5, 1, 1.5],
[0.75, 1.5, 2.25],
[1, 2, 3]]
Note how the .numpy()
and .torch()
function calls both triggered a
mandatory kernel launch to ensure that that the array contents were ready
before returning a representation in the other framework. This can be wasteful
when converting many variables at an interface between two frameworks. For this
reason, both .numpy()
and .torch()
functions take an optional eval
argument that is set to True
by default. Passing False
causes the
operation to return an uninitialized NumPy or PyTorch array, while at the same
time scheduling Enoki code that will eventually fill this memory with valid
contents the next time that cuda_eval()
is triggered. An example is
shown below. This feature is to be used with caution.
>>> x = FloatC.linspace(0, 1, 5)
>>> y = Vector3fC(x, x*2, x*3).numpy(False)
>>> y
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]], dtype=float32)
>>> cuda_eval()
cuda_eval(): launching kernel (n=5, in=1, out=4, ops=36)
>>> y
array([[0. , 0. , 0. ],
[0.25, 0.5 , 0.75],
[0.5 , 1. , 1.5 ],
[0.75, 1.5 , 2.25],
[1. , 2. , 3. ]], dtype=float32)
The GPU backend also supports scatter and gather operations involving GPU arrays as targets.
>>> a = FloatC.zero(10)
>>> b = UInt32C.arange(5)
>>> scatter(target=a, source=FloatC(b), index=b*2)
>>> a
cuda_eval(): launching kernel (n=5, in=1, out=2, ops=9)
[0, 0, 1, 0, 2, 0, 3, 0, 4, 0]
Note that gathering from an unevaluated Enoki array is not guaranteed to be a
vertical operation, hence it triggers a call to cuda_eval()
.
Similar to the PyTorch memory allocator,
Enoki uses a caching scheme to avoid very costly device synchronizations when
releasing memory. This means that freeing a large GPU variable doesn’t cause
the associated memory region to become available for use by the operating
system or other frameworks like Tensorflow or PyTorch. Use the function
cuda_malloc_trim()
to fully purge all unused memory. The function is
only relevant when working with other frameworks and does not need to be called
to free up memory for use by Enoki itself.
Everything demonstrated in the above sections can be directly applied to
C++ programs as well. To use the associated (dynamic array) type CUDAArray
,
include the header
#include <enoki/cuda.h>
Furthermore, applications must be linked against the cuda
and
enoki-cuda
libraries. The following snippet contains a C++ translation of
the Monte Carlo integration Python example shown earlier.
#include <enoki/cuda.h>
#include <enoki/random.h>
using namespace enoki;
using FloatC = CUDAArray<float>;
using Vector3fC = Array<FloatC, 3>;
using PCG32C = PCG32<FloatC>;
using MaskC = mask_t<FloatC>;
int main(int argc, char **argv) {
PCG32C rng(PCG32_DEFAULT_STATE, arange<FloatC>(1000000));
Vector3fC v(
rng.next_float32() * 2.f - 1.f,
rng.next_float32() * 2.f - 1.f,
rng.next_float32() * 2.f - 1.f
);
MaskC inside = norm(v) < 1.f;
std::cout << count(inside) / (float) inside.size() << std::endl;
}
When vectorizing code, we may sometimes want to skip an expensive computation
when it is not actually needed by any elements in the array being processed.
This is usually done with the any()
function and yields good
performance in when targeting the CPU (e.g. with the AVX512 backend). An
example:
auto condition = variable > 1.f;
if (any(condition))
result[condition] = /* expensive-to-evaluate expression */;
However, recall the discussion earlier in this section, which explained how
horizontal operations tend to be fairly expensive in conjunction with the GPU
backend because they flush the JIT compiler. This effectively breaks up the
program into smaller kernels, increasing memory traffic and missing potential
optimization opportunities. Arrays processed by the GPU backend tend to be much
larger, and from a probabilistic viewpoint it is often likely that the
any()
function call will in any case evaluate to true
. For these
reasons, skipping test and always evaluating the expression often leads to
better performance on the GPU.
Enoki provides alternative horizontal reductions of masks named
any_or()
, all_or()
, none_or()
that do exactly
this: they skip evaluation when compiling for GPU targets and simply return the
supplied template argument. For other targets, they behave as usual. With this
change, the example looks as follows:
auto condition = variable > 1.f;
if (any_or<true>(condition))
result[condition] = /* expensive-to-evaluate expression */;
Enoki was designed as a numerical foundation for differentiable physical simulations, specifically the Mitsuba renderer, though it is significantly more general and should be a trusty tool for a variety of simulation and optimization problems.
Its GPU and Autodiff backends are related to well-known frameworks like TensorFlow and PyTorch that have become standard tools for training and evaluating neural networks. In the following, we outline the main differences between these frameworks and Enoki.
Both PyTorch and Tensorflow provide two main operational modes: eager mode
directly evaluates arithmetic operations on the GPU, which yields excellent
performance in conjunction with arithmetically intensive operations like
convolutions and large matrix-vector multiplications, both of which are
building blocks of neural networks. When evaluating typical simulation code
that mainly consists of much simpler arithmetic (e.g. additions,
multiplications, etc.), the resulting memory traffic and scheduling overheads
induce severe bottlenecks. An early prototype of Enoki provided a
TorchArray<T>
type that carried out operations using PyTorch’s eager mode,
and the low performance of this combination eventually motivated us to develop
the technique based on JIT compilation introduced in the previous section.
The second operational mode requires an up-front specification of the complete
computation graph to generate a single optimized GPU kernel (e.g. via XLA in
TensorFlow and jit.trace
in PyTorch). This is feasible for neural networks,
whose graph specification is very regular and typically only consists of a few
hundred operations. Simulation code, on the other hand, involves much larger
graphs, whose structure is unpredictable: program execution often involves
randomness, which could cause jumps to almost any part of the system. The full
computation graph would simply be the entire codebase (potentially on the order
of hundreds of thousands lines of code), which is of course far too big.
Enoki’s approach could be interpreted as a middle ground between the two extremes discussed above. Graphs are created on the fly during a simulation, and can be several orders of magnitude larger compared to typical neural networks. They consist mostly of unstructured and comparably simple arithmetic that is lazily fused into optimized CUDA kernels. Since our system works without an up-front specification of the full computation graph, it must support features like dynamic indirection via virtual function calls that can simultaneously branch to multiple different implementations. The details of this are described in the section on function calls.
Note that that there are of of course many use cases where PyTorch, Tensorflow, etc. are vastly superior to Enoki, and it is often a good idea to combine the two in such cases (e.g. to feed the output of a differentiable simulation into a neural network).
One last related framework is ArrayFire, which provides a JIT compiler that
lazily fuses instructions similar to our CUDAArray<T>
type. ArrayFire
targets a higher-level language (C), but appears to be limited to fairly small
kernels (100 operations by default), and does not support a mechanism for
automatic differentiation. In contrast, Enoki emits an intermediate
representation (PTX) and fuses instructions into comparatively larger kernels
that often exceed 100K instructions.