Enoki: structured vectorization and differentiation on modern processor architectures

_images/enoki-logo.svg
_images/enoki-logo.svg

Introduction

Enoki is a C++ template library that enables automatic transformations of numerical code, for instance to create a “wide” vectorized variant of an algorithm that runs on the CPU or GPU, or to compute gradients via transparent forward/reverse-mode automatic differentation.

The core parts of the library are implemented as a set of header files with no dependencies other than a sufficiently C++17-capable compiler (GCC >= 8.2, Clang >= 7.0, Visual Studio >= 2017). Enoki code reduces to efficient SIMD instructions available on modern CPUs and GPUs—in particular, Enoki supports:

  • Intel: AVX512, AVX2, AVX, and SSE4.2,
  • ARM: NEON/VFPV4 on armv7-a, Advanced SIMD on 64-bit armv8-a,
  • NVIDIA: CUDA via a Parallel Thread Execution (PTX) just-in-time compiler.
  • Fallback: a scalar fallback mode ensures that programs still run even if none of the above are available.

Deploying a program on top of Enoki usually serves three goals:

  1. Enoki ships with a convenient library of special functions and data structures that facilitate implementation of numerical code (vectors, matrices, complex numbers, quaternions, etc.).

  2. Programs built using these can be instantiated as wide versions that process many arguments at once (either on the CPU or the GPU).

    Enoki is also structured in the sense that it handles complex programs with custom data structures, lambda functions, loadable modules, virtual method calls, and many other modern C++ features.

  3. If derivatives are desired (e.g. for stochastic gradient descent), Enoki performs transparent forward or reverse-mode automatic differentiation of the entire program.

Finally, Enoki can do all of the above simultaneously: if desired, it can compile the same source code to multiple different implementations (e.g. scalar, AVX512, and CUDA+autodiff).

Motivation

The development of this library was prompted by the author’s frustration with the current vectorization landscape:

  1. Auto-vectorization in state-of-the-art compilers is inherently local. A computation whose call graph spans separate compilation units (e.g. multiple shared libraries) simply can’t be vectorized.

  2. Data structures must be converted into a Structure of Arrays (SoA) layout to be eligible for vectorization.

    _images/intro-01.svg

    This is analogous to performing a matrix transpose of an application’s entire memory layout—an intrusive change that is likely to touch almost every line of code.

  3. Parts of the application likely have to be rewritten using intrinsic instructions, which is going to look something like this:

    _images/intro-02.svg

    Intrinsics-heavy code is challenging to read and modify once written, and it is inherently non-portable. CUDA provides a nice language environment for programming GPUs but does nothing to help with the other requirements (vectorization on CPUs, automatic differentiation).

  4. Vectorized transcendental functions (exp, cos, erf, ..) are not widely available. Intel, AMD, and CUDA provide proprietary implementations, but many compilers don’t include them by default.

  5. It is desirable to retain both scalar and vector versions of an algorithm, but ensuring their consistency throughout the development cycle becomes a maintenance nightmare.

  6. Domain-specific languages (DSLs) for vectorization such as ISPC address many of the above issues but assume that the main computation underlying an application can be condensed into a compact kernel that is implementable using the limited language subset of the DSL (e.g. plain C in the case of ISPC).

    This is not the case for complex applications, where the “kernel” may be spread out over many separate modules involving high-level language features such as functional or object-oriented programming.

What Enoki does differently

Enoki addresses these issues and provides a complete solution for vectorizing and differentiating modern C++ applications with nontrivial control flow and data structures, dynamic memory allocation, virtual method calls, and vector calls across module boundaries. It has the following design goals:

  1. Unobtrusive. Only minor modifications are necessary to convert existing C++ code into its Enoki-vectorized equivalent, which remains readable and maintainable.

  2. No code duplication. It is generally desirable to provide both scalar and vectorized versions of an API, e.g. for debugging, and to preserve compatibility with legacy code. Enoki code extensively relies on class and function templates to achieve this goal without any code duplication—the same code template can be leveraged to create scalar, CPU SIMD, and GPU implementations, and each variant can provide gradients via automatic differentiation if desired.

  3. Custom data structures. Enoki can also vectorize custom data structures. All the hard work (e.g. conversion to SoA format) is handled by the C++17 type system.

  4. Function calls. Vectorized calls to functions in other compilation units (e.g. a dynamically loaded plugin) are possible. Enoki can even vectorize method or virtual method calls (e.g. instance->my_function(arg1, arg2, ...); when instance turns out to be an array containing many different instances).

  5. Mathematical library. Enoki includes an extensive mathematical support library with complex numbers, matrices, quaternions, and related operations (determinants, matrix, inversion, etc.). A set of transcendental and special functions supports real, complex, and quaternion-valued arguments in single and double-precision using polynomial or rational polynomial approximations, generally with an average error of \(<\!\frac{1}{2}\) ULP on their full domain. These include exponentials, logarithms, and trigonometric and hyperbolic functions, as well as their inverses. Enoki also provides real-valued versions of error function variants, Bessel functions, the Gamma function, and various elliptic integrals.

    _images/intro-03.png

    Importantly, all of this functionality is realized using the abstractions of Enoki, which means that it transparently composes with vectorization, the JIT compiler for generating CUDA kernels, automatic differentiation, etc.

  6. Portability. When creating vectorized CPU code, Enoki supports arbitrary array sizes that don’t necessarily match what is supported by the underlying hardware (e.g. 16 x single precision on a machine, whose SSE vector only has hardware support for 4 x single precision operands). The library uses template metaprogramming techniques to efficiently map array expressions onto the available hardware resources. This greatly simplifies development because it’s enough to write a single implementation of a numerical algorithm that can then be deployed on any target architecture. There are non-vectorized fallbacks for everything, thus programs will run even on unsupported architectures (albeit without the performance benefits of vectorization).

  7. Modular architecture. Enoki is split into two major components: the front-end provides various high-level array operations, while the back-end provides the basic ingredients that are needed to realize these operations using the SIMD instruction set(s) supported by the target architecture. Backends can also transform arithmetic, e.g. to perform automatic differentatiation.

    The CPU vector back-ends e.g. make heavy use of SIMD intrinsics to ensure that compilers generate efficient machine code. The intrinsics are contained in separate back-end header files (e.g. array_avx.h for AVX intrinsics), which provide rudimentary arithmetic and bit-level operations. Fancier operations (e.g. atan2) use the back-ends as an abstract interface to the hardware, which means that it’s simple to support other instruction sets such as a hypothetical future AVX1024 or even an entirely different architecture (e.g. a DSP chip) by just adding a new back-end.

  8. License. Enoki is available under a non-viral open source license (3-clause BSD).

About

This project was created by Wenzel Jakob. It is named after Enokitake, a type of mushroom with many long and parallel stalks reminiscent of data flow in vectorized arithmetic.

Enoki is the numerical foundation of version 2 of the Mitsuba renderer, though it is significantly more general and should be a trusty tool for a variety of simulation and optimization problems.

When using Enoki in academic projects, please cite

@misc{Enoki,
   author = {Wenzel Jakob},
   year = {2019},
   note = {https://github.com/mitsuba-renderer/enoki},
   title = {Enoki: structured vectorization and differentiation on modern processor architectures}
}

A quick demonstration

This section contains a quick and somewhat contrived demonstration that illustrates a number of basic Enoki features. The remainder of the documentation contains a more systematic introduction of this functionality.

Consider the function

\[\begin{split}f(x)=\begin{cases} 12.92x, &x \le 0.0031308\\ 1.055x^{1/2.4} -0.055, &\mathrm{otherwise,} \end{cases}\end{split}\]

which converts a linear color value into one that can be displayed on a modern display following the sRGB standard—this is known as gamma correction. A standard C++ implementation of this function might look as follows:

float srgb_gamma(float x) {
   if (x <= 0.0031308f)
      return x * 12.92f;
   else
      return std::pow(x * 1.055f, 1.f / 2.4f) - 0.055f;
}

A Enoki implementation of the same computation first replaces float by a generic Value type. The second difference is that scalar conditionals are replaced by generalized expressions involving masks.

template <typename Value> Value srgb_gamma(Value x) {
   return enoki::select(
      x <= 0.0031308f,
      x * 12.92f,
      enoki::pow(x * 1.055f, 1.f / 2.4f) - 0.055f
   );
}

Vectorization

The simple generalization from normal C++ code to an Enoki function template enables a number of interesting applications. For instance, the function automatically extends to cases where Value is a color type with three components, in which case the arithmetic operations recursively thread through the array.

using Color3f = enoki::Array<float, 3>;

Color3f input = /* ... */;
Color3f output = srgb_gamma(input);

Arrays can be nested arbitrarily: the following snippet declares a 16-wide FloatP “packet” type (hence the “P” suffix) and uses it to construct a a new type storing 16 colors that will all be processed in parallel.

using FloatP   = enoki::Array<float, 16>;
using Color3fP = enoki::Array<FloatP, 3>;

Color3fP input = /* ... */;
Color3fP output = srgb_gamma(input);

If this code is compiled on a machine supporting the SSE4.2, AVX, AVX2, or AVX512 instructions set extensions, these vector instructions will be leveraged to carry out the computation more efficiently.

Execution on the GPU

Vectorization is not restricted to the CPU—for instance, the following type declarations create a special array that is resident in GPU memory. In this mode of operation, Enoki relies on an internal just-in-time compiler to generate efficient CUDA kernels on the fly.

using FloatC   = enoki::CUDAArray<float>;
using Color3fC = enoki::Array<FloatC, 3>;

Color3fC input = /* ... */;
Color3fC output = srgb_gamma(input);

Enoki’s CUDAArray<T> type applies an important optimization that leads to significantly improved performance: in contrast to the previous examples, the function call srgb_gamma(input) now merely records the sequence of computations that is needed to determine the value of output but does not yet execute it.

Eventually, this evaluation can no longer be postponed (e.g. when we try to access or print the array contents). At this point, Enoki’s JIT backend compiles and executes a kernel that contains all queued computations using NVIDIA’s PTX intermediate representation. All of this happens automatically: in particular, no CUDA-specific rewrite (e.g. to nvcc compatible kernels) of the program is necessary!

Automatic differentiation

Enoki can also apply transparent forward or reverse-mode automatic differentiation to a program using a special enoki::DiffArray<T> array that wraps a number type or another Enoki array T.

For instance, the following example computes the gradient of a loss function that measures L2 distance from a given gamma-corrected color value. Both primal and gradient-related computations involve GPU-resident arrays, and the resulting computation is queued up as in the previously example using Enoki’s just-in-time compiler.

using FloatC   = enoki::CUDAArray<float>;
using FloatD   = enoki::DiffArray<FloatC>;
using Color3fD = enoki::Array<FloatD, 3>;

Color3fD input = /* ... */;
enoki::set_requires_gradient(input);

Color3fD output = srgb_gamma(input);

FloatD loss = enoki::norm(output - Color3fD(.1f, .2f, .3f));
enoki::backward(loss);

std::cout << enoki::gradient(input) << std::endl;

The scalar case

All Enoki functions also accept non-array arguments, hence the original scalar implementation remains available:

float input = /* ... */;
float output = srgb_gamma(input);

Python bindings

Modern C++ systems often strive to provide fine-grained Python bindings to facilitate rapid prototyping and interoperability with other software. Enoki is designed to work with the widely used pybind11 library (itself based on template metaprogramming) to facilitate this. Exposing an Enoki function on the Python side is usually a 1-liner, even for the “fancy” GPU+autodiff variants, as in the following example:

/// Create python bindings with 2 overloads (here, 'm' is a py::module)
m.def("srgb_gamma", &srgb_gamma<float>);
m.def("srgb_gamma", &srgb_gamma<Color3fD>);

Summary

In summary: Enoki, along with a generalized template implementation of a computation enables several powerful transformations:

  1. A simple type substitution yields an equivalent vectorized computation that leverages vector instructions on modern processor architectures.
  2. Symbolic execution of the computation using a a just-in-time compiler yields efficient kernels that run on NVIDIA GPUs.
  3. Further type transformations enable tracking of derivatives through a calculation, either on the CPU or the GPU.
  4. The above transformations can all be deduced from the type of the resulting functions. This is an ideal fit for metaprogramming-based libraries like pybind11 which inspect the type of a function to generate high-quality binding code.

There are many missing pieces that weren’t discussed in this basic overview: how to handle more complex control flow, types and data structures, virtual method calls, and so on. The remainder of this documentation provides a more systematic overview of these topics.

Basics

The remainder of this document provides a basic overview of the Enoki library. All code snippets assume that the following lines are present:

#include <iostream>
#include <enoki/array.h>

/* Don't forget to include the 'enoki' namespace */
using namespace enoki;

Note

The next few sections will introduce the basics of Enoki in the context of vectorization for CPUs. Later sections will discuss vectorization via GPU arrays and Automatic differentiation. It is recommended that you read the following sections even if you are mainly interested in the latter two topics.

Static arrays

The most important data structure in this library is enoki::Array, a generic container that stores a fixed-size array of an arbitrary data type. This is somewhat similar to the standard template library class std::array. The main distinction between the two is that enoki::Array forwards all C++ operators (and other standard mathematical functions) to the contained elements.

_images/basics-01.svg

For instance, the snippet

using StrArray = Array<std::string, 2>;

StrArray x("Hello ", "How are "),
         y("world!", "you?");

// Prints: "[Hello world!,  How are you?]"
std::cout << x + y << std::endl;

is valid because Array::operator+() can carry out the addition by invoking std::string::operator+() (this would not compile if std::array was used, since it has no such semantics).

Enoki arrays also have support for a convenient feature that is commonly known as broadcasting—in this simple example, broadcasting is triggered if we add a raw string corresponding to an array of dimension zero. Fancier applications of broadcasting are discussed later.

// Prints: "[Hello you,  How are you]"
std::cout << x + "you" << std::endl;

The real use case of Enoki arrays, however, involves arrays of integer or floating point values, for which arithmetic operations can often be reduced to fast vectorized instructions provided by current processor architectures.

The library ships with partial template overloads that become active when the Type and Size parameters supplied to the enoki::Array<Type, Size> template correspond to combinations that are natively supported by the targeted hardware. For instance, the template overloads for single precision arrays look as follows:

_images/basics-02.svg

Altogether, Enoki currently supports the ARM NEON, SSE4.2, AVX, AVX2, and AVX512 instruction sets and vectorizes arithmetic involving single and double precision floating point values as well as signed and unsigned 32-bit and 64-bit integers.

It is worth pointing out that that enoki::Array does not require Size to exactly match what is supported by the hardware to benefit from vectorization. Enoki relies on template metaprogramming techniques to ensure optimal code generation even in such challenging situations. For instance, arithmetic operations involving a hypothetical Array<float, 27> type will generate one AVX512 instruction [1], one AVX instruction, and one 4-wide SSE instruction that leaves the last entry unused.

_images/basics-03.svg

A perhaps more sensible use of this feature is to instantiate packed arrays with a Size that is an integer multiple of what is supported natively as a way of aggressively unrolling the underlying computations. If Size is not specified, the implementation chooses a reasonable value (4 on machines with SSE/NEON, 8 on machines with AVX, and 16 on AVX512).

Initializing, reading, and writing data

Arrays can be initialized by broadcasting a scalar value, or by specifying the values of individual entries.

/* Initialize all entries with a constant */
MyFloat f1(1.f);

/* Initialize the entries individually */
MyFloat f2(1.f, 2.f, 3.f, 4.f);

The enoki namespace also contains a large number of global functions that create or manipulate Enoki arrays in various ways. One example is the enoki::load() function, which is the method of choice to initialize an array with data that is currently stored in memory:

float *mem = /* ... pointer to floating point data ... */;
MyFloat f3;

/* Load entries of 'f3' from 'mem' */
f3 = load<MyFloat>(mem);           /* if known that 'mem' is aligned */
f3 = load_unaligned<MyFloat>(mem); /* otherwise */

Both enoki::load() and enoki::load_unaligned() are template functions that load an array of the specified type (MyFloat in this case) from a given address in memory. The first indicates that the memory address is aligned to a multiple of alignof(MyFloat), which is equal to 16 bytes in this example. It is a good idea to align data and use aligned versions of operations, since this reduces the number of cache lines that must be accessed.

Warning

Performing an aligned load from an unaligned memory address will cause a general protection fault that immediately terminates the application.

Note

It is generally desirable to use high-level Enoki template functions like enoki::load() whenever possible since they are designed to produce the most efficient instruction sequences for the specified target architecture. In this case, we could also have written

f3 = MyFloat(mem[0], mem[1], mem[2], mem[3]);

which is functionally equivalent—however, it is not guaranteed that the compiler will be able to exploit the equivalence to generate optimal code in this case.

An analogous pair of operations stores the contents of arrays in memory:

/* Store entries of 'f3' to 'mem' */
store(mem, f3);                    /* if known that 'mem' is aligned */
store_unaligned(mem, f3)           /* otherwise */

Note that load() and load_unaligned() require the target type as a template parameter, while the stores do not, since it can be inferred from the provided value.

Gather, scatter, and scatter-add operations are also supported using a similar pattern:

/* 32 and 64 bit integers are supported as indices for scatter/gather operations */
Array<int, 4> idx(1, 2, 3, 4);

/* Gather f3 from mem -- this is equivalent to
   setting f3[i] = mem[idx[i]] (i = 0, 1, ..) */
f3 = gather<MyFloat>(mem, idx);

/* Scatter f3 to mem -- this is equivalent to
   setting mem[idx[i]] = f3[i] (i = 0, 1, ..) */
scatter(mem, f3, idx);

/* Atomic scatter-add of f3 to mem -- this is equivalent to
   setting mem[idx[i]] += f3[i] (i = 0, 1, ..). The operation
   is atomic in the sense that it gives the correct results
   if 'idx' refers to the same index twice. */
scatter_add(mem, f3, idx);

Finally, the following initialization methods also exist:

/* Efficient way to create an array of any type filled with zero entries */
f1 = zero<MyFloat>();

/* Initialize entries with index sequence 0, 1, 2, ... */
f1 = arange<MyFloat>();

/* Initialize entries with a linearly increasing sequence with endpoints 0 and 1 */
f1 = linspace<MyFloat>(0.f, 1.f);

Element access

The components of Array can be accessed via operator[]. If you find yourself using this much, your code is likely not making good use of the vector units.

f2[2] = 1.f;

Alternatively, the functions x(), y(), z(), and w() can be used to access the first four components. The following line is equivalent to the one above.

f2.z() = 1.f;

Components of a vector can be efficiently reordered using the following syntax:

// f1: [0, 10, 20, 30]
f2 = shuffle<0, 2, 1, 3>(f1);
// f2: [0, 20, 10, 30]

Finally, Enoki provides an overloaded operator<<(std::ostream&, ...) stream insertion operator to facilitate the inspection of array contents:

/* The line below prints: [1, 2, 3, 4] */
std::cout << MyFloat(1.f, 2.f, 3.f, 4.f) << std::endl;

Vertical operations

Enoki provides the following vertical operations. The word vertical implies that they are independently applied to all array elements.

/* Basic arithmetic operations*/
f1 *= (f2 + 1.f) / (f2 - 1.f);

/* Basic math library functions */
f2 = ceil(f1); f2 = floor(f1); f2 = round(f1); f2 = trunc(f1);
f2 = abs(f1);  f2 = sqrt(f1); f2 = cbrt(f1); f2 = sign(f1);
f2 = min(f1, f2); f2 = max(f1, f2);

/* Fused multiply-add/subtract */
f1 = fmadd(f1, f2, f3); /* f1 * f2 + f3 */
f1 = fmsub(f1, f2, f3); /* f1 * f2 - f3 */
f1 = fnmadd(f1, f2, f3); /* -f1 * f2 + f3 */
f1 = fnmsub(f1, f2, f3); /* -f1 * f2 - f3 */

/* Efficient reciprocal and reciprocal square root */
f1 = rcp(f1); f1 = rsqrt(f1);

/* Trigonometric and inverse trigonometric functions */
f2 = sin(f1);   f2 = cos(f1);    f2 = tan(f1);
f2 = csc(f1);   f2 = sec(f1);    f2 = cot(f1);
f2 = asin(f1);  f2 = acos(f1);   f2 = atan(f2);
f2 = atan2(f1, f2);
auto [s, c] = sincos(f1);

/* Hyperbolic and inverse hyperbolic functions */
f2 = sinh(f1);  f2 = cosh(f1);  f2 = tanh(f1);
f2 = csch(f1);  f2 = sech(f1);  f2 = coth(f1);
f2 = asinh(f1); f2 = acosh(f1); f2 = atanh(f2);
auto [sh, ch] = sincosh(f1);

/* Exponential function, natural logarithm, power function */
f2 = exp(f1);   f2 = log(f1);   f2 = pow(f1, f2);

/* Exponent/mantissa manipulation */
f1 = ldexp(f1, f2);
auto [mant, exp] = frexp(f1);

/* Special functions */
f2 = erf(f1); f2 = erfinv(f1); f2 = erfi(f1);
f2 = i0e(f1); f2 = dawson(f1);

f1 = comp_ellint_1(f1);     f1 = ellint_1(f1, f2);
f1 = comp_ellint_2(f1);     f1 = ellint_2(f1, f2);
f1 = comp_ellint_2(f1, f2); f1 = ellint_3(f1, f2, f3);

/* Bit shifts and rotations (only for integer arrays) */
i1 = sl<3>(i1);   i1 = sr<3>(i1);   /* Shift by a compile-time constant ("immediate") */
i1 = i1 >> i2;    i1 = i1 << i2;    /* Element-wise shift by a variable amount */
i1 = rol<3>(i1);  i1 = ror<3>(i1);  /* Rotate by a compile-time constant ("immediate") */
i1 = rol(i1, i2); i1 = ror(i1, i2); /* Element-wise rotation by a variable amount */

/* Trailing/leading zero count, population count (only for integer arrays) */
i1 = lzcnt(i1);  i1 = tzcnt(i1);  i1 = popcnt(i1);

Casting

A cast is another type of vertical operation. Enoki performs efficient conversion between any pair of types using native conversion instructions whenever possible:

using Source = Array<int64_t, 32>;
using Target = Array<double, 32>;

Source source = ...;
Target target(source);

Horizontal operations

In contrast to the above vertical operations, the following horizontal operations consider the entries of a packed array jointly and return a scalar.

_images/basics-04.svg

Depending on the size of the array, these are implemented using between \(log_2(N)\) and \(N-1\) vertical reduction operations and shuffles. Horizontal operations should generally be avoided since they don’t fully utilize the hardware vector units (ways of avoiding them are discussed later).

/* Horizontal sum, equivalent to f1[0] + f1[1] + f1[2] + f1[3] */
float s0 = hsum(f1);

/* Horizontal product, equivalent to f1[0] * f1[1] * f1[2] * f1[3] */
float s1 = hprod(f1);

/* Horizontal minimum, equivalent to std::min({ f1[0], f1[1], f1[2], f1[3] }) */
float s2 = hmin(f1);

/* Horizontal maximum, equivalent to std::max({ f1[0], f1[1], f1[2], f1[3] }) */
float s3 = hmax(f1);

/* Horizontal mean , equivalent to (f1[0] + f1[1] + f1[2] + f1[3]) / 4.f */
float s4 = hmean(f1);

The following linear algebra primitives are also realized in terms of horizontal operations:

/* Dot product of two arrays */
float dp = dot(f1, f2);

/* For convenience: absolute value of the dot product */
float adp = abs_dot(f1, f2);

/* Squared 2-norm of a vector */
float sqn = squared_norm(f1);

/* 2-norm of a vector */
float nrm = norm(f1);

Working with masks

Comparisons involving Enoki types are generally applied component-wise and produce a mask representing the outcome of the comparison. The internal representation of a mask is an implementation detail that varies widely from architecture to architecture – an overview is given in the section on Architectural differences handled by Enoki.

Masks enable powerful branchless logic in combination with a range of other bit-level operations. The following snippets show some example usage of mask types:

auto mask = f1 > 1.f;

/* Bit-level AND operation: Zero out entries where the comparison was false */
f1 &= mask;

Masks can be combined in various ways

mask ^= (f1 > cos(f2)) | ~(f2 <= f1);

The following range tests also generate masks

mask = isnan(f1);    /* Per-component NaN test */
mask = isinf(f1);    /* Per-component +/- infinity test */
mask = isfinite(f1); /* Per-component test for finite values */

Note

Using the -ffast-math compiler option may break detection of NaN values, and so is typically not recommended.

Enoki provides a number of helpful trait classes to access array-related types. For instance, enoki::mask_t determines the mask type associated with an array, which permits replacing the auto statement above.

using FloatMask = mask_t<MyFloat>;
FloatMask mask = f1 > 1.f;

A comprehensive list of type traits is available in the reference. Similar to the horizontal operations for addition and multiplication involving arithmetic arrays, mask arrays also provide a set of horizontal operations:

/* Do *all* entries have a mask value corresponding to 'true'? */
bool mask_all_true  = all(mask);

/* Do *some* entries have a mask value corresponding to 'true'? */
bool mask_some_true = any(mask);

/* Do *none* of the entries have a mask value corresponding to 'true'? */
bool mask_none_true = none(mask);

/* Count *how many* entries have a mask value corresponding to 'true'? */
size_t true_count = count(mask);

Note

The special case of the equality and inequality comparison operators: following the principle of least surprise, enoki::operator==() and enoki::operator!=() return a boolean value (i.e. they internally perform a horizontal reduction). Vertical comparison operators named eq() and neq() are also available. The following pairs of operations are thus equivalent:

MyFloat f1 = ..., f2 = ...;

bool b1 = (f1 == f2);
bool b2 = all(eq(f1, f2));

bool b3 = (f1 != f2);
bool b4 = any(neq(f1, f2));

One of the most useful bit-level operation is select() which chooses between two arguments using a mask. This is extremely useful for writing branch-free code. Argument order matches the C ternary operator, i.e. condition ? true_value : false_value maps to select(condition, true_value, false_value).

f1 = select(f1 < 0.f, f1, f2);

/* The above select() statement is equivalent to the following less efficient expression */
f1 = ((f1 < 0.f) & f1) | (~(f1 < 0.f) & f2);

Enoki also provides a special masked assignment operator, which updates entries of an array matching the given mask:

f1[f1 > 0.f] = f2;
f1[f1 < 0.f] += 1.f;

Compared to select(), a masked update may generate slightly more efficient code on some platforms. Apart from this, the two approaches can be used interchangeably. An alternative syntax involving the function enoki::masked() also exists:

masked(f1, f1 > 0.f) = f2;
masked(f1, f1 < 0.f) += 1.f;

This is functionally equivalent to the previous example. The enoki::masked() syntax exists because it extends to cases where f1 is scalar, i.e. not an Enoki array. Using Enoki functions with scalar arguments will be discussed later.

The special case of 3D arrays

Because information of dimension 3 occurs frequently (spatial coordinates, color information, …) and generally also benefits very slightly from vectorization, Enoki represents 3-vectors in packed arrays of size 4, leaving the last component unused. Any vertical operations are applied to the entire array including the fourth component, while horizontal operations skip the last component.

An efficient cross product operation realized using shuffles is available for 3-vectors:

f1 = cross(f1, f2);

Generally, a better way to work with 3D data while achieving much greater instruction level parallelism is via nested arrays and the Structure of Arrays (SoA) approach discussed next.

Footnotes

[1]Different combinations are used when not all of these instruction sets are available.

Nested arrays

Motivation

Application data is often not arranged in a way that is conductive to efficient vectorization. For instance, vectors in a 3D dataset have too few dimensions to fully utilize the SIMD lanes of modern hardware. Scalar or horizontal operations like dot products lead to similar inefficiencies if used frequently. In such situations, it is preferable to use nested arrays using a technique that is known as a Structure of Arrays (SoA) representation, which provides a way of converting scalar, horizontal, and low-dimensional vector arithmetic into vertical operations that fully utilize all SIMD lanes.

To understand the fundamental problem, consider the following basic example code, which computes the normalized cross product of a pair of 3D vectors. Without Enoki, this might be done as follows:

 struct Vector3f {
    float x;
    float y;
    float z;
 };

Vector3f normalize(const Vector3f &v) {
    float scale = 1.0f / std::sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
    return Vector3f{v.x * scale, v.y * scale, v.z * scale};
}

Vector3f cross(const Vector3f &v1, const Vector3f &v2) {
    return Vector3f{
        (v1.y * v2.z) - (v1.z * v2.y),
        (v1.z * v2.x) - (v1.x * v2.z),
        (v1.x * v2.y) - (v1.y * v2.x)
    };
}

 Vector3f normalized_cross(Vector3f a, Vector3f b) {
     return normalize(cross(a, b));
 }

Clang compiles the normalized_cross() function into the following fairly decent scalar assembly (clang -O3 -msse4.2 -mfma -ffp-contract=fast -fomit-frame-pointer):

__Z16normalized_cross8Vector3fS_:
        vmovshdup   xmm4, xmm0           ; xmm4 = xmm0[1, 1, 3, 3]
        vmovshdup   xmm5, xmm2           ; xmm5 = xmm2[1, 1, 3, 3]
        vinsertps   xmm6, xmm3, xmm1, 16 ; xmm6 = xmm3[0], xmm1[0], xmm3[2, 3]
        vblendps    xmm7, xmm2, xmm0, 2  ; xmm7 = xmm2[0], xmm0[1], xmm2[2, 3]
        vpermilps   xmm7, xmm7, 225      ; xmm7 = xmm7[1, 0, 2, 3]
        vinsertps   xmm1, xmm1, xmm3, 16 ; xmm1 = xmm1[0], xmm3[0], xmm1[2, 3]
        vblendps    xmm3, xmm0, xmm2, 2  ; xmm3 = xmm0[0], xmm2[1], xmm0[2, 3]
        vpermilps   xmm3, xmm3, 225      ; xmm3 = xmm3[1, 0, 2, 3]
        vmulps      xmm1, xmm1, xmm3
        vfmsub231ps xmm1, xmm6, xmm7
        vmulss      xmm2, xmm4, xmm2
        vfmsub231ss xmm2, xmm0, xmm5
        vmovshdup   xmm0, xmm1           ; xmm0 = xmm1[1, 1, 3, 3]
        vmulss      xmm0, xmm0, xmm0
        vfmadd231ss xmm0, xmm1, xmm1
        vfmadd231ss xmm0, xmm2, xmm2
        vsqrtss     xmm0, xmm0, xmm0
        vmovss      xmm3, dword ptr [rip + LCPI0_0] ; xmm3 = 1.f, 0.f, 0.f, 0.f
        vdivss      xmm3, xmm3, xmm0
        vmovsldup   xmm0, xmm3           ; xmm0 = xmm3[0, 0, 2, 2]
        vmulps      xmm0, xmm1, xmm0
        vmulss      xmm1, xmm2, xmm3
        ret

However, note that only 50% of the above 22 instruction perform actual arithmetic (which is scalar, i.e. low throughput), with the remainder being spent on unpacking and re-shuffling data.

A first attempt

Simply rewriting this code using Enoki leads to considerable improvements:

/* Enoki version */
using Vector3f = Array<float, 3>;

Vector3f normalized_cross(Vector3f a, Vector3f b) {
    return normalize(cross(a, b));
}
; Assembly for Enoki version
__Z16normalized_cross8Vector3fS_:
    vpermilps   xmm2, xmm0, 201 ; xmm2 = xmm0[1, 2, 0, 3]
    vpermilps   xmm3, xmm1, 210 ; xmm3 = xmm1[2, 0, 1, 3]
    vpermilps   xmm0, xmm0, 210 ; xmm0 = xmm0[2, 0, 1, 3]
    vpermilps   xmm1, xmm1, 201 ; xmm1 = xmm1[1, 2, 0, 3]
    vmulps      xmm0, xmm0, xmm1
    vfmsub231ps xmm0, xmm2, xmm3
    vdpps       xmm1, xmm0, xmm0, 113
    vsqrtss     xmm1, xmm1, xmm1
    vmovss      xmm2, dword ptr [rip + LCPI0_0] ; xmm2 = 1.f, 0.f, 0.f, 0.f
    vdivss      xmm1, xmm2, xmm1
    vpermilps   xmm1, xmm1, 0   ; xmm1 = xmm1[0, 0, 0, 0]
    vmulps      xmm0, xmm0, xmm1
    ret

Here, Enoki has organized the 3D vectors as 4D arrays that “waste” the last component, allowing for more compact sequence of SSE4.2 instructions with fewer shuffles. This is better but still not ideal: of the 12 instructions (a reduction by 50% compared to the previous example), 3 are vectorized, 2 are scalar, and 1 is a (slow) horizontal reduction. The remaining 6 are shuffle and move instructions.

A better solution

The key idea that enables further vectorization of this code is to work on 3D arrays, whose components are themselves arrays. This is known as SoA-style data organization. One group of multiple 3D vectors represented in this way is referred to as a packet.

Since Enoki arrays support arbitrary nesting, it’s straightforward to wrap an existing Array representing a packet of data into another array while preserving the semantics of an 3-dimensional vector at the top level.

_images/nested-01.svg

As before, all mathematical operations discussed so far are trivially supported due to the fundamental behavior of an Enoki array: all operations are simply forwarded to the contained entries (which are themselves arrays now: the procedure continues recursively). The following snippet demonstrates the basic usage of such an approach.

/* Declare an underlying packet type with 4 floats */
using FloatP = Array<float, 4>;

/* NEW: Packet of 3D vectors containing four separate directions */
using Vector3fP = Array<FloatP, 3>;

Vector3fP vec(
   FloatP(1, 2, 3, 4),    /* X components */
   FloatP(5, 6, 7, 8),    /* Y components */
   FloatP(9, 10, 11, 12)  /* Z components */
);

/* Enoki's stream insertion operator detects the recursive array and
   prints the contents as a list of 3D vectors
   "[[1, 5, 9],
     [2, 6, 10],
     [3, 7, 11],
     [4, 8, 12]]" */
std::cout << vec << std::endl;

/* Element access using operator[] and x()/y()/z()/w() now return size-4 packets.
   The statement below prints "[1, 2, 3, 4]" */
std::cout << vec.x() << std::endl;

/* Transcendental functions are applied to all (nested) components independently */
Vector3fP vec2 = sin(vec);

The behavior of horizontal operations changes as well–for instance, the dot product

FloatP dp = dot(vec, vec2);

now creates a size-4 packet of dot products: one for each pair of input 3D vectors. This is simply a consequence of applying the definition of the dot product to the components of the array (which are now arrays).

_images/nested-02.svg

Note how this is a major performance improvement since relatively inefficient horizontal operations have now turned into a series of vertical operations that make better use of the processor’s vector units.

With the above type aliases, the normalized_cross() function now looks as follows:

Vector3fP normalized_cross(Vector3fP a, Vector3fP b) {
    return normalize(cross(a, b));
}

Disregarding the loads and stores that are needed to fetch the operands and write the results, this generates the following assembly:

; Assembly for SoA-style version
__Z16normalized_cross8Vector3fS_:
    vmulps       xmm6, xmm2, xmm4
    vfmsub231ps  xmm6, xmm1, xmm5
    vmulps       xmm5, xmm0, xmm5
    vfmsub213ps  xmm2, xmm3, xmm5
    vmulps       xmm1, xmm1, xmm3
    vfmsub231ps  xmm1, xmm0, xmm4
    vmulps       xmm0, xmm2, xmm2
    vfmadd231ps  xmm0, xmm6, xmm6
    vfmadd231ps  xmm0, xmm1, xmm1
    vsqrtps      xmm0, xmm0
    vbroadcastss xmm3, dword ptr [rip + LCPI0_0]
    vdivps       xmm0, xmm3, xmm0
    vmulps       xmm3, xmm6, xmm0
    vmulps       xmm2, xmm2, xmm0
    vmulps       xmm0, xmm1, xmm0

This is much better: 15 vectorized operations which process four vectors at the same time, while fully utilizing the underlying SSE4.2 vector units. If wider arithmetic is available, it’s of course possible to process many more vectors at the same time.

Enoki will also avoid costly high-latency operations like division and square root if the user indicates that minor approximations are tolerable. The following snippet demonstrates code generation on an ARMv8 NEON machine. Note the use of the frsqrte and frsqrts instructions for the reciprocal square root.

; Assembly for ARM NEON (armv8a) version
__Z16normalized_cross8Vector3fS_:
    fmul    v6.4s, v2.4s, v3.4s
    fmul    v7.4s, v0.4s, v4.4s
    fmls    v6.4s, v0.4s, v5.4s
    fmul    v16.4s, v1.4s, v5.4s
    fmls    v7.4s, v1.4s, v3.4s
    fmul    v0.4s, v6.4s, v6.4s
    fmls    v16.4s, v2.4s, v4.4s
    fmla    v0.4s, v7.4s, v7.4s
    fmla    v0.4s, v16.4s, v16.4s
    frsqrte v1.4s, v0.4s
    fmul    v2.4s, v1.4s, v1.4s
    frsqrts v2.4s, v2.4s, v0.4s
    fmul    v1.4s, v1.4s, v2.4s
    fmul    v2.4s, v1.4s, v1.4s
    frsqrts v0.4s, v2.4s, v0.4s
    fmul    v2.4s, v1.4s, v0.4s
    fmul    v0.4s, v6.4s, v2.4s
    fmul    v1.4s, v7.4s, v2.4s
    fmul    v2.4s, v16.4s, v2.4s

Unrolling the computation further

On current processor architectures, most floating point operations have a latency of \(\sim4-6\) clock cycles. This means that instructions depending the preceding instruction’s result will be generally stall for at least that long.

To alleviate the effects of latency, it can be advantageous to use an integer multiple of the system’s SIMD width (e.g. \(2\times\)). This leads to longer instructions sequences with fewer interdependencies, which can improve performance noticeably.

using FloatP = Array<float, 32>;

With the above type definition on an AVX512-capable machine, Enoki would e.g. unroll every 32-wide operation into a pair of 16-wide instructions.

Nested horizontal operations

Horizontal operations (e.g. hsum()) perform a reduction across the outermost dimension, which means that they return arrays instead of scalars when given a nested array as input (the same is also true for horizontal mask operations such as any()).

Sometimes this is not desirable, and Enoki thus also provides nested versions all of horizontal operations that can be accessed via the _nested suffix. These functions recursively apply horizontal reductions until the result ceases to be an array. For instance, the following function ensures that no element of a packet of 3-vectors contains a Not-a-Number floating point value.

bool check(const Vector3fP &x) {
    return none_nested(isnan(x));
}

Broadcasting

Enoki performs an automatic broadcast operation whenever a higher-dimensional array is initialized from a lower-dimensional array, or when types of mixed dimension occur in an arithmetic expression.

The figure below illustrates a broadcast of a 3D vector to a rank 4 tensor during an arithmetic operation involving both (e.g. addition).

_images/nested-03.svg

Enoki works its way through the shape descriptors of both arrays, moving from right to left in the above figure (i.e. from the outermost to the innermost nesting level). At each iteration, it checks if the current pair of shape entries match – if they do, the iteration advances to the next entry. Otherwise, a broadcast is performed over that dimension, which is analogous to inserting a 1 into the lower-dimensional array’s shape. When the dimensions of the lower-dimensional array are exhausted, Enoki appends further ones until the dimensions match. In the above example, the array is thus copied to the second dimension, with a broadcast taking place over the last and first two dimensions.

Broadcasting nested types works in the same way. Here, the entries of a \(3\times 3\) array are copied to the first and third dimensions of the output array:

_images/nested-05.svg

Automatic broadcasting is convenient whenever a computation combines both vectorized and non-vectorized data. For instance, it is legal to mix instances of type Vector4f and Vector4fP, defined below, in the same expression.

/* Packet data type */
using FloatP    = Array<float>;

/* 4D vector and packets of 4D vectors */
using Vector4f  = Array<float, 4>;
using Vector4fP = Array<FloatP, 4>;

Vector4   data1 = ..;
Vector4fP data2 = ..;

/* This is legal -- 'result' is of type Vector4fP */
auto result = data1 + data;

Warning

Broadcasting can sometimes lead to unexpected and undesirable behavior. The following section explains how this can be avoided.

GPU Arrays

We will now switch gears to GPU arrays, whose operation is markedly different from that of the other previously discussed Enoki array types.

The first major change when working with GPU arrays is that Enoki is no longer a pure header file library. A compilation step becomes necessary, which produces shared libraries against which applications must be linked. The reason for this is to minimize template bloat: GPU arrays involve a certain amount of additional machinery, and it would be wasteful to have to include the underlying implementation in every file of an application that relies on GPU arrays.

Compiling Enoki produces up to three libraries that are potentially of interest:

  1. libenoki-cuda.so: a just-in-time compiler that is used to realize the Enoki backend of the CUDAArray<T> array type discussed in this section.
  2. libenoki-autodiff.so: a library for maintaining a computation graph for automatic differentiation discussed in the next section.
  3. enoki.cpyhon-37m-x86_64-linux-gnu.so (platform-dependent filename): a Python binding library that provides interoperability with Enoki’s GPU arrays and differentiable GPU arrays.

Enter the following CMake command to compile all of them:

cd <path-to-enoki>
mkdir build
cd build
cmake -DENOKI_CUDA=ON -DENOKI_AUTODIFF=ON -DENOKI_PYTHON=ON ..
make

For educational reasons, it is instructive to compile in Enoki in debug mode, which enables a number of log messages that we will refer to in the remainder of this section. Use the following CMake command to do so:

cmake -DCMAKE_BUILD_TYPE=Debug -DENOKI_CUDA=ON -DENOKI_AUTODIFF=ON -DENOKI_PYTHON=ON ..

Using GPU Arrays in Python

We find it easiest to introduce Enoki’s GPU arrays from within an interactive Python interpreter and postpone the discussion of the C++ interface to the end of this section. We’ll start by importing the Enoki extension into an interactive Python session and set the log level to a high value via cuda_set_log_level(), which will be helpful in the subsequent discussion.

>>> from enoki import *
>>> cuda_set_log_level(4)

Note

The first time that Enoki is imported on a new machine, it will trigger a kernel pre-compilation step that takes a few seconds.

The Enoki python bindings expose a number of types in the enoki.cuda submodule that correspond to GPU-resident arrays. The following example initializes such an array with a constant followed by a simple addition operation.

>>> from enoki.cuda import Float32 as FloatC
>>> a = FloatC(1)
cuda_trace_append(10): mov.$t1 $r1, 0f3f800000

>>> a = a + a
cuda_trace_append(11 <- 10, 10): add.rn.ftz.$t1 $r1, $r2, $r3

Observe the two cuda_trace_append log messages, which begin to reveal the mechanics underlying the GPU backend: neither of these two operations has actually occurred at this point. Instead, Enoki simply queued this computation for later execution using an assembly-like intermediate language named NVIDIA PTX.

For instance, the first line cuda_trace_append(10): mov.$t1 $r1, 0f3f800000 indicates the creation of a new variable with index 10, and the associated line of PTX will eventually be used to initialize this variable with a binary constant representing the floating point value 1.0. The next cuda_trace_append command introduces a new variable 11 that will record the result of the addition, while keeping track of the dependence on the original variable 10, etc. More complex numerical operations (e.g. a hyperbolic tangent) result in a longer sequence of steps that are similarly enqueued:

>>> a = tanh(a)
cuda_trace_append(12 <- 11): abs.ftz.$t1 $r1, $r2
cuda_trace_append(13): mov.$t1 $r1, 0f3f200000
... 25 lines skipped ...
cuda_trace_append(39 <- 38, 37): sub.rn.ftz.$t1 $r1, $r2, $r3
cuda_trace_append(40 <- 39, 29, 14): selp.$t1 $r1, $r2, $r3, $r4

Eventually, numerical evaluation can no longer be postponed, e.g. when we try to print the array contents:

>>> print(a)
cuda_eval(): launching kernel (n=1, in=0, out=1, ops=31)
.... many lines skipped ...
cuda_jit_run(): cache miss, jit: 541 us, ptx compilation: 43.534, 10 registers
[0.964028]

At this point, Enoki’s JIT backend compiles and launches a kernel that contains all of the computation queued thus far.

Show/Hide the resulting PTX code
.version 6.3
.target sm_75
.address_size 64

.visible .entry enoki_8a163272(.param .u64 ptr,
                               .param .u32 size) {
    .reg.b8 %b<41>;
    .reg.b16 %w<41>;
    .reg.b32 %r<41>;
    .reg.b64 %rd<41>;
    .reg.f32 %f<41>;
    .reg.f64 %d<41>;
    .reg.pred %p<41>;


    // Grid-stride loop setup
    ld.param.u64 %rd0, [ptr];
    ld.param.u32 %r1, [size];
    mov.u32 %r4, %tid.x;
    mov.u32 %r5, %ctaid.x;
    mov.u32 %r6, %ntid.x;
    mad.lo.u32 %r2, %r5, %r6, %r4;
    setp.ge.u32 %p0, %r2, %r1;
    @%p0 bra L0;

    mov.u32 %r7, %nctaid.x;
    mul.lo.u32 %r3, %r6, %r7;

L1:
    // Loop body

    mov.f32 %f10, 0f3f800000;
    add.rn.ftz.f32 %f11, %f10, %f10;
    mul.rn.ftz.f32 %f12, %f11, %f11;
    mul.rn.ftz.f32 %f13, %f12, %f12;
    mul.rn.ftz.f32 %f14, %f13, %f13;
    mov.f32 %f15, 0fbbbaf0ea;
    mul.rn.ftz.f32 %f16, %f15, %f14;
    mov.f32 %f17, 0f3e088393;
    mov.f32 %f18, 0fbeaaaa99;
    fma.rn.ftz.f32 %f19, %f12, %f17, %f18;
    add.rn.ftz.f32 %f20, %f19, %f16;
    mov.f32 %f21, 0f3ca9134e;
    mov.f32 %f22, 0fbd5c1e2d;
    fma.rn.ftz.f32 %f23, %f12, %f21, %f22;
    fma.rn.ftz.f32 %f24, %f13, %f23, %f20;
    mul.rn.ftz.f32 %f25, %f12, %f11;
    fma.rn.ftz.f32 %f26, %f24, %f25, %f11;
    add.rn.ftz.f32 %f27, %f11, %f11;
    mov.f32 %f28, 0f3fb8aa3b;
    mul.rn.ftz.f32 %f29, %f28, %f27;
    ex2.approx.ftz.f32 %f30, %f29;
    mov.f32 %f31, 0f3f800000;
    add.rn.ftz.f32 %f32, %f30, %f31;
    rcp.approx.ftz.f32 %f33, %f32;
    add.rn.ftz.f32 %f34, %f33, %f33;
    mov.f32 %f35, 0f3f800000;
    sub.rn.ftz.f32 %f36, %f35, %f34;
    abs.ftz.f32 %f37, %f11;
    mov.f32 %f38, 0f3f200000;
    setp.ge.f32 %p39, %f37, %f38;
    selp.f32 %f40, %f36, %f26, %p39;

    // Store register %f40
    ldu.global.u64 %rd8, [%rd0 + 0];
    st.global.f32 [%rd8], %f40;

    add.u32     %r2, %r2, %r3;
    setp.ge.u32 %p0, %r2, %r1;
    @!%p0 bra L1;

L0:
    ret;
}

Internally, Enoki hands the PTX code over to CUDA’s runtime compiler (NVRTC), which performs a second pass that translates from PTX to the native GPU instruction set SASS.

Show/Hide the resulting SASS code
enoki_8a163272:
    MOV R1, c[0x0][0x28];
    S2R R0, SR_TID.X;
    S2R R3, SR_CTAID.X;
    IMAD R0, R3, c[0x0][0x0], R0;
    ISETP.GE.U32.AND P0, PT, R0, c[0x0][0x168], PT;
@P0 EXIT;
    BSSY B0, `(.L_2);
    ULDC.64 UR4, c[0x0][0x160];
.L_3:
     LDG.E.64.SYS R2, [UR4];
     MOV R5, 0x3f76ca83;
     MOV R7, c[0x0][0x0];
     IMAD R0, R7, c[0x0][0xc], R0;
     ISETP.GE.U32.AND P0, PT, R0, c[0x0][0x168], PT;
     STG.E.SYS [R2], R5;
@!P0 BRA `(.L_3);
     BSYNC B0;
.L_2:
     EXIT ;
.L_4:
     BRA `(.L_4);

This second phase is a full-fledged optimizing compiler with constant propagation and common subexpression elimination. You can observe this in the previous example because the second snippet is much smaller—in fact, almost all of the computation was optimized away and replaced by a simple constant (\(\tanh(2)\approx 0.964028\)).

Enoki’s approach is motivated by efficiency considerations: most array operations are individually very simple and do not involve a sufficient amount of computation to outweigh overheads related to memory accesses and GPU kernel launches. Enoki therefore accumulates larger amounts of work (potentially hundreds of thousands of individual operations) before creating and launching an optimized GPU kernel. Once evaluated, array contents can be accessed without triggering further computation:

>>> print(a)
[0.964028]

Kernel caching

GPU kernel compilation consists of two steps: the first generates a PTX kernel from the individual operations—this is essentially just string concatenation and tends to be very fast (541 µs in the above example, most of which is caused by printing assembly code onto the console due to the high log level).

The second step (ptx compilation) that converts the PTX intermediate representation into concrete machine code that can be executed on the installed graphics card is orders of magnitude slower (43 ms in the above example) but only needs to happen once: whenever the same computation occurs again (e.g. in subsequent iterations of an optimization algorithm), the previously generated kernel is reused:

>>> b = FloatC(1)
>>> b = b + b
>>> b = tanh(b)
>>> print(b)
cuda_eval(): launching kernel (n=1, in=0, out=1, ops=31)
.... many lines skipped ...
cuda_jit_run(): cache hit, jit backend: 550 us
[0.964028]

A more complex example

We now turn to a more complex example: computing the three-dimensional volume of a sphere using Monte Carlo integration. To do so, we create a random number generator RNG that will generate 1 million samples:

>>> from enoki.cuda import PCG32 as PCG32C, UInt64 as UInt64C
>>> rng = PCG32C(UInt64C.arange(1000000))

Here, PCG32 refers to a linear congruential generator from the section on random number generation. We use it to sample three random number vectors from the RNG and create a dynamic array of 3D vectors (Vector3fC).

>>> from enoki.cuda import Vector3f as Vector3fC
>>> v = Vector3fC([rng.next_float32() * 2 - 1 for _ in range(3)])

Finally, we compute a mask that determines which of the uniformly distributed vectors on the set \([-1, 1]^3\) lie within the unit sphere:

>>> inside = norm(v) < 1

At this point, seeding of the random number generator and subsequent sampling steps touching its internal state have produced over a hundred different operations generating various intermediate results along with the output variable of interest.

To understand the specifics of this process, we assign a label to this variable and enter the command cuda_whos(), which is analogous to whos in IPython and MATLAB and generates a listing of all variables that are currently registered (with the JIT compiler, in this case).

>>> set_label(inside, 'inside')
>>> cuda_whos()

  ID        Type   E/I Refs   Size        Memory     Ready    Label
  =================================================================
  10        u32    0 / 1      1000000     3.8147 MiB  [ ]
  11        u64    0 / 1      1000000     7.6294 MiB  [ ]
  ... 126 lines skipped ...
  178       f32    0 / 1      1           4 B         [ ]
  179       msk    1 / 0      1000000     976.56 KiB  [ ]     inside
  =================================================================

  Memory usage (ready)     : 0 B
  Memory usage (scheduled) : 0 B + 20.027 MiB = 20.027 MiB
  Memory savings           : 350.95 MiB

The resulting output lists variables of many types (single precision floating point values, 32/64 bit unsigned integers, masks, etc..), of which the last one corresponds to the inside variable named above.

Note how each variable lists two reference counts (in the column E/I refs): the first (external) specifies how many times the variable is referenced from an external application like the interactive Python prompt, while the second (internal) counts how many times it is referenced as part of queued arithmetic expressions. Variables with zero references in both categories are automatically purged from the list.

Most of the variables are only referenced internally—these correspond to temporaries created during a computation. Because they can no longer be “reached” through external references, it would be impossible to ask the system for the contents of such a temporary variable. Enoki relies on this observation to perform an important optimization: rather than storing temporaries in global GPU memory, their contents can be represented using cheap temporary GPU registers. This yields significant storage and memory traffic savings: over 350 MiB of storage can be elided in the last example, leaving only roughly 20 MiB of required storage.

In fact, these numbers can still change: we have not actually executed the computation yet, and Enoki currently conservatively assumes that we plan to continue using the random number generator rng and list of 3D vectors v later on. If we instruct Python to garbage-collect these two variables, the required storage drops to less than a megabyte:

>>> del v, rng
>>> cuda_whos()

  ID        Type   E/I Refs   Size        Memory     Ready    Label
  =================================================================
  10        u32    0 / 1      1000000     3.8147 MiB  [ ]
  11        u64    0 / 1      1000000     7.6294 MiB  [ ]
  ... 126 lines skipped ...
  178       f32    0 / 1      1           4 B         [ ]
  179       msk    1 / 0      1000000     976.56 KiB  [ ]     inside
  =================================================================

  Memory usage (ready)     : 0 B
  Memory usage (scheduled) : 0 B + 976.56 KiB = 976.56 KiB
  Memory savings           : 324.25 MiB

Finally, we can “peek” into the inside array to compute the fraction of points that lie within the sphere, which approximates the expected value \(\frac{4}{3\cdot 2^3}\pi\approx0.523599\).

>>> count(inside) / len(inside)
... many lines skipped ...
0.523946

Manually triggering JIT compilation

It is sometimes desirable to manually force Enoki’s JIT compiler to generate a kernel containing the computation queued thus far. For instance, rather than compiling a long-running iterative algorithm into a single huge kernel, a single kernel per iteration may be preferable. This can be accomplished by explicitly invoking the cuda_eval() function periodically. An example:

>>> from enoki.cuda import UInt32 as UInt32C
>>> a = UInt32C.arange(1234)

>>> cuda_eval()
cuda_eval(): launching kernel (n=1234, in=0, out=1, ops=1)

>>> cuda_whos()

  ID        Type   E/I Refs   Size        Memory     Ready    Label
  =================================================================
  10        u32    1 / 0      1234        4.8203 KiB  [x]
  =================================================================

  Memory usage (ready)     : 4.8203 KiB
  Memory usage (scheduled) : 4.8203 KiB + 0 B = 4.8203 KiB
  Memory savings           : 0 B

The array is now marked “ready”, which means that its contents were evaluated and reside in GPU memory at an address that can be queried via the data field.

>>> a.data
140427428626432

Actually, that is not entirely accurate: kernels are always launched asynchronously, which means that the function cuda_eval() may have returned before the GPU finished executing the kernel. Nonetheless, is perfectly safe to begin using the variable immediately as asynchronous communication with the GPU still observes a linear ordering guarantee.

In very rare cases (e.g. kernel benchmarking), it may be desirable to wait until all currently running kernels have terminated. For this, invoke cuda_sync() following cuda_eval().

Parallelization and horizontal operations

Recall the difference between vertical and horizontal operations: vertical operations are applied independently to each element of a vector, while horizontal ones combine the different elements of a vector. Enoki’s GPU arrays are designed to operate very efficiently when working with vertical operations that can be parallelized over the entire chip.

Horizontal operations (e.g. hsum(), all(), count(), etc.) are best avoided whenever possible, because they require that all prior computation has finished. In other words: each time Enoki encounters a horizontal operation involving an unevaluated array, it triggers a call to cuda_eval(). That said, horizontal reductions are executed in parallel using NVIDIA’s CUB library, which is a highly performant implementation of these primitives.

Interfacing with NumPy

Enoki GPU arrays support bidirectional conversion from/to NumPy arrays, which will of course involve some communication between the CPU and GPU:

>>> x = FloatC.linspace(0, 1, 5)

>>> # Enoki -> NumPy
>>> y = Vector3fC(x, x*2, x*3).numpy()
cuda_eval(): launching kernel (n=5, in=1, out=6, ops=36)

>>> print(y)
array([[0.  , 0.  , 0.  ],
       [0.25, 0.5 , 0.75],
       [0.5 , 1.  , 1.5 ],
       [0.75, 1.5 , 2.25],
       [1.  , 2.  , 3.  ]], dtype=float32)

>>> # NumPy -> Enoki
>>> Vector3fC(y)
cuda_eval(): launching kernel (n=5, in=1, out=3, ops=27)
[[0, 0, 0],
 [0.25, 0.5, 0.75],
 [0.5, 1, 1.5],
 [0.75, 1.5, 2.25],
 [1, 2, 3]]

Interfacing with PyTorch

PyTorch GPU tensors are supported as well. In this case, copying occurs on the GPU (but is still necessary, as the two frameworks use different memory layouts for tensors).

>>> x = FloatC.linspace(0, 1, 5)

>>> # Enoki -> PyTorch
>>> y = Vector3fC(x, x*2, x*3).torch()
cuda_eval(): launching kernel (n=5, in=2, out=5, ops=31)

>>> y
tensor([[0.0000, 0.0000, 0.0000],
        [0.2500, 0.5000, 0.7500],
        [0.5000, 1.0000, 1.5000],
        [0.7500, 1.5000, 2.2500],
        [1.0000, 2.0000, 3.0000]], device='cuda:0')

>>> # PyTorch -> Enoki
>>> Vector3fC(y)
cuda_eval(): launching kernel (n=5, in=1, out=3, ops=27)
[[0, 0, 0],
 [0.25, 0.5, 0.75],
 [0.5, 1, 1.5],
 [0.75, 1.5, 2.25],
 [1, 2, 3]]

Note how the .numpy() and .torch() function calls both triggered a mandatory kernel launch to ensure that that the array contents were ready before returning a representation in the other framework. This can be wasteful when converting many variables at an interface between two frameworks. For this reason, both .numpy() and .torch() functions take an optional eval argument that is set to True by default. Passing False causes the operation to return an uninitialized NumPy or PyTorch array, while at the same time scheduling Enoki code that will eventually fill this memory with valid contents the next time that cuda_eval() is triggered. An example is shown below. This feature is to be used with caution.

>>> x = FloatC.linspace(0, 1, 5)

>>> y = Vector3fC(x, x*2, x*3).numpy(False)

>>> y
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]], dtype=float32)

>>> cuda_eval()
cuda_eval(): launching kernel (n=5, in=1, out=4, ops=36)

>>> y
array([[0.  , 0.  , 0.  ],
       [0.25, 0.5 , 0.75],
       [0.5 , 1.  , 1.5 ],
       [0.75, 1.5 , 2.25],
       [1.  , 2.  , 3.  ]], dtype=float32)

Scatter/gather operations

The GPU backend also supports scatter and gather operations involving GPU arrays as targets.

>>> a = FloatC.zero(10)
>>> b = UInt32C.arange(5)
>>> scatter(target=a, source=FloatC(b), index=b*2)
>>> a
cuda_eval(): launching kernel (n=5, in=1, out=2, ops=9)
[0, 0, 1, 0, 2, 0, 3, 0, 4, 0]

Note that gathering from an unevaluated Enoki array is not guaranteed to be a vertical operation, hence it triggers a call to cuda_eval().

Caching memory allocations

Similar to the PyTorch memory allocator, Enoki uses a caching scheme to avoid very costly device synchronizations when releasing memory. This means that freeing a large GPU variable doesn’t cause the associated memory region to become available for use by the operating system or other frameworks like Tensorflow or PyTorch. Use the function cuda_malloc_trim() to fully purge all unused memory. The function is only relevant when working with other frameworks and does not need to be called to free up memory for use by Enoki itself.

C++ interface

Everything demonstrated in the above sections can be directly applied to C++ programs as well. To use the associated (dynamic array) type CUDAArray, include the header

#include <enoki/cuda.h>

Furthermore, applications must be linked against the cuda and enoki-cuda libraries. The following snippet contains a C++ translation of the Monte Carlo integration Python example shown earlier.

#include <enoki/cuda.h>
#include <enoki/random.h>

using namespace enoki;

using FloatC    = CUDAArray<float>;
using Vector3fC = Array<FloatC, 3>;
using PCG32C    = PCG32<FloatC>;
using MaskC     = mask_t<FloatC>;

int main(int argc, char **argv) {
    PCG32C rng(PCG32_DEFAULT_STATE, arange<FloatC>(1000000));

    Vector3fC v(
        rng.next_float32() * 2.f - 1.f,
        rng.next_float32() * 2.f - 1.f,
        rng.next_float32() * 2.f - 1.f
    );

    MaskC inside = norm(v) < 1.f;

    std::cout << count(inside) / (float) inside.size() << std::endl;
}

Suggestions regarding horizontal operations

When vectorizing code, we may sometimes want to skip an expensive computation when it is not actually needed by any elements in the array being processed. This is usually done with the any() function and yields good performance in when targeting the CPU (e.g. with the AVX512 backend). An example:

auto condition = variable > 1.f;
if (any(condition))
    result[condition] = /* expensive-to-evaluate expression */;

However, recall the discussion earlier in this section, which explained how horizontal operations tend to be fairly expensive in conjunction with the GPU backend because they flush the JIT compiler. This effectively breaks up the program into smaller kernels, increasing memory traffic and missing potential optimization opportunities. Arrays processed by the GPU backend tend to be much larger, and from a probabilistic viewpoint it is often likely that the any() function call will in any case evaluate to true. For these reasons, skipping test and always evaluating the expression often leads to better performance on the GPU.

Enoki provides alternative horizontal reductions of masks named any_or(), all_or(), none_or() that do exactly this: they skip evaluation when compiling for GPU targets and simply return the supplied template argument. For other targets, they behave as usual. With this change, the example looks as follows:

auto condition = variable > 1.f;
if (any_or<true>(condition))
    result[condition] = /* expensive-to-evaluate expression */;

Differences between Enoki and existing frameworks

Enoki was designed as a numerical foundation for differentiable physical simulations, specifically the Mitsuba renderer, though it is significantly more general and should be a trusty tool for a variety of simulation and optimization problems.

Its GPU and Autodiff backends are related to well-known frameworks like TensorFlow and PyTorch that have become standard tools for training and evaluating neural networks. In the following, we outline the main differences between these frameworks and Enoki.

Both PyTorch and Tensorflow provide two main operational modes: eager mode directly evaluates arithmetic operations on the GPU, which yields excellent performance in conjunction with arithmetically intensive operations like convolutions and large matrix-vector multiplications, both of which are building blocks of neural networks. When evaluating typical simulation code that mainly consists of much simpler arithmetic (e.g. additions, multiplications, etc.), the resulting memory traffic and scheduling overheads induce severe bottlenecks. An early prototype of Enoki provided a TorchArray<T> type that carried out operations using PyTorch’s eager mode, and the low performance of this combination eventually motivated us to develop the technique based on JIT compilation introduced in the previous section.

The second operational mode requires an up-front specification of the complete computation graph to generate a single optimized GPU kernel (e.g. via XLA in TensorFlow and jit.trace in PyTorch). This is feasible for neural networks, whose graph specification is very regular and typically only consists of a few hundred operations. Simulation code, on the other hand, involves much larger graphs, whose structure is unpredictable: program execution often involves randomness, which could cause jumps to almost any part of the system. The full computation graph would simply be the entire codebase (potentially on the order of hundreds of thousands lines of code), which is of course far too big.

Enoki’s approach could be interpreted as a middle ground between the two extremes discussed above. Graphs are created on the fly during a simulation, and can be several orders of magnitude larger compared to typical neural networks. They consist mostly of unstructured and comparably simple arithmetic that is lazily fused into optimized CUDA kernels. Since our system works without an up-front specification of the full computation graph, it must support features like dynamic indirection via virtual function calls that can simultaneously branch to multiple different implementations. The details of this are described in the section on function calls.

Note that that there are of of course many use cases where PyTorch, Tensorflow, etc. are vastly superior to Enoki, and it is often a good idea to combine the two in such cases (e.g. to feed the output of a differentiable simulation into a neural network).

One last related framework is ArrayFire, which provides a JIT compiler that lazily fuses instructions similar to our CUDAArray<T> type. ArrayFire targets a higher-level language (C), but appears to be limited to fairly small kernels (100 operations by default), and does not support a mechanism for automatic differentiation. In contrast, Enoki emits an intermediate representation (PTX) and fuses instructions into comparatively larger kernels that often exceed 100K instructions.

Automatic differentiation

Automatic differentiation (AD) broadly refers to a set of techniques that numerically evaluate the gradient of a computer program. Two variants of AD are widely used:

  1. Forward mode. Starting with a set of inputs (e.g. a and b) and associated derivatives (da and db), forward-mode AD instruments every operation (e.g. c = a * b) with an additional step that tracks the evolution of derivatives (dc = a*db + b*da). Forward mode is ideal for functions with a single input and many outputs. When the function has multiple inputs, a separate propagation pass is needed per parameter, which can become very costly.

  2. Reverse mode. On the other hand, reverse-mode AD specifically targets the case where the function to be differentiated has one (or few) outputs and potentially many inputs. It traverses the computational graph from outputs to inputs, repeatedly evaluating the chain rule in reverse. This approach is also known as backpropagation in the context of neural networks.

    One tricky aspect of reverse-mode is that the backward traversal can only begin once the output has been computed. A partial record of intermediate computations must furthermore be kept in memory, which can become costly for long-running computations.

The implementation in Enoki is realized via the special DiffArray<T> array type and supports both of the above variants, though it is particularly optimized for reverse mode operation. The template argument T refers to an arithmetic type (e.g. float or an Enoki array) that is used to carry out the underlying primal and derivative computation.

Due to the need to maintain a computation graph for the reverse-mode traversal, AD tends to involve massive amounts of memory traffic, making it an ideal fit for GPU-resident arrays that provide a higher amount of memory bandwidth. For this reason, combinations such as DiffArray<CUDAArray<float>> should be considered the default choice. Enoki’s Python bindings expose this type in the enoki.cuda_autodiff submodule. FloatD. As in the previous section, we will stick to the interactive Python interface and postpone a discussion of the C++ side until the end of this section.

A FloatD instance consists of two parts: a floating point array that is used during the original computation (of type FloatC), and an index that refers to a node in a separately maintained directed acyclic graph capturing the structure of the differentiable portion of the computation. By default, the index is set to zero, which indicates that the variable does not participate in automatic differentiation.

The following two example snippets demonstrate usage of automatic differentiation. In both cases, the set_requires_gradient() function is used to mark a variable as being part of a differentiable computation.

>>> from enoki import *
>>> from enoki.cuda_autodiff import Float32 as FloatD

>>> # Create a single differentiable variable
>>> a = FloatD(2.0)
>>> set_requires_gradient(a)

>>> # Arithmetic with one input ('a') and multiple outputs ('b', 'c')
>>> b = a * a
>>> c = sqrt(a)

>>> # Forward-propagate gradients from single input to multiple outputs
>>> forward(a)
autodiff: forward(): processed 3/5 nodes.

>>> gradient(b), gradient(c)
([4], [0.353553])

The forward() and backward() function realize the two previously discussed AD variants, and gradient() extracts the gradient associated with a differentiable variable. An example of reverse-mode traversal is shown next:

>>> from enoki import *
>>> from enoki.cuda_autodiff import Float32 as FloatD

>>> # Create multiple differentiable input variables
>>> a, b = FloatD(2.0), FloatD(3.0)
>>> set_requires_gradient(a)
>>> set_requires_gradient(b)

>>> # Arithmetic with two inputs ('a', 'b') and a single output ('c')
>>> c = a * sqrt(b)

>>> # Backward-propagate gradients from single output to multiple inputs
>>> backward(c)
autodiff: backward(): processed 3/4 nodes.

>>> gradient(a), gradient(b)
([1.73205], [0.57735])

Note that gradient() returns the gradient using the wrapped arithmetic type, which is a FloatC instance in this case. Another function named detach() can be used to extract the value using the underlying (non-differentiable) array type. Using these two operations, a gradient descent step on a parameter a would be realized as follows:

>>> a = FloatD(detach(a) + step_size * gradient(a))

Note that practical applications of Enoki likely involve large arrays with many millions of entries rather than scalars used in the two examples above.

Visualizing computation graphs

It is possible to visualize the graph of the currently active computation using the graphviz() function. You may also want to assign explicit variable names via set_label() to make the visualization easier to parse. An example is shown below:

>>> a = FloatD(1.0)
>>> set_requires_gradient(a)
>>> b = erf(a)
>>> set_label(a, 'a')
>>> set_label(b, 'b')

>>> print(graphviz(b))
digraph {
  rankdir=RL;
  fontname=Consolas;
  node [shape=record fontname=Consolas];
  1 [label="'a' [s]\n#1 [E/I: 1/5]" fillcolor=salmon style=filled];
  3 [label="mul [s]\n#3 [E/I: 0/4]"];
  ... 111 lines skipped ...
  46 -> 12;
  46 [fillcolor=cornflowerblue style=filled];
}

The resulting string can be visualized via Graphviz, which reveals the numerical approximation used to evaluate the error function erf().

_images/autodiff-01.svg

The combination of Enoki’s JIT compiler and AD has interesting consequences: computation related to derivatives is queued up along with primal arithmetic and can thus be compiled to into a joint GPU kernel.

For example, if a forward computation evaluates the expression \(\sin(x)\), the weight of the associated backward edge in the computation graph is given by \(\cos(x)\). The computation of both of these quantities is automatically merged into a single joint kernel, leveraging subexpression elimination and constant folding to further improve efficiency.

For the previous example involving the error function, cuda_whos() introduced in the last section reveals that many variables relating to both primal and gradient computations have been scheduled (but not executed yet).

>>> cuda_whos()

  ID        Type   E/I Refs   Size        Memory     Ready    Label
  =================================================================
  10        f32    3 / 11     1           4 B         [ ]     a
  11        f32    1 / 0      1           4 B         [ ]     a.grad
  16        f32    0 / 1      1           4 B         [ ]
  17        f32    0 / 1      1           4 B         [ ]
  ... 117 lines skipped ...
  150       f32    1 / 0      1           4 B         [ ]     b
  151       f32    0 / 1      1           4 B         [ ]
  152       f32    0 / 1      1           4 B         [ ]
  153       f32    1 / 0      1           4 B         [ ]
  154       f32    0 / 1      1           4 B         [ ]
  155       f32    0 / 1      1           4 B         [ ]
  156       f32    1 / 0      1           4 B         [ ]
  =================================================================

  Memory usage (ready)     : 0 B
  Memory usage (scheduled) : 0 B + 268 B = 268 B
  Memory savings           : 235 B

Graph simplification

An important goal of Enoki’s autodiff backend is a significant reduction in memory usage during simulation code that produces computation graphs with long sequences of relatively simple arithmetic operations. Existing frameworks like PyTorch do not fare very well in such cases. For instance, consider the following simple PyTorch session where an array is repeatedly multiplied by itself:

>>> # ----- GPU memory usage: 0 MiB -----
>>> import torch

>>> # Create a tensor with 1 million floats (4 MiB of GPU memory)
>>> a = torch.zeros(1024 * 1024, device='cuda')
>>> # ----- GPU memory usage: 809 MiB (mostly overhead) -----

>>> # Perform a simple differentiable computation
>>> b = a.requires_grad()
>>> for i in range(1000):
...     b = b * b
>>> # ----- GPU memory usage: 4803 MiB -----

The issue here are that PyTorch keeps the entire computation graph (including intermediate results) in memory to be able to perform a reverse-model traversal later on. This is costly and unnecessary when working with simple arithmetic operations.

To avoid this problem, Enoki periodically simplifies the computation graph by eagerly evaluating the chain rule at interior nodes to reduce storage requirements. Consequently, it does not follow a strict reverse- or forward-mode graph traversal, making it an instance of mixed-mode, or hybrid AD [GrWa08]. When working with differentiable GPU arrays, simplification occurs before each JIT compilation pass. The fundamental operation of the simplification process is known as vertex elimination [Yoshi87], [GrSh91] and collapses an interior node with \(d_i\) in-edges and \(d_o\) out-edges, creating \(d_i\cdot d_o\) new edges, whose weights are products of the original edge weights. These are then merged with existing edges, if applicable:

_images/autodiff-02.svg

Although this operation may increase the density of the graph connectivity if \(d_i,d_o>1\), collapsing such nodes is often worthwhile since it enables later simplifications that can reduce an entire subgraph to a single edge. Compared to direct traversal of the original graph, simplification increases the required amount of arithmetic in exchange for lower memory usage. In conjunction with the GPU backend, this optimization is particularly effective: removals often target nodes whose primal computation has not yet taken place. Since edge weights of collapsed nodes are no longer directly reachable, they can be promoted to cheap register storage.

The order of collapse operations has a significant effect on the efficiency and size of the resulting kernels. Unfortunately, propagating derivatives in a way that results in a minimal number of operations is known to be NP-hard [Naum07]. Enoki uses a greedy scheme that organizes nodes in a priority queue ordered by the number of edges \(d_i\cdot d_o\) that would be created by a hypothetical collapse operation, issuing collapses from cheapest to most expensive until the cost exceeds an arbitrary threshold that we set to 10 edges.

Graph simplification can be manually triggered by the FloatD.simplify_graph() operation. Returning to our earlier example of the error function, we can observe that it collapses the graph to just the input and output node.

 >>> a = FloatD(1.0)
 >>> set_requires_gradient(a)
 >>> b = erf(a)
 >>> set_label(a, 'a')
 >>> set_label(b, 'b')
 >>> FloatD.simplify_graph()
 >>> print(graphviz(b))

digraph {
  rankdir=RL;
  fontname=Consolas;
  node [shape=record fontname=Consolas];
  1 [label="'a' [s]\n#1 [E/I: 1/1]" fillcolor=salmon style=filled];
  46 [label="'b' [s]\n#46 [E/I: 1/0]" fillcolor=salmon style=filled];
  46 -> 1;
  46 [fillcolor=cornflowerblue style=filled];
}
_images/autodiff-03.svg

If automatic graph simplification as part of cuda_eval() is not desired, it can be completely disabled by calling FloatD.set_graph_simplification(False).

References

[GrSh91]Andreas Griewank and Shawn Reese. 1991. On the calculation of Jacobian matrices by the Markowitz rule. Technical Report. Argonne National Lab., IL (United States).
[GrWa08]Andreas Griewank and Andrea Walther. 2008. Evaluating derivatives: principles and techniques of algorithmic differentiation. Vol. 105. SIAM.
[Yoshi87]Toshinobu Yoshida. 1987. Derivation of a computational process for partial derivatives of functions using transformations of a graph. Transactions of Information Processing Society of Japan 11, 19.
[Naum07]Uwe Naumann. 2007. Optimal Jacobian accumulation is NP-complete. Mathematical Programming 112 (2007).

A more complex example

We will now look at a complete optimization example: our objective will be to find a matrix that rotates one vector onto another using gradient descent. This problem is of course contrived because a simple explicit solution exists, and because we won’t be using the vectorization aspect of Enoki, but it provides an opportunity to use a few more Enoki constructions. The annotated source code is given below:

from enoki import *
from enoki.cuda_autodiff import Float32 as FloatD, Vector3f as Vector3fD

cuda_set_log_level(2)

# Initialize two 3D vectors. We want to rotate 'a' onto 'b'
a = normalize(Vector3fD(2, 1, 3))
b = normalize(Vector3fD(-1, 2, 3))

# Our rotation matrix will be parameterized by an axis and an angle
axis = Vector3fD(1, 0, 0)
angle = FloatD(1)

# Learning rate for stochastic gradient descent
lr = 0.2

for i in range(20):
   # Label and mark input variables as differentiable
   set_requires_gradient(axis)
   set_requires_gradient(angle)
   set_label(axis, "axis")
   set_label(angle, "angle")

   # Define a nested scope (only for visualization/debugging purposes)
   with FloatD.Scope("rotation"):
      # Compute a rotation matrix with the given axis and angle
      rot_matrix = Matrix4fD.rotate(axis=normalize(axis), angle=angle)

      # Label the entries of the rotation matrix
      set_label(rot_matrix, "rot_matrix")

   # Define a nested scope (only for visualization/debugging purposes)
   with FloatD.Scope("loss"):
      # Apply the rotation matrix to 'a' and compute the L2 difference to 'b'
      loss = norm(rot_matrix * Vector4fD(a.x, a.y, a.z, 0) - Vector4fD(b.x, b.y, b.z, 0))

      # Label the resulting loss
      set_label(loss, "loss")

   # Dump a GraphViz plot of the computation graph
   with open("out_%i.dot" %i, "w") as f:
      f.write(graphviz(loss))

   # Reverse-mode traversal of the computation graph
   backward(loss)
   print("err: %s" % str(loss))

   # Gradient descent
   axis = Vector3fD(normalize(detach(axis) - gradient(axis) * lr))
   angle = FloatD(detach(angle) - gradient(angle) * lr)

Running the above progrma prints a message of the form

autodiff: backward(): processed 58/58 nodes.
cuda_eval(): launching kernel (n=1, in=14, out=25, ops=521)
cuda_jit_run(): cache hit, jit: 535 us
err: [1.12665]

for each iteration. After a few iterations, the error is reduced from an initial value of 1.34 to 0.065. Note that kernels are only created at the beginning—later iterations indicate cache hits because the overall structure of the computation is repetitive.

Observe also that the print() command that quantifies the loss value in each iteration has an interesting side effect: it flushes the queued computation and waits for it to finish (waiting for the computation to finish is clearly necessary, otherwise how could we know the loss?). Moving this statement to the last line causes all iterations to be merged into a single kernel that is much larger (see the ops=10444 part of the debug message, which specifies the number of PTX instructions):

cuda_eval(): launching kernel (n=1, in=0, out=27, ops=10444)
cuda_jit_run(): cache miss, jit: 22.763 ms, ptx compilation: 299.26 ms, 73 registers
err: [0.0653574]

This is likely not desired, and a call to cuda_flush() per iteration would be advisable when such a situation arises in general.

Finally, we visualize the GraphViz files that were written to disk by the optimization steps. Boxes in red highlight named variables (“axis”, “angle”, “rot_matrix”), and the blue box is the loss. You may also have wondered what the with FloatD.Scope(...): statements above do: these collect all computation in the nested scope, causing it to be arranged within a labeled box.

_images/autodiff-04.svg

Differentiable scatter and gather operations

Enoki arrays provide scatter, gather, and atomic scatter-add primitives, which constitute a special case during automatic differentiation. Consider the following differentiable calculation, which selects a subset of an input array:

>>> a = FloatD.linspace(0, 1, 10)
>>> set_requires_gradient(a)

>>> c = gather(a, UInt32D([1, 4, 8, 4]))
>>> backward(hsum(c))
autodiff: backward(): processed 3/3 nodes.

>>> print(gradient(a))
[0, 1, 0, 0, 2, 0, 0, 0, 1, 0]

Here, reverse-mode propagation of a derivative of c with respect to the input parameter a requires a suitable scatter_add() operation during the reverse-model traversal. Analogously, scatters turn into gathers under reverse-mode AD. The differentiable array backend recognizes these operations and inserts a special type of edge into the graph to enable the necessary transformations.

One current limitation of Enoki is that such special edges cannot be merged into ordinary edges during graph simplification. Handling this case could further reduce memory usage and is an interesting topic for future work.

Interfacing with PyTorch

It is possible to insert a differentiable computation realized using Enoki into a larger PyTorch program and subsequently back-propagate gradients through the combination of these systems. The following annotated example shows how to expose a differentiable Enoki function (enoki.atan2) to PyTorch. The page on Extending PyTorch is a helpful reference regarding the torch.autograd.Function construction used in the example.

import torch
import enoki

class EnokiAtan2(torch.autograd.Function):
    @staticmethod
    def forward(ctx, arg1, arg2):
        # Convert input parameters to Enoki arrays
        ctx.in1 = enoki.cuda_autodiff.FloatD(arg1)
        ctx.in2 = enoki.cuda_autodiff.FloatD(arg2)

        # Inform Enoki if PyTorch wants gradients for one/both of them
        enoki.set_requires_gradient(ctx.in1, arg1.requires_grad)
        enoki.set_requires_gradient(ctx.in2, arg2.requires_grad)

        # Perform a differentiable computation in ENoki
        ctx.out = enoki.atan2(ctx.in1, ctx.in2)

        # Convert the result back into a PyTorch array
        out_torch = ctx.out.torch()

        # Optional: release any cached memory from Enoki back to PyTorch
        enoki.cuda_malloc_trim()

        return out_torch

    @staticmethod
    def backward(ctx, grad_out):
        # Attach gradients received from PyTorch to the output
        # variable of the forward pass
        enoki.set_gradient(ctx.out, enoki.FloatC(grad_out))

        # Perform a reverse-mode traversal. Note that the static
        # version of the backward() function is being used, see
        # the following subsection for details on this
        enoki.cuda_autodiff.FloatD.backward()

        # Fetch gradients from the input variables and pass them on
        result = (enoki.gradient(ctx.in1).torch()
                  if enoki.requires_gradient(ctx.in1) else None,
                  enoki.gradient(ctx.in2).torch()
                  if enoki.requires_gradient(ctx.in2) else None)

        # Garbage-collect Enoki arrays that are now no longer needed
        del ctx.out, ctx.in1, ctx.in2

        # Optional: release any cached memory from Enoki back to PyTorch
        enoki.cuda_malloc_trim()

        return result

# Create 'enoki_atan2(y, x)' function
enoki_atan2 = EnokiAtan2.apply

# Let's try it!
y = torch.tensor(1.0, device='cuda')
x = torch.tensor(2.0, device='cuda')
y.requires_grad_()
x.requires_grad_()

o = enoki_atan2(y, x)
print(o)

o.backward()
print(y.grad)
print(x.grad)

Running this program yields the following output

cuda_eval(): launching kernel (n=1, in=1, out=8, ops=61)
tensor([0.4636], device='cuda:0', grad_fn=<EnokiAtan2Backward>)
autodiff: backward(): processed 3/3 nodes.
cuda_eval(): launching kernel (n=1, in=6, out=3, ops=20)
cuda_eval(): launching kernel (n=1, in=2, out=1, ops=9)
tensor(0.4000, device='cuda:0')
tensor(-0.2000, device='cuda:0')

Custom forward and reverse-mode traversals

The default forward() and backward() traversal functions require an input or output variable for which gradients should be propagated. Following the traversal, the autodiff graph data structure is immediately torn down. These assumptions are usually fine when the function being differentiated has 1 input and n outputs, or when it has m inputs and 1 output.

However, for a function with n inputs and m outputs, we may want to perform multiple reverse or forward-mode traversals while retaining the computation graph. This is simple to do via an extra argument

backward(my_variable, free_graph=False)
# or
forward(my_variable, free_graph=False)

We may want to initialize the input/output variables with specific gradients before each traversal.

set_gradient(out1, 2.0)
set_gradient(out2, 3.0)
FloatD.backward(free_graph=False)
# or
set_gradient(in1, 2.0)
set_gradient(in2, 3.0)
FloatD.forward(free_graph=False)

This functionality is particularly useful when implementing a partial reverse-mode traversal in the context of a larger differentiable computation realized using another framework (e.g. PyTorch). See the previous subsection for an example.

C++ interface

As in the previous section, the C++ and Python interfaces behave in exactly the same way. To use the DiffArray<T> type, include the header

#include <enoki/autodiff.h>

Furthermore, applications must be linked against the enoki-autodiff library (and against cuda and enoki-cuda if differentiable GPU arrays are used). The following snippet contains a C++ translation of the error function example shown earlier.

#include <enoki/cuda.h>
#include <enoki/autodiff.h>
#include <enoki/special.h> // for erf()

using namespace enoki;

using FloatC    = CUDAArray<float>;
using FloatD    = DiffArray<FloatC>;

int main(int argc, char **argv) {
    FloatD a = 1.f;
    set_requires_gradient(a);

    FloatD b = erf(a);
    set_label(a, "a");
    set_label(b, "b");

    std::cout << graphviz(b) << std::endl;

    backward(b);
    std::cout << gradient(a) << std::endl;
}

Method calls

Method calls and virtual method calls are important building blocks of modern object-oriented C++ applications. When vectorization enters the picture, it is not immediately clear how they should be dealt with. This section introduces Enoki’s method call vectorization support, focusing on a hypothetical Sensor class that decodes a measurement performed by a sensor.

Note that the examples will refer to CPU SIMD-style vectorization, but everything in this section also applies to other kinds of Enoki arrays (GPU arrays, differentiable arrays).

Suppose that the interface of the Sensor class originally looks as follows:

class Sensor {
public:
    /// Decode a measurement based on the sensor's response curve
    virtual float decode(float input) = 0;

    /// Return sensor's serial number
    virtual uint32_t serial_number() = 0;
};

It is trivial to add a second method that takes vector inputs, like so:

using FloatP = Packet<float, 8>;
using MaskP  = mask_t<FloatP>;

class Sensor {
public:
    /// Scalar version
    virtual float decode(float input) = 0;

    /// Vector version
    virtual FloatP decode(FloatP input) = 0;

    /// Return sensor's serial number
    virtual uint32_t serial_number() = 0;
};

This will work fine if there is just a single Sensor instance. But what if there are many of them, e.g. when each FloatP array of measurements also comes with a SensorP structure whose entries reference the sensor that produced the measurement?

class Sensor;
using SensorP = Array<Sensor *, 8>;

Ideally, we’d still be able to write the following code, but this sort of thing is clearly not supported by standard C++.

SensorP sensor = ...;
FloatP data = ...;

data = sensor->decode(data);

Enoki provides a support layer that can handle such vectorized method calls. It performs as many method calls as there are unique instances in the sensor array, and an optional mask is forwarded to the callee indicating the associated active SIMD lanes. Null pointers in the data array are legal and are considered as masked entries. The return value of masked entries is always zero (or a zero-filled array/structure, depending on the method’s return type).

The ENOKI_CALL_SUPPORT_METHOD macro is required to support the above syntax. This generates the Enoki support layer that intercepts and carries out the function call:

class Sensor {
public:
    // Scalar version
    virtual float decode(float input) = 0;

    // Vector version with optional mask argument
    virtual FloatP decode(FloatP input, MaskP mask) = 0;

    /// Return sensor's serial number
    virtual uint32_t serial_number() = 0;
};

ENOKI_CALL_SUPPORT_BEGIN(Sensor)
ENOKI_CALL_SUPPORT_METHOD(decode)
ENOKI_CALL_SUPPORT_METHOD(serial_number)
/// .. potentially other methods ..
ENOKI_CALL_SUPPORT_END(Sensor)

The macro supports functions taking an arbitrary number of arguments but assumes that results are provided to the caller via the return value only (i.e. no writing to arguments passed by reference). The mask, if present, must be the last argument of the function.

Here is a hypothetical implementation of the Sensor interface:

class MySensor : Sensor {
public:
    /// Vector version
    virtual FloatP decode(FloatP input, MaskP active) override {
        /// Keep track of invalid samples
        n_invalid += count(isnan(input) && active);

        /* Transform e.g. from log domain. */
        return log(input);
    }

    /// Return sensor's serial number
    uint32_t serial_number() { return 363436u; }

    // ...

    size_t n_invalid = 0;
};

With this interface, the following vectorized expressions are now valid:

SensorP sensor = ...;
FloatP data = ...;

/* Unmasked version */
data = sensor->decode(data);

/* Masked version */
auto mask = sensor->serial_number() > 1000;
data = sensor->decode(data, mask);

Note how both functions with scalar and vector return values are vectorized automatically.

The implementation of vector method calls depends on the array type and hardware capabilities.

  • On machines with the AVX512 instruction set, the vpextractq instruction is used to efficiently extract the unique set of instance pointers.
  • The CUDA backend performs a parallel radix sort and run-length encoding of the pointer array using NVIDIA’s CUB library to obtain the list of unique pointers and indices referring to them. It then gathers the argument values corresponding to a particular pointer, evaluates the function, and then scatters the result into an output array.
  • In all other cases, the unique elements are found using a linear sweep.

Supporting scalar getter functions

The above way of vectorizing a scalar getter function may involve multiple virtual method calls, which is not particularly efficient when the invoked function is very simple (e.g. a getter). Enoki provides an alternative macro ENOKI_CALL_SUPPORT_GETTER that turns any such attribute lookup into a gather operation. The macro takes the getter name and field name as arguments. The macro ENOKI_CALL_SUPPORT_FRIEND is needed if the field in question is a private member.

class Sensor {
    ENOKI_CALL_SUPPORT_FRIEND()
public:
    /// ...

    /// Return sensor's serial number
    uint32_t serial_number() { return m_serial_number; }

private:
    uint32_t m_serial_number;
};

ENOKI_CALL_SUPPORT_BEGIN(Sensor)
ENOKI_CALL_SUPPORT_GETTER(serial_number, m_serial_number)
ENOKI_CALL_SUPPORT_END(Sensor)

The usage is identical to before, i.e.:

using UInt32P = Packet<uint32_t, 8>;

SensorP sensor = ...;
UInt32P serial = sensor->serial_number();

Note that this trick even works for GPU arrays! In this case, the GPU will directly fetch the value of the m_serial_number field from the CPU via shared memory. However, this only works when the Sensor instance has been allocated in host-pinned address space that will be reachable on the GPU. To do so, add the ENOKI_PINNED_OPERATOR_NEW annotation that will override the new and delete operator to ensure that this is always the case for Sensor instances.

class Sensor {
    ENOKI_CALL_SUPPORT_FRIEND()
    ENOKI_PINNED_OPERATOR_NEW(UInt32P)
public:
    // ...
};

Custom data structures

The previous sections introduced Enoki arrays as a tool for designing vectorized algorithms that are capable of processing multiple inputs at the same time. However, in many cases, vectorizing an algorithm will also require a corresponding change to the data structures that underlie it. This section demonstrates how C++ templates provide a natural framework for achieving this goal while satisfying the desiderata of Enoki (readability, portability, no code duplication, etc.).

We will focus on an simple example data structure that represents a position record acquired by a GPS tracker along with auxiliary information (time, and a flag stating whether the data is considered reliable). The examples will refer to CPU SIMD-style vectorization, but everything in this section also applies to other kinds of Enoki arrays (GPU arrays, differentiable arrays).

using Vector2f = Array<float, 2>;

/// Simple 2D GPS record tagged with auxiliary information
struct GPSCoord2f {
    /// UNIX time when the data point was acquired (seconds since Jan 1970)
    uint64_t time;

    /// Latitude, longitude as a 2D vector
    Vector2f pos;

    /// Is the data point reliable (e.g. enough GPS satellites in sensor's field of view)
    bool reliable;
};

Next, consider the following function, which computes the distance between two GPS coordinates using the haversine formula. When either of the two positions is deemed unreliable, it returns a NaN value to inform the caller about this.

/// Calculate the distance in kilometers between 'r1' and 'r2' using the haversine formula
float distance(const GPSCoord2f &r1, const GPSCoord2f &r2) {
    const float deg_to_rad = (float) (M_PI / 180.0);

    if (!r1.reliable || !r2.reliable)
        return std::numeric_limits<float>::quiet_NaN();

    Vector2f sin_diff_h = std::sin(deg_to_rad * .5f * (r2.pos - r1.pos));
    sin_diff_h *= sin_diff_h;

    float a = sin_diff_h.x() + sin_diff_h.y() *
              std::cos(r1.pos.x() * deg_to_rad) *
              std::cos(r2.pos.x() * deg_to_rad);

    return 6371.f * 2.f * std::atan2(std::sqrt(a), std::sqrt(1.f - a));
}

Suppose that we would like to add a second vectorized version of this function, which works with arrays of GPS coordinates. Instead of defining yet another data structure for coordinates arrays in addition to the existing GPSCoord2f, our approach will be to rely on a single template data structure that subsumes both cases. It is parameterized by the type of a GPS position component (e.g. latitude) named Value.

template <typename Value> struct GPSCoord2 {
    using Vector2 = Array<Value, 2>;
    using UInt64  = uint64_array_t<Value>;
    using Mask    = mask_t<Value>;

    UInt64 time;
    Vector2 pos;
    Mask reliable;
};

The using declarations at the beginning require an explanation: they involve the type traits enoki::uint64_array_t and enoki::mask_t, which “compute” the type of an Enoki array that has the same configuration as their input parameter, but with uint64_t- and bool-valued entries, respectively. Both are specializations of the more general enoki::replace_scalar_t trait that works for any type.

With these declarations, we can now create a packet type GPSCoord2fP that stores 16 GPS positions in a convenient SoA representation.

using FloatP      = Packet<float, 16>;
using GPSCoord2fP = GPSCoord2<FloatP>;

An important aspect of the type calculations mentioned above is that they also generalize to non-array arguments. In particular, uint64_array_t<float> and mask_t<float> simply turn into uint64_t and bool, respectively, hence the type alias

using GPSCoord2f  = GPSCoord2<float>;

perfectly reproduces the original (scalar) GPS record definition. Having defined the GPS record type, it is time to update the function definition as well. Once more, we will rely on C++ templates to do so.

The new distance function shown below is similarly templated with respect to the Value type, and it works for both scalar and vector arguments.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
/// Calculate the distance in kilometers between 'r1' and 'r2' using the haversine formula
template <typename Value>
Value distance(const GPSCoord2<Value> &r1, const GPSCoord2<Value> &r2) {
    using Scalar = scalar_t<Value>;
    const Value deg_to_rad = Scalar(M_PI / 180.0);

    auto sin_diff_h = sin(deg_to_rad * .5f * (r2.pos - r1.pos));
    sin_diff_h *= sin_diff_h;

    Value a = sin_diff_h.x() + sin_diff_h.y() *
              cos(r1.pos.x() * deg_to_rad) *
              cos(r2.pos.x() * deg_to_rad);

    return select(
        r1.reliable && r2.reliable,
        (6371.f * 2.f) * atan2(sqrt(a), sqrt(1.f - a)),
        std::numeric_limits<Scalar>::quiet_NaN()
    );
}

Note how the overall structure is preserved. There are three noteworthy changes:

  1. Control flow such as if statements must be replaced by branchless code involving masks (see the enoki::select() statement on line 15). Separate array entries may undergo a different control flow, which is not possible with standard C++ language constructs, hence the need for masks.

    If desired, the early-out optimization from the previous snippet can be preserved for the special case that all records are unreliable:

    if (none(r1.reliable && r2.reliable))
        return std::numeric_limits<Scalar>::quiet_NaN()
    
  2. Standard mathematical functions such as std::sin are replaced by their Enoki equivalents, which generalize to both array and non-array arguments.

  3. The enoki::scalar_t type alias on line 4 is used to extract the elementary arithmetic type underlying an Enoki array—this results in the type float in our example, which is used to cast a constant to the right precision.

    It is sometimes useful to be able to work with a higher precision. Our templated distance function can nicely accommodate this need simply by simply switching to the following types:

    using GPSCoord2d   = GPSCoord2<double>;
    using DoubleP      = Packet<double, 16>;
    using GPSCoord2dP  = GPSCoord2<DoubleP>;
    

    The distance function requires no changes. When working with double precision GPS records, the deg_to_rad constant automatically adapts to the higher precision due to the cast to the Scalar type.

Dynamic CPU arrays

Arrays and nested arrays facilitate the development of vectorized code that processes multiple values at once. However, it can be awkward to work with small fixed packet sizes when the underlying application must process millions or billions of data points. The remainder of this document discusses infrastructure that can be used to realize computations on the CPU involving dynamically allocated arrays of arbitrary length.

One of the core ingredients is enoki::DynamicArray, which is a smart pointer that manages the lifetime of a dynamically allocated memory region. It is the exclusive owner of this data and is also responsible for its destruction when the dynamic array goes out of scope (similar to std::unique_ptr). Dynamic arrays can be used to realize arithmetic involving data that is much larger than the maximum SIMD width supported by the underlying hardware.

Note that its implementation is contained in a separate header file that must be included:

#include <enoki/dynamic.h>

The class requires a single template argument, which can be any kind of enoki::Array. This is the packet type that will be used to used to realize vectorized computations involving the array contents. The following code snippet illustrates the creation of a dynamic floating point array that vectorizes using 4-wide SSE arithmetic.

/* Static float array (the suffix "P" indicates that this is a fixed-size packet) */
using FloatP = Packet<float, 4>;

/* Dynamic float array (vectorized via FloatP, the suffix "X" indicates arbitrary length) */
using FloatX = DynamicArray<FloatP>;

Note

In contrast to all array types discussed so far, a enoki::DynamicArray instance should not be part of an arithmetic expression. For instance, the following will compile and yield the expected result, but this style of using dynamic arrays is disouraged.

FloatX in1 = ... , in2 = ... , in3 = ... , in4 = ...;

/* Add the dynamic arrays using operator+() */
FloatX out = in1 + in2 + in3 + in4;

Why that is the case requires a longer explanation on the design of this library.

At a high level, there are two “standard” ways of implementing arithmetic for dynamically allocated memory regions.

  1. A common approach (used e.g. by most std::valarray implementations) is to evaluate partial expressions in place, creating a large number of temporaries in the process.

    This is unsatisfactory since the amount of computation is very small compared to the resulting memory traffic.

  2. The second is a technique named expression templates that is used in libraries such as Eigen. Expression templates construct complex graphs describing the inputs and operations of a mathematical expression using C++ templates. The underlying motivation is to avoid numerous memory allocations for temporaries by postponing evaluation until the point where the expression template is assigned to a storage container.

    Unfortunately, pushed to the scale of entire programs, this approach tends to produce intermediate code with an extremely large number of common subexpressions that exceeds the capabilities of the common subexpression elimination (CSE) stage of current compilers. The first version of Enoki in fact used expression templates, and it was due to the difficulties with them that an alternative was developed.

The key idea of vectorizing over dynamic Enoki arrays is to iterate over packets (i.e. static arrays) that represent a sliding window into the dynamic array’s contents. Packets, in turn, are easily supported using the tools discussed in the previous sections. Enoki provides a powerful operation named enoki::vectorize(), discussed later, that implements this sliding window technique automatically.

That said, for convenience, arithmetic operations like operator+ are implemented for dynamic arrays, and they are realized using approach 1 of the above list (i.e. with copious amounts of memory allocation for temporaries). Using them in performance-critical code is unadvisable.

Allocating dynamic arrays

When allocating dynamic arrays, the underlying memory region is always fully aligned according to the requirements of the packet type. Enoki may sometimes allocate a partially used packet at the end, which eliminates the need for special end-of-array handling. The following code snippet allocates an array of size 5 using 4-wide packets, which means that 3 entries at the end are unused.

_images/dynamic-01.svg
/* Creates a dynamic array that is initially empty */
FloatX x;

/* Allocate memory for at least 5 entries */
set_slices(x, 5);

/* Query the size (a.k.a number of "slices") of the dynamic array */
size_t slice_count = slices(x);
assert(slice_count == 5);

/* Query the number of packets */
size_t packet_count = packets(x);
assert(packet_count == 2);

A few convenience initialization methods also exist:

/* Efficient way to create an array filled with zero entries */
x = zero<FloatX>(size);

/* .. or an unitialized array */
x = empty<FloatX>(size);

/* Initialize entries with index sequence 0, 1, 2, ... */
x = arange<FloatX>(size);

/* Initialize entries with a linearly increasing sequence with endpoints 0 and 1 */
x = linspace<FloatX>(0.f, 1.f, size);

Custom dynamic data structures

The previous section used the example of a GPS record to show how Enoki can create packet versions of a type. The same approach also generalizes to dynamic arrays, allowing an arbitrarily long sequence of records to be represented. This requires two small additions to the original type declaration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
template <typename Value> struct GPSCoord2 {
    using Vector2 = Array<Value, 2>;
    using UInt64  = uint64_array_t<Value>;
    using Bool    = bool_array_t<Value>;

    UInt64 time;
    Vector2 pos;
    Bool reliable;

    ENOKI_STRUCT(GPSCoord2,           /* <- name of this class */
                 time, pos, reliable  /* <- list of all attributes in layout order */)
};

ENOKI_STRUCT_SUPPORT(GPSCoord2, time, pos, reliable)

The two highlighted additions do the following:

  1. The macro on lines 10 and 11 declares copy and assignment constructors that are able to convert between different types of records.
  2. The macro on line 14 makes Enoki aware of GPSCoord2 for the purposes of dynamic vectorization.

It is possible but fairly tedious to write these declarations by hand, hence the code generation macros should generally be used.

With these declarations, we can now allocate a dynamic array of 1000 coordinates that will be processed in packets of 4 (or more, depending on the definition of FloatP):

using GPSCoord2fX = GPSCoord2<FloatX>;

GPSCoord2fX coord;
set_slices(coord, 1000);

In memory, this data will be arranged as follows:

_images/dynamic-02.svg

In other words: each field references a dynamic array that contiguously stores the contents in a SoA organization.

Accessing array packets

The enoki::packet() function can be used to create a reference to the \(i\)-th packet of a dynamic array or a custom dynamic data structure. For instance, the following code iterates over all packets and resets their time values:

/* Reset the time value of all records */
for (size_t i = 0; i < packets(coord); ++i) {
    GPSRecord2<FloatP&> ref = packet(coord, i);
    ref.time = 0;
}

The packet() function is interesting because it returns an instance of a new type GPSRecord2<FloatP&> that was not discussed yet (note the ampersand in the template argument). Instead of directly storing data, all fields of a GPSRecord2<FloatP&> are references pointing to packets of data elsewhere in memory. In this case, assigning (writing) to a field of this structure of references will change the corresponding entry of the dynamic array! Conceptually, this looks as follows:

_images/dynamic-03.svg

References can also be cast into their associated packet types and vice versa:

/* Read a GPSRecord2<FloatP&> and convert to GPSRecord2<FloatP> */
GPSCoord2fP cp = packet(coord, i);

/* Assign a GPSRecord2<FloatP> to a GPSRecord2<FloatP&> */
packet(coord, i + 1) = cp;

Note

For non-nested dynamic arrays such as FloatX = DynamicArray<FloatP>, packet() simply returns a reference to the selected FloatP entry in that array of packets. We generally encourage using universal references (auto &&) to hold the result of packet() so that both cases are handled in the same way:

auto   ref = packet(coord, i);   // Only works for dynamic structures
auto  &ref = packet(numbers, i); // Only works for non-nested arrays
auto &&ref = packet(coord, i);   // Works for both

Accessing array slices

Enoki provides a second way of indexing into dynamic arrays: the enoki::slice() function creates a reference to the \(i\)-th slice of a dynamic array or a custom dynamic data structure. Elements of a slice store references to scalar elements representing a vertical slice through the data structure.

The following code iterates over all slices and initializes the time values to an increasing sequence:

/* Set the i-th time value to 'i' */
for (size_t i = 0; i < slices(coord); ++i) {
    auto ref = slice(coord, i);
    ref.time = i;
}

Here, the enoki::slice() function returns an instance of a new type GPSRecord2<float&> (again, note the ampersand), Conceptually, this looks as follows:

_images/dynamic-06.svg

Slice reference types can also be cast into their associated scalar data types and vice versa:

/* Read a GPSRecord2<float&> and convert to GPSRecord2<float> */
GPSCoord2f c = slice(coord, n);

/* Assign a GPSRecord2<float> to a GPSRecord2<float&> */
slice(coord, n + 1) = c;

Dynamic vectorization

Now suppose that we’d like to compute the pairwise distance between records organized in two dynamically allocated lists. Direct application of the discussed ingredients leads to the following overall structure:

GPSCoord2fX coord1;
GPSCoord2fX coord2;
FloatX result;

// Allocate memory and fill input arrays with contents (e.g. using slice(...))
...

// Call SIMD-vectorized function for each packet
for (size_t i = 0; i < packets(coord1); ++i)
    packet(result, i) = distance(packet(coord1, i),
                                 packet(coord2, i));

This does not quite compile (yet)—a minor modification of the distance() function is required:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
/// Calculate the distance in kilometers between 'r1' and 'r2' using the haversine formula
template <typename Value_, typename Value = expr_t<Value_>>
Value distance(const GPSCoord2<Value_> &r1, const GPSCoord2<Value_> &r2) {
    using Scalar = scalar_t<Value>;
    const Value deg_to_rad = Scalar(M_PI / 180.0);

    auto sin_diff_h = sin(deg_to_rad * .5f * (r2.pos - r1.pos));
    sin_diff_h *= sin_diff_h;

    Value a = sin_diff_h.x() + sin_diff_h.y() *
              cos(r1.pos.x() * deg_to_rad) *
              cos(r2.pos.x() * deg_to_rad);

    return select(
        r1.reliable & r2.reliable,
        Scalar(6371.f * 2.f) * atan2(sqrt(a), sqrt(1.f - a)),
        std::numeric_limits<Scalar>::quiet_NaN()
    );
}

The modified version above uses the enoki::expr_t type trait to determine a suitable type that is able to hold the result of an expression involving its argument (which turns FloatP& into FloatP in this case).

Note

The issue with the original code was that it was called with a GPSRecord2<FloatP&> instance, i.e. with a template parameter Value = FloatP&. However, the Value type is also used for the return value as well as various intermediate computations, which is illegal since these temporaries are not associated with an address in memory.

With these modifications, we are now finally able to vectorize over the dynamic array:

// Call SIMD-vectorized function for each packet -- yay!
for (size_t i = 0; i < packets(coord1); ++i)
    packet(result, i) = distance(packet(coord1, i),
                                 packet(coord2, i));

Shorthand notation

Extracting individual packets as shown in the snippet above can become fairly tedious when a function takes many arguments. Enoki offers a convenient helper function named enoki::vectorize() that automates this process. It takes a function and a number of dynamic arrays as input and calls the function once for each set of input packets.

FloatX result = vectorize(
    distance<FloatP>, // Function to call
    coord1,           // Input argument 1
    coord2            // Input argument 2
                      // ...
);

Here, the returned float packets are stored in a dynamic array of type FloatX.

When the output array is already allocated, it is also possible to write the results directly into the array. The snippet below shows how to do this by calling call enoki::vectorize() with a lambda function.

vectorize(
    [](auto&& result, auto&& coord1, auto &&coord2) {
        result = distance<FloatP>(coord1, coord2);
    },
    result,
    coord1,
    coord2
);

Note the use of a variadic lambda with auto&& arguments: it would be redundant to specify the argument types since they are automatically inferred from the function inputs.

Naturally, we could also perform the complete calculation within the lambda function:

vectorize(
    [](auto&& result, auto&& coord1, auto&& coord2) {
        using Value = FloatP;
        using Scalar = float;

        const Value deg_to_rad = Scalar(M_PI / 180.0);

        auto sin_diff_h = sin(deg_to_rad * .5f * (coord2.pos - coord1.pos));
        sin_diff_h *= sin_diff_h;

        Value a = sin_diff_h.x() + sin_diff_h.y() *
                  cos(coord1.pos.x() * deg_to_rad) *
                  cos(coord2.pos.x() * deg_to_rad);

        result = select(
            coord1.reliable & coord2.reliable,
            (6371.f * 2.f) * atan2(sqrt(a), sqrt(1.f - a)),
            std::numeric_limits<Scalar>::quiet_NaN()
        );
    },

    result,
    coord1,
    coord2
);

It is not necessary to “route” all parameters through enoki::vectorize(). Auxiliary data structures or constants are easily accessible via the lambda capture object using the standard [&] notation.

A benchmark

We now turn to the results of a microbenchmark which runs the previously discussed GPS record distance function on a dynamic array with 10 million entries.

Show/Hide Code
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
 /* Compilation flags:
    $ clang++ benchmark.cpp -o benchmark -std=c++14 -I include -O3
              -march=native -fomit-frame-pointer -fno-stack-protector -DNDEBUG
  */

 #include <enoki/array.h>
 #include <enoki/random.h>
 #include <chrono>

 using namespace enoki;

 auto clk() { return std::chrono::high_resolution_clock::now(); }

 template <typename T> float clkdiff(T a, T b) {
     return std::chrono::duration<float>(b - a).count() * 1000;
 }

 template <typename Value> struct GPSCoord2 {
     using Vector2 = Array<Value, 2>;
     using UInt64  = uint64_array_t<Value>;
     using Bool    = mask_t<Value>;

     UInt64 time;
     Vector2 pos;
     Bool reliable;

     ENOKI_STRUCT(GPSCoord2, time, pos, reliable)
 };

 ENOKI_STRUCT_SUPPORT(GPSCoord2, time, pos, reliable)

 using FloatP       = Packet<float, SIMD_WIDTH>;
 using FloatX       = DynamicArray<FloatP>;
 using GPSCoord2fX  = GPSCoord2<FloatX>;
 using GPSCoord2fP  = GPSCoord2<FloatP>;
 using GPSCoord2f   = GPSCoord2<float>;

 using RNG = PCG32<FloatP>;

 /// Calculate the distance in kilometers between 'r1' and 'r2' using the haversine formula
 template <typename Value_, typename Value = expr_t<Value_>>
 ENOKI_INLINE Value distance(const GPSCoord2<Value_> &r1, const GPSCoord2<Value_> &r2) {
     using Scalar = scalar_t<Value>;

     const Value deg_to_rad = Scalar(M_PI / 180.0);

     auto sin_diff_h = sin(deg_to_rad * .5f * (r2.pos - r1.pos));
     sin_diff_h *= sin_diff_h;

     Value a = sin_diff_h.x() + sin_diff_h.y() *
               cos(r1.pos.x() * deg_to_rad) *
               cos(r2.pos.x() * deg_to_rad);

     return select(
         r1.reliable & r2.reliable,
         (6371.f * 2.f) * atan2(sqrt(a), sqrt(1.f - a)),
         Value(std::numeric_limits<Scalar>::quiet_NaN())
     );
 }

 int main(int argc, char *argv[]) {
     for (int i =0; i<3; ++i) {
         GPSCoord2fX coord1;
         GPSCoord2fX coord2;
         FloatX result;

         auto clk0 = clk();

         size_t size = 10000000;
         set_slices(coord1, size);
         set_slices(coord2, size);
         set_slices(result, size);

         auto clk1 = clk();

         RNG rng;

         for (size_t j = 0; j < packets(coord1); ++j) {
             packet(coord1, j) = GPSCoord2fP {
                 0,
                 { rng.next_float32() * 180.f - 90, rng.next_float32() * 360.f - 180.f},
                 true
             };
             packet(coord2, j) = GPSCoord2fP {
                 0,
                 { rng.next_float32() * 180.f - 90, rng.next_float32() * 360.f - 180.f},
                 true
             };
         }

         auto clk2 = clk();

         vectorize([](auto &&result, auto &&coord1, auto &&coord2) {
                       result = distance<FloatP>(coord1, coord2);
                   },
                   result, coord1, coord2);

         auto clk3 = clk();
         std::cout << clkdiff(clk2, clk3) << " (alloc = " << clkdiff(clk0, clk1)
                   << ", fill = " << clkdiff(clk1, clk2) << ")" << std::endl;
     }

     return 0;
 }

The plots show the measured speedup relative to a scalar baseline implementation. We consider two different microarchitectures:

Knight’s Landing microarchitecture (Xeon Phi 7210)

The Knight’s Landing architecture provides hardware support for SIMD arithmetic using 16 single precision point values. Interestingly, the best performance is reached when working with arrays of 32 entries, which can be interpreted as a type of loop unrolling. The ability of issuing wide memory operations, performing branchless arithmetic using vector registers, and keeping two independent instructions in flight for each arithmetic operation leads to a total speedup of 23.5x (i.e. considerably exceeding the expected maximum speedup of 16 from the vectorized instructions alone!).

Relative to the C math library, Enoki obtains an even larger speedup of 38.7x. Using the standard C math library on this platform is fairly expensive, presumably because of function call penalties on Xeon Phi (Enoki generally inlines functions), and because it is compiled for a generic x86_64 machine rather than the native architecture.

Platform details: clang trunk rev. 304711 on Linux 64 bit (RHEL 7.3)

_images/dynamic-04.svg

Skylake microarchitecture (i7-6920HQ)

The Skylake architecture provides hardware support for SIMD arithmetic using 8 single precision point values. Significant speedups are observed for packets of 8 and 16 entries. It is likely that more involved functions (i.e. with a higher register pressure) will have a sharper performance drop after \(n=16\) due to the relatively small number of registers on this platform. Enoki single-precision transcendentals are only slightly faster than the standard C math library on this platform. The max. speedup relative to the standard C math library is 10.0x.

Platform details: clang trunk rev. 304711 on macOS 10.12.5

_images/dynamic-05.svg

Advanced topics

This section discusses a number of advanced operations and ways of extending Enoki.

Improving introspection of Enoki types in LLDB/GDB

When debugging programs built on top of Enoki using LLDB or GDB, the stringified versions of arrays are needlessly verbose and reveal private implementation details. For instance, printing a simple statically sized 3D vectors like Array<float, 3>(1, 2, 3) in LLDB yields

$0 = {
  enoki::StaticArrayImpl<float, 3, false, enoki::Array<float, 3>, int> = {
    enoki::StaticArrayImpl<float, 4, false, enoki::Array<float, 3>, int> = {
      m = (1, 2, 3, 0)
    }
  }
}

Dynamic arrays (e.g. FloatX(1, 2, 3)) are even worse, as the values are obscured behind a pointer:

$0 = {
  enoki::DynamicArrayImpl<enoki::Packet<float, 8>, enoki::DynamicArray<enoki::Packet<float, 8> > > = summary {
    m_packets = {
      __ptr_ = {
        std::__1::__compressed_pair_elem<enoki::Packet<float, 8> *, 0, false> = {
          __value_ = 0x0000000100300200
        }
      }
    }
    m_size = 1
    m_packets_allocated = 1
  }
}

To improve readability, this repository includes scripts that improve GDB and LLDB’s understanding of Enoki types. With this script, both of the above turn into

$0 = [1, 2, 3]

To install it in LLDB, copy the file resources/enoki_lldb.py to ~/.lldb (creating the directory, if not present) and then apppend the following line to the file ~/.lldbinit (again, creating it if, not already present):

command script import ~/.lldb/enoki_lldb.py

To install it in GDB, copy the file resources/enoki_gdb.py to ~/.gdb (creating the directory, if not present) and then apppend the following two lines to the file ~/.gdbinit (again, creating it if, not already present):

set print pretty
source ~/.gdb/enoki_gdb.py

Compressing arrays

A common design pattern in vectorized code involves compressing arrays, i.e. selectively writing only masked parts of an array so that the selected entries become densely packed in memory (e.g. to improve resource usage when only parts of an array participate in a computation).

The function enoki::compress() efficiently maps this operation onto the targeted hardware (SSE4.2, AVX2, and AVX512 implementations are provided). The function also automatically advances the pointer by the amount of written entries.

Array<float, 16> input = ...;
auto mask = input < 0;

float output[16], ptr = output;
size_t count = compress(ptr, input, mask);
std::cout << count << " entries were written." << std::endl;
...

Custom data structures such as the GPS record class discussed in previous chapters are transparently supported by enoki::compress()—in this case, the mask applies to each vertical slice through the data structure as illustrated in the following figure:

_images/advanced-01.svg

The slice_ptr() function is used to acquire pointers to the beginning of the output arrays (there is one for each field of the data structure). It returns a value of type GPSRecord2<float *>, which holds these pointers. The following snippet illustrates how an arbitrarily long list of records can be compressed:

GPSCoord2fX input = /* .. input data to be compressed .. */;

/* Make sure there is enough space to store all data */
GPSCoord2fX output;
set_slices(output, slices(input));

/* Structure composed of pointers to the output arrays */
GPSRecord2<float *> ptr = slice_ptr(output, 0);

/* Counter used to keep track of the number of collected elements */
size_t final_size = 0;

/* Go through all packets, compress, and append */
for (size_t i = 0; i < packets(input); ++i) {
    /* Let's filter out the records with input.reliable == false */
    auto input_p = packet(input, i);
    final_size += compress(ptr, input_p, input_p.reliable);
}

/* Now that the final number of slices is known, adjust the output array size */
set_slices(output, final_size);

Warning

The writes performed by enoki::compress() are at the granularity of entire packets, which means that some extra scratch space generally needs to be allocated at the end of the output array.

For instance, even if it is known that a compression operation will find exactly N elements, you are required to reserve memory for N + Packet::Size elements to avoid undefined behavior.

Note that enoki::compress() will never require more memory than the input array, hence this provides a safe upper bound.

Vectorized for loops

Enoki provides a powerful enoki::range() iterator that enables for loops with index vectors. The following somewhat contrived piece of code computes \(\sum_{i=0}^{1000}i^2\) using brute force addition (but with only \(1000/16\approx 63\) loop iterations).

using Index = Packet<uint32_t, 8>;

Index result(0);

for (auto [index, mask]: range<Index>(0, 1000)) {
    result += select(
        mask,
        index * index,
        Index(0)
    );
}

assert(hsum(result) == 332833500);

The mask is necessary to communicate the fact that the last loop iteration has several disabled entries. An extended version of the range statement enables iteration over multi-dimensional grids.

using Index3 = Array<Packet<uint32_t, 8>, 3>;

for (auto [index, mask] : range<Index3>(4u, 5u, 6u))
    std::cout << index << std::endl;


/* Output:
    [[0, 0, 0],
     [1, 0, 0],
     [2, 0, 0],
     [3, 0, 0],
     [0, 1, 0],
     [1, 1, 0],
     [2, 1, 0],
     [3, 1, 0]]
     ....
*/

Here, it is assumed that the range along each dimension starts from zero.

Scatter, gather, and prefetches for SoA ↔ AoS conversion

Enoki’s scatter() and gather() functions can transparently convert between SoA and AoS representations. This case is triggered when the supplied index array has a smaller nesting level than that of the data array. For instance, consider the following example:

/* Packet types */
using FloatP    = Packet<float, 4>;
using UInt32P   = Packet<uint32_t, 4>;

/* SoA and AoS 3x3 matrix types */
using Matrix3f  = Matrix<float, 3>;
using Matrix3fP = Matrix<FloatP, 3>;

Matrix3f *data = ...;
UInt32P indices(5, 1, 3, 0);

/* Gather AoS matrix data into SoA representation */
Matrix3fP mat = gather<Matrix3fP>(data, indices);

/* Modify SoA matrix data and scatter back into AoS-based 'data' array */
mat *= 2.f;
scatter(data, mat, indices);

The same syntax also works for prefetch(), which is convenient to ensure that a given set of memory addresses are in cache (preferably a few hundred cycles before the actual usage).

/* Prefetch into L2 cache for write access */
prefetch<Matrix3fP, /* Write = */ true, /* Level = */ 2>(data, indices);

Enoki is smart enough to realize that the non-vectorized form of gather and scatter statements can be reduced to standard loads and stores, e.g.

using Matrix3f  = Matrix<float, 3>;
float *ptr = ...;
int32_t index = ...;

/* The statement below reduces to load<Matrix3f>(ptr + index); */
Matrix3f mat = gather<Matrix3f>(ptr, index);

Scatter and gather operations are also permitted for dynamic arrays, e.g.:

using FloatX    = DynamicArray<FloatP>;
using UInt32X   = DynamicArray<UInt32P>;
using Matrix3fX = Matrix<FloatX, 3>;

Matrix3f *data = ...;
UInt32X indices = ...;

/* Gather AoS matrix data into SoA representation */
Matrix3fX mat = gather<Matrix3fX>(data, indices);

Vectorized integer division by constants

Integer division is a surprisingly expensive operation on current processor architectures: for instance, the Knight’s Landing architecture requires up to a whopping 108 cycles (95 cycles on Skylake) to perform a single 64-bit signed integer division with remainder. The hardware unit implementing the division cannot accept any new inputs until it is done with the current input (in other words, it is not pipelined in contrast to most other operations). Given the challenges of efficiently realizing integer division in hardware, current processors don’t even provide an vector instruction to perform multiple divisions at once.

Although Enoki can’t do anything clever to provide an efficient array division instruction given these constraints, it does provide a highly efficient division operation for a special case that is often applicable: dividing by an integer constant. The following snippet falls under this special case because all array entries are divided by the same constant, which is furthermore known at compile time.

using Int32 = enoki::Array<uint32_t, 8>;

Int32 div_43(Int32 a) {
    return a / 43;
}

This generates the following AVX2 assembly code (with comments):

_div_43:
    ; Load magic constant into 'ymm1'
    vpbroadcastd  ymm1, dword ptr [rip + LCPI0_0]

    ; Compute high part of 64 bit multiplication with 'ymm1'
    vpmuludq      ymm2, ymm1, ymm0
    vpsrlq        ymm2, ymm2, 32
    vpsrlq        ymm3, ymm0, 32
    vpmuludq      ymm1, ymm1, ymm3
    vpblendd      ymm1, ymm2, ymm1, 170

    ; Correction & shift
    vpsubd        ymm0, ymm0, ymm1
    vpsrld        ymm0, ymm0, 1
    vpaddd        ymm0, ymm0, ymm1
    vpsrld        ymm0, ymm0, 5
    ret

We’ve effectively turned the division into a sequence of 2 multiplies, 4 shifts, and 2 additions/subtractions. Needless to say, this is going to be much faster than sequence of high-latency/low-througput scalar divisions.

In cases where the constant is not known at compile time, a enoki::divisor instance can be precomputed and efficiently applied using enoki::divisor::operator(), as shown in the following example:

using Int32P = enoki::Packet<uint32_t, 8>;

void divide(Int32P *a, int32_t b, size_t n) {
    /* Precompute magic constants */
    divisor<int32_t> prec_div = b;

    /* Now apply the precomputed division efficiently */
    for (size_t i = 0; i < n; ++i)
        a[i] = prec_div(a[i]);
}

The following plots show the speedup compared to scalar division when dividing 100 million integer packets of size 16 by a compile-time constant. As can be seen, the difference is fairly significant on consumer processors (up to 13.2x on Skylake) and huge on the simple cores found on a Xeon Phi (up to 61.2x on Knight’s Landing).

_images/advanced-03.svg
_images/advanced-02.svg

Enoki’s implementation of division by constants is based on the excellent libdivide library.

Note

As can be seen, unsigned divisions are generally cheaper than signed division, and 32 bit division is considerably cheaper than 64 bit divisions. The reason for this is that a 64 bit high multiplication instruction required by the algorithm does not exist and must be emulated.

Warning

Enoki’s integer precomputed division operator does not support dividends equal to \(\pm 1\) (all other values are permissible). This is an inherent limitation of the magic number & shift-based algorithm used internally, which simply cannot represent this dividend. Enoki will throw an exception when a dividend equal to \(\pm 1\) is detected in an application compiled in debug mode.

Reinterpreting the contents of arrays

The function reinterpret_array() can be used to reinterpret the bit-level representation as a different type when both source and target types have matching sizes and layouts.

using UInt32P = Packet<uint32_t, 4>;
using FloatP = Packet<float, 4>;

UInt32P source = 0x3f800000;
FloatP target = reinterpret_array<FloatP>(source);

// Prints: [1, 1, 1, 1]
std::cout << target << std::endl;

The histogram problem and conflict detection

Consider vectorizing a function that increments the bins of a histogram given an array of bin indices. It is impossible to do this kind of indirect update using a normal pair of gather and scatter operations, since incorrect updates occur whenever the indices array contains an index multiple times:

using FloatP = Packet<float, 16>;
using IndexP = Packet<int32_t, 16>;

float hist[1000] = { 0.f }; /* Histogram entries */

IndexP indices = /* .. bin indices whose value should be increased .. */;

/* Ooops, don't do this. Some entries may have to be incremented multiple times.. */
scatter(hist, gather<FloatP>(hist, indices) + 1, indices);

Enoki provides a function named enoki::transform(), which modifies an indirect memory location in a way that is not susceptible to conflicts. The function takes an arbitrary function as parameter and applies it to the specified memory location, which allows this approach to generalize to situations other than just building histograms.

/* Unmasked version */
transform<FloatP>(hist, indices, [](auto& x, auto& /* mask */) { x += 1; });

/* Masked version */
transform<FloatP>(hist, indices, [](auto& x, auto& /* mask */) { x += 1; }, mask);

A mask argument is always provided to the lambda function. This is useful, e.g. to mask memory operations or function calls performed inside it. Note that any extra arguments will also be passed to the lambda function:

FloatP amount = ...;

/* Unmasked version */
transform<FloatP>(hist, indices,
                  [](auto& x, auto&y, auto & /* mask */) { x += y; },
                  amount);

/* Masked version */
transform<FloatP>(hist, indices,
                  [](auto& x, auto&y, auto& /* mask */) { x += y; },
                  amount, mask);

Internally, enoki::transform() detects and processes conflicts using the AVX512CDI instruction set. When conflicts are present, the function provided as an argument may be invoked multiple times in a row. When AVX512CDI is not available, a slower scalar fallback implementation is used.

Since the above use pattern—adding values to an array—is so common, a special short-hand notation exists:

FloatP amount = ...;

/* Unmasked version */
scatter_add(hist, amount, indices);

/* Masked version */
scatter_add(hist, amount, indices, mask);

Defining custom array types

Enoki provides a mechanism for declaring custom array types using the Curiously recurring template pattern. The following snippet shows a declaration of a hypothetical type named Spectrum representing a discretized color spectrum. Spectrum generally behaves the same way as Array and supports all regular Enoki operations.

template <typename Value, size_t Size>
struct Spectrum : enoki::StaticArrayImpl<Value, Size, false,
                                         Spectrum<Value, Size>> {

    /// Base class
    using Base = enoki::StaticArrayImpl<Value, Size, false,
                                        Spectrum<Value, Size>>;

    /// Helper alias used to implement type promotion rules
    template <typename T> using ReplaceValue = Spectrum<T, Size>;

    /// Mask type associated with this custom type
    using MaskType = enoki::Mask<Value, Size>;

    /// Import constructors, assignment operators, etc.
    ENOKI_IMPORT_ARRAY(Base, Spectrum)
};

The main reason for declaring custom arrays is to tag (and preserve) the type of arrays within expressions. For instance, the type of value2 in the following snippet is Spectrum<float, 8> rather than a generic enoki::Array<...>.

Spectrum<float, 8> value = { ... };
auto value2 = exp(-value);

Note

A further declaration is needed in a rare corner case. This only applies

  1. if you plan to use the new array type within custom data structures, and
  2. if masked assignments to the custom data structure are used, and
  3. if the custom data structure creates instances of the new array type from its template parameters.

An example:

// 1. Custom data structure
template <typename Value> struct PolarizedSpectrum {
    // 3. Spectrum type constructed from template parameters
    using Spectrum8 = Spectrum<Value, 8>;
    Spectrum8 r_s;
    Spectrum8 r_p;
    ENOKI_STRUCT(PolarizedSpectrum, r_s, r_p)
};
ENOKI_STRUCT_SUPPORT(PolarizedSpectrum, r_s, r_p)

using PolarizedSpectrumfP = PolarizedSpectrum<FloatP>;

PolarizedSpectrumfP some_function(FloatP theta) {
    PolarizedSpectrumfP result = ...;
    // 2. Masked assignment
    masked(result, theta < 0.f) = zero<PolarizedSpectrumfP>();
    return result;
}

In this case, the following declaration is needed:

template <typename Value, size_t Size>
struct Spectrum<enoki::detail::MaskedArray<Value>, Size> : enoki::detail::MaskedArray<Spectrum<Value, Size>> {
    using Base = enoki::detail::MaskedArray<Spectrum<Value, Size>>;
    using Base::Base;
    using Base::operator=;
    Spectrum(const Base &b) : Base(b) { }
};

This partial overload has the purpose of propagating an internal Enoki masking data structure to the top level (so that Spectrum<MaskedArray> becomes MaskedArray<Spectrum>).

Architectural differences handled by Enoki

In addition to mapping vector operations on the available instruction sets, Enoki’s abstractions hide a number of tedious platform-related details. This is a partial list:

  1. The representation of masks is highly platform-dependent. For instance, the AVX512 back-end uses eight dedicated mask registers to store masks compactly (allocating only a single bit per mask entry).

    Older machines use a redundant representation based on regular vector registers that have all bits set to 1 for entries where the comparison was true and 0 elsewhere.

  2. Machines with AVX (but no AVX2) don’t have an 8-wide integer vector unit. This means that an Array<float, 8> can be represented using a single AVX ymm register, but casting it to an Array<int32_t, 8> entails switching to a pair of half width SSE4.2 xmm integer registers, etc.

  3. Vector instruction sets are generally fairly incomplete in the sense that they are missing many entries in the full data type / operation matrix. Enoki emulates such operations using other vector instructions whenever possible.

  4. Various operations that work with 64 bit registers aren’t available when Enoki is compiled on a 32-bit platform and must be emulated.

Adding backends for new instruction sets

Adding a new Enoki array type involves creating a new partial overload of the StaticArrayImpl<> template that derives from StaticArrayBase. To support the full feature set of Enoki, overloads must provide at least a set of core methods shown below. The underscores in the function names indicate that this is considered non-public API that should only be accessed indirectly via the routing templates in enoki/enoki_router.h.

  • The following core operations must be provided by every implementation.

    • Loads and stores: store_, store_unaligned_, load_, load_unaligned_.
    • Arithmetic and bit-level operations: add_, sub_, mul_, mulhi_ (signed/unsigned high integer multiplication), div_, mod_, and_, or_, xor_.
    • Unary operators: neg_, not_.
    • Comparison operators that produce masks: ge_, gt_, lt_, le_, eq_, neq_.
    • Other elementary operations: abs_, ceil_, floor_, max_, min_, round_, sqrt_.
    • Shift operations for integers: sl_, sr_.
    • Horizontal operations: all_, any_, hprod_, hsum_, hmax_, hmin_, count_.
    • Masked blending operation: select_.
    • Access to low and high part (if applicable): high_, low_.
    • Zero-valued array creation: zero_.
  • The following operations all have default implementations in Enoki’s mathematical support library, hence overriding them is optional.

    However, doing so may be worthwile if efficient hardware-level support exists on the target platform.

    • Shuffle operation (emulated using scalar operations by default): shuffle_.
    • Compressed stores (emulated using scalar operations by default): compress_.
    • Extracting an element based on a mask (emulated using scalar operations by default): extract_.
    • Scatter/gather operations (emulated using scalar operations by default): scatter_, gather_.
    • Prefetch operations (no-op by default): prefetch_.
    • Fused multiply-add routines (reduced to add_/sub_ and mul_ by default): fmadd_, fmsub_, fnmadd_, fnmsub_, fmaddsub_, fmsubadd_.
    • Reciprocal and reciprocal square root (reduced to div_ and sqrt_ by default): rcp_, rsqrt_.
    • Dot product (reduced to mul_ and hsum_ by default): dot_.
    • Optional bit-level rotation operations (reduced to shifts by default): rol_, ror_.
    • Optional array rotation operations (reduced to shuffles by default): rol_array_, ror_array_.

Change log

Version 0.1.0 (Sep 2, 2019)

  • First public release of Enoki

Reference

Warning

This section is a work in progress.

How to read this section

The function signatures provided in this section are simplified to aid readability. For instance, the array addition operator

template<typename Array>
Array operator+(Array x, Array y)

hides the fact that

  1. The function supports mixed arguments with builtin broadcasting. For instance, the following is legal

    Array<float, 4> x = ...;
    Array<Array<double, 8>, 4> y = ...;
    
    auto z = x + y;
    

    The entries of x are broadcasted and promoted in the above example, and the type of z is thus Array<Array<double, 8>, 4>. See the section on broadcasting rules for details.

  2. The operator uses SFINAE (std::enable_if) so that it only becomes active when x or y are Enoki arrays.

  3. The operator replicates the C++ typing rules. Recall that adding two float& references in standard C++ produces a float temporary (i.e. not a reference) holding the result of the operation.

    Analogously, z is of type Array<float, 4> in the example below.

    Array<float&, 4> x = ...;
    Array<float&, 4> y = ...;
    
    auto z = x + y;
    

Writing out these type transformation rules in every function definition would make for tedious reading, hence the simplifications. If in doubt, please look at the source code of Enoki.

Global macro definitions

ENOKI_VERSION_MAJOR

Integer value denoting the major version of the Enoki release.

ENOKI_VERSION_MINOR

Integer value denoting the minor version of the Enoki release.

ENOKI_VERSION_PATCH

Integer value denoting the patch version of the Enoki release.

ENOKI_VERSION

Enoki version string (e.g. "0.1.2").

ENOKI_LIKELY(condition)

Signals that the branch is almost always taken, which can be used for improved code layout if supported by the compiler. An example is shown below:

if (ENOKI_LIKELY(x > 0)) {
    /// ....
 }
ENOKI_UNLIKELY(condition)

Signals that the branch is rarely taken analogous to ENOKI_LIKELY().

ENOKI_UNROLL

Cross-platform mechanism for asking the compiler to unroll a loop. The macro should be placed before the for statement.

ENOKI_NOUNROLL

Cross-platform mechanism for asking the compiler to never unroll a loop analogous to ENOKI_UNROLL().

ENOKI_INLINE

Cross-platform mechanism for asking the compiler to always inline a function. The macro should be placed in front of the function declaration.

ENOKI_INLINE void foo() { ... }
ENOKI_NOINLINE

Cross-platform mechanism for asking the compiler to never inline a function analogous to ENOKI_INLINE().

Global variable definitions

static constexpr bool has_avx512dq

Specifies whether AVX512DQ instructions are available on the target architecture.

static constexpr bool has_avx512vl

Specifies whether AVX512VL instructions are available on the target architecture.

static constexpr bool has_avx512bw

Specifies whether AVX512BW instructions are available on the target architecture.

static constexpr bool has_avx512cd

Specifies whether AVX512CD instructions are available on the target architecture.

static constexpr bool has_avx512pf

Specifies whether AVX512PF instructions are available on the target architecture.

static constexpr bool has_avx512er

Specifies whether AVX512ER instructions are available on the target architecture.

static constexpr bool has_avx512vpopcntdq

Specifies whether AVX512VPOPCNTDQ instructions are available on the target architecture.

static constexpr bool has_avx512f

Specifies whether AVX512F instructions are available on the target architecture.

static constexpr bool has_avx2

Specifies whether AVX2 instructions are available on the target architecture.

static constexpr bool has_avx

Specifies whether AVX instructions are available on the target architecture.

static constexpr bool has_fma

Specifies whether fused multiply-add instructions are available on the target architecture (ARM & x86).

static constexpr bool has_f16c

Specifies whether F16C instructions are available on the target architecture.

static constexpr bool has_sse42

Specifies whether SSE 4.2 instructions are available on the target architecture.

static constexpr bool has_x86_32

Specifies whether the target architecture is x86, 32 bit.

static constexpr bool has_x86_64

Specifies whether the target architecture is x86, 64 bit.

static constexpr bool has_arm_neon

Specifies whether ARM NEON instructions are available on the target architecture.

static constexpr bool has_arm_32

Specifies whether the target architecture is a 32-bit ARM processor (armv7).

static constexpr bool has_arm_64

Specifies whether the target architecture is aarch64 (armv8+).

static constexpr size_t max_packet_size

Denotes the maximal packet size (in bytes) that can be mapped to native vector registers. It is equal to 64 if AVX512 is present, 32 if AVX is present, and 16 for machines with SSE 4.2 or ARM NEON.

static constexpr size_t array_default_size

Denotes the default size of Enoki arrays. Equal to max_packet_size / 4.

Static arrays

template <typename Value, size_t Size = array_default_size>
class Array : StaticArrayImpl<Value, Size, Array<Value, Size>>

The default Enoki array class – a generic container that stores a fixed-size array of an arbitrary data type similar to the standard template library class std::array. The main distinction between the two is that enoki::Array forwards all arithmetic operations (and other standard mathematical functions) to the contained elements.

It has several template parameters:

  • typename Value: the type of individual array entries.
  • size_t Size: the number of packed array entries.

This class is just a small wrapper that instantiates enoki::StaticArrayImpl using the Curiously Recurring Template Pattern (CRTP). The latter provides the actual machinery that is needed to evaluate array expressions. See Defining custom array types for details.

template <typename Value, size_t Size = array_default_size>
class Packet : StaticArrayImpl<Value, Size, Array<Value, Size>>

The Packet type is identical to enoki::Array except for its broadcasting behavior.

template<typename Value, size_t Size, typename Derived>
class StaticArrayImpl

This base class provides the core implementation of an Enoki array. It cannot be instantiated directly and is used via the Curiously Recurring Template Pattern (CRTP). See Array and Defining custom array types for details on how to create custom array types.

StaticArrayImpl()

Create an unitialized array. Floating point arrays are initialized using std::numeric_limits<Value>::quiet_NaN() when the application is compiled in debug mode.

StaticArrayImpl(Value type)

Broadcast a constant value to all entries of the array.

template<typename ...Args>
StaticArrayImpl(Args... args)

Initialize the individual array entries with args (where sizeof...(args) == Size).

template<typename Value2, typename Derived2>
StaticArrayImpl(const StaticArrayImpl<Value2, Size, Derived2> &other)

Initialize the array with the contents of another given array that potentially has a different underlying type. Enoki will perform a vectorized type conversion if this is supported by the target processor.

size_t size() const

Returns the size of the array.

const Value &operator[](size_t index) const

Return a reference to an array element (const version). When the application is compiled in debug mode, the function performs a range check and throws std::out_of_range in case of an out-of-range access. This behavior can be disabled by defining ENOKI_DISABLE_RANGE_CHECK.

Value &operator[](size_t index)

Return a reference to an array element. When the application is compiled in debug mode, the function performs a range check and throws std::out_of_range in case of an out-of-range access. This behavior can be disabled by defining ENOKI_DISABLE_RANGE_CHECK.

const Value &coeff(size_t index) const

Just like operator[](), but without the range check (const version).

Value &coeff(size_t index)

Just like operator[](), but without the range check.

Value &x()

Access the first component.

const Value &x() const

Access the first component (const version).

Value &y()

Access the second component.

const Value &y() const

Access the second component (const version).

Value &z()

Access the third component.

const Value &z() const

Access the third component (const version).

Value &w()

Access the fourth component.

const Value &w() const

Access the fourth component (const version).

Memory allocation

void *alloc(size_t size)

Allocates size bytes of memory that are sufficiently aligned so that any Enoki array can be safely stored at the returned address.

template<typename T>
T *alloc(size_t count)

Typed convenience alias for alloc(). Allocates count * sizeof(T) bytes of memory that are sufficiently aligned so that any Enoki array can be safely stored at the returned address.

void dealloc(void *ptr)

Release the given memory region previously allocated by alloc().

Memory operations

template<typename Array>
Array load(const void *mem, mask_t<Array> mask = true)

Loads an array of type Array from the memory address mem (which is assumed to be aligned on a multiple of alignof(Array) bytes). No loads are performed for entries whose mask bit is false—instead, these entries are initialized with zero.

Warning

Performing an aligned load from an unaligned memory address will cause a general protection fault that immediately terminates the application.

Note

When the mask parameter is specified, the function implements a masked load, which is fairly slow on machines without the AVX512 instruction set.

template<typename Array>
Array load_unaligned(const void *mem, mask_t<Array> mask = true)

Loads an array of type Array from the memory address mem (which is not required to be aligned). No loads are performed for entries whose mask bit is false—instead, these entries are initialized with zero.

Note

When the mask parameter is specified, the function implements a masked load, which is fairly slow on machines without the AVX512 instruction set.

template<typename Array>
void store(const void *mem, Array array, mask_t<Array> mask = true)

Stores an array of type Array at the memory address mem (which is assumed to be aligned on a multiple of alignof(Array) bytes). No stores are performed for entries whose mask bit is false.

Warning

Performing an aligned storefrom an unaligned memory address will cause a general protection fault that immediately terminates the application.

Note

When the mask parameter is specified, the function implements a masked store, which is fairly slow on machines without the AVX512 instruction set.

template<typename Array>
void store_unaligned(const void *mem, Array array, mask_t<Array> mask = true)

Stores an array of type Array at the memory address mem (which is not required to be aligned). No stores are performed for entries whose mask bit is false.

Note

When the mask parameter is specified, the function implements a masked store, which is fairly slow on machines without the AVX512 instruction set.

template<typename Array, size_t Stride = sizeof(scalar_t<Array>), typename Index>
Array gather(const void *mem, Index index, mask_t<Array> mask = true)

Loads an array of type Array using a masked gather operation. This is equivalent to the following scalar loop (which is mapped to efficient hardware instructions if supported by the target hardware).

Array result;
for (size_t i = 0; i < Array::Size; ++i)
    if (mask[i])
        result[i] = ((Value *) mem)[index[i]];
    else
        result[i] = Value(0);

The index parameter must be a 32 or 64 bit integer array having the same number of entries. It will be interpreted as a signed array regardless of whether the provided array is signed or unsigned.

The default value of the Stride parameter indicates that the data at mem uses a packed memory layout (i.e. a stride value of sizeof(Value)); other values override this behavior.

template<size_t Stride = 0, typename Array, typename Index>
void scatter(const void *mem, Array array, Index index, mask_t<Array> mask = true)

Stores an array of type Array using a scatter operation. This is equivalent to the following scalar loop (which is mapped to efficient hardware instructions if supported by the target hardware).

for (size_t i = 0; i < Array::Size; ++i)
    if (mask[i])
        ((Value *) mem)[index[i]] = array[i];

The index parameter must be a 32 or 64 bit integer array having the same number of entries. It will be interpreted as a signed array regardless of whether the provided array is signed or unsigned.

The default value of the Stride parameter indicates that the data at mem uses a packed memory layout (i.e. a stride value of sizeof(Value)); other values override this behavior.

template<typename Array, bool Write = false, size_t Level = 2, size_t Stride = sizeof(scalar_t<Array>), typename Index>
void prefetch(const void *mem, Index index, mask_t<Array> mask = true)

Pre-fetches an array of type Array into the L1 or L2 cache (as indicated via the Level template parameter) to reduce the latency of a future gather or scatter operation. If Write = true, the the associated cache line should be acquired for write access (i.e. a scatter rather than a gather operation).

The index parameter must be a 32 or 64 bit integer array having the same number of entries. It will be interpreted as a signed array regardless of whether the provided array is signed or unsigned.

If provided, the mask parameter specifies which of the pre-fetches should actually be performed.

The default value of the Stride parameter indicates that the data at mem uses a packed memory layout (i.e. a stride value of sizeof(Value)); other values override this behavior.

template<typename Array, typename Index>
void scatter_add(const void *mem, Array array, Index index, mask_t<Array> mask = true)

Performs a scatter-add operation that is equivalent to the following scalar loop (mapped to efficient hardware instructions if supported by the target hardware).

for (size_t i = 0; i < Array::Size; ++i)
    if (mask[i])
        ((Value *) mem)[index[i]] += array[i];

The implementation avoids conflicts in case multiple indices refer to the same entry.

template<typename Output, typename Input, typename Mask>
size_t compress(Output output, Input input, Mask mask)

Tightly packs the input values selected by a provided mask and writes them to output, which must be a pointer or a structure of pointers. See the advanced topics section with regards to usage. The function returns count(mask) and also advances the pointer by this amount.

template<typename Array, typename Index, typename Mask, typename Func, typename ...Args>
void transform(scalar_t<Array> *mem, Index index, Mask mask, Func func, Args&&... args)

Transforms referenced entries at mem by the function func while avoiding potential conflicts. The variadic template arguments args are forwarded to the function. The pseudocode for this operation is

for (size_t i = 0; i < Array::Size; ++i) {
    if (mask[i])
        func(mem[index], args...);
}

See the section on the histogram problem and conflict detection on how to use this function.

Note

To efficiently perform the transformation at the hardware level, the Index and Array types should occupy the same size. The implementation ensures that this is the case by performing an explicit cast of the index parameter to int_array_t<Array>.

template<typename Array, typename Index, typename Func, typename ...Args>
void transform(scalar_t<Array> *mem, Index index, Func func, Args&&... args)

Unmasked version of transform().

Miscellaneous initialization

template<typename Array>
Array empty()

Returns an unitialized static array.

template<typename DArray>
DArray empty(size_t size)

Allocates and returns a dynamic array of type DArray of size size. The array contents are uninitialized.

template<typename Array>
Array zero()

Returns a static array filled with zeros. This is analogous to writing Array(0) but makes it more explicit to the compiler that a specific efficient instruction sequence should be used for zero-initialization.

template<typename DArray>
DArray zero(size_t size)

Allocates and returns a dynamic array of type DArray that is filled with zeros.

template<typename Array>
Array arange()

Return an array initialized with an index sequence, i.e. 0, 1, .., Array::Size-1.

template<typename DArray>
DArray arange(size_t size)

Allocates and returns a dynamic array of type DArray that is filled an index sequence 0..size-1.

template<typename Array>
Array linspace(scalar_t<Array> min, scalar_t<Array> max)

Return an array initialized with linear linearly spaced entries including the endpoints min and max.

template<typename DArray>
DArray linspace(scalar_t<DArray> min, scalar_t<DArray> max, size_t size)

Allocates and returns a dynamic array initialized with size linear linearly spaced entries including the endpoints min and max.

template<typename DArray>
Array<DArray, 2> meshgrid(const DArray &x, const DArray &y)

Creates a 2D coordinate array containing all pairs of entries from the x and y arrays. Analogous to the meshgrid function in NumPy.

using FloatP = Array<float>;
using FloatX = DynamicArray<FloatP>;

auto x = linspace<FloatX>(0.f, 1.f, 4);
auto y = linspace<FloatX>(2.f, 3.f, 4);
Array<FloatX, 2> grid = meshgrid(x, y);

std::cout << grid << std::endl;

/* Prints:

    [[0, 2],
     [0.333333, 2],
     [0.666667, 2],
     [1, 2],
     [0, 2.33333],
     [0.333333, 2.33333],
     [0.666667, 2.33333],
     [1, 2.33333],
     [0, 2.66667],
     [0.333333, 2.66667],
     [0.666667, 2.66667],
     [1, 2.66667],
     [0, 3],
     [0.333333, 3],
     [0.666667, 3],
     [1, 3]]
*/
template<typename Predicate, typename Index>
Index binary_search(scalar_t<Index> start, scalar_t<Index> end, Predicate pred)

Perform binary search over a range given a predicate pred, which monotonically decreases over this range (i.e. max one true -> false transition).

Given a (scalar) start and end index of a range, this function evaluates a predicate floor(log2(end-start) + 1) times with index values on the interval [start, end] (inclusive) to find the first index that no longer satisfies it. Note that the template parameter Index is automatically inferred from the supplied predicate. Specifically, the predicate takes an index or an index vector of type Index as input argument. In the vectorized case, each vector lane can thus evaluate a different predicate. When pred is false for all entries, the function returns start, and when it is true for all cases, it returns end.

The following code example shows a typical use case: array contains a sorted list of floating point numbers, and the goal is to map floating point entries of x to the first index j such that array[j] >= x (and all of this of course in parallel for each vector element).

std::vector<float> array = ...;
Float32P x = ...;

UInt32P j = binary_search(
   0,
   array.size() - 1,
   [&](UInt32P index) {
       return gather(array.data(), index) < x;
   }
);

Elementary Arithmetic Operators

template<typename Array>
Array operator+(Array x, Array y)

Binary addition operator.

template<typename Array>
Array operator-(Array x, Array y)

Binary subtraction operator.

template<typename Array>
Array operator-(Array x)

Unary minus operator.

template<typename Array>
Array operator*(Array x, Array y)

Binary multiplication operator.

template<typename Array>
Array &operator*=(Array &x, Array y)

Compound assignment operator for multiplication. Concerning Array types with non-commutative multiplication: the operation expands to x = x * y;.

template<typename Array>
Array mulhi(Array x, Array y)

Returns the high part of an integer multiplication. For 32-bit scalar input, this is e.g. equivalent to the following expression

(int32_t) (((int64_t) x * (int64_t) y) >> 32);
template<typename Array>
Array operator/(Array x, Array y)

Binary division operator. A special overload to multiply by the reciprocal when the second argument is a scalar.

Integer division is handled specially, see Vectorized integer division by constants for details.

template<typename Array>
Array operator|(Array x, Array y)

Binary bitwise OR operator.

template<typename Array>
Array operator||(Array x, Array y)

Binary logical OR operator (identical to operator|, as no short-circuiting is supported in operator overloads).

template<typename Array>
Array operator&(Array x, Array y)

Binary bitwise AND operator.

template<typename Array>
Array operator&&(Array x, Array y)

Binary logical AND operator. (identical to operator&, as no short-circuiting is supported in operator overloads).

template<typename Array>
Array andnot(Array x, Array y)

Binary logical AND NOT operator. (identical to x & ~y).

template<typename Array>
Array operator^(Array x, Array y)

Binary bitwise XOR operator.

template<typename Array>
Array operator<<(Array x, Array y)

Left shift operator. See also: sl() and rol().

template<typename Array>
Array operator>>(Array x, Array y)

Right shift operator. See also: sr() and ror().

template<typename Array>
mask_t<Array> operator<(Array x, Array y)

Less-than comparison operator.

template<typename Array>
mask_t<Array> operator<=(Array x, Array y)

Less-than-or-equal comparison operator.

template<typename Array>
mask_t<Array> operator>(Array x, Array y)

Greater-than comparison operator.

template<typename Array>
mask_t<Array> operator>=(Array x, Array y)

Greater-than-or-equal comparison operator.

template<typename Array>
mask_t<Array> eq(Array x, Array y)

Equality operator (vertical operation).

template<typename Array>
mask_t<Array> neq(Array x, Array y)

Inequality operator (vertical operation).

template<size_t Imm, typename Array>
Array sl(Array x)

Left shift by an immediate amount Imm.

template<size_t Imm, typename Array>
Array sr(Array x)

Right shift by an immediate amount Imm.

template<typename Array>
Array rol(Array x, Array y)

Left shift with rotation.

template<typename Array>
Array ror(Array x, Array y)

Right shift with rotation.

template<size_t Imm, typename Array>
Array rol(Array x)

Left shift with rotation by an immediate amount Imm.

template<size_t Imm, typename Array>
Array ror(Array x)

Right shift with rotation by an immediate amount Imm.

template<size_t Imm, typename Array>
Array ror_array(Array x)

Rotate the entire array by Imm entries towards the right, i.e. coeff[0] becomes coeff[Imm], etc.

template<size_t Imm, typename Array>
Array rol_array(Array x)

Rotate the entire array by Imm entries towards the left, i.e. coeff[Imm] becomes coeff[0], etc.

template<typename Target, typename Source>
Target reinterpret_array(Source x)

Reinterprets the bit-level representation of an array (e.g. from uint32_t to float). See the section on reinterpreting array contents for further details.

Elementary Arithmetic Functions

template<typename Array>
Array rcp(Array x)

Computes the reciprocal \(\frac{1}{x}\). A slightly less accurate (but more efficient) implementation is used when approximate mode is enabled for Array. Relies on AVX512ER instructions if available.

template<typename Array>
Array rsqrt(Array x)

Computes the reciprocal square root \(\frac{1}{\sqrt{x}}\). A slightly less accurate (but more efficient) implementation is used when approximate mode is enabled for Array. Relies on AVX512ER instructions if available.

template<typename Array>
Array abs(Array x)

Computes the absolute value \(|x|\) (analogous to std::abs).

template<typename Array>
Array max(Array x, Array y)

Returns the maximum of \(x\) and \(y\) (analogous to std::max).

template<typename Array>
Array min(Array x, Array y)

Returns the minimum of \(x\) and \(y\) (analogous to std::min).

template<typename Array>
Array sign(Array x)

Computes the signum function \(\begin{cases}1,&\mathrm{if}\ x\ge 0\\-1,&\mathrm{otherwise}\end{cases}\)

Analogous to std::copysign(1.f, x).

template<typename Array>
Array copysign(Array x, Array y)

Copies the sign of the array y to x (analogous to std::copysign).

template<typename Array>
Array copysign_neg(Array x, Array y)

Copies the sign of the array -y to x.

template<typename Array>
Array mulsign(Array x, Array y)

Efficiently multiplies x by the sign of y.

template<typename Array>
Array mulsign_neg(Array x, Array y)

Efficiently multiplies x by the sign of -y.

template<typename Array>
Array sqr(Array x)

Computes the square of \(x\) (analogous to x*x)

template<typename Array>
Array sqrt(Array x)

Computes the square root of \(x\) (analogous to std::sqrt).

template<typename Array>
Array cbrt(Array x)

Computes the cube root of \(x\) (analogous to std::cbrt).

template<typename Array>
Array hypot(Array x, Array y)

Computes \(\sqrt{x^2+y^2}\) while avoiding overflow and underflow.

template<typename Array>
Array ceil(Array x)

Computes the ceiling of \(x\) (analogous to std::ceil).

template<typename Array>
Array floor(Array x)

Computes the floor of \(x\) (analogous to std::floor).

template<typename Target, typename Array>
Array ceil2int(Array x)

Computes the ceiling of \(x\) and converts the result to an integer. If supported by the hardware, the combined operation is more efficient than the analogous expression Target(ceil(x)).

template<typename Target, typename Array>
Array floor2int(Array x)

Computes the floor of \(x\) and converts the result to an integer. If supported by the hardware, the combined operation is more efficient than the analogous expression Target(floor(x)).

template<typename Array>
Array round(Array x)

Rounds \(x\) to the nearest integer using Banker’s rounding for half-way values.

Note

This is analogous to std::rint, not std::round.

template<typename Array>
Array trunc(Array x)

Rounds \(x\) towards zero (analogous to std::trunc).

template<typename Array>
Array fmod(Array x, Array y)

Computes the floating-point remainder of the division operation x/y

template<typename Array>
Array fmadd(Array x, Array y, Array z)

Performs a fused multiply-add operation if supported by the target hardware. Otherwise, the operation is emulated using conventional multiplication and addition (i.e. x * y + z).

template<typename Array>
Array fnmadd(Array x, Array y, Array z)

Performs a fused negative multiply-add operation if supported by the target hardware. Otherwise, the operation is emulated using conventional multiplication and addition (i.e. -x * y + z).

template<typename Array>
Array fmsub(Array x, Array y, Array z)

Performs a fused multiply-subtract operation if supported by the target hardware. Otherwise, the operation is emulated using conventional multiplication and subtraction (i.e. x * y - z).

template<typename Array>
Array fnmsub(Array x, Array y, Array z)

Performs a fused negative multiply-subtract operation if supported by the target hardware. Otherwise, the operation is emulated using conventional multiplication and subtraction (i.e. -x * y - z).

template<typename Array>
Array fmaddsub(Array x, Array y, Array z)

Performs a fused multiply-add and multiply-subtract operation for alternating elements. The pseudocode for this operation is

Array result;
for (size_t i = 0; i < Array::Size; ++i) {
    if (i % 2 == 0)
        result[i] = x[i] * y[i] - c[i];
    else
        result[i] = x[i] * y[i] + c[i];
}
template<typename Array>
Array fmsubadd(Array x, Array y, Array z)

Performs a fused multiply-add and multiply-subtract operation for alternating elements. The pseudocode for this operation is

Array result;
for (size_t i = 0; i < Array::Size; ++i) {
    if (i % 2 == 0)
        result[i] = x[i] * y[i] + c[i];
    else
        result[i] = x[i] * y[i] - c[i];
}
template<typename Array>
Array ldexp(Array x, Array n)

Multiplies \(x\) by \(2^n\). Analogous to std::ldexp except that n is a floating point argument.

template<typename Array>
std::pair<Array, Array> frexp(Array x)

Breaks the floating-point number \(x\) into a normalized fraction and power of 2. Analogous to std::frexp except that both return values are floating point values.

template<typename Array>
Array lerp(Array a, Array b, Array t)

Blends between the values \(a\) and \(b\) using the expression \(a(1-t) + t*b\).

template<typename Array>
Array clip(Array a, Array min, Array max)

Clips \(a\) to the specified interval \([\texttt{min}, \texttt{max}]\).

Horizontal operations

template<typename Array>
bool operator==(Array x, Array y)

Equality operator.

Warning

Following the principle of least surprise, enoki::operator==() is a horizontal operations that returns a boolean value; a vertical alternatives named eq() is also available. The following pair of operations is equivalent:

bool b1 = (f1 == f2);
bool b2 = all(eq(f1, f2));
template<typename Array>
bool operator!=(Array x, Array y)

Warning

Following the principle of least surprise, enoki::operator!=() is a horizontal operations that returns a boolean value; a vertical alternatives named neq() is also available. The following pair of operations is equivalent:

bool b1 = (f1 != f2);
bool b2 = any(neq(f1, f2));
template<typename Array>
Array reverse(Array value)

Returns the input array with all components reversed, i.e.

value[Array::Size-1], .., value[0]

For multidimensional arrays, the outermost dimension of Array are reversed. Note that this operation is currently not efficiently vectorized on 1D CPU arrays (though GPU and/or multi-dimensional arrays are fine).

template<typename Array>
Array psum(Array value)

Computes the inclusive prefix sum of the components of value, i.e.

value[0], value[0] + value[1], .., value[0] + .. + value[Array::Size-1];

For multidimensional arrays, the horizontal reduction is performed over the outermost dimension of Array. Note that this operation is currently not efficiently vectorized on 1D CPU arrays (though GPU and/or multi-dimensional arrays are fine).

template<typename Array>
value_t<Array> hsum(Array value)

Efficiently computes the horizontal sum of the components of value, i.e.

value[0] + .. + value[Array::Size-1];

For 1D arrays, hsum() returns a scalar result. For multidimensional arrays, the horizontal reduction is performed over the outermost dimension of Array, and the result is of type value_t<Array>.

template<typename Array>
auto hsum_inner(Array value)

Analogous to hsum(), exept that the horizontal reduction is performed over the innermost dimension of Array (which is only relevant in the case of a multidimensional input array).

template<typename Array>
scalar_t<Array> hsum_nested(Array value)

Recursive version of hsum(), which nests through all dimensions and always returns a scalar.

template<typename Array>
value_t<Array> hprod(Array value)

Efficiently computes the horizontal product of the components of value, i.e.

value[0] * .. * value[Array::Size-1];

For 1D arrays, hprod() returns a scalar result. For multidimensional arrays, the horizontal reduction is performed over the outermost dimension of Array, and the result is of type value_t<Array>.

template<typename Array>
auto hprod_inner(Array value)

Analogous to hprod(), exept that the horizontal reduction is performed over the innermost dimension of Array (which is only relevant in the case of a multidimensional input array).

template<typename Array>
scalar_t<Array> hprod_nested(Array value)

Recursive version of hprod(), which nests through all dimensions and always returns a scalar.

template<typename Array>
value_t<Array> hmax(Array value)

Efficiently computes the horizontal maximum of the components of value, i.e.

max(value[0], max(value[1], ...))

For 1D arrays, hmax() returns a scalar result. For multidimensional arrays, the horizontal reduction is performed over the outermost dimension of Array, and the result is of type value_t<Array>.

template<typename Array>
auto hmax_inner(Array value)

Analogous to hmax(), exept that the horizontal reduction is performed over the innermost dimension of Array (which is only relevant in the case of a multidimensional input array).

template<typename Array>
scalar_t<Array> hmax_nested(Array value)

Recursive version of hmax(), which nests through all dimensions and always returns a scalar.

template<typename Array>
value_t<Array> hmin(Array value)

Efficiently computes the horizontal minimum of the components of value, i.e.

min(value[0], min(value[1], ...))

For 1D arrays, hmin() returns a scalar result. For multidimensional arrays, the horizontal reduction is performed over the outermost dimension of Array, and the result is of type value_t<Array>.

template<typename Array>
auto hmin_inner(Array value)

Analogous to hmin(), exept that the horizontal reduction is performed over the innermost dimension of Array (which is only relevant in the case of a multidimensional input array).

template<typename Array>
scalar_t<Array> hmin_nested(Array value)

Recursive version of hmin(), which nests through all dimensions and always returns a scalar.

template<typename Mask>
auto all(Mask value)

Efficiently computes the horizontal AND (i.e. logical conjunction) of the components of the mask value, i.e.

value[0] & ... & value[Size-1]

For 1D arrays, all() returns a boolean result. For multidimensional arrays, the horizontal reduction is performed over the outermost dimension of Mask, and the result is of type mask_t<value_t<Mask>>.

template<typename Mask>
auto all_inner(Mask value)

Analogous to all(), exept that the horizontal reduction is performed over the innermost dimension of Mask (which is only relevant in the case of a multidimensional input array).

template<typename Mask>
bool all_nested(Mask value)

Recursive version of all(), which nests through all dimensions and always returns a boolean value.

template<typename Mask, bool Default>
bool all_or(Mask value)

This function calls returns the Default template argument when Mask is a GPU array. Otherwise, it falls back all(). See the section on horizontal operations on the GPU for details.

template<typename Mask>
auto any(Mask value)

Efficiently computes the horizontal OR (i.e. logical disjunction) of the components of the mask value, i.e.

value[0] | ... | value[Size-1]

For 1D arrays, any() returns a boolean result. For multidimensional arrays, the horizontal reduction is performed over the outermost dimension of Mask, and the result is of type mask_t<value_t<Mask>>.

template<typename Mask>
auto any_inner(Mask value)

Analogous to any(), exept that the horizontal reduction is performed over the innermost dimension of Mask (which is only relevant in the case of a multidimensional input array).

template<typename Mask>
bool any_nested(Mask value)

Recursive version of any(), which nests through all dimensions and always returns a boolean value.

template<typename Mask, bool Default>
bool any_or(Mask value)

This function calls returns the Default template argument when Mask is a GPU array. Otherwise, it falls back any(). See the section on horizontal operations on the GPU for details.

template<typename Mask>
auto none(Mask value)

Efficiently computes the negated horizontal OR of the components of the mask value, i.e.

~(value[0] | ... | value[Size-1])

For 1D arrays, none() returns a boolean result. For multidimensional arrays, the horizontal reduction is performed over the outermost dimension of Mask, and the result is of type mask_t<value_t<Mask>>.

template<typename Mask>
auto none_inner(Mask value)

Analogous to none(), exept that the horizontal reduction is performed over the innermost dimension of Mask (which is only relevant in the case of a multidimensional input array).

template<typename Mask>
bool none_nested(Mask value)

Recursive version of none(), which nests through all dimensions and always returns a boolean value.

template<typename Mask, bool Default>
bool none_or(Mask value)

This function calls returns the Default template argument when Mask is a GPU array. Otherwise, it falls back none(). See the section on horizontal operations on the GPU for details.

template<typename Mask>
auto count(Mask value)

Efficiently computes the number of components whose mask bits are turned on, i.e.

(value[0] ? 1 : 0) + ... (value[Size - 1] ? 1 : 0)

For 1D arrays, count() returns a result of type size_t. For multidimensional arrays, the horizontal reduction is performed over the outermost dimension of Mask, and the result is of type size_array_t<value_t<Mask>>.

template<typename Mask>
auto count_inner(Mask value)

Analogous to count(), exept that the horizontal reduction is performed over the innermost dimension of Mask (which is only relevant in the case of a multidimensional input array).

template<typename Mask>
size_t count_nested(Mask value)

Recursive version of count(), which nests through all dimensions and always returns a boolean value.

template<typename Array>
value_t<Array> dot(Array value1, Array value2)

Efficiently computes the dot products of value1 and value2, i.e.:

value1[0]*value2[0] + .. + value1[Array::Size-1]*value2[Array::Size-1];

The return value is of type value_t<Array>, which is a scalar (e.g. float) for ordinary inputs and an array for nested array inputs.

template<typename Array>
value_t<Array> norm(Array value)

Computes the 2-norm of the input array. The return value is of type value_t<Array>, which is a scalar (e.g. float) for ordinary inputs and an array for nested array inputs.

template<typename Array>
value_t<Array> squared_norm(Array value)

Computes the squared 2-norm of the input array. The return value is of type value_t<Array>, which is a scalar (e.g. float) for ordinary inputs and an array for nested array inputs.

template<typename Array>
Array normalize(Array value)

Normalizes the input array by multiplying by the inverse of norm(value).

Transcendental functions

Accuracy of transcendental function approximations

Most approximations of transcendental functions are based on routines in the CEPHES math library. The table below provides some statistics on their absolute and relative error.

The CEPHES approximations are only used when approximate mode is enabled; otherwise, the functions below will invoke the corresponding non-vectorized standard C library routines.

Note

The forward trigonometric functions (sin, cos, tan) are optimized for low error on the domain \(|x| < 8192\) and don’t perform as well beyond this range.

Single precision
Function Tested domain Abs. error (mean) Abs. error (max) Rel. error (mean) Rel. error (max)
\(\mathrm{sin}()\) \(-8192 < x < 8192\) \(1.2 \cdot 10^{-8}\) \(1.2 \cdot 10^{-7}\) \(1.9 \cdot 10^{-8}\,(0.25\,\mathrm{ulp})\) \(1.8 \cdot 10^{-6}\,(19\,\mathrm{ulp})\)
\(\mathrm{cos}()\) \(-8192 < x < 8192\) \(1.2 \cdot 10^{-8}\) \(1.2 \cdot 10^{-7}\) \(1.9 \cdot 10^{-8}\,(0.25\,\mathrm{ulp})\) \(3.1 \cdot 10^{-6}\,(47\,\mathrm{ulp})\)
\(\mathrm{tan}()\) \(-8192 < x < 8192\) \(4.7 \cdot 10^{-6}\) \(8.1 \cdot 10^{-1}\) \(3.4 \cdot 10^{-8}\,(0.42\,\mathrm{ulp})\) \(3.1 \cdot 10^{-6}\,(30\,\mathrm{ulp})\)
\(\mathrm{cot}()\) \(-8192 < x < 8192\) \(2.6 \cdot 10^{-6}\) \(0.11 \cdot 10^{1}\) \(3.5 \cdot 10^{-8}\,(0.42\,\mathrm{ulp})\) \(3.1 \cdot 10^{-6}\,(47\,\mathrm{ulp})\)
\(\mathrm{asin}()\) \(-1 < x < 1\) \(2.3 \cdot 10^{-8}\) \(1.2 \cdot 10^{-7}\) \(2.9 \cdot 10^{-8}\,(0.33\,\mathrm{ulp})\) \(2.3 \cdot 10^{-7}\,(2\,\mathrm{ulp})\)
\(\mathrm{acos}()\) \(-1 < x < 1\) \(4.7 \cdot 10^{-8}\) \(2.4 \cdot 10^{-7}\) \(2.9 \cdot 10^{-8}\,(0.33\,\mathrm{ulp})\) \(1.2 \cdot 10^{-7}\,(1\,\mathrm{ulp})\)
\(\mathrm{atan}()\) \(-1 < x < 1\) \(1.8 \cdot 10^{-7}\) \(6 \cdot 10^{-7}\) \(4.2 \cdot 10^{-7}\,(4.9\,\mathrm{ulp})\) \(8.2 \cdot 10^{-7}\,(12\,\mathrm{ulp})\)
\(\mathrm{sinh}()\) \(-10 < x < 10\) \(2.6 \cdot 10^{-5}\) \(2 \cdot 10^{-3}\) \(2.8 \cdot 10^{-8}\,(0.34\,\mathrm{ulp})\) \(2.7 \cdot 10^{-7}\,(3\,\mathrm{ulp})\)
\(\mathrm{cosh}()\) \(-10 < x < 10\) \(2.9 \cdot 10^{-5}\) \(2 \cdot 10^{-3}\) \(2.9 \cdot 10^{-8}\,(0.35\,\mathrm{ulp})\) \(2.5 \cdot 10^{-7}\,(4\,\mathrm{ulp})\)
\(\mathrm{tanh}()\) \(-10 < x < 10\) \(4.8 \cdot 10^{-8}\) \(4.2 \cdot 10^{-7}\) \(5 \cdot 10^{-8}\,(0.76\,\mathrm{ulp})\) \(5 \cdot 10^{-7}\,(7\,\mathrm{ulp})\)
\(\mathrm{csch}()\) \(-10 < x < 10\) \(5.7 \cdot 10^{-8}\) \(7.8 \cdot 10^{-3}\) \(4.4 \cdot 10^{-8}\,(0.54\,\mathrm{ulp})\) \(3.1 \cdot 10^{-7}\,(5\,\mathrm{ulp})\)
\(\mathrm{sech}()\) \(-10 < x < 10\) \(6.7 \cdot 10^{-9}\) \(1.8 \cdot 10^{-7}\) \(4.3 \cdot 10^{-8}\,(0.54\,\mathrm{ulp})\) \(3.2 \cdot 10^{-7}\,(4\,\mathrm{ulp})\)
\(\mathrm{coth}()\) \(-10 < x < 10\) \(1.2 \cdot 10^{-7}\) \(7.8 \cdot 10^{-3}\) \(6.9 \cdot 10^{-8}\,(0.61\,\mathrm{ulp})\) \(5.7 \cdot 10^{-7}\,(8\,\mathrm{ulp})\)
\(\mathrm{asinh}()\) \(-30 < x < 30\) \(2.8 \cdot 10^{-8}\) \(4.8 \cdot 10^{-7}\) \(1 \cdot 10^{-8}\,(0.13\,\mathrm{ulp})\) \(1.7 \cdot 10^{-7}\,(2\,\mathrm{ulp})\)
\(\mathrm{acosh}()\) \(1 < x < 10\) \(2.9 \cdot 10^{-8}\) \(2.4 \cdot 10^{-7}\) \(1.5 \cdot 10^{-8}\,(0.18\,\mathrm{ulp})\) \(2.4 \cdot 10^{-7}\,(3\,\mathrm{ulp})\)
\(\mathrm{atanh}()\) \(-1 < x < 1\) \(9.9 \cdot 10^{-9}\) \(2.4 \cdot 10^{-7}\) \(1.5 \cdot 10^{-8}\,(0.18\,\mathrm{ulp})\) \(1.2 \cdot 10^{-7}\,(1\,\mathrm{ulp})\)
\(\mathrm{exp}()\) \(-20 < x < 30\) \(0.72 \cdot 10^{4}\) \(0.1 \cdot 10^{7}\) \(2.4 \cdot 10^{-8}\,(0.27\,\mathrm{ulp})\) \(1.2 \cdot 10^{-7}\,(1\,\mathrm{ulp})\)
\(\mathrm{log}()\) \(10^{-20} < x < 2\cdot 10^{30}\) \(9.6 \cdot 10^{-9}\) \(7.6 \cdot 10^{-6}\) \(1.4 \cdot 10^{-10}\,(0.0013\,\mathrm{ulp})\) \(1.2 \cdot 10^{-7}\,(1\,\mathrm{ulp})\)
\(\mathrm{erf}()\) \(-1 < x < 1\) \(3.2 \cdot 10^{-8}\) \(1.8 \cdot 10^{-7}\) \(6.4 \cdot 10^{-8}\,(0.78\,\mathrm{ulp})\) \(3.3 \cdot 10^{-7}\,(4\,\mathrm{ulp})\)
\(\mathrm{erfc}()\) \(-1 < x < 1\) \(3.4 \cdot 10^{-8}\) \(2.4 \cdot 10^{-7}\) \(6.4 \cdot 10^{-8}\,(0.79\,\mathrm{ulp})\) \(1 \cdot 10^{-6}\,(11\,\mathrm{ulp})\)
Double precision
Function Tested domain Abs. error (mean) Abs. error (max) Rel. error (mean) Rel. error (max)
\(\mathrm{sin}()\) \(-8192 < x < 8192\) \(2.2 \cdot 10^{-17}\) \(2.2 \cdot 10^{-16}\) \(3.6 \cdot 10^{-17}\,(0.25\,\mathrm{ulp})\) \(3.1 \cdot 10^{-16}\,(2\,\mathrm{ulp})\)
\(\mathrm{cos}()\) \(-8192 < x < 8192\) \(2.2 \cdot 10^{-17}\) \(2.2 \cdot 10^{-16}\) \(3.6 \cdot 10^{-17}\,(0.25\,\mathrm{ulp})\) \(3 \cdot 10^{-16}\,(2\,\mathrm{ulp})\)
\(\mathrm{tan}()\) \(-8192 < x < 8192\) \(6.8 \cdot 10^{-16}\) \(1.2 \cdot 10^{-10}\) \(5.4 \cdot 10^{-17}\,(0.35\,\mathrm{ulp})\) \(4.1 \cdot 10^{-16}\,(3\,\mathrm{ulp})\)
\(\mathrm{cot}()\) \(-8192 < x < 8192\) \(4.9 \cdot 10^{-16}\) \(1.2 \cdot 10^{-10}\) \(5.5 \cdot 10^{-17}\,(0.36\,\mathrm{ulp})\) \(4.4 \cdot 10^{-16}\,(3\,\mathrm{ulp})\)
\(\mathrm{asin}()\) \(-1 < x < 1\) \(1.3 \cdot 10^{-17}\) \(2.2 \cdot 10^{-16}\) \(1.5 \cdot 10^{-17}\,(0.098\,\mathrm{ulp})\) \(2.2 \cdot 10^{-16}\,(1\,\mathrm{ulp})\)
\(\mathrm{acos}()\) \(-1 < x < 1\) \(5.4 \cdot 10^{-17}\) \(4.4 \cdot 10^{-16}\) \(3.5 \cdot 10^{-17}\,(0.23\,\mathrm{ulp})\) \(2.2 \cdot 10^{-16}\,(1\,\mathrm{ulp})\)
\(\mathrm{atan}()\) \(-1 < x < 1\) \(4.3 \cdot 10^{-17}\) \(3.3 \cdot 10^{-16}\) \(1 \cdot 10^{-16}\,(0.65\,\mathrm{ulp})\) \(7.1 \cdot 10^{-16}\,(5\,\mathrm{ulp})\)
\(\mathrm{sinh}()\) \(-10 < x < 10\) \(3.1 \cdot 10^{-14}\) \(1.8 \cdot 10^{-12}\) \(3.3 \cdot 10^{-17}\,(0.22\,\mathrm{ulp})\) \(4.3 \cdot 10^{-16}\,(2\,\mathrm{ulp})\)
\(\mathrm{cosh}()\) \(-10 < x < 10\) \(2.2 \cdot 10^{-14}\) \(1.8 \cdot 10^{-12}\) \(2 \cdot 10^{-17}\,(0.13\,\mathrm{ulp})\) \(2.9 \cdot 10^{-16}\,(2\,\mathrm{ulp})\)
\(\mathrm{tanh}()\) \(-10 < x < 10\) \(5.6 \cdot 10^{-17}\) \(3.3 \cdot 10^{-16}\) \(6.1 \cdot 10^{-17}\,(0.52\,\mathrm{ulp})\) \(5.5 \cdot 10^{-16}\,(3\,\mathrm{ulp})\)
\(\mathrm{csch}()\) \(-10 < x < 10\) \(4.5 \cdot 10^{-17}\) \(1.8 \cdot 10^{-12}\) \(3.3 \cdot 10^{-17}\,(0.21\,\mathrm{ulp})\) \(5.1 \cdot 10^{-16}\,(4\,\mathrm{ulp})\)
\(\mathrm{sech}()\) \(-10 < x < 10\) \(3 \cdot 10^{-18}\) \(2.2 \cdot 10^{-16}\) \(2 \cdot 10^{-17}\,(0.13\,\mathrm{ulp})\) \(4.3 \cdot 10^{-16}\,(2\,\mathrm{ulp})\)
\(\mathrm{coth}()\) \(-10 < x < 10\) \(1.2 \cdot 10^{-16}\) \(3.6 \cdot 10^{-12}\) \(6.2 \cdot 10^{-17}\,(0.3\,\mathrm{ulp})\) \(6.7 \cdot 10^{-16}\,(5\,\mathrm{ulp})\)
\(\mathrm{asinh}()\) \(-30 < x < 30\) \(5.1 \cdot 10^{-17}\) \(8.9 \cdot 10^{-16}\) \(1.9 \cdot 10^{-17}\,(0.13\,\mathrm{ulp})\) \(4.4 \cdot 10^{-16}\,(2\,\mathrm{ulp})\)
\(\mathrm{acosh}()\) \(1 < x < 10\) \(4.9 \cdot 10^{-17}\) \(4.4 \cdot 10^{-16}\) \(2.6 \cdot 10^{-17}\,(0.17\,\mathrm{ulp})\) \(6.6 \cdot 10^{-16}\,(5\,\mathrm{ulp})\)
\(\mathrm{atanh}()\) \(-1 < x < 1\) \(1.8 \cdot 10^{-17}\) \(4.4 \cdot 10^{-16}\) \(3.2 \cdot 10^{-17}\,(0.21\,\mathrm{ulp})\) \(3 \cdot 10^{-16}\,(2\,\mathrm{ulp})\)
\(\mathrm{exp}()\) \(-20 < x < 30\) \(4.7 \cdot 10^{-6}\) \(2 \cdot 10^{-3}\) \(2.5 \cdot 10^{-17}\,(0.16\,\mathrm{ulp})\) \(3.3 \cdot 10^{-16}\,(2\,\mathrm{ulp})\)
\(\mathrm{log}()\) \(10^{-20} < x < 2\cdot 10^{30}\) \(1.9 \cdot 10^{-17}\) \(1.4 \cdot 10^{-14}\) \(2.7 \cdot 10^{-19}\,(0.0013\,\mathrm{ulp})\) \(2.2 \cdot 10^{-16}\,(1\,\mathrm{ulp})\)
\(\mathrm{erf}()\) \(-1 < x < 1\) \(4.7 \cdot 10^{-17}\) \(4.4 \cdot 10^{-16}\) \(9.6 \cdot 10^{-17}\,(0.63\,\mathrm{ulp})\) \(5.9 \cdot 10^{-16}\,(5\,\mathrm{ulp})\)
\(\mathrm{erfc}()\) \(-1 < x < 1\) \(4.8 \cdot 10^{-17}\) \(4.4 \cdot 10^{-16}\) \(9.6 \cdot 10^{-17}\,(0.64\,\mathrm{ulp})\) \(2.5 \cdot 10^{-15}\,(16\,\mathrm{ulp})\)

Trigonometric functions

template<typename Array>
Array sin(Array x)

Sine function approximation based on the CEPHES library.

template<typename Array>
Array cos(Array x)

Cosine function approximation based on the CEPHES library.

template<typename Array>
std::pair<Array, Array> sincos(Array x)

Simultaneous sine and cosine function approximation based on the CEPHES library.

template<typename Array>
Array tan(Array x)

Tangent function approximation based on the CEPHES library.

template<typename Array>
Array csc(Array x)

Cosecant convenience function implemented as rcp(sin(x)).

template<typename Array>
Array sec(Array x)

Cosecant convenience function implemented as rcp(cos(x)).

template<typename Array>
Array cot(Array x)

Cotangent convenience function implemented as rcp(tan(x)).

template<typename Array>
Array asin(Array x)

Arcsine function approximation based on the CEPHES library.

template<typename Array>
Array acos(Array x)

Arccosine function approximation based on the CEPHES library.

template<typename Array>
Array atan(Array x)

Arctangent function approximation based on the CEPHES library.

template<typename Array>
Array atan2(Array y, Array x)

Arctangent function of two variables.

Hyperbolic functions

template<typename Array>
Array sinh(Array x)

Hyperbolic sine function approximation based on the CEPHES library.

template<typename Array>
Array cosh(Array x)

Hyperbolic cosine function approximation based on the CEPHES library.

template<typename Array>
std::pair<Array, Array> sincosh(Array x)

Simultaneous hyperbolic sine and cosine function approximation based on the CEPHES library.

template<typename Array>
Array tanh(Array x)

Hyperbolic tangent function approximation based on the CEPHES library.

template<typename Array>
Array csch(Array x)

Hyperbolic cosecant convenience function implemented as rcp(sinh(x)).

template<typename Array>
Array sech(Array x)

Hyperbolic secant convenience function.

template<typename Array>
Array coth(Array x)

Hyperbolic cotangent convenience function implemented as rcp(tanh(x)).

template<typename Array>
Array asinh(Array x)

Hyperbolic arcsine function approximation based on the CEPHES library.

template<typename Array>
Array acosh(Array x)

Hyperbolic arccosine function approximation based on the CEPHES library.

template<typename Array>
Array atanh(Array x)

Hyperbolic arctangent function approximation based on the CEPHES library.

Exponential, logarithm, and others

template<typename Array>
Array exp(Array x)

Natural exponential function approximation based on the CEPHES library. Relies on AVX512ER instructions if available.

template<typename Array>
Array log(Array x)

Natural logarithm approximation based on the CEPHES library.

template<typename Array>
Array pow(Array x, Array y)

Computes the power function \(x^y\).

“Safe” versions of mathematical functions

template<typename Array>
Array safe_sqrt(Array x)

Computes sqrt(max(0, x)) to avoid issues with negative inputs (e.g. due to roundoff error in a prior calculation).

template<typename Array>
Array safe_rsqrt(Array x)

Computes rsqrt(max(0, x)) to avoid issues with negative inputs (e.g. due to roundoff error in a prior calculation).

template<typename Array>
Array safe_asin(Array x)

Computes asin(min(1, max(-1, x))) to avoid issues with out-of-range inputs (e.g. due to roundoff error in a prior calculation).

template<typename Array>
Array safe_acos(Array x)

Computes acos(min(1, max(-1, x))) to avoid issues with out-of-range inputs (e.g. due to roundoff error in a prior calculation).

Special functions

The following special functions require including the header enoki/special.h.

template<typename Array>
Array erf(Array x)

Evaluates the error function defined as

\[\mathrm{erf}(x)=\frac{2}{\sqrt{\pi}}\int_0^x e^{-t^2}\,\mathrm{d}t.\]

Requires a real-valued input array x.

template<typename Array>
Array erfi(Array x)

Evaluates the imaginary error function defined as

\[\mathrm{erfi}(x)=-i\,\mathrm{erf}(ix).\]

Requires a real-valued input array x.

template<typename Array>
Array erfinv(Array x)

Evaluates the inverse of the error function erf().

template<typename Array>
Array i0e(Array x)

Evaluates the exponentially scaled modified Bessel function of order zero defined as

\[I_0^{(e)}(x) = e^{-|x|} I_0(x),\]

where

\[I_0(x) = \frac{1}{\pi} \int_{0}^\pi e^{x\cos \theta}\mathrm{d}\theta.\]
template<typename Array>
Array gamma(Array x)

Evaluates the Gamma function defined as

\[\Gamma(x)=\int_0^\infty t^{x-1} e^{-t}\,\mathrm{d}t.\]
template<typename Array>
Array lgamma(Array x)

Evaluates the natural logarithm of the Gamma function.

template<typename Array>
Array dawson(Array x)

Evaluates Dawson’s integral defined as

\[D(x)=e^{-x^2}\int_0^x e^{t^2}\,\mathrm{d}t.\]
template<typename Array>
Array ellint_1(Array phi, Array k)

Evaluates the incomplete elliptic integral of the first kind

\[F(\phi, k)=\int_0^\phi (1-k^2\sin^2\theta)^{-\frac{1}{2}}\,\mathrm{d}\theta\]
template<typename Array>
Array comp_ellint_1(Array k)

Evaluates the complete elliptic integral of the first kind

\[F(k)=\int_0^\frac{\pi}{2} (1-k^2\sin^2\theta)^{-\frac{1}{2}}\,\mathrm{d}\theta\]
template<typename Array>
Array ellint_2(Array phi, Array k)

Evaluates the incomplete elliptic integral of the second kind

\[E(\phi, k)=\int_0^\phi (1-k^2\sin^2\theta)^{\frac{1}{2}}\,\mathrm{d}\theta\]
template<typename Array>
Array comp_ellint_2(Array k)

Evaluates the complete elliptic integral of the second kind

\[E(k)=\int_0^\frac{\pi}{2} (1-k^2\sin^2\theta)^{\frac{1}{2}}\,\mathrm{d}\theta\]
template<typename Array>
Array ellint_3(Array phi, Array k, Array nu)

Evaluates the incomplete elliptic integral of the third kind

\[\Pi(\phi, k, \nu)=\int_0^\phi (1+\nu\sin^2\theta)^{-1}(1-k^2\sin^2\theta)^{-\frac{1}{2}}\,\mathrm{d}\theta\]
template<typename Array>
Array comp_ellint_3(Array k, Array nu)

Evaluates the complete elliptic integral of the third kind

\[\Pi(k, \nu)=\int_0^\frac{\pi}{2} (1+\nu\sin^2\theta)^{-1}(1-k^2\sin^2\theta)^{-\frac{1}{2}}\,\mathrm{d}\theta\]

Miscellaneous operations

template<typename Array>
mask_t<Array> isnan(Array x)

Checks for NaN values and returns a mask, analogous to std::isnan.

template<typename Array>
mask_t<Array> isinf(Array x)

Checks for infinite values and returns a mask, analogous to std::isinf.

template<typename Array>
mask_t<Array> isfinite(Array x)

Checks for finite values and returns a mask, analogous to std::isfinite.

template<typename Array>
mask_t<Array> isdenormal(Array x)

Checks for denormalized values and returns a mask.

template<typename Array>
Array deg_to_rad(Array array)

Convenience function which multiplies the input array by \(\frac{\pi}{180}\).

template<typename Array>
Array rad_to_deg(Array array)

Convenience function which multiplies the input array by \(\frac{180}{\pi}\).

template<typename Array>
Array prev_float(Array array)

Return the prev representable floating point value for each element of array analogous to std::nextafter(array, -∞). Special values (infinities & not-a-number values) are returned unchanged.

template<typename Array>
Array next_float(Array array)

Return the next representable floating point value for each element of array analogous to std::nextafter(array, ∞). Special values (infinities & not-a-number values) are returned unchanged.

template<typename Array>
Array tzcnt(Array array)

Return the number of trailing zero bits (assumes that Array is an integer array).

template<typename Array>
Array lzcnt(Array array)

Return the number of leading zero bits (assumes that Array is an integer array).

template<typename Array>
Array popcnt(Array array)

Return the number nonzero bits (assumes that Array is an integer array).

template<typename Array>
Array log2i(Array array)

Return the floor of the base-two logarithm (assumes that Array is an integer array).

template<typename Index>
std::pair<Index, mask_t<Index>> range(scalar_t<Index> size)

Returns an iterable, which generates linearly increasing index vectors from 0 to size-1. This function is meant to be used with the C++11 range-based for loop:

for (auto pair : range<Index>(1000)) {
    Index index = pair.first;
    mask_t<Index> mask = pair.second;

    // ...
}

The mask specifies which index vector entries are active: unless the number of interations is exactly divisible by the packet size, the last loop iteration will generally have several disabled entries.

The implementation also supports iteration of multidimensional arrays

using UInt32P = Packet<uint32_t>;
using Vector3uP = Array<UInt32P, 3>;

for (auto pair : range<Vector3uP>(10, 20, 30)) {
    Vector3uP index = pair.first;
    mask_t<Index> mask = pair.second;

    // ...
}

which will generate indices (0, 0, 0), (0, 0, 1), etc. As before, the last loop iteration will generally have several disabled entries.

void set_flush_denormals(bool value)

Arithmetic involving denormalized floating point numbers triggers a slow microcode handler on most current architectures, which leads to severe performance penalties. This function can be used to specify whether denormalized floating point values are simply flushed to zero, which sidesteps the performance issues.

Enoki also provides a tiny a RAII wrapper named scoped_flush_denormals which sets (and later resets) this parameter.

bool flush_denormals()

Returns the denormals are flushed to zero (see set_flush_denormals()).

template<typename Array>
auto unit_angle(const Array &a, const Array &b)

Numerically well-behaved routine for computing the angle between two unit direction vectors. This should be used wherever one is tempted to compute the arc cosine of a dot product, i.e. acos(dot(a, b)).

Proposed by Don Hatch.

template<typename Array>
auto unit_angle_z(const Array &v)

Numerically well-behaved routine for computing the angle between the unit vector v and the Z axis [0, 0, 1]. This should be used wherever one is tempted to compute the arc cosine, i.e. acos(v.z()).

Proposed by Don Hatch.

Rearranging contents of arrays

template<size_t... Index, typename Array>
shuffle(Array a)

Shuffles the contents of an array given indices specified as a template parameter. The pseudocode for this operation is

Array out;
for (size_t i = 0; i<Array::Size; ++i)
    out[i] = a[Index[i]];
return out;
template<typename Array, typename Indices>
shuffle(Array a, Indices ind)

Shuffles the contents of an array given indices specified via an integer array. The pseudocode for this operation is

Array out;
for (size_t i = 0; i<Array::Size; ++i)
    out[i] = a[ind[i]];
return out;
template<typename Array1, typename Array2>
auto concat(Array1 a1, Array2 a2)

Concatenates the contents of two arrays a1 and a2. The pseudocode for this operation is

Array<value_t<Array1>, Array1::Size + Array2::Size> out;
for (size_t i = 0; i<Array1::Size; ++i)
    out[i] = a1[i];
for (size_t i = 0; i<Array2::Size; ++i)
    out[i + Array1::Size] = a2[i];
return out;
template<typename Array>
auto low(Array a)

Returns the low part of the input array a. The length of the low part is defined as the largest power of two that is smaller than Array::Size. For power-of-two sized input, this function simply returns the low half.

template<typename Array>
auto high(Array a)

Returns the high part of the input array a. The length of the high part is equal to Array::Size minus the size of the low part. For power-of-two sized input, this function simply returns the high half.

template<size_t Size, typename Array>
auto head(Array a)

Returns a new array containing the leading Size elements of a.

template<size_t Size, typename Array>
auto tail(Array a)

Returns a new array containing the trailing Size elements of a.

template<typename Outer, typename Inner>
replace_scalar_t<Outer, Inner> full(const Inner &inner)

Given an array type Outer and a value of type Inner, this function returns a new composite type of shape [<shape of Outer>, <shape of Inner>] that is filled with the value of the inner argument.

In the simplest case, this can be used to create a (potentially nested) array that is filled with constant values.

using Vector4f = Array<float, 4>;
using MyMatrix = Array<Vector4f, 4>;
MyMatrix result = full<MyMatrix>(10.f);
std::cout << result << std::endl;

/* Prints:

     [[10, 10, 10, 10],
      [10, 10, 10, 10],
      [10, 10, 10, 10],
      [10, 10, 10, 10]]
*/

Another use case entails replicating an array over the trailing dimensions of a new array:

result = full<Vector4f>(Vector4f(1, 2, 3, 4))
std::cout << result << std::endl;

/* Prints:

    [[1, 1, 1, 1],
     [2, 2, 2, 2],
     [3, 3, 3, 3],
     [4, 4, 4, 4]]
*/

Note how this is different from the default broadcasting behavior of arrays. In this case, Vector4f and MyMatrix have the same size in the leading dimension, which would replicate the vector over that axis instead:

result = MyMatrix(Vector4f(1, 2, 3, 4));
std::cout << result << std::endl;

/* Prints:

    [[1, 2, 3, 4],
     [1, 2, 3, 4],
     [1, 2, 3, 4],
     [1, 2, 3, 4]]
*/
template<typename Array>
std::array<size_t, array_depth_v<Array>> shape(const Array &a)

Returns a std::array, whose entries describe the shape of the (potentially multi-dimensional) input array along each dimension. It works for both static and dynamic arrays.

Operations for dynamic arrays

template<typename DArray>
auto packet(DArray &&a, size_t i)

Extracts the \(i\)-th packet from a dynamic array or data structure. See the chapter on dynamic arrays on how to use this function.

template<typename DArray>
size_t packets(const DArray &a)

Return the number of packets stored in the given dynamic array or data structure.

template<typename DArray>
auto slice(DArray &&a, size_t i)

Extracts the \(i\)-th slice from a dynamic array or data structure. See the chapter on dynamic arrays on how to use this function.

template<typename DArray>
size_t slices(const DArray &a)

Return the number of packets stored in the given dynamic array or data structure.

template<typename DArray>
void set_slices(DArray &a, size_t size)

Resize the given dynamic array or data structure so that there is space for size slices. When reducing the size of the array, any memory allocated so far is kept. Since there’s no exponential allocation mechanism, it is not recommended to call set_slices repeatedly to append elements.

Unlike std::vector::resize(), previous values are not preserved when enlarging the array.

Type traits

The following type traits are available to query the properties of arrays at compile time.

Replacing the scalar type of an array

The enoki::replace_scalar_t type trait and various aliases construct arrays matching a certain layout, but with different-flavored data. This is often helpful when defining custom data structures or function inputs. See the section on custom data structures for an example usage.

template<typename Array, typename Scalar>
using replace_scalar_t

Replaces the scalar type underlying an array. For instance, replace_scalar_t<Array<Array<float, 16>, 32>, int> is equal to Array<Array<int, 16>, 32>.

The type trait also works for scalar arguments. Pointers and reference arguments are copied—for instance, replace_scalar_t<const float *, int> is equal to const int *.

template<typename Array>
using uint32_array_t = replace_scalar_t<Array, uint32_t>

Create a 32-bit unsigned integer array matching the layout of Array.

template<typename Array>
using int32_array_t = replace_scalar_t<Array, int32_t>

Create a 32-bit signed integer array matching the layout of Array.

template<typename Array>
using uint64_array_t = replace_scalar_t<Array, uint64_t>

Create a 64-bit unsigned integer array matching the layout of Array.

template<typename Array>
using int64_array_t = replace_scalar_t<Array, int64_t>

Create a 64-bit signed integer array matching the layout of Array.

template<typename Array>
using int_array_t

Create a signed integer array (with the same number of bits per entry as the input) matching the layout of Array.

template<typename Array>
using uint_array_t

Create an unsigned integer array (with the same number of bits per entry as the input) matching the layout of Array.

template<typename Array>
using float16_array_t = replace_scalar_t<Array, half>

Create a half precision array matching the layout of Array.

template<typename Array>
using float32_array_t = replace_scalar_t<Array, float>

Create a single precision array matching the layout of Array.

template<typename Array>
using float64_array_t = replace_scalar_t<Array, double>

Create a double precision array matching the layout of Array.

template<typename Array>
using float_array_t

Create a floating point array (with the same number of bits per entry as the input) matching the layout of Array.

template<typename Array>
using bool_array_t = replace_scalar_t<Array, bool>

Create a boolean array matching the layout of Array.

template<typename Array>
using size_array_t = replace_scalar_t<Array, size_t>

Create a size_t-valued array matching the layout of Array.

template<typename Array>
using ssize_array_t = replace_scalar_t<Array, ssize_t>

Create a ssize_t-valued array matching the layout of Array.

SFINAE helper types

The following section discusses helper types that can be used to selectively enable or disable template functions for Enoki arrays, e.g. like so:

template <typename Value, enable_if_array_t<Value> = 0>
void f(Value value) {
    /* Invoked if 'Value' is an Enoki array */
}

template <typename Value, enable_if_not_array_t<Value> = 0>
void f(Value value) {
    /* Invoked if 'Value' is *not* an Enoki array */
}

Detecting Enoki arrays

template<typename T>
class is_array
static constexpr bool value

Equal to true iff T is a static or dynamic Enoki array type.

template<typename T>
using enable_if_array_t = std::enable_if_t<is_array_v<T>, int>

SFINAE alias to selectively enable a class or function definition for Enoki array types.

template<typename T>
using enable_if_not_array_t = std::enable_if_t<!is_array_v<T>, int>

SFINAE alias to selectively enable a class or function definition for types that are not Enoki arrays.

Detecting Enoki masks

template<typename T>
class is_mask
static constexpr bool value

Equal to true iff T is a static or dynamic Enoki mask type.

template<typename T>
using enable_if_mask_t = std::enable_if_t<is_mask_v<T>, int>

SFINAE alias to selectively enable a class or function definition for Enoki mask types.

template<typename T>
using enable_if_not_mask_t = std::enable_if_t<!is_mask_v<T>, int>

SFINAE alias to selectively enable a class or function definition for types that are not Enoki masks.

Detecting static Enoki arrays

template<typename T>
class is_static_array
static constexpr bool value

Equal to true iff T is a static Enoki array type.

template<typename T>
using enable_if_static_array_t = std::enable_if_t<is_static_array_v<T>, int>

SFINAE alias to selectively enable a class or function definition for static Enoki array types.

template<typename T>
using enable_if_not_static_array_t = std::enable_if_t<!is_static_array_v<T>, int>

SFINAE alias to selectively enable a class or function definition for static Enoki array types.

Detecting dynamic Enoki arrays

template<typename T>
class is_dynamic_array
static constexpr bool value

Equal to true iff T is a dynamic Enoki array type.

template<typename T>
using enable_if_dynamic_array_t = std::enable_if_t<is_dynamic_array_v<T>, int>

SFINAE alias to selectively enable a class or function definition for dynamic Enoki array types.

template<typename T>
using enable_if_not_dynamic_array_t = std::enable_if_t<!is_dynamic_array_v<T>, int>

SFINAE alias to selectively enable a class or function definition for dynamic Enoki array types.

template<typename T>
class is_dynamic
static constexpr bool value

Equal to true iff T (which could be a nested Enoki array) contains a dynamic array at any level.

This is different from is_dynamic_array, which only cares about the outermost level – for instance, given static array T containing a nested dynamic array, is_dynamic_array_v<T> == false, while is_dynamic_v<T> == true.

Random number generation

Enoki ships with a vectorized implementation of the PCG32 random number generator developed by Melissa O’Neill. To use it, include the following header:

#include <enoki/random.h>

The following reference is based on the original PCG32 documentation.

Usage

The enoki::PCG32 class takes a single template parameter T that denotes the “shape” of the output. This can be any scalar type like uint32_t, in which case the implementation generates scalar variates:

/* Scalar RNG */
using RNG_1x = PCG32<uint32_t>;

RNG_1x my_rng;
float value = my_rng.next_float32();

Alternatively, it can be an Enoki array, in which case the implementation produces arrays of variates.

using FloatP = Packet<float, 16>;

/* Vector RNG -- generates 16 independent variates at once */
using RNG_16x = PCG32<FloatP>;

RNG_16x my_rng;
FloatP value = my_rng.next_float32();

PCG32 is fast: on a Skylake i7-6920HQ processor, the vectorized implementation provided here generates around 1.4 billion single precision variates per second.

Reference

template<typename T>
class PCG32

This class implements the PCG32 pseudorandom number generator. It has a period of \(2^{64}\) and supports \(2^{63}\) separate streams. Each stream produces a different unique sequence of random numbers, which is particularly useful in the context of vectorized computations.

Member types

constexpr size_t Size

Denotes the SIMD width of the random number generator (i.e. how many pseudorandom variates are generated at the same time)

using Int64 = int64_array_t<T>

Type alias for a signed 64-bit integer (or an array thereof).

using UInt64 = uint64_array_t<T>

Type alias for a unsigned 64-bit integer (or an array thereof).

using UInt32 = uint32_array_t<T>

Type alias for a unsigned 32-bit integer (or an array thereof).

using Float32 = float32_array_t<T>

Type alias for a single precision float (or an array thereof).

using Float64 = float64_array_t<T>

Type alias for a double precision float (or an array thereof).

Member variables

UInt64 state

Stores the RNG state. All values are possible.

UInt64 inc

Controls which RNG sequence (stream) is selected. Must always be odd, which is ensured by the constructor and seed() method.

Constructors

PCG32(const UInt64 &initstate = PCG32_DEFAULT_STATE, const UInt64 &initseq = PCG32_DEFAULT_STREAM + arange<UInt64>())

Seeds the PCG32 with the default state. When T is an array, every entry by default uses a different stream index, which yields an uncorrelated and non-overlapping set of sequences.

Methods

void seed(const UInt64 &initstate, const UInt64 &initseq)

This function initializes (a.k.a. “seeds”) the random number generator, a required initialization step before the generator can be used. The provided arguments are defined as follows:

  • initstate is the starting state for the RNG. Any 64-bit value is permissible.
  • initseq selects the output sequence for the RNG. Any 64-bit value is permissible, although only the low 63 bits are used.

For this generator, there are \(2^{63}\) possible sequences of pseudorandom numbers. Each sequence is entirely distinct and has a period of \(2^{64}\). The initseq argument selects which stream is used. The initstate argument specifies the location within the \(2^{64}\) period.

Calling PCG32::seed() with the same arguments produces the same output, allowing programs to use random number sequences repeatably.

UInt32 next_uint32(const mask_t<UInt64> &mask = true)

Generate a uniformly distributed unsigned 32-bit random number (i.e. \(x\), where \(0\le x< 2^{32}\))

If a mask parameter is provided, only the pseudorandom number generators of active SIMD lanes are advanced.

UInt64 next_uint64(const mask_t<UInt64> &mask = true)

Generate a uniformly distributed unsigned 64-bit random number (i.e. \(x\), where \(0\le x< 2^{64}\))

If a mask parameter is provided, only the pseudorandom number generators of active SIMD lanes are advanced.

Note

This function performs two internal calls to next_uint32().

UInt32 next_uint32_bound(uint32_t bound, const mask_t<UInt64> &mask = true)

Generate a uniformly distributed unsigned 32-bit random number less than bound (i.e. \(x\), where \(0\le x<\) bound)

If a mask parameter is provided, only the pseudorandom number generators of active SIMD lanes are advanced.

Note

This may involve multiple internal calls to next_uint32(), in which case the RNG advances by several steps. This is only relevant when using the advance() or operator-() method.

UInt64 next_uint64_bound(uint64_t bound, const mask_t<UInt64> &mask = true)

Generate a uniformly distributed unsigned 64-bit random number less than bound (i.e. \(x\), where \(0\le x<\) bound)

If a mask parameter is provided, only the pseudorandom number generators of active SIMD lanes are advanced.

Note

This may involve multiple internal calls to next_uint64(), in which case the RNG advances by several steps. This is only relevant when using the advance() or operator-() method.

Float32 next_float32(const mask_t<UInt64> &mask = true)

Generate a single precision floating point value on the interval \([0, 1)\)

If a mask parameter is provided, only the pseudorandom number generators of active SIMD lanes are advanced.

Float64 next_float64(const mask_t<UInt64> &mask = true)

Generate a double precision floating point value on the interval \([0, 1)\)

If a mask parameter is provided, only the pseudorandom number generators of active SIMD lanes are advanced.

Warning

Since the underlying random number generator produces 32 bit output, only the first 32 mantissa bits will be filled (however, the resolution is still finer than in next_float32(), which only uses 23 mantissa bits)

void advance(const Int64 &delta)

This operation provides jump-ahead; it advances the RNG by delta steps, doing so in \(\log(\texttt{delta})\) time. Because of the periodic nature of generation, advancing by \(2^{64}-d\) (i.e., passing \(-d\)) is equivalent to backstepping the generator by \(d\) steps.

Int64 operator-(const PCG32 &other)

Compute the distance between two PCG32 pseudorandom number generators

bool operator==(const PCG32 &other)

Equality operator

bool operator!=(const PCG32 &other)

Inequality operator

Macros

The following macros are defined in enoki/random.h:

uint64_t PCG32_DEFAULT_STATE = 0x853c49e6748fea9bULL

Default initialization passed to PCG32::seed().

uint64_t PCG32_DEFAULT_STREAM = 0xda3e39cb94b95bdbULL

Default stream index passed to PCG32::seed().

Morton/Z-order indexing

Enoki provides efficient support for encoding and decoding of Morton/Z-order indices of arbitrary dimension. Both scalar indices and index vectors are supported. Z-order indexing can improve the locality accesses when two- or higher-dimensional data is arranged in memory.

_images/morton-01.svg

(Figure by David Eppstein)

To use this feature, include the following header:

#include <enoki/morton.h>

Usage

The following shows a round trip, encoding a 32-bit position as a Morton index and decoding it again.

using Vector2u = Array<uint32_t, 2>;

Vector2u pos(123u, 456u);
uint32_t encoded = morton_encode(pos);
Vector2u decoded = morton_decode<Vector2u>(encoded);

std::cout << "Original : " << pos << std::endl;
std::cout << "Encoded  : " << encoded << std::endl;
std::cout << "Decoded  : " << decoded << std::endl;

/* Prints:
    Original : [123, 456]
    Encoded  : 177605
    Decoded  : [123, 456]
*/

Depending on hardware support, Enoki implements these operations using BMI2 instructions or bit shifts with precomputed magic constants.

The same also works for nested vectors:

using UInt32P = Packet<uint32_t, 8>;
using Vector2uP = Array<UInt32P, 2>;

Vector2uP pos(123u, 456u);
UInt32P encoded = morton_encode(pos);
Vector2uP decoded = morton_decode<Vector2uP>(encoded);

std::cout << "Original : " << pos << std::endl;
std::cout << "Encoded  : " << encoded << std::endl;
std::cout << "Decoded  : " << decoded << std::endl;

/* Prints:
    Original : [[123, 456], [123, 456], [123, 456], [123, 456], [123, 456], [123, 456], [123, 456], [123, 456]]
    Encoded  : [177605, 177605, 177605, 177605, 177605, 177605, 177605, 177605]
    Decoded  : [[123, 456], [123, 456], [123, 456], [123, 456], [123, 456], [123, 456], [123, 456], [123, 456]]
*/

Reference

template<typename Array>
value_t<Array> morton_encode(Array array)

Converts a potentially nested N-dimensional array into its corresponding Morton/Z-order index by interleaving the bits of the component values. The array must have an unsigned integer as its underlying scalar type.

template<typename Array>
Array morton_encode(value_t<Array> array)

Converts a Morton/Z-order index or index array into a potentially nested N-dimensional array. The array must have an unsigned integer as its underlying scalar type.

Complex numbers

Enoki provides a vectorizable type for complex number arithmetic analogous to std::complex<T>. To use it, include the following header file:

#include <enoki/complex.h>

Usage

The following example shows how to define and perform basic arithmetic using enoki::Complex vectorized over 4-wide packets.

/* Declare underlying packet type, could just be 'float' for scalar arithmetic */
using FloatP   = Packet<float, 4>;

/* Define complex number type */
using ComplexP = Complex<FloatP>;

const ComplexP I(0.f, 1.f);
const ComplexP z(0.2f, 0.3f);

/* Two different ways of computing the tangent function */
ComplexP t0 = (exp(I * z) - exp(-I * z)) / (I * (exp(I * z) + exp(-I * z)));
ComplexP t1 = tan(z);

std::cout << "t0 = " << t0 << std::endl << std::endl;
std::cout << "t1 = " << t1 << std::endl;

/* Prints

    t0 = [0.184863 + 0.302229i,
     0.184863 + 0.302229i,
     0.184863 + 0.302229i,
     0.184863 + 0.302229i]

    t1 = [0.184863 + 0.302229i,
     0.184863 + 0.302229i,
     0.184863 + 0.302229i,
     0.184863 + 0.302229i]
 */

Reference

template<typename Type>
class Complex : StaticArrayImpl<Type, 2>

The class enoki::Complex is a 2D Enoki array whose components are of type Type. Various arithmetic operators (e.g. multiplication) and transcendental functions are overloaded so that they provide the correct behavior for complex-valued inputs.

Complex(Type real, Type imag)

Creates a enoki::Complex instance from the given real and imaginary inputs.

Complex(Type f)

Creates a real-valued enoki::Complex instance from f. This constructor effectively changes the broadcasting behavior of non-complex inputs—for instance, the snippet

auto value_a = zero<Array<float, 2>>();
auto value_c = zero<Complex<float>>();

value_a += 1.f; value_c += 1.f;

std::cout << "value_a = "<< value_a << ", value_c = " << value_c << std::endl;

prints value_a = [1, 1], value_c = 1 + 0i, which is the desired behavior for complex numbers. For standard Enoki arrays, the number 1.f is broadcast to both components.

Elementary operations

template<typename T>
T real(Complex<T> z)

Extracts the real part of z.

template<typename T>
T imag(Complex<T> z)

Extracts the imaginary part of z.

template<typename T>
Complex<T> arg(Complex<T> z)

Evaluates the complex argument of z.

template<typename T>
Complex<T> abs(Complex<T> z)

Compute the absolute value of z.

template<typename T>
Complex<T> sqrt(Complex<T> z)

Compute the square root of z.

template<typename T>
Complex<T> conj(Complex<T> z)

Evaluates the complex conjugate of z.

template<typename T>
Complex<T> rcp(Complex<T> z)

Evaluates the complex reciprocal of z.

Arithmetic operators

Only a few arithmetic operators need to be overridden to support complex arithmetic. The rest are automatically provided by Enoki’s existing operators and broadcasting rules.

template<typename T>
Complex<T> operator*(Complex<T> z0, Complex<T> z1)

Evaluates the complex product of z1 and z2.

template<typename T>
Complex<T> operator/(Complex<T> z0, Complex<T> z1)

Evaluates the complex division of z1 and z2.

Stream operators

std::ostream &operator<<(std::ostream &os, const Complex<T> &z)

Sends the complex number z to the stream os using the format 1 + 2i.

Exponential, logarithm, and power function

template<typename T>
Complex<T> exp(Complex<T> z)

Evaluates the complex exponential of z.

template<typename T>
Complex<T> log(Complex<T> z)

Evaluates the complex logarithm of z.

template<typename T>
Complex<T> pow(Complex<T> z0, Complex<T> z1)

Evaluates the complex power of z0 raised to the z1.

Trigonometric functions

template<typename T>
Complex<T> sin(Complex<T> z)

Evaluates the complex sine function for z.

template<typename T>
Complex<T> cos(Complex<T> z)

Evaluates the complex cosine function for z.

template<typename T>
Complex<T> tan(Complex<T> z)

Evaluates the complex tangent function for z.

template<typename T>
std::pair<Complex<T>, Complex<T>> sincos(Complex<T> z)

Jointly evaluates the complex sine and cosine function for z.

template<typename T>
Complex<T> asin(Complex<T> z)

Evaluates the complex arc sine function for z.

template<typename T>
Complex<T> acos(Complex<T> z)

Evaluates the complex arc cosine function for z.

template<typename T>
Complex<T> atan(Complex<T> z)

Evaluates the complex arc tangent function for z.

Hyperbolic functions

template<typename T>
Complex<T> sinh(Complex<T> z)

Evaluates the complex hyperbolic sine function for z.

template<typename T>
Complex<T> cosh(Complex<T> z)

Evaluates the complex hyperbolic cosine function for z.

template<typename T>
Complex<T> tanh(Complex<T> z)

Evaluates the complex hyperbolic tangent function for z.

template<typename T>
std::pair<Complex<T>, Complex<T>> sincosh(Complex<T> z)

Jointly evaluates the complex hyperbolic sine and cosine function for z.

template<typename T>
Complex<T> asinh(Complex<T> z)

Evaluates the complex hyperbolic arc sine function for z.

template<typename T>
Complex<T> acosh(Complex<T> z)

Evaluates the complex hyperbolic arc cosine function for z.

template<typename T>
Complex<T> atanh(Complex<T> z)

Evaluates the complex hyperbolic arc tangent function for z.

Miscellaneous functions

std::pair<T, T> sincos_arg_diff(const Complex<T> &z1, const Complex<T> &z2)

Efficiently evaluates sin(arg(z1) - arg(z2)) and cos(arg(z1) - arg(z2)).

template<typename T>
Complex<T> sqrtz(T x)

Compute the complex square root of a real-valued argument x (which may be negative). This is considerably more efficient than the general complex square root above.

Quaternions

Enoki provides a vectorizable type for quaternion arithmetic. To use it, include the following header:

#include <enoki/quaternion.h>

Usage

The following example shows how to define and perform basic arithmetic using enoki::Quaternion vectorized over 4-wide packets.

/* Declare underlying packet type, could just be 'float' for scalar arithmetic */
using FloatP = Packet<float, 4>;

/* Define vectorized quaternion type */
using QuaternionP = Quaternion<FloatP>;

QuaternionP a = QuaternionP(1.f, 0.f, 0.f, 0.f);
QuaternionP b = QuaternionP(0.f, 1.f, 0.f, 0.f);

/* Compute several rotations that interpolate between 'a' and 'b' */
FloatP t = linspace<FloatP>(0.f, 1.f);
QuaternionP c = slerp(a, b, t);

std::cout << "Interpolated quaternions:" << std::endl;
std::cout << c << std::endl << std::endl;

/* Turn into a 4x4 homogeneous coordinate rotation matrix packet */
using MatrixP = Matrix<FloatP, 4>;
MatrixP c_rot = quat_to_matrix<Matrix4P>(c);

std::cout << "Rotation matrices:" << std::endl;
std::cout << c_rot << std::endl << std::endl;

/* Round trip: turn the rotation matrices back into rotation quaternions */
QuaternionP c2 = matrix_to_quat(c_rot);

if (hsum(abs(c-c2)) < 1e-6f)
    std::cout << "Test passed." << std::endl;
else
    std::cout << "Test failed." << std::endl;

/* Prints:

    Interpolated quaternions:
    [0 + 1i + 0j + 0k,
     0 + 0.866025i + 0.5j + 0k,
     0 + 0.5i + 0.866025j + 0k,
     0 - 4.37114e-08i + 1j + 0k]

    Rotation matrices:
    [[[1, 0, 0, 0],
      [0, -1, 0, 0],
      [0, 0, -1, 0],
      [0, 0, 0, 1]],
     [[0.5, 0.866025, 0, 0],
      [0.866025, -0.5, 0, 0],
      [0, 0, -1, 0],
      [0, 0, 0, 1]],
     [[-0.5, 0.866025, 0, 0],
      [0.866025, 0.5, 0, 0],
      [0, 0, -1, 0],
      [0, 0, 0, 1]],
     [[-1, -8.74228e-08, 0, 0],
      [-8.74228e-08, 1, 0, 0],
      [-0, 0, -1, 0],
      [0, 0, 0, 1]]]

     Test passed.
*/

Reference

template<typename Type>
class Quaternion : StaticArrayImpl<Type, 4>

The class enoki::Quaternion is a 4D Enoki array whose components are of type Type. Various arithmetic operators (e.g. multiplication) and transcendental functions are overloaded so that they provide the correct behavior for quaternion-valued inputs.

Quaternion(Type x, Type y, Type z, Type w)

Initialize a new enoki::Quaternion instance with the value \(x\mathbf{i} + y\mathbf{j} + z\mathbf{k} + w\)

Warning

Note the different order convention compared to Complex::Complex().

Quaternion(Array<Type, 3> imag, Type real)

Creates a enoki::Quaternion instance from the given imaginary and real inputs.

Warning

Note the different order convention compared to Complex::Complex().

Quaternion(Type f)

Creates a real-valued enoki::Quaternion instance from f. This constructor effectively changes the broadcasting behavior of non-quaternion inputs—for instance, the snippet

auto value_a = zero<Array<float, 4>>();
auto value_q = zero<Quaternion<float>>();

value_a += 1.f; value_q += 1.f;

std::cout << "value_a = "<< value_a << ", value_q = " << value_q << std::endl;

prints value_a = [1, 1, 1, 1], value_q = 1 + 0i + 0j + 0k, which is the desired behavior for quaternions. For standard Enoki arrays, the number 1.f is broadcast to all four components.

Elementary operations

template<typename Quat>
Quat identity()

Returns the identity quaternion.

template<typename T>
T real(Quaternion<T> q)

Extracts the real part of q.

template<typename T>
Array<T, 3> imag(Quaternion<T> q)

Extracts the imaginary part of q.

template<typename T>
T abs(Quaternion<T> q)

Compute the absolute value of q.

template<typename T>
T sqrt(Quaternion<T> q)

Compute the square root of q.

template<typename T>
Quaternion<T> conj(Quaternion<T> q)

Evaluates the quaternion conjugate of q.

template<typename T>
Quaternion<T> rcp(Quaternion<T> q)

Evaluates the quaternion reciprocal of q.

Arithmetic operators

Only a few arithmetic operators need to be overridden to support quaternion arithmetic. The rest are automatically provided by Enoki’s existing operators and broadcasting rules.

template<typename T>
Quaternion<T> operator*(Quaternion<T> q0, Quaternion<T> q1)

Evaluates the quaternion product of q1 and z2.

template<typename T>
Quaternion<T> operator/(Quaternion<T> q0, Quaternion<T> q1)

Evaluates the quaternion division of q1 and z2.

Stream operators

std::ostream &operator<<(std::ostream &os, const Quaternion<T> &z)

Sends the quaternion number q to the stream os using the format 1 + 2i + 3j + 4k.

Exponential, logarithm, and power function

template<typename T>
Quaternion<T> exp(Quaternion<T> q)

Evaluates the quaternion exponential of q.

template<typename T>
Quaternion<T> log(Quaternion<T> q)

Evaluates the quaternion logarithm of q.

template<typename T>
Quaternion<T> pow(Quaternion<T> q0, Quaternion<T> q1)

Evaluates the quaternion power of q0 raised to the q1.

Fixed-size matrices

Enoki provides a vectorizable type for small fixed-size square matrices (e.g. \(2\times 2\), \(3\times 3\), or \(4\times 4\)) that are commonly found in computer graphics and vision applications.

Note

Enoki attempts to store matrix entries in processor registers—it is unwise to work with large matrices since this will cause considerable pressure on the register allocator. Instead, consider using specialized libraries for large-scale linear algebra (e.g. Eigen or Intel MKL) in such cases.

To use this feature, include the following header:

#include <enoki/matrix.h>

Usage

The following example shows how to define and perform basic arithmetic using enoki::Matrix to construct a \(4\times 4\) homogeneous coordinate “look-at” camera matrix and its inverse:

using Vector3f = Matrix<float, 3>;
using Vector4f = Matrix<float, 4>;
using Matrix4f = Matrix<float, 4>;

std::pair<Matrix4f, Matrix4f> look_at(const Vector3f &origin,
                                      const Vector3f &target,
                                      const Vector3f &up) {
    Vector3f dir = normalize(target - origin);
    Vector3f left = normalize(cross(up, dir));
    Vector3f new_up = cross(dir, left);

    Matrix4f result = Matrix4f::from_cols(
        concat(left, 0.f),
        concat(new_up, 0.f),
        concat(dir, 0.f),
        concat(origin, 1.f)
    );

    /* The following two statements efficiently compute
       the inverse. Alternatively, we could have written

        Matrix4f inverse = inverse(result);
    */
    Matrix4f inverse = Matrix4f::from_rows(
        concat(left, 0.f),
        concat(new_up, 0.f),
        concat(dir, 0.f),
        Vector4f(0.f, 0.f, 0.f, 1.f)
    );

    inverse[3] = inverse * concat(-origin, 1.f);

    return std::make_pair(result, inverse);
}

Enoki can also work with fixed-size packets and dynamic arrays of matrices. The following example shows how to compute the inverse of a matrix array.

using FloatP = Packet<float, 4>;     // 4-wide packet type
using FloatX = DynamicArray<FloatP>; // arbitrary-length array

using MatrixP = Matrix<FloatP, 4>;   // a packet of 4x4 matrices
using MatrixX = Matrix<FloatX, 4>;   // arbitrarily many 4x4 matrices

MatrixX matrices;
set_slices(matrices, 1000);

// .. fill matrices with contents ..

// Invert all matrices
vectorize(
    [](auto &&m) {
        m = inverse(MatrixP(m));
    },
    matrices
);

Reference

template<typename Value, size_t Size>
class Matrix : StaticArrayImpl<Array<Value, Size>, Size>

The class enoki::Matrix represents a dense square matrix of fixed size as a Size \(\times\) Size Enoki array whose components are of type Value. The implementation relies on a column-major storage order to enable a particularly efficient implementation of vectorized matrix multiplication.

type Value

Denotes the type of matrix elements.

type Column

Denotes the Enoki array type of a matrix column.

template<typename ...Values>
Matrix(Values... values)

Creates a new enoki::Matrix instance with the given set of entries (where sizeof...(Values) == Size*Size)

template<typename ...Columns>
Matrix(Columns... columns)

Creates a new enoki::Matrix instance with the given set of columns (where sizeof...(Columns) == Size)

template<size_t Size2>
Matrix(Matrix<Value, Size2> m)

Construct a matrix from another matrix of the same type, but with a different size. If Size2 > Size, the constructor copies the top left part of m. Otherwise, it copies all of m and fills the rest of the matrix with the identity.

Matrix(Value f)

Creates a enoki::Matrix instance which has the value f on the diagonal and zeroes elsewhere.

Value &operator()(size_t i, size_t j)

Returns a reference to the matrix entry \((i, j)\).

const Value &operator()(size_t i, size_t j) const

Returns a const reference to the matrix entry \((i, j)\).

Column &col(size_t i)

Returns a reference to \(i\)-th column.

const Column &col(size_t i) const

Returns a const reference to \(i\)-th column.

Column row(size_t i)

Returns the \(i\)-th row by value.

template<typename ...Columns>
static Matrix from_columns(Columns... columns)

Creates a new enoki::Matrix instance with the given set of columns (where Size == sizeof...(Columns)). This is identical to the Matrix::Matrix() constructor but makes it more explicit that the input are columns.

template<typename ...Rows>
static Matrix from_rows(Rows... rows)

Creates a new enoki::Matrix instance with the given set of rows (where Size == sizeof...(Rows)).

Supported operations

template<typename T, size_t Size>
Matrix<T, Size> operator*(Matrix<T, Size> m, Matrix<T, Size> v)

Efficient vectorized matrix-matrix multiplication operation. On AVX512VL, a \(4\times 4\) matrix multiplication reduces to 4 multiplications and 12 fused multiply-adds with embedded broadcasts.

template<typename T, size_t Size>
Array<T, Size> operator*(Matrix<T, Size> m, Array<T, Size> v)

Matrix-vector multiplication operation.

template<typename T, size_t Size>
T trace(Matrix<T, Size> m)

Computes the trace (i.e. sum of the diagonal elements) of the given matrix.

template<typename T, size_t Size>
T frob(Matrix<T, Size> m)

Computes the Frobenius norm of the given matrix.

template<typename Matrix>
Matrix identity()

Returns the identity matrix.

template<typename Matrix>
Matrix diag(typename Matrix::Column v)

Returns a diagonal matrix whoose entries are copied from v.

template<typename Matrix>
typename Matrix::Column diag(Matrix m)

Extracts the diagonal from a matrix m and returns it as a vector.

template<typename T, size_t Size>
Matrix<T, Size> transpose(Matrix<T, Size> m)

Computes the transpose of m using an efficient set of shuffles.

template<typename T, size_t Size>
Matrix<T, Size> inverse(Matrix<T, Size> m)

Computes the inverse of m using an branchless vectorized form of Cramer’s rule.

Warning

This function is only implemented for \(1\times 1\), \(2\times 2\), \(3\times 3\), and \(4\times 4\) matrices (which are allowed to be packets of matrices).

template<typename T, size_t Size>
Matrix<T, Size> inverse_transpose(Matrix<T, Size> m)

Computes the inverse transpose of m using an branchless vectorized form of Cramer’s rule. (This function is more efficient than transpose(inverse(m)))

Warning

This function is only implemented for \(1\times 1\), \(2\times 2\), \(3\times 3\), and \(4\times 4\) matrices (which are allowed to be packets of matrices).

template<typename T, size_t Size>
Matrix<T, Size> det(Matrix<T, Size> m)

Computes the determinant of m.

Warning

This function is only implemented for \(1\times 1\), \(2\times 2\), \(3\times 3\), and \(4\times 4\) matrices (which are allowed to be packets of matrices).

template<typename Matrix>
std::pair<Matrix, Matrix> polar_decomp(Matrix M, size_t it = 10)

Given a nonsingular input matrix \(\mathbf{M}\), polar_decomp computes the polar decomposition \(\mathbf{M} = \mathbf{Q}\mathbf{P}\), where \(\mathbf{Q}\) is an orthogonal matrix and \(\mathbf{Q}\) is a symmetric and positive definite matrix. The computation relies on an accelerated version of Heron’s method that converges rapidly. it denotes the iteration count—a value of \(10\) should be plenty.

Homogeneous transformations

Enoki provides a number of convenience functions to construct 3D homogeneous coordinate transformations (rotations, translations, scales, perspective transformation matrices, etc.). To use them, include the following header file:

#include <enoki/transform.h>

Reference

template<typename Matrix, typename Vector3>
Matrix translate(Vector3 v)

Constructs a homogeneous coordinate transformation, which translates points by v.

template<typename Matrix, typename Vector3>
Matrix scale(Vector3 v)

Constructs a homogeneous coordinate transformation, which scales points by v.

template<typename Matrix, typename Vector3, typename Float>
Matrix rotate(Vector3 v, Float angle)

Constructs a homogeneous coordinate transformation, which rotates by angle radians around the axis v. The function requires v to be normalized.

template<typename Matrix>
auto transform_decompose(Matrix m)

Performs a polar decomposition of a non-perspective 4x4 homogeneous coordinate matrix and returns a tuple of

  1. A positive definite 3x3 matrix containing an inhomogeneous scaling operation
  2. A rotation quaternion
  3. A 3D translation vector

This representation is helpful when animating keyframe animations.

The function also handles singular inputs m, in which case the rotation component is set to the identity quaternion and the scaling part simply copies the input matrix.

template<typename Matrix3, typename Quaternion, typename Vector3>
auto transform_compose(Matrix3 scale, Quaternion rotation, Vector3 translate)

This function composes a 4x4 homogeneous coordinate transformation from the given scale, rotation, and translation. It performs the reverse of transform_decompose.

template<typename Matrix3, typename Quaternion, typename Vector3>
auto transform_compose_inverse(Matrix3 scale, Quaternion rotation, Vector3 translate)

This function composes a 4x4 homogeneous inverse coordinate transformation from the given scale, rotation, and translation. It is the equivalent to (but more efficient than) the expression inverse(transform_compose(...)).

template<typename Matrix, typename Point3, typename Vector3>
Matrix look_at(Point3 origin, Point3, target, Vector3 up)

Constructs a homogeneous coordinate transformation, which translates to \(\mathrm{origin}\), maps the negative \(z\) axis to \(\mathrm{target}-\mathrm{origin}\) (normalized) and the positive \(y\) axis to \(\mathrm{up}\) (if orthogonal to \(\mathrm{target}-\mathrm{origin}\)). The algorithm performs Gram-Schmidt orthogonalization to ensure that the returned matrix is orthonormal.

template<typename Matrix, typename Float>
Matrix perspective(Float fov, Float near, Float far)

Constructs an OpenGL-compatible perspective projection matrix with the specified field of view (in radians) and near and far clip planes. The returned matrix performs the transformation

\[\begin{split}\begin{pmatrix} x\\y\\z\end{pmatrix} \mapsto \begin{pmatrix} -c\,x/z\\ -c\,x/z\\ \frac{2\,\mathrm{far}\,\mathrm{near}\,+\,z\,(\mathrm{far}+\mathrm{near})}{z\, (\mathrm{far}-\mathrm{near})} \end{pmatrix},\end{split}\]

where

\[c = \mathrm{cot}\!\left(0.5\, \textrm{fov}\right),\]

which maps \((0, 0, -\mathrm{near})^T\) to \((0, 0, -1)^T\) and \((0, 0, -\mathrm{far})^T\) to \((0, 0, 1)^T\).

template<typename Matrix, typename Float>
Matrix frustum(Float left, Float right, Float bottom, Float top, Float near, Float far)

Constructs an OpenGL-compatible perspective projection matrix. The provided parameters specify the intersection of the camera frustum with the near clipping plane. Specifically, the returned transformation maps \((\mathrm{left}, \mathrm{bottom}, -\mathrm{near})\) to \((-1, -1, -1)\) and \((\mathrm{right}, \mathrm{top}, -\mathrm{near})\) to \((1, 1, -1)\).

template<typename Matrix, typename Float>
Matrix ortho(Float left, Float right, Float bottom, Float top, Float near, Float far)

Constructs an OpenGL-compatible orthographic projection matrix. The provided parameters specify the intersection of the camera frustum with the near clipping plane. Specifically, the returned transformation maps \((\mathrm{left}, \mathrm{bottom}, -\mathrm{near})\) to \((-1, -1, -1)\) and \((\mathrm{right}, \mathrm{top}, -\mathrm{near})\) to \((1, 1, -1)\).

Spherical harmonics

Enoki can efficiently evaluate real spherical harmonics basis functions for both scalar and vector arguments. To use this feature, include the following header file:

#include <enoki/sh.h>

The evaluation routines rely on efficient pre-generated branch-free code processed using aggressive constant folding and common subexpression elimination passes. Evaluation routines are provided up to order 10.

The generated code is based on the paper Efficient Spherical Harmonic Evaluation, Journal of Computer Graphics Techniques (JCGT), vol. 2, no. 2, 84-90, 2013 by Peter-Pike Sloan.

Note

The directions provided to sh_eval_* must be normalized 3D vectors (i.e. using Cartesian instead of spherical coordinates).

The Mathematica equivalent of the real spherical harmonic basis implemented in enoki/sh.h is given by the following definition:

SphericalHarmonicQ[l_, m_, d_] := Block[{θ, ϕ},
  θ = ArcCos[d[[3]]];
  ϕ = ArcTan[d[[1]], d[[2]]];
  Piecewise[{
    {SphericalHarmonicY[l, m, θ, ϕ], m == 0},
    {Sqrt[2] * Re[SphericalHarmonicY[l,  m, θ, ϕ]], m > 0},
    {Sqrt[2] * Im[SphericalHarmonicY[l, -m, θ, ϕ]], m < 0}
  }]
]

Usage

The following example shows how to evaluate the spherical harmonics basis up to and including order 2 producing a total of 9 function evaluations.

using Vector3f = Array<float, 3>;
Vector3f d = normalize(Vector3f(1, 2, 3));

float coeffs[9];
sh_eval(d, 2, coeffs);

// Prints: [0.282095, -0.261169, 0.391754, -0.130585, 0.156078, -0.468235, 0.292864, -0.234118, -0.117059]
std::cout << load<Array<float, 9>>(coeffs) << std::endl;

Reference

template<typename Array>
void sh_eval(const Array &d, size_t order, expr_t<value_t<Array>> *out)

Evaluates the real spherical harmonics basis functions up to and including order order. The output array must have room for (order + 1)*(order + 1) entries. This function dispatches to one of the sh_eval_* implementations and throws an exception if order > 9.

template<typename Array>
void sh_eval_0(const Array &d, expr_t<value_t<Array>> *out)

Evaluates the real spherical harmonics basis functions up to and including order 0. The output array must have room for 1 entry.

template<typename Array>
void sh_eval_1(const Array &d, expr_t<value_t<Array>> *out)

Evaluates the real spherical harmonics basis functions up to and including order 1. The output array must have room for 4 entries.

template<typename Array>
void sh_eval_2(const Array &d, expr_t<value_t<Array>> *out)

Evaluates the real spherical harmonics basis functions up to and including order 2. The output array must have room for 9 entries.

template<typename Array>
void sh_eval_3(const Array &d, expr_t<value_t<Array>> *out)

Evaluates the real spherical harmonics basis functions up to and including order 3. The output array must have room for 16 entries.

template<typename Array>
void sh_eval_4(const Array &d, expr_t<value_t<Array>> *out)

Evaluates the real spherical harmonics basis functions up to and including order 4. The output array must have room for 25 entries.

template<typename Array>
void sh_eval_5(const Array &d, expr_t<value_t<Array>> *out)

Evaluates the real spherical harmonics basis functions up to and including order 5. The output array must have room for 36 entries.

template<typename Array>
void sh_eval_6(const Array &d, expr_t<value_t<Array>> *out)

Evaluates the real spherical harmonics basis functions up to and including order 6. The output array must have room for 49 entries.

template<typename Array>
void sh_eval_7(const Array &d, expr_t<value_t<Array>> *out)

Evaluates the real spherical harmonics basis functions up to and including order 7. The output array must have room for 64 entries.

template<typename Array>
void sh_eval_8(const Array &d, expr_t<value_t<Array>> *out)

Evaluates the real spherical harmonics basis functions up to and including order 8. The output array must have room for 81 entries.

template<typename Array>
void sh_eval_9(const Array &d, expr_t<value_t<Array>> *out)

Evaluates the real spherical harmonics basis functions up to and including order 9. The output array must have room for 100 entries.

Color space transformations

Enoki provides a set of helper functions for color space transformations. For now, only sRGB and inverse sRGB gamma correction are available. To use them, include the following header file:

#include <enoki/color.h>

Functions

template<typename Value>
Value linear_to_srgb(Value value)

Efficiently applies the sRGB gamma correction

\[\begin{split}x\mapsto\begin{cases}12.92x,&x\leq 0.0031308\\1.055x^{1/2.4}-0.055,&x>0.0031308\end{cases}\end{split}\]

to an input value in the interval \((0, 1)\).

template<typename Value>
Value srgb_to_linear(Value value)

Efficiently applies the inverse sRGB gamma correction

\[\begin{split}x\mapsto{\begin{cases}{\frac {x}{12.92}},&x\leq 0.04045\\\left({\frac {x+0.055}{1.055}}\right)^{2.4},&x>0.04045\end{cases}}\end{split}\]

to an input value in the interval \((0, 1)\).

Half-precision floats

Enoki provides a compact implementation of a 16 bit half-precision floating point type that is compatible with the FP16 format on GPUs and high dynamic range image libraries such as OpenEXR. To use this feature, include the following header:

#include <enoki/half.h>

Current processors don’t natively implement half precision arithmetic, hence mathematical operations involving this type always involve a half\(\to\) float\(\to\) half roundtrip. For this reason, it is unwise to rely on it for expensive parts of a computation.

The main reason for including a dedicated half precision type in Enoki is that it provides an ideal storage format for floating point data that does not require the full accuracy of the single precision representation, which leads to an immediate storage savings of \(2\times\).

Note

If supported by the target architecture, Enoki uses the F16C instruction set to perform efficient vectorized conversion between half and single precision variables (however, this only affects conversion and no other arithmetic operations). ARM NEON also provides native conversion instructions.

Usage

The following example shows how to use the enoki::half type in a typical use case.

using Color4f = Array<float, 4>;
using Color4h = Array<half, 4>;

uint8_t *image_ptr = ...;

Color4f pixel(load<Color4h>(image_ptr)); // <- conversion vectorized using F16C

/* ... update 'pixel' using single-precision arithmetic ... */

store(image_ptr, Color4h(pixel)); // <- conversion vectorized using F16C

Reference

class half

A half instance encodes a sign bit, an exponent width of 5 bits, and 10 explicitly stored mantissa bits.

All standard mathematical operators are overloaded and implemented using the processor’s floating point unit after a conversion to a IEEE754 single precision. The result of the operation is then converted back to half precision.

uint16_t value

Stores the represented half precision value as an unsigned 16-bit integer.

half(float value)

Constructs a half-precision value from the given single precision argument.

operator float() const

Implicit half to float conversion operator.

static half from_binary(uint16_t value)

Reinterpret a 16-bit unsigned integer as a half-precision variable.

half operator+(half h) const

Addition operator.

half &operator+=(half h)

Addition compound assignment operator.

half operator-() const

Unary minus operator

half operator*(half h) const

Multiplication operator.

half &operator*=(half h)

Multiplication compound assignment operator.

half operator/(half h) const

Division operator.

half &operator/=(half h)

Division compound assignment operator.

bool operator<(half h) const

Less-than comparison operator.

bool operator<=(half h) const

Less-than-or-equal comparison operator.

bool operator>(half h) const

Greater-than comparison operator.

bool operator>=(half h) const

Greater-than-or-equal comparison operator.

bool operator==(half h) const

Equality operator.

bool operator!=(half h) const

Inequality operator.

friend std::ostream &operator<<(std::ostream &os, const half &h)

Stream insertion operator.

Standard Template Library

When Enoki extracts packets or slices through custom data structures, it also handles STL data structures including std::array, and std::pair, and std::tuple. Please review the section on dynamic arrays for general details on vectorizing over dynamic arrays and working with slices.

To use this feature, include the following header file:

#include <enoki/stl.h>

Usage

Consider the following example, where a function returns a std::tuple containing a 3D position and a mask specifying whether the computation was successful. When the enoki/stl.h header file is included, Enoki’s dynamic vectorization machinery can be applied to vectorize such functions over arbitrarily large inputs.

/// Return value of 'my_function'
template <typename T>
using Return = std::tuple<
    Array<T, 3>,
    mask_t<T>
>;

template <typename T> Return<T> my_function(T theta, T phi) {
    /* Turn spherical -> cartesian coordinates */
    Array<T, 3> pos(
        sin(theta) * cos(phi),
        sin(theta) * sin(phi),
        cos(theta)
    );

    /* Only points on the top hemisphere are 'valid' */
    return std::make_pair(pos, pos.z() > 0);
}

/// Packet of floats
using FloatP  = Packet<float>;

/// Arbitrarily large sequence of floats
using FloatX  = DynamicArray<FloatP>;

/// Tuple containing a packet of results
using ReturnP = Return<FloatP>;

/// Tuple containing dynamic arrays with arbitrarily many results
using ReturnX = Return<FloatX>;

int main(int argc, char *argv[]) {
    FloatX theta = linspace<FloatX>(-10.f, 10.f, 10);
    FloatX phi = linspace<FloatX>(0.f, 60.f, 10);

    ReturnX result = vectorize(my_function<FloatP>, theta, phi);

    /* Prints:
        [[0.544021, 0, -0.839072],
         [-0.924676, -0.373065, 0.0761302],
         [0.478888, 0.461548, 0.746753],
         [0.0777672, 0.173978, -0.981674],
         [-0.0330365, -0.895583, 0.443666],
         [-0.304446, 0.842896, 0.443666],
         [0.127097, -0.141995, -0.981674],
         [0.596782, -0.293616, 0.746753],
         [-0.994388, 0.0734624, 0.0761293],
         [0.518133, 0.165823, -0.839072]]
    */

    std::cout << std::get<0>(result) << std::endl;

    /* Prints:
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
    */
    std::cout << std::get<1>(result) << std::endl;
}

Python bindings

Enoki provides support for pybind11, a lightweight header-only binding library that is used to expose C++ types in Python and vice versa.

To use this feature, include the following header file in the extension module:

#include <enoki/python.h>

Usage

The example below shows how to create bindings for a simple vector computation that converts spherical to Cartesian coordinates. A CMake build system file is provided at the bottom of this page.

Extension module

#include <enoki/python.h>

/* Import pybind11 and Enoki namespaces */
namespace py = pybind11;
using namespace enoki;
using namespace py::literals; // Enables the ""_a parameter tags used below

/* Define a packet type used for vectorization */
using FloatP    = Packet<float>;

/* Dynamic type for arbitrary-length arrays */
using FloatX    = DynamicArray<FloatP>;

/* Various flavors of 3D vectors */
using Vector3f  = Array<float, 3>;
using Vector3fP = Array<FloatP, 3>;
using Vector3fX = Array<FloatX, 3>;

/* The function we want to expose in Python */
template <typename Float>
Array<Float, 3> sph_to_cartesian(Float theta, Float phi) {
    auto sc_theta = sincos(theta);
    auto sc_phi   = sincos(phi);

    return {
        sc_theta.first * sc_phi.second,
        sc_theta.first * sc_phi.first,
        sc_theta.second
    };
}

/* The function below is called when the extension module is loaded. It performs a
   sequence of m.def(...) calls which define functions in the module namespace 'm' */
PYBIND11_MODULE(pybind11_test /* <- name of extension module */, m) {
    m.doc() = "Enoki & pybind11 test plugin"; // Set a docstring

    /* 1. Bind the scalar version of the function */
    m.def(
          /* Name of the function in the Python extension module */
          "sph_to_cartesian",

          /* Function that should be exposed */
          sph_to_cartesian<float>,

          /* Function docstring */
          "Convert from spherical to cartesian coordinates [scalar version]",

          /* Parameter names for function signature in docstring */
          "theta"_a, "phi"_a
    );

    /* 2. Bind the packet version of the function */
    m.def("sph_to_cartesian",
           /* The only differnce is the FloatP template argument */
           sph_to_cartesian<FloatP>,
          "Convert from spherical to cartesian coordinates [packet version]",
          "theta"_a, "phi"_a);

    /* 3. Bind dynamic version of the function */
    m.def("sph_to_cartesian",
           /* Note the use of 'vectorize_wrapper', which is described below */
           vectorize_wrapper(sph_to_cartesian<FloatP>),
          "Convert from spherical to cartesian coordinates [dynamic version]",
          "theta"_a, "phi"_a);
}

pybind11 infers the necessary binding code from the type of the function provided to the def() calls. Including the enoki/python.h header is all it takes to make the pybind11 library fully Enoki-aware—arbitrarily nested dynamic and static arrays will be converted automatically.

In practice, one would usually skip the packet version since it is subsumed by the dynamic case.

Using the extension from Python

The following iteractive session shows how to load the extension module and query its automatically generated help page.

Python 3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul  2 2016, 17:52:12)
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>> import pybind11_test
>>> help(pybind11_test)

Help on module pybind11_test

NAME
    pybind11_test - Enoki & pybind11 test plugin

FUNCTIONS
    sph_to_cartesian(...)
        sph_to_cartesian(*args, **kwargs)
        Overloaded function.

        1. sph_to_cartesian(theta: float, phi: float)
               -> numpy.ndarray[dtype=float32, shape=(3)]

        Convert from spherical to cartesian coordinates [scalar version]

        2. sph_to_cartesian(theta: numpy.ndarray[dtype=float32, shape=(8)],
                            phi: numpy.ndarray[dtype=float32, shape=(8)])
               -> numpy.ndarray[dtype=float32, shape=(8, 3)]

        Convert from spherical to cartesian coordinates [packet version]

        3. sph_to_cartesian(theta: numpy.ndarray[dtype=float32, shape=(n)],
                            phi: numpy.ndarray[dtype=float32, shape=(n)])
               -> numpy.ndarray[dtype=float32, shape=(n, 3)]

        Convert from spherical to cartesian coordinates [dynamic version]

FILE
    /Users/wjakob/pybind11_test/pybind11_test.cpython-35m-darwin.so

As can be seen, the help describes all three overloads along with the name and shape of their input arguments. Let’s try calling one of them:

>>> from pybind11_test import sph_to_cartesian
>>> sph_to_cartesian(theta=1, phi=2)
array([-0.35017547,  0.76514739,  0.54030228], dtype=float32)

Note how the returned Enoki array was automatically converted into a NumPy array.

Let’s now call the dynamic version of the function. We will use np.linspace to generate inputs, which actually have an incorrect dtype of np.float64. The binding layer detects this and automatically creates a temporary single precision input array before performing the function call.

>>> import numpy as np
>>> sph_to_cartesian(theta=np.linspace(0.0, 1.0, 10),
...                  phi=np.linspace(1.0, 2.0, 10))
array([[ 0.        ,  0.        ,  1.        ],
       [ 0.04919485,  0.09937215,  0.99383354],
       [ 0.07527862,  0.20714317,  0.9754101 ],
       [ 0.07696848,  0.31801295,  0.9449569 ],
       [ 0.05418137,  0.42652887,  0.90284967],
       [ 0.00803789,  0.52735412,  0.84960753],
       [-0.05919253,  0.61553025,  0.7858873 ],
       [-0.14420365,  0.68672061,  0.71247464],
       [-0.24281444,  0.73742425,  0.63027501],
       [-0.35017547,  0.76514739,  0.54030228]], dtype=float32)

Build system

The following CMakeLists.txt file can be used to build the module on various platforms.

cmake_minimum_required (VERSION 2.8.12)
project(pybind11_test CXX)
include(CheckCXXCompilerFlag)

# Set a default build configuration (Release)
if (NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
  message(STATUS "Setting build type to 'Release' as none was specified.")
  set(CMAKE_BUILD_TYPE Release CACHE STRING "Choose the type of build." FORCE)
  set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release"
    "MinSizeRel" "RelWithDebInfo")
endif()

# Enable C++14 support
if (CMAKE_CXX_COMPILER_ID MATCHES "GNU|Clang|Intel")
  CHECK_CXX_COMPILER_FLAG("-std=c++14" HAS_CPP14_FLAG)
  if (HAS_CPP14_FLAG)
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++14")
  else()
    message(FATAL_ERROR "Unsupported compiler -- C++14 support is needed!")
  endif()
endif()

# Assumes that pybind11 is located in the 'pybind11' subdirectory
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/pybind11)

# Assumes that enoki is located in the 'enoki' subdirectory
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/enoki)

# Enable some helpful vectorization-related compiler flags
enoki_set_compile_flags()
enoki_set_native_flags()

include_directories(enoki/include pybind11/include)

# Compile our pybind11 module
pybind11_add_module(pybind11_test pybind11_test.cpp)

Reference

Please refer to pybind11’s extensive documentation. for details on using it in general. The enoki/python.h API only provides one public function:

template<typename Func>
auto vectorize_wrapper(Func func)

“Converts” a function that takes a set of packets and structures of packets as inputs into a new function that processes dynamic versions of these parameters. Non-array arguments are not transformed. For instance, it would turn the following hypothetical signature

FloatP my_func(Array<FloatP, 3> position, GPSRecord2<FloatP> record, int scalar);

into

FloatX my_func(Array<FloatX, 3> position, GPSRecord2<FloatX> record, int scalar);

where

using FloatX = DynamicArray<FloatP>;

This is handy because a one-liner like vectorize_wrapper(sph_to_cartesian<FloatP>) in the above example is all it takes to take a packet version of a function and expose a dynamic version that can process arbitrarily large NumPy arrays.