Enoki is a C++ template library that enables automatic transformations of numerical code, for instance to create a “wide” vectorized variant of an algorithm that runs on the CPU or GPU, or to compute gradients via transparent forward/reverse-mode automatic differentation.
The core parts of the library are implemented as a set of header files with no dependencies other than a sufficiently C++17-capable compiler (GCC >= 8.2, Clang >= 7.0, Visual Studio >= 2017). Enoki code reduces to efficient SIMD instructions available on modern CPUs and GPUs—in particular, Enoki supports:
Deploying a program on top of Enoki usually serves three goals:
Enoki ships with a convenient library of special functions and data structures that facilitate implementation of numerical code (vectors, matrices, complex numbers, quaternions, etc.).
Programs built using these can be instantiated as wide versions that process many arguments at once (either on the CPU or the GPU).
Enoki is also structured in the sense that it handles complex programs with custom data structures, lambda functions, loadable modules, virtual method calls, and many other modern C++ features.
If derivatives are desired (e.g. for stochastic gradient descent), Enoki performs transparent forward or reverse-mode automatic differentiation of the entire program.
Finally, Enoki can do all of the above simultaneously: if desired, it can compile the same source code to multiple different implementations (e.g. scalar, AVX512, and CUDA+autodiff).
The development of this library was prompted by the author’s frustration with the current vectorization landscape:
Auto-vectorization in state-of-the-art compilers is inherently local. A computation whose call graph spans separate compilation units (e.g. multiple shared libraries) simply can’t be vectorized.
Data structures must be converted into a Structure of Arrays (SoA) layout to be eligible for vectorization.
This is analogous to performing a matrix transpose of an application’s entire memory layout—an intrusive change that is likely to touch almost every line of code.
Parts of the application likely have to be rewritten using intrinsic instructions, which is going to look something like this:
Intrinsics-heavy code is challenging to read and modify once written, and it is inherently non-portable. CUDA provides a nice language environment for programming GPUs but does nothing to help with the other requirements (vectorization on CPUs, automatic differentiation).
Vectorized transcendental functions (exp, cos, erf, ..) are not widely available. Intel, AMD, and CUDA provide proprietary implementations, but many compilers don’t include them by default.
It is desirable to retain both scalar and vector versions of an algorithm, but ensuring their consistency throughout the development cycle becomes a maintenance nightmare.
Domain-specific languages (DSLs) for vectorization such as ISPC address many of the above issues but assume that the main computation underlying an application can be condensed into a compact kernel that is implementable using the limited language subset of the DSL (e.g. plain C in the case of ISPC).
This is not the case for complex applications, where the “kernel” may be spread out over many separate modules involving high-level language features such as functional or object-oriented programming.
Enoki addresses these issues and provides a complete solution for vectorizing and differentiating modern C++ applications with nontrivial control flow and data structures, dynamic memory allocation, virtual method calls, and vector calls across module boundaries. It has the following design goals:
Unobtrusive. Only minor modifications are necessary to convert existing C++ code into its Enoki-vectorized equivalent, which remains readable and maintainable.
No code duplication. It is generally desirable to provide both scalar and vectorized versions of an API, e.g. for debugging, and to preserve compatibility with legacy code. Enoki code extensively relies on class and function templates to achieve this goal without any code duplication—the same code template can be leveraged to create scalar, CPU SIMD, and GPU implementations, and each variant can provide gradients via automatic differentiation if desired.
Custom data structures. Enoki can also vectorize custom data structures. All the hard work (e.g. conversion to SoA format) is handled by the C++17 type system.
Function calls. Vectorized calls to functions in other compilation units
(e.g. a dynamically loaded plugin) are possible. Enoki can even vectorize
method or virtual method calls (e.g. instance->my_function(arg1, arg2,
...);
when instance
turns out to be an array containing many different
instances).
Mathematical library. Enoki includes an extensive mathematical support library with complex numbers, matrices, quaternions, and related operations (determinants, matrix, inversion, etc.). A set of transcendental and special functions supports real, complex, and quaternion-valued arguments in single and double-precision using polynomial or rational polynomial approximations, generally with an average error of \(<\!\frac{1}{2}\) ULP on their full domain. These include exponentials, logarithms, and trigonometric and hyperbolic functions, as well as their inverses. Enoki also provides real-valued versions of error function variants, Bessel functions, the Gamma function, and various elliptic integrals.
Importantly, all of this functionality is realized using the abstractions of Enoki, which means that it transparently composes with vectorization, the JIT compiler for generating CUDA kernels, automatic differentiation, etc.
Portability. When creating vectorized CPU code, Enoki supports arbitrary array sizes that don’t necessarily match what is supported by the underlying hardware (e.g. 16 x single precision on a machine, whose SSE vector only has hardware support for 4 x single precision operands). The library uses template metaprogramming techniques to efficiently map array expressions onto the available hardware resources. This greatly simplifies development because it’s enough to write a single implementation of a numerical algorithm that can then be deployed on any target architecture. There are non-vectorized fallbacks for everything, thus programs will run even on unsupported architectures (albeit without the performance benefits of vectorization).
Modular architecture. Enoki is split into two major components: the front-end provides various high-level array operations, while the back-end provides the basic ingredients that are needed to realize these operations using the SIMD instruction set(s) supported by the target architecture. Backends can also transform arithmetic, e.g. to perform automatic differentatiation.
The CPU vector back-ends e.g. make heavy use of SIMD intrinsics to
ensure that compilers generate efficient machine code. The
intrinsics are contained in separate back-end header files (e.g.
array_avx.h
for AVX intrinsics), which provide rudimentary
arithmetic and bit-level operations. Fancier operations (e.g.
atan2) use the back-ends as an abstract interface to the hardware,
which means that it’s simple to support other instruction sets such
as a hypothetical future AVX1024 or even an entirely different
architecture (e.g. a DSP chip) by just adding a new back-end.
License. Enoki is available under a non-viral open source license (3-clause BSD).
This project was created by Wenzel Jakob. It is named after Enokitake, a type of mushroom with many long and parallel stalks reminiscent of data flow in vectorized arithmetic.
Enoki is the numerical foundation of version 2 of the Mitsuba renderer, though it is significantly more general and should be a trusty tool for a variety of simulation and optimization problems.
When using Enoki in academic projects, please cite
@misc{Enoki,
author = {Wenzel Jakob},
year = {2019},
note = {https://github.com/mitsuba-renderer/enoki},
title = {Enoki: structured vectorization and differentiation on modern processor architectures}
}
This section contains a quick and somewhat contrived demonstration that illustrates a number of basic Enoki features. The remainder of the documentation contains a more systematic introduction of this functionality.
Consider the function
which converts a linear color value into one that can be displayed on a modern display following the sRGB standard—this is known as gamma correction. A standard C++ implementation of this function might look as follows:
float srgb_gamma(float x) {
if (x <= 0.0031308f)
return x * 12.92f;
else
return std::pow(x * 1.055f, 1.f / 2.4f) - 0.055f;
}
A Enoki implementation of the same computation first replaces float
by a
generic Value
type. The second difference is that scalar conditionals are
replaced by generalized expressions involving masks.
template <typename Value> Value srgb_gamma(Value x) {
return enoki::select(
x <= 0.0031308f,
x * 12.92f,
enoki::pow(x * 1.055f, 1.f / 2.4f) - 0.055f
);
}
The simple generalization from normal C++ code to an Enoki function template
enables a number of interesting applications. For instance, the function
automatically extends to cases where Value
is a color type with three
components, in which case the arithmetic operations recursively thread through
the array.
using Color3f = enoki::Array<float, 3>;
Color3f input = /* ... */;
Color3f output = srgb_gamma(input);
Arrays can be nested arbitrarily: the following snippet declares a 16-wide
FloatP
“packet” type (hence the “P” suffix) and uses it to construct a
a new type storing 16 colors that will all be processed in parallel.
using FloatP = enoki::Array<float, 16>;
using Color3fP = enoki::Array<FloatP, 3>;
Color3fP input = /* ... */;
Color3fP output = srgb_gamma(input);
If this code is compiled on a machine supporting the SSE4.2, AVX, AVX2, or AVX512 instructions set extensions, these vector instructions will be leveraged to carry out the computation more efficiently.
Vectorization is not restricted to the CPU—for instance, the following type declarations create a special array that is resident in GPU memory. In this mode of operation, Enoki relies on an internal just-in-time compiler to generate efficient CUDA kernels on the fly.
using FloatC = enoki::CUDAArray<float>;
using Color3fC = enoki::Array<FloatC, 3>;
Color3fC input = /* ... */;
Color3fC output = srgb_gamma(input);
Enoki’s CUDAArray<T>
type applies an important optimization that leads to
significantly improved performance: in contrast to the previous examples, the
function call srgb_gamma(input)
now merely records the sequence of
computations that is needed to determine the value of output
but does not
yet execute it.
Eventually, this evaluation can no longer be postponed (e.g. when we try to
access or print the array contents). At this point, Enoki’s JIT backend
compiles and executes a kernel that contains all queued computations using
NVIDIA’s PTX intermediate representation. All of this happens automatically: in
particular, no CUDA-specific rewrite (e.g. to nvcc
compatible kernels) of
the program is necessary!
Enoki can also apply transparent forward or reverse-mode automatic
differentiation to a program using a special enoki::DiffArray<T>
array that
wraps a number type or another Enoki array T
.
For instance, the following example computes the gradient of a loss function that measures L2 distance from a given gamma-corrected color value. Both primal and gradient-related computations involve GPU-resident arrays, and the resulting computation is queued up as in the previously example using Enoki’s just-in-time compiler.
using FloatC = enoki::CUDAArray<float>;
using FloatD = enoki::DiffArray<FloatC>;
using Color3fD = enoki::Array<FloatD, 3>;
Color3fD input = /* ... */;
enoki::set_requires_gradient(input);
Color3fD output = srgb_gamma(input);
FloatD loss = enoki::norm(output - Color3fD(.1f, .2f, .3f));
enoki::backward(loss);
std::cout << enoki::gradient(input) << std::endl;
All Enoki functions also accept non-array arguments, hence the original scalar implementation remains available:
float input = /* ... */;
float output = srgb_gamma(input);
Modern C++ systems often strive to provide fine-grained Python bindings to facilitate rapid prototyping and interoperability with other software. Enoki is designed to work with the widely used pybind11 library (itself based on template metaprogramming) to facilitate this. Exposing an Enoki function on the Python side is usually a 1-liner, even for the “fancy” GPU+autodiff variants, as in the following example:
/// Create python bindings with 2 overloads (here, 'm' is a py::module)
m.def("srgb_gamma", &srgb_gamma<float>);
m.def("srgb_gamma", &srgb_gamma<Color3fD>);
In summary: Enoki, along with a generalized template implementation of a computation enables several powerful transformations:
There are many missing pieces that weren’t discussed in this basic overview: how to handle more complex control flow, types and data structures, virtual method calls, and so on. The remainder of this documentation provides a more systematic overview of these topics.
The remainder of this document provides a basic overview of the Enoki library. All code snippets assume that the following lines are present:
#include <iostream>
#include <enoki/array.h>
/* Don't forget to include the 'enoki' namespace */
using namespace enoki;
Note
The next few sections will introduce the basics of Enoki in the context of vectorization for CPUs. Later sections will discuss vectorization via GPU arrays and Automatic differentiation. It is recommended that you read the following sections even if you are mainly interested in the latter two topics.
The most important data structure in this library is enoki::Array
,
a generic container that stores a fixed-size array of an arbitrary data type.
This is somewhat similar to the standard template library class std::array
.
The main distinction between the two is that enoki::Array
forwards
all C++ operators (and other standard mathematical functions) to the contained
elements.
For instance, the snippet
using StrArray = Array<std::string, 2>;
StrArray x("Hello ", "How are "),
y("world!", "you?");
// Prints: "[Hello world!, How are you?]"
std::cout << x + y << std::endl;
is valid because Array::operator+()
can carry out the addition by invoking
std::string::operator+()
(this would not compile if std::array
was used,
since it has no such semantics).
Enoki arrays also have support for a convenient feature that is commonly known as broadcasting—in this simple example, broadcasting is triggered if we add a raw string corresponding to an array of dimension zero. Fancier applications of broadcasting are discussed later.
// Prints: "[Hello you, How are you]"
std::cout << x + "you" << std::endl;
The real use case of Enoki arrays, however, involves arrays of integer or floating point values, for which arithmetic operations can often be reduced to fast vectorized instructions provided by current processor architectures.
The library ships with partial template overloads that become active when the
Type
and Size
parameters supplied to the enoki::Array<Type, Size>
template correspond to combinations that are natively supported by the targeted
hardware. For instance, the template overloads for single precision arrays look
as follows:
Altogether, Enoki currently supports the ARM NEON, SSE4.2, AVX, AVX2, and AVX512 instruction sets and vectorizes arithmetic involving single and double precision floating point values as well as signed and unsigned 32-bit and 64-bit integers.
It is worth pointing out that that enoki::Array
does not require
Size
to exactly match what is supported by the hardware to benefit from
vectorization. Enoki relies on template metaprogramming techniques to ensure
optimal code generation even in such challenging situations. For instance,
arithmetic operations involving a hypothetical Array<float, 27>
type will
generate one AVX512 instruction [1], one AVX instruction, and one 4-wide SSE
instruction that leaves the last entry unused.
A perhaps more sensible use of this feature is to instantiate packed arrays
with a Size
that is an integer multiple of what is supported natively as a
way of aggressively unrolling the underlying computations. If Size
is not
specified, the implementation chooses a reasonable value (4 on machines with
SSE/NEON, 8 on machines with AVX, and 16 on AVX512).
Arrays can be initialized by broadcasting a scalar value, or by specifying the values of individual entries.
/* Initialize all entries with a constant */
MyFloat f1(1.f);
/* Initialize the entries individually */
MyFloat f2(1.f, 2.f, 3.f, 4.f);
The enoki
namespace also contains a large number of global functions that
create or manipulate Enoki arrays in various ways. One example is the
enoki::load()
function, which is the method of choice to
initialize an array with data that is currently stored in memory:
float *mem = /* ... pointer to floating point data ... */;
MyFloat f3;
/* Load entries of 'f3' from 'mem' */
f3 = load<MyFloat>(mem); /* if known that 'mem' is aligned */
f3 = load_unaligned<MyFloat>(mem); /* otherwise */
Both enoki::load()
and enoki::load_unaligned()
are
template functions that load an array of the specified type (MyFloat
in
this case) from a given address in memory. The first indicates that the memory
address is aligned to a multiple of alignof(MyFloat)
, which is equal to 16
bytes in this example. It is a good idea to align data and use aligned versions
of operations, since this reduces the number of cache lines that must be
accessed.
Warning
Performing an aligned load from an unaligned memory address will cause a general protection fault that immediately terminates the application.
Note
It is generally desirable to use high-level Enoki template functions like
enoki::load()
whenever possible since they are designed to
produce the most efficient instruction sequences for the specified target
architecture. In this case, we could also have written
f3 = MyFloat(mem[0], mem[1], mem[2], mem[3]);
which is functionally equivalent—however, it is not guaranteed that the compiler will be able to exploit the equivalence to generate optimal code in this case.
An analogous pair of operations stores the contents of arrays in memory:
/* Store entries of 'f3' to 'mem' */
store(mem, f3); /* if known that 'mem' is aligned */
store_unaligned(mem, f3) /* otherwise */
Note that load()
and load_unaligned()
require the target
type as a template parameter, while the stores do not, since it can be inferred
from the provided value.
Gather, scatter, and scatter-add operations are also supported using a similar pattern:
/* 32 and 64 bit integers are supported as indices for scatter/gather operations */
Array<int, 4> idx(1, 2, 3, 4);
/* Gather f3 from mem -- this is equivalent to
setting f3[i] = mem[idx[i]] (i = 0, 1, ..) */
f3 = gather<MyFloat>(mem, idx);
/* Scatter f3 to mem -- this is equivalent to
setting mem[idx[i]] = f3[i] (i = 0, 1, ..) */
scatter(mem, f3, idx);
/* Atomic scatter-add of f3 to mem -- this is equivalent to
setting mem[idx[i]] += f3[i] (i = 0, 1, ..). The operation
is atomic in the sense that it gives the correct results
if 'idx' refers to the same index twice. */
scatter_add(mem, f3, idx);
Finally, the following initialization methods also exist:
/* Efficient way to create an array of any type filled with zero entries */
f1 = zero<MyFloat>();
/* Initialize entries with index sequence 0, 1, 2, ... */
f1 = arange<MyFloat>();
/* Initialize entries with a linearly increasing sequence with endpoints 0 and 1 */
f1 = linspace<MyFloat>(0.f, 1.f);
The components of Array
can be accessed via operator[]
. If you find
yourself using this much, your code is likely not making good use of the vector
units.
f2[2] = 1.f;
Alternatively, the functions x()
, y()
, z()
, and w()
can be used
to access the first four components. The following line is equivalent to the
one above.
f2.z() = 1.f;
Components of a vector can be efficiently reordered using the following syntax:
// f1: [0, 10, 20, 30]
f2 = shuffle<0, 2, 1, 3>(f1);
// f2: [0, 20, 10, 30]
Finally, Enoki provides an overloaded operator<<(std::ostream&, ...)
stream
insertion operator to facilitate the inspection of array contents:
/* The line below prints: [1, 2, 3, 4] */
std::cout << MyFloat(1.f, 2.f, 3.f, 4.f) << std::endl;
Enoki provides the following vertical operations. The word vertical implies that they are independently applied to all array elements.
/* Basic arithmetic operations*/
f1 *= (f2 + 1.f) / (f2 - 1.f);
/* Basic math library functions */
f2 = ceil(f1); f2 = floor(f1); f2 = round(f1); f2 = trunc(f1);
f2 = abs(f1); f2 = sqrt(f1); f2 = cbrt(f1); f2 = sign(f1);
f2 = min(f1, f2); f2 = max(f1, f2);
/* Fused multiply-add/subtract */
f1 = fmadd(f1, f2, f3); /* f1 * f2 + f3 */
f1 = fmsub(f1, f2, f3); /* f1 * f2 - f3 */
f1 = fnmadd(f1, f2, f3); /* -f1 * f2 + f3 */
f1 = fnmsub(f1, f2, f3); /* -f1 * f2 - f3 */
/* Efficient reciprocal and reciprocal square root */
f1 = rcp(f1); f1 = rsqrt(f1);
/* Trigonometric and inverse trigonometric functions */
f2 = sin(f1); f2 = cos(f1); f2 = tan(f1);
f2 = csc(f1); f2 = sec(f1); f2 = cot(f1);
f2 = asin(f1); f2 = acos(f1); f2 = atan(f2);
f2 = atan2(f1, f2);
auto [s, c] = sincos(f1);
/* Hyperbolic and inverse hyperbolic functions */
f2 = sinh(f1); f2 = cosh(f1); f2 = tanh(f1);
f2 = csch(f1); f2 = sech(f1); f2 = coth(f1);
f2 = asinh(f1); f2 = acosh(f1); f2 = atanh(f2);
auto [sh, ch] = sincosh(f1);
/* Exponential function, natural logarithm, power function */
f2 = exp(f1); f2 = log(f1); f2 = pow(f1, f2);
/* Exponent/mantissa manipulation */
f1 = ldexp(f1, f2);
auto [mant, exp] = frexp(f1);
/* Special functions */
f2 = erf(f1); f2 = erfinv(f1); f2 = erfi(f1);
f2 = i0e(f1); f2 = dawson(f1);
f1 = comp_ellint_1(f1); f1 = ellint_1(f1, f2);
f1 = comp_ellint_2(f1); f1 = ellint_2(f1, f2);
f1 = comp_ellint_2(f1, f2); f1 = ellint_3(f1, f2, f3);
/* Bit shifts and rotations (only for integer arrays) */
i1 = sl<3>(i1); i1 = sr<3>(i1); /* Shift by a compile-time constant ("immediate") */
i1 = i1 >> i2; i1 = i1 << i2; /* Element-wise shift by a variable amount */
i1 = rol<3>(i1); i1 = ror<3>(i1); /* Rotate by a compile-time constant ("immediate") */
i1 = rol(i1, i2); i1 = ror(i1, i2); /* Element-wise rotation by a variable amount */
/* Trailing/leading zero count, population count (only for integer arrays) */
i1 = lzcnt(i1); i1 = tzcnt(i1); i1 = popcnt(i1);
A cast is another type of vertical operation. Enoki performs efficient conversion between any pair of types using native conversion instructions whenever possible:
using Source = Array<int64_t, 32>;
using Target = Array<double, 32>;
Source source = ...;
Target target(source);
In contrast to the above vertical operations, the following horizontal operations consider the entries of a packed array jointly and return a scalar.
Depending on the size of the array, these are implemented using between \(log_2(N)\) and \(N-1\) vertical reduction operations and shuffles. Horizontal operations should generally be avoided since they don’t fully utilize the hardware vector units (ways of avoiding them are discussed later).
/* Horizontal sum, equivalent to f1[0] + f1[1] + f1[2] + f1[3] */
float s0 = hsum(f1);
/* Horizontal product, equivalent to f1[0] * f1[1] * f1[2] * f1[3] */
float s1 = hprod(f1);
/* Horizontal minimum, equivalent to std::min({ f1[0], f1[1], f1[2], f1[3] }) */
float s2 = hmin(f1);
/* Horizontal maximum, equivalent to std::max({ f1[0], f1[1], f1[2], f1[3] }) */
float s3 = hmax(f1);
/* Horizontal mean , equivalent to (f1[0] + f1[1] + f1[2] + f1[3]) / 4.f */
float s4 = hmean(f1);
The following linear algebra primitives are also realized in terms of horizontal operations:
/* Dot product of two arrays */
float dp = dot(f1, f2);
/* For convenience: absolute value of the dot product */
float adp = abs_dot(f1, f2);
/* Squared 2-norm of a vector */
float sqn = squared_norm(f1);
/* 2-norm of a vector */
float nrm = norm(f1);
Comparisons involving Enoki types are generally applied component-wise and produce a mask representing the outcome of the comparison. The internal representation of a mask is an implementation detail that varies widely from architecture to architecture – an overview is given in the section on Architectural differences handled by Enoki.
Masks enable powerful branchless logic in combination with a range of other bit-level operations. The following snippets show some example usage of mask types:
auto mask = f1 > 1.f;
/* Bit-level AND operation: Zero out entries where the comparison was false */
f1 &= mask;
Masks can be combined in various ways
mask ^= (f1 > cos(f2)) | ~(f2 <= f1);
The following range tests also generate masks
mask = isnan(f1); /* Per-component NaN test */
mask = isinf(f1); /* Per-component +/- infinity test */
mask = isfinite(f1); /* Per-component test for finite values */
Note
Using the -ffast-math
compiler option may break detection of NaN values, and
so is typically not recommended.
Enoki provides a number of helpful trait classes to access array-related types.
For instance, enoki::mask_t
determines the mask type associated
with an array, which permits replacing the auto
statement above.
using FloatMask = mask_t<MyFloat>;
FloatMask mask = f1 > 1.f;
A comprehensive list of type traits is available in the reference. Similar to the horizontal operations for addition and multiplication involving arithmetic arrays, mask arrays also provide a set of horizontal operations:
/* Do *all* entries have a mask value corresponding to 'true'? */
bool mask_all_true = all(mask);
/* Do *some* entries have a mask value corresponding to 'true'? */
bool mask_some_true = any(mask);
/* Do *none* of the entries have a mask value corresponding to 'true'? */
bool mask_none_true = none(mask);
/* Count *how many* entries have a mask value corresponding to 'true'? */
size_t true_count = count(mask);
Note
The special case of the equality and inequality comparison operators:
following the principle of least surprise, enoki::operator==()
and enoki::operator!=()
return a boolean value (i.e. they
internally perform a horizontal reduction). Vertical comparison operators
named eq()
and neq()
are also available. The
following pairs of operations are thus equivalent:
MyFloat f1 = ..., f2 = ...;
bool b1 = (f1 == f2);
bool b2 = all(eq(f1, f2));
bool b3 = (f1 != f2);
bool b4 = any(neq(f1, f2));
One of the most useful bit-level operation is select()
which chooses
between two arguments using a mask. This is extremely useful for writing
branch-free code. Argument order matches the C ternary operator, i.e.
condition ? true_value : false_value
maps to select(condition,
true_value, false_value)
.
f1 = select(f1 < 0.f, f1, f2);
/* The above select() statement is equivalent to the following less efficient expression */
f1 = ((f1 < 0.f) & f1) | (~(f1 < 0.f) & f2);
Enoki also provides a special masked assignment operator, which updates entries of an array matching the given mask:
f1[f1 > 0.f] = f2;
f1[f1 < 0.f] += 1.f;
Compared to select()
, a masked update may generate slightly more efficient
code on some platforms. Apart from this, the two approaches can be used
interchangeably. An alternative syntax involving the function
enoki::masked()
also exists:
masked(f1, f1 > 0.f) = f2;
masked(f1, f1 < 0.f) += 1.f;
This is functionally equivalent to the previous example. The
enoki::masked()
syntax exists because it extends to cases where
f1
is scalar, i.e. not an Enoki array. Using Enoki functions with scalar
arguments will be discussed later.
Because information of dimension 3 occurs frequently (spatial coordinates, color information, …) and generally also benefits very slightly from vectorization, Enoki represents 3-vectors in packed arrays of size 4, leaving the last component unused. Any vertical operations are applied to the entire array including the fourth component, while horizontal operations skip the last component.
An efficient cross product operation realized using shuffles is available for 3-vectors:
f1 = cross(f1, f2);
Generally, a better way to work with 3D data while achieving much greater instruction level parallelism is via nested arrays and the Structure of Arrays (SoA) approach discussed next.
Footnotes
[1] | Different combinations are used when not all of these instruction sets are available. |
Application data is often not arranged in a way that is conductive to efficient vectorization. For instance, vectors in a 3D dataset have too few dimensions to fully utilize the SIMD lanes of modern hardware. Scalar or horizontal operations like dot products lead to similar inefficiencies if used frequently. In such situations, it is preferable to use nested arrays using a technique that is known as a Structure of Arrays (SoA) representation, which provides a way of converting scalar, horizontal, and low-dimensional vector arithmetic into vertical operations that fully utilize all SIMD lanes.
To understand the fundamental problem, consider the following basic example code, which computes the normalized cross product of a pair of 3D vectors. Without Enoki, this might be done as follows:
struct Vector3f {
float x;
float y;
float z;
};
Vector3f normalize(const Vector3f &v) {
float scale = 1.0f / std::sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
return Vector3f{v.x * scale, v.y * scale, v.z * scale};
}
Vector3f cross(const Vector3f &v1, const Vector3f &v2) {
return Vector3f{
(v1.y * v2.z) - (v1.z * v2.y),
(v1.z * v2.x) - (v1.x * v2.z),
(v1.x * v2.y) - (v1.y * v2.x)
};
}
Vector3f normalized_cross(Vector3f a, Vector3f b) {
return normalize(cross(a, b));
}
Clang compiles the normalized_cross()
function into the following fairly decent scalar
assembly (clang -O3 -msse4.2 -mfma -ffp-contract=fast -fomit-frame-pointer
):
__Z16normalized_cross8Vector3fS_:
vmovshdup xmm4, xmm0 ; xmm4 = xmm0[1, 1, 3, 3]
vmovshdup xmm5, xmm2 ; xmm5 = xmm2[1, 1, 3, 3]
vinsertps xmm6, xmm3, xmm1, 16 ; xmm6 = xmm3[0], xmm1[0], xmm3[2, 3]
vblendps xmm7, xmm2, xmm0, 2 ; xmm7 = xmm2[0], xmm0[1], xmm2[2, 3]
vpermilps xmm7, xmm7, 225 ; xmm7 = xmm7[1, 0, 2, 3]
vinsertps xmm1, xmm1, xmm3, 16 ; xmm1 = xmm1[0], xmm3[0], xmm1[2, 3]
vblendps xmm3, xmm0, xmm2, 2 ; xmm3 = xmm0[0], xmm2[1], xmm0[2, 3]
vpermilps xmm3, xmm3, 225 ; xmm3 = xmm3[1, 0, 2, 3]
vmulps xmm1, xmm1, xmm3
vfmsub231ps xmm1, xmm6, xmm7
vmulss xmm2, xmm4, xmm2
vfmsub231ss xmm2, xmm0, xmm5
vmovshdup xmm0, xmm1 ; xmm0 = xmm1[1, 1, 3, 3]
vmulss xmm0, xmm0, xmm0
vfmadd231ss xmm0, xmm1, xmm1
vfmadd231ss xmm0, xmm2, xmm2
vsqrtss xmm0, xmm0, xmm0
vmovss xmm3, dword ptr [rip + LCPI0_0] ; xmm3 = 1.f, 0.f, 0.f, 0.f
vdivss xmm3, xmm3, xmm0
vmovsldup xmm0, xmm3 ; xmm0 = xmm3[0, 0, 2, 2]
vmulps xmm0, xmm1, xmm0
vmulss xmm1, xmm2, xmm3
ret
However, note that only 50% of the above 22 instruction perform actual arithmetic (which is scalar, i.e. low throughput), with the remainder being spent on unpacking and re-shuffling data.
Simply rewriting this code using Enoki leads to considerable improvements:
/* Enoki version */
using Vector3f = Array<float, 3>;
Vector3f normalized_cross(Vector3f a, Vector3f b) {
return normalize(cross(a, b));
}
; Assembly for Enoki version
__Z16normalized_cross8Vector3fS_:
vpermilps xmm2, xmm0, 201 ; xmm2 = xmm0[1, 2, 0, 3]
vpermilps xmm3, xmm1, 210 ; xmm3 = xmm1[2, 0, 1, 3]
vpermilps xmm0, xmm0, 210 ; xmm0 = xmm0[2, 0, 1, 3]
vpermilps xmm1, xmm1, 201 ; xmm1 = xmm1[1, 2, 0, 3]
vmulps xmm0, xmm0, xmm1
vfmsub231ps xmm0, xmm2, xmm3
vdpps xmm1, xmm0, xmm0, 113
vsqrtss xmm1, xmm1, xmm1
vmovss xmm2, dword ptr [rip + LCPI0_0] ; xmm2 = 1.f, 0.f, 0.f, 0.f
vdivss xmm1, xmm2, xmm1
vpermilps xmm1, xmm1, 0 ; xmm1 = xmm1[0, 0, 0, 0]
vmulps xmm0, xmm0, xmm1
ret
Here, Enoki has organized the 3D vectors as 4D arrays that “waste” the last component, allowing for more compact sequence of SSE4.2 instructions with fewer shuffles. This is better but still not ideal: of the 12 instructions (a reduction by 50% compared to the previous example), 3 are vectorized, 2 are scalar, and 1 is a (slow) horizontal reduction. The remaining 6 are shuffle and move instructions.
The key idea that enables further vectorization of this code is to work on 3D arrays, whose components are themselves arrays. This is known as SoA-style data organization. One group of multiple 3D vectors represented in this way is referred to as a packet.
Since Enoki arrays support arbitrary nesting, it’s straightforward to wrap an
existing Array
representing a packet of data into another array while
preserving the semantics of an 3-dimensional vector at the top level.
As before, all mathematical operations discussed so far are trivially supported due to the fundamental behavior of an Enoki array: all operations are simply forwarded to the contained entries (which are themselves arrays now: the procedure continues recursively). The following snippet demonstrates the basic usage of such an approach.
/* Declare an underlying packet type with 4 floats */
using FloatP = Array<float, 4>;
/* NEW: Packet of 3D vectors containing four separate directions */
using Vector3fP = Array<FloatP, 3>;
Vector3fP vec(
FloatP(1, 2, 3, 4), /* X components */
FloatP(5, 6, 7, 8), /* Y components */
FloatP(9, 10, 11, 12) /* Z components */
);
/* Enoki's stream insertion operator detects the recursive array and
prints the contents as a list of 3D vectors
"[[1, 5, 9],
[2, 6, 10],
[3, 7, 11],
[4, 8, 12]]" */
std::cout << vec << std::endl;
/* Element access using operator[] and x()/y()/z()/w() now return size-4 packets.
The statement below prints "[1, 2, 3, 4]" */
std::cout << vec.x() << std::endl;
/* Transcendental functions are applied to all (nested) components independently */
Vector3fP vec2 = sin(vec);
The behavior of horizontal operations changes as well–for instance, the dot product
FloatP dp = dot(vec, vec2);
now creates a size-4 packet of dot products: one for each pair of input 3D vectors. This is simply a consequence of applying the definition of the dot product to the components of the array (which are now arrays).
Note how this is a major performance improvement since relatively inefficient horizontal operations have now turned into a series of vertical operations that make better use of the processor’s vector units.
With the above type aliases, the normalized_cross()
function now looks as
follows:
Vector3fP normalized_cross(Vector3fP a, Vector3fP b) {
return normalize(cross(a, b));
}
Disregarding the loads and stores that are needed to fetch the operands and write the results, this generates the following assembly:
; Assembly for SoA-style version
__Z16normalized_cross8Vector3fS_:
vmulps xmm6, xmm2, xmm4
vfmsub231ps xmm6, xmm1, xmm5
vmulps xmm5, xmm0, xmm5
vfmsub213ps xmm2, xmm3, xmm5
vmulps xmm1, xmm1, xmm3
vfmsub231ps xmm1, xmm0, xmm4
vmulps xmm0, xmm2, xmm2
vfmadd231ps xmm0, xmm6, xmm6
vfmadd231ps xmm0, xmm1, xmm1
vsqrtps xmm0, xmm0
vbroadcastss xmm3, dword ptr [rip + LCPI0_0]
vdivps xmm0, xmm3, xmm0
vmulps xmm3, xmm6, xmm0
vmulps xmm2, xmm2, xmm0
vmulps xmm0, xmm1, xmm0
This is much better: 15 vectorized operations which process four vectors at the same time, while fully utilizing the underlying SSE4.2 vector units. If wider arithmetic is available, it’s of course possible to process many more vectors at the same time.
Enoki will also avoid costly high-latency operations like division and square
root if the user indicates that minor approximations are tolerable. The
following snippet demonstrates code generation on an ARMv8 NEON machine. Note
the use of the frsqrte
and frsqrts
instructions for the reciprocal
square root.
; Assembly for ARM NEON (armv8a) version
__Z16normalized_cross8Vector3fS_:
fmul v6.4s, v2.4s, v3.4s
fmul v7.4s, v0.4s, v4.4s
fmls v6.4s, v0.4s, v5.4s
fmul v16.4s, v1.4s, v5.4s
fmls v7.4s, v1.4s, v3.4s
fmul v0.4s, v6.4s, v6.4s
fmls v16.4s, v2.4s, v4.4s
fmla v0.4s, v7.4s, v7.4s
fmla v0.4s, v16.4s, v16.4s
frsqrte v1.4s, v0.4s
fmul v2.4s, v1.4s, v1.4s
frsqrts v2.4s, v2.4s, v0.4s
fmul v1.4s, v1.4s, v2.4s
fmul v2.4s, v1.4s, v1.4s
frsqrts v0.4s, v2.4s, v0.4s
fmul v2.4s, v1.4s, v0.4s
fmul v0.4s, v6.4s, v2.4s
fmul v1.4s, v7.4s, v2.4s
fmul v2.4s, v16.4s, v2.4s
On current processor architectures, most floating point operations have a latency of \(\sim4-6\) clock cycles. This means that instructions depending the preceding instruction’s result will be generally stall for at least that long.
To alleviate the effects of latency, it can be advantageous to use an integer multiple of the system’s SIMD width (e.g. \(2\times\)). This leads to longer instructions sequences with fewer interdependencies, which can improve performance noticeably.
using FloatP = Array<float, 32>;
With the above type definition on an AVX512-capable machine, Enoki would e.g. unroll every 32-wide operation into a pair of 16-wide instructions.
Horizontal operations (e.g. hsum()
) perform a reduction across the
outermost dimension, which means that they return arrays instead of scalars
when given a nested array as input (the same is also true for horizontal mask
operations such as any()
).
Sometimes this is not desirable, and Enoki thus also provides nested versions
all of horizontal operations that can be accessed via the _nested
suffix.
These functions recursively apply horizontal reductions until the result ceases
to be an array. For instance, the following function ensures that no element of
a packet of 3-vectors contains a Not-a-Number floating point value.
bool check(const Vector3fP &x) {
return none_nested(isnan(x));
}
Enoki performs an automatic broadcast operation whenever a higher-dimensional array is initialized from a lower-dimensional array, or when types of mixed dimension occur in an arithmetic expression.
The figure below illustrates a broadcast of a 3D vector to a rank 4 tensor during an arithmetic operation involving both (e.g. addition).
Enoki works its way through the shape descriptors of both arrays, moving from
right to left in the above figure (i.e. from the outermost to the innermost
nesting level). At each iteration, it checks if the current pair of shape
entries match – if they do, the iteration advances to the next entry.
Otherwise, a broadcast is performed over that dimension, which is analogous to
inserting a 1
into the lower-dimensional array’s shape. When the dimensions
of the lower-dimensional array are exhausted, Enoki appends further ones until
the dimensions match. In the above example, the array is thus copied to the
second dimension, with a broadcast taking place over the last and first two
dimensions.
Broadcasting nested types works in the same way. Here, the entries of a \(3\times 3\) array are copied to the first and third dimensions of the output array:
Automatic broadcasting is convenient whenever a computation combines both
vectorized and non-vectorized data. For instance, it is legal to mix instances
of type Vector4f
and Vector4fP
, defined below, in the same expression.
/* Packet data type */
using FloatP = Array<float>;
/* 4D vector and packets of 4D vectors */
using Vector4f = Array<float, 4>;
using Vector4fP = Array<FloatP, 4>;
Vector4 data1 = ..;
Vector4fP data2 = ..;
/* This is legal -- 'result' is of type Vector4fP */
auto result = data1 + data;
Warning
Broadcasting can sometimes lead to unexpected and undesirable behavior. The following section explains how this can be avoided.
We will now switch gears to GPU arrays, whose operation is markedly different from that of the other previously discussed Enoki array types.
The first major change when working with GPU arrays is that Enoki is no longer a pure header file library. A compilation step becomes necessary, which produces shared libraries against which applications must be linked. The reason for this is to minimize template bloat: GPU arrays involve a certain amount of additional machinery, and it would be wasteful to have to include the underlying implementation in every file of an application that relies on GPU arrays.
Compiling Enoki produces up to three libraries that are potentially of interest:
libenoki-cuda.so
: a just-in-time compiler that is used to realize the
Enoki backend of the CUDAArray<T>
array type discussed in this section.libenoki-autodiff.so
: a library for maintaining a computation graph for
automatic differentiation discussed in the next section.enoki.cpyhon-37m-x86_64-linux-gnu.so
(platform-dependent filename): a
Python binding library that provides interoperability with Enoki’s GPU
arrays and differentiable GPU arrays.Enter the following CMake command to compile all of them:
cd <path-to-enoki>
mkdir build
cd build
cmake -DENOKI_CUDA=ON -DENOKI_AUTODIFF=ON -DENOKI_PYTHON=ON ..
make
For educational reasons, it is instructive to compile in Enoki in debug mode, which enables a number of log messages that we will refer to in the remainder of this section. Use the following CMake command to do so:
cmake -DCMAKE_BUILD_TYPE=Debug -DENOKI_CUDA=ON -DENOKI_AUTODIFF=ON -DENOKI_PYTHON=ON ..
We find it easiest to introduce Enoki’s GPU arrays from within an interactive
Python interpreter and postpone the discussion of the C++ interface to the end
of this section. We’ll start by importing the Enoki extension into an
interactive Python session and set the log level to a high value via
cuda_set_log_level()
, which will be helpful in the subsequent
discussion.
>>> from enoki import *
>>> cuda_set_log_level(4)
Note
The first time that Enoki is imported on a new machine, it will trigger a kernel pre-compilation step that takes a few seconds.
The Enoki python bindings expose a number of types in the enoki.cuda
submodule that
correspond to GPU-resident arrays. The following example initializes such an array with a
constant followed by a simple addition operation.
>>> from enoki.cuda import Float32 as FloatC
>>> a = FloatC(1)
cuda_trace_append(10): mov.$t1 $r1, 0f3f800000
>>> a = a + a
cuda_trace_append(11 <- 10, 10): add.rn.ftz.$t1 $r1, $r2, $r3
Observe the two cuda_trace_append
log messages, which begin to reveal the
mechanics underlying the GPU backend: neither of these two operations has
actually occurred at this point. Instead, Enoki simply queued this computation
for later execution using an assembly-like intermediate language named NVIDIA
PTX.
For instance, the first line cuda_trace_append(10): mov.$t1 $r1, 0f3f800000
indicates the creation of a new variable with index 10
, and the associated
line of PTX will eventually be used to initialize this variable with a binary
constant representing the floating point value 1.0
. The next
cuda_trace_append
command introduces a new variable 11
that will record
the result of the addition, while keeping track of the dependence on the
original variable 10
, etc. More complex numerical operations (e.g. a
hyperbolic tangent) result in a longer sequence of steps that are similarly
enqueued:
>>> a = tanh(a)
cuda_trace_append(12 <- 11): abs.ftz.$t1 $r1, $r2
cuda_trace_append(13): mov.$t1 $r1, 0f3f200000
... 25 lines skipped ...
cuda_trace_append(39 <- 38, 37): sub.rn.ftz.$t1 $r1, $r2, $r3
cuda_trace_append(40 <- 39, 29, 14): selp.$t1 $r1, $r2, $r3, $r4
Eventually, numerical evaluation can no longer be postponed, e.g. when we try to print the array contents:
>>> print(a)
cuda_eval(): launching kernel (n=1, in=0, out=1, ops=31)
.... many lines skipped ...
cuda_jit_run(): cache miss, jit: 541 us, ptx compilation: 43.534, 10 registers
[0.964028]
At this point, Enoki’s JIT backend compiles and launches a kernel that contains all of the computation queued thus far.
.version 6.3
.target sm_75
.address_size 64
.visible .entry enoki_8a163272(.param .u64 ptr,
.param .u32 size) {
.reg.b8 %b<41>;
.reg.b16 %w<41>;
.reg.b32 %r<41>;
.reg.b64 %rd<41>;
.reg.f32 %f<41>;
.reg.f64 %d<41>;
.reg.pred %p<41>;
// Grid-stride loop setup
ld.param.u64 %rd0, [ptr];
ld.param.u32 %r1, [size];
mov.u32 %r4, %tid.x;
mov.u32 %r5, %ctaid.x;
mov.u32 %r6, %ntid.x;
mad.lo.u32 %r2, %r5, %r6, %r4;
setp.ge.u32 %p0, %r2, %r1;
@%p0 bra L0;
mov.u32 %r7, %nctaid.x;
mul.lo.u32 %r3, %r6, %r7;
L1:
// Loop body
mov.f32 %f10, 0f3f800000;
add.rn.ftz.f32 %f11, %f10, %f10;
mul.rn.ftz.f32 %f12, %f11, %f11;
mul.rn.ftz.f32 %f13, %f12, %f12;
mul.rn.ftz.f32 %f14, %f13, %f13;
mov.f32 %f15, 0fbbbaf0ea;
mul.rn.ftz.f32 %f16, %f15, %f14;
mov.f32 %f17, 0f3e088393;
mov.f32 %f18, 0fbeaaaa99;
fma.rn.ftz.f32 %f19, %f12, %f17, %f18;
add.rn.ftz.f32 %f20, %f19, %f16;
mov.f32 %f21, 0f3ca9134e;
mov.f32 %f22, 0fbd5c1e2d;
fma.rn.ftz.f32 %f23, %f12, %f21, %f22;
fma.rn.ftz.f32 %f24, %f13, %f23, %f20;
mul.rn.ftz.f32 %f25, %f12, %f11;
fma.rn.ftz.f32 %f26, %f24, %f25, %f11;
add.rn.ftz.f32 %f27, %f11, %f11;
mov.f32 %f28, 0f3fb8aa3b;
mul.rn.ftz.f32 %f29, %f28, %f27;
ex2.approx.ftz.f32 %f30, %f29;
mov.f32 %f31, 0f3f800000;
add.rn.ftz.f32 %f32, %f30, %f31;
rcp.approx.ftz.f32 %f33, %f32;
add.rn.ftz.f32 %f34, %f33, %f33;
mov.f32 %f35, 0f3f800000;
sub.rn.ftz.f32 %f36, %f35, %f34;
abs.ftz.f32 %f37, %f11;
mov.f32 %f38, 0f3f200000;
setp.ge.f32 %p39, %f37, %f38;
selp.f32 %f40, %f36, %f26, %p39;
// Store register %f40
ldu.global.u64 %rd8, [%rd0 + 0];
st.global.f32 [%rd8], %f40;
add.u32 %r2, %r2, %r3;
setp.ge.u32 %p0, %r2, %r1;
@!%p0 bra L1;
L0:
ret;
}
Internally, Enoki hands the PTX code over to CUDA’s runtime compiler (NVRTC), which performs a second pass that translates from PTX to the native GPU instruction set SASS.
enoki_8a163272:
MOV R1, c[0x0][0x28];
S2R R0, SR_TID.X;
S2R R3, SR_CTAID.X;
IMAD R0, R3, c[0x0][0x0], R0;
ISETP.GE.U32.AND P0, PT, R0, c[0x0][0x168], PT;
@P0 EXIT;
BSSY B0, `(.L_2);
ULDC.64 UR4, c[0x0][0x160];
.L_3:
LDG.E.64.SYS R2, [UR4];
MOV R5, 0x3f76ca83;
MOV R7, c[0x0][0x0];
IMAD R0, R7, c[0x0][0xc], R0;
ISETP.GE.U32.AND P0, PT, R0, c[0x0][0x168], PT;
STG.E.SYS [R2], R5;
@!P0 BRA `(.L_3);
BSYNC B0;
.L_2:
EXIT ;
.L_4:
BRA `(.L_4);
This second phase is a full-fledged optimizing compiler with constant propagation and common subexpression elimination. You can observe this in the previous example because the second snippet is much smaller—in fact, almost all of the computation was optimized away and replaced by a simple constant (\(\tanh(2)\approx 0.964028\)).
Enoki’s approach is motivated by efficiency considerations: most array operations are individually very simple and do not involve a sufficient amount of computation to outweigh overheads related to memory accesses and GPU kernel launches. Enoki therefore accumulates larger amounts of work (potentially hundreds of thousands of individual operations) before creating and launching an optimized GPU kernel. Once evaluated, array contents can be accessed without triggering further computation:
>>> print(a)
[0.964028]
GPU kernel compilation consists of two steps: the first generates a PTX kernel from the individual operations—this is essentially just string concatenation and tends to be very fast (541 µs in the above example, most of which is caused by printing assembly code onto the console due to the high log level).
The second step (ptx compilation
) that converts the PTX intermediate
representation into concrete machine code that can be executed on the installed
graphics card is orders of magnitude slower (43 ms in the above example) but
only needs to happen once: whenever the same computation occurs again (e.g. in
subsequent iterations of an optimization algorithm), the previously generated
kernel is reused:
>>> b = FloatC(1)
>>> b = b + b
>>> b = tanh(b)
>>> print(b)
cuda_eval(): launching kernel (n=1, in=0, out=1, ops=31)
.... many lines skipped ...
cuda_jit_run(): cache hit, jit backend: 550 us
[0.964028]
We now turn to a more complex example: computing the three-dimensional volume of a sphere using Monte Carlo integration. To do so, we create a random number generator RNG that will generate 1 million samples:
>>> from enoki.cuda import PCG32 as PCG32C, UInt64 as UInt64C
>>> rng = PCG32C(UInt64C.arange(1000000))
Here, PCG32 refers to a linear congruential generator from the section on
random number generation. We use it to sample three random
number vectors from the RNG and create a dynamic array of 3D vectors
(Vector3fC
).
>>> from enoki.cuda import Vector3f as Vector3fC
>>> v = Vector3fC([rng.next_float32() * 2 - 1 for _ in range(3)])
Finally, we compute a mask that determines which of the uniformly distributed vectors on the set \([-1, 1]^3\) lie within the unit sphere:
>>> inside = norm(v) < 1
At this point, seeding of the random number generator and subsequent sampling steps touching its internal state have produced over a hundred different operations generating various intermediate results along with the output variable of interest.
To understand the specifics of this process, we assign a label to this variable
and enter the command cuda_whos()
, which is analogous to whos
in
IPython and MATLAB and generates a listing of all variables that are currently
registered (with the JIT compiler, in this case).
>>> set_label(inside, 'inside')
>>> cuda_whos()
ID Type E/I Refs Size Memory Ready Label
=================================================================
10 u32 0 / 1 1000000 3.8147 MiB [ ]
11 u64 0 / 1 1000000 7.6294 MiB [ ]
... 126 lines skipped ...
178 f32 0 / 1 1 4 B [ ]
179 msk 1 / 0 1000000 976.56 KiB [ ] inside
=================================================================
Memory usage (ready) : 0 B
Memory usage (scheduled) : 0 B + 20.027 MiB = 20.027 MiB
Memory savings : 350.95 MiB
The resulting output lists variables of many types (single precision floating
point values, 32/64 bit unsigned integers, masks, etc..), of which the last one
corresponds to the inside
variable named above.
Note how each variable lists two reference counts (in the column E/I
refs
): the first (external) specifies how many times the variable is
referenced from an external application like the interactive Python prompt,
while the second (internal) counts how many times it is referenced as part of
queued arithmetic expressions. Variables with zero references in both categories
are automatically purged from the list.
Most of the variables are only referenced internally—these correspond to temporaries created during a computation. Because they can no longer be “reached” through external references, it would be impossible to ask the system for the contents of such a temporary variable. Enoki relies on this observation to perform an important optimization: rather than storing temporaries in global GPU memory, their contents can be represented using cheap temporary GPU registers. This yields significant storage and memory traffic savings: over 350 MiB of storage can be elided in the last example, leaving only roughly 20 MiB of required storage.
In fact, these numbers can still change: we have not actually executed the
computation yet, and Enoki currently conservatively assumes that we plan to
continue using the random number generator rng
and list of 3D vectors v
later on. If we instruct Python to garbage-collect these two variables, the
required storage drops to less than a megabyte:
>>> del v, rng
>>> cuda_whos()
ID Type E/I Refs Size Memory Ready Label
=================================================================
10 u32 0 / 1 1000000 3.8147 MiB [ ]
11 u64 0 / 1 1000000 7.6294 MiB [ ]
... 126 lines skipped ...
178 f32 0 / 1 1 4 B [ ]
179 msk 1 / 0 1000000 976.56 KiB [ ] inside
=================================================================
Memory usage (ready) : 0 B
Memory usage (scheduled) : 0 B + 976.56 KiB = 976.56 KiB
Memory savings : 324.25 MiB
Finally, we can “peek” into the inside
array to compute the fraction of
points that lie within the sphere, which approximates the expected value
\(\frac{4}{3\cdot 2^3}\pi\approx0.523599\).
>>> count(inside) / len(inside)
... many lines skipped ...
0.523946
It is sometimes desirable to manually force Enoki’s JIT compiler to generate a
kernel containing the computation queued thus far. For instance, rather than
compiling a long-running iterative algorithm into a single huge kernel, a
single kernel per iteration may be preferable. This can be accomplished by
explicitly invoking the cuda_eval()
function periodically. An example:
>>> from enoki.cuda import UInt32 as UInt32C
>>> a = UInt32C.arange(1234)
>>> cuda_eval()
cuda_eval(): launching kernel (n=1234, in=0, out=1, ops=1)
>>> cuda_whos()
ID Type E/I Refs Size Memory Ready Label
=================================================================
10 u32 1 / 0 1234 4.8203 KiB [x]
=================================================================
Memory usage (ready) : 4.8203 KiB
Memory usage (scheduled) : 4.8203 KiB + 0 B = 4.8203 KiB
Memory savings : 0 B
The array is now marked “ready”, which means that its contents were evaluated
and reside in GPU memory at an address that can be queried via the data
field.
>>> a.data
140427428626432
Actually, that is not entirely accurate: kernels are always launched
asynchronously, which means that the function cuda_eval()
may have
returned before the GPU finished executing the kernel. Nonetheless, is
perfectly safe to begin using the variable immediately as asynchronous
communication with the GPU still observes a linear ordering guarantee.
In very rare cases (e.g. kernel benchmarking), it may be desirable to wait
until all currently running kernels have terminated. For this, invoke
cuda_sync()
following cuda_eval()
.
Recall the difference between vertical and horizontal operations: vertical operations are applied independently to each element of a vector, while horizontal ones combine the different elements of a vector. Enoki’s GPU arrays are designed to operate very efficiently when working with vertical operations that can be parallelized over the entire chip.
Horizontal operations (e.g. hsum()
, all()
,
count()
, etc.) are best avoided whenever possible, because they
require that all prior computation has finished. In other words: each time
Enoki encounters a horizontal operation involving an unevaluated array, it
triggers a call to cuda_eval()
. That said, horizontal reductions are
executed in parallel using NVIDIA’s CUB
library, which is a highly performant implementation of these primitives.
Enoki GPU arrays support bidirectional conversion from/to NumPy arrays, which will of course involve some communication between the CPU and GPU:
>>> x = FloatC.linspace(0, 1, 5)
>>> # Enoki -> NumPy
>>> y = Vector3fC(x, x*2, x*3).numpy()
cuda_eval(): launching kernel (n=5, in=1, out=6, ops=36)
>>> print(y)
array([[0. , 0. , 0. ],
[0.25, 0.5 , 0.75],
[0.5 , 1. , 1.5 ],
[0.75, 1.5 , 2.25],
[1. , 2. , 3. ]], dtype=float32)
>>> # NumPy -> Enoki
>>> Vector3fC(y)
cuda_eval(): launching kernel (n=5, in=1, out=3, ops=27)
[[0, 0, 0],
[0.25, 0.5, 0.75],
[0.5, 1, 1.5],
[0.75, 1.5, 2.25],
[1, 2, 3]]
PyTorch GPU tensors are supported as well. In this case, copying occurs on the GPU (but is still necessary, as the two frameworks use different memory layouts for tensors).
>>> x = FloatC.linspace(0, 1, 5)
>>> # Enoki -> PyTorch
>>> y = Vector3fC(x, x*2, x*3).torch()
cuda_eval(): launching kernel (n=5, in=2, out=5, ops=31)
>>> y
tensor([[0.0000, 0.0000, 0.0000],
[0.2500, 0.5000, 0.7500],
[0.5000, 1.0000, 1.5000],
[0.7500, 1.5000, 2.2500],
[1.0000, 2.0000, 3.0000]], device='cuda:0')
>>> # PyTorch -> Enoki
>>> Vector3fC(y)
cuda_eval(): launching kernel (n=5, in=1, out=3, ops=27)
[[0, 0, 0],
[0.25, 0.5, 0.75],
[0.5, 1, 1.5],
[0.75, 1.5, 2.25],
[1, 2, 3]]
Note how the .numpy()
and .torch()
function calls both triggered a
mandatory kernel launch to ensure that that the array contents were ready
before returning a representation in the other framework. This can be wasteful
when converting many variables at an interface between two frameworks. For this
reason, both .numpy()
and .torch()
functions take an optional eval
argument that is set to True
by default. Passing False
causes the
operation to return an uninitialized NumPy or PyTorch array, while at the same
time scheduling Enoki code that will eventually fill this memory with valid
contents the next time that cuda_eval()
is triggered. An example is
shown below. This feature is to be used with caution.
>>> x = FloatC.linspace(0, 1, 5)
>>> y = Vector3fC(x, x*2, x*3).numpy(False)
>>> y
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]], dtype=float32)
>>> cuda_eval()
cuda_eval(): launching kernel (n=5, in=1, out=4, ops=36)
>>> y
array([[0. , 0. , 0. ],
[0.25, 0.5 , 0.75],
[0.5 , 1. , 1.5 ],
[0.75, 1.5 , 2.25],
[1. , 2. , 3. ]], dtype=float32)
The GPU backend also supports scatter and gather operations involving GPU arrays as targets.
>>> a = FloatC.zero(10)
>>> b = UInt32C.arange(5)
>>> scatter(target=a, source=FloatC(b), index=b*2)
>>> a
cuda_eval(): launching kernel (n=5, in=1, out=2, ops=9)
[0, 0, 1, 0, 2, 0, 3, 0, 4, 0]
Note that gathering from an unevaluated Enoki array is not guaranteed to be a
vertical operation, hence it triggers a call to cuda_eval()
.
Similar to the PyTorch memory allocator,
Enoki uses a caching scheme to avoid very costly device synchronizations when
releasing memory. This means that freeing a large GPU variable doesn’t cause
the associated memory region to become available for use by the operating
system or other frameworks like Tensorflow or PyTorch. Use the function
cuda_malloc_trim()
to fully purge all unused memory. The function is
only relevant when working with other frameworks and does not need to be called
to free up memory for use by Enoki itself.
Everything demonstrated in the above sections can be directly applied to
C++ programs as well. To use the associated (dynamic array) type CUDAArray
,
include the header
#include <enoki/cuda.h>
Furthermore, applications must be linked against the cuda
and
enoki-cuda
libraries. The following snippet contains a C++ translation of
the Monte Carlo integration Python example shown earlier.
#include <enoki/cuda.h>
#include <enoki/random.h>
using namespace enoki;
using FloatC = CUDAArray<float>;
using Vector3fC = Array<FloatC, 3>;
using PCG32C = PCG32<FloatC>;
using MaskC = mask_t<FloatC>;
int main(int argc, char **argv) {
PCG32C rng(PCG32_DEFAULT_STATE, arange<FloatC>(1000000));
Vector3fC v(
rng.next_float32() * 2.f - 1.f,
rng.next_float32() * 2.f - 1.f,
rng.next_float32() * 2.f - 1.f
);
MaskC inside = norm(v) < 1.f;
std::cout << count(inside) / (float) inside.size() << std::endl;
}
When vectorizing code, we may sometimes want to skip an expensive computation
when it is not actually needed by any elements in the array being processed.
This is usually done with the any()
function and yields good
performance in when targeting the CPU (e.g. with the AVX512 backend). An
example:
auto condition = variable > 1.f;
if (any(condition))
result[condition] = /* expensive-to-evaluate expression */;
However, recall the discussion earlier in this section, which explained how
horizontal operations tend to be fairly expensive in conjunction with the GPU
backend because they flush the JIT compiler. This effectively breaks up the
program into smaller kernels, increasing memory traffic and missing potential
optimization opportunities. Arrays processed by the GPU backend tend to be much
larger, and from a probabilistic viewpoint it is often likely that the
any()
function call will in any case evaluate to true
. For these
reasons, skipping test and always evaluating the expression often leads to
better performance on the GPU.
Enoki provides alternative horizontal reductions of masks named
any_or()
, all_or()
, none_or()
that do exactly
this: they skip evaluation when compiling for GPU targets and simply return the
supplied template argument. For other targets, they behave as usual. With this
change, the example looks as follows:
auto condition = variable > 1.f;
if (any_or<true>(condition))
result[condition] = /* expensive-to-evaluate expression */;
Enoki was designed as a numerical foundation for differentiable physical simulations, specifically the Mitsuba renderer, though it is significantly more general and should be a trusty tool for a variety of simulation and optimization problems.
Its GPU and Autodiff backends are related to well-known frameworks like TensorFlow and PyTorch that have become standard tools for training and evaluating neural networks. In the following, we outline the main differences between these frameworks and Enoki.
Both PyTorch and Tensorflow provide two main operational modes: eager mode
directly evaluates arithmetic operations on the GPU, which yields excellent
performance in conjunction with arithmetically intensive operations like
convolutions and large matrix-vector multiplications, both of which are
building blocks of neural networks. When evaluating typical simulation code
that mainly consists of much simpler arithmetic (e.g. additions,
multiplications, etc.), the resulting memory traffic and scheduling overheads
induce severe bottlenecks. An early prototype of Enoki provided a
TorchArray<T>
type that carried out operations using PyTorch’s eager mode,
and the low performance of this combination eventually motivated us to develop
the technique based on JIT compilation introduced in the previous section.
The second operational mode requires an up-front specification of the complete
computation graph to generate a single optimized GPU kernel (e.g. via XLA in
TensorFlow and jit.trace
in PyTorch). This is feasible for neural networks,
whose graph specification is very regular and typically only consists of a few
hundred operations. Simulation code, on the other hand, involves much larger
graphs, whose structure is unpredictable: program execution often involves
randomness, which could cause jumps to almost any part of the system. The full
computation graph would simply be the entire codebase (potentially on the order
of hundreds of thousands lines of code), which is of course far too big.
Enoki’s approach could be interpreted as a middle ground between the two extremes discussed above. Graphs are created on the fly during a simulation, and can be several orders of magnitude larger compared to typical neural networks. They consist mostly of unstructured and comparably simple arithmetic that is lazily fused into optimized CUDA kernels. Since our system works without an up-front specification of the full computation graph, it must support features like dynamic indirection via virtual function calls that can simultaneously branch to multiple different implementations. The details of this are described in the section on function calls.
Note that that there are of of course many use cases where PyTorch, Tensorflow, etc. are vastly superior to Enoki, and it is often a good idea to combine the two in such cases (e.g. to feed the output of a differentiable simulation into a neural network).
One last related framework is ArrayFire, which provides a JIT compiler that
lazily fuses instructions similar to our CUDAArray<T>
type. ArrayFire
targets a higher-level language (C), but appears to be limited to fairly small
kernels (100 operations by default), and does not support a mechanism for
automatic differentiation. In contrast, Enoki emits an intermediate
representation (PTX) and fuses instructions into comparatively larger kernels
that often exceed 100K instructions.
Automatic differentiation (AD) broadly refers to a set of techniques that numerically evaluate the gradient of a computer program. Two variants of AD are widely used:
Forward mode. Starting with a set of inputs (e.g. a
and b
) and
associated derivatives (da
and db
), forward-mode AD instruments
every operation (e.g. c = a * b
) with an additional step that tracks the
evolution of derivatives (dc = a*db + b*da
). Forward mode is ideal for
functions with a single input and many outputs. When the function has
multiple inputs, a separate propagation pass is needed per parameter, which
can become very costly.
Reverse mode. On the other hand, reverse-mode AD specifically targets the case where the function to be differentiated has one (or few) outputs and potentially many inputs. It traverses the computational graph from outputs to inputs, repeatedly evaluating the chain rule in reverse. This approach is also known as backpropagation in the context of neural networks.
One tricky aspect of reverse-mode is that the backward traversal can only begin once the output has been computed. A partial record of intermediate computations must furthermore be kept in memory, which can become costly for long-running computations.
The implementation in Enoki is realized via the special DiffArray<T>
array
type and supports both of the above variants, though it is particularly
optimized for reverse mode operation. The template argument T
refers to an
arithmetic type (e.g. float
or an Enoki array) that is used to carry out
the underlying primal and derivative computation.
Due to the need to maintain a computation graph for the reverse-mode traversal,
AD tends to involve massive amounts of memory traffic, making it an ideal fit
for GPU-resident arrays that provide a higher amount of memory bandwidth. For
this reason, combinations such as DiffArray<CUDAArray<float>>
should be
considered the default choice. Enoki’s Python bindings expose this type in the
enoki.cuda_autodiff
submodule. FloatD
. As in the previous section, we will stick
to the interactive Python interface and postpone a discussion of the C++ side until
the end of this section.
A FloatD
instance consists of two parts: a floating point array that is
used during the original computation (of type FloatC
), and an index that
refers to a node in a separately maintained directed acyclic graph capturing
the structure of the differentiable portion of the computation. By default, the
index is set to zero, which indicates that the variable does not participate in
automatic differentiation.
The following two example snippets demonstrate usage of automatic
differentiation. In both cases, the set_requires_gradient()
function
is used to mark a variable as being part of a differentiable computation.
>>> from enoki import *
>>> from enoki.cuda_autodiff import Float32 as FloatD
>>> # Create a single differentiable variable
>>> a = FloatD(2.0)
>>> set_requires_gradient(a)
>>> # Arithmetic with one input ('a') and multiple outputs ('b', 'c')
>>> b = a * a
>>> c = sqrt(a)
>>> # Forward-propagate gradients from single input to multiple outputs
>>> forward(a)
autodiff: forward(): processed 3/5 nodes.
>>> gradient(b), gradient(c)
([4], [0.353553])
The forward()
and backward()
function realize the two
previously discussed AD variants, and gradient()
extracts the
gradient associated with a differentiable variable. An example of reverse-mode
traversal is shown next:
>>> from enoki import *
>>> from enoki.cuda_autodiff import Float32 as FloatD
>>> # Create multiple differentiable input variables
>>> a, b = FloatD(2.0), FloatD(3.0)
>>> set_requires_gradient(a)
>>> set_requires_gradient(b)
>>> # Arithmetic with two inputs ('a', 'b') and a single output ('c')
>>> c = a * sqrt(b)
>>> # Backward-propagate gradients from single output to multiple inputs
>>> backward(c)
autodiff: backward(): processed 3/4 nodes.
>>> gradient(a), gradient(b)
([1.73205], [0.57735])
Note that gradient()
returns the gradient using the wrapped arithmetic
type, which is a FloatC
instance in this case. Another function named
detach()
can be used to extract the value using the underlying
(non-differentiable) array type. Using these two operations, a gradient descent
step on a parameter a
would be realized as follows:
>>> a = FloatD(detach(a) + step_size * gradient(a))
Note that practical applications of Enoki likely involve large arrays with many millions of entries rather than scalars used in the two examples above.
It is possible to visualize the graph of the currently active computation using
the graphviz()
function. You may also want to assign explicit
variable names via set_label()
to make the visualization easier to
parse. An example is shown below:
>>> a = FloatD(1.0)
>>> set_requires_gradient(a)
>>> b = erf(a)
>>> set_label(a, 'a')
>>> set_label(b, 'b')
>>> print(graphviz(b))
digraph {
rankdir=RL;
fontname=Consolas;
node [shape=record fontname=Consolas];
1 [label="'a' [s]\n#1 [E/I: 1/5]" fillcolor=salmon style=filled];
3 [label="mul [s]\n#3 [E/I: 0/4]"];
... 111 lines skipped ...
46 -> 12;
46 [fillcolor=cornflowerblue style=filled];
}
The resulting string can be visualized via Graphviz, which reveals the
numerical approximation used to evaluate the error function erf()
.
The combination of Enoki’s JIT compiler and AD has interesting consequences: computation related to derivatives is queued up along with primal arithmetic and can thus be compiled to into a joint GPU kernel.
For example, if a forward computation evaluates the expression \(\sin(x)\), the weight of the associated backward edge in the computation graph is given by \(\cos(x)\). The computation of both of these quantities is automatically merged into a single joint kernel, leveraging subexpression elimination and constant folding to further improve efficiency.
For the previous example involving the error function, cuda_whos()
introduced in the last section reveals that many variables relating to both
primal and gradient computations have been scheduled (but not executed yet).
>>> cuda_whos()
ID Type E/I Refs Size Memory Ready Label
=================================================================
10 f32 3 / 11 1 4 B [ ] a
11 f32 1 / 0 1 4 B [ ] a.grad
16 f32 0 / 1 1 4 B [ ]
17 f32 0 / 1 1 4 B [ ]
... 117 lines skipped ...
150 f32 1 / 0 1 4 B [ ] b
151 f32 0 / 1 1 4 B [ ]
152 f32 0 / 1 1 4 B [ ]
153 f32 1 / 0 1 4 B [ ]
154 f32 0 / 1 1 4 B [ ]
155 f32 0 / 1 1 4 B [ ]
156 f32 1 / 0 1 4 B [ ]
=================================================================
Memory usage (ready) : 0 B
Memory usage (scheduled) : 0 B + 268 B = 268 B
Memory savings : 235 B
An important goal of Enoki’s autodiff backend is a significant reduction in memory usage during simulation code that produces computation graphs with long sequences of relatively simple arithmetic operations. Existing frameworks like PyTorch do not fare very well in such cases. For instance, consider the following simple PyTorch session where an array is repeatedly multiplied by itself:
>>> # ----- GPU memory usage: 0 MiB -----
>>> import torch
>>> # Create a tensor with 1 million floats (4 MiB of GPU memory)
>>> a = torch.zeros(1024 * 1024, device='cuda')
>>> # ----- GPU memory usage: 809 MiB (mostly overhead) -----
>>> # Perform a simple differentiable computation
>>> b = a.requires_grad()
>>> for i in range(1000):
... b = b * b
>>> # ----- GPU memory usage: 4803 MiB -----
The issue here are that PyTorch keeps the entire computation graph (including intermediate results) in memory to be able to perform a reverse-model traversal later on. This is costly and unnecessary when working with simple arithmetic operations.
To avoid this problem, Enoki periodically simplifies the computation graph by eagerly evaluating the chain rule at interior nodes to reduce storage requirements. Consequently, it does not follow a strict reverse- or forward-mode graph traversal, making it an instance of mixed-mode, or hybrid AD [GrWa08]. When working with differentiable GPU arrays, simplification occurs before each JIT compilation pass. The fundamental operation of the simplification process is known as vertex elimination [Yoshi87], [GrSh91] and collapses an interior node with \(d_i\) in-edges and \(d_o\) out-edges, creating \(d_i\cdot d_o\) new edges, whose weights are products of the original edge weights. These are then merged with existing edges, if applicable:
Although this operation may increase the density of the graph connectivity if \(d_i,d_o>1\), collapsing such nodes is often worthwhile since it enables later simplifications that can reduce an entire subgraph to a single edge. Compared to direct traversal of the original graph, simplification increases the required amount of arithmetic in exchange for lower memory usage. In conjunction with the GPU backend, this optimization is particularly effective: removals often target nodes whose primal computation has not yet taken place. Since edge weights of collapsed nodes are no longer directly reachable, they can be promoted to cheap register storage.
The order of collapse operations has a significant effect on the efficiency and size of the resulting kernels. Unfortunately, propagating derivatives in a way that results in a minimal number of operations is known to be NP-hard [Naum07]. Enoki uses a greedy scheme that organizes nodes in a priority queue ordered by the number of edges \(d_i\cdot d_o\) that would be created by a hypothetical collapse operation, issuing collapses from cheapest to most expensive until the cost exceeds an arbitrary threshold that we set to 10 edges.
Graph simplification can be manually triggered by the
FloatD.simplify_graph()
operation. Returning to our earlier example of the
error function, we can observe that it collapses the graph to just the input
and output node.
>>> a = FloatD(1.0)
>>> set_requires_gradient(a)
>>> b = erf(a)
>>> set_label(a, 'a')
>>> set_label(b, 'b')
>>> FloatD.simplify_graph()
>>> print(graphviz(b))
digraph {
rankdir=RL;
fontname=Consolas;
node [shape=record fontname=Consolas];
1 [label="'a' [s]\n#1 [E/I: 1/1]" fillcolor=salmon style=filled];
46 [label="'b' [s]\n#46 [E/I: 1/0]" fillcolor=salmon style=filled];
46 -> 1;
46 [fillcolor=cornflowerblue style=filled];
}
If automatic graph simplification as part of cuda_eval()
is not
desired, it can be completely disabled by calling
FloatD.set_graph_simplification(False)
.
References
[GrSh91] | Andreas Griewank and Shawn Reese. 1991. On the calculation of Jacobian matrices by the Markowitz rule. Technical Report. Argonne National Lab., IL (United States). |
[GrWa08] | Andreas Griewank and Andrea Walther. 2008. Evaluating derivatives: principles and techniques of algorithmic differentiation. Vol. 105. SIAM. |
[Yoshi87] | Toshinobu Yoshida. 1987. Derivation of a computational process for partial derivatives of functions using transformations of a graph. Transactions of Information Processing Society of Japan 11, 19. |
[Naum07] | Uwe Naumann. 2007. Optimal Jacobian accumulation is NP-complete. Mathematical Programming 112 (2007). |
We will now look at a complete optimization example: our objective will be to find a matrix that rotates one vector onto another using gradient descent. This problem is of course contrived because a simple explicit solution exists, and because we won’t be using the vectorization aspect of Enoki, but it provides an opportunity to use a few more Enoki constructions. The annotated source code is given below:
from enoki import *
from enoki.cuda_autodiff import Float32 as FloatD, Vector3f as Vector3fD
cuda_set_log_level(2)
# Initialize two 3D vectors. We want to rotate 'a' onto 'b'
a = normalize(Vector3fD(2, 1, 3))
b = normalize(Vector3fD(-1, 2, 3))
# Our rotation matrix will be parameterized by an axis and an angle
axis = Vector3fD(1, 0, 0)
angle = FloatD(1)
# Learning rate for stochastic gradient descent
lr = 0.2
for i in range(20):
# Label and mark input variables as differentiable
set_requires_gradient(axis)
set_requires_gradient(angle)
set_label(axis, "axis")
set_label(angle, "angle")
# Define a nested scope (only for visualization/debugging purposes)
with FloatD.Scope("rotation"):
# Compute a rotation matrix with the given axis and angle
rot_matrix = Matrix4fD.rotate(axis=normalize(axis), angle=angle)
# Label the entries of the rotation matrix
set_label(rot_matrix, "rot_matrix")
# Define a nested scope (only for visualization/debugging purposes)
with FloatD.Scope("loss"):
# Apply the rotation matrix to 'a' and compute the L2 difference to 'b'
loss = norm(rot_matrix * Vector4fD(a.x, a.y, a.z, 0) - Vector4fD(b.x, b.y, b.z, 0))
# Label the resulting loss
set_label(loss, "loss")
# Dump a GraphViz plot of the computation graph
with open("out_%i.dot" %i, "w") as f:
f.write(graphviz(loss))
# Reverse-mode traversal of the computation graph
backward(loss)
print("err: %s" % str(loss))
# Gradient descent
axis = Vector3fD(normalize(detach(axis) - gradient(axis) * lr))
angle = FloatD(detach(angle) - gradient(angle) * lr)
Running the above progrma prints a message of the form
autodiff: backward(): processed 58/58 nodes.
cuda_eval(): launching kernel (n=1, in=14, out=25, ops=521)
cuda_jit_run(): cache hit, jit: 535 us
err: [1.12665]
for each iteration. After a few iterations, the error is reduced from an initial value of 1.34 to 0.065. Note that kernels are only created at the beginning—later iterations indicate cache hits because the overall structure of the computation is repetitive.
Observe also that the print()
command that quantifies the loss value in
each iteration has an interesting side effect: it flushes the queued
computation and waits for it to finish (waiting for the computation to finish
is clearly necessary, otherwise how could we know the loss?). Moving this
statement to the last line causes all iterations to be merged into a single
kernel that is much larger (see the ops=10444
part of the debug message,
which specifies the number of PTX instructions):
cuda_eval(): launching kernel (n=1, in=0, out=27, ops=10444)
cuda_jit_run(): cache miss, jit: 22.763 ms, ptx compilation: 299.26 ms, 73 registers
err: [0.0653574]
This is likely not desired, and a call to cuda_flush()
per iteration
would be advisable when such a situation arises in general.
Finally, we visualize the GraphViz files that were written to disk by the
optimization steps. Boxes in red highlight named variables (“axis”, “angle”,
“rot_matrix”), and the blue box is the loss. You may also have wondered what
the with FloatD.Scope(...):
statements above do: these collect all
computation in the nested scope, causing it to be arranged within a labeled
box.
Enoki arrays provide scatter, gather, and atomic scatter-add primitives, which constitute a special case during automatic differentiation. Consider the following differentiable calculation, which selects a subset of an input array:
>>> a = FloatD.linspace(0, 1, 10)
>>> set_requires_gradient(a)
>>> c = gather(a, UInt32D([1, 4, 8, 4]))
>>> backward(hsum(c))
autodiff: backward(): processed 3/3 nodes.
>>> print(gradient(a))
[0, 1, 0, 0, 2, 0, 0, 0, 1, 0]
Here, reverse-mode propagation of a derivative of c
with respect to the
input parameter a
requires a suitable scatter_add()
operation
during the reverse-model traversal. Analogously, scatters turn into gathers
under reverse-mode AD. The differentiable array backend recognizes these
operations and inserts a special type of edge into the graph to enable the
necessary transformations.
One current limitation of Enoki is that such special edges cannot be merged into ordinary edges during graph simplification. Handling this case could further reduce memory usage and is an interesting topic for future work.
It is possible to insert a differentiable computation realized using Enoki into
a larger PyTorch program and subsequently back-propagate gradients through the
combination of these systems. The following annotated example shows how to
expose a differentiable Enoki function (enoki.atan2
) to PyTorch. The page
on Extending PyTorch
is a helpful reference regarding the torch.autograd.Function
construction
used in the example.
import torch
import enoki
class EnokiAtan2(torch.autograd.Function):
@staticmethod
def forward(ctx, arg1, arg2):
# Convert input parameters to Enoki arrays
ctx.in1 = enoki.cuda_autodiff.FloatD(arg1)
ctx.in2 = enoki.cuda_autodiff.FloatD(arg2)
# Inform Enoki if PyTorch wants gradients for one/both of them
enoki.set_requires_gradient(ctx.in1, arg1.requires_grad)
enoki.set_requires_gradient(ctx.in2, arg2.requires_grad)
# Perform a differentiable computation in ENoki
ctx.out = enoki.atan2(ctx.in1, ctx.in2)
# Convert the result back into a PyTorch array
out_torch = ctx.out.torch()
# Optional: release any cached memory from Enoki back to PyTorch
enoki.cuda_malloc_trim()
return out_torch
@staticmethod
def backward(ctx, grad_out):
# Attach gradients received from PyTorch to the output
# variable of the forward pass
enoki.set_gradient(ctx.out, enoki.FloatC(grad_out))
# Perform a reverse-mode traversal. Note that the static
# version of the backward() function is being used, see
# the following subsection for details on this
enoki.cuda_autodiff.FloatD.backward()
# Fetch gradients from the input variables and pass them on
result = (enoki.gradient(ctx.in1).torch()
if enoki.requires_gradient(ctx.in1) else None,
enoki.gradient(ctx.in2).torch()
if enoki.requires_gradient(ctx.in2) else None)
# Garbage-collect Enoki arrays that are now no longer needed
del ctx.out, ctx.in1, ctx.in2
# Optional: release any cached memory from Enoki back to PyTorch
enoki.cuda_malloc_trim()
return result
# Create 'enoki_atan2(y, x)' function
enoki_atan2 = EnokiAtan2.apply
# Let's try it!
y = torch.tensor(1.0, device='cuda')
x = torch.tensor(2.0, device='cuda')
y.requires_grad_()
x.requires_grad_()
o = enoki_atan2(y, x)
print(o)
o.backward()
print(y.grad)
print(x.grad)
Running this program yields the following output
cuda_eval(): launching kernel (n=1, in=1, out=8, ops=61)
tensor([0.4636], device='cuda:0', grad_fn=<EnokiAtan2Backward>)
autodiff: backward(): processed 3/3 nodes.
cuda_eval(): launching kernel (n=1, in=6, out=3, ops=20)
cuda_eval(): launching kernel (n=1, in=2, out=1, ops=9)
tensor(0.4000, device='cuda:0')
tensor(-0.2000, device='cuda:0')
The default forward()
and backward()
traversal functions
require an input or output variable for which gradients should be propagated.
Following the traversal, the autodiff graph data structure is immediately torn
down. These assumptions are usually fine when the function being differentiated
has 1 input and n outputs, or when it has m inputs and 1 output.
However, for a function with n inputs and m outputs, we may want to perform multiple reverse or forward-mode traversals while retaining the computation graph. This is simple to do via an extra argument
backward(my_variable, free_graph=False)
# or
forward(my_variable, free_graph=False)
We may want to initialize the input/output variables with specific gradients before each traversal.
set_gradient(out1, 2.0)
set_gradient(out2, 3.0)
FloatD.backward(free_graph=False)
# or
set_gradient(in1, 2.0)
set_gradient(in2, 3.0)
FloatD.forward(free_graph=False)
This functionality is particularly useful when implementing a partial reverse-mode traversal in the context of a larger differentiable computation realized using another framework (e.g. PyTorch). See the previous subsection for an example.
As in the previous section, the C++ and Python interfaces behave in exactly the
same way. To use the DiffArray<T>
type, include the header
#include <enoki/autodiff.h>
Furthermore, applications must be linked against the enoki-autodiff
library
(and against cuda
and enoki-cuda
if differentiable GPU arrays are
used). The following snippet contains a C++ translation of the error function
example shown earlier.
#include <enoki/cuda.h>
#include <enoki/autodiff.h>
#include <enoki/special.h> // for erf()
using namespace enoki;
using FloatC = CUDAArray<float>;
using FloatD = DiffArray<FloatC>;
int main(int argc, char **argv) {
FloatD a = 1.f;
set_requires_gradient(a);
FloatD b = erf(a);
set_label(a, "a");
set_label(b, "b");
std::cout << graphviz(b) << std::endl;
backward(b);
std::cout << gradient(a) << std::endl;
}
Method calls and virtual method calls are important building blocks of modern
object-oriented C++ applications. When vectorization enters the picture, it is
not immediately clear how they should be dealt with. This section introduces
Enoki’s method call vectorization support, focusing on a hypothetical
Sensor
class that decodes a measurement performed by a sensor.
Note that the examples will refer to CPU SIMD-style vectorization, but everything in this section also applies to other kinds of Enoki arrays (GPU arrays, differentiable arrays).
Suppose that the interface of the Sensor
class originally looks as follows:
class Sensor {
public:
/// Decode a measurement based on the sensor's response curve
virtual float decode(float input) = 0;
/// Return sensor's serial number
virtual uint32_t serial_number() = 0;
};
It is trivial to add a second method that takes vector inputs, like so:
using FloatP = Packet<float, 8>;
using MaskP = mask_t<FloatP>;
class Sensor {
public:
/// Scalar version
virtual float decode(float input) = 0;
/// Vector version
virtual FloatP decode(FloatP input) = 0;
/// Return sensor's serial number
virtual uint32_t serial_number() = 0;
};
This will work fine if there is just a single Sensor
instance. But what if
there are many of them, e.g. when each FloatP
array of measurements also
comes with a SensorP
structure whose entries reference the sensor that
produced the measurement?
class Sensor;
using SensorP = Array<Sensor *, 8>;
Ideally, we’d still be able to write the following code, but this sort of thing is clearly not supported by standard C++.
SensorP sensor = ...;
FloatP data = ...;
data = sensor->decode(data);
Enoki provides a support layer that can handle such vectorized method calls. It
performs as many method calls as there are unique instances in the sensor
array, and an optional mask is forwarded to the callee indicating the
associated active SIMD lanes. Null pointers in the data
array are legal and
are considered as masked entries. The return value of masked entries is always
zero (or a zero-filled array/structure, depending on the method’s return type).
The ENOKI_CALL_SUPPORT_METHOD
macro is required to support the
above syntax. This generates the Enoki support layer that intercepts and
carries out the function call:
class Sensor {
public:
// Scalar version
virtual float decode(float input) = 0;
// Vector version with optional mask argument
virtual FloatP decode(FloatP input, MaskP mask) = 0;
/// Return sensor's serial number
virtual uint32_t serial_number() = 0;
};
ENOKI_CALL_SUPPORT_BEGIN(Sensor)
ENOKI_CALL_SUPPORT_METHOD(decode)
ENOKI_CALL_SUPPORT_METHOD(serial_number)
/// .. potentially other methods ..
ENOKI_CALL_SUPPORT_END(Sensor)
The macro supports functions taking an arbitrary number of arguments but assumes that results are provided to the caller via the return value only (i.e. no writing to arguments passed by reference). The mask, if present, must be the last argument of the function.
Here is a hypothetical implementation of the Sensor
interface:
class MySensor : Sensor {
public:
/// Vector version
virtual FloatP decode(FloatP input, MaskP active) override {
/// Keep track of invalid samples
n_invalid += count(isnan(input) && active);
/* Transform e.g. from log domain. */
return log(input);
}
/// Return sensor's serial number
uint32_t serial_number() { return 363436u; }
// ...
size_t n_invalid = 0;
};
With this interface, the following vectorized expressions are now valid:
SensorP sensor = ...;
FloatP data = ...;
/* Unmasked version */
data = sensor->decode(data);
/* Masked version */
auto mask = sensor->serial_number() > 1000;
data = sensor->decode(data, mask);
Note how both functions with scalar and vector return values are vectorized automatically.
The implementation of vector method calls depends on the array type and hardware capabilities.
vpextractq
instruction
is used to efficiently extract the unique set of instance pointers.The above way of vectorizing a scalar getter function may involve multiple
virtual method calls, which is not particularly efficient when the invoked
function is very simple (e.g. a getter). Enoki provides an alternative macro
ENOKI_CALL_SUPPORT_GETTER
that turns any such attribute lookup into
a gather operation. The macro takes the getter name and field name as
arguments. The macro ENOKI_CALL_SUPPORT_FRIEND
is needed if the
field in question is a private member.
class Sensor {
ENOKI_CALL_SUPPORT_FRIEND()
public:
/// ...
/// Return sensor's serial number
uint32_t serial_number() { return m_serial_number; }
private:
uint32_t m_serial_number;
};
ENOKI_CALL_SUPPORT_BEGIN(Sensor)
ENOKI_CALL_SUPPORT_GETTER(serial_number, m_serial_number)
ENOKI_CALL_SUPPORT_END(Sensor)
The usage is identical to before, i.e.:
using UInt32P = Packet<uint32_t, 8>;
SensorP sensor = ...;
UInt32P serial = sensor->serial_number();
Note that this trick even works for GPU arrays! In this case, the GPU will
directly fetch the value of the m_serial_number
field from the CPU via
shared memory. However, this only works when the Sensor
instance has been
allocated in host-pinned address space that will be reachable on the GPU. To
do so, add the ENOKI_PINNED_OPERATOR_NEW
annotation that will
override the new
and delete
operator to ensure that this is always the
case for Sensor
instances.
class Sensor {
ENOKI_CALL_SUPPORT_FRIEND()
ENOKI_PINNED_OPERATOR_NEW(UInt32P)
public:
// ...
};
The previous sections introduced Enoki arrays as a tool for designing vectorized algorithms that are capable of processing multiple inputs at the same time. However, in many cases, vectorizing an algorithm will also require a corresponding change to the data structures that underlie it. This section demonstrates how C++ templates provide a natural framework for achieving this goal while satisfying the desiderata of Enoki (readability, portability, no code duplication, etc.).
We will focus on an simple example data structure that represents a position record acquired by a GPS tracker along with auxiliary information (time, and a flag stating whether the data is considered reliable). The examples will refer to CPU SIMD-style vectorization, but everything in this section also applies to other kinds of Enoki arrays (GPU arrays, differentiable arrays).
using Vector2f = Array<float, 2>;
/// Simple 2D GPS record tagged with auxiliary information
struct GPSCoord2f {
/// UNIX time when the data point was acquired (seconds since Jan 1970)
uint64_t time;
/// Latitude, longitude as a 2D vector
Vector2f pos;
/// Is the data point reliable (e.g. enough GPS satellites in sensor's field of view)
bool reliable;
};
Next, consider the following function, which computes the distance between two GPS coordinates using the haversine formula. When either of the two positions is deemed unreliable, it returns a NaN value to inform the caller about this.
/// Calculate the distance in kilometers between 'r1' and 'r2' using the haversine formula
float distance(const GPSCoord2f &r1, const GPSCoord2f &r2) {
const float deg_to_rad = (float) (M_PI / 180.0);
if (!r1.reliable || !r2.reliable)
return std::numeric_limits<float>::quiet_NaN();
Vector2f sin_diff_h = std::sin(deg_to_rad * .5f * (r2.pos - r1.pos));
sin_diff_h *= sin_diff_h;
float a = sin_diff_h.x() + sin_diff_h.y() *
std::cos(r1.pos.x() * deg_to_rad) *
std::cos(r2.pos.x() * deg_to_rad);
return 6371.f * 2.f * std::atan2(std::sqrt(a), std::sqrt(1.f - a));
}
Suppose that we would like to add a second vectorized version of this function,
which works with arrays of GPS coordinates. Instead of defining yet another
data structure for coordinates arrays in addition to the existing
GPSCoord2f
, our approach will be to rely on a single template data
structure that subsumes both cases. It is parameterized by the type of a GPS
position component (e.g. latitude) named Value
.
template <typename Value> struct GPSCoord2 {
using Vector2 = Array<Value, 2>;
using UInt64 = uint64_array_t<Value>;
using Mask = mask_t<Value>;
UInt64 time;
Vector2 pos;
Mask reliable;
};
The using
declarations at the beginning require an explanation: they
involve the type traits enoki::uint64_array_t
and
enoki::mask_t
, which “compute” the type of an Enoki array that has
the same configuration as their input parameter, but with uint64_t
- and
bool
-valued entries, respectively. Both are specializations of the more
general enoki::replace_scalar_t
trait that works for any type.
With these declarations, we can now create a packet type GPSCoord2fP
that
stores 16 GPS positions in a convenient SoA representation.
using FloatP = Packet<float, 16>;
using GPSCoord2fP = GPSCoord2<FloatP>;
An important aspect of the type calculations mentioned above is that they
also generalize to non-array arguments. In particular, uint64_array_t<float>
and
mask_t<float>
simply turn into uint64_t
and bool
, respectively,
hence the type alias
using GPSCoord2f = GPSCoord2<float>;
perfectly reproduces the original (scalar) GPS record definition. Having defined the GPS record type, it is time to update the function definition as well. Once more, we will rely on C++ templates to do so.
The new distance
function shown below is similarly templated with respect
to the Value
type, and it works for both scalar and vector arguments.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | /// Calculate the distance in kilometers between 'r1' and 'r2' using the haversine formula
template <typename Value>
Value distance(const GPSCoord2<Value> &r1, const GPSCoord2<Value> &r2) {
using Scalar = scalar_t<Value>;
const Value deg_to_rad = Scalar(M_PI / 180.0);
auto sin_diff_h = sin(deg_to_rad * .5f * (r2.pos - r1.pos));
sin_diff_h *= sin_diff_h;
Value a = sin_diff_h.x() + sin_diff_h.y() *
cos(r1.pos.x() * deg_to_rad) *
cos(r2.pos.x() * deg_to_rad);
return select(
r1.reliable && r2.reliable,
(6371.f * 2.f) * atan2(sqrt(a), sqrt(1.f - a)),
std::numeric_limits<Scalar>::quiet_NaN()
);
}
|
Note how the overall structure is preserved. There are three noteworthy changes:
Control flow such as if
statements must be replaced by branchless code
involving masks (see the enoki::select()
statement on line 15).
Separate array entries may undergo a different control flow, which is not
possible with standard C++ language constructs, hence the need for masks.
If desired, the early-out optimization from the previous snippet can be preserved for the special case that all records are unreliable:
if (none(r1.reliable && r2.reliable))
return std::numeric_limits<Scalar>::quiet_NaN()
Standard mathematical functions such as std::sin
are replaced by their
Enoki equivalents, which generalize to both array and non-array arguments.
The enoki::scalar_t
type alias on line 4 is used to extract the
elementary arithmetic type underlying an Enoki array—this results in the
type float
in our example, which is used to cast a constant to the right
precision.
It is sometimes useful to be able to work with a higher precision. Our
templated distance
function can nicely accommodate this need simply by simply
switching to the following types:
using GPSCoord2d = GPSCoord2<double>;
using DoubleP = Packet<double, 16>;
using GPSCoord2dP = GPSCoord2<DoubleP>;
The distance
function requires no changes. When working with double
precision GPS records, the deg_to_rad
constant automatically adapts to
the higher precision due to the cast to the Scalar
type.
Arrays and nested arrays facilitate the development of vectorized code that processes multiple values at once. However, it can be awkward to work with small fixed packet sizes when the underlying application must process millions or billions of data points. The remainder of this document discusses infrastructure that can be used to realize computations on the CPU involving dynamically allocated arrays of arbitrary length.
One of the core ingredients is enoki::DynamicArray
, which is a
smart pointer that manages the lifetime of a dynamically allocated memory
region. It is the exclusive owner of this data and is also responsible for its
destruction when the dynamic array goes out of scope (similar to
std::unique_ptr
). Dynamic arrays can be used to realize arithmetic
involving data that is much larger than the maximum SIMD width supported by the
underlying hardware.
Note that its implementation is contained in a separate header file that must be included:
#include <enoki/dynamic.h>
The class requires a single template argument, which can be any kind of
enoki::Array
. This is the packet type that will be used to used
to realize vectorized computations involving the array contents. The following
code snippet illustrates the creation of a dynamic floating point array that
vectorizes using 4-wide SSE arithmetic.
/* Static float array (the suffix "P" indicates that this is a fixed-size packet) */
using FloatP = Packet<float, 4>;
/* Dynamic float array (vectorized via FloatP, the suffix "X" indicates arbitrary length) */
using FloatX = DynamicArray<FloatP>;
Note
In contrast to all array types discussed so far, a
enoki::DynamicArray
instance should not be part of an
arithmetic expression. For instance, the following will compile and yield
the expected result, but this style of using dynamic arrays is disouraged.
FloatX in1 = ... , in2 = ... , in3 = ... , in4 = ...;
/* Add the dynamic arrays using operator+() */
FloatX out = in1 + in2 + in3 + in4;
Why that is the case requires a longer explanation on the design of this library.
At a high level, there are two “standard” ways of implementing arithmetic for dynamically allocated memory regions.
A common approach (used e.g. by most std::valarray
implementations)
is to evaluate partial expressions in place, creating a large number
of temporaries in the process.
This is unsatisfactory since the amount of computation is very small compared to the resulting memory traffic.
The second is a technique named expression templates that is used in libraries such as Eigen. Expression templates construct complex graphs describing the inputs and operations of a mathematical expression using C++ templates. The underlying motivation is to avoid numerous memory allocations for temporaries by postponing evaluation until the point where the expression template is assigned to a storage container.
Unfortunately, pushed to the scale of entire programs, this approach tends to produce intermediate code with an extremely large number of common subexpressions that exceeds the capabilities of the common subexpression elimination (CSE) stage of current compilers. The first version of Enoki in fact used expression templates, and it was due to the difficulties with them that an alternative was developed.
The key idea of vectorizing over dynamic Enoki arrays is to iterate over
packets (i.e. static arrays) that represent a sliding window into the
dynamic array’s contents. Packets, in turn, are easily supported using the
tools discussed in the previous sections. Enoki provides a powerful
operation named enoki::vectorize()
, discussed later, that
implements this sliding window technique automatically.
That said, for convenience, arithmetic operations like operator+
are
implemented for dynamic arrays, and they are realized using approach 1 of
the above list (i.e. with copious amounts of memory allocation for
temporaries). Using them in performance-critical code is unadvisable.
When allocating dynamic arrays, the underlying memory region is always fully aligned according to the requirements of the packet type. Enoki may sometimes allocate a partially used packet at the end, which eliminates the need for special end-of-array handling. The following code snippet allocates an array of size 5 using 4-wide packets, which means that 3 entries at the end are unused.
/* Creates a dynamic array that is initially empty */
FloatX x;
/* Allocate memory for at least 5 entries */
set_slices(x, 5);
/* Query the size (a.k.a number of "slices") of the dynamic array */
size_t slice_count = slices(x);
assert(slice_count == 5);
/* Query the number of packets */
size_t packet_count = packets(x);
assert(packet_count == 2);
A few convenience initialization methods also exist:
/* Efficient way to create an array filled with zero entries */
x = zero<FloatX>(size);
/* .. or an unitialized array */
x = empty<FloatX>(size);
/* Initialize entries with index sequence 0, 1, 2, ... */
x = arange<FloatX>(size);
/* Initialize entries with a linearly increasing sequence with endpoints 0 and 1 */
x = linspace<FloatX>(0.f, 1.f, size);
The previous section used the example of a GPS record to show how Enoki can create packet versions of a type. The same approach also generalizes to dynamic arrays, allowing an arbitrarily long sequence of records to be represented. This requires two small additions to the original type declaration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | template <typename Value> struct GPSCoord2 {
using Vector2 = Array<Value, 2>;
using UInt64 = uint64_array_t<Value>;
using Bool = bool_array_t<Value>;
UInt64 time;
Vector2 pos;
Bool reliable;
ENOKI_STRUCT(GPSCoord2, /* <- name of this class */
time, pos, reliable /* <- list of all attributes in layout order */)
};
ENOKI_STRUCT_SUPPORT(GPSCoord2, time, pos, reliable)
|
The two highlighted additions do the following:
GPSCoord2
for the purposes of
dynamic vectorization.It is possible but fairly tedious to write these declarations by hand, hence the code generation macros should generally be used.
With these declarations, we can now allocate a dynamic array of 1000
coordinates that will be processed in packets of 4 (or more, depending on the
definition of FloatP
):
using GPSCoord2fX = GPSCoord2<FloatX>;
GPSCoord2fX coord;
set_slices(coord, 1000);
In memory, this data will be arranged as follows:
In other words: each field references a dynamic array that contiguously stores the contents in a SoA organization.
The enoki::packet()
function can be used to create a reference to the
\(i\)-th packet of a dynamic array or a custom dynamic data structure.
For instance, the following code iterates over all packets and resets their
time values:
/* Reset the time value of all records */
for (size_t i = 0; i < packets(coord); ++i) {
GPSRecord2<FloatP&> ref = packet(coord, i);
ref.time = 0;
}
The packet()
function is interesting because it returns an instance of a
new type GPSRecord2<FloatP&>
that was not discussed yet (note the ampersand
in the template argument). Instead of directly storing data, all fields of a
GPSRecord2<FloatP&>
are references pointing to packets of data elsewhere in
memory. In this case, assigning (writing) to a field of this structure of
references will change the corresponding entry of the dynamic array!
Conceptually, this looks as follows:
References can also be cast into their associated packet types and vice versa:
/* Read a GPSRecord2<FloatP&> and convert to GPSRecord2<FloatP> */
GPSCoord2fP cp = packet(coord, i);
/* Assign a GPSRecord2<FloatP> to a GPSRecord2<FloatP&> */
packet(coord, i + 1) = cp;
Note
For non-nested dynamic arrays such as FloatX = DynamicArray<FloatP>
,
packet()
simply returns a reference to the selected FloatP
entry in that array of packets. We generally encourage using universal
references (auto &&
) to hold the result of packet()
so that both
cases are handled in the same way:
auto ref = packet(coord, i); // Only works for dynamic structures
auto &ref = packet(numbers, i); // Only works for non-nested arrays
auto &&ref = packet(coord, i); // Works for both
Enoki provides a second way of indexing into dynamic arrays: the
enoki::slice()
function creates a reference to the
\(i\)-th slice of a dynamic array or a custom dynamic data
structure. Elements of a slice store references to scalar
elements representing a vertical slice through the data structure.
The following code iterates over all slices and initializes the time values to an increasing sequence:
/* Set the i-th time value to 'i' */
for (size_t i = 0; i < slices(coord); ++i) {
auto ref = slice(coord, i);
ref.time = i;
}
Here, the enoki::slice()
function returns an instance
of a new type GPSRecord2<float&>
(again, note the ampersand),
Conceptually, this looks as follows:
Slice reference types can also be cast into their associated scalar data types and vice versa:
/* Read a GPSRecord2<float&> and convert to GPSRecord2<float> */
GPSCoord2f c = slice(coord, n);
/* Assign a GPSRecord2<float> to a GPSRecord2<float&> */
slice(coord, n + 1) = c;
Now suppose that we’d like to compute the pairwise distance between records organized in two dynamically allocated lists. Direct application of the discussed ingredients leads to the following overall structure:
GPSCoord2fX coord1;
GPSCoord2fX coord2;
FloatX result;
// Allocate memory and fill input arrays with contents (e.g. using slice(...))
...
// Call SIMD-vectorized function for each packet
for (size_t i = 0; i < packets(coord1); ++i)
packet(result, i) = distance(packet(coord1, i),
packet(coord2, i));
This does not quite compile (yet)—a minor modification of the distance()
function is required:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | /// Calculate the distance in kilometers between 'r1' and 'r2' using the haversine formula
template <typename Value_, typename Value = expr_t<Value_>>
Value distance(const GPSCoord2<Value_> &r1, const GPSCoord2<Value_> &r2) {
using Scalar = scalar_t<Value>;
const Value deg_to_rad = Scalar(M_PI / 180.0);
auto sin_diff_h = sin(deg_to_rad * .5f * (r2.pos - r1.pos));
sin_diff_h *= sin_diff_h;
Value a = sin_diff_h.x() + sin_diff_h.y() *
cos(r1.pos.x() * deg_to_rad) *
cos(r2.pos.x() * deg_to_rad);
return select(
r1.reliable & r2.reliable,
Scalar(6371.f * 2.f) * atan2(sqrt(a), sqrt(1.f - a)),
std::numeric_limits<Scalar>::quiet_NaN()
);
}
|
The modified version above uses the enoki::expr_t
type trait to
determine a suitable type that is able to hold the result of an expression
involving its argument (which turns FloatP&
into FloatP
in this case).
Note
The issue with the original code was that it was called with a
GPSRecord2<FloatP&>
instance, i.e. with a template parameter Value =
FloatP&
. However, the Value
type is also used for the return value as
well as various intermediate computations, which is illegal since these
temporaries are not associated with an address in memory.
With these modifications, we are now finally able to vectorize over the dynamic array:
// Call SIMD-vectorized function for each packet -- yay!
for (size_t i = 0; i < packets(coord1); ++i)
packet(result, i) = distance(packet(coord1, i),
packet(coord2, i));
Extracting individual packets as shown in the snippet above can become fairly
tedious when a function takes many arguments. Enoki offers a convenient helper
function named enoki::vectorize()
that automates this process. It
takes a function and a number of dynamic arrays as input and calls the function
once for each set of input packets.
FloatX result = vectorize(
distance<FloatP>, // Function to call
coord1, // Input argument 1
coord2 // Input argument 2
// ...
);
Here, the returned float packets are stored in a dynamic array of type
FloatX
.
When the output array is already allocated, it is also possible to write the
results directly into the array. The snippet below shows how to do this by
calling call enoki::vectorize()
with a lambda function.
vectorize(
[](auto&& result, auto&& coord1, auto &&coord2) {
result = distance<FloatP>(coord1, coord2);
},
result,
coord1,
coord2
);
Note the use of a variadic lambda with auto&&
arguments: it would be
redundant to specify the argument types since they are automatically inferred
from the function inputs.
Naturally, we could also perform the complete calculation within the lambda function:
vectorize(
[](auto&& result, auto&& coord1, auto&& coord2) {
using Value = FloatP;
using Scalar = float;
const Value deg_to_rad = Scalar(M_PI / 180.0);
auto sin_diff_h = sin(deg_to_rad * .5f * (coord2.pos - coord1.pos));
sin_diff_h *= sin_diff_h;
Value a = sin_diff_h.x() + sin_diff_h.y() *
cos(coord1.pos.x() * deg_to_rad) *
cos(coord2.pos.x() * deg_to_rad);
result = select(
coord1.reliable & coord2.reliable,
(6371.f * 2.f) * atan2(sqrt(a), sqrt(1.f - a)),
std::numeric_limits<Scalar>::quiet_NaN()
);
},
result,
coord1,
coord2
);
It is not necessary to “route” all parameters through
enoki::vectorize()
. Auxiliary data structures or constants are easily
accessible via the lambda capture object using the standard [&]
notation.
We now turn to the results of a microbenchmark which runs the previously discussed GPS record distance function on a dynamic array with 10 million entries.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | /* Compilation flags:
$ clang++ benchmark.cpp -o benchmark -std=c++14 -I include -O3
-march=native -fomit-frame-pointer -fno-stack-protector -DNDEBUG
*/
#include <enoki/array.h>
#include <enoki/random.h>
#include <chrono>
using namespace enoki;
auto clk() { return std::chrono::high_resolution_clock::now(); }
template <typename T> float clkdiff(T a, T b) {
return std::chrono::duration<float>(b - a).count() * 1000;
}
template <typename Value> struct GPSCoord2 {
using Vector2 = Array<Value, 2>;
using UInt64 = uint64_array_t<Value>;
using Bool = mask_t<Value>;
UInt64 time;
Vector2 pos;
Bool reliable;
ENOKI_STRUCT(GPSCoord2, time, pos, reliable)
};
ENOKI_STRUCT_SUPPORT(GPSCoord2, time, pos, reliable)
using FloatP = Packet<float, SIMD_WIDTH>;
using FloatX = DynamicArray<FloatP>;
using GPSCoord2fX = GPSCoord2<FloatX>;
using GPSCoord2fP = GPSCoord2<FloatP>;
using GPSCoord2f = GPSCoord2<float>;
using RNG = PCG32<FloatP>;
/// Calculate the distance in kilometers between 'r1' and 'r2' using the haversine formula
template <typename Value_, typename Value = expr_t<Value_>>
ENOKI_INLINE Value distance(const GPSCoord2<Value_> &r1, const GPSCoord2<Value_> &r2) {
using Scalar = scalar_t<Value>;
const Value deg_to_rad = Scalar(M_PI / 180.0);
auto sin_diff_h = sin(deg_to_rad * .5f * (r2.pos - r1.pos));
sin_diff_h *= sin_diff_h;
Value a = sin_diff_h.x() + sin_diff_h.y() *
cos(r1.pos.x() * deg_to_rad) *
cos(r2.pos.x() * deg_to_rad);
return select(
r1.reliable & r2.reliable,
(6371.f * 2.f) * atan2(sqrt(a), sqrt(1.f - a)),
Value(std::numeric_limits<Scalar>::quiet_NaN())
);
}
int main(int argc, char *argv[]) {
for (int i =0; i<3; ++i) {
GPSCoord2fX coord1;
GPSCoord2fX coord2;
FloatX result;
auto clk0 = clk();
size_t size = 10000000;
set_slices(coord1, size);
set_slices(coord2, size);
set_slices(result, size);
auto clk1 = clk();
RNG rng;
for (size_t j = 0; j < packets(coord1); ++j) {
packet(coord1, j) = GPSCoord2fP {
0,
{ rng.next_float32() * 180.f - 90, rng.next_float32() * 360.f - 180.f},
true
};
packet(coord2, j) = GPSCoord2fP {
0,
{ rng.next_float32() * 180.f - 90, rng.next_float32() * 360.f - 180.f},
true
};
}
auto clk2 = clk();
vectorize([](auto &&result, auto &&coord1, auto &&coord2) {
result = distance<FloatP>(coord1, coord2);
},
result, coord1, coord2);
auto clk3 = clk();
std::cout << clkdiff(clk2, clk3) << " (alloc = " << clkdiff(clk0, clk1)
<< ", fill = " << clkdiff(clk1, clk2) << ")" << std::endl;
}
return 0;
}
|
The plots show the measured speedup relative to a scalar baseline implementation. We consider two different microarchitectures:
The Knight’s Landing architecture provides hardware support for SIMD arithmetic using 16 single precision point values. Interestingly, the best performance is reached when working with arrays of 32 entries, which can be interpreted as a type of loop unrolling. The ability of issuing wide memory operations, performing branchless arithmetic using vector registers, and keeping two independent instructions in flight for each arithmetic operation leads to a total speedup of 23.5x (i.e. considerably exceeding the expected maximum speedup of 16 from the vectorized instructions alone!).
Relative to the C math library, Enoki obtains an even larger speedup of 38.7x. Using the standard C math library on this platform is fairly expensive, presumably because of function call penalties on Xeon Phi (Enoki generally inlines functions), and because it is compiled for a generic x86_64 machine rather than the native architecture.
Platform details: clang trunk rev. 304711 on Linux 64 bit (RHEL 7.3)
The Skylake architecture provides hardware support for SIMD arithmetic using 8 single precision point values. Significant speedups are observed for packets of 8 and 16 entries. It is likely that more involved functions (i.e. with a higher register pressure) will have a sharper performance drop after \(n=16\) due to the relatively small number of registers on this platform. Enoki single-precision transcendentals are only slightly faster than the standard C math library on this platform. The max. speedup relative to the standard C math library is 10.0x.
Platform details: clang trunk rev. 304711 on macOS 10.12.5
This section discusses a number of advanced operations and ways of extending Enoki.
When debugging programs built on top of Enoki using LLDB or GDB, the
stringified versions of arrays are needlessly verbose and reveal private
implementation details. For instance, printing a simple statically sized 3D
vectors like Array<float, 3>(1, 2, 3)
in LLDB yields
$0 = {
enoki::StaticArrayImpl<float, 3, false, enoki::Array<float, 3>, int> = {
enoki::StaticArrayImpl<float, 4, false, enoki::Array<float, 3>, int> = {
m = (1, 2, 3, 0)
}
}
}
Dynamic arrays (e.g. FloatX(1, 2, 3)
) are even worse, as the values are
obscured behind a pointer:
$0 = {
enoki::DynamicArrayImpl<enoki::Packet<float, 8>, enoki::DynamicArray<enoki::Packet<float, 8> > > = summary {
m_packets = {
__ptr_ = {
std::__1::__compressed_pair_elem<enoki::Packet<float, 8> *, 0, false> = {
__value_ = 0x0000000100300200
}
}
}
m_size = 1
m_packets_allocated = 1
}
}
To improve readability, this repository includes scripts that improve GDB and LLDB’s understanding of Enoki types. With this script, both of the above turn into
$0 = [1, 2, 3]
To install it in LLDB, copy the file resources/enoki_lldb.py
to ~/.lldb
(creating the directory, if not present) and then apppend the following line to
the file ~/.lldbinit
(again, creating it if, not already present):
command script import ~/.lldb/enoki_lldb.py
To install it in GDB, copy the file resources/enoki_gdb.py
to ~/.gdb
(creating the directory, if not present) and then apppend the following two lines to
the file ~/.gdbinit
(again, creating it if, not already present):
set print pretty
source ~/.gdb/enoki_gdb.py
A common design pattern in vectorized code involves compressing arrays, i.e. selectively writing only masked parts of an array so that the selected entries become densely packed in memory (e.g. to improve resource usage when only parts of an array participate in a computation).
The function enoki::compress()
efficiently maps this operation onto
the targeted hardware (SSE4.2, AVX2, and AVX512 implementations are provided).
The function also automatically advances the pointer by the amount of written
entries.
Array<float, 16> input = ...;
auto mask = input < 0;
float output[16], ptr = output;
size_t count = compress(ptr, input, mask);
std::cout << count << " entries were written." << std::endl;
...
Custom data structures such as the GPS record class discussed in previous
chapters are transparently supported by enoki::compress()
—in this
case, the mask applies to each vertical slice through the data structure as
illustrated in the following figure:
The slice_ptr()
function is used to acquire pointers to the
beginning of the output arrays (there is one for each field of the
data structure).
It returns a value of type GPSRecord2<float *>
, which holds these
pointers. The following snippet illustrates how an arbitrarily long list
of records can be compressed:
GPSCoord2fX input = /* .. input data to be compressed .. */;
/* Make sure there is enough space to store all data */
GPSCoord2fX output;
set_slices(output, slices(input));
/* Structure composed of pointers to the output arrays */
GPSRecord2<float *> ptr = slice_ptr(output, 0);
/* Counter used to keep track of the number of collected elements */
size_t final_size = 0;
/* Go through all packets, compress, and append */
for (size_t i = 0; i < packets(input); ++i) {
/* Let's filter out the records with input.reliable == false */
auto input_p = packet(input, i);
final_size += compress(ptr, input_p, input_p.reliable);
}
/* Now that the final number of slices is known, adjust the output array size */
set_slices(output, final_size);
Warning
The writes performed by enoki::compress()
are at the granularity
of entire packets, which means that some extra scratch space generally
needs to be allocated at the end of the output array.
For instance, even if it is known that a compression operation will find
exactly N
elements, you are required to reserve memory for N +
Packet::Size
elements to avoid undefined behavior.
Note that enoki::compress()
will never require more memory than
the input array, hence this provides a safe upper bound.
Enoki provides a powerful enoki::range()
iterator that enables for
loops with index vectors. The following somewhat contrived piece of code
computes \(\sum_{i=0}^{1000}i^2\) using brute force addition (but with only
\(1000/16\approx 63\) loop iterations).
using Index = Packet<uint32_t, 8>;
Index result(0);
for (auto [index, mask]: range<Index>(0, 1000)) {
result += select(
mask,
index * index,
Index(0)
);
}
assert(hsum(result) == 332833500);
The mask is necessary to communicate the fact that the last loop iteration has several disabled entries. An extended version of the range statement enables iteration over multi-dimensional grids.
using Index3 = Array<Packet<uint32_t, 8>, 3>;
for (auto [index, mask] : range<Index3>(4u, 5u, 6u))
std::cout << index << std::endl;
/* Output:
[[0, 0, 0],
[1, 0, 0],
[2, 0, 0],
[3, 0, 0],
[0, 1, 0],
[1, 1, 0],
[2, 1, 0],
[3, 1, 0]]
....
*/
Here, it is assumed that the range along each dimension starts from zero.
Enoki’s scatter()
and gather()
functions can transparently
convert between SoA and AoS representations. This case is triggered when the
supplied index array has a smaller nesting level than that of the data array.
For instance, consider the following example:
/* Packet types */
using FloatP = Packet<float, 4>;
using UInt32P = Packet<uint32_t, 4>;
/* SoA and AoS 3x3 matrix types */
using Matrix3f = Matrix<float, 3>;
using Matrix3fP = Matrix<FloatP, 3>;
Matrix3f *data = ...;
UInt32P indices(5, 1, 3, 0);
/* Gather AoS matrix data into SoA representation */
Matrix3fP mat = gather<Matrix3fP>(data, indices);
/* Modify SoA matrix data and scatter back into AoS-based 'data' array */
mat *= 2.f;
scatter(data, mat, indices);
The same syntax also works for prefetch()
, which is convenient to
ensure that a given set of memory addresses are in cache (preferably a few
hundred cycles before the actual usage).
/* Prefetch into L2 cache for write access */
prefetch<Matrix3fP, /* Write = */ true, /* Level = */ 2>(data, indices);
Enoki is smart enough to realize that the non-vectorized form of gather and scatter statements can be reduced to standard loads and stores, e.g.
using Matrix3f = Matrix<float, 3>;
float *ptr = ...;
int32_t index = ...;
/* The statement below reduces to load<Matrix3f>(ptr + index); */
Matrix3f mat = gather<Matrix3f>(ptr, index);
Scatter and gather operations are also permitted for dynamic arrays, e.g.:
using FloatX = DynamicArray<FloatP>;
using UInt32X = DynamicArray<UInt32P>;
using Matrix3fX = Matrix<FloatX, 3>;
Matrix3f *data = ...;
UInt32X indices = ...;
/* Gather AoS matrix data into SoA representation */
Matrix3fX mat = gather<Matrix3fX>(data, indices);
Integer division is a surprisingly expensive operation on current processor architectures: for instance, the Knight’s Landing architecture requires up to a whopping 108 cycles (95 cycles on Skylake) to perform a single 64-bit signed integer division with remainder. The hardware unit implementing the division cannot accept any new inputs until it is done with the current input (in other words, it is not pipelined in contrast to most other operations). Given the challenges of efficiently realizing integer division in hardware, current processors don’t even provide an vector instruction to perform multiple divisions at once.
Although Enoki can’t do anything clever to provide an efficient array division instruction given these constraints, it does provide a highly efficient division operation for a special case that is often applicable: dividing by an integer constant. The following snippet falls under this special case because all array entries are divided by the same constant, which is furthermore known at compile time.
using Int32 = enoki::Array<uint32_t, 8>;
Int32 div_43(Int32 a) {
return a / 43;
}
This generates the following AVX2 assembly code (with comments):
_div_43:
; Load magic constant into 'ymm1'
vpbroadcastd ymm1, dword ptr [rip + LCPI0_0]
; Compute high part of 64 bit multiplication with 'ymm1'
vpmuludq ymm2, ymm1, ymm0
vpsrlq ymm2, ymm2, 32
vpsrlq ymm3, ymm0, 32
vpmuludq ymm1, ymm1, ymm3
vpblendd ymm1, ymm2, ymm1, 170
; Correction & shift
vpsubd ymm0, ymm0, ymm1
vpsrld ymm0, ymm0, 1
vpaddd ymm0, ymm0, ymm1
vpsrld ymm0, ymm0, 5
ret
We’ve effectively turned the division into a sequence of 2 multiplies, 4 shifts, and 2 additions/subtractions. Needless to say, this is going to be much faster than sequence of high-latency/low-througput scalar divisions.
In cases where the constant is not known at compile time, a
enoki::divisor
instance can be precomputed and efficiently applied
using enoki::divisor::operator()
, as shown in the following example:
using Int32P = enoki::Packet<uint32_t, 8>;
void divide(Int32P *a, int32_t b, size_t n) {
/* Precompute magic constants */
divisor<int32_t> prec_div = b;
/* Now apply the precomputed division efficiently */
for (size_t i = 0; i < n; ++i)
a[i] = prec_div(a[i]);
}
The following plots show the speedup compared to scalar division when dividing 100 million integer packets of size 16 by a compile-time constant. As can be seen, the difference is fairly significant on consumer processors (up to 13.2x on Skylake) and huge on the simple cores found on a Xeon Phi (up to 61.2x on Knight’s Landing).
Enoki’s implementation of division by constants is based on the excellent libdivide library.
Note
As can be seen, unsigned divisions are generally cheaper than signed division, and 32 bit division is considerably cheaper than 64 bit divisions. The reason for this is that a 64 bit high multiplication instruction required by the algorithm does not exist and must be emulated.
Warning
Enoki’s integer precomputed division operator does not support dividends equal to \(\pm 1\) (all other values are permissible). This is an inherent limitation of the magic number & shift-based algorithm used internally, which simply cannot represent this dividend. Enoki will throw an exception when a dividend equal to \(\pm 1\) is detected in an application compiled in debug mode.
The function reinterpret_array()
can be used to reinterpret the
bit-level representation as a different type when both source and target types
have matching sizes and layouts.
using UInt32P = Packet<uint32_t, 4>;
using FloatP = Packet<float, 4>;
UInt32P source = 0x3f800000;
FloatP target = reinterpret_array<FloatP>(source);
// Prints: [1, 1, 1, 1]
std::cout << target << std::endl;
Consider vectorizing a function that increments the bins of a histogram given
an array of bin indices. It is impossible to do this kind of indirect update
using a normal pair of gather and scatter operations, since incorrect
updates occur whenever the indices
array contains an index multiple times:
using FloatP = Packet<float, 16>;
using IndexP = Packet<int32_t, 16>;
float hist[1000] = { 0.f }; /* Histogram entries */
IndexP indices = /* .. bin indices whose value should be increased .. */;
/* Ooops, don't do this. Some entries may have to be incremented multiple times.. */
scatter(hist, gather<FloatP>(hist, indices) + 1, indices);
Enoki provides a function named enoki::transform()
, which modifies an
indirect memory location in a way that is not susceptible to conflicts. The
function takes an arbitrary function as parameter and applies it to the
specified memory location, which allows this approach to generalize to
situations other than just building histograms.
/* Unmasked version */
transform<FloatP>(hist, indices, [](auto& x, auto& /* mask */) { x += 1; });
/* Masked version */
transform<FloatP>(hist, indices, [](auto& x, auto& /* mask */) { x += 1; }, mask);
A mask argument is always provided to the lambda function. This is useful, e.g. to mask memory operations or function calls performed inside it. Note that any extra arguments will also be passed to the lambda function:
FloatP amount = ...;
/* Unmasked version */
transform<FloatP>(hist, indices,
[](auto& x, auto&y, auto & /* mask */) { x += y; },
amount);
/* Masked version */
transform<FloatP>(hist, indices,
[](auto& x, auto&y, auto& /* mask */) { x += y; },
amount, mask);
Internally, enoki::transform()
detects and processes conflicts using
the AVX512CDI instruction set. When conflicts are present, the function
provided as an argument may be invoked multiple times in a row. When AVX512CDI
is not available, a slower scalar fallback implementation is used.
Since the above use pattern—adding values to an array—is so common, a special short-hand notation exists:
FloatP amount = ...;
/* Unmasked version */
scatter_add(hist, amount, indices);
/* Masked version */
scatter_add(hist, amount, indices, mask);
Enoki provides a mechanism for declaring custom array types using the
Curiously recurring template pattern. The
following snippet shows a declaration of a hypothetical type named Spectrum
representing a discretized color spectrum. Spectrum
generally behaves the
same way as Array
and supports all regular Enoki operations.
template <typename Value, size_t Size>
struct Spectrum : enoki::StaticArrayImpl<Value, Size, false,
Spectrum<Value, Size>> {
/// Base class
using Base = enoki::StaticArrayImpl<Value, Size, false,
Spectrum<Value, Size>>;
/// Helper alias used to implement type promotion rules
template <typename T> using ReplaceValue = Spectrum<T, Size>;
/// Mask type associated with this custom type
using MaskType = enoki::Mask<Value, Size>;
/// Import constructors, assignment operators, etc.
ENOKI_IMPORT_ARRAY(Base, Spectrum)
};
The main reason for declaring custom arrays is to tag (and preserve) the type
of arrays within expressions. For instance, the type of value2
in the
following snippet is Spectrum<float, 8>
rather than a generic
enoki::Array<...>
.
Spectrum<float, 8> value = { ... };
auto value2 = exp(-value);
Note
A further declaration is needed in a rare corner case. This only applies
An example:
// 1. Custom data structure
template <typename Value> struct PolarizedSpectrum {
// 3. Spectrum type constructed from template parameters
using Spectrum8 = Spectrum<Value, 8>;
Spectrum8 r_s;
Spectrum8 r_p;
ENOKI_STRUCT(PolarizedSpectrum, r_s, r_p)
};
ENOKI_STRUCT_SUPPORT(PolarizedSpectrum, r_s, r_p)
using PolarizedSpectrumfP = PolarizedSpectrum<FloatP>;
PolarizedSpectrumfP some_function(FloatP theta) {
PolarizedSpectrumfP result = ...;
// 2. Masked assignment
masked(result, theta < 0.f) = zero<PolarizedSpectrumfP>();
return result;
}
In this case, the following declaration is needed:
template <typename Value, size_t Size>
struct Spectrum<enoki::detail::MaskedArray<Value>, Size> : enoki::detail::MaskedArray<Spectrum<Value, Size>> {
using Base = enoki::detail::MaskedArray<Spectrum<Value, Size>>;
using Base::Base;
using Base::operator=;
Spectrum(const Base &b) : Base(b) { }
};
This partial overload has the purpose of propagating an internal Enoki
masking data structure to the top level (so that Spectrum<MaskedArray>
becomes MaskedArray<Spectrum>
).
In addition to mapping vector operations on the available instruction sets, Enoki’s abstractions hide a number of tedious platform-related details. This is a partial list:
The representation of masks is highly platform-dependent. For instance, the AVX512 back-end uses eight dedicated mask registers to store masks compactly (allocating only a single bit per mask entry).
Older machines use a redundant representation based on regular vector
registers that have all bits set to 1
for entries where the comparison
was true and 0
elsewhere.
Machines with AVX (but no AVX2) don’t have an 8-wide integer vector unit.
This means that an Array<float, 8>
can be represented using a single AVX
ymm
register, but casting it to an Array<int32_t, 8>
entails
switching to a pair of half width SSE4.2 xmm
integer registers, etc.
Vector instruction sets are generally fairly incomplete in the sense that they are missing many entries in the full data type / operation matrix. Enoki emulates such operations using other vector instructions whenever possible.
Various operations that work with 64 bit registers aren’t available when Enoki is compiled on a 32-bit platform and must be emulated.
Adding a new Enoki array type involves creating a new partial overload of the
StaticArrayImpl<>
template that derives from StaticArrayBase
. To
support the full feature set of Enoki, overloads must provide at least a set of
core methods shown below. The underscores in the function names indicate that
this is considered non-public API that should only be accessed indirectly via
the routing templates in enoki/enoki_router.h
.
The following core operations must be provided by every implementation.
store_
, store_unaligned_
, load_
,
load_unaligned_
.add_
, sub_
, mul_
,
mulhi_
(signed/unsigned high integer multiplication), div_
,
mod_
, and_
, or_
, xor_
.neg_
, not_
.ge_
, gt_
, lt_
, le_
,
eq_
, neq_
.abs_
, ceil_
, floor_
, max_
,
min_
, round_
, sqrt_
.sl_
, sr_
.all_
, any_
, hprod_
, hsum_
,
hmax_
, hmin_
, count_
.select_
.high_
, low_
.zero_
.The following operations all have default implementations in Enoki’s mathematical support library, hence overriding them is optional.
However, doing so may be worthwile if efficient hardware-level support exists on the target platform.
shuffle_
.compress_
.extract_
.scatter_
, gather_
.prefetch_
.add_
/sub_
and mul_
by
default): fmadd_
, fmsub_
, fnmadd_
, fnmsub_
,
fmaddsub_
, fmsubadd_
.div_
and sqrt_
by default): rcp_
, rsqrt_
.mul_
and hsum_
by default): dot_
.rol_
, ror_
.rol_array_
, ror_array_
.Warning
This section is a work in progress.
The function signatures provided in this section are simplified to aid readability. For instance, the array addition operator
hides the fact that
The function supports mixed arguments with builtin broadcasting. For instance, the following is legal
Array<float, 4> x = ...;
Array<Array<double, 8>, 4> y = ...;
auto z = x + y;
The entries of x
are broadcasted and promoted in the above example, and
the type of z
is thus Array<Array<double, 8>, 4>
. See the section on
broadcasting rules for details.
The operator uses SFINAE (std::enable_if
) so that it only becomes active
when x
or y
are Enoki arrays.
The operator replicates the C++ typing rules. Recall that adding two
float&
references in standard C++ produces a float
temporary (i.e.
not a reference) holding the result of the operation.
Analogously, z
is of type Array<float, 4>
in the example below.
Array<float&, 4> x = ...;
Array<float&, 4> y = ...;
auto z = x + y;
Writing out these type transformation rules in every function definition would make for tedious reading, hence the simplifications. If in doubt, please look at the source code of Enoki.
ENOKI_VERSION_MAJOR
¶Integer value denoting the major version of the Enoki release.
ENOKI_VERSION_MINOR
¶Integer value denoting the minor version of the Enoki release.
ENOKI_VERSION_PATCH
¶Integer value denoting the patch version of the Enoki release.
ENOKI_VERSION
¶Enoki version string (e.g. "0.1.2"
).
ENOKI_LIKELY
(condition)¶Signals that the branch is almost always taken, which can be used for improved code layout if supported by the compiler. An example is shown below:
if (ENOKI_LIKELY(x > 0)) {
/// ....
}
ENOKI_UNLIKELY
(condition)¶Signals that the branch is rarely taken analogous to
ENOKI_LIKELY()
.
ENOKI_UNROLL
¶Cross-platform mechanism for asking the compiler to unroll a loop. The
macro should be placed before the for
statement.
ENOKI_NOUNROLL
¶Cross-platform mechanism for asking the compiler to never unroll a loop
analogous to ENOKI_UNROLL()
.
ENOKI_INLINE
¶Cross-platform mechanism for asking the compiler to always inline a function. The macro should be placed in front of the function declaration.
ENOKI_INLINE void foo() { ... }
ENOKI_NOINLINE
¶Cross-platform mechanism for asking the compiler to never inline a
function analogous to ENOKI_INLINE()
.
has_avx512dq
¶Specifies whether AVX512DQ instructions are available on the target architecture.
has_avx512vl
¶Specifies whether AVX512VL instructions are available on the target architecture.
has_avx512bw
¶Specifies whether AVX512BW instructions are available on the target architecture.
has_avx512cd
¶Specifies whether AVX512CD instructions are available on the target architecture.
has_avx512pf
¶Specifies whether AVX512PF instructions are available on the target architecture.
has_avx512er
¶Specifies whether AVX512ER instructions are available on the target architecture.
has_avx512vpopcntdq
¶Specifies whether AVX512VPOPCNTDQ instructions are available on the target architecture.
has_avx512f
¶Specifies whether AVX512F instructions are available on the target architecture.
has_avx2
¶Specifies whether AVX2 instructions are available on the target architecture.
has_avx
¶Specifies whether AVX instructions are available on the target architecture.
has_fma
¶Specifies whether fused multiply-add instructions are available on the target architecture (ARM & x86).
has_f16c
¶Specifies whether F16C instructions are available on the target architecture.
has_sse42
¶Specifies whether SSE 4.2 instructions are available on the target architecture.
has_x86_32
¶Specifies whether the target architecture is x86, 32 bit.
has_x86_64
¶Specifies whether the target architecture is x86, 64 bit.
has_arm_neon
¶Specifies whether ARM NEON instructions are available on the target architecture.
has_arm_32
¶Specifies whether the target architecture is a 32-bit ARM processor (armv7).
has_arm_64
¶Specifies whether the target architecture is aarch64 (armv8+).
max_packet_size
¶Denotes the maximal packet size (in bytes) that can be mapped to native vector registers. It is equal to 64 if AVX512 is present, 32 if AVX is present, and 16 for machines with SSE 4.2 or ARM NEON.
array_default_size
¶Denotes the default size of Enoki arrays. Equal to max_packet_size / 4
.
template <typename Value, size_t Size = array_default_size>
Array
: StaticArrayImpl<Value, Size, Array<Value, Size>>¶The default Enoki array class – a generic container that stores a
fixed-size array of an arbitrary data type similar to the standard template
library class std::array
. The main distinction between the two is that
enoki::Array
forwards all arithmetic operations (and other
standard mathematical functions) to the contained elements.
It has several template parameters:
typename Value
: the type of individual array entries.size_t Size
: the number of packed array entries.This class is just a small wrapper that instantiates
enoki::StaticArrayImpl
using the Curiously Recurring Template
Pattern (CRTP). The latter provides the actual machinery that is needed to
evaluate array expressions. See Defining custom array types for details.
template <typename Value, size_t Size = array_default_size>
Packet
: StaticArrayImpl<Value, Size, Array<Value, Size>>¶The Packet
type is identical to enoki::Array
except for
its broadcasting behavior.
Value
, size_t Size
, typename Derived
>StaticArrayImpl
¶This base class provides the core implementation of an Enoki array. It
cannot be instantiated directly and is used via the Curiously Recurring
Template Pattern (CRTP). See Array
and Defining custom array types
for details on how to create custom array types.
StaticArrayImpl
()¶Create an unitialized array. Floating point arrays are initialized
using std::numeric_limits<Value>::quiet_NaN()
when the application
is compiled in debug mode.
Args
>StaticArrayImpl
(Args... args)¶Initialize the individual array entries with args
(where
sizeof...(args) == Size
).
Value2
, typename Derived2
>StaticArrayImpl
(const StaticArrayImpl<Value2, Size, Derived2> &other)¶Initialize the array with the contents of another given array that potentially has a different underlying type. Enoki will perform a vectorized type conversion if this is supported by the target processor.
size
() const¶Returns the size of the array.
operator[]
(size_t index) const¶Return a reference to an array element (const version). When the
application is compiled in debug mode, the function performs a range
check and throws std::out_of_range
in case of an out-of-range
access. This behavior can be disabled by defining
ENOKI_DISABLE_RANGE_CHECK
.
operator[]
(size_t index)¶Return a reference to an array element. When the application is
compiled in debug mode, the function performs a range check and throws
std::out_of_range
in case of an out-of-range access. This behavior
can be disabled by defining ENOKI_DISABLE_RANGE_CHECK
.
coeff
(size_t index) const¶Just like operator[]()
, but without the range check (const
version).
coeff
(size_t index)¶Just like operator[]()
, but without the range check.
alloc
(size_t size)¶Allocates size
bytes of memory that are sufficiently aligned so that
any Enoki array can be safely stored at the returned address.
Array
>load
(const void *mem, mask_t<Array> mask = true)¶Loads an array of type Array
from the memory address mem
(which is
assumed to be aligned on a multiple of alignof(Array)
bytes). No loads
are performed for entries whose mask bit is false
—instead, these
entries are initialized with zero.
Warning
Performing an aligned load from an unaligned memory address will cause a general protection fault that immediately terminates the application.
Note
When the mask
parameter is specified, the function implements a
masked load, which is fairly slow on machines without the AVX512
instruction set.
Array
>load_unaligned
(const void *mem, mask_t<Array> mask = true)¶Loads an array of type Array
from the memory address mem
(which is
not required to be aligned). No loads are performed for entries whose mask
bit is false
—instead, these entries are initialized with zero.
Note
When the mask
parameter is specified, the function implements a
masked load, which is fairly slow on machines without the AVX512
instruction set.
Array
>store
(const void *mem, Array array, mask_t<Array> mask = true)¶Stores an array of type Array
at the memory address mem
(which is
assumed to be aligned on a multiple of alignof(Array)
bytes). No stores
are performed for entries whose mask bit is false
.
Warning
Performing an aligned storefrom an unaligned memory address will cause a general protection fault that immediately terminates the application.
Note
When the mask
parameter is specified, the function implements a
masked store, which is fairly slow on machines without the AVX512
instruction set.
Array
>store_unaligned
(const void *mem, Array array, mask_t<Array> mask = true)¶Stores an array of type Array
at the memory address mem
(which is
not required to be aligned). No stores are performed for entries whose mask
bit is false
.
Note
When the mask
parameter is specified, the function implements a
masked store, which is fairly slow on machines without the AVX512
instruction set.
Array
, size_t Stride
= sizeof(scalar_t<Array>), typename Index
>gather
(const void *mem, Index index, mask_t<Array> mask = true)¶Loads an array of type Array
using a masked gather operation. This is
equivalent to the following scalar loop (which is mapped to efficient
hardware instructions if supported by the target hardware).
Array result;
for (size_t i = 0; i < Array::Size; ++i)
if (mask[i])
result[i] = ((Value *) mem)[index[i]];
else
result[i] = Value(0);
The index
parameter must be a 32 or 64 bit integer array having the
same number of entries. It will be interpreted as a signed array regardless
of whether the provided array is signed or unsigned.
The default value of the Stride
parameter indicates that the data at
mem
uses a packed memory layout (i.e. a stride value of
sizeof(Value)
); other values override this behavior.
Stride
= 0, typename Array
, typename Index
>scatter
(const void *mem, Array array, Index index, mask_t<Array> mask = true)¶Stores an array of type Array
using a scatter operation. This is
equivalent to the following scalar loop (which is mapped to efficient
hardware instructions if supported by the target hardware).
for (size_t i = 0; i < Array::Size; ++i)
if (mask[i])
((Value *) mem)[index[i]] = array[i];
The index
parameter must be a 32 or 64 bit integer array having the
same number of entries. It will be interpreted as a signed array regardless
of whether the provided array is signed or unsigned.
The default value of the Stride
parameter indicates that the data at
mem
uses a packed memory layout (i.e. a stride value of
sizeof(Value)
); other values override this behavior.
Array
, bool Write
= false, size_t Level
= 2, size_t Stride
= sizeof(scalar_t<Array>), typename Index
>prefetch
(const void *mem, Index index, mask_t<Array> mask = true)¶Pre-fetches an array of type Array
into the L1 or L2 cache (as
indicated via the Level
template parameter) to reduce the latency of a
future gather or scatter operation. If Write = true
, the
the associated cache line should be acquired for write access (i.e. a
scatter rather than a gather operation).
The index
parameter must be a 32 or 64 bit integer array having the
same number of entries. It will be interpreted as a signed array regardless
of whether the provided array is signed or unsigned.
If provided, the mask parameter specifies which of the pre-fetches should actually be performed.
The default value of the Stride
parameter indicates that the data at
mem
uses a packed memory layout (i.e. a stride value of
sizeof(Value)
); other values override this behavior.
Array
, typename Index
>scatter_add
(const void *mem, Array array, Index index, mask_t<Array> mask = true)¶Performs a scatter-add operation that is equivalent to the following scalar loop (mapped to efficient hardware instructions if supported by the target hardware).
for (size_t i = 0; i < Array::Size; ++i)
if (mask[i])
((Value *) mem)[index[i]] += array[i];
The implementation avoids conflicts in case multiple indices refer to the same entry.
Output
, typename Input
, typename Mask
>compress
(Output output, Input input, Mask mask)¶Tightly packs the input values selected by a provided mask and writes them
to output
, which must be a pointer or a structure of pointers. See the
advanced topics section with regards to usage. The
function returns count(mask)
and also advances the pointer by this
amount.
Array
, typename Index
, typename Mask
, typename Func
, typename ...Args
>transform
(scalar_t<Array> *mem, Index index, Mask mask, Func func, Args&&... args)¶Transforms referenced entries at mem
by the function func
while
avoiding potential conflicts. The variadic template arguments args
are
forwarded to the function. The pseudocode for this operation is
for (size_t i = 0; i < Array::Size; ++i) {
if (mask[i])
func(mem[index], args...);
}
See the section on the histogram problem and conflict detection on how to use this function.
Note
To efficiently perform the transformation at the hardware level, the
Index
and Array
types should occupy the same size. The
implementation ensures that this is the case by performing an explicit
cast of the index parameter to int_array_t<Array>
.
DArray
>empty
(size_t size)¶Allocates and returns a dynamic array of type DArray
of size size
.
The array contents are uninitialized.
Array
>zero
()¶Returns a static array filled with zeros. This is analogous to writing
Array(0)
but makes it more explicit to the compiler that a specific
efficient instruction sequence should be used for zero-initialization.
DArray
>zero
(size_t size)¶Allocates and returns a dynamic array of type DArray
that is filled
with zeros.
Array
>arange
()¶Return an array initialized with an index sequence, i.e. 0, 1, .., Array::Size-1
.
DArray
>arange
(size_t size)¶Allocates and returns a dynamic array of type DArray
that is filled an
index sequence 0..size-1
.
Array
>linspace
(scalar_t<Array> min, scalar_t<Array> max)¶Return an array initialized with linear linearly spaced entries including
the endpoints min
and max
.
DArray
>linspace
(scalar_t<DArray> min, scalar_t<DArray> max, size_t size)¶Allocates and returns a dynamic array initialized with size
linear
linearly spaced entries including the endpoints min
and max
.
DArray
>meshgrid
(const DArray &x, const DArray &y)¶Creates a 2D coordinate array containing all pairs of entries from the
x
and y
arrays. Analogous to the meshgrid
function in NumPy.
using FloatP = Array<float>;
using FloatX = DynamicArray<FloatP>;
auto x = linspace<FloatX>(0.f, 1.f, 4);
auto y = linspace<FloatX>(2.f, 3.f, 4);
Array<FloatX, 2> grid = meshgrid(x, y);
std::cout << grid << std::endl;
/* Prints:
[[0, 2],
[0.333333, 2],
[0.666667, 2],
[1, 2],
[0, 2.33333],
[0.333333, 2.33333],
[0.666667, 2.33333],
[1, 2.33333],
[0, 2.66667],
[0.333333, 2.66667],
[0.666667, 2.66667],
[1, 2.66667],
[0, 3],
[0.333333, 3],
[0.666667, 3],
[1, 3]]
*/
Predicate
, typename Index
>binary_search
(scalar_t<Index> start, scalar_t<Index> end, Predicate pred)¶Perform binary search over a range given a predicate pred
, which
monotonically decreases over this range (i.e. max one true
-> false
transition).
Given a (scalar) start
and end
index of a range, this function
evaluates a predicate floor(log2(end-start) + 1)
times with index
values on the interval [start, end] (inclusive) to find the first index
that no longer satisfies it. Note that the template parameter Index
is
automatically inferred from the supplied predicate. Specifically, the
predicate takes an index or an index vector of type Index
as input
argument. In the vectorized case, each vector lane can thus evaluate a
different predicate. When pred
is false
for all entries, the
function returns start
, and when it is true
for all cases, it
returns end
.
The following code example shows a typical use case: array
contains a
sorted list of floating point numbers, and the goal is to map floating
point entries of x
to the first index j
such that array[j] >= x
(and all of this of course in parallel for each vector element).
std::vector<float> array = ...;
Float32P x = ...;
UInt32P j = binary_search(
0,
array.size() - 1,
[&](UInt32P index) {
return gather(array.data(), index) < x;
}
);
Array
>operator*=
(Array &x, Array y)¶Compound assignment operator for multiplication. Concerning Array
types
with non-commutative multiplication: the operation expands to x = x * y;
.
Array
>mulhi
(Array x, Array y)¶Returns the high part of an integer multiplication. For 32-bit scalar input, this is e.g. equivalent to the following expression
(int32_t) (((int64_t) x * (int64_t) y) >> 32);
Array
>operator/
(Array x, Array y)¶Binary division operator. A special overload to multiply by the reciprocal when the second argument is a scalar.
Integer division is handled specially, see Vectorized integer division by constants for details.
Array
>operator||
(Array x, Array y)¶Binary logical OR operator (identical to operator|
, as no
short-circuiting is supported in operator overloads).
Array
>operator&&
(Array x, Array y)¶Binary logical AND operator. (identical to operator&
, as no
short-circuiting is supported in operator overloads).
Array
>andnot
(Array x, Array y)¶Binary logical AND NOT operator. (identical to x & ~y
).
Array
>operator<=
(Array x, Array y)¶Less-than-or-equal comparison operator.
Array
>operator>
(Array x, Array y)¶Greater-than comparison operator.
Array
>operator>=
(Array x, Array y)¶Greater-than-or-equal comparison operator.
Array
>neq
(Array x, Array y)¶Inequality operator (vertical operation).
Imm
, typename Array
>rol
(Array x)¶Left shift with rotation by an immediate amount Imm
.
Imm
, typename Array
>ror
(Array x)¶Right shift with rotation by an immediate amount Imm
.
Imm
, typename Array
>ror_array
(Array x)¶Rotate the entire array by Imm
entries towards the right, i.e.
coeff[0]
becomes coeff[Imm]
, etc.
Imm
, typename Array
>rol_array
(Array x)¶Rotate the entire array by Imm
entries towards the left, i.e.
coeff[Imm]
becomes coeff[0]
, etc.
Target
, typename Source
>reinterpret_array
(Source x)¶Reinterprets the bit-level representation of an array (e.g. from
uint32_t
to float
). See the section on reinterpreting array
contents for further details.
Array
>rcp
(Array x)¶Computes the reciprocal \(\frac{1}{x}\). A slightly less accurate (but
more efficient) implementation is used when approximate mode is enabled for
Array
. Relies on AVX512ER instructions if available.
Array
>rsqrt
(Array x)¶Computes the reciprocal square root \(\frac{1}{\sqrt{x}}\). A slightly
less accurate (but more efficient) implementation is used when approximate
mode is enabled for Array
. Relies on AVX512ER instructions if available.
Array
>abs
(Array x)¶Computes the absolute value \(|x|\) (analogous to std::abs
).
Array
>max
(Array x, Array y)¶Returns the maximum of \(x\) and \(y\) (analogous to std::max
).
Array
>min
(Array x, Array y)¶Returns the minimum of \(x\) and \(y\) (analogous to std::min
).
Array
>sign
(Array x)¶Computes the signum function \(\begin{cases}1,&\mathrm{if}\ x\ge 0\\-1,&\mathrm{otherwise}\end{cases}\)
Analogous to std::copysign(1.f, x)
.
Array
>copysign
(Array x, Array y)¶Copies the sign of the array y
to x
(analogous to std::copysign
).
Array
>mulsign_neg
(Array x, Array y)¶Efficiently multiplies x
by the sign of -y
.
Array
>sqrt
(Array x)¶Computes the square root of \(x\) (analogous to std::sqrt
).
Array
>cbrt
(Array x)¶Computes the cube root of \(x\) (analogous to std::cbrt
).
Array
>hypot
(Array x, Array y)¶Computes \(\sqrt{x^2+y^2}\) while avoiding overflow and underflow.
Array
>ceil
(Array x)¶Computes the ceiling of \(x\) (analogous to std::ceil
).
Array
>floor
(Array x)¶Computes the floor of \(x\) (analogous to std::floor
).
Target
, typename Array
>ceil2int
(Array x)¶Computes the ceiling of \(x\) and converts the result to an integer. If
supported by the hardware, the combined operation is more efficient than
the analogous expression Target(ceil(x))
.
Target
, typename Array
>floor2int
(Array x)¶Computes the floor of \(x\) and converts the result to an integer. If
supported by the hardware, the combined operation is more efficient than
the analogous expression Target(floor(x))
.
Array
>round
(Array x)¶Rounds \(x\) to the nearest integer using Banker’s rounding for half-way values.
Note
This is analogous to std::rint
, not std::round
.
Array
>fmod
(Array x, Array y)¶Computes the floating-point remainder of the division operation x/y
Array
>fmadd
(Array x, Array y, Array z)¶Performs a fused multiply-add operation if supported by the target
hardware. Otherwise, the operation is emulated using conventional
multiplication and addition (i.e. x * y + z
).
Array
>fnmadd
(Array x, Array y, Array z)¶Performs a fused negative multiply-add operation if supported by the target
hardware. Otherwise, the operation is emulated using conventional
multiplication and addition (i.e. -x * y + z
).
Array
>fmsub
(Array x, Array y, Array z)¶Performs a fused multiply-subtract operation if supported by the target
hardware. Otherwise, the operation is emulated using conventional
multiplication and subtraction (i.e. x * y - z
).
Array
>fnmsub
(Array x, Array y, Array z)¶Performs a fused negative multiply-subtract operation if supported by the
target hardware. Otherwise, the operation is emulated using conventional
multiplication and subtraction (i.e. -x * y - z
).
Array
>fmaddsub
(Array x, Array y, Array z)¶Performs a fused multiply-add and multiply-subtract operation for alternating elements. The pseudocode for this operation is
Array result;
for (size_t i = 0; i < Array::Size; ++i) {
if (i % 2 == 0)
result[i] = x[i] * y[i] - c[i];
else
result[i] = x[i] * y[i] + c[i];
}
Array
>fmsubadd
(Array x, Array y, Array z)¶Performs a fused multiply-add and multiply-subtract operation for alternating elements. The pseudocode for this operation is
Array result;
for (size_t i = 0; i < Array::Size; ++i) {
if (i % 2 == 0)
result[i] = x[i] * y[i] + c[i];
else
result[i] = x[i] * y[i] - c[i];
}
Array
>ldexp
(Array x, Array n)¶Multiplies \(x\) by \(2^n\). Analogous to std::ldexp
except
that n
is a floating point argument.
Array
>frexp
(Array x)¶Breaks the floating-point number \(x\) into a normalized fraction and
power of 2. Analogous to std::frexp
except that both return values are
floating point values.
Array
>operator==
(Array x, Array y)¶Equality operator.
Warning
Following the principle of least surprise,
enoki::operator==()
is a horizontal operations that returns a
boolean value; a vertical alternatives named eq()
is also
available. The following pair of operations is equivalent:
bool b1 = (f1 == f2);
bool b2 = all(eq(f1, f2));
Array
>operator!=
(Array x, Array y)¶Warning
Following the principle of least surprise,
enoki::operator!=()
is a horizontal operations that returns a
boolean value; a vertical alternatives named neq()
is also
available. The following pair of operations is equivalent:
bool b1 = (f1 != f2);
bool b2 = any(neq(f1, f2));
Array
>reverse
(Array value)¶Returns the input array with all components reversed, i.e.
value[Array::Size-1], .., value[0]
For multidimensional arrays, the outermost dimension of Array
are
reversed. Note that this operation is currently not efficiently vectorized
on 1D CPU arrays (though GPU and/or multi-dimensional arrays are fine).
Array
>psum
(Array value)¶Computes the inclusive prefix sum of the components of value
, i.e.
value[0], value[0] + value[1], .., value[0] + .. + value[Array::Size-1];
For multidimensional arrays, the horizontal reduction is performed over the
outermost dimension of Array
. Note that this operation is currently
not efficiently vectorized on 1D CPU arrays (though GPU and/or
multi-dimensional arrays are fine).
Array
>hsum
(Array value)¶Efficiently computes the horizontal sum of the components of value
, i.e.
value[0] + .. + value[Array::Size-1];
For 1D arrays, hsum()
returns a scalar result. For multidimensional
arrays, the horizontal reduction is performed over the outermost dimension
of Array
, and the result is of type value_t<Array>
.
Array
>hsum_inner
(Array value)¶Analogous to hsum()
, exept that the horizontal reduction is
performed over the innermost dimension of Array
(which is only
relevant in the case of a multidimensional input array).
Array
>hsum_nested
(Array value)¶Recursive version of hsum()
, which nests through all dimensions
and always returns a scalar.
Array
>hprod
(Array value)¶Efficiently computes the horizontal product of the components of value
, i.e.
value[0] * .. * value[Array::Size-1];
For 1D arrays, hprod()
returns a scalar result. For multidimensional
arrays, the horizontal reduction is performed over the outermost dimension
of Array
, and the result is of type value_t<Array>
.
Array
>hprod_inner
(Array value)¶Analogous to hprod()
, exept that the horizontal reduction is
performed over the innermost dimension of Array
(which is only
relevant in the case of a multidimensional input array).
Array
>hprod_nested
(Array value)¶Recursive version of hprod()
, which nests through all dimensions
and always returns a scalar.
Array
>hmax
(Array value)¶Efficiently computes the horizontal maximum of the components of value
, i.e.
max(value[0], max(value[1], ...))
For 1D arrays, hmax()
returns a scalar result. For multidimensional
arrays, the horizontal reduction is performed over the outermost dimension
of Array
, and the result is of type value_t<Array>
.
Array
>hmax_inner
(Array value)¶Analogous to hmax()
, exept that the horizontal reduction is
performed over the innermost dimension of Array
(which is only
relevant in the case of a multidimensional input array).
Array
>hmax_nested
(Array value)¶Recursive version of hmax()
, which nests through all dimensions
and always returns a scalar.
Array
>hmin
(Array value)¶Efficiently computes the horizontal minimum of the components of value
, i.e.
min(value[0], min(value[1], ...))
For 1D arrays, hmin()
returns a scalar result. For multidimensional
arrays, the horizontal reduction is performed over the outermost dimension
of Array
, and the result is of type value_t<Array>
.
Array
>hmin_inner
(Array value)¶Analogous to hmin()
, exept that the horizontal reduction is
performed over the innermost dimension of Array
(which is only
relevant in the case of a multidimensional input array).
Array
>hmin_nested
(Array value)¶Recursive version of hmin()
, which nests through all dimensions
and always returns a scalar.
Mask
>all
(Mask value)¶Efficiently computes the horizontal AND (i.e. logical conjunction) of the
components of the mask value
, i.e.
value[0] & ... & value[Size-1]
For 1D arrays, all()
returns a boolean result. For multidimensional
arrays, the horizontal reduction is performed over the outermost dimension
of Mask
, and the result is of type mask_t<value_t<Mask>>
.
Mask
>all_inner
(Mask value)¶Analogous to all()
, exept that the horizontal reduction is
performed over the innermost dimension of Mask
(which is only
relevant in the case of a multidimensional input array).
Mask
>all_nested
(Mask value)¶Recursive version of all()
, which nests through all dimensions
and always returns a boolean value.
Mask
, bool Default
>all_or
(Mask value)¶This function calls returns the Default
template argument when Mask
is
a GPU array. Otherwise, it falls back all()
. See the section on
horizontal operations on the GPU for
details.
Mask
>any
(Mask value)¶Efficiently computes the horizontal OR (i.e. logical disjunction) of the
components of the mask value
, i.e.
value[0] | ... | value[Size-1]
For 1D arrays, any()
returns a boolean result. For multidimensional
arrays, the horizontal reduction is performed over the outermost dimension
of Mask
, and the result is of type mask_t<value_t<Mask>>
.
Mask
>any_inner
(Mask value)¶Analogous to any()
, exept that the horizontal reduction is
performed over the innermost dimension of Mask
(which is only
relevant in the case of a multidimensional input array).
Mask
>any_nested
(Mask value)¶Recursive version of any()
, which nests through all dimensions
and always returns a boolean value.
Mask
, bool Default
>any_or
(Mask value)¶This function calls returns the Default
template argument when Mask
is
a GPU array. Otherwise, it falls back any()
. See the section on
horizontal operations on the GPU for
details.
Mask
>none
(Mask value)¶Efficiently computes the negated horizontal OR of the components of the
mask value
, i.e.
~(value[0] | ... | value[Size-1])
For 1D arrays, none()
returns a boolean result. For multidimensional
arrays, the horizontal reduction is performed over the outermost dimension
of Mask
, and the result is of type mask_t<value_t<Mask>>
.
Mask
>none_inner
(Mask value)¶Analogous to none()
, exept that the horizontal reduction is
performed over the innermost dimension of Mask
(which is only
relevant in the case of a multidimensional input array).
Mask
>none_nested
(Mask value)¶Recursive version of none()
, which nests through all dimensions
and always returns a boolean value.
Mask
, bool Default
>none_or
(Mask value)¶This function calls returns the Default
template argument when Mask
is
a GPU array. Otherwise, it falls back none()
. See the section on
horizontal operations on the GPU for
details.
Mask
>count
(Mask value)¶Efficiently computes the number of components whose mask bits are turned on, i.e.
(value[0] ? 1 : 0) + ... (value[Size - 1] ? 1 : 0)
For 1D arrays, count()
returns a result of type size_t
. For
multidimensional arrays, the horizontal reduction is performed over the
outermost dimension of Mask
, and the result is of type
size_array_t<value_t<Mask>>
.
Mask
>count_inner
(Mask value)¶Analogous to count()
, exept that the horizontal reduction is
performed over the innermost dimension of Mask
(which is only
relevant in the case of a multidimensional input array).
Mask
>count_nested
(Mask value)¶Recursive version of count()
, which nests through all dimensions
and always returns a boolean value.
Array
>dot
(Array value1, Array value2)¶Efficiently computes the dot products of value1
and value2
, i.e.:
value1[0]*value2[0] + .. + value1[Array::Size-1]*value2[Array::Size-1];
The return value is of type value_t<Array>
, which is a scalar (e.g.
float
) for ordinary inputs and an array for nested array inputs.
Array
>norm
(Array value)¶Computes the 2-norm of the input array. The return value is of type
value_t<Array>
, which is a scalar (e.g. float
) for ordinary inputs
and an array for nested array inputs.
Most approximations of transcendental functions are based on routines in the CEPHES math library. The table below provides some statistics on their absolute and relative error.
The CEPHES approximations are only used when approximate mode is enabled; otherwise, the functions below will invoke the corresponding non-vectorized standard C library routines.
Note
The forward trigonometric functions (sin, cos, tan) are optimized for low error on the domain \(|x| < 8192\) and don’t perform as well beyond this range.
Function | Tested domain | Abs. error (mean) | Abs. error (max) | Rel. error (mean) | Rel. error (max) |
---|---|---|---|---|---|
\(\mathrm{sin}()\) | \(-8192 < x < 8192\) | \(1.2 \cdot 10^{-8}\) | \(1.2 \cdot 10^{-7}\) | \(1.9 \cdot 10^{-8}\,(0.25\,\mathrm{ulp})\) | \(1.8 \cdot 10^{-6}\,(19\,\mathrm{ulp})\) |
\(\mathrm{cos}()\) | \(-8192 < x < 8192\) | \(1.2 \cdot 10^{-8}\) | \(1.2 \cdot 10^{-7}\) | \(1.9 \cdot 10^{-8}\,(0.25\,\mathrm{ulp})\) | \(3.1 \cdot 10^{-6}\,(47\,\mathrm{ulp})\) |
\(\mathrm{tan}()\) | \(-8192 < x < 8192\) | \(4.7 \cdot 10^{-6}\) | \(8.1 \cdot 10^{-1}\) | \(3.4 \cdot 10^{-8}\,(0.42\,\mathrm{ulp})\) | \(3.1 \cdot 10^{-6}\,(30\,\mathrm{ulp})\) |
\(\mathrm{cot}()\) | \(-8192 < x < 8192\) | \(2.6 \cdot 10^{-6}\) | \(0.11 \cdot 10^{1}\) | \(3.5 \cdot 10^{-8}\,(0.42\,\mathrm{ulp})\) | \(3.1 \cdot 10^{-6}\,(47\,\mathrm{ulp})\) |
\(\mathrm{asin}()\) | \(-1 < x < 1\) | \(2.3 \cdot 10^{-8}\) | \(1.2 \cdot 10^{-7}\) | \(2.9 \cdot 10^{-8}\,(0.33\,\mathrm{ulp})\) | \(2.3 \cdot 10^{-7}\,(2\,\mathrm{ulp})\) |
\(\mathrm{acos}()\) | \(-1 < x < 1\) | \(4.7 \cdot 10^{-8}\) | \(2.4 \cdot 10^{-7}\) | \(2.9 \cdot 10^{-8}\,(0.33\,\mathrm{ulp})\) | \(1.2 \cdot 10^{-7}\,(1\,\mathrm{ulp})\) |
\(\mathrm{atan}()\) | \(-1 < x < 1\) | \(1.8 \cdot 10^{-7}\) | \(6 \cdot 10^{-7}\) | \(4.2 \cdot 10^{-7}\,(4.9\,\mathrm{ulp})\) | \(8.2 \cdot 10^{-7}\,(12\,\mathrm{ulp})\) |
\(\mathrm{sinh}()\) | \(-10 < x < 10\) | \(2.6 \cdot 10^{-5}\) | \(2 \cdot 10^{-3}\) | \(2.8 \cdot 10^{-8}\,(0.34\,\mathrm{ulp})\) | \(2.7 \cdot 10^{-7}\,(3\,\mathrm{ulp})\) |
\(\mathrm{cosh}()\) | \(-10 < x < 10\) | \(2.9 \cdot 10^{-5}\) | \(2 \cdot 10^{-3}\) | \(2.9 \cdot 10^{-8}\,(0.35\,\mathrm{ulp})\) | \(2.5 \cdot 10^{-7}\,(4\,\mathrm{ulp})\) |
\(\mathrm{tanh}()\) | \(-10 < x < 10\) | \(4.8 \cdot 10^{-8}\) | \(4.2 \cdot 10^{-7}\) | \(5 \cdot 10^{-8}\,(0.76\,\mathrm{ulp})\) | \(5 \cdot 10^{-7}\,(7\,\mathrm{ulp})\) |
\(\mathrm{csch}()\) | \(-10 < x < 10\) | \(5.7 \cdot 10^{-8}\) | \(7.8 \cdot 10^{-3}\) | \(4.4 \cdot 10^{-8}\,(0.54\,\mathrm{ulp})\) | \(3.1 \cdot 10^{-7}\,(5\,\mathrm{ulp})\) |
\(\mathrm{sech}()\) | \(-10 < x < 10\) | \(6.7 \cdot 10^{-9}\) | \(1.8 \cdot 10^{-7}\) | \(4.3 \cdot 10^{-8}\,(0.54\,\mathrm{ulp})\) | \(3.2 \cdot 10^{-7}\,(4\,\mathrm{ulp})\) |
\(\mathrm{coth}()\) | \(-10 < x < 10\) | \(1.2 \cdot 10^{-7}\) | \(7.8 \cdot 10^{-3}\) | \(6.9 \cdot 10^{-8}\,(0.61\,\mathrm{ulp})\) | \(5.7 \cdot 10^{-7}\,(8\,\mathrm{ulp})\) |
\(\mathrm{asinh}()\) | \(-30 < x < 30\) | \(2.8 \cdot 10^{-8}\) | \(4.8 \cdot 10^{-7}\) | \(1 \cdot 10^{-8}\,(0.13\,\mathrm{ulp})\) | \(1.7 \cdot 10^{-7}\,(2\,\mathrm{ulp})\) |
\(\mathrm{acosh}()\) | \(1 < x < 10\) | \(2.9 \cdot 10^{-8}\) | \(2.4 \cdot 10^{-7}\) | \(1.5 \cdot 10^{-8}\,(0.18\,\mathrm{ulp})\) | \(2.4 \cdot 10^{-7}\,(3\,\mathrm{ulp})\) |
\(\mathrm{atanh}()\) | \(-1 < x < 1\) | \(9.9 \cdot 10^{-9}\) | \(2.4 \cdot 10^{-7}\) | \(1.5 \cdot 10^{-8}\,(0.18\,\mathrm{ulp})\) | \(1.2 \cdot 10^{-7}\,(1\,\mathrm{ulp})\) |
\(\mathrm{exp}()\) | \(-20 < x < 30\) | \(0.72 \cdot 10^{4}\) | \(0.1 \cdot 10^{7}\) | \(2.4 \cdot 10^{-8}\,(0.27\,\mathrm{ulp})\) | \(1.2 \cdot 10^{-7}\,(1\,\mathrm{ulp})\) |
\(\mathrm{log}()\) | \(10^{-20} < x < 2\cdot 10^{30}\) | \(9.6 \cdot 10^{-9}\) | \(7.6 \cdot 10^{-6}\) | \(1.4 \cdot 10^{-10}\,(0.0013\,\mathrm{ulp})\) | \(1.2 \cdot 10^{-7}\,(1\,\mathrm{ulp})\) |
\(\mathrm{erf}()\) | \(-1 < x < 1\) | \(3.2 \cdot 10^{-8}\) | \(1.8 \cdot 10^{-7}\) | \(6.4 \cdot 10^{-8}\,(0.78\,\mathrm{ulp})\) | \(3.3 \cdot 10^{-7}\,(4\,\mathrm{ulp})\) |
\(\mathrm{erfc}()\) | \(-1 < x < 1\) | \(3.4 \cdot 10^{-8}\) | \(2.4 \cdot 10^{-7}\) | \(6.4 \cdot 10^{-8}\,(0.79\,\mathrm{ulp})\) | \(1 \cdot 10^{-6}\,(11\,\mathrm{ulp})\) |
Function | Tested domain | Abs. error (mean) | Abs. error (max) | Rel. error (mean) | Rel. error (max) |
---|---|---|---|---|---|
\(\mathrm{sin}()\) | \(-8192 < x < 8192\) | \(2.2 \cdot 10^{-17}\) | \(2.2 \cdot 10^{-16}\) | \(3.6 \cdot 10^{-17}\,(0.25\,\mathrm{ulp})\) | \(3.1 \cdot 10^{-16}\,(2\,\mathrm{ulp})\) |
\(\mathrm{cos}()\) | \(-8192 < x < 8192\) | \(2.2 \cdot 10^{-17}\) | \(2.2 \cdot 10^{-16}\) | \(3.6 \cdot 10^{-17}\,(0.25\,\mathrm{ulp})\) | \(3 \cdot 10^{-16}\,(2\,\mathrm{ulp})\) |
\(\mathrm{tan}()\) | \(-8192 < x < 8192\) | \(6.8 \cdot 10^{-16}\) | \(1.2 \cdot 10^{-10}\) | \(5.4 \cdot 10^{-17}\,(0.35\,\mathrm{ulp})\) | \(4.1 \cdot 10^{-16}\,(3\,\mathrm{ulp})\) |
\(\mathrm{cot}()\) | \(-8192 < x < 8192\) | \(4.9 \cdot 10^{-16}\) | \(1.2 \cdot 10^{-10}\) | \(5.5 \cdot 10^{-17}\,(0.36\,\mathrm{ulp})\) | \(4.4 \cdot 10^{-16}\,(3\,\mathrm{ulp})\) |
\(\mathrm{asin}()\) | \(-1 < x < 1\) | \(1.3 \cdot 10^{-17}\) | \(2.2 \cdot 10^{-16}\) | \(1.5 \cdot 10^{-17}\,(0.098\,\mathrm{ulp})\) | \(2.2 \cdot 10^{-16}\,(1\,\mathrm{ulp})\) |
\(\mathrm{acos}()\) | \(-1 < x < 1\) | \(5.4 \cdot 10^{-17}\) | \(4.4 \cdot 10^{-16}\) | \(3.5 \cdot 10^{-17}\,(0.23\,\mathrm{ulp})\) | \(2.2 \cdot 10^{-16}\,(1\,\mathrm{ulp})\) |
\(\mathrm{atan}()\) | \(-1 < x < 1\) | \(4.3 \cdot 10^{-17}\) | \(3.3 \cdot 10^{-16}\) | \(1 \cdot 10^{-16}\,(0.65\,\mathrm{ulp})\) | \(7.1 \cdot 10^{-16}\,(5\,\mathrm{ulp})\) |
\(\mathrm{sinh}()\) | \(-10 < x < 10\) | \(3.1 \cdot 10^{-14}\) | \(1.8 \cdot 10^{-12}\) | \(3.3 \cdot 10^{-17}\,(0.22\,\mathrm{ulp})\) | \(4.3 \cdot 10^{-16}\,(2\,\mathrm{ulp})\) |
\(\mathrm{cosh}()\) | \(-10 < x < 10\) | \(2.2 \cdot 10^{-14}\) | \(1.8 \cdot 10^{-12}\) | \(2 \cdot 10^{-17}\,(0.13\,\mathrm{ulp})\) | \(2.9 \cdot 10^{-16}\,(2\,\mathrm{ulp})\) |
\(\mathrm{tanh}()\) | \(-10 < x < 10\) | \(5.6 \cdot 10^{-17}\) | \(3.3 \cdot 10^{-16}\) | \(6.1 \cdot 10^{-17}\,(0.52\,\mathrm{ulp})\) | \(5.5 \cdot 10^{-16}\,(3\,\mathrm{ulp})\) |
\(\mathrm{csch}()\) | \(-10 < x < 10\) | \(4.5 \cdot 10^{-17}\) | \(1.8 \cdot 10^{-12}\) | \(3.3 \cdot 10^{-17}\,(0.21\,\mathrm{ulp})\) | \(5.1 \cdot 10^{-16}\,(4\,\mathrm{ulp})\) |
\(\mathrm{sech}()\) | \(-10 < x < 10\) | \(3 \cdot 10^{-18}\) | \(2.2 \cdot 10^{-16}\) | \(2 \cdot 10^{-17}\,(0.13\,\mathrm{ulp})\) | \(4.3 \cdot 10^{-16}\,(2\,\mathrm{ulp})\) |
\(\mathrm{coth}()\) | \(-10 < x < 10\) | \(1.2 \cdot 10^{-16}\) | \(3.6 \cdot 10^{-12}\) | \(6.2 \cdot 10^{-17}\,(0.3\,\mathrm{ulp})\) | \(6.7 \cdot 10^{-16}\,(5\,\mathrm{ulp})\) |
\(\mathrm{asinh}()\) | \(-30 < x < 30\) | \(5.1 \cdot 10^{-17}\) | \(8.9 \cdot 10^{-16}\) | \(1.9 \cdot 10^{-17}\,(0.13\,\mathrm{ulp})\) | \(4.4 \cdot 10^{-16}\,(2\,\mathrm{ulp})\) |
\(\mathrm{acosh}()\) | \(1 < x < 10\) | \(4.9 \cdot 10^{-17}\) | \(4.4 \cdot 10^{-16}\) | \(2.6 \cdot 10^{-17}\,(0.17\,\mathrm{ulp})\) | \(6.6 \cdot 10^{-16}\,(5\,\mathrm{ulp})\) |
\(\mathrm{atanh}()\) | \(-1 < x < 1\) | \(1.8 \cdot 10^{-17}\) | \(4.4 \cdot 10^{-16}\) | \(3.2 \cdot 10^{-17}\,(0.21\,\mathrm{ulp})\) | \(3 \cdot 10^{-16}\,(2\,\mathrm{ulp})\) |
\(\mathrm{exp}()\) | \(-20 < x < 30\) | \(4.7 \cdot 10^{-6}\) | \(2 \cdot 10^{-3}\) | \(2.5 \cdot 10^{-17}\,(0.16\,\mathrm{ulp})\) | \(3.3 \cdot 10^{-16}\,(2\,\mathrm{ulp})\) |
\(\mathrm{log}()\) | \(10^{-20} < x < 2\cdot 10^{30}\) | \(1.9 \cdot 10^{-17}\) | \(1.4 \cdot 10^{-14}\) | \(2.7 \cdot 10^{-19}\,(0.0013\,\mathrm{ulp})\) | \(2.2 \cdot 10^{-16}\,(1\,\mathrm{ulp})\) |
\(\mathrm{erf}()\) | \(-1 < x < 1\) | \(4.7 \cdot 10^{-17}\) | \(4.4 \cdot 10^{-16}\) | \(9.6 \cdot 10^{-17}\,(0.63\,\mathrm{ulp})\) | \(5.9 \cdot 10^{-16}\,(5\,\mathrm{ulp})\) |
\(\mathrm{erfc}()\) | \(-1 < x < 1\) | \(4.8 \cdot 10^{-17}\) | \(4.4 \cdot 10^{-16}\) | \(9.6 \cdot 10^{-17}\,(0.64\,\mathrm{ulp})\) | \(2.5 \cdot 10^{-15}\,(16\,\mathrm{ulp})\) |
Array
>sin
(Array x)¶Sine function approximation based on the CEPHES library.
Array
>cos
(Array x)¶Cosine function approximation based on the CEPHES library.
Array
>sincos
(Array x)¶Simultaneous sine and cosine function approximation based on the CEPHES library.
Array
>tan
(Array x)¶Tangent function approximation based on the CEPHES library.
Array
>csc
(Array x)¶Cosecant convenience function implemented as rcp(sin(x))
.
Array
>sec
(Array x)¶Cosecant convenience function implemented as rcp(cos(x))
.
Array
>cot
(Array x)¶Cotangent convenience function implemented as rcp(tan(x))
.
Array
>asin
(Array x)¶Arcsine function approximation based on the CEPHES library.
Array
>acos
(Array x)¶Arccosine function approximation based on the CEPHES library.
Array
>sinh
(Array x)¶Hyperbolic sine function approximation based on the CEPHES library.
Array
>cosh
(Array x)¶Hyperbolic cosine function approximation based on the CEPHES library.
Array
>sincosh
(Array x)¶Simultaneous hyperbolic sine and cosine function approximation based on the CEPHES library.
Array
>tanh
(Array x)¶Hyperbolic tangent function approximation based on the CEPHES library.
Array
>csch
(Array x)¶Hyperbolic cosecant convenience function implemented as rcp(sinh(x))
.
Array
>coth
(Array x)¶Hyperbolic cotangent convenience function implemented as rcp(tanh(x))
.
Array
>asinh
(Array x)¶Hyperbolic arcsine function approximation based on the CEPHES library.
Array
>exp
(Array x)¶Natural exponential function approximation based on the CEPHES library. Relies on AVX512ER instructions if available.
Array
>safe_sqrt
(Array x)¶Computes sqrt(max(0, x))
to avoid issues with negative inputs
(e.g. due to roundoff error in a prior calculation).
Array
>safe_rsqrt
(Array x)¶Computes rsqrt(max(0, x))
to avoid issues with negative inputs
(e.g. due to roundoff error in a prior calculation).
The following special functions require including the header
enoki/special.h
.
Array
>erf
(Array x)¶Evaluates the error function defined as
Requires a real-valued input array x
.
Array
>erfi
(Array x)¶Evaluates the imaginary error function defined as
Requires a real-valued input array x
.
Array
>i0e
(Array x)¶Evaluates the exponentially scaled modified Bessel function of order zero defined as
where
Array
>gamma
(Array x)¶Evaluates the Gamma function defined as
Array
>lgamma
(Array x)¶Evaluates the natural logarithm of the Gamma function.
Array
>dawson
(Array x)¶Evaluates Dawson’s integral defined as
Array
>ellint_1
(Array phi, Array k)¶Evaluates the incomplete elliptic integral of the first kind
Array
>comp_ellint_1
(Array k)¶Evaluates the complete elliptic integral of the first kind
Array
>ellint_2
(Array phi, Array k)¶Evaluates the incomplete elliptic integral of the second kind
Array
>comp_ellint_2
(Array k)¶Evaluates the complete elliptic integral of the second kind
Array
>isnan
(Array x)¶Checks for NaN values and returns a mask, analogous to std::isnan
.
Array
>isinf
(Array x)¶Checks for infinite values and returns a mask, analogous to std::isinf
.
Array
>isfinite
(Array x)¶Checks for finite values and returns a mask, analogous to std::isfinite
.
Array
>isdenormal
(Array x)¶Checks for denormalized values and returns a mask.
Array
>deg_to_rad
(Array array)¶Convenience function which multiplies the input array by \(\frac{\pi}{180}\).
Array
>rad_to_deg
(Array array)¶Convenience function which multiplies the input array by \(\frac{180}{\pi}\).
Array
>prev_float
(Array array)¶Return the prev representable floating point value for each element of
array
analogous to std::nextafter(array, -∞)
. Special values
(infinities & not-a-number values) are returned unchanged.
Array
>next_float
(Array array)¶Return the next representable floating point value for each element of
array
analogous to std::nextafter(array, ∞)
. Special values
(infinities & not-a-number values) are returned unchanged.
Array
>tzcnt
(Array array)¶Return the number of trailing zero bits (assumes that Array
is an integer array).
Array
>lzcnt
(Array array)¶Return the number of leading zero bits (assumes that Array
is an integer array).
Array
>popcnt
(Array array)¶Return the number nonzero bits (assumes that Array
is an integer array).
Array
>log2i
(Array array)¶Return the floor of the base-two logarithm (assumes that Array
is an integer array).
Index
>range
(scalar_t<Index> size)¶Returns an iterable, which generates linearly increasing index vectors from
0
to size-1
. This function is meant to be used with the C++11
range-based for loop:
for (auto pair : range<Index>(1000)) {
Index index = pair.first;
mask_t<Index> mask = pair.second;
// ...
}
The mask specifies which index vector entries are active: unless the number of interations is exactly divisible by the packet size, the last loop iteration will generally have several disabled entries.
The implementation also supports iteration of multidimensional arrays
using UInt32P = Packet<uint32_t>;
using Vector3uP = Array<UInt32P, 3>;
for (auto pair : range<Vector3uP>(10, 20, 30)) {
Vector3uP index = pair.first;
mask_t<Index> mask = pair.second;
// ...
}
which will generate indices (0, 0, 0)
, (0, 0, 1)
, etc. As before, the
last loop iteration will generally have several disabled entries.
set_flush_denormals
(bool value)¶Arithmetic involving denormalized floating point numbers triggers a slow microcode handler on most current architectures, which leads to severe performance penalties. This function can be used to specify whether denormalized floating point values are simply flushed to zero, which sidesteps the performance issues.
Enoki also provides a tiny a RAII wrapper named scoped_flush_denormals
which sets (and later resets) this parameter.
flush_denormals
()¶Returns the denormals are flushed to zero (see set_flush_denormals()
).
Index
, typename Array
>shuffle
(Array a)¶Shuffles the contents of an array given indices specified as a template parameter. The pseudocode for this operation is
Array out;
for (size_t i = 0; i<Array::Size; ++i)
out[i] = a[Index[i]];
return out;
Array
, typename Indices
>shuffle
(Array a, Indices ind)¶Shuffles the contents of an array given indices specified via an integer array. The pseudocode for this operation is
Array out;
for (size_t i = 0; i<Array::Size; ++i)
out[i] = a[ind[i]];
return out;
Array1
, typename Array2
>concat
(Array1 a1, Array2 a2)¶Concatenates the contents of two arrays a1
and a2
. The pseudocode
for this operation is
Array<value_t<Array1>, Array1::Size + Array2::Size> out;
for (size_t i = 0; i<Array1::Size; ++i)
out[i] = a1[i];
for (size_t i = 0; i<Array2::Size; ++i)
out[i + Array1::Size] = a2[i];
return out;
Array
>low
(Array a)¶Returns the low part of the input array a
. The length of the low part
is defined as the largest power of two that is smaller than
Array::Size
. For power-of-two sized input, this function simply returns
the low half.
Array
>high
(Array a)¶Returns the high part of the input array a
. The length of the high part
is equal to Array::Size
minus the size of the low part. For
power-of-two sized input, this function simply returns the high half.
Size
, typename Array
>head
(Array a)¶Returns a new array containing the leading Size
elements of a
.
Size
, typename Array
>tail
(Array a)¶Returns a new array containing the trailing Size
elements of a
.
Outer
, typename Inner
>full
(const Inner &inner)¶Given an array type Outer
and a value of type Inner
, this function
returns a new composite type of shape [<shape of Outer>, <shape of
Inner>]
that is filled with the value of the inner
argument.
In the simplest case, this can be used to create a (potentially nested) array that is filled with constant values.
using Vector4f = Array<float, 4>;
using MyMatrix = Array<Vector4f, 4>;
MyMatrix result = full<MyMatrix>(10.f);
std::cout << result << std::endl;
/* Prints:
[[10, 10, 10, 10],
[10, 10, 10, 10],
[10, 10, 10, 10],
[10, 10, 10, 10]]
*/
Another use case entails replicating an array over the trailing dimensions of a new array:
result = full<Vector4f>(Vector4f(1, 2, 3, 4))
std::cout << result << std::endl;
/* Prints:
[[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
[4, 4, 4, 4]]
*/
Note how this is different from the default broadcasting behavior of
arrays. In this case, Vector4f
and MyMatrix
have the same size
in the leading dimension, which would replicate the vector over that
axis instead:
result = MyMatrix(Vector4f(1, 2, 3, 4));
std::cout << result << std::endl;
/* Prints:
[[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4]]
*/
DArray
>packet
(DArray &&a, size_t i)¶Extracts the \(i\)-th packet from a dynamic array or data structure. See the chapter on dynamic arrays on how to use this function.
DArray
>packets
(const DArray &a)¶Return the number of packets stored in the given dynamic array or data structure.
DArray
>slice
(DArray &&a, size_t i)¶Extracts the \(i\)-th slice from a dynamic array or data structure. See the chapter on dynamic arrays on how to use this function.
DArray
>slices
(const DArray &a)¶Return the number of packets stored in the given dynamic array or data structure.
DArray
>set_slices
(DArray &a, size_t size)¶Resize the given dynamic array or data structure so that there is space for
size
slices. When reducing the size of the array, any memory allocated
so far is kept. Since there’s no exponential allocation mechanism, it is not
recommended to call set_slices
repeatedly to append elements.
Unlike std::vector::resize()
, previous values are not preserved when
enlarging the array.
The following type traits are available to query the properties of arrays at compile time.
The enoki::replace_scalar_t
type trait and various aliases construct arrays
matching a certain layout, but with different-flavored data. This is often
helpful when defining custom data structures or function inputs. See the
section on custom data structures for an example
usage.
Array
, typename Scalar
>replace_scalar_t
¶Replaces the scalar type underlying an array. For instance,
replace_scalar_t<Array<Array<float, 16>, 32>, int>
is equal to Array<Array<int,
16>, 32>
.
The type trait also works for scalar arguments. Pointers and reference
arguments are copied—for instance, replace_scalar_t<const float *, int>
is
equal to const int *
.
Array
>uint32_array_t
= replace_scalar_t<Array, uint32_t>¶Create a 32-bit unsigned integer array matching the layout of Array
.
Array
>int32_array_t
= replace_scalar_t<Array, int32_t>¶Create a 32-bit signed integer array matching the layout of Array
.
Array
>uint64_array_t
= replace_scalar_t<Array, uint64_t>¶Create a 64-bit unsigned integer array matching the layout of Array
.
Array
>int64_array_t
= replace_scalar_t<Array, int64_t>¶Create a 64-bit signed integer array matching the layout of Array
.
Array
>int_array_t
¶Create a signed integer array (with the same number of bits per entry as
the input) matching the layout of Array
.
Array
>uint_array_t
¶Create an unsigned integer array (with the same number of bits per entry as
the input) matching the layout of Array
.
Array
>float16_array_t
= replace_scalar_t<Array, half>¶Create a half precision array matching the layout of Array
.
Array
>float32_array_t
= replace_scalar_t<Array, float>¶Create a single precision array matching the layout of Array
.
Array
>float64_array_t
= replace_scalar_t<Array, double>¶Create a double precision array matching the layout of Array
.
Array
>float_array_t
¶Create a floating point array (with the same number of bits per entry as
the input) matching the layout of Array
.
Array
>bool_array_t
= replace_scalar_t<Array, bool>¶Create a boolean array matching the layout of Array
.
Array
>size_array_t
= replace_scalar_t<Array, size_t>¶Create a size_t
-valued array matching the layout of Array
.
Array
>ssize_array_t
= replace_scalar_t<Array, ssize_t>¶Create a ssize_t
-valued array matching the layout of Array
.
The following section discusses helper types that can be used to selectively enable or disable template functions for Enoki arrays, e.g. like so:
template <typename Value, enable_if_array_t<Value> = 0>
void f(Value value) {
/* Invoked if 'Value' is an Enoki array */
}
template <typename Value, enable_if_not_array_t<Value> = 0>
void f(Value value) {
/* Invoked if 'Value' is *not* an Enoki array */
}
T
>is_array
¶value
¶Equal to true
iff T
is a static or dynamic Enoki array type.
T
>is_mask
¶value
¶Equal to true
iff T
is a static or dynamic Enoki mask type.
T
>is_static_array
¶value
¶Equal to true
iff T
is a static Enoki array type.
T
>is_dynamic_array
¶value
¶Equal to true
iff T
is a dynamic Enoki array type.
T
>enable_if_dynamic_array_t
= std::enable_if_t<is_dynamic_array_v<T>, int>¶SFINAE alias to selectively enable a class or function definition for dynamic Enoki array types.
T
>enable_if_not_dynamic_array_t
= std::enable_if_t<!is_dynamic_array_v<T>, int>¶SFINAE alias to selectively enable a class or function definition for dynamic Enoki array types.
T
>is_dynamic
¶value
¶Equal to true
iff T
(which could be a nested Enoki array) contains
a dynamic array at any level.
This is different from is_dynamic_array
, which only cares
about the outermost level – for instance, given static array T
containing a nested dynamic array, is_dynamic_array_v<T> ==
false
, while is_dynamic_v<T> == true
.
Enoki ships with a vectorized implementation of the PCG32 random number generator developed by Melissa O’Neill. To use it, include the following header:
#include <enoki/random.h>
The following reference is based on the original PCG32 documentation.
The enoki::PCG32
class takes a single template parameter T
that denotes the “shape” of the output. This can be any scalar type like
uint32_t
, in which case the implementation generates scalar variates:
/* Scalar RNG */
using RNG_1x = PCG32<uint32_t>;
RNG_1x my_rng;
float value = my_rng.next_float32();
Alternatively, it can be an Enoki array, in which case the implementation produces arrays of variates.
using FloatP = Packet<float, 16>;
/* Vector RNG -- generates 16 independent variates at once */
using RNG_16x = PCG32<FloatP>;
RNG_16x my_rng;
FloatP value = my_rng.next_float32();
PCG32 is fast: on a Skylake i7-6920HQ processor, the vectorized implementation provided here generates around 1.4 billion single precision variates per second.
T
>PCG32
¶This class implements the PCG32 pseudorandom number generator. It has a period of \(2^{64}\) and supports \(2^{63}\) separate streams. Each stream produces a different unique sequence of random numbers, which is particularly useful in the context of vectorized computations.
Size
¶Denotes the SIMD width of the random number generator (i.e. how many pseudorandom variates are generated at the same time)
Int64
= int64_array_t<T>¶Type alias for a signed 64-bit integer (or an array thereof).
UInt64
= uint64_array_t<T>¶Type alias for a unsigned 64-bit integer (or an array thereof).
UInt32
= uint32_array_t<T>¶Type alias for a unsigned 32-bit integer (or an array thereof).
Float32
= float32_array_t<T>¶Type alias for a single precision float (or an array thereof).
Float64
= float64_array_t<T>¶Type alias for a double precision float (or an array thereof).
seed
(const UInt64 &initstate, const UInt64 &initseq)¶This function initializes (a.k.a. “seeds”) the random number generator, a required initialization step before the generator can be used. The provided arguments are defined as follows:
initstate
is the starting state for the RNG. Any 64-bit value is
permissible.initseq
selects the output sequence for the RNG. Any 64-bit value is
permissible, although only the low 63 bits are used.For this generator, there are \(2^{63}\) possible sequences of
pseudorandom numbers. Each sequence is entirely distinct and has a period
of \(2^{64}\). The initseq
argument selects which stream is used.
The initstate
argument specifies the location within the \(2^{64}\)
period.
Calling PCG32::seed()
with the same arguments produces the same
output, allowing programs to use random number sequences repeatably.
next_uint32
(const mask_t<UInt64> &mask = true)¶Generate a uniformly distributed unsigned 32-bit random number (i.e. \(x\), where \(0\le x< 2^{32}\))
If a mask parameter is provided, only the pseudorandom number generators of active SIMD lanes are advanced.
next_uint64
(const mask_t<UInt64> &mask = true)¶Generate a uniformly distributed unsigned 64-bit random number (i.e. \(x\), where \(0\le x< 2^{64}\))
If a mask parameter is provided, only the pseudorandom number generators of active SIMD lanes are advanced.
Note
This function performs two internal calls to next_uint32()
.
next_uint32_bound
(uint32_t bound, const mask_t<UInt64> &mask = true)¶Generate a uniformly distributed unsigned 32-bit random number less
than bound
(i.e. \(x\), where \(0\le x<\) bound
)
If a mask parameter is provided, only the pseudorandom number generators of active SIMD lanes are advanced.
Note
This may involve multiple internal calls to
next_uint32()
, in which case the RNG advances by
several steps. This is only relevant when using the
advance()
or operator-()
method.
next_uint64_bound
(uint64_t bound, const mask_t<UInt64> &mask = true)¶Generate a uniformly distributed unsigned 64-bit random number less
than bound
(i.e. \(x\), where \(0\le x<\) bound
)
If a mask parameter is provided, only the pseudorandom number generators of active SIMD lanes are advanced.
Note
This may involve multiple internal calls to
next_uint64()
, in which case the RNG advances by
several steps. This is only relevant when using the
advance()
or operator-()
method.
next_float32
(const mask_t<UInt64> &mask = true)¶Generate a single precision floating point value on the interval \([0, 1)\)
If a mask parameter is provided, only the pseudorandom number generators of active SIMD lanes are advanced.
next_float64
(const mask_t<UInt64> &mask = true)¶Generate a double precision floating point value on the interval \([0, 1)\)
If a mask parameter is provided, only the pseudorandom number generators of active SIMD lanes are advanced.
Warning
Since the underlying random number generator produces 32 bit
output, only the first 32 mantissa bits will be filled (however,
the resolution is still finer than in next_float32()
,
which only uses 23 mantissa bits)
advance
(const Int64 &delta)¶This operation provides jump-ahead; it advances the RNG by delta
steps,
doing so in \(\log(\texttt{delta})\) time. Because of the periodic
nature of generation, advancing by \(2^{64}-d\) (i.e., passing
\(-d\)) is equivalent to backstepping the generator by \(d\) steps.
The following macros are defined in enoki/random.h
:
uint64_t PCG32_DEFAULT_STATE = 0x853c49e6748fea9bULL
Default initialization passed to PCG32::seed()
.
uint64_t PCG32_DEFAULT_STREAM = 0xda3e39cb94b95bdbULL
Default stream index passed to PCG32::seed()
.
Enoki provides efficient support for encoding and decoding of Morton/Z-order indices of arbitrary dimension. Both scalar indices and index vectors are supported. Z-order indexing can improve the locality accesses when two- or higher-dimensional data is arranged in memory.
(Figure by David Eppstein)
To use this feature, include the following header:
#include <enoki/morton.h>
The following shows a round trip, encoding a 32-bit position as a Morton index and decoding it again.
using Vector2u = Array<uint32_t, 2>;
Vector2u pos(123u, 456u);
uint32_t encoded = morton_encode(pos);
Vector2u decoded = morton_decode<Vector2u>(encoded);
std::cout << "Original : " << pos << std::endl;
std::cout << "Encoded : " << encoded << std::endl;
std::cout << "Decoded : " << decoded << std::endl;
/* Prints:
Original : [123, 456]
Encoded : 177605
Decoded : [123, 456]
*/
Depending on hardware support, Enoki implements these operations using BMI2 instructions or bit shifts with precomputed magic constants.
The same also works for nested vectors:
using UInt32P = Packet<uint32_t, 8>;
using Vector2uP = Array<UInt32P, 2>;
Vector2uP pos(123u, 456u);
UInt32P encoded = morton_encode(pos);
Vector2uP decoded = morton_decode<Vector2uP>(encoded);
std::cout << "Original : " << pos << std::endl;
std::cout << "Encoded : " << encoded << std::endl;
std::cout << "Decoded : " << decoded << std::endl;
/* Prints:
Original : [[123, 456], [123, 456], [123, 456], [123, 456], [123, 456], [123, 456], [123, 456], [123, 456]]
Encoded : [177605, 177605, 177605, 177605, 177605, 177605, 177605, 177605]
Decoded : [[123, 456], [123, 456], [123, 456], [123, 456], [123, 456], [123, 456], [123, 456], [123, 456]]
*/
Enoki provides a vectorizable type for complex number arithmetic analogous to
std::complex<T>
. To use it, include the following header file:
#include <enoki/complex.h>
The following example shows how to define and perform basic arithmetic using
enoki::Complex
vectorized over 4-wide packets.
/* Declare underlying packet type, could just be 'float' for scalar arithmetic */
using FloatP = Packet<float, 4>;
/* Define complex number type */
using ComplexP = Complex<FloatP>;
const ComplexP I(0.f, 1.f);
const ComplexP z(0.2f, 0.3f);
/* Two different ways of computing the tangent function */
ComplexP t0 = (exp(I * z) - exp(-I * z)) / (I * (exp(I * z) + exp(-I * z)));
ComplexP t1 = tan(z);
std::cout << "t0 = " << t0 << std::endl << std::endl;
std::cout << "t1 = " << t1 << std::endl;
/* Prints
t0 = [0.184863 + 0.302229i,
0.184863 + 0.302229i,
0.184863 + 0.302229i,
0.184863 + 0.302229i]
t1 = [0.184863 + 0.302229i,
0.184863 + 0.302229i,
0.184863 + 0.302229i,
0.184863 + 0.302229i]
*/
Type
>Complex
: StaticArrayImpl<Type, 2>¶The class enoki::Complex
is a 2D Enoki array whose components
are of type Type
. Various arithmetic operators (e.g. multiplication)
and transcendental functions are overloaded so that they provide the
correct behavior for complex-valued inputs.
Complex
(Type real, Type imag)¶Creates a enoki::Complex
instance from the given real and
imaginary inputs.
Complex
(Type f)¶Creates a real-valued enoki::Complex
instance from f
.
This constructor effectively changes the broadcasting behavior of
non-complex inputs—for instance, the snippet
auto value_a = zero<Array<float, 2>>();
auto value_c = zero<Complex<float>>();
value_a += 1.f; value_c += 1.f;
std::cout << "value_a = "<< value_a << ", value_c = " << value_c << std::endl;
prints value_a = [1, 1], value_c = 1 + 0i
, which is the desired
behavior for complex numbers. For standard Enoki arrays, the number
1.f
is broadcast to both components.
Only a few arithmetic operators need to be overridden to support complex arithmetic. The rest are automatically provided by Enoki’s existing operators and broadcasting rules.
T
>sinh
(Complex<T> z)¶Evaluates the complex hyperbolic sine function for z
.
T
>cosh
(Complex<T> z)¶Evaluates the complex hyperbolic cosine function for z
.
T
>tanh
(Complex<T> z)¶Evaluates the complex hyperbolic tangent function for z
.
T
>sincosh
(Complex<T> z)¶Jointly evaluates the complex hyperbolic sine and cosine function for z
.
T
>asinh
(Complex<T> z)¶Evaluates the complex hyperbolic arc sine function for z
.
Enoki provides a vectorizable type for quaternion arithmetic. To use it, include the following header:
#include <enoki/quaternion.h>
The following example shows how to define and perform basic arithmetic using
enoki::Quaternion
vectorized over 4-wide packets.
/* Declare underlying packet type, could just be 'float' for scalar arithmetic */
using FloatP = Packet<float, 4>;
/* Define vectorized quaternion type */
using QuaternionP = Quaternion<FloatP>;
QuaternionP a = QuaternionP(1.f, 0.f, 0.f, 0.f);
QuaternionP b = QuaternionP(0.f, 1.f, 0.f, 0.f);
/* Compute several rotations that interpolate between 'a' and 'b' */
FloatP t = linspace<FloatP>(0.f, 1.f);
QuaternionP c = slerp(a, b, t);
std::cout << "Interpolated quaternions:" << std::endl;
std::cout << c << std::endl << std::endl;
/* Turn into a 4x4 homogeneous coordinate rotation matrix packet */
using MatrixP = Matrix<FloatP, 4>;
MatrixP c_rot = quat_to_matrix<Matrix4P>(c);
std::cout << "Rotation matrices:" << std::endl;
std::cout << c_rot << std::endl << std::endl;
/* Round trip: turn the rotation matrices back into rotation quaternions */
QuaternionP c2 = matrix_to_quat(c_rot);
if (hsum(abs(c-c2)) < 1e-6f)
std::cout << "Test passed." << std::endl;
else
std::cout << "Test failed." << std::endl;
/* Prints:
Interpolated quaternions:
[0 + 1i + 0j + 0k,
0 + 0.866025i + 0.5j + 0k,
0 + 0.5i + 0.866025j + 0k,
0 - 4.37114e-08i + 1j + 0k]
Rotation matrices:
[[[1, 0, 0, 0],
[0, -1, 0, 0],
[0, 0, -1, 0],
[0, 0, 0, 1]],
[[0.5, 0.866025, 0, 0],
[0.866025, -0.5, 0, 0],
[0, 0, -1, 0],
[0, 0, 0, 1]],
[[-0.5, 0.866025, 0, 0],
[0.866025, 0.5, 0, 0],
[0, 0, -1, 0],
[0, 0, 0, 1]],
[[-1, -8.74228e-08, 0, 0],
[-8.74228e-08, 1, 0, 0],
[-0, 0, -1, 0],
[0, 0, 0, 1]]]
Test passed.
*/
Type
>Quaternion
: StaticArrayImpl<Type, 4>¶The class enoki::Quaternion
is a 4D Enoki array whose
components are of type Type
. Various arithmetic operators (e.g.
multiplication) and transcendental functions are overloaded so that they
provide the correct behavior for quaternion-valued inputs.
Quaternion
(Type x, Type y, Type z, Type w)¶Initialize a new enoki::Quaternion
instance with the value
\(x\mathbf{i} + y\mathbf{j} + z\mathbf{k} + w\)
Warning
Note the different order convention compared to
Complex::Complex()
.
Quaternion
(Array<Type, 3> imag, Type real)¶Creates a enoki::Quaternion
instance from the given
imaginary and real inputs.
Warning
Note the different order convention compared to
Complex::Complex()
.
Quaternion
(Type f)¶Creates a real-valued enoki::Quaternion
instance from f
.
This constructor effectively changes the broadcasting behavior of
non-quaternion inputs—for instance, the snippet
auto value_a = zero<Array<float, 4>>();
auto value_q = zero<Quaternion<float>>();
value_a += 1.f; value_q += 1.f;
std::cout << "value_a = "<< value_a << ", value_q = " << value_q << std::endl;
prints value_a = [1, 1, 1, 1], value_q = 1 + 0i + 0j + 0k
, which is
the desired behavior for quaternions. For standard Enoki arrays, the
number 1.f
is broadcast to all four components.
T
>real
(Quaternion<T> q)¶Extracts the real part of q
.
T
>imag
(Quaternion<T> q)¶Extracts the imaginary part of q
.
T
>abs
(Quaternion<T> q)¶Compute the absolute value of q
.
T
>sqrt
(Quaternion<T> q)¶Compute the square root of q
.
T
>conj
(Quaternion<T> q)¶Evaluates the quaternion conjugate of q
.
T
>rcp
(Quaternion<T> q)¶Evaluates the quaternion reciprocal of q
.
Only a few arithmetic operators need to be overridden to support quaternion arithmetic. The rest are automatically provided by Enoki’s existing operators and broadcasting rules.
T
>operator*
(Quaternion<T> q0, Quaternion<T> q1)¶Evaluates the quaternion product of q1
and z2
.
T
>operator/
(Quaternion<T> q0, Quaternion<T> q1)¶Evaluates the quaternion division of q1
and z2
.
operator<<
(std::ostream &os, const Quaternion<T> &z)¶Sends the quaternion number q
to the stream os
using the format
1 + 2i + 3j + 4k
.
T
>exp
(Quaternion<T> q)¶Evaluates the quaternion exponential of q
.
T
>log
(Quaternion<T> q)¶Evaluates the quaternion logarithm of q
.
T
>pow
(Quaternion<T> q0, Quaternion<T> q1)¶Evaluates the quaternion power of q0
raised to the q1
.
Enoki provides a vectorizable type for small fixed-size square matrices (e.g. \(2\times 2\), \(3\times 3\), or \(4\times 4\)) that are commonly found in computer graphics and vision applications.
Note
Enoki attempts to store matrix entries in processor registers—it is unwise to work with large matrices since this will cause considerable pressure on the register allocator. Instead, consider using specialized libraries for large-scale linear algebra (e.g. Eigen or Intel MKL) in such cases.
To use this feature, include the following header:
#include <enoki/matrix.h>
The following example shows how to define and perform basic arithmetic using
enoki::Matrix
to construct a \(4\times 4\) homogeneous
coordinate “look-at” camera matrix and its inverse:
using Vector3f = Matrix<float, 3>;
using Vector4f = Matrix<float, 4>;
using Matrix4f = Matrix<float, 4>;
std::pair<Matrix4f, Matrix4f> look_at(const Vector3f &origin,
const Vector3f &target,
const Vector3f &up) {
Vector3f dir = normalize(target - origin);
Vector3f left = normalize(cross(up, dir));
Vector3f new_up = cross(dir, left);
Matrix4f result = Matrix4f::from_cols(
concat(left, 0.f),
concat(new_up, 0.f),
concat(dir, 0.f),
concat(origin, 1.f)
);
/* The following two statements efficiently compute
the inverse. Alternatively, we could have written
Matrix4f inverse = inverse(result);
*/
Matrix4f inverse = Matrix4f::from_rows(
concat(left, 0.f),
concat(new_up, 0.f),
concat(dir, 0.f),
Vector4f(0.f, 0.f, 0.f, 1.f)
);
inverse[3] = inverse * concat(-origin, 1.f);
return std::make_pair(result, inverse);
}
Enoki can also work with fixed-size packets and dynamic arrays of matrices. The following example shows how to compute the inverse of a matrix array.
using FloatP = Packet<float, 4>; // 4-wide packet type
using FloatX = DynamicArray<FloatP>; // arbitrary-length array
using MatrixP = Matrix<FloatP, 4>; // a packet of 4x4 matrices
using MatrixX = Matrix<FloatX, 4>; // arbitrarily many 4x4 matrices
MatrixX matrices;
set_slices(matrices, 1000);
// .. fill matrices with contents ..
// Invert all matrices
vectorize(
[](auto &&m) {
m = inverse(MatrixP(m));
},
matrices
);
Value
, size_t Size
>Matrix
: StaticArrayImpl<Array<Value, Size>, Size>¶The class enoki::Matrix
represents a dense square matrix of
fixed size as a Size
\(\times\) Size
Enoki array whose
components are of type Value
. The implementation relies on a
column-major storage order to enable a particularly efficient
implementation of vectorized matrix multiplication.
Value
Denotes the type of matrix elements.
Column
¶Denotes the Enoki array type of a matrix column.
Values
>Matrix
(Values... values)¶Creates a new enoki::Matrix
instance with the
given set of entries (where sizeof...(Values) == Size*Size
)
Columns
>Matrix
(Columns... columns)¶Creates a new enoki::Matrix
instance with the
given set of columns (where sizeof...(Columns) == Size
)
Size2
>Matrix
(Matrix<Value, Size2> m)¶Construct a matrix from another matrix of the same type, but with a
different size. If Size2 > Size
, the constructor copies the top
left part of m
. Otherwise, it copies all of m
and fills the
rest of the matrix with the identity.
Matrix
(Value f)¶Creates a enoki::Matrix
instance which has the value f
on the diagonal and zeroes elsewhere.
operator()
(size_t i, size_t j) const¶Returns a const reference to the matrix entry \((i, j)\).
Columns
>from_columns
(Columns... columns)¶Creates a new enoki::Matrix
instance with the given set of
columns (where Size == sizeof...(Columns)
). This is identical to
the Matrix::Matrix()
constructor but makes it more explicit
that the input are columns.
Rows
>from_rows
(Rows... rows)¶Creates a new enoki::Matrix
instance with the given set of
rows (where Size == sizeof...(Rows)
).
T
, size_t Size
>operator*
(Matrix<T, Size> m, Matrix<T, Size> v)¶Efficient vectorized matrix-matrix multiplication operation. On AVX512VL, a \(4\times 4\) matrix multiplication reduces to 4 multiplications and 12 fused multiply-adds with embedded broadcasts.
T
, size_t Size
>operator*
(Matrix<T, Size> m, Array<T, Size> v)¶Matrix-vector multiplication operation.
T
, size_t Size
>trace
(Matrix<T, Size> m)¶Computes the trace (i.e. sum of the diagonal elements) of the given matrix.
T
, size_t Size
>frob
(Matrix<T, Size> m)¶Computes the Frobenius norm of the given matrix.
Matrix
>diag
(typename Matrix::Column v)¶Returns a diagonal matrix whoose entries are copied from v
.
Matrix
>diag
(Matrix m)¶Extracts the diagonal from a matrix m
and returns it as a vector.
T
, size_t Size
>transpose
(Matrix<T, Size> m)¶Computes the transpose of m
using an efficient set of shuffles.
T
, size_t Size
>inverse
(Matrix<T, Size> m)¶Computes the inverse of m
using an branchless vectorized form of
Cramer’s rule.
Warning
This function is only implemented for \(1\times 1\), \(2\times 2\), \(3\times 3\), and \(4\times 4\) matrices (which are allowed to be packets of matrices).
T
, size_t Size
>inverse_transpose
(Matrix<T, Size> m)¶Computes the inverse transpose of m
using an branchless vectorized form
of Cramer’s rule. (This function is more efficient than transpose(inverse(m))
)
Warning
This function is only implemented for \(1\times 1\), \(2\times 2\), \(3\times 3\), and \(4\times 4\) matrices (which are allowed to be packets of matrices).
T
, size_t Size
>det
(Matrix<T, Size> m)¶Computes the determinant of m
.
Warning
This function is only implemented for \(1\times 1\), \(2\times 2\), \(3\times 3\), and \(4\times 4\) matrices (which are allowed to be packets of matrices).
Matrix
>polar_decomp
(Matrix M, size_t it = 10)¶Given a nonsingular input matrix \(\mathbf{M}\), polar_decomp
computes the polar decomposition \(\mathbf{M} = \mathbf{Q}\mathbf{P}\),
where \(\mathbf{Q}\) is an orthogonal matrix and \(\mathbf{Q}\) is
a symmetric and positive definite matrix. The computation relies on an
accelerated version of Heron’s method that converges rapidly. it
denotes the iteration count—a value of \(10\) should be plenty.
Enoki provides a number of convenience functions to construct 3D homogeneous coordinate transformations (rotations, translations, scales, perspective transformation matrices, etc.). To use them, include the following header file:
#include <enoki/transform.h>
Matrix
, typename Vector3
>translate
(Vector3 v)¶Constructs a homogeneous coordinate transformation, which translates points by v
.
Matrix
, typename Vector3
>scale
(Vector3 v)¶Constructs a homogeneous coordinate transformation, which scales points by v
.
Matrix
, typename Vector3
, typename Float
>rotate
(Vector3 v, Float angle)¶Constructs a homogeneous coordinate transformation, which rotates by angle
radians
around the axis v
. The function requires v
to be normalized.
Matrix
>transform_decompose
(Matrix m)¶Performs a polar decomposition of a non-perspective 4x4 homogeneous coordinate matrix and returns a tuple of
This representation is helpful when animating keyframe animations.
The function also handles singular inputs m
, in which case the rotation
component is set to the identity quaternion and the scaling part simply
copies the input matrix.
Matrix3
, typename Quaternion
, typename Vector3
>transform_compose
(Matrix3 scale, Quaternion rotation, Vector3 translate)¶This function composes a 4x4 homogeneous coordinate transformation from the
given scale, rotation, and translation. It performs the reverse of
transform_decompose
.
Matrix3
, typename Quaternion
, typename Vector3
>transform_compose_inverse
(Matrix3 scale, Quaternion rotation, Vector3 translate)¶This function composes a 4x4 homogeneous inverse coordinate
transformation from the given scale, rotation, and translation. It is the
equivalent to (but more efficient than) the expression
inverse(transform_compose(...))
.
Matrix
, typename Point3
, typename Vector3
>look_at
(Point3 origin, Point3, target, Vector3 up)¶Constructs a homogeneous coordinate transformation, which translates to \(\mathrm{origin}\), maps the negative \(z\) axis to \(\mathrm{target}-\mathrm{origin}\) (normalized) and the positive \(y\) axis to \(\mathrm{up}\) (if orthogonal to \(\mathrm{target}-\mathrm{origin}\)). The algorithm performs Gram-Schmidt orthogonalization to ensure that the returned matrix is orthonormal.
Matrix
, typename Float
>perspective
(Float fov, Float near, Float far)¶Constructs an OpenGL-compatible perspective projection matrix with the specified field of view (in radians) and near and far clip planes. The returned matrix performs the transformation
where
which maps \((0, 0, -\mathrm{near})^T\) to \((0, 0, -1)^T\) and \((0, 0, -\mathrm{far})^T\) to \((0, 0, 1)^T\).
Matrix
, typename Float
>frustum
(Float left, Float right, Float bottom, Float top, Float near, Float far)¶Constructs an OpenGL-compatible perspective projection matrix. The provided parameters specify the intersection of the camera frustum with the near clipping plane. Specifically, the returned transformation maps \((\mathrm{left}, \mathrm{bottom}, -\mathrm{near})\) to \((-1, -1, -1)\) and \((\mathrm{right}, \mathrm{top}, -\mathrm{near})\) to \((1, 1, -1)\).
Matrix
, typename Float
>ortho
(Float left, Float right, Float bottom, Float top, Float near, Float far)¶Constructs an OpenGL-compatible orthographic projection matrix. The provided parameters specify the intersection of the camera frustum with the near clipping plane. Specifically, the returned transformation maps \((\mathrm{left}, \mathrm{bottom}, -\mathrm{near})\) to \((-1, -1, -1)\) and \((\mathrm{right}, \mathrm{top}, -\mathrm{near})\) to \((1, 1, -1)\).
Enoki can efficiently evaluate real spherical harmonics basis functions for both scalar and vector arguments. To use this feature, include the following header file:
#include <enoki/sh.h>
The evaluation routines rely on efficient pre-generated branch-free code processed using aggressive constant folding and common subexpression elimination passes. Evaluation routines are provided up to order 10.
The generated code is based on the paper Efficient Spherical Harmonic Evaluation, Journal of Computer Graphics Techniques (JCGT), vol. 2, no. 2, 84-90, 2013 by Peter-Pike Sloan.
Note
The directions provided to sh_eval_*
must be normalized 3D vectors
(i.e. using Cartesian instead of spherical coordinates).
The Mathematica equivalent of the real spherical harmonic basis implemented
in enoki/sh.h
is given by the following definition:
SphericalHarmonicQ[l_, m_, d_] := Block[{θ, ϕ},
θ = ArcCos[d[[3]]];
ϕ = ArcTan[d[[1]], d[[2]]];
Piecewise[{
{SphericalHarmonicY[l, m, θ, ϕ], m == 0},
{Sqrt[2] * Re[SphericalHarmonicY[l, m, θ, ϕ]], m > 0},
{Sqrt[2] * Im[SphericalHarmonicY[l, -m, θ, ϕ]], m < 0}
}]
]
The following example shows how to evaluate the spherical harmonics basis up to and including order 2 producing a total of 9 function evaluations.
using Vector3f = Array<float, 3>;
Vector3f d = normalize(Vector3f(1, 2, 3));
float coeffs[9];
sh_eval(d, 2, coeffs);
// Prints: [0.282095, -0.261169, 0.391754, -0.130585, 0.156078, -0.468235, 0.292864, -0.234118, -0.117059]
std::cout << load<Array<float, 9>>(coeffs) << std::endl;
Array
>sh_eval
(const Array &d, size_t order, expr_t<value_t<Array>> *out)¶Evaluates the real spherical harmonics basis functions up to and including
order order
. The output array must have room for (order + 1)*(order +
1)
entries. This function dispatches to one of the sh_eval_*
implementations and throws an exception if order > 9
.
Array
>sh_eval_0
(const Array &d, expr_t<value_t<Array>> *out)¶Evaluates the real spherical harmonics basis functions up to and including
order 0. The output array must have room for 1
entry.
Array
>sh_eval_1
(const Array &d, expr_t<value_t<Array>> *out)¶Evaluates the real spherical harmonics basis functions up to and including
order 1. The output array must have room for 4
entries.
Array
>sh_eval_2
(const Array &d, expr_t<value_t<Array>> *out)¶Evaluates the real spherical harmonics basis functions up to and including
order 2. The output array must have room for 9
entries.
Array
>sh_eval_3
(const Array &d, expr_t<value_t<Array>> *out)¶Evaluates the real spherical harmonics basis functions up to and including
order 3. The output array must have room for 16
entries.
Array
>sh_eval_4
(const Array &d, expr_t<value_t<Array>> *out)¶Evaluates the real spherical harmonics basis functions up to and including
order 4. The output array must have room for 25
entries.
Array
>sh_eval_5
(const Array &d, expr_t<value_t<Array>> *out)¶Evaluates the real spherical harmonics basis functions up to and including
order 5. The output array must have room for 36
entries.
Array
>sh_eval_6
(const Array &d, expr_t<value_t<Array>> *out)¶Evaluates the real spherical harmonics basis functions up to and including
order 6. The output array must have room for 49
entries.
Array
>sh_eval_7
(const Array &d, expr_t<value_t<Array>> *out)¶Evaluates the real spherical harmonics basis functions up to and including
order 7. The output array must have room for 64
entries.
Enoki provides a set of helper functions for color space transformations. For now, only sRGB and inverse sRGB gamma correction are available. To use them, include the following header file:
#include <enoki/color.h>
Value
>linear_to_srgb
(Value value)¶Efficiently applies the sRGB gamma correction
to an input value in the interval \((0, 1)\).
Value
>srgb_to_linear
(Value value)¶Efficiently applies the inverse sRGB gamma correction
to an input value in the interval \((0, 1)\).
Enoki provides a compact implementation of a 16 bit half-precision floating point type that is compatible with the FP16 format on GPUs and high dynamic range image libraries such as OpenEXR. To use this feature, include the following header:
#include <enoki/half.h>
Current processors don’t natively implement half precision arithmetic, hence
mathematical operations involving this type always involve a
half
\(\to\) float
\(\to\) half
roundtrip. For this reason,
it is unwise to rely on it for expensive parts of a computation.
The main reason for including a dedicated half precision type in Enoki is that it provides an ideal storage format for floating point data that does not require the full accuracy of the single precision representation, which leads to an immediate storage savings of \(2\times\).
Note
If supported by the target architecture, Enoki uses the F16C instruction set to perform efficient vectorized conversion between half and single precision variables (however, this only affects conversion and no other arithmetic operations). ARM NEON also provides native conversion instructions.
The following example shows how to use the enoki::half
type in a
typical use case.
using Color4f = Array<float, 4>;
using Color4h = Array<half, 4>;
uint8_t *image_ptr = ...;
Color4f pixel(load<Color4h>(image_ptr)); // <- conversion vectorized using F16C
/* ... update 'pixel' using single-precision arithmetic ... */
store(image_ptr, Color4h(pixel)); // <- conversion vectorized using F16C
half
¶A half
instance encodes a sign bit, an exponent width of 5
bits, and 10 explicitly stored mantissa bits.
All standard mathematical operators are overloaded and implemented using the processor’s floating point unit after a conversion to a IEEE754 single precision. The result of the operation is then converted back to half precision.
value
¶Stores the represented half precision value as an unsigned 16-bit integer.
half
(float value)¶Constructs a half-precision value from the given single precision argument.
operator float
() const¶Implicit half
to float
conversion operator.
When Enoki extracts packets or slices through custom data structures, it also
handles STL data structures including std::array
, and std::pair
, and
std::tuple
. Please review the section on dynamic arrays
for general details on vectorizing over dynamic arrays and working with slices.
To use this feature, include the following header file:
#include <enoki/stl.h>
Consider the following example, where a function returns a std::tuple
containing a 3D position and a mask specifying whether the computation was
successful. When the enoki/stl.h
header file is included, Enoki’s
dynamic vectorization machinery can be applied to vectorize such functions over
arbitrarily large inputs.
/// Return value of 'my_function'
template <typename T>
using Return = std::tuple<
Array<T, 3>,
mask_t<T>
>;
template <typename T> Return<T> my_function(T theta, T phi) {
/* Turn spherical -> cartesian coordinates */
Array<T, 3> pos(
sin(theta) * cos(phi),
sin(theta) * sin(phi),
cos(theta)
);
/* Only points on the top hemisphere are 'valid' */
return std::make_pair(pos, pos.z() > 0);
}
/// Packet of floats
using FloatP = Packet<float>;
/// Arbitrarily large sequence of floats
using FloatX = DynamicArray<FloatP>;
/// Tuple containing a packet of results
using ReturnP = Return<FloatP>;
/// Tuple containing dynamic arrays with arbitrarily many results
using ReturnX = Return<FloatX>;
int main(int argc, char *argv[]) {
FloatX theta = linspace<FloatX>(-10.f, 10.f, 10);
FloatX phi = linspace<FloatX>(0.f, 60.f, 10);
ReturnX result = vectorize(my_function<FloatP>, theta, phi);
/* Prints:
[[0.544021, 0, -0.839072],
[-0.924676, -0.373065, 0.0761302],
[0.478888, 0.461548, 0.746753],
[0.0777672, 0.173978, -0.981674],
[-0.0330365, -0.895583, 0.443666],
[-0.304446, 0.842896, 0.443666],
[0.127097, -0.141995, -0.981674],
[0.596782, -0.293616, 0.746753],
[-0.994388, 0.0734624, 0.0761293],
[0.518133, 0.165823, -0.839072]]
*/
std::cout << std::get<0>(result) << std::endl;
/* Prints:
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
*/
std::cout << std::get<1>(result) << std::endl;
}
Enoki provides support for pybind11, a lightweight header-only binding library that is used to expose C++ types in Python and vice versa.
To use this feature, include the following header file in the extension module:
#include <enoki/python.h>
The example below shows how to create bindings for a simple vector computation that converts spherical to Cartesian coordinates. A CMake build system file is provided at the bottom of this page.
#include <enoki/python.h>
/* Import pybind11 and Enoki namespaces */
namespace py = pybind11;
using namespace enoki;
using namespace py::literals; // Enables the ""_a parameter tags used below
/* Define a packet type used for vectorization */
using FloatP = Packet<float>;
/* Dynamic type for arbitrary-length arrays */
using FloatX = DynamicArray<FloatP>;
/* Various flavors of 3D vectors */
using Vector3f = Array<float, 3>;
using Vector3fP = Array<FloatP, 3>;
using Vector3fX = Array<FloatX, 3>;
/* The function we want to expose in Python */
template <typename Float>
Array<Float, 3> sph_to_cartesian(Float theta, Float phi) {
auto sc_theta = sincos(theta);
auto sc_phi = sincos(phi);
return {
sc_theta.first * sc_phi.second,
sc_theta.first * sc_phi.first,
sc_theta.second
};
}
/* The function below is called when the extension module is loaded. It performs a
sequence of m.def(...) calls which define functions in the module namespace 'm' */
PYBIND11_MODULE(pybind11_test /* <- name of extension module */, m) {
m.doc() = "Enoki & pybind11 test plugin"; // Set a docstring
/* 1. Bind the scalar version of the function */
m.def(
/* Name of the function in the Python extension module */
"sph_to_cartesian",
/* Function that should be exposed */
sph_to_cartesian<float>,
/* Function docstring */
"Convert from spherical to cartesian coordinates [scalar version]",
/* Parameter names for function signature in docstring */
"theta"_a, "phi"_a
);
/* 2. Bind the packet version of the function */
m.def("sph_to_cartesian",
/* The only differnce is the FloatP template argument */
sph_to_cartesian<FloatP>,
"Convert from spherical to cartesian coordinates [packet version]",
"theta"_a, "phi"_a);
/* 3. Bind dynamic version of the function */
m.def("sph_to_cartesian",
/* Note the use of 'vectorize_wrapper', which is described below */
vectorize_wrapper(sph_to_cartesian<FloatP>),
"Convert from spherical to cartesian coordinates [dynamic version]",
"theta"_a, "phi"_a);
}
pybind11 infers the necessary binding code from the type of the function
provided to the def()
calls. Including the enoki/python.h
header is
all it takes to make the pybind11 library fully Enoki-aware—arbitrarily
nested dynamic and static arrays will be converted automatically.
In practice, one would usually skip the packet version since it is subsumed by the dynamic case.
The following iteractive session shows how to load the extension module and query its automatically generated help page.
Python 3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul 2 2016, 17:52:12)
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pybind11_test
>>> help(pybind11_test)
Help on module pybind11_test
NAME
pybind11_test - Enoki & pybind11 test plugin
FUNCTIONS
sph_to_cartesian(...)
sph_to_cartesian(*args, **kwargs)
Overloaded function.
1. sph_to_cartesian(theta: float, phi: float)
-> numpy.ndarray[dtype=float32, shape=(3)]
Convert from spherical to cartesian coordinates [scalar version]
2. sph_to_cartesian(theta: numpy.ndarray[dtype=float32, shape=(8)],
phi: numpy.ndarray[dtype=float32, shape=(8)])
-> numpy.ndarray[dtype=float32, shape=(8, 3)]
Convert from spherical to cartesian coordinates [packet version]
3. sph_to_cartesian(theta: numpy.ndarray[dtype=float32, shape=(n)],
phi: numpy.ndarray[dtype=float32, shape=(n)])
-> numpy.ndarray[dtype=float32, shape=(n, 3)]
Convert from spherical to cartesian coordinates [dynamic version]
FILE
/Users/wjakob/pybind11_test/pybind11_test.cpython-35m-darwin.so
As can be seen, the help describes all three overloads along with the name and shape of their input arguments. Let’s try calling one of them:
>>> from pybind11_test import sph_to_cartesian
>>> sph_to_cartesian(theta=1, phi=2)
array([-0.35017547, 0.76514739, 0.54030228], dtype=float32)
Note how the returned Enoki array was automatically converted into a NumPy array.
Let’s now call the dynamic version of the function. We will use np.linspace
to generate inputs, which actually have an incorrect dtype
of
np.float64
. The binding layer detects this and automatically creates a
temporary single precision input array before performing the function call.
>>> import numpy as np
>>> sph_to_cartesian(theta=np.linspace(0.0, 1.0, 10),
... phi=np.linspace(1.0, 2.0, 10))
array([[ 0. , 0. , 1. ],
[ 0.04919485, 0.09937215, 0.99383354],
[ 0.07527862, 0.20714317, 0.9754101 ],
[ 0.07696848, 0.31801295, 0.9449569 ],
[ 0.05418137, 0.42652887, 0.90284967],
[ 0.00803789, 0.52735412, 0.84960753],
[-0.05919253, 0.61553025, 0.7858873 ],
[-0.14420365, 0.68672061, 0.71247464],
[-0.24281444, 0.73742425, 0.63027501],
[-0.35017547, 0.76514739, 0.54030228]], dtype=float32)
The following CMakeLists.txt
file can be used to build the module on
various platforms.
cmake_minimum_required (VERSION 2.8.12)
project(pybind11_test CXX)
include(CheckCXXCompilerFlag)
# Set a default build configuration (Release)
if (NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
message(STATUS "Setting build type to 'Release' as none was specified.")
set(CMAKE_BUILD_TYPE Release CACHE STRING "Choose the type of build." FORCE)
set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release"
"MinSizeRel" "RelWithDebInfo")
endif()
# Enable C++14 support
if (CMAKE_CXX_COMPILER_ID MATCHES "GNU|Clang|Intel")
CHECK_CXX_COMPILER_FLAG("-std=c++14" HAS_CPP14_FLAG)
if (HAS_CPP14_FLAG)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++14")
else()
message(FATAL_ERROR "Unsupported compiler -- C++14 support is needed!")
endif()
endif()
# Assumes that pybind11 is located in the 'pybind11' subdirectory
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/pybind11)
# Assumes that enoki is located in the 'enoki' subdirectory
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/enoki)
# Enable some helpful vectorization-related compiler flags
enoki_set_compile_flags()
enoki_set_native_flags()
include_directories(enoki/include pybind11/include)
# Compile our pybind11 module
pybind11_add_module(pybind11_test pybind11_test.cpp)
Please refer to pybind11’s extensive documentation. for details on
using it in general. The enoki/python.h
API only provides one public
function:
Func
>vectorize_wrapper
(Func func)¶“Converts” a function that takes a set of packets and structures of packets as inputs into a new function that processes dynamic versions of these parameters. Non-array arguments are not transformed. For instance, it would turn the following hypothetical signature
FloatP my_func(Array<FloatP, 3> position, GPSRecord2<FloatP> record, int scalar);
into
FloatX my_func(Array<FloatX, 3> position, GPSRecord2<FloatX> record, int scalar);
where
using FloatX = DynamicArray<FloatP>;
This is handy because a one-liner like
vectorize_wrapper(sph_to_cartesian<FloatP>)
in the above example is all
it takes to take a packet version of a function and expose a dynamic
version that can process arbitrarily large NumPy arrays.