taichi-dev
diff --git a/‎docs/lang/articles/advanced/quant.md
+249 b/‎docs/lang/articles/advanced/quant.md
+249
diff --git a/‎docs/lang/articles/differentiable/differentiable_programming.md
+76 b/‎docs/lang/articles/differentiable/differentiable_programming.md
+76
diff --git a/‎docs/lang/articles/static/assets/bitpacked_fields_layout_example.png
51.6 KB b/‎docs/lang/articles/static/assets/bitpacked_fields_layout_example.png
51.6 KB
diff --git a/‎docs/lang/articles/static/assets/floating-point_formats.png
135 KB b/‎docs/lang/articles/static/assets/floating-point_formats.png
135 KB
diff --git a/‎docs/lang/articles/static/assets/quant_array_layout_example.png
10.1 KB b/‎docs/lang/articles/static/assets/quant_array_layout_example.png
10.1 KB
@@ -0,0 +1,249 @@
+---
+sidebar_position: 3
+---
+
+# Using quantized data types
+
+High-resolution simulations can deliver great visual quality, but they are often
+limited by available memory, especially on GPUs. For the sake of saving memory,
+Taichi provides low-precision ("quantized") data types. You can define your own integers,
+fixed-point numbers or floating-point numbers with non-standard number of bits so
+that you can choose a proper setting with minimum memory for your applications.
+Taichi provides a suite of tailored domain-specific optimizations to ensure the
+runtime performance with quantized data types close to that with full-precision
+data types.
+
+:::note
+Quantized data types are only supported on CPU and CUDA backends for now.
+:::
+
+## Quantized data types
+
+### Quantized integers
+
+Modern computers represent integers using the [two's complement](https://en.wikipedia.org/wiki/Two%27s_complement)
+format. *Quantized integers* in Taichi adopt the same format, and can contain
+non-standard number of bits:
+
+```python
+i10 = ti.types.quant.int(bits=10)  # 10-bit signed (default) integer type
+u5 = ti.types.quant.int(bits=5, signed=False)  # 5-bit unsigned integer type
+```
+
+### Quantized fixed-point numbers
+
+[Fixed-point numbers](https://en.wikipedia.org/wiki/Fixed-point_arithmetic) are
+an old way to represent real numbers. The internal representation of a fixed-point number is simply an integer, and
+its actual value equals to the integer multiplied by a predefined scaling
+factor. Based on the support for quantized integers, Taichi provides *quantized
+fixed-point numbers* as follows:
+
+```python
+fixed_type_a = ti.types.quant.fixed(bits=10, max_value=20.0)  # 10-bit signed (default) fixed-point type within [-20.0, 20.0]
+fixed_type_b = ti.types.quant.fixed(bits=5, signed=False, max_value=100.0)  # 5-bit unsigned fixed-point type within [0.0, 100.0]
+fixed_type_c = ti.types.quant.fixed(bits=6, signed=False, scale=1.0)  # 6-bit unsigned fixed-point type within [0, 64.0]
+```
+
+`scale` is the scaling factor mentioned above. Because fixed-point numbers are
+especially useful when you know the actual value is guaranteed to be within a
+range, Taichi allows you to simply set `max_value` and will calculate the
+scaling factor accordingly.
+
+### Quantized floating-point numbers
+
+[Floating-point numbers](https://en.wikipedia.org/wiki/Floating-point_arithmetic)
+are the standard way to represent real numbers on modern computers. A
+floating-point number is composed of exponent bits, fraction bits, and a sign
+bit. There are various floating-point formats:
+
+![image](../static/assets/floating-point_formats.png)
+
+In Taichi, you can define a *quantized floating-point number* with arbitrary
+combination of exponent bits and fraction bits (the sign bit is made part of
+fraction bits):
+
+```python
+float_type_a = ti.types.quant.float(exp=5, frac=10)  # 15-bit signed (default) floating-point type with 5 exponent bits
+float_type_b = ti.types.quant.float(exp=6, frac=9, signed=False)  # 15-bit unsigned floating-point type with 6 exponent bits
+```
+
+### Compute types
+
+All the parameters you've seen above are specifying the *storage type* of a
+quantized data type. However, most quantized data types have no native support
+on hardware, so an actual value of that quantized data type needs to convert to
+a primitive type ("*compute type*") when it is involved in computation.
+
+The default compute type for quantized integers is `ti.i32`, while the default
+compute type for quantized fixed-point/floating-point numbers is `ti.f32`. You
+can change the compute type by specifying the `compute` parameter:
+
+```python
+i21 = ti.types.quant.int(bits=21, compute=ti.i64)
+bfloat16 = ti.types.quant.float(exp=8, frac=8, compute=ti.f32)
+```
+
+## Data containers for quantized data types
+
+Because the storage types are not primitive types, you may now wonder how
+quantized data types can work together with data containers that Taichi
+provides. In fact, some special constructs are introduced to eliminate the gap.
+
+### Bitpacked fields
+
+`ti.BitpackedFields` packs a group of fields whose `dtype`s are
+quantized data types together so that they are stored with one primitive type.
+You can then place a `ti.BitpackedFields` instance under any SNode as if each member field
+is placed individually.
+
+```python
+a = ti.field(float_type_a)  # 15 bits
+b = ti.field(fixed_type_b)  # 5 bits
+c = ti.field(fixed_type_c)  # 6 bits
+d = ti.field(u5)  # 5 bits
+bitpack = ti.BitpackedFields(max_num_bits=32)
+bitpack.place(a, b, c, d)  # 31 out of 32 bits occupied
+ti.root.dense(ti.i, 10).place(bitpack)
+```
+
+#### Shared exponent
+
+When multiple fields with quantized floating-point types are packed together,
+there is chance that they can share a common exponent. For example, in a 3D
+velocity vector, if you know the x-component has a much larger absolute value
+compared to y- and z-components, then you probably do not care about the exact
+value of the y- and z-components. In this case, using a shared exponent can
+leave more bits for components with larger absolute values. You can use
+`place(x, y, z, shared_exponent=True)` to make fields `x, y, z` share a common
+exponent.
+
+#### Your first program
+
+You probably cannot wait to write your first Taichi program with quantized data
+types. The easiest way is to modify the data definitions of an existing example.
+Assume you want to save memory for
+[examples/simulation/euler.py](https://github.com/taichi-dev/taichi/blob/master/python/taichi/examples/simulation/euler.py).
+Because most data definitions in the example are similar, here only field `Q` is
+used for illustration:
+
+```python
+Q = ti.Vector.field(4, dtype=ti.f32, shape=(N, N))
+```
+
+An element of `Q` now occupies 4 x 32 = 128 bits. If you can fit it in
+64 bits, then the memory usage is halved. A direct and first attempt is to
+use quantized floating-point numbers with a shared exponent:
+
+```python
+float_type_c = ti.types.quant.float(exp=8, frac=14)
+Q_old = ti.Vector.field(4, dtype=float_type_c)
+bitpack = ti.BitpackedFields(max_num_bits=64)
+bitpack.place(Q_old, shared_exponent=True)
+ti.root.dense(ti.ij, (N, N)).place(bitpack)
+```
+
+Surprisingly, you find that there is no obvious difference in visual effects
+after the change, and you now successfully finish a Taichi program with
+quantized data types! More attempts are left to you.
+
+#### More complicated quantization schemes
+
+Here comes a more complicated scenario. In a 3D Eulerian fluid simulation, a
+voxel may need to store a 3D vector for velocity, and an integer value for cell
+category with three possible values: "source", "Dirichlet boundary", and
+"Neumann boundar". You can actually store all information with a single 32-bit
+`ti.BitpackedFields`:
+
+```python
+velocity_component_type = ti.types.quant.float(exp=6, frac=8, compute=ti.f32)
+velocity = ti.Vector.field(3, dtype=velocity_component_type)
+
+# Since there are only three cell categories, 2 bits are enough.
+cell_category_type = ti.types.quant.int(bits=2, signed=False, compute=ti.i32)
+cell_category = ti.field(dtype=cell_category_type)
+
+voxel = ti.BitpackedFields(max_num_bits=32)
+# Place three components of velocity into the voxel, and let them share the exponent.
+voxel.place(velocity, shared_exponent=True)
+# Place the 2-bit cell category.
+voxel.place(cell_category)
+# Create 512 x 512 x 256 voxels.
+ti.root.dense(ti.ijk, (512, 512, 256)).place(voxel)
+```
+
+The compression scheme above allows you to store 13 bytes (4B x 3 + 1B) into
+just 4 bytes. Note that you can still use velocity and cell_category in the
+computation code, as if they are `ti.f32` and `ti.u8`.
+
+![image](../static/assets/bitpacked_fields_layout_example.png)
+
+### Quant arrays
+
+Bitpacked fields are actually laid in an array of structure (AOS) order.
+However, there are also cases where a single quantized type is required to get
+laid in an array. For example, you may want to store 8 x u4 values in a single
+u32 type, to represent bin values of a histogram:
+
+![image](../static/assets/quant_array_layout_example.png)
+
+Quant array is exactly what you need. A `quant_array` is a SNode which
+can reinterpret a primitive type into an array of a quantized type:
+
+```python
+bin_value_type = ti.types.quant.int(bits=4, signed=False)
+
+# The quant array for 512 x 512 bin values
+array = ti.root.dense(ti.ij, (512, 64)).quant_array(ti.j, 8, max_num_bits=32)
+# Place the unsigned 4-bit bin value into the array
+array.place(bin_value_type)
+```
+
+:::note
+1. Only one field can be placed under a `quant_array`.
+2. Only quantized integer types and quantized fixed-point types are supported as
+the `dtype` of the field under a `quant_array`.
+3. The size of the `dtype` of the field times the shape of the `quant_array`
+must be less than or equal to the `max_num_bits` of the `quant_array`.
+:::
+
+#### Bit vectorization
+
+For quant arrays of 1-bit quantized integer types ("boolean"), Taichi provides
+an additional optimization - bit vectorization. It aims at vectorizing
+operations on such quant arrays under struct fors:
+
+```python
+u1 = ti.types.quant.int(1, False)
+N = 512
+M = 32
+x = ti.field(dtype=u1)
+y = ti.field(dtype=u1)
+ti.root.dense(ti.i, N // M).quant_array(ti.i, M, max_num_bits=M).place(x)
+ti.root.dense(ti.i, N // M).quant_array(ti.i, M, max_num_bits=M).place(y)
+
+@ti.kernel
+def assign_vectorized():
+    ti.loop_config(bit_vectorize=True)
+    for i, j in x:
+        y[i, j] = x[i, j]  # 32 bits are handled at a time
+
+assign_vectorized()
+```
+
+## Advanced examples
+
+The following examples are picked from the
+[QuanTaichi paper](https://yuanming.taichi.graphics/publication/2021-quantaichi/quantaichi.pdf),
+so you can dig into details there.
+
+### [Game of Life](https://github.com/taichi-dev/quantaichi/tree/main/gol)
+
+![image](https://github.com/taichi-dev/quantaichi/raw/main/pics/teaser_gol.jpg)
+
+### [Eulerian Fluid](https://github.com/taichi-dev/quantaichi/tree/main/eulerian_fluid)
+
+![image](https://github.com/taichi-dev/quantaichi/raw/main/pics/smoke_result.png)
+
+### [MLS-MPM](https://github.com/taichi-dev/taichi_elements/blob/master/demo/demo_quantized_simulation_letters.py)
+
+![image](https://github.com/taichi-dev/quantaichi/raw/main/pics/mpm-235.jpg)
@@ -448,3 +448,79 @@ Check out [the DiffTaichi paper](https://arxiv.org/pdf/1910.00935.pdf)
 and [video](https://www.youtube.com/watch?v=Z1xvAZve9aE) to learn more
 about Taichi differentiable programming.
 :::
+
+
+## Forward-Mode Autodiff
+
+There are two modes of automatic differentiation, forward and reverse mode. The forward mode provides a function to compute Jacobian-Vector Product (JVP), which can compute one column of the Jacobian matrix at a time. The reverse mode supports computing Vector-Jacobian Product (VJP), i.e., one row of the Jacobian matrix at a time. Therefore, for functions which have more inputs than outputs, reverse mode is more efficient. The `ti.ad.Tape` and `kernel.grad()` are built on the reverse mode. The forward mode is more efficient when handling functions whose outputs are more than inputs. Taichi autodiff also supports forward mode.
+
+### Using `ti.ad.FwdMode`
+The usage of `ti.ad.FwdMode` is very similar to `ti.ad.Tape`. Here we reuse the example for reverse mode above for an explanation.
+1. Enable `needs_dual=True` option when declaring fields involved in the derivative chain.
+2. Use context manager with `ti.ad.FwdMode(loss=y, param=x)`: to capture the kernel invocations which you want to automatically differentiate. The `loss` and `param` are the output and input of the function respectively.
+3. Now dy/dx value at current x is available at function output `y.dual[None]`.
+The following code snippet explains the steps above:
+
+```python
+import taichi as ti
+ti.init()
+
+x = ti.field(dtype=ti.f32, shape=(), needs_dual=True)
+y = ti.field(dtype=ti.f32, shape=(), needs_dual=True)
+
+
+@ti.kernel
+def compute_y():
+    y[None] = ti.sin(x[None])
+
+
+with ti.ad.FwdMode(loss=y, param=x):
+    compute_y()
+
+print('dy/dx =', y.dual[None], ' at x =', x[None])
+```
+
+:::note
+The `dual` here indicates `dual number`in math. The reason for using the name is that forwar-mode autodiff is equivalent to evaluating function with dual numbers.
+:::
+
+:::note
+The `ti.ad.FwdMode` automatically clears the dual field of `loss`.
+:::
+
+ti.ad.FwdMode support multiple inputs and outputs. The param can be a N-D field and the loss can be an individual or a list of N-D fields. The argument `seed` is the 'vector' in Jacobian-vector product, which used to control the parameter that is computed derivative with respect to. Here we show three cases with multiple inputs and outputs. With `seed=[1.0, 0.0] `or `seed=[0.0, 1.0]`  , we can compute the derivatives solely with respect to `x_0` or `x_1`.
+
+```python
+import taichi as ti
+ti.init()
+N_param = 2
+N_loss = 5
+x = ti.field(dtype=ti.f32, shape=N_param, needs_dual=True)
+y = ti.field(dtype=ti.f32, shape=N_loss, needs_dual=True)
+
+
+@ti.kernel
+def compute_y():
+    for i in range(N_loss):
+        for j in range(N_param):
+            y[i] += i * ti.sin(x[j])
+
+
+# Compute derivatives respect to x_0
+with ti.ad.FwdMode(loss=y, param=x, seed=[1.0, 0.0]):
+    compute_y()
+print('dy/dx_0 =', y.dual, ' at x_0 =', x[0])
+
+# Compute derivatives respect to x_1
+with ti.ad.FwdMode(loss=y, param=x, seed=[0.0, 1.0]):
+    compute_y()
+print('dy/dx_1 =', y.dual, ' at x_1 =', x[1])
+```
+
+:::note
+The `seed` argument is required if the `param` is not a scalar field.
+:::
+
+:::tip
+Similar to reverse mode autodiff, Taichi provides an API `ti.root.lazy_dual()` that automatically places the dual fields following the layout of their primal fields.
+:::