Skip to content

Commit 7cf6056

Browse files
committed
Merge branch 'master' into link_libraries
2 parents 8c886a3 + d096f15 commit 7cf6056

38 files changed

+914
-304
lines changed

.github/workflows/testing.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -450,7 +450,7 @@ jobs:
450450
fi
451451
docker create --user dev --name taichi_build_desktop --gpus all -v /tmp/.X11-unix:/tmp/.X11-unix \
452452
-e PY -e GPU_BUILD -e PROJECT_NAME -e TAICHI_CMAKE_ARGS -e DISPLAY -e EXPORT_CORE\
453-
registry.taichigraphics.com/taichidev-ubuntu18.04:v0.2.1 \
453+
registry.taichigraphics.com/taichidev-ubuntu18.04:v0.3.0 \
454454
/home/dev/taichi/.github/workflows/scripts/unix_build.sh
455455
# A tarball is needed because sccache needs some permissions that only the file owner has.
456456
# 1000 is the uid and gid of user "dev" in the container.

cmake/TaichiExportCore.cmake

+2
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@ set(TAICHI_EXPORT_CORE_NAME taichi_export_core)
44

55
add_library(${TAICHI_EXPORT_CORE_NAME} SHARED)
66
target_link_libraries(${TAICHI_EXPORT_CORE_NAME} PRIVATE taichi_isolated_core)
7+
set_target_properties(${TAICHI_EXPORT_CORE_NAME} PROPERTIES
8+
CMAKE_LIBRARY_OUTPUT_DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}/build")

docs/lang/articles/advanced/performance.md

+7-3
Original file line numberDiff line numberDiff line change
@@ -153,14 +153,18 @@ Additionally, the last atomic add to the global memory `s[None]` is optimized us
153153
CUDA's warp-level intrinsics, further reducing the number of required atomic adds.
154154

155155
Currently, Taichi supports TLS optimization for these reduction operators: `add`,
156-
`sub`, `min` and `max`. [Here](https://github.com/taichi-dev/taichi/pull/2956) is
157-
a benchmark comparison when running a global max reduction on a 1-D Taichi field
156+
`sub`, `min` and `max` on **0D** scalar/vector/matrix `ti.field`s. It is not yet
157+
supported on `ti.ndarray`s. [Here](https://github.com/taichi-dev/taichi/pull/2956)
158+
is a benchmark comparison when running a global max reduction on a 1-D Taichi field
158159
of 8M floats on an Nvidia GeForce RTX 3090 card:
159160

160161
* TLS disabled: 5.2 x 1e3 us
161162
* TLS enabled: 5.7 x 1e1 us
162163

163-
TLS has led to an approximately 100x speedup.
164+
TLS has led to an approximately 100x speedup. We also show that TLS reduction sum
165+
achieves comparable performance with CUDA implementations, see
166+
[benchmark](https://github.com/taichi-dev/taichi_benchmark/tree/main/reduce_sum) for
167+
details.
164168

165169
### Block Local Storage (BLS)
166170

python/taichi/ui/staging_buffer.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ def copy_image_u8_to_u8(src: ti.template(), dst: ti.template(),
100100
num_components: ti.template()):
101101
for i, j in src:
102102
for k in ti.static(range(num_components)):
103-
dst[i, j][k] = src[i, j][k]
103+
dst[i, j][k] = ti.cast(src[i, j][k], ti.u8)
104104
if num_components < 4:
105105
# alpha channel
106106
dst[i, j][3] = u8(255)

0 commit comments

Comments
 (0)