Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use WebAssembly bulk memory opcodes #263

Merged
merged 7 commits into from
May 20, 2022
Merged

Conversation

TerrorJack
Copy link
Contributor

Closes #262.

@TerrorJack TerrorJack marked this pull request as ready for review December 4, 2021 03:59
@TerrorJack
Copy link
Contributor Author

Ready for first pass of review.

Haven't made microbenchmarks to gather statistics yet, but IIUIC the only scenario where a wasm loop can be faster than native implementation might be: frequent calls to operate on very small chunks. Some engines may have overhead to call native methods, who knows.

@TerrorJack
Copy link
Contributor Author

I also eyeballed the wasm2wat CI artifacts, compiled with LLVM 10 & 13. memcpy, memmove, memset all correctly use bulk memory opcodes, but when I attempted to use __builtin_wmemcpy, LLVM 10 emits:

(module
  (type (;0;) (func (param i32 i32 i32) (result i32)))
  (import "env" "__linear_memory" (memory (;0;) 0))
  (import "env" "__indirect_function_table" (table (;0;) 0 funcref))
  (func $wmemcpy (type 0) (param i32 i32 i32) (result i32)
    (local i32)
    local.get 3))

And LLVM 13 emits:

(module
  (type (;0;) (func (param i32 i32 i32) (result i32)))
  (import "env" "__linear_memory" (memory (;0;) 0))
  (func $wmemcpy (type 0) (param i32 i32 i32) (result i32)
    loop (result i32)  ;; label = @1
      br 0 (;@1;)
    end))

So the builtin intrinsics doesn't work for wide char versions of the functions. Might be an upstream bug.

@TerrorJack TerrorJack requested a review from sbc100 December 11, 2021 15:05
@sbc100
Copy link
Member

sbc100 commented Dec 11, 2021

Might be worth measure that actual performs in shipping runtimes as @kripken suggests in the bug.. otherwise LGTM

@sunfishcode
Copy link
Member

Benchmarks would be nice to do, though I'm inclined to merge this either way. LLVM optimizes small fixed-size memcpy/etc. into inline loads and stores already, and beyond that, this is one of the main use cases that bulk-memory was added for, so in theory it shouldn't be slow anywhere. If it does turn out to be slow somewhere, and it isn't just a missing optimization in a particular engine, we can of course revisit this.

@kripken
Copy link
Member

kripken commented Dec 14, 2021

Just as a warning, the benchmarks I saw were quite slow, so you might be making wasi-libc 2x slower or so on commonly-called functions. There was also debate back then as to whether this was an intended use of bulk memory or not (i.e. should the VM emit fast code for both small and large operations, both aligned and unaligned, etc.). All that was a while ago though, so hopefully it's not a problem any more!

@kripken
Copy link
Member

kripken commented Jan 5, 2022

Link to some recent discussion and data on this (with no clear conclusions):

WebAssembly/binaryen#4403

@sunfishcode
Copy link
Member

I've also now done some benchmarking and instrumentation, and these small unaligned memcpys are more frequent than I had guessed they'd be. For example, they come up when doing formatted I/O, to copy all the individual strings into the I/O buffer.

@TerrorJack Would you be interested in extending this PR to have fast paths for small lengths?

@TerrorJack
Copy link
Contributor Author

@sunfishcode Hi, of course, but I'd like to know how to obtain the heuristic small length first.

@sunfishcode
Copy link
Member

@TerrorJack Going by this data I think the optimal number will be somewhere between 8 and 100. Perhaps we could try starting with 32? That's long enough to easily cover most formatted-I/O use cases, but perhaps short enough to keep the code simple?

@TerrorJack
Copy link
Contributor Author

@sunfishcode Done.

@sunfishcode
Copy link
Member

Thanks! We can experiment with the threshold and see how it works out in practice.

@sunfishcode sunfishcode merged commit ba81b40 into WebAssembly:main May 20, 2022
@TerrorJack TerrorJack deleted the bulk-memory branch May 20, 2022 20:58
@kripken
Copy link
Member

kripken commented Jul 21, 2022

Btw, it might be good to document this somewhere (not sure where though?). I saw someone run into a problem because they used wasi-libc and the code couldn't run in a VM since the VM didn't support bulk memory. The error message "invalid section 12" doesn't really help point people to rebuilding wasi-libc, unforuntately...

@sunfishcode
Copy link
Member

If you have any ideas where we could document this such that a user would find it when they need to, I'm open to documenting this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use WebAssembly bulk memory opcodes for memcpy/memset?
4 participants