-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use WebAssembly bulk memory opcodes #263
Conversation
Ready for first pass of review. Haven't made microbenchmarks to gather statistics yet, but IIUIC the only scenario where a wasm loop can be faster than native implementation might be: frequent calls to operate on very small chunks. Some engines may have overhead to call native methods, who knows. |
I also eyeballed the (module
(type (;0;) (func (param i32 i32 i32) (result i32)))
(import "env" "__linear_memory" (memory (;0;) 0))
(import "env" "__indirect_function_table" (table (;0;) 0 funcref))
(func $wmemcpy (type 0) (param i32 i32 i32) (result i32)
(local i32)
local.get 3)) And LLVM 13 emits: (module
(type (;0;) (func (param i32 i32 i32) (result i32)))
(import "env" "__linear_memory" (memory (;0;) 0))
(func $wmemcpy (type 0) (param i32 i32 i32) (result i32)
loop (result i32) ;; label = @1
br 0 (;@1;)
end)) So the builtin intrinsics doesn't work for wide char versions of the functions. Might be an upstream bug. |
Might be worth measure that actual performs in shipping runtimes as @kripken suggests in the bug.. otherwise LGTM |
Benchmarks would be nice to do, though I'm inclined to merge this either way. LLVM optimizes small fixed-size memcpy/etc. into inline loads and stores already, and beyond that, this is one of the main use cases that bulk-memory was added for, so in theory it shouldn't be slow anywhere. If it does turn out to be slow somewhere, and it isn't just a missing optimization in a particular engine, we can of course revisit this. |
Just as a warning, the benchmarks I saw were quite slow, so you might be making wasi-libc 2x slower or so on commonly-called functions. There was also debate back then as to whether this was an intended use of bulk memory or not (i.e. should the VM emit fast code for both small and large operations, both aligned and unaligned, etc.). All that was a while ago though, so hopefully it's not a problem any more! |
Link to some recent discussion and data on this (with no clear conclusions): |
I've also now done some benchmarking and instrumentation, and these small unaligned memcpys are more frequent than I had guessed they'd be. For example, they come up when doing formatted I/O, to copy all the individual strings into the I/O buffer. @TerrorJack Would you be interested in extending this PR to have fast paths for small lengths? |
@sunfishcode Hi, of course, but I'd like to know how to obtain the heuristic small length first. |
@TerrorJack Going by this data I think the optimal number will be somewhere between 8 and 100. Perhaps we could try starting with 32? That's long enough to easily cover most formatted-I/O use cases, but perhaps short enough to keep the code simple? |
@sunfishcode Done. |
Thanks! We can experiment with the threshold and see how it works out in practice. |
Btw, it might be good to document this somewhere (not sure where though?). I saw someone run into a problem because they used wasi-libc and the code couldn't run in a VM since the VM didn't support bulk memory. The error message "invalid section 12" doesn't really help point people to rebuilding wasi-libc, unforuntately... |
If you have any ideas where we could document this such that a user would find it when they need to, I'm open to documenting this. |
Closes #262.