-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive vector lengths without global state #21
Comments
This is a good idea, definitely better than what we were able to come up with in #4. I think it should definitely investigated as a potential solution. I would still want to establish a baseline with the "exposed length" style, to have something to compare to, especially since the rest of operations (aside from the length) would not depend on this too much. The obvious challenge from spec perspective would be introducing a new kind of non-determinism. But that should not be a deterrent, since any proposal hiding the length would have to be non-deterministic. Another thing worth mentioning is that some work would be needed on tools producing wasm. Existing backends support vector instructions, even variable-length ones, but not the control flow we would need for this. This should not be a deterrent either - there are precedents of wasm operations diverging from what native targets do. |
Such a design seems too simple to handle most SIMD/vector algorithms because of the lack of inter-lane operations. The prohibition of nested parallelism is also problematic for not so few algorithms. Matrix multiplication is a really simple example that requires both inter-lane operations (either broadcast or reduce) and nested parallelism, and I fail to see how one would implement it with such an interface. |
The nested-SIMD restriction is mainly motivated by my understanding of the limitations of hardware platforms. It should be straightforward to lift, if we wish, since there's no global state. The question is, what algorithms would use nested-SIMD, and how would they map to various architectures? Either way, nesting with regular Broadcast and reduce would be straightforward -- broadcast would be an instruction which takes a scalar operand and produces a Even shuffles could work: for example, if you set the |
Since the length is tied to lane size, how would this work for widen or narrow instructions? |
The (local $A i32) (local $C i32) (local $n i32)
(local $vl vec_len.32)
...
local.get $n
vec_loop 32 $vl ;; start vector loop processing (at most) 32-bit elements
(local.set $t0 (vec_load $vl 16 (local.get $A))) ;; Load 16-bit elements; return type vec<16>
(local.set $t2 (vec_extend_s 16 32 $vl (local.get $t0))) ;; vec<16> -> vec<32>
(vec_store $vl 32 (local.get $C) (local.get $t2)) ;; Store with 32-bit elements
(local.set $A (vec_step $vl 2 (local.get $A))) ;; Step by 2 bytes for the input array
(local.set $C (vec_step $vl 4 (local.get $C))) ;; Step by 4 bytes for the output array
(local.set $n (vec_step $vl -1 (local.get $n))) ;; Decrement the loop counter
(br_if 0 (local.get $n) (local.get $n)) ;; pass the count back to the top
end |
The issue I see here is that you lose parallelism when processing smaller elements as your registers are not full. How would you deal with functions? if you have to pass
I think it would be super useful to have intra 128-bit shuffles. But if you only deal with shuffles inside 128-bit "lanes", then there is a bunch of algorithms that would not be really implementable. Prefix sum is an example. If you want a toy program to experiment with your design, I would suggest you the following one: static const uint8_t CNT_LUT[16] = {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4}; // LUT can fit into a 128-bit register
void popcount_prefixsum(const uint8_t*restrict A, uint16_t*restrict B, int n) {
uint16_t sum = 0;
for (int i = 0; i < n; ++i) {
uint8_t a = A[i];
uint8_t cnt_low = CNT_LUT[a & 0xf];
uint8_t cnt_high = CNT_LUT[a >> 4];
uint8_t cnt = cnt_low + cnt_high; // cnt = popcount(a)
sum += cnt;
B[i] = sum;
}
} I consider this to be a good primary test for the usefulness of a SIMD ISA as it exposes a lof of features (mixed type lengths, LUT, reduction, rotation) that people would expect from a SIMD ISA. They are a few other missing like masks, compress/expand, gather/scatter, conflict detection, but that is a good start. In fact, SSE2 does not pass this test, which reflects the shortcomings of that ISA. |
One possible way to model this would be to add a (vec_loop 16 $vl
(local.set $half_vl (half (local.get $vl))
(local.set $t0 (vec_load $half_vl 32 (local.get $A))) ;; Load 32-bit elements; return type vec<32>
(local.set $t1 (vec_load $half_vl 32 (local.get $A))) ;; Load more 32-bit elements; return type vec<32>
(local.set $t2 (vec_wrap 32 16 $vl (local.get $t0) (local.get $t1))) ;; vec<32> -> vec<16>
...
The key is that
I'm not very familiar with these kinds of algorithms. Has anyone prototyped this kind of prefix-sum algorithm with the other flexible SIMD proposals? I'd be curious to see how this looks. |
I finally found time to give it a thought, and your design might just work. The syntax looks a bit weird, but it seems to solve the issues we are trying to solve. The problem I see with your design is the necessity to have a remainder for some ISAs (so for all WASM codes in practice). If you implement a masked remainder to alleviate this issue, what is the actual gain of your To nuance my comment, I think your main goal to avoid global state is really good, and it is one goal we have in common you and I. It's just that I think masked operations are pretty much necessary for other reasons, and could be used also for loops. |
The At the wasm level, |
In theory, I would agree. But in practice, it seems impractical to impose WASM engines to change the meaning of the code depending on if it is the remainder or not. This is better achieved by compilers because they have time to reason about the code.
This is not compatible with something you said earlier about having
This looks super complex to implement on a WASM engine that has basically no time to handle those. It would indeed need to rebuild the semantic of the code (building the AST and the SSA form) in order to change the meaning of the code and make the actual decision. Unless there is something I am not aware of, this looks very impractical. Also, the are some hardware solutions that cannot fit with this model. The one I have in mind is the global vector length, because you would not be able to handle nested loops. |
In general in wasm, engine compile time is an important concern. For wasm SIMD though, there's a good argument for letting engines take (somewhat) more time. On one hand, achieving portability, efficiency, and practicality in a single SIMD design is already very difficult, and the current designs are already making some major tradeoffs. Adding stringent compile-time restrictions risks making these worse. And on the other, SIMD is typically a very small portion of most applications (in static code terms), so making SIMD compiler slower won't affect most code. So if taking additional compile time on SIMD code gives us better performance, portability, or practicality, it's worth considering. The design proposed here would likely require JITs to build an IR to perform many of the optimizations we want here. However, all major JIT-based wasm engines' top-tier backends already build IRs.
Yes. Here's an overview of how
* Other implementation strategies are possible; I just picked some strategies to serve as examples.
I wrote a bit about this in the "Nested |
Ok, that is much more clear to me. So you would indeed have dynamic This makes total sense, and the translation overhead looks reasonable. I can know totally see how WASM code would be translated into machine code (except maybe Risc-V V, but that's because I don't know enough about it). About nesting: yes, I was thinking about nesting Re-reading your proposal, there is something a bit odd: "Using this local outside of a vec_loop is prohibited." |
Indeed. And, an implementation which unrolls loops could also use lengths greater than the target architecture's nominal length on some loops. Forbidding |
The thing is, if you explicitly forbid
You could want to build an in-register LUT (like for emulating popcount), and you just happen to have a function that takes a If
That's another problem. It is interesting and must be solved at some point, but it is orthogonal to the discussion we have now. Let me explore something with you. It seems that forbidding Having multiple "active" Now, this design is problematic when the target architecture handles As I mentioned earlier, we need a way to process types with different sizes in the same loop, and you proposed In fact, this is not different than having multiple Also, I've looked back at your example with multiple type size:
And this example only works when And processing multiple type sizes at the same time is really common in signal processing, and also in AI if I'm not mistaken. In conclusion, I really believe we need to have multiple "active" |
Note that RVV's LMUL (ganging together 2,4,8 registers, which are all affected by a single instruction) avoids the need for many/most? setvl. Even after promoting u16->u32, the number of lanes has not changed, only LMUL has doubled. There is also mf2..8 for half/quarter/eighths, but I have not yet seen these supported in intrinsics. |
You still need In the documentation (risv-v-spec p.24), they explicitly call That being said, it is true that with LMUL, the need for more registers after promotion vanishes. But that is basically the only ISA that I know of that can do that. |
Looks like there have been some changes. VLH from your link no longer exists in 1.0-draft, and vtype (including LMUL) can indeed only be set by vsetvli (the special form with x0 in/out that changes only vtype, not VL). Further changes are afoot: intrinsics seem to be changing to accept an AVL parameter and the compiler (in non-wasm usage) would be in charge of emitting vsetvli when it has changed. And to add yet more unclarity: store ops encode an EEW (effective element width), so it seems we could have LMUL=4, widen implicitly to LMUL=8, then store without vsetvli in between?
Agreed. |
A pre-computed vector value would need to have a particular length, which would require the implementation to use that length for all iterations of all loops which could potentially use that value. One of the goals of For LUT-like use cases, some possible options which would work with
|
You seem to want that A really simple code that should work is reduction (written in Neon for the sake of simplicity): uint32_t scalar_sum(const uint32_t * A, int n) {
uint32_t s = 0;
for (int i = 0; i < n; ++i) {
s += A[i];
}
return s;
}
uint32_t neon_sum(const uint32_t * A, int n) {
uint32x4_t s = vdupq_n_u32(0); // s has a length of 4
int i = 0;
for (; i < (n & -4); i += 4) {
uint32x4_t a = vld1q_u32(&A[i]);
s = vaddq_u32(s, a); // s has a length of 4
}
if (i < n) { // remainder
// generate mask
uint32x4_t I = {0, 1, 2, 3};
I = vaddq_u32(I, vdupq_n_u32(i));
uint32x4_t mask = vcltq_u32(I, vdupq_n_u32(n));
// masked-load
uint32x4_t a = vld1q_u32(&A[i]); // assume A is aligned with 16
a = vandq_u32(a, I); // a has a "length" of 1, 2, or 3
s = vaddq_u32(s, a); // s has a length of 4
}
// s has a length of 4
return vaddvq_u32(s);
} As you can see here, we need to define This can work with your design only if That's why I proposed you to play with scan algorithms: they have a loop-carried dependency, and they are super common in signal and image processing. With your limitations, I fail to see how one could implement them with your design.
That is actually compatible with
First, you simply cannot have specialized instructions for all the use-cases, so my example is popcount, but it could have been anything else. And no matter what kind of specialized instruction you could come up with to generate LUTs and what-not, there will always be some use-cases not handled by your specialized instructions. If you provide a way to "algorithmically" generate any kind of vector, nice, you just have invented a new language within the language, with a new level of complexity. Also, LUTs are kind of special because the length of the LUT vector is not correlated to the length of the data it is applied on. For instance, the popcount LUT I gave in example, if you only have a 8 byte-element vector, you would still need the 16 element LUT. Also, you seem to have disregarded half of my comment, by this other half is, at least, as important as the one you replied to. All in all, your proposal looks half-baked: the starting point looks good, but you should really try to see all the implications of your model, and try to "implement" common SIMD algorithms with it to see if it can works in practice. So please, take back the example algorithm I gave you about prefix sum of popcount with mixed element sizes, and see how you can make it work with your design. You could even replace the prefix sum with a reduction, that would already be a good test. |
When you mentioned scan algorithms earlier, I replied:
I'd be interested in your answer here. You're right that my proposal here as it stands does not support loop-carried dependencies. The neon reduction example is indeed something that Otherwise, let's talk about how these popcount/scan/mixed-type examples look in other flexible-length-vector programming models. Precomputing a vector of arbitrary values, and loop-carried dependencies, will be "interesting" in any flexible-length proposal. I didn't reply about mixed-element-type loops yet because we still seem to be discussing the basic mechanism of |
Here would be an implementation in SVE: void prefix_sum(uint32_t const* A, uint32_t* B, int n) {
svuint32_t last = svdup_n_u32(0);
svuint32_t zero = svdup_n_u32(0);
uint64_t vl = svcntw();
for (int i = 0; i < n; i += vl) {
svbool_t mask = svwhilelt_b32(i, n);
svuint32_t a = svld1_u32(mask, &A[i]);
// in-register prefix-sum
for (int shift = 1; shift < vl; shift <<= 1) { // this loop could be unrolled on fixed-size ISAs
svbool_t splice_mask = svwhilelt_b32(0, shift);
splice_mask = svbnot(svptrue_b32(), splice_mask);
svuint32_t sa = svsplice_u32(splice_mask, zero, a); // shift elements by "shift" amount
a = svadd_m_u32(mask, a, sa);
}
// propagate previous sum
a = svadd_m_u32(mask, a, last);
last = svclastb(mask, a); // broadcast last element to all lanes
svst1_u32(mask, &B[i], a);
}
} I cannot tell you how exactly one would implement such algorithm on other flexible-vectors proposal because there are operations missing. But my mental model around flexible vectors is, and has always been, close to what SVE is, and I thought that was commonly admitted. To explicit my view: the target architecture has a constant vector width. This width is not known at compile time, but it is known at translation time. However, you can have an "effective" vector length (possibly implemented with masks) to apply operations on fewer elements. This view is actually compatible with your proposal as stated in the first post of this issue (even if I did not notice from the start).
This is a big issue as many algorithms require loop-carried dependencies. And to me, it is essential.
This can be implemented in a
I'm not opposed to a runtime
Concerning LUTs and scans, as soon as you have a single internal vector length, those are "just" a matter of adding new instructions, nothing that really need a new conceptual view. Also, Loop-carried dependencies just work when you have a single internal vector length. Mixed type is a bit more complex, but if you can have multiple "effective lengths" at the same time, it should not be an issue either. Precomputing a vector of arbitrary values at runtime should not be an issue in any design because it should just be calling SIMD instructions. SIMD can exist outside loops. Precomputing at compile time or translation time, need more thought, but is not tied to this proposal.
Even if I think it is directly related to the core issue I mentioned, I agree to drop it for now. I think all the issues I pointed revolve around the single fact that you really want to have vector with really different length at run time. And I think having |
Thanks for that example. One thing I should be more clear about is my interest in seeing if it's possible to avoid implicit state. As I mentioned at the outside, "one thing this proposal does do is avoid having nondeterminism which has to be constant for the duration of the program." As I understand it, an SVE-style VL is not mutable, but it is state that needs to be available and shared between all wasm instances that a SIMD computation might span. This places constraints on suspending and resuming programs on machines with different computing resources, on distributing programs across networks, tiering execution strategies, and other things. As an example, an implementation with a powerful and long but power-hungry vector accelerator can't decide whether to use that accelerator based on runtime power-usage conditions; it has to run all loops at the same length, which might make the fallback inefficient. While thinking about this, it occurs to me that the SVE model could perhaps be improved in this respect by adding an instruction which allows programs to declare when they've finished a SIMD kernel, meaning that implementations could pick a new VL next time they enter a SIMD kernel.
Besides implicit state, here's an attempt to summarize the high-level differences between the SVE model and
|
Indeed, this view was implicit up until now. In your proposal, you mentioned global state, not implicit state.
It depends on what you call VL: if it's the actual number of elements stored in a SIMD register, then yes, it is constant. But if VL means the number of elements on which instructions operate, then VL can be mutated (but still has a constant maximum value). To me, there no real way around if we want to support out-of-loop SIMD initialization and loop-carried dependencies. But if you find a way around, I would be glad to hear it. BTW, Risc-V V also has such an implicit state: "Elements in any destination vector register group with indices ≥
This concern is interesting, but is far beyond WASM scope and looks like a research topic on its own. In fact, as far as I'm aware, there no runtime in any language that can do this kind of "ISA" hot swap. Sure, there are some runtimes that have multiple versions of the same kernel for different targets and can select where to run a kernel at runtime, but this still require to have one manually specialized version per "accelerator". This would be completely different than what you propose (and might actually be much simpler to implement in the engine). Also, how the situation is any different than with native code? The goal of WASM is to be close to native. Native cannot do that, so I wonder why WASM would require it.
I don't get how it is an improvement, but yes, it could be done. The question is: how do you deal with nesting SIMD contexts?
While I mostly agree that
With SVE, you cannot really have different code for different VL, except with runtime ifs, but that would be the same for all "flexible" vector ISA.
|
Ok, yes, you got me, when I said loop-carried dependencies, I should have clarified that reductions and dot products and other useful cases of loop reductions can be handled. I even described how reductions would work, earlier in this thread. And yes, I'm aware that RISC-V has state. The proposal here wouldn't expose that state to wasm directly. And yes, I'm aware that
https://github.com/bytecodealliance/wizer for example.
If you want native code, use native code. |
I checked the thread again just to be sure, and no, you never mentioned how reductions would work. You even said: "The neon reduction example is indeed something that I am genuinely curious to know how you would handle reductions, because I have no clue, and reductions are an important part of many SIMD algorithms.
So how
You missed my point: a less flexible for programmers approach would be ok if said programmers could still use it. But as it stands, most will not be able to use it because it lacks basic features that make most SIMD algorithms just impossible. I am still waiting that you show me how one could handles just even a simple reduction.
This one is interesting, but is much more specific than what you explained earlier. it is not some general "ISA hot-swap", it is basically two programs (with same source code), one that generates constant data and runs once, and one that uses those data and runs multiple times.
From webassembly.org main page:
One goal of WASM is to have a portable binary format that is as fast as native code. Automatic and transparent offloading of code into accelerators is not a WASM goal. |
Firstly, please do keep discussions productive. Second, getting rid of global state may not be a goal on its own, but avoiding exposing native vector length definitely is, which is what this idea strives to achieve, and so far this is out best shot at that. Whether or not a program can infer global state is very important from upstream spec point of view, so we have to attempt to evaluate that - in some sense, that is required for moving up the stage ladder. |
Right, I should stop replying to any comment part that I want to reply to and keep the discussion focused. So let me expose my point of view on this in a single block. The goal of such a thread is to explore this new model to see if it can work in practice. However, "ISA hot-swapping" and off-loading are non-issues. They try to solve a problem we don't have (at least for the very large majority of us). So we could either keep the runtime constant VL visible and implement most of SIMD algorithms, or we could go the actually no target VL is visible and many SIMD algorithms won't fit. The latter also comes with more complex WASM engine. Therefore, the critical issues I exposed must be addressed. Otherwise, this vector model will simply not be used. As a side note, I believed that an SVE-like model was consensual here, and I explained all my arguments in light of this. This thread proved me wrong, so I will write an SVE-like model proposal in the future, but don't expect anything from me for the next week at least. |
Here are some more ideas that cover several topics: nested SIMD, scalar library routines, LUTs, different forms of reductions, and a complete
|
@sampsyo has added examples of benchmarks in #5. @sunfishcode and @lemaitre feel free to take a look - it would be great to see what the new operations would help with, comparing to this proposal's baseline. @lemaitre - SVE-like proposal would be very welcome. |
@sunfishcode I finally found some time to look at your suggestion.
|
Has there been any more thoughts on this? This kind of loop construct looks like a concise stepping stone from fixed to flexible, and also flexible from a compiler strategy POV. The first proposal here is also similar to how we handle vector loops in Arm's MVE. I've just begun trying to implement flexible types in cranelift, likely targeting NEON first, and this loop proposal looks very appealing. As a side question, I'm not sure if I just failed to read the spec properly or if it hasn't been updated.... but what are the rules around what 'flexible' means? Is there a minimum width (128-bits) with the minimum being a factor of any other supported size? |
It looks like we have moved in the other direction, there is a PR (which needs changes) to introduce more vector-friendly behavior, as opposed to loop instructions: #27. Idea expressed in this issue is solving the same problem but on much higher level. Yeah, the minimum width is 128-bits, it is (sort of) implied by backwards compatibility with 128-bit SIMD, but I think we should make it explicit. |
To be clear, the proposal above does not have a 128-bit minimum or required factor. It intends for VMs to insert remainder loops or use masking as needed, and gives them the flexibility to do so, in whichever manner is best for the target architecture. If there's interest in this proposal, I believe the various objections raised above can be answered. It does require significantly more complex language features, but it provides more flexibility to VMs. |
Exactly, it's the flexibility (portability) that I'm really interested in here. I'm thinking about this from purely a compiler/runtime engineer perspective and generating a remainder loop, for an architecture that doesn't support efficient masking, sounds much easier than converting a predicated vector loop into a non-predicated one along with a scalar remainder. As does unrolling on a core with multiple vector pipes. Getting performance portable SIMD is already hard enough without predication, and I'm not sure I'd want to be writing the wasm cost model for LLVM's vectorizer :) What is the expectation of how users will use fixed vs flexible vectors? As an autovec target, I'm presuming flexible would be used any time the trip count is unknown at compile time and I'm concerned that would mean that fixed SIMD is barely used, potentially reducing the performance of traditional SIMD engines to that of a scalar loop. Though, feature detection and loop versioning could solve this problem. |
What do you mean by traditional SIMD engine? Flexible vector operations are meant to lower to regular SIMD instructions at the time the module is compiled by the engine. TBH, maybe the spec is not clear on some of those things, and those of us who have been working on Wasm SIMD may take some of the features for granted, therefore your feedback would be very valuable.
The tradeoff is in shifting work between producer and consumer. For vector-like operations it is on producer (toolchain that produces Wasm), while for first-class vector loops it is on consumer (the runtime). Historically we have been leaning towards making the producer do more work, the motivation is that the producer runs once when the module is built, but the consumer runs every time it is used (on the Web that usually means on every page view). |
Even calculating whether to use masked load/stores, in the target-specific compiler, is a non-trivial task. So, my gut feeling is it will be hard to effectively support, on the majority of current CPUs, via an agnostic ISA. The fixed width proposal introduced instructions that could be mapped efficiently to the popular vector extensions, but the same cannot be said if we introduce explicit masking into wasm. I'd say it's too early to call NEON legacy when there isn't a consumer device that supports SVE and LLVM will, in some cases, still use NEON even when SVE is supported. This proposal doesn't look as powerful as introducing SVE-like operations, but it would enable a good subset of cases where we just want to vectorize and take advantage of the varying widths of each vector extension. That's why I see it as a good stepping stone for today's cores, before we get to being able to take advantage of a full-blown vector machine. |
Can you elaborate a little bit, what calculations about using masks are you referring to and how do they apply here? Is this purely about NEON/SSE vs SVE/AVX?
To provide background, the point about SVE being a valid target came from @akirilov-arm (#27 (comment)):
Some clarity on this would be greatly appreciated 😉 BTW, supporting instructions beyond 128 bits has similar challenges on x86, where SSE also has no masks, and there is still some hardware that only has SSE. That said, I feel like there are two problems discussed here: efficient support for "advanced" ISAs (AVX*, SVE) and fallbacks for less advanced 128-bit ISAs like SSE and Neon. |
Again, from purely a auto-vec compiler engineer perspective it takes some effort to model of the costs of different vectorization strategies - including whether using the more advanced vector operations, such as masks and gather/scatter, would be beneficial. Here is the X86 backend implementation in LLVM for masked and here is the costing for gather/scatter. It should be noted that, even though AVX provides these operations, it doesn't mean a compiler would use them regardless of the situation. The backend implementation for AArch64 is significantly different, because NEON just can't do these things efficiently and SVE code generation isn't good enough yet. Arm supports these features in microcontrollers, but again, a lot of effort has to go into making a reasonable decision about the real costs of using them.
Yes, there has (finally!) been product announcements, but my point is that it will probably be 5 years until the majority of people browsing the web on an Arm device has SVE support. My assumption here is that it's a goal of wasm to support a common subset of user CPU features, but currently all phones, tablets and most (Intel and Arm) chromebooks wouldn't meet this criteria. Again, feature detection would solve this problem if we're happy with the increased binary size.
Definitely. My suggestion would be to reduce the scope of the flexible proposal, focusing on how we can support wider vectors in a performance portable manner. Another future proposal could introduce the more advanced operations of masking and gather/scatter, when these are supported by the majority of devices. Or we could make the proposal dependent upon feature detection, but it feels like we could still decouple sizeless vectors from 'advanced' support. |
Just to follow up on this... I've been made aware that LLVM costs gather-scatters so high that they're almost disabled for X86, unless AVX-512 is supported.
|
Using vector loop approach would move vectorization logic to runtimes, I don't think that is desirable or in line with how WebAssembly has been approaching similar problems. Some mask support is desirable for ISAs that have them, mainly because there are comparison operations that return them. This does not mean that we have to unlock the full power of masked operations right away.
I am not sure what you mean by that, I don't think there are any scatters or gathers among the operations listed in the spec (keep in mind that instructions in higher tiers are not 'in' until we can test them). |
Sorry, I have been conflating masked memory operations and gather/scatter, but I wanted to show that, just because an architecture has some instructions, it doesn't mean it's a good idea to use them. My general observation of the wasm instruction proposals is that it's shown how an instruction maps to each target ISA, which is probably fine most of time, but it is evidently not in the case of more complicated cases, such as vector memory ops. A DSP engineer, with a good knowledge of the target (micro)architecture, is generally going to be much better at writing a kernel using assembly/intrinsics than what a compiler can manage because it's still frustratingly difficult to evaluate a loop as a whole. And this is with target-specific information.
Agreed. To reiterate, I only have an issue with masked memory instructions, but how do you intend on controlling the 'unlocking'? I have, naively, assumed that breaking the spec up is the only way to do this... Or do you plan to formalize the current different tiers at different times?
But the wasm code here is still representing a vectorized loop - it's just up to the target backend to select a width and what it wants to do to handle a remainder... A high-level vec loop structure makes this easy/fast and doesn't require vectorization, but it would it would likely need a pass to operate on the loop. |
This has been the dilemma of Wasm performance extensions from the beginning - it is true that somebody with knowledge of microarchitecture would use intrinsics or write assembly, but the approaches on different architectures end up somewhat far from one another. There is a project, Highway, that shows that compromises are possible, and performance-critical code can be written in this manner.
The idea is to have a good vision about non-masking operations (or at least make headway) before trying to work masking ops out. The tiers (which don't include any masking ops at all at the moment) are meant to be collapsed into one eventually, with the features not making the cut becoming "future work".
That is precisely what I would like to avoid adding to runtime. It won't be a just a simple instruction selection pass either, particularly because loops can be nested and multiple kinds of such constructs are necessary. |
Context: #4, #20. Global state is problematic, but even if we make it explicit, the ability to have arbitrary length switches is hard to implement on many architectures. The following is a sketch of a design which avoids both of these, while preserving some key advantages of flexible vectors.
Similar to the explicit loop variant of @aardappel's post here, it has a
vec_loop
, but this proposal is lower-level, making state explicit and avoiding unnecessary restrictions, while still hiding information as needed by implementations. I haven't prototyped this, but in theory it should be implementable on RISC-V V, SVE, AVX-512, as well as simd128-style architectures.Implementations with dynamic vectors or masking could use them. But implementations could also chose to emit multiple loops, to handle alignments or remainders. Such implementations could also chose to run some vector loops at twice or more the hardware length, making a register-pressure vs. speed tradeoff as they see fit.
New Types:
vec<n>
: a flexible vector of element bitwidthn
vec_len<n>
: a scalar but opaque element count, limited ton
as the widest bitwidthNew Opcodes:
vec_loop<vl, n>
: likeloop
butvl
: immediate identifying a local variable to set to the number of elements being processed within an iteration of the loop, with typevec_len<n>
. Using this local outside of avec_loop
is prohibited.n
: immediate which declares the bitwidth of the widest element type which will be processed in the loopvec_step<vl, scale>
:(i32) -> (i32)
vl
: immediate identifying avec_len
local variablescale
: immediate scaling factoroperand + vl * scale
vec_load<vl, n>
,vec_store<vl, n>
: likeload
and store` butvl
: immediate identifying avec_len
local variablen
: immediate specifying the vector element bitwidth to useThe arithmetic operations from simd128, with a similar
vl
immediate added.Example
Add two arrays of length
$n
starting at$A
and$B
and store the result in an array starting at$C
.Nondeterminism
The length in each iteration of a
vec_loop
is nondeterministic. It's visible invec_step
, in the number of elements loaded and stored invec_load
andvec_store
, and in any side effects of the loop. It's expensive to completely hide the hardware vector length with any flexible-vector proposal, so whether or not there should be nondeterminism is an independent question.One thing this proposal does do is avoid having nondeterminism which has to be constant for the duration of the program. Wasm doesn't currently have any nondeterminism that behaves like that, and it would have implications for suspending a program and resuming it on another architecture, or for distributing a program over multiple architectures.
Vector subroutines
A call within a
vec_loop
could make sense if the callsite passes thevec_len
value to the callee. Sincevec_len
is a type, it'd be part of the function signature, so implementations could compile the callee to be called from within a vector loop without whole-program analysis.Nested
vec_loop
sSome architectures wouldn't be able to implement arbitrary nested vector parallelism, so one possible approach would be to prohibit it. Lexical nesting is easy to detect, and dynamic nesting -- a call inside a
vec_loop
calling a function that contains anothervec_loop
could be prohibited if we require calls invec_loop
s to have exactly onevec_len
argument, and prohibitvec_loop
inside functions with avec_len
argument.Masks, shuffles, extracts and inserts, strided loads/stores
There are ways this design could be extended to include these, but left them out to keep things simple to start with. It's mostly an independent question whether these can be implemented efficiently.
The text was updated successfully, but these errors were encountered: