-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What are the requirements for a benchmark line item? #1
Comments
Some ideas that have been floated:
Other thoughts / ideas? |
I propose to also require that the benchmark results can be programmatically fetched/consumed (in the case where the benchmark produces its own results and we're not measuring its performance in an external way). I am thinking about one particular benchmark we've used where the results would be rendered into a canvas, failing this criteria and making it really hard to get insightful information.
I think this is implied that builds should be entirely deterministic, that is, provide the exact version of compilers / toolchains that created them, so it's easy to reproduce them on different machines, if not being able to compute a hash of the produced binaries and compare it against an expected hash. (Containers to the rescue!) |
This is a great idea, and there's already precedence for consumable testing through wast scripts. Many engines, such as wasmi and cranelift-wasm programmatically consume tests through A single benchmark definition could look like this: (benchmark
(kernel ;; micro / kernel / application? / domain?
(name "factorial-recursive")
(description "Recursive factorial implementation. Benchmarks ... and ...")
(complexity (time "O(n)") (space "O(n)"))
(source "fac.watb") ;; new benchmark filename?
;; only recommendations
(warmup_iter 200)
(bench_iter 1000)
)
;; The function to benchmark
(assert_return
(invoke
"fac-rec"
(i64.const 20))
(i64.const 2432902008176640000)
)
) The binary source could be formatted similarly to a binary module in wast files: (module "fac-rec" binary "\00asm" "\01\00\00\00\01\04\01\60 ...") And the written source code would be inside a textual module. The entire file would then look something (module "fac-rec" binary "\00asm" "\01\00\00\00\01\04\01\60 ...")
(module
;; Recursive factorial
(func (export "fac-rec") (param i64) (result i64)
(if (result i64) (i64.eq (local.get 0) (i64.const 0))
(then (i64.const 1))
(else
(i64.mul (local.get 0) (call 0 (i64.sub (local.get 0) (i64.const 1))))
)
)
)
)
(benchmark
(kernel ;; micro / kernel / application? / domain?
(name "factorial-recursive")
(description "Recursive factorial implementation. Benchmarks ... and ...")
(complexity (time "O(n)") (space "O(n)"))
(source "fac.watb") ;; new benchmark filename? keep .wast?
;; only recommendations
(warmup_iter 200)
(bench_iter 1000)
)
;; The function to benchmark
(assert_return
(invoke
"fac-rec"
(i64.const 20))
(i64.const 2432902008176640000)
)
) Multiple benchmarks of the same art could be described within the same file. E.g. different implementations of factorial. This would allow current engines to reuse the same code they test with, and then add logic for executing a benchmark instead. It would be hard to model this approach for applications, though for micro, kernel, and domain specific benchmarks this may work. Would it be appropriate to open an issue? |
I does not like the idea of a text representation of the binary. There should be a reference to a original binary file created from any tool chain. I expect also that the original sources are not in the WAT format. It can be in any language. There can be multiple source files. I think a sub folder for the sources of every test seams more practical. |
Builds should be entirely deterministic
A docker registry could ensure that the same compilers/toolchains are used
every time, allowing anyone to exactly reproduce the build
…On Wed, Jul 10, 2019 at 2:31 AM Benjamin Bouvier ***@***.***> wrote:
I propose to also require that the benchmark results can be
programmatically fetched/consumed (in the case where the benchmark produces
its own results and we're not measuring its performance in an external
way). I am thinking about one particular benchmark we've used where the
results would be rendered into a canvas, failing this criteria and making
it really hard to get insightful information.
Require build instructions / build scripts
I think this is implied that builds should be entirely deterministic, that
is, provide the exact version of compilers / toolchains that created them,
so it's easy to reproduce them on different machines, if not being able to
compute a hash of the produced binaries and compare it against an expected
hash. (Containers to the rescue!)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1?email_source=notifications&email_token=AH3SDLCI5JZQWZSWW6QZH6TP6WM5TA5CNFSM4H7MX5V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSXI6A#issuecomment-509965432>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AH3SDLFASUZYODZU7IP6JCTP6WM5TANCNFSM4H7MX5VQ>
.
--
Samuel Warfield
(720)-278-8897
[email protected]
|
I'm thinking that variance of startup time may also need our attention. Disabling tiering of WASM engines can make time measurement more stable, but it hides the real startup time from the user perspective, and in our previous experiments on Spec2k6, PolyBench and OpenCV.js, we observed big variance of the startup time of WASM workloads, maybe we have to find some way to handle it properly? Besides, since there’re many candidates for benchmark cases, I’d like to kind of limit the overall run time of the benchmark. A time-consuming benchmark is unfriendly to users. Maybe we can group the cases and allow people to run a single case or run a subgroup. |
Agreed the stableness of the benchmark will be an important otherwise it may not be usable for fair comparison.
The binary release should contain the toolchain information like versions and built options etc as well to be part of the benchmark result for performance comparison. |
In CG meetings, including the face-to-face in La Coruna, we've discussed what the requirements of a benchmark line item should be. I'm filing this issue to attract and distill discussion around the topic and build consensus for what those criteria should be.
The text was updated successfully, but these errors were encountered: