What are the requirements for a benchmark line item? #1

titzer · 2019-07-10T08:21:25Z

In CG meetings, including the face-to-face in La Coruna, we've discussed what the requirements of a benchmark line item should be. I'm filing this issue to attract and distill discussion around the topic and build consensus for what those criteria should be.

titzer · 2019-07-10T08:24:24Z

Some ideas that have been floated:

Require source code
Require licensing of source code under a (set of) approved licenses
Require build instructions / build scripts
Require algorithmic description / code documentation
Require multiple workloads if applicable
For each line item, both source and binary form, with the binary form being updated (relatively infrequently) in response to respective toolchains
Require each line item to do perform self-validation of outputs (i.e. correctness criteria)

Other thoughts / ideas?

bnjbvr · 2019-07-10T08:31:20Z

I propose to also require that the benchmark results can be programmatically fetched/consumed (in the case where the benchmark produces its own results and we're not measuring its performance in an external way). I am thinking about one particular benchmark we've used where the results would be rendered into a canvas, failing this criteria and making it really hard to get insightful information.

Require build instructions / build scripts

I think this is implied that builds should be entirely deterministic, that is, provide the exact version of compilers / toolchains that created them, so it's easy to reproduce them on different machines, if not being able to compute a hash of the produced binaries and compare it against an expected hash. (Containers to the rescue!)

fisherdarling · 2019-07-10T10:14:30Z

I propose to also require that the benchmark results can be programmatically fetched/consumed

This is a great idea, and there's already precedence for consumable testing through wast scripts. Many engines, such as wasmi and cranelift-wasm programmatically consume tests through wat2wasm bindings. Wasm scripts could be extended to include a syntax for benchmarking that requires these line items to be defined as well.

A single benchmark definition could look like this:

(benchmark
    (kernel ;; micro / kernel / application? / domain?
        (name "factorial-recursive")
        (description "Recursive factorial implementation. Benchmarks ... and ...")
        (complexity (time "O(n)") (space "O(n)"))
        (source "fac.watb") ;; new benchmark filename?
         ;; only recommendations
        (warmup_iter 200)
        (bench_iter 1000)
    )
    ;; The function to benchmark
    (assert_return
        (invoke
            "fac-rec"
            (i64.const 20))
        (i64.const 2432902008176640000)
    )
)

The binary source could be formatted similarly to a binary module in wast files:

(module "fac-rec" binary "\00asm" "\01\00\00\00\01\04\01\60 ...")

And the written source code would be inside a textual module. The entire file would then look something
like this:

(module "fac-rec" binary "\00asm" "\01\00\00\00\01\04\01\60 ...")

(module
  ;; Recursive factorial
  (func (export "fac-rec") (param i64) (result i64)
    (if (result i64) (i64.eq (local.get 0) (i64.const 0))
      (then (i64.const 1))
      (else
        (i64.mul (local.get 0) (call 0 (i64.sub (local.get 0) (i64.const 1))))
      )
    )
  )
)

(benchmark
    (kernel ;; micro / kernel / application? / domain?
        (name "factorial-recursive")
        (description "Recursive factorial implementation. Benchmarks ... and ...")
        (complexity (time "O(n)") (space "O(n)"))
        (source "fac.watb") ;; new benchmark filename? keep .wast?
         ;; only recommendations
        (warmup_iter 200)
        (bench_iter 1000)
    )
    ;; The function to benchmark
    (assert_return
        (invoke
            "fac-rec"
            (i64.const 20))
        (i64.const 2432902008176640000)
    )
)

Multiple benchmarks of the same art could be described within the same file. E.g. different implementations of factorial.

This would allow current engines to reuse the same code they test with, and then add logic for executing a benchmark instead. It would be hard to model this approach for applications, though for micro, kernel, and domain specific benchmarks this may work.

Would it be appropriate to open an issue?

Horcrux7 · 2019-07-10T14:49:41Z

I does not like the idea of a text representation of the binary. There should be a reference to a original binary file created from any tool chain.

I expect also that the original sources are not in the WAT format. It can be in any language. There can be multiple source files. I think a sub folder for the sources of every test seams more practical.

Warfields · 2019-07-10T15:35:44Z

Builds should be entirely deterministic

A docker registry could ensure that the same compilers/toolchains are used every time, allowing anyone to exactly reproduce the build

…

On Wed, Jul 10, 2019 at 2:31 AM Benjamin Bouvier ***@***.***> wrote: I propose to also require that the benchmark results can be programmatically fetched/consumed (in the case where the benchmark produces its own results and we're not measuring its performance in an external way). I am thinking about one particular benchmark we've used where the results would be rendered into a canvas, failing this criteria and making it really hard to get insightful information. Require build instructions / build scripts I think this is implied that builds should be entirely deterministic, that is, provide the exact version of compilers / toolchains that created them, so it's easy to reproduce them on different machines, if not being able to compute a hash of the produced binaries and compare it against an expected hash. (Containers to the rescue!) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1?email_source=notifications&email_token=AH3SDLCI5JZQWZSWW6QZH6TP6WM5TA5CNFSM4H7MX5V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSXI6A#issuecomment-509965432>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AH3SDLFASUZYODZU7IP6JCTP6WM5TANCNFSM4H7MX5VQ> .

-- Samuel Warfield (720)-278-8897 [email protected]

jing-bao · 2019-07-12T13:17:37Z

I'm thinking that variance of startup time may also need our attention. Disabling tiering of WASM engines can make time measurement more stable, but it hides the real startup time from the user perspective, and in our previous experiments on Spec2k6, PolyBench and OpenCV.js, we observed big variance of the startup time of WASM workloads, maybe we have to find some way to handle it properly?

Besides, since there’re many candidates for benchmark cases, I’d like to kind of limit the overall run time of the benchmark. A time-consuming benchmark is unfriendly to users. Maybe we can group the cases and allow people to run a single case or run a subgroup.

TianyouLi · 2019-07-16T07:42:04Z

I'm thinking that variance of startup time may also need our attention. Disabling tiering of WASM engines can make time measurement more stable, but it hides the real startup time from the user perspective, and in our previous experiments on Spec2k6, PolyBench and OpenCV.js, we observed big variance of the startup time of WASM workloads, maybe we have to find some way to handle it properly?

Besides, since there’re many candidates for benchmark cases, I’d like to kind of limit the overall run time of the benchmark. A time-consuming benchmark is unfriendly to users. Maybe we can group the cases and allow people to run a single case or run a subgroup.

Agreed the stableness of the benchmark will be an important otherwise it may not be usable for fair comparison.

Require build instructions / build scripts
For each line item, both source and binary form, with the binary form being updated (relatively infrequently) in response to respective toolchains

The binary release should contain the toolchain information like versions and built options etc as well to be part of the benchmark result for performance comparison.

titzer mentioned this issue Jul 16, 2019

How should proposals be structured? #2

Open

wingo mentioned this issue Jul 23, 2019

Add Schism eval benchmark proposal #3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are the requirements for a benchmark line item? #1

What are the requirements for a benchmark line item? #1

titzer commented Jul 10, 2019

titzer commented Jul 10, 2019

bnjbvr commented Jul 10, 2019

fisherdarling commented Jul 10, 2019 •

edited

Loading

Horcrux7 commented Jul 10, 2019

Warfields commented Jul 10, 2019 via email

jing-bao commented Jul 12, 2019

TianyouLi commented Jul 16, 2019

What are the requirements for a benchmark line item? #1

What are the requirements for a benchmark line item? #1

Comments

titzer commented Jul 10, 2019

titzer commented Jul 10, 2019

bnjbvr commented Jul 10, 2019

fisherdarling commented Jul 10, 2019 • edited Loading

Horcrux7 commented Jul 10, 2019

Warfields commented Jul 10, 2019 via email

jing-bao commented Jul 12, 2019

TianyouLi commented Jul 16, 2019

fisherdarling commented Jul 10, 2019 •

edited

Loading