Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are the requirements for a benchmark line item? #1

Open
titzer opened this issue Jul 10, 2019 · 7 comments
Open

What are the requirements for a benchmark line item? #1

titzer opened this issue Jul 10, 2019 · 7 comments

Comments

@titzer
Copy link
Collaborator

titzer commented Jul 10, 2019

In CG meetings, including the face-to-face in La Coruna, we've discussed what the requirements of a benchmark line item should be. I'm filing this issue to attract and distill discussion around the topic and build consensus for what those criteria should be.

@titzer
Copy link
Collaborator Author

titzer commented Jul 10, 2019

Some ideas that have been floated:

  • Require source code
  • Require licensing of source code under a (set of) approved licenses
  • Require build instructions / build scripts
  • Require algorithmic description / code documentation
  • Require multiple workloads if applicable
  • For each line item, both source and binary form, with the binary form being updated (relatively infrequently) in response to respective toolchains
  • Require each line item to do perform self-validation of outputs (i.e. correctness criteria)

Other thoughts / ideas?

@bnjbvr
Copy link
Member

bnjbvr commented Jul 10, 2019

I propose to also require that the benchmark results can be programmatically fetched/consumed (in the case where the benchmark produces its own results and we're not measuring its performance in an external way). I am thinking about one particular benchmark we've used where the results would be rendered into a canvas, failing this criteria and making it really hard to get insightful information.

Require build instructions / build scripts

I think this is implied that builds should be entirely deterministic, that is, provide the exact version of compilers / toolchains that created them, so it's easy to reproduce them on different machines, if not being able to compute a hash of the produced binaries and compare it against an expected hash. (Containers to the rescue!)

@fisherdarling
Copy link

fisherdarling commented Jul 10, 2019

I propose to also require that the benchmark results can be programmatically fetched/consumed

This is a great idea, and there's already precedence for consumable testing through wast scripts. Many engines, such as wasmi and cranelift-wasm programmatically consume tests through wat2wasm bindings. Wasm scripts could be extended to include a syntax for benchmarking that requires these line items to be defined as well.

A single benchmark definition could look like this:

(benchmark
    (kernel ;; micro / kernel / application? / domain?
        (name "factorial-recursive")
        (description "Recursive factorial implementation. Benchmarks ... and ...")
        (complexity (time "O(n)") (space "O(n)"))
        (source "fac.watb") ;; new benchmark filename?
         ;; only recommendations
        (warmup_iter 200)
        (bench_iter 1000)
    )
    ;; The function to benchmark
    (assert_return
        (invoke
            "fac-rec"
            (i64.const 20))
        (i64.const 2432902008176640000)
    )
)

The binary source could be formatted similarly to a binary module in wast files:

(module "fac-rec" binary "\00asm" "\01\00\00\00\01\04\01\60 ...")

And the written source code would be inside a textual module. The entire file would then look something
like this:

(module "fac-rec" binary "\00asm" "\01\00\00\00\01\04\01\60 ...")

(module
  ;; Recursive factorial
  (func (export "fac-rec") (param i64) (result i64)
    (if (result i64) (i64.eq (local.get 0) (i64.const 0))
      (then (i64.const 1))
      (else
        (i64.mul (local.get 0) (call 0 (i64.sub (local.get 0) (i64.const 1))))
      )
    )
  )
)

(benchmark
    (kernel ;; micro / kernel / application? / domain?
        (name "factorial-recursive")
        (description "Recursive factorial implementation. Benchmarks ... and ...")
        (complexity (time "O(n)") (space "O(n)"))
        (source "fac.watb") ;; new benchmark filename? keep .wast?
         ;; only recommendations
        (warmup_iter 200)
        (bench_iter 1000)
    )
    ;; The function to benchmark
    (assert_return
        (invoke
            "fac-rec"
            (i64.const 20))
        (i64.const 2432902008176640000)
    )
)

Multiple benchmarks of the same art could be described within the same file. E.g. different implementations of factorial.

This would allow current engines to reuse the same code they test with, and then add logic for executing a benchmark instead. It would be hard to model this approach for applications, though for micro, kernel, and domain specific benchmarks this may work.

Would it be appropriate to open an issue?

@Horcrux7
Copy link

I does not like the idea of a text representation of the binary. There should be a reference to a original binary file created from any tool chain.

I expect also that the original sources are not in the WAT format. It can be in any language. There can be multiple source files. I think a sub folder for the sources of every test seams more practical.

@Warfields
Copy link

Warfields commented Jul 10, 2019 via email

@jing-bao
Copy link
Contributor

I'm thinking that variance of startup time may also need our attention. Disabling tiering of WASM engines can make time measurement more stable, but it hides the real startup time from the user perspective, and in our previous experiments on Spec2k6, PolyBench and OpenCV.js, we observed big variance of the startup time of WASM workloads, maybe we have to find some way to handle it properly?

Besides, since there’re many candidates for benchmark cases, I’d like to kind of limit the overall run time of the benchmark. A time-consuming benchmark is unfriendly to users. Maybe we can group the cases and allow people to run a single case or run a subgroup.

@TianyouLi
Copy link

I'm thinking that variance of startup time may also need our attention. Disabling tiering of WASM engines can make time measurement more stable, but it hides the real startup time from the user perspective, and in our previous experiments on Spec2k6, PolyBench and OpenCV.js, we observed big variance of the startup time of WASM workloads, maybe we have to find some way to handle it properly?

Besides, since there’re many candidates for benchmark cases, I’d like to kind of limit the overall run time of the benchmark. A time-consuming benchmark is unfriendly to users. Maybe we can group the cases and allow people to run a single case or run a subgroup.

Agreed the stableness of the benchmark will be an important otherwise it may not be usable for fair comparison.

Require build instructions / build scripts
For each line item, both source and binary form, with the binary form being updated (relatively infrequently) in response to respective toolchains

The binary release should contain the toolchain information like versions and built options etc as well to be part of the benchmark result for performance comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants