Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible wrong timer behavoir in run_wasm.js #59666

Closed
yurydelendik opened this issue Dec 5, 2024 · 9 comments
Closed

Possible wrong timer behavoir in run_wasm.js #59666

yurydelendik opened this issue Dec 5, 2024 · 9 comments
Labels
area-dart2wasm Issues for the dart2wasm compiler. type-bug Incorrect behavior (everything from a crash to more subtle misbehavior)

Comments

@yurydelendik
Copy link

TL;DR I suspect a defect in the implementation of the addTimer code:

if (ms == 0 && !isNextTimerDue()) {

It is connected with isNextTimerDue logic: when ms == 0 and isNextTimerDue() == true, the Task.run may create two separate entries in the timerHeap (in comparison with isNextTimerDue() == false, they will be just added to zeroTimerQueue) when engine runs slow.

The flute engine produces timers with the following cadence: 16 0 0 16 0 0 .. in the scheduleFrame. The defect above causing dart benchmark to fail "Null check operator used on a null value" because the draw frame can be executed out of sequence with corresponding begin frame.

I'm temporary patching it with:

    if (ms == 0){
      if (!isNextTimerDue()) {
        zeroTimerQueue.push(f);
      } else {
        timerHeap[0].push(f); // keeps Timer.run together
      } 
    } else {
@dart-github-bot
Copy link
Collaborator

Summary: The run_wasm.js timer implementation might be faulty. When a zero-delay timer is scheduled while the engine is slow, duplicate timer entries are created, leading to benchmark failures.

@dart-github-bot dart-github-bot added area-dart2wasm Issues for the dart2wasm compiler. triage-automation See https://github.com/dart-lang/ecosystem/tree/main/pkgs/sdk_triage_bot. type-bug Incorrect behavior (everything from a crash to more subtle misbehavior) labels Dec 5, 2024
@osa1
Copy link
Member

osa1 commented Dec 6, 2024

Thanks for reporting. I don't understand how these callbacks can execute out of order given that earlier callbacks schedule the later ones:

https://github.com/dart-lang/flute/blob/c44fba23c57bfc55031a823424112f77690aa00a/engine/lib/src/platform_dispatcher.dart#L567-L593

void scheduleFrame() {
  _frameTimer ??= Timer(
    const Duration(milliseconds: 16),
    () {
      if (_continue) {
        _frameTimer = null;
      }
      final int microseconds = _frameTime.inMicroseconds;
      _frameTime += const Duration(milliseconds: 16);
      Timer.run(() {
        final Stopwatch beginSw = Stopwatch()..start();
        _beginFrame(microseconds);
        beginSw.stop();
        Timer.run(() {
          final Stopwatch drawSw = Stopwatch()..start();
          _drawFrame();
          drawSw.stop();
          _continue = frameCallback(
              _frameCount,
              beginSw.elapsedMicroseconds / 1000,
              drawSw.elapsedMicroseconds / 1000);
          _frameCount += 1;
        });
      });
    },
  );
}

Here _drawFrame is scheduled by the callback that calls _beginFrame, after calling _beginFrame.

Since there will be one callback scheduled by scheduleFrame at any point in time, do you mean callbacks of different scheduleFrames interacting in a weird way maybe?

@yurydelendik
Copy link
Author

I observe the failure in the benchmark only when I run it under valgrind. It is reproducible with debug builds of the JS engine.

do you mean callbacks of different scheduleFrames interacting in a weird way maybe?

perhaps _beginFrame,_beginFrame,_drawFrame,_drawFrame

The timerHeap becomes something like [...30,...47,...31] in terms of timecodes,

Sorry, I wish I could be more helpful, I'm mostly working with wasm and js artifacts.

@mkustermann
Copy link
Member

Thanks, @yurydelendik !

I can reproduce this quite easily by adding a busy loop after the build frame.

The root cause is not a bug in the timer implementation. The timer implementation is just running already expired timers (i.e. timers with expiration time before the current time) before new callbacks registered with ms == 0.

The reason this happens is in the flute code. When it schedules a frame callback it will run the _beginFrame() callback code and use Timer.run() to schedule a drawFrame() callback afterwards. Now the problem is that _beginFrame() itself will trigger scheduleFrame() again, which will arm another timer 16 ms in the future. Now if _beginFrame() itself takes longer than 16 ms, that "next frame" timer will have earlier expiration date as Timer.run() making it run before the _drawFrame callback.

tl;dr The next frame is scheduled in the _beginFrame() not in the _drawFrame() (the final step of a frame) and they are separated by async boundary.

The reason flute is written in that way is that between _beginFrame() and _drawFrame() it needs to ensure to drain any scheduled microtasks - which we do automatically by yielding to the event loop.

So I think we can fix this by scheduling the _drawFrame() timer before even executing _beginFrame(). This will ensure microtasks will be drained then we _drawFrame() before doing work for the next frame. This then also aligns with real world flutter engine C++ code which does the same.

diff --git a/engine/lib/src/platform_dispatcher.dart b/engine/lib/src/platform_dispatcher.dart
index c8cc92c..64af62a 100644
--- a/engine/lib/src/platform_dispatcher.dart
+++ b/engine/lib/src/platform_dispatcher.dart
@@ -574,9 +574,7 @@ class PlatformDispatcher {
         final int microseconds = _frameTime.inMicroseconds;
         _frameTime += const Duration(milliseconds: 16);
         Timer.run(() {
-          final Stopwatch beginSw = Stopwatch()..start();
-          _beginFrame(microseconds);
-          beginSw.stop();
+          late final Stopwatch beginSw;
           Timer.run(() {
             final Stopwatch drawSw = Stopwatch()..start();
             _drawFrame();
@@ -587,6 +585,9 @@ class PlatformDispatcher {
                 drawSw.elapsedMicroseconds / 1000);
             _frameCount += 1;
           });
+          beginSw = Stopwatch()..start();
+          _beginFrame(microseconds);
+          beginSw.stop();
         });
       },
     );

@mkustermann
Copy link
Member

mkustermann commented Dec 6, 2024

For reference, here's the real C++ implementation

  tonic::DartInvoke(begin_frame_.Get(), ...);
  UIDartState::Current()->FlushMicrotasksNow();
  tonic::DartInvokeVoid(draw_frame_.Get());

which doesn't allow any timer events to come in-between beginFrame and endFrame code (neither expired timers, nor Timer.run() timers)

@lrhn
Copy link
Member

lrhn commented Dec 6, 2024

That heap looks fine. It's a heap priority queue, so it implicitly represents subtrees through the indices. The child nodes of the node at position i are at positions i×2+1 and i×2+2. Here it makes 30 the root, with children 31 and 47, both of which are larger numbers, so the heap is well-ordered.
When the root node is removed, the 31-node will become the new root.

The way the timer queue implementation works is that it has a "due timers" list that it takes events from, and a heap priority queue of later timers, each a list containing the time they are due and a number of timers due at that time.
The heap queue root is always the timers with the next due time.

When the due timers list becomes empty, and a the next timer time in the heap is due, it's popped out of the heap, which is then normalized to put the next smallest time at the root.
The due time is removed from the front of the list and that list then becomes the new due-timer list, and event running can continue.

This means that all timers in the due-timer list are scheduled for before the current time. No new timer can be scheduled before them, because no new timer can be scheduled before the current time, it's no earlier than zero ms after the current now.

So when a new timer is added, it's added to the delayed timers priority heap, which puts it into the position it needs to be relative to all other existing scheduled timers. If any of those scheduled timers are also already due (but originally scheduled later than the ones in the due-timer list), the new event won't become the root of the priority heap.

The text f (ms == 0 && !isNextTimerDue()) { is an optimization, it's not needed.
It checks whether the timer is a zero-delay timer, which means it will be scheduled for "now" and therefore already be due and that there is no due timer in the delayed timer heap, which means that it will definitely run before any of those timers, and definitely before any timers scheduled in the future.
In that case, it's added to the current due-timer list, to be run after all of those have been run.
Just an optimization, which shouldn't change when the event will run, it just avoids allocating a new list for and adding it to the heap (where it would become the root entry).

Without the !isNextTimerDue() check, adding the event to the due-timer list would make it run before the already due timers in the heap, which would be wrong.

The fix of

if (ms == 0){
      if (!isNextTimerDue()) {
        zeroTimerQueue.push(f);
      } else {
        timerHeap[0].push(f); // keeps Timer.run together
      } 
    } else {

is wrong too. It puts the zero-timer with the first due timers in the heap, but if there is more than one due timer-time in the heap, that will run the new timer before the second group of already due timers, which is incorrect.

If your timer events come from a 16ms interval timer, which do some work and schedule a zero-timer event to do more work and schedule a second zero-timer event, then it's possible that the last zero-timer event has been delayed so much that it's scheduled after the next 16 ms interval timer.
(The VM's timers are isochronous, so an interval trimer tries to run at multiples of 16ms from the start point. If one event gets delayed, the next will still happen when it should, as long as you don't skip more than one multiple of the duration, and the next event will represent multiple ticks, as visible in the tick property of the Timer. They also schedule the next event when a timer event starts.
In comparison, the JS timers schedule the next event when a timer event's callback ends, and doesn't try to to be isochronous, so the the duration is a delay between the end of the former timer event execution, and the start of the next, and a slow or delayed execution will shift all later events as well.
That means that the VM is more likley to have less than 16 ms between two timer events, if the former was delayed, and therefore it's easier to schedule a zero-timer event after the already scheduled next interval timer.

mkustermann added a commit to mkustermann/flute that referenced this issue Dec 6, 2024
A `_beginFrame()` seems to schedule the next frame interanally. If the
`_beginFrame()` takes a long time, it will finish after the expiration
date of the next frame timer.

That in return makes a `Timer.run(() => _drawFrame())` be scheduled
*after* the next frame callback is already run. Though we want to ensure
the `_beginFrame()` and `_drawFrame()` are together before starting to
work on the next frame. We only want microtasks to be drained in-between
those two.

(This is how C++ engine works)

We can get similar semantics by scheduling the `_drawFrame()` timer even
before running the `_beginFrame()` code.

Issue dart-lang#23
Issue dart-lang/sdk#59666
@lrhn
Copy link
Member

lrhn commented Dec 6, 2024

(As a solution, you may want to not use an interval timer, and to schedule the next event yourself, when the former event is done running. That way you know that there won't be an existing timer with a lower due time already in the timer queue. You'll have to decide when to run the next event then, if the current event has taken more than 16 ms to run.)

@mkustermann
Copy link
Member

Have sent out dart-lang/flute#24 which I think is sufficient to fix this issue.

copybara-service bot pushed a commit to dart-lang/flute that referenced this issue Dec 10, 2024
…e `_beginFrame()` takes a long time, it will finish after the expiration date of the next frame timer.

That in return makes a `Timer.run(() => _drawFrame())` be scheduled *after* the next frame callback is already run. Though we want to ensure the `_beginFrame()` and `_drawFrame()` are together before starting to work on the next frame. We only want microtasks to be drained in-between those two.

(This is how C++ engine works)

We can get similar semantics by scheduling the `_drawFrame()` timer even before running the `_beginFrame()` code.

Issue #23
Issue dart-lang/sdk#59666

GitOrigin-RevId: 71e3edb
Change-Id: Ib2eb54d159d59a4e8e78a71682628160b66ff897
Reviewed-on: https://dart-review.googlesource.com/c/flute/+/399380
Reviewed-by: Lasse Nielsen <[email protected]>
Reviewed-by: Alexander Thomas <[email protected]>
Commit-Queue: Martin Kustermann <[email protected]>
@lrhn lrhn removed the triage-automation See https://github.com/dart-lang/ecosystem/tree/main/pkgs/sdk_triage_bot. label Dec 10, 2024
mkustermann added a commit to mkustermann/wasm_gc_benchmarks that referenced this issue Jan 28, 2025
@mkustermann
Copy link
Member

The flute benchmark was fixed some time ago. We've rolled that fix into our CI & benchmarking infrastructure some time ago.

Today I've also rolled it into https://github.com/mkustermann/wasm_gc_benchmarks, so this should close this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-dart2wasm Issues for the dart2wasm compiler. type-bug Incorrect behavior (everything from a crash to more subtle misbehavior)
Projects
None yet
Development

No branches or pull requests

5 participants