Skip to content

Commit 341e314

Browse files
nickrobinson251vilterpkpamnanyNHDalyvchuravy
committed
Add per-task metrics (JuliaLang#56320)
Close JuliaLang#47351 (builds on top of JuliaLang#48416) Adds two per-task metrics: - running time = amount of time the task was actually running (according to our scheduler). Note: currently inclusive of GC time, but would be good to be able to separate that out (in a future PR) - wall time = amount of time between the scheduler becoming aware of this task and the task entering a terminal state (i.e. done or failed). We record running time in `wait()`, where the scheduler stops running the task as well as in `yield(t)`, `yieldto(t)` and `throwto(t)`, which bypass the scheduler. Other places where a task stops running (for `Channel`, `ReentrantLock`, `Event`, `Timer` and `Semaphore` are all implemented in terms of `wait(Condition)`, which in turn calls `wait()`. `LibuvStream` similarly calls `wait()`. This should capture everything (albeit, slightly over-counting task CPU time by including any enqueuing work done before we hit `wait()`). The various metrics counters could be a separate inlined struct if we think that's a useful abstraction, but for now i've just put them directly in `jl_task_t`. They are all atomic, except the `metrics_enabled` flag itself (which we now have to check on task start/switch/done even if metrics are not enabled) which is set on task construction and marked `const` on the julia side. In future PRs we could add more per-task metrics, e.g. compilation time, GC time, allocations, potentially a wait-time breakdown (time waiting on locks, channels, in the scheduler run queue, etc.), potentially the number of yields. Perhaps in future there could be ways to enable this on a per-thread and per-task basis. And potentially in future these same timings could be used by `@time` (e.g. writing this same timing data to a ScopedValue like in JuliaLang#55103 but only for tasks lexically scoped to inside the `@time` block). Timings are off by default but can be turned on globally via starting Julia with `--task-metrics=yes` or calling `Base.Experimental.task_metrics(true)`. Metrics are collected for all tasks created when metrics are enabled. In other words, enabling/disabling timings via `Base.Experimental.task_metrics` does not affect existing `Task`s, only new `Task`s. The other new APIs are `Base.Experimental.task_running_time_ns(::Task)` and `Base.Experimental.task_wall_time_ns(::Task)` for retrieving the new metrics. These are safe to call on any task (including the current task, or a task running on another thread). All these are in `Base.Experimental` to give us room to change up the APIs as we add more metrics in future PRs (without worrying about release timelines). cc @NHDaly @kpamnany @d-netto --------- Co-authored-by: Pete Vilter <[email protected]> Co-authored-by: K Pamnany <[email protected]> Co-authored-by: Nathan Daly <[email protected]> Co-authored-by: Valentin Churavy <[email protected]>
1 parent 9344ba8 commit 341e314

19 files changed

+564
-37
lines changed

base/boot.jl

+23-5
Original file line numberDiff line numberDiff line change
@@ -156,15 +156,33 @@
156156
#end
157157

158158
#mutable struct Task
159-
# parent::Task
159+
# next::Any
160+
# queue::Any
160161
# storage::Any
161-
# state::Symbol
162162
# donenotify::Any
163163
# result::Any
164-
# exception::Any
165-
# backtrace::Any
166-
# logstate::Any
164+
# scope::Any
167165
# code::Any
166+
# @atomic _state::UInt8
167+
# sticky::UInt8
168+
# priority::UInt16
169+
# @atomic _isexception::UInt8
170+
# pad00::UInt8
171+
# pad01::UInt8
172+
# pad02::UInt8
173+
# rngState0::UInt64
174+
# rngState1::UInt64
175+
# rngState2::UInt64
176+
# rngState3::UInt64
177+
# rngState4::UInt64
178+
# const metrics_enabled::Bool
179+
# pad10::UInt8
180+
# pad11::UInt8
181+
# pad12::UInt8
182+
# @atomic first_enqueued_at::UInt64
183+
# @atomic last_started_running_at::UInt64
184+
# @atomic running_time_ns::UInt64
185+
# @atomic finished_at::UInt64
168186
#end
169187

170188
export

base/experimental.jl

+83
Original file line numberDiff line numberDiff line change
@@ -368,4 +368,87 @@ adding them to the global method table.
368368
"""
369369
:@MethodTable
370370

371+
"""
372+
Base.Experimental.disable_new_worlds()
373+
374+
Mark that no new worlds (methods additions, deletions, etc) are permitted to be created at
375+
any future time, allowing for lower latencies for some operations and slightly lower memory
376+
usage, by eliminating the tracking of those possible invalidation.
377+
"""
378+
disable_new_worlds() = ccall(:jl_disable_new_worlds, Cvoid, ())
379+
380+
### Task metrics
381+
382+
"""
383+
Base.Experimental.task_metrics(::Bool)
384+
385+
Enable or disable the collection of per-task metrics.
386+
A `Task` created when `Base.Experimental.task_metrics(true)` is in effect will have
387+
[`Base.Experimental.task_running_time_ns`](@ref) and [`Base.Experimental.task_wall_time_ns`](@ref)
388+
timing information available.
389+
390+
!!! note
391+
Task metrics can be enabled at start-up via the `--task-metrics=yes` command line option.
392+
"""
393+
function task_metrics(b::Bool)
394+
if b
395+
ccall(:jl_task_metrics_enable, Cvoid, ())
396+
else
397+
ccall(:jl_task_metrics_disable, Cvoid, ())
398+
end
399+
return nothing
400+
end
401+
402+
"""
403+
Base.Experimental.task_running_time_ns(t::Task) -> Union{UInt64, Nothing}
404+
405+
Return the total nanoseconds that the task `t` has spent running.
406+
This metric is only updated when `t` yields or completes unless `t` is the current task, in
407+
which it will be updated continuously.
408+
See also [`Base.Experimental.task_wall_time_ns`](@ref).
409+
410+
Returns `nothing` if task timings are not enabled.
411+
See [`Base.Experimental.task_metrics`](@ref).
412+
413+
!!! note "This metric is from the Julia scheduler"
414+
A task may be running on an OS thread that is descheduled by the OS
415+
scheduler, this time still counts towards the metric.
416+
417+
!!! compat "Julia 1.12"
418+
This method was added in Julia 1.12.
419+
"""
420+
function task_running_time_ns(t::Task=current_task())
421+
t.metrics_enabled || return nothing
422+
if t == current_task()
423+
# These metrics fields can't update while we're running.
424+
# But since we're running we need to include the time since we last started running!
425+
return t.running_time_ns + (time_ns() - t.last_started_running_at)
426+
else
427+
return t.running_time_ns
428+
end
371429
end
430+
431+
"""
432+
Base.Experimental.task_wall_time_ns(t::Task) -> Union{UInt64, Nothing}
433+
434+
Return the total nanoseconds that the task `t` was runnable.
435+
This is the time since the task first entered the run queue until the time at which it
436+
completed, or until the current time if the task has not yet completed.
437+
See also [`Base.Experimental.task_running_time_ns`](@ref).
438+
439+
Returns `nothing` if task timings are not enabled.
440+
See [`Base.Experimental.task_metrics`](@ref).
441+
442+
!!! compat "Julia 1.12"
443+
This method was added in Julia 1.12.
444+
"""
445+
function task_wall_time_ns(t::Task=current_task())
446+
t.metrics_enabled || return nothing
447+
start_at = t.first_enqueued_at
448+
start_at == 0 && return UInt64(0)
449+
end_at = t.finished_at
450+
end_at == 0 && return time_ns() - start_at
451+
return end_at - start_at
452+
end
453+
454+
end # module

base/options.jl

+1
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ struct JLOptions
6060
heap_size_hint::UInt64
6161
trace_compile_timing::Int8
6262
safe_crash_log_file::Ptr{UInt8}
63+
task_metrics::Int8
6364
end
6465

6566
# This runs early in the sysimage != is not defined yet

base/task.jl

+50-3
Original file line numberDiff line numberDiff line change
@@ -810,7 +810,11 @@ function enq_work(t::Task)
810810
return t
811811
end
812812

813-
schedule(t::Task) = enq_work(t)
813+
function schedule(t::Task)
814+
# [task] created -scheduled-> wait_time
815+
maybe_record_enqueued!(t)
816+
enq_work(t)
817+
end
814818

815819
"""
816820
schedule(t::Task, [val]; error=false)
@@ -857,6 +861,8 @@ function schedule(t::Task, @nospecialize(arg); error=false)
857861
t.queue === nothing || Base.error("schedule: Task not runnable")
858862
setfield!(t, :result, arg)
859863
end
864+
# [task] created -scheduled-> wait_time
865+
maybe_record_enqueued!(t)
860866
enq_work(t)
861867
return t
862868
end
@@ -888,9 +894,15 @@ A fast, unfair-scheduling version of `schedule(t, arg); yield()` which
888894
immediately yields to `t` before calling the scheduler.
889895
"""
890896
function yield(t::Task, @nospecialize(x=nothing))
891-
(t._state === task_state_runnable && t.queue === nothing) || error("yield: Task not runnable")
897+
ct = current_task()
898+
t === ct && throw(ConcurrencyViolationError("Cannot yield to currently running task!"))
899+
(t._state === task_state_runnable && t.queue === nothing) || throw(ConcurrencyViolationError("yield: Task not runnable"))
900+
# [task] user_time -yield-> wait_time
901+
record_running_time!(ct)
902+
# [task] created -scheduled-> wait_time
903+
maybe_record_enqueued!(t)
892904
t.result = x
893-
enq_work(current_task())
905+
enq_work(ct)
894906
set_next_task(t)
895907
return try_yieldto(ensure_rescheduled)
896908
end
@@ -904,13 +916,18 @@ call to `yieldto`. This is a low-level call that only switches tasks, not consid
904916
or scheduling in any way. Its use is discouraged.
905917
"""
906918
function yieldto(t::Task, @nospecialize(x=nothing))
919+
ct = current_task()
907920
# TODO: these are legacy behaviors; these should perhaps be a scheduler
908921
# state error instead.
909922
if t._state === task_state_done
910923
return x
911924
elseif t._state === task_state_failed
912925
throw(t.result)
913926
end
927+
# [task] user_time -yield-> wait_time
928+
record_running_time!(ct)
929+
# [task] created -scheduled-unfairly-> wait_time
930+
maybe_record_enqueued!(t)
914931
t.result = x
915932
set_next_task(t)
916933
return try_yieldto(identity)
@@ -924,6 +941,10 @@ function try_yieldto(undo)
924941
rethrow()
925942
end
926943
ct = current_task()
944+
# [task] wait_time -(re)started-> user_time
945+
if ct.metrics_enabled
946+
@atomic :monotonic ct.last_started_running_at = time_ns()
947+
end
927948
if ct._isexception
928949
exc = ct.result
929950
ct.result = nothing
@@ -937,6 +958,11 @@ end
937958

938959
# yield to a task, throwing an exception in it
939960
function throwto(t::Task, @nospecialize exc)
961+
ct = current_task()
962+
# [task] user_time -yield-> wait_time
963+
record_running_time!(ct)
964+
# [task] created -scheduled-unfairly-> wait_time
965+
maybe_record_enqueued!(t)
940966
t.result = exc
941967
t._isexception = true
942968
set_next_task(t)
@@ -989,6 +1015,9 @@ checktaskempty = Partr.multiq_check_empty
9891015
end
9901016

9911017
function wait()
1018+
ct = current_task()
1019+
# [task] user_time -yield-or-done-> wait_time
1020+
record_running_time!(ct)
9921021
GC.safepoint()
9931022
W = workqueue_for(Threads.threadid())
9941023
poptask(W)
@@ -1003,3 +1032,21 @@ if Sys.iswindows()
10031032
else
10041033
pause() = ccall(:pause, Cvoid, ())
10051034
end
1035+
1036+
# update the `running_time_ns` field of `t` to include the time since it last started running.
1037+
function record_running_time!(t::Task)
1038+
if t.metrics_enabled && !istaskdone(t)
1039+
@atomic :monotonic t.running_time_ns += time_ns() - t.last_started_running_at
1040+
end
1041+
return t
1042+
end
1043+
1044+
# if this is the first time `t` has been added to the run queue
1045+
# (or the first time it has been unfairly yielded to without being added to the run queue)
1046+
# then set the `first_enqueued_at` field to the current time.
1047+
function maybe_record_enqueued!(t::Task)
1048+
if t.metrics_enabled && t.first_enqueued_at == 0
1049+
@atomic :monotonic t.first_enqueued_at = time_ns()
1050+
end
1051+
return t
1052+
end

doc/man/julia.1

+8
Original file line numberDiff line numberDiff line change
@@ -270,6 +270,14 @@ Print precompile statements for methods compiled during execution or save to a p
270270
--trace-compile-timing=
271271
If --trace-compile is enabled show how long each took to compile in ms
272272

273+
.TP
274+
--trace-dispatch={stderr|name}
275+
Print precompile statements for methods dispatched during execution or save to stderr or a path.
276+
277+
.TP
278+
--task-metrics={yes|no*}
279+
Enable the collection of per-task metrics.
280+
273281
.TP
274282
-image-codegen
275283
Force generate code in imaging mode

doc/src/base/multi-threading.md

+8
Original file line numberDiff line numberDiff line change
@@ -71,3 +71,11 @@ These building blocks are used to create the regular synchronization objects.
7171
```@docs
7272
Base.Threads.SpinLock
7373
```
74+
75+
## Task metrics (Experimental)
76+
77+
```@docs
78+
Base.Experimental.task_metrics
79+
Base.Experimental.task_running_time_ns
80+
Base.Experimental.task_wall_time_ns
81+
```

doc/src/manual/command-line-interface.md

+1
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,7 @@ The following is a complete list of command-line switches available when launchi
130130
|`--code-coverage=tracefile.info` |Append coverage information to the LCOV tracefile (filename supports format tokens).|
131131
|`--track-allocation[={none*\|user\|all}]` |Count bytes allocated by each source line (omitting setting is equivalent to "user")|
132132
|`--track-allocation=@<path>` |Count bytes but only in files that fall under the given file path/directory. The `@` prefix is required to select this option. A `@` with no path will track the current directory.|
133+
|`--task-metrics={yes\|no*}` |Enable the collection of per-task metrics|
133134
|`--bug-report=KIND` |Launch a bug report session. It can be used to start a REPL, run a script, or evaluate expressions. It first tries to use BugReporting.jl installed in current environment and falls back to the latest compatible BugReporting.jl if not. For more information, see `--bug-report=help`.|
134135
|`--compile={yes*\|no\|all\|min}` |Enable or disable JIT compiler, or request exhaustive or minimal compilation|
135136
|`--output-o <name>` |Generate an object file (including system image data)|

src/init.c

+4
Original file line numberDiff line numberDiff line change
@@ -851,6 +851,10 @@ JL_DLLEXPORT void julia_init(JL_IMAGE_SEARCH rel)
851851
#if defined(_COMPILER_GCC_) && __GNUC__ >= 12
852852
#pragma GCC diagnostic ignored "-Wdangling-pointer"
853853
#endif
854+
if (jl_options.task_metrics == JL_OPTIONS_TASK_METRICS_ON) {
855+
// enable before creating the root task so it gets timings too.
856+
jl_atomic_fetch_add(&jl_task_metrics_enabled, 1);
857+
}
854858
// warning: this changes `jl_current_task`, so be careful not to call that from this function
855859
jl_task_t *ct = jl_init_root_task(ptls, stack_lo, stack_hi);
856860
#pragma GCC diagnostic pop

src/jlapi.c

+30
Original file line numberDiff line numberDiff line change
@@ -509,6 +509,36 @@ JL_DLLEXPORT uint64_t jl_cumulative_recompile_time_ns(void)
509509
return jl_atomic_load_relaxed(&jl_cumulative_recompile_time);
510510
}
511511

512+
/**
513+
* @brief Enable per-task timing.
514+
*/
515+
JL_DLLEXPORT void jl_task_metrics_enable(void)
516+
{
517+
// Increment the flag to allow reentrant callers.
518+
jl_atomic_fetch_add(&jl_task_metrics_enabled, 1);
519+
}
520+
521+
/**
522+
* @brief Disable per-task timing.
523+
*/
524+
JL_DLLEXPORT void jl_task_metrics_disable(void)
525+
{
526+
// Prevent decrementing the counter below zero
527+
uint8_t enabled = jl_atomic_load_relaxed(&jl_task_metrics_enabled);
528+
while (enabled > 0) {
529+
if (jl_atomic_cmpswap(&jl_task_metrics_enabled, &enabled, enabled-1))
530+
break;
531+
}
532+
}
533+
534+
/**
535+
* @brief Retrieve floating-point environment constants.
536+
*
537+
* Populates an array with constants related to the floating-point environment,
538+
* such as rounding modes and exception flags.
539+
*
540+
* @param ret An array of integers to be populated with floating-point environment constants.
541+
*/
512542
JL_DLLEXPORT void jl_get_fenv_consts(int *ret)
513543
{
514544
ret[0] = FE_INEXACT;

0 commit comments

Comments
 (0)