runtime/metrics: add contention metrics #49881

znkr · 2021-11-30T17:24:49Z

Right now lock contention is only possible to debug with pprof using the block profile. This is very useful once contention has been identified as the issue, but since it has to be turned on manullay it doesn't help in identifying that contention is an issue. Exporting the cumulative wait time via runtime/metrics would allow continouous monitoring of contention and help in debugging Go programs.

prattmic · 2021-11-30T17:39:27Z

cc @mknyszek @aclements

mknyszek · 2021-11-30T17:47:36Z

Interesting... Looks like we already measure time spent blocked for starvation purposes anyway. I suppose this wouldn't be too hard to expose. Is cumulative wait time the right metric? It's useful for comparison, but it doesn't really tell you much on its own.

An approximate distribution of latencies might be more useful for this, because you can correlate that with e.g. request latency, but it's slightly more expensive and it requires a bit more post-processing. Its other downside is that it's approximate -- I'm not sure if there's some special use case enabled by having a precise cumulative wait time.

mknyszek · 2021-11-30T17:56:18Z

To be clear I'm not opposed to cumulative wait time. If that's the convention we can move forward with that. Just want to make sure we considered our options.

znkr · 2021-11-30T17:56:44Z

Latency distribution would be even better. Approximate times would be good enough for all use cases I can think of.

znkr · 2022-01-20T15:01:47Z

cc @jeremyfaller

ianlancetaylor · 2022-06-24T20:25:08Z

Moving to Backlog. Please recategorize as appropriate. Thanks.

znkr · 2022-07-04T15:57:21Z

Would it be a lot of work to get this into 1.20? It would help a lot in understanding performance issues in production jobs and a lot of Google internal users have expressed interest in this.

aclements · 2022-07-06T17:54:43Z

Given that there's known demand for this and that I think it wouldn't be hard to add (though I might be wrong), I'll move this to 1.20, but given our focus on core project health and PGO for 1.20, I'm not sure we'll get to this.

mknyszek · 2022-07-18T18:32:44Z

I'll probably do some other runtime/metrics stuff this release, so I'll put it on my plate.

gopherbot · 2022-09-01T18:28:51Z

Change https://go.dev/cl/427618 mentions this issue: runtime/metrics: add /sync/mutex-wait:seconds metric

gopherbot · 2022-09-01T18:28:51Z

Change https://go.dev/cl/427617 mentions this issue: runtime: set G wait reason more consistently

gopherbot · 2022-09-01T18:28:52Z

Change https://go.dev/cl/427616 mentions this issue: runtime: make the wait reason for a g blocked on a mutex more specific

This change adds 3 new waitReasons that correspond to sync.Mutex.Lock, sync.RWMutex.RLock, and sync.RWMutex.Lock that are plumbed down into semacquire1 by exporting new functions to the sync package from the runtime. Currently these three functions show up as "semacquire" in backtraces which isn't very clear, though the stack trace itself should reveal what's really going on. This represents a minor improvement to backtrace readability, though blocking on an RWMutex.w.Lock will still show up as blocking on a regular mutex (I suppose technically it is). This is a step toward helping the runtime identify when a goroutine is blocked on a mutex of some kind. For #49881. Change-Id: Ia409b4d27e117fe4bfdc25fa541e9c58d6d587b9 Reviewed-on: https://go-review.googlesource.com/c/go/+/427616 TryBot-Result: Gopher Robot <[email protected]> Auto-Submit: Michael Knyszek <[email protected]> Reviewed-by: Michael Pratt <[email protected]> Run-TryBot: Michael Knyszek <[email protected]>

Currently, wait reasons are set somewhat inconsistently. In a follow-up CL, we're going to want to rely on the wait reason being there for casgstatus, so the status quo isn't really going to work for that. Plus this inconsistency means there are a whole bunch of cases where we could be more specific about the G's status but aren't. So, this change adds a new function, casGToWaiting which is like casgstatus but also sets the wait reason. The goal is that by using this API it'll be harder to forget to set a wait reason (or the lack thereof will at least be explicit). This change then updates all casgstatus(gp, ..., _Gwaiting) calls to casGToWaiting(gp, ..., waitReasonX) instead. For a number of these cases, we're missing a wait reason, and it wouldn't hurt to add a wait reason for them, so this change also adds those wait reasons. For #49881. Change-Id: Ia95e06ecb74ed17bb7bb94f1a362ebfe6bec1518 Reviewed-on: https://go-review.googlesource.com/c/go/+/427617 Reviewed-by: Michael Pratt <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> Auto-Submit: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]>

prattmic added this to the Go1.19 milestone Nov 30, 2021

prattmic added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Nov 30, 2021

mknyszek changed the title ~~runtime/metrics: Contention metrics~~ runtime/metrics: add contention metrics Nov 30, 2021

mknyszek added the FeatureRequest Issues asking for a new feature that does not need a proposal. label Nov 30, 2021

joedian added this to Release Blockers Jun 15, 2022

ianlancetaylor modified the milestones: Go1.19, Backlog Jun 24, 2022

aclements added this to Go Compiler / Runtime Jul 6, 2022

aclements moved this to Todo in Go Compiler / Runtime Jul 6, 2022

aclements modified the milestones: Backlog, Go1.20 Jul 6, 2022

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022

mknyszek self-assigned this Jul 18, 2022

mknyszek mentioned this issue Aug 3, 2022

proposal: runtime/metrics: add mutex slow path counters #54236

Closed

gopherbot closed this as completed in 63ceff9 Sep 16, 2022

Repository owner moved this from Todo to Done in Go Compiler / Runtime Sep 16, 2022

gopherbot moved this to Done in Release Blockers Sep 16, 2022

rhysh mentioned this issue Dec 4, 2022

runtime/metrics: add a way to measure runtime-internal lock contention #57071

Closed

bradfitz mentioned this issue Jan 31, 2023

Upgrade to Go 1.20 tailscale/tailscale#7123

Closed

mknyszek removed this from Go Compiler / Runtime Feb 15, 2023

charleskorn mentioned this issue Aug 31, 2023

Add query-scheduler enqueue and Go mutex wait time metrics grafana/mimir#5879

Merged

2 tasks

golang locked and limited conversation to collaborators Sep 16, 2023

gopherbot added the FrozenDueToAge label Sep 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime/metrics: add contention metrics #49881

runtime/metrics: add contention metrics #49881

znkr commented Nov 30, 2021

prattmic commented Nov 30, 2021

mknyszek commented Nov 30, 2021

mknyszek commented Nov 30, 2021

znkr commented Nov 30, 2021 •

edited

Loading

znkr commented Jan 20, 2022

ianlancetaylor commented Jun 24, 2022

znkr commented Jul 4, 2022

aclements commented Jul 6, 2022

mknyszek commented Jul 18, 2022

gopherbot commented Sep 1, 2022

gopherbot commented Sep 1, 2022

gopherbot commented Sep 1, 2022

runtime/metrics: add contention metrics #49881

runtime/metrics: add contention metrics #49881

Comments

znkr commented Nov 30, 2021

prattmic commented Nov 30, 2021

mknyszek commented Nov 30, 2021

mknyszek commented Nov 30, 2021

znkr commented Nov 30, 2021 • edited Loading

znkr commented Jan 20, 2022

ianlancetaylor commented Jun 24, 2022

znkr commented Jul 4, 2022

aclements commented Jul 6, 2022

mknyszek commented Jul 18, 2022

gopherbot commented Sep 1, 2022

gopherbot commented Sep 1, 2022

gopherbot commented Sep 1, 2022

znkr commented Nov 30, 2021 •

edited

Loading