-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/metrics: add contention metrics #49881
Comments
Interesting... Looks like we already measure time spent blocked for starvation purposes anyway. I suppose this wouldn't be too hard to expose. Is cumulative wait time the right metric? It's useful for comparison, but it doesn't really tell you much on its own. An approximate distribution of latencies might be more useful for this, because you can correlate that with e.g. request latency, but it's slightly more expensive and it requires a bit more post-processing. Its other downside is that it's approximate -- I'm not sure if there's some special use case enabled by having a precise cumulative wait time. |
To be clear I'm not opposed to cumulative wait time. If that's the convention we can move forward with that. Just want to make sure we considered our options. |
Latency distribution would be even better. Approximate times would be good enough for all use cases I can think of. |
Moving to Backlog. Please recategorize as appropriate. Thanks. |
Would it be a lot of work to get this into 1.20? It would help a lot in understanding performance issues in production jobs and a lot of Google internal users have expressed interest in this. |
Given that there's known demand for this and that I think it wouldn't be hard to add (though I might be wrong), I'll move this to 1.20, but given our focus on core project health and PGO for 1.20, I'm not sure we'll get to this. |
I'll probably do some other |
Change https://go.dev/cl/427618 mentions this issue: |
Change https://go.dev/cl/427617 mentions this issue: |
Change https://go.dev/cl/427616 mentions this issue: |
This change adds 3 new waitReasons that correspond to sync.Mutex.Lock, sync.RWMutex.RLock, and sync.RWMutex.Lock that are plumbed down into semacquire1 by exporting new functions to the sync package from the runtime. Currently these three functions show up as "semacquire" in backtraces which isn't very clear, though the stack trace itself should reveal what's really going on. This represents a minor improvement to backtrace readability, though blocking on an RWMutex.w.Lock will still show up as blocking on a regular mutex (I suppose technically it is). This is a step toward helping the runtime identify when a goroutine is blocked on a mutex of some kind. For #49881. Change-Id: Ia409b4d27e117fe4bfdc25fa541e9c58d6d587b9 Reviewed-on: https://go-review.googlesource.com/c/go/+/427616 TryBot-Result: Gopher Robot <[email protected]> Auto-Submit: Michael Knyszek <[email protected]> Reviewed-by: Michael Pratt <[email protected]> Run-TryBot: Michael Knyszek <[email protected]>
Currently, wait reasons are set somewhat inconsistently. In a follow-up CL, we're going to want to rely on the wait reason being there for casgstatus, so the status quo isn't really going to work for that. Plus this inconsistency means there are a whole bunch of cases where we could be more specific about the G's status but aren't. So, this change adds a new function, casGToWaiting which is like casgstatus but also sets the wait reason. The goal is that by using this API it'll be harder to forget to set a wait reason (or the lack thereof will at least be explicit). This change then updates all casgstatus(gp, ..., _Gwaiting) calls to casGToWaiting(gp, ..., waitReasonX) instead. For a number of these cases, we're missing a wait reason, and it wouldn't hurt to add a wait reason for them, so this change also adds those wait reasons. For #49881. Change-Id: Ia95e06ecb74ed17bb7bb94f1a362ebfe6bec1518 Reviewed-on: https://go-review.googlesource.com/c/go/+/427617 Reviewed-by: Michael Pratt <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> Auto-Submit: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
Right now lock contention is only possible to debug with pprof using the block profile. This is very useful once contention has been identified as the issue, but since it has to be turned on manullay it doesn't help in identifying that contention is an issue. Exporting the cumulative wait time via runtime/metrics would allow continouous monitoring of contention and help in debugging Go programs.
The text was updated successfully, but these errors were encountered: