-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: use context-sensitivity to prevent inlining of cold callsites in PGO #72107
Comments
Change https://go.dev/cl/654935 mentions this issue: |
Hi @yukatan1701, thanks for filing this. If this moves forward, I think it would be viewed as an internal implementation improvement and as such would not need to go through the proposal process (https://go.dev/s/proposal), so I'll take it out of the proposal process. |
Hi @yukatan1701, ah, from a quick look, I see your draft CL includes adding a new In general, proposals take more time to work through (both for you, as well as for the core Go team), so my question to you is whether a user-facing flag like that or other proposal-level change is critical, vs. maybe more of a "nice to have". My advice would be that if your change is always or almost always an improvement, then a new cmd/go build flag probably would not be needed, and it would probably be better to do something like add a compiler debug flag, or use one that exists. That said, I don't know if this change is something that fits that category. You can look at some of the existing PGO debug flags here: go/src/cmd/compile/internal/base/debug.go Lines 68 to 73 in 583d586
Also, if you haven't seen it yet, the "Tips" section of the cmd/compile README has an overview of how to use those types of debug flags (including which package a debug flag affects is something someone might not just guess on their own): https://github.com/golang/go/tree/master/src/cmd/compile#8-tips |
Thanks for working on this @yukatan1701, the results look nice (thanks for running sweet!). Just to clarify, am I understanding correctly that I agree that ultimately this doesn't need to be a proposal or add a top-level flag to It looks like this is currently using a I didn't quite understand the new serialization format from a quick look (an overview comment would be helpful), but if feasible, I think the right approach would be to have [1] For future reference, your life would likely have been easier by adding a new GOEXPERIMENT (example). GOEXPERIMENTS are already handled by the |
Hi @thepudds! Initially, I had no plans to add a new build flag. I used a PGO debug flag but encountered the following issue. Context-sensitive inlining requires a special CS-graph that allows the compiler to quickly check whether an inlining path is present in any sample. Building this graph can take some time, depending on the volume of profile data. To speed up this process, I needed a pre-profile. The only way to leverage this was to introduce a new build flag. Since CS-inlining may slow down compilation, I believe this optimization should be optional. It is particularly useful for those looking to balance code size and performance, especially in projects with large codebases. Given the potentially long compilation times, the need for pre-profiling is justified. I have no objections to using the debug flag if there is a way to incorporate the pre-profile. @prattmic, thanks for the recommendations. I'll try to come up with workarounds to load the pre-profile for CS-inlining.
Yes, it is. |
CC @golang/compiler |
Thanks for looking into this. I actually tried something simple along this line (very simple heuristics for not inlining cold calls) in the early days working on PGO. But result was not good. There were actually quite some regression. So we decided to not penalize cold functions. I think a reason could be due to the sampling nature of CPU profiles. Some functions may still be important to performance, but just didn't get enough samples. An extreme case would be that a function is inlined and optimized out to very few instructions (even a constant), which would get very few CPU samples, but this optimization is actually important for performance. It is possible that the new algorithm in the paper is better. I haven't read it. Does it have mechanisms to handle such cases? Thanks. |
Proposal Details
proposal: cmd/compile: use context-sensitivity to prevent inlining of cold callsites in PGO
Abstract
This proposal aims to support context-sensitive inline in the Go Compiler. Context-sensitive inline extends the current inlining approach. It considers both an inlining path and profile information when deciding whether a function is worth being inlined in a particular callsite. It helps reduce code size without any significant performance changes. This proposal is based on Wenlei He, Hongtao Yu, Lei Wang, Taewook Oh, "Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation" paper. Section III.B of the paper ("Context-sensitive Sampling-based PGO") contains details.
Problem
This example demonstrates in which cases this optimization can be useful:
As can be seen,
scalarSub
is never called when invokingscalarOp
fromAddVectorHead
. The Go Compiler's inliner can't handle such cases and inlines bothscalarAdd
andscalarSub
inscalarOp
by default. In this simple example, unreachable calls will be eliminated later, after constant propagation and dead code elimination. However, this approach does not work in more complex cases where specific variable values are unknown. In such situations, we can use profile information to prevent the inlining of cold call sites, which helps reduce code size.A callsite may be cold on one execution path and hot on another. In the example above, a call to
scalarAdd
is hot whenscalarOp
is invoked fromAddVectorHead
while a call toscalarSub
is cold. After inliningscalarOp
inAddVectorHead
, the nodes of the inlined function body are traversed. Since we know the inlining pathAddVectorHead -> scalarOp
, we can check whether the pathAddVectorHead -> scalarOp -> scalarSub
is cold before inliningscalarSub
.Proposed changes
One possible solution is to use profile information. A collected profile contains a list of samples representing call stacks along with additional data. If there are no samples that match the inlining path (i.e. if this path didn't appear in the samples), the inlining of
scalarSub
can be prevented in this context. In this particular example, it may be best to inlinescalarSub
everywhere as it would need only a singlesub
instruction instead of a function call. However, preventing inlining can be beneficial when applied to relatively large functions. Context-sensitivity involves considering the path through a call graph that corresponds to the call stack leading to the current call site.Collect a profile for the example:
Compile with PGO enabled and look at inlining decisions:
As we can see on the last four lines, both
scalarAdd
andscalarSub
were inlined intoAddVectorHead
andSubVectorHead
.Context-Sensitive Inline prevents cold callsites from being inlined:
Implementation details
For this example, there are only 7 samples in the profile:
Each sample represents a call stack and its occurrence count in the profile. Stack frames are sorted from leaf to root.
AddVectorHead:15 -> scalarOp:25 -> scalarSub
doesn't appear in the profile, so inlining ofscalarSub
may be prevented in this case. However,scalarAdd
is worth being inlined after finding theAddVectorHead:15 -> scalarOp:25 -> scalarAdd
path in the graph.Results
Benchmarking with Sweet (threshold = 40) shows that this approach can reduce code size by up to 5.11% without any significant performance changes. Some internal tests show a code size reduction of 10% (threshold = 30).
ARMv8 Kunpeng920
Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
The text was updated successfully, but these errors were encountered: