Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: inlining confuses escape analysis with interface{} params #53465

Open
mvdan opened this issue Jun 20, 2022 · 14 comments
Open

cmd/compile: inlining confuses escape analysis with interface{} params #53465

mvdan opened this issue Jun 20, 2022 · 14 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@mvdan
Copy link
Member

mvdan commented Jun 20, 2022

Take a look at https://go.dev/play/p/mPqA0RaybuQ.

On go version devel go1.19-9068c6844d Thu Jun 16 21:58:40 2022 +0000 linux/amd64 I get the following benchmark results:

name                        time/op
ConditionalDebugf/false-16  69.4ns ± 3%
ConditionalDebugf/true-16    280ns ± 0%
InlinedDebugf/false-16      0.53ns ± 1%
InlinedDebugf/true-16        281ns ± 4%

name                        alloc/op
ConditionalDebugf/false-16   24.0B ± 0%
ConditionalDebugf/true-16    40.0B ± 0%
InlinedDebugf/false-16       0.00B     
InlinedDebugf/true-16        40.0B ± 0%

name                        allocs/op
ConditionalDebugf/false-16    2.00 ± 0%
ConditionalDebugf/true-16     3.00 ± 0%
InlinedDebugf/false-16        0.00     
InlinedDebugf/true-16         3.00 ± 0%

First, note that both debugf functions are inlined:

$ go test -gcflags='-m' |& grep Debugf
./go_test.go:12:6: can inline conditionalDebugf
./go_test.go:18:6: can inline inlinedDebugf
./go_test.go:30:22: inlining call to conditionalDebugf
./go_test.go:45:19: inlining call to inlinedDebugf

And, in the true sub-benchmarks where debug logging is enabled and we actually call Sprintf, we can see three allocations - two corresponding to the two interface{} parameters, as the string and int variables escape to the heap via the interface, and one for the actual call to Sprintf constructing a string.

So far so good. However, when debug logging is disabled, I want to avoid those allocations, because this debug logging happens in a tight loop that runs millions of times in a program, so I want to avoid millions of allocations made in a very short amount of time.

This works fine for the "manually inlined" case where I wrap the call to Debugf with a conditional. Note the zero allocations on InlinedDebugf/false. However, that means that every time I want to use Debugf (a dozen times in said program), I need three lines rather than one, and it's pretty repetitive and verbose.

One good alternative is to move the conditional inside the Debugf function - and since it's still inlined, it should be the same result. However, ConditionalDebugf/false allocates twice, presumably because the two arguments do still allocate, as if I was actually calling Sprintf when in fact I am not.

It seems to me like the two benchmarks should behave the same; the compiler's escape analysis and inliner should work well enough together to detect this scenario.

Note that my real use case involves log.Printf, which is a no-op when one has called log.SetOutput(io.Discard). The benchmark above is a simplification which still reproduces the same problem.

@randall77
Copy link
Contributor

This will be a hard one to fix. The conversion to interface happens when building the [2]interface{} to back the ...interface{} slice. That happens before the call to conditionalDebugf. Inlining doesn't really make a difference, it still gets built before the inlined body of conditionalDebugf executes. We'd have to realize that the thing being built is only used inside a conditional, and move that building down inside the conditional.

Similar to the code below, but for a [2]interface{} instead of just an interface{}.

var c bool
var sink interface{}

func f(x string) {
	var y interface{} = x
	if c {
		sink = y
	}
}

The compiler has a tighten pass which does this kind of movement, but it only works for SSA-able values, which these aren't.

@josharian
Copy link
Contributor

Could this be fixed as a special case? Recognize exactly these cases:

func f {
  if safeexpr {
    // …
  } 
}
func f {
  if !safeexpr {
    return
  } 
  // …
}

These are very common wrappers for a very common problem with significant readability impact; it might be worth the code weight in the compiler.

@mvdan
Copy link
Member Author

mvdan commented Jun 20, 2022

Intuitively, I'm hinting at @josharian's suggestion: as a programmer I'm inclined to move the conditional from the function body to each call site, because there are clearly some gains to be had. But that sort of manual partial inlining is what the compiler should be able to do, not me :)

We'd have to realize that the thing being built is only used inside a conditional, and move that building down inside the conditional.
The compiler has a tighten pass which does this kind of movement, but it only works for SSA-able values, which these aren't.

That does sound like a more generic solution, but I'm not in a position to judge how difficult it is compared to Josh's suggestion.

@beoran
Copy link

beoran commented Jun 23, 2022

Fyi, Zerolog fixed this by not having interface{} parameters in the first place. It does mean the api has to support all types with separate methods. https://github.com/rs/zerolog

@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 23, 2022
@cagedmantis cagedmantis added this to the Backlog milestone Jun 23, 2022
@mvdan
Copy link
Member Author

mvdan commented Jul 2, 2022

@beoran that is indeed a possible solution, but a drastically different one that I'm not considering at the moment. It is more verbose too, so it doesn't exactly win against if debug { log.Printf(...) either :) It is likely better if what one wants is logging at maximum efficiency, but that is not what I'm after - I just want near-zero overhead when logging is disabled.

Regardless, I think it would be best for the Go compiler to gain this optimization, because ...any parameters will remain commonplace due to their usefulness.

@mvdan
Copy link
Member Author

mvdan commented Jul 2, 2022

Here's my inserting of the if debug { ... } conditionals in the downstream project, for a 5% reduction of total allocations: burrowers/garble#563

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022
@mknyszek mknyszek moved this to Triage Backlog in Go Compiler / Runtime Jul 15, 2022
@thockin
Copy link

thockin commented Nov 21, 2022

I think this is the same as I see in github.com/go-log/logr (xref #56873). I am seeing people adopt unnatural caller-side checking which feels like what mvdan said (#53465 (comment))

package main

import (
	"fmt"
	"testing"

	"github.com/go-logr/logr"
)

//go:noinline
func doV9(b *testing.B, log logr.Logger) {
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		log.V(9).Info("multi",
			"bool", true, "string", "str", "int", 42,
			"float", 3.14, "struct", struct{ X, Y int }{93, 76})
	}
}

//go:noinline
func doManualV9(b *testing.B, log logr.Logger) {
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		// This manual optimization should not be needed?
		if logV := log.V(9); logV.Enabled() {
			logV.Info("multi",
				"bool", true, "string", "str", "int", 42,
				"float", 3.14, "struct", struct{ X, Y int }{93, 76})
		}
	}
}

// This is a simple repro all in one package.
type Repro struct {
	enabled bool
}

//go:noinline
func (r Repro) Info(msg string, keysAndValues ...interface{}) {
	if r.enabled {
		r.Impl(msg, keysAndValues...)
	}
}

//go:noinline
func (r Repro) Impl(msg string, keysAndValues ...interface{}) {
	panic(fmt.Sprintln("this should not get called", msg, keysAndValues))
}

//go:noinline
func doLocalRepro(b *testing.B, repro Repro) {
	b.ReportAllocs()
	for i := 0; i < b.N; i++ {
		repro.Info("multi",
			"bool", true, "string", "str", "int", 42,
			"float", 3.14, "struct", struct{ X, Y int }{93, 76})
	}
}

func BenchmarkLocalReproDisabled(b *testing.B) {
	r := Repro{false}
	doLocalRepro(b, r)
}

func BenchmarkDiscardV9(b *testing.B) {
	var log logr.Logger = logr.Discard()
	doV9(b, log)
}

func BenchmarkDiscardManualV9(b *testing.B) {
	var log logr.Logger = logr.Discard()
	doManualV9(b, log)
}

@pohly
Copy link

pohly commented Nov 21, 2022

Should the basic problem get addressed, would it also eliminate function calls that are needed to construct the parameter slice?

In Kubernetes, we often have:

logger.V(5).Info("created", "object", klog.KObj(pod))

where KObj is https://github.com/kubernetes/klog/blob/9ae4992afb54bb4686590fff003828c30c84c92c/k8s_references.go#L56. It shouldn't have any side effects, but I guess the compiler would have to be able to figure that out, otherwise it cannot skip the call?

@mvdan
Copy link
Member Author

mvdan commented Jul 15, 2023

@pohly this issue is just for the escape analysis. That is, if the logger is disabled, the log arguments should not be escaping. A few allocations per log call add up.

For the work involved in preparing the logged values, see #37739. For pure functions maybe the compiler could be smart enough to do this for you, but I'd follow that issue regardless.

@mvdan
Copy link
Member Author

mvdan commented Nov 20, 2024

Perhaps interesting to @dr2chase given his recent work on better inlining for iterator funcs, e.g. #69411. I still think something like #53465 (comment) would be worthwhile.

@thepudds
Copy link
Contributor

FWIW, it seems to me this might be solved by #8618 (comment) and #62653. (However, I find escape analysis to be subtle, so I probably should have thought longer about this example before speaking here 😅).

Using one the CLs from that stack, the ConditionalDebugf/false case goes to 0 allocations (which if I understood seems to be @mvdan's main goal here).

Also, the ConditionalDebugf/true case with that same CL goes from 3 allocations to 1 allocation. (The heap allocation of the Sprintf string result remains, which might be required/expected given the Sprintf result is assigned to a global by @mvdan's benchmark).

$ go install golang.org/dl/gotip@latest
$ gotip download 528538

                          │    go1.23    │              cl-528538                  │
                          │  allocs/op   │ allocs/op   vs base                     │
ConditionalDebugf/false-4   2.000 ± 0%     0.000 ± 0%  -100.00% (p=0.000 n=10)
ConditionalDebugf/true-4    3.000 ± 0%     1.000 ± 0%   -66.67% (p=0.000 n=10)
InlinedDebugf/false-4       0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) 
InlinedDebugf/true-4        3.000 ± 0%     1.000 ± 0%   -66.67% (p=0.000 n=10)

(That particular CL 528538 is abandoned mainly because it relied on me having created a new pragma to solve a reflect.Value.Interface allocation in an earlier CL, but I later concluded a more general approach was likely possible, which is now spread across a few other CLs in that stack. #8618 (comment) has a brief overview of that).

All that said, changes along the lines suggested by Josh above or Keith above could also be useful in their own right, or maybe there is a good inining change to make here...

@mvdan
Copy link
Member Author

mvdan commented Nov 20, 2024

Interesting, thank you. When your changes land in master I'll try again with my real project to see whether the allocations there are still present. The example in the original post here is a simplification, so the funcs involved are a lot smaller.

@thepudds
Copy link
Contributor

thepudds commented Mar 2, 2025

Hi @mvdan, FWIW I now have a different stack of CLs that use a different set of approaches that seemingly avoid the unwanted allocations in your specific benchmark.

In #18704, Josh made it so that the compiler can recognize the 1000 here as a literal and use read-only global storage for the underlying interface value:

       fmt.Println(1000)

My CLs extend that in the compiler to work better across inlining (as in your benchmark here) and also to work with examples like:

       v := 1000
       fmt.Println(v)

The compiler also now creates and uses read-only global storage for certain composite literals that get used in an interface conversion, like:

       s := S{1, 1}
       fmt.Println(s)

I put a write-up of the overall approach in #71359, and you can see the current WIP stack via the relation chain in https://go.dev/cl/649555 for example. I had experimented with this approach a while ago, but seeing your issue here a few months ago nudged me to try to get it closer to reviewable shape, though clearly it's still a WIP.

Of course, your simplified benchmark is likely an easier case than your actual more complex application usage. I suspect this stack would help in some of your actual cases, but probably not all.

#62653 and #8618 (comment) I think are still the more general solutions which I think would apply even for most complex uses, and I'm still moving that forward.


Benchmark results (via gotip):

$ go install golang.org/dl/gotip@latest
$ gotip download 649555


                          │   go1.24.0   │              cl-649555               │
                          │    sec/op    │    sec/op     vs base                │
ConditionalDebugf/false-4   45.115n ± 1%    1.193n ± 0%  -97.36% (p=0.000 n=10)
ConditionalDebugf/true-4     178.5n ± 0%    116.1n ± 0%  -34.96% (p=0.000 n=10)
InlinedDebugf/false-4       0.3726n ± 0%   0.3728n ± 0%        ~ (p=0.754 n=10)
InlinedDebugf/true-4         177.6n ± 0%    115.4n ± 1%  -35.05% (p=0.000 n=10)
geomean                      27.02n         8.784n       -67.49%

                          │   go1.24.0   │                cl-649555                │
                          │  allocs/op   │ allocs/op   vs base                     │
ConditionalDebugf/false-4   2.000 ± 0%     0.000 ± 0%  -100.00% (p=0.000 n=10)
ConditionalDebugf/true-4    3.000 ± 0%     1.000 ± 0%   -66.67% (p=0.000 n=10)
InlinedDebugf/false-4       0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
InlinedDebugf/true-4        3.000 ± 0%     1.000 ± 0%   -66.67% (p=0.000 n=10)

@thepudds
Copy link
Contributor

thepudds commented Mar 3, 2025

Hi @mvdan, hmm, while #71359 does avoid the undesired allocs in your benchmark at the top of here, if this issue was specifically filed for garble, I suspect #71359 would not help those specific uses (e.g., for the examples in burrowers/garble#563).

On the other hand, I think #62653 and #8618 (comment) would resolve those examples from that PR (with a caveat that's based just on a quick look).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Status: Triage Backlog
Development

No branches or pull requests

9 participants