Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: cmd/internal/obj/arm64, cmd/link/internal/arm64: introduce GOARM64 ",func_align_32" suffix #72130

Open
alexanius opened this issue Mar 6, 2025 · 4 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Proposal
Milestone

Comments

@alexanius
Copy link

Proposal Details

Abstract

This proposal introduces func_align_32 value for GOARM64 environment variable. This value forces compiler to align both go & assembly functions entries by 32 bytes boundaries instead of current value 16 bytes.

Background and ### example

Code alignment affects CPU fetcher and branch predictor work and influences the application execution speed. Currently Go specification has no requirements on code alignment and the effective code alignment on ARM64 system is 16 bytes.

Depending on the ARM CPU model, the optimization guides may state the following:

Consider aligning subroutine entry points and branch targets to 32B boundaries, within the bounds of the code-density requirements of the program. This will ensure that the subsequent fetch can maximize bandwidth following the taken branch by bringing in all useful instructions

Arm® Cortex®-A76 Software Optimization Guide.

Consider aligning subroutine entry points and branch targets to 16-byte boundaries, within the bounds of the code-density requirements of the program. This ensures that the subsequent fetch can maximize bandwidth following the taken branch by bringing in all useful instructions

Arm® Cortex®-A75 Software Optimization Guide

Because of the alignment capabilities, software alignment of branch targets is generally unnecessary and sometimes detrimental. Rather, software should retain unaligned branch targets in favor of reduced code size

Apple Silicon CPU Optimization Guide: 3.0

From one side such alignment can lead to performance improvement.

The other aspect is the stability of performance measurements. Let's assume we want to measure the performance improvement of removing the out of bound check (-gcflags=all=-B). We will use ethereum_bitutil tests from bent benchmark suite on Kunpeng920 CPU (go 1.21 release):

$ ~/go/bin/benchstat base.log no_bound_check.log
                        │  base.log   │          no_bound_check.log          │
                        │   sec/op    │    sec/op     vs base                │
BaseTest2KB-4             992.2n ± 0%   1584.0n ± 0%  +59.64% (p=0.000 n=10)
Encoding4KBVerySparse-4   19.78µ ± 1%    16.47µ ± 0%  -16.73% (p=0.000 n=10)
FastTest2KB-4             992.4n ± 0%   1583.0n ± 0%  +59.51% (p=0.000 n=10)
geomean                   2.691µ         3.457µ       +28.47%

The results show, that disabling out of bound checks makes tests BaseTest2KB and FastTest2KB ~60% slower. But these benchmarks do not have any bound checks:

func BenchmarkBaseTest2KB(b *testing.B) { benchmarkBaseTest(b, 2048) }
 
func benchmarkBaseTest(b *testing.B, size int) {
        p := make([]byte, size)
        a := false
        for i := 0; i < b.N; i++ {
                a = a != safeTestBytes(p)
        }
        GloBool = a // Use of benchmark "result" to prevent total dead code elimination.
}
 
func safeTestBytes(p []byte) bool {
        for i := 0; i < len(p); i++ {
                if p[i] != 0 {
                        return true
                }
        }
        return false
}

These tests are very sensitive to code alignment and we want to avoid induced effects on performance measurements. The 32 bytes alignment helps us to stabilize the measurement results before and after optimization:

$ ~/go/bin/benchstat base_32_align.log no_bound_check_32_align.log
                        │ base_32_align.log │    no_bound_check_32_align.log     │
                        │      sec/op       │   sec/op     vs base               │
BaseTest2KB-4                   1.583µ ± 0%   1.583µ ± 0%       ~ (p=0.474 n=10)
Encoding4KBVerySparse-4         19.99µ ± 0%   18.48µ ± 0%  -7.56% (p=0.000 n=10)
FastTest2KB-4                   1.584µ ± 0%   1.583µ ± 0%       ~ (p=0.370 n=10)
geomean                         3.687µ        3.591µ       -2.61%

Here, we can see that by simply aligning the code to a 32-byte boundary, we observe more realistic effect of disabling out of bounds checks. The tests without these checks show same performance result.

Summary

Such alignment is a recommended optimization for ARM64 CPUs, for certain ARM64 CPU models, with the cost of increased code size. Therefore, we propose introducing a "func_align_32" environment variable, which can be enabled conditionally depending on the target CPU model. This option can also be used to improve the stability of performance measurement results for CPUs that are sensitive to branch targets alignment.

The proposal is already implemented in cl615736.

@gopherbot gopherbot added this to the Proposal milestone Mar 6, 2025
@randall77
Copy link
Contributor

Such alignment is a recommended optimization for ARM64 CPUs, for certain ARM64 CPU models

Which models? Where does that recommendation come from?

Can you demonstrate performance improvements with your CL?

Note that you can mark individual assembly functions with PCALIGN directives to get higher alignment.

@alexanius
Copy link
Author

Such alignment is a recommended optimization for ARM64 CPUs, for certain ARM64 CPU models

Which models? Where does that recommendation come from?

The recommendation comes from optimization guides to particular CPU model. For example the first quotation comes from Arm Cortex-A76 Software Optimization Guide, 4.8. Branch instruction alignment. The second quotation comes from Arm Cortex-A75 Software Optimization Guide, 4.6 Branch instruction alignment.

Can you demonstrate performance improvements with your CL?

As from my side, I use it for stability of benchmarking. As I showed in the text of proposal, on the ARM platforms (Kunpeng920 in my case), alignment changes caused by some optimizations, not related to the measured functions may affect its performance almost by 60%. In less synthetic benchmarks I could observe similar performance changes by 10%. Caused just by the changes, that are not related to the code of measured functions, but that changed the alignment of measured functions.

Note that you can mark individual assembly functions with PCALIGN directives to get higher alignment.

Yes, but this proposal adds align to all the functions, not only assembly automatically. So manual adding PCALIGN is not equal to this proposal.

@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals Mar 6, 2025
@ianlancetaylor ianlancetaylor added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 6, 2025
@ianlancetaylor
Copy link
Member

I'll just note that GCC has a -falign-functions option that applies to all processors. It's complicated (of course) so I'll quote the docs:

-falign-functions
-falign-functions=n
-falign-functions=n:m
-falign-functions=n:m:n2
-falign-functions=n:m:n2:m2

Align the start of functions to the next power-of-two greater than or equal to n, skipping up to m-1 bytes. This ensures that at least the first m bytes of the function can be fetched by the CPU without crossing an n-byte alignment boundary. This is an optimization of code performance and alignment is ignored for functions considered cold. If alignment is required for all functions, use -fmin-function-alignment.

If m is not specified, it defaults to n.

Examples: -falign-functions=32 aligns functions to the next 32-byte boundary, -falign-functions=24 aligns to the next 32-byte boundary only if this can be done by skipping 23 bytes or less, -falign-functions=32:7 aligns to the next 32-byte boundary only if this can be done by skipping 6 bytes or less.

The second pair of n2:m2 values allows you to specify a secondary alignment: -falign-functions=64:7:32:3 aligns to the next 64-byte boundary if this can be done by skipping 6 bytes or less, otherwise aligns to the next 32-byte boundary if this can be done by skipping 2 bytes or less. If m2 is not specified, it defaults to n2.

Some assemblers only support this flag when n is a power of two; in that case, it is rounded up.

-fno-align-functions and -falign-functions=1 are equivalent and mean that functions are not aligned.

If n is not specified or is zero, use a machine-dependent default. The maximum allowed n option value is 65536.

Enabled at levels -O2, -O3.

@cherrymui
Copy link
Member

What is the binary size impact if we increase the alignment to 32 always? If the binary size increase is small, we could consider always do it.

Have you tried randomizing function orders, both with the current alignment and the 32-byte alignment? You can do this by using the -ldflags-randlayout=N flag where N is a number to seed the random number generator. It would be interesting to know how your benchmark is sensible to layout, and whether increased alignment makes it more stable across various layouts.

If we decide to do this, I'd propose we use a linker command line flag, like -textalign or -funcalign.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Proposal
Projects
Status: Incoming
Development

No branches or pull requests

5 participants