proposal: cmd/internal/obj/arm64, cmd/link/internal/arm64: introduce GOARM64 ",func_align_32" suffix #72130

alexanius · 2025-03-06T13:24:21Z

Proposal Details

Abstract

This proposal introduces func_align_32 value for GOARM64 environment variable. This value forces compiler to align both go & assembly functions entries by 32 bytes boundaries instead of current value 16 bytes.

Background and ### example

Code alignment affects CPU fetcher and branch predictor work and influences the application execution speed. Currently Go specification has no requirements on code alignment and the effective code alignment on ARM64 system is 16 bytes.

Depending on the ARM CPU model, the optimization guides may state the following:

Consider aligning subroutine entry points and branch targets to 32B boundaries, within the bounds of the code-density requirements of the program. This will ensure that the subsequent fetch can maximize bandwidth following the taken branch by bringing in all useful instructions

Arm® Cortex®-A76 Software Optimization Guide.

Consider aligning subroutine entry points and branch targets to 16-byte boundaries, within the bounds of the code-density requirements of the program. This ensures that the subsequent fetch can maximize bandwidth following the taken branch by bringing in all useful instructions

Arm® Cortex®-A75 Software Optimization Guide

Because of the alignment capabilities, software alignment of branch targets is generally unnecessary and sometimes detrimental. Rather, software should retain unaligned branch targets in favor of reduced code size

Apple Silicon CPU Optimization Guide: 3.0

From one side such alignment can lead to performance improvement.

The other aspect is the stability of performance measurements. Let's assume we want to measure the performance improvement of removing the out of bound check (-gcflags=all=-B). We will use ethereum_bitutil tests from bent benchmark suite on Kunpeng920 CPU (go 1.21 release):

$ ~/go/bin/benchstat base.log no_bound_check.log
                        │  base.log   │          no_bound_check.log          │
                        │   sec/op    │    sec/op     vs base                │
BaseTest2KB-4             992.2n ± 0%   1584.0n ± 0%  +59.64% (p=0.000 n=10)
Encoding4KBVerySparse-4   19.78µ ± 1%    16.47µ ± 0%  -16.73% (p=0.000 n=10)
FastTest2KB-4             992.4n ± 0%   1583.0n ± 0%  +59.51% (p=0.000 n=10)
geomean                   2.691µ         3.457µ       +28.47%

The results show, that disabling out of bound checks makes tests BaseTest2KB and FastTest2KB ~60% slower. But these benchmarks do not have any bound checks:

func BenchmarkBaseTest2KB(b *testing.B) { benchmarkBaseTest(b, 2048) }
 
func benchmarkBaseTest(b *testing.B, size int) {
        p := make([]byte, size)
        a := false
        for i := 0; i < b.N; i++ {
                a = a != safeTestBytes(p)
        }
        GloBool = a // Use of benchmark "result" to prevent total dead code elimination.
}
 
func safeTestBytes(p []byte) bool {
        for i := 0; i < len(p); i++ {
                if p[i] != 0 {
                        return true
                }
        }
        return false
}

These tests are very sensitive to code alignment and we want to avoid induced effects on performance measurements. The 32 bytes alignment helps us to stabilize the measurement results before and after optimization:

$ ~/go/bin/benchstat base_32_align.log no_bound_check_32_align.log
                        │ base_32_align.log │    no_bound_check_32_align.log     │
                        │      sec/op       │   sec/op     vs base               │
BaseTest2KB-4                   1.583µ ± 0%   1.583µ ± 0%       ~ (p=0.474 n=10)
Encoding4KBVerySparse-4         19.99µ ± 0%   18.48µ ± 0%  -7.56% (p=0.000 n=10)
FastTest2KB-4                   1.584µ ± 0%   1.583µ ± 0%       ~ (p=0.370 n=10)
geomean                         3.687µ        3.591µ       -2.61%

Here, we can see that by simply aligning the code to a 32-byte boundary, we observe more realistic effect of disabling out of bounds checks. The tests without these checks show same performance result.

Summary

Such alignment is a recommended optimization for ARM64 CPUs, for certain ARM64 CPU models, with the cost of increased code size. Therefore, we propose introducing a "func_align_32" environment variable, which can be enabled conditionally depending on the target CPU model. This option can also be used to improve the stability of performance measurement results for CPUs that are sensitive to branch targets alignment.

The proposal is already implemented in cl615736.

The text was updated successfully, but these errors were encountered:

randall77 · 2025-03-06T15:47:21Z

Such alignment is a recommended optimization for ARM64 CPUs, for certain ARM64 CPU models

Which models? Where does that recommendation come from?

Can you demonstrate performance improvements with your CL?

Note that you can mark individual assembly functions with PCALIGN directives to get higher alignment.

alexanius · 2025-03-06T17:00:09Z

Such alignment is a recommended optimization for ARM64 CPUs, for certain ARM64 CPU models

Which models? Where does that recommendation come from?

The recommendation comes from optimization guides to particular CPU model. For example the first quotation comes from Arm Cortex-A76 Software Optimization Guide, 4.8. Branch instruction alignment. The second quotation comes from Arm Cortex-A75 Software Optimization Guide, 4.6 Branch instruction alignment.

Can you demonstrate performance improvements with your CL?

As from my side, I use it for stability of benchmarking. As I showed in the text of proposal, on the ARM platforms (Kunpeng920 in my case), alignment changes caused by some optimizations, not related to the measured functions may affect its performance almost by 60%. In less synthetic benchmarks I could observe similar performance changes by 10%. Caused just by the changes, that are not related to the code of measured functions, but that changed the alignment of measured functions.

Note that you can mark individual assembly functions with PCALIGN directives to get higher alignment.

Yes, but this proposal adds align to all the functions, not only assembly automatically. So manual adding PCALIGN is not equal to this proposal.

ianlancetaylor · 2025-03-06T17:56:33Z

I'll just note that GCC has a -falign-functions option that applies to all processors. It's complicated (of course) so I'll quote the docs:

-falign-functions
-falign-functions=n
-falign-functions=n:m
-falign-functions=n:m:n2
-falign-functions=n:m:n2:m2

Align the start of functions to the next power-of-two greater than or equal to n, skipping up to m-1 bytes. This ensures that at least the first m bytes of the function can be fetched by the CPU without crossing an n-byte alignment boundary. This is an optimization of code performance and alignment is ignored for functions considered cold. If alignment is required for all functions, use -fmin-function-alignment.

If m is not specified, it defaults to n.

Examples: -falign-functions=32 aligns functions to the next 32-byte boundary, -falign-functions=24 aligns to the next 32-byte boundary only if this can be done by skipping 23 bytes or less, -falign-functions=32:7 aligns to the next 32-byte boundary only if this can be done by skipping 6 bytes or less.

The second pair of n2:m2 values allows you to specify a secondary alignment: -falign-functions=64:7:32:3 aligns to the next 64-byte boundary if this can be done by skipping 6 bytes or less, otherwise aligns to the next 32-byte boundary if this can be done by skipping 2 bytes or less. If m2 is not specified, it defaults to n2.

Some assemblers only support this flag when n is a power of two; in that case, it is rounded up.

-fno-align-functions and -falign-functions=1 are equivalent and mean that functions are not aligned.

If n is not specified or is zero, use a machine-dependent default. The maximum allowed n option value is 65536.

Enabled at levels -O2, -O3.

cherrymui · 2025-03-06T19:13:28Z

What is the binary size impact if we increase the alignment to 32 always? If the binary size increase is small, we could consider always do it.

Have you tried randomizing function orders, both with the current alignment and the 32-byte alignment? You can do this by using the -ldflags-randlayout=N flag where N is a number to seed the random number generator. It would be interesting to know how your benchmark is sensible to layout, and whether increased alignment makes it more stable across various layouts.

If we decide to do this, I'd propose we use a linker command line flag, like -textalign or -funcalign.

alexanius added the Proposal label Mar 6, 2025

gopherbot added this to the Proposal milestone Mar 6, 2025

ianlancetaylor added this to Proposals Mar 6, 2025

ianlancetaylor moved this to Incoming in Proposals Mar 6, 2025

ianlancetaylor added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: cmd/internal/obj/arm64, cmd/link/internal/arm64: introduce GOARM64 ",func_align_32" suffix #72130

proposal: cmd/internal/obj/arm64, cmd/link/internal/arm64: introduce GOARM64 ",func_align_32" suffix #72130

alexanius commented Mar 6, 2025

randall77 commented Mar 6, 2025

alexanius commented Mar 6, 2025

ianlancetaylor commented Mar 6, 2025

cherrymui commented Mar 6, 2025

proposal: cmd/internal/obj/arm64, cmd/link/internal/arm64: introduce GOARM64 ",func_align_32" suffix #72130

proposal: cmd/internal/obj/arm64, cmd/link/internal/arm64: introduce GOARM64 ",func_align_32" suffix #72130

Comments

alexanius commented Mar 6, 2025

Proposal Details

Abstract

Summary

randall77 commented Mar 6, 2025

alexanius commented Mar 6, 2025

ianlancetaylor commented Mar 6, 2025

cherrymui commented Mar 6, 2025