-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: cmd/internal/obj/arm64, cmd/link/internal/arm64: introduce GOARM64 ",func_align_32" suffix #72130
Comments
Which models? Where does that recommendation come from? Can you demonstrate performance improvements with your CL? Note that you can mark individual assembly functions with |
The recommendation comes from optimization guides to particular CPU model. For example the first quotation comes from Arm Cortex-A76 Software Optimization Guide, 4.8. Branch instruction alignment. The second quotation comes from Arm Cortex-A75 Software Optimization Guide, 4.6 Branch instruction alignment.
As from my side, I use it for stability of benchmarking. As I showed in the text of proposal, on the ARM platforms (Kunpeng920 in my case), alignment changes caused by some optimizations, not related to the measured functions may affect its performance almost by 60%. In less synthetic benchmarks I could observe similar performance changes by 10%. Caused just by the changes, that are not related to the code of measured functions, but that changed the alignment of measured functions.
Yes, but this proposal adds align to all the functions, not only assembly automatically. So manual adding PCALIGN is not equal to this proposal. |
I'll just note that GCC has a
|
What is the binary size impact if we increase the alignment to 32 always? If the binary size increase is small, we could consider always do it. Have you tried randomizing function orders, both with the current alignment and the 32-byte alignment? You can do this by using the If we decide to do this, I'd propose we use a linker command line flag, like |
Proposal Details
Abstract
This proposal introduces func_align_32 value for GOARM64 environment variable. This value forces compiler to align both go & assembly functions entries by 32 bytes boundaries instead of current value 16 bytes.
Background and ### example
Code alignment affects CPU fetcher and branch predictor work and influences the application execution speed. Currently Go specification has no requirements on code alignment and the effective code alignment on ARM64 system is 16 bytes.
Depending on the ARM CPU model, the optimization guides may state the following:
Arm® Cortex®-A76 Software Optimization Guide.
Arm® Cortex®-A75 Software Optimization Guide
Apple Silicon CPU Optimization Guide: 3.0
From one side such alignment can lead to performance improvement.
The other aspect is the stability of performance measurements. Let's assume we want to measure the performance improvement of removing the out of bound check (
-gcflags=all=-B
). We will use ethereum_bitutil tests from bent benchmark suite on Kunpeng920 CPU (go 1.21 release):The results show, that disabling out of bound checks makes tests BaseTest2KB and FastTest2KB ~60% slower. But these benchmarks do not have any bound checks:
These tests are very sensitive to code alignment and we want to avoid induced effects on performance measurements. The 32 bytes alignment helps us to stabilize the measurement results before and after optimization:
Here, we can see that by simply aligning the code to a 32-byte boundary, we observe more realistic effect of disabling out of bounds checks. The tests without these checks show same performance result.
Summary
Such alignment is a recommended optimization for ARM64 CPUs, for certain ARM64 CPU models, with the cost of increased code size. Therefore, we propose introducing a "func_align_32" environment variable, which can be enabled conditionally depending on the target CPU model. This option can also be used to improve the stability of performance measurement results for CPUs that are sensitive to branch targets alignment.
The proposal is already implemented in cl615736.
The text was updated successfully, but these errors were encountered: