-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: process hangs when built with Go 1.23 for mips hardware #71591
Comments
cc @golang/mips |
The initial issue had the wrong values for |
Given that you've identified potential commits, is it possible for you to build Go with either of the commits reverted and test it? |
I rebuilt the application based on go1.23.6 with reverted 9623a35, branch. After some hours the running application stopped responding. gdb output: (gdb) thread apply all backtrace
Thread 11 (LWP 2025):
#0 internal/runtime/atomic.spinLock () at internal/runtime/atomic/atomic_mipsx.s:250
#1 0x004c10ac in internal/runtime/atomic.lockAndCheck (addr=0x1740fb0 <runtime.sched+16>) at internal/runtime/atomic/atomic_mipsx.go:43
#2 0x004c1e5c in internal/runtime/atomic.Store64 (addr=0x1740fb0 <runtime.sched+16>, val=0) at internal/runtime/atomic/atomic_mipsx.go:99
#3 0x004c1238 in internal/runtime/atomic.(*Int64).Store (i=0x1740fb0 <runtime.sched+16>, value=0) at internal/runtime/atomic/types.go:81
#4 0x0046a92c in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3584
#5 0x0046c42c in runtime.schedule () at runtime/proc.go:3996
#6 0x0046d154 in runtime.goexit0 (gp=0x1d04d88) at runtime/proc.go:4269
#7 0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#8 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC
Thread 10 (LWP 30401):
#0 runtime.futex () at runtime/sys_linux_mipsx.s:380
#1 0x00459f6c in runtime.futexsleep (addr=0x1c805d8, val=0, ns=-1) at runtime/os_linux.go:69
#2 0x0041b260 in runtime.notesleep (n=0x1c805d8) at runtime/lock_futex.go:170
#3 0x00466eb0 in runtime.mPark () at runtime/proc.go:1865
#4 0x00468eb0 in runtime.stopm () at runtime/proc.go:2886
#5 0x00469db0 in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3623
#6 0x0046c42c in runtime.schedule () at runtime/proc.go:3996
#7 0x0046c978 in runtime.park_m (gp=0x1f56128) at runtime/proc.go:4103
#8 0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#9 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC
Thread 9 (LWP 30397):
#0 runtime.futex () at runtime/sys_linux_mipsx.s:380
#1 0x00459f6c in runtime.futexsleep (addr=0x1c499d8, val=0, ns=-1) at runtime/os_linux.go:69
#2 0x0041b260 in runtime.notesleep (n=0x1c499d8) at runtime/lock_futex.go:170
#3 0x00466eb0 in runtime.mPark () at runtime/proc.go:1865
#4 0x00468eb0 in runtime.stopm () at runtime/proc.go:2886
#5 0x00469db0 in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3623
#6 0x0046c42c in runtime.schedule () at runtime/proc.go:3996
#7 0x0046d154 in runtime.goexit0 (gp=0x1d04d88) at runtime/proc.go:4269
#8 0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#9 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC
Thread 8 (LWP 30396):
#0 runtime.futex () at runtime/sys_linux_mipsx.s:380
#1 0x00459f6c in runtime.futexsleep (addr=0x1c494d8, val=0, ns=-1) at runtime/os_linux.go:69
#2 0x0041b260 in runtime.notesleep (n=0x1c494d8) at runtime/lock_futex.go:170
#3 0x00466eb0 in runtime.mPark () at runtime/proc.go:1865
#4 0x00468eb0 in runtime.stopm () at runtime/proc.go:2886
#5 0x0046eaf0 in runtime.exitsyscall0 (gp=0x1dc1c28) at runtime/proc.go:4829
#6 0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#7 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC
Thread 7 (LWP 30389):
#0 runtime.futex () at runtime/sys_linux_mipsx.s:380
#1 0x00459f6c in runtime.futexsleep (addr=0x1c80ad8, val=0, ns=-1) at runtime/os_linux.go:69
#2 0x0041b260 in runtime.notesleep (n=0x1c80ad8) at runtime/lock_futex.go:170
#3 0x00466eb0 in runtime.mPark () at runtime/proc.go:1865
#4 0x00468eb0 in runtime.stopm () at runtime/proc.go:2886
#5 0x00469db0 in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3623
#6 0x0046c42c in runtime.schedule () at runtime/proc.go:3996
#7 0x0046cb1c in runtime.goschedImpl (gp=0x1c82248, preempted=false) at runtime/proc.go:4137
--Type <RET> for more, q to quit, c to continue without paging--
#8 0x0046cba0 in runtime.gosched_m (gp=0x1c82248) at runtime/proc.go:4142
#9 0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#10 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC
Thread 6 (LWP 30386):
#0 runtime.futex () at runtime/sys_linux_mipsx.s:380
#1 0x00459f6c in runtime.futexsleep (addr=0x1749e2c <runtime.newmHandoff+12>, val=0, ns=-1) at runtime/os_linux.go:69
#2 0x0041b260 in runtime.notesleep (n=0x1749e2c <runtime.newmHandoff+12>) at runtime/lock_futex.go:170
#3 0x00468d50 in runtime.templateThread () at runtime/proc.go:2864
#4 0x00466d88 in runtime.mstart1 () at runtime/proc.go:1834
#5 0x00466c9c in runtime.mstart0 () at runtime/proc.go:1791
#6 0x004b854c in runtime.mstart () at runtime/asm_mipsx.s:89
Thread 5 (LWP 30385):
#0 runtime.futex () at runtime/sys_linux_mipsx.s:380
#1 0x00459f6c in runtime.futexsleep (addr=0x1c800d8, val=0, ns=-1) at runtime/os_linux.go:69
#2 0x0041b260 in runtime.notesleep (n=0x1c800d8) at runtime/lock_futex.go:170
#3 0x00466eb0 in runtime.mPark () at runtime/proc.go:1865
#4 0x004697ac in runtime.stoplockedm () at runtime/proc.go:3141
#5 0x0046c388 in runtime.schedule () at runtime/proc.go:3975
#6 0x0046c978 in runtime.park_m (gp=0x213a368) at runtime/proc.go:4103
#7 0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#8 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC
Thread 4 (LWP 30384):
#0 runtime.futex () at runtime/sys_linux_mipsx.s:380
#1 0x00459f6c in runtime.futexsleep (addr=0x174a420 <runtime.sig>, val=0, ns=-1) at runtime/os_linux.go:69
#2 0x0041b588 in runtime.notetsleep_internal (n=0x174a420 <runtime.sig>, ns=-1, ~r0=<optimized out>) at runtime/lock_futex.go:193
#3 0x0041b6e0 in runtime.notetsleepg (n=0x174a420 <runtime.sig>, ns=-1, ~r0=<optimized out>) at runtime/lock_futex.go:247
#4 0x004b42d4 in os/signal.signal_recv (~r0=<optimized out>) at runtime/sigqueue.go:152
#5 0x00964454 in os/signal.loop () at os/signal/signal_unix.go:23
#6 0x004bab34 in runtime.goexit () at runtime/asm_mipsx.s:664
Thread 3 (LWP 30383):
#0 internal/runtime/atomic.spinLock () at internal/runtime/atomic/atomic_mipsx.s:250
#1 0x004c10ac in internal/runtime/atomic.lockAndCheck (addr=0x1c3b1c8) at internal/runtime/atomic/atomic_mipsx.go:43
#2 0x004c1e04 in internal/runtime/atomic.Load64 (addr=0x1c3b1c8, val=<optimized out>) at internal/runtime/atomic/atomic_mipsx.go:89
#3 0x004c11e8 in internal/runtime/atomic.(*Int64).Load (i=0x1c3b1c8, ~r0=<optimized out>) at internal/runtime/atomic/types.go:74
#4 0x0049113c in runtime.(*timers).wakeTime (ts=0x1c3b1a0, ~r0=<optimized out>) at runtime/time.go:877
#5 0x0049124c in runtime.(*timers).check (ts=0x1c3b1a0, now=0, rnow=<optimized out>, pollUntil=<optimized out>, ran=<optimized out>) at runtime/time.go:899
#6 0x00469e38 in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3271
#7 0x0046c42c in runtime.schedule () at runtime/proc.go:3996
#8 0x0046d154 in runtime.goexit0 (gp=0x1f54008) at runtime/proc.go:4269
#9 0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#10 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC
Thread 2 (LWP 30382):
#0 internal/runtime/atomic.spinLock () at internal/runtime/atomic/atomic_mipsx.s:251
#1 0x004c10ac in internal/runtime/atomic.lockAndCheck (addr=0x17515e0 <runtime.memstats+7712>) at internal/runtime/atomic/atomic_mipsx.go:43
#2 0x004c1e04 in internal/runtime/atomic.Load64 (addr=0x17515e0 <runtime.memstats+7712>, val=<optimized out>) at internal/runtime/atomic/atomic_mipsx.go:89
#3 0x0042d34c in runtime.gcTrigger.test (t=..., ~r0=<optimized out>) at runtime/mgc.go:614
#4 0x00472cbc in runtime.sysmon () at runtime/proc.go:6175
#5 0x00466d88 in runtime.mstart1 () at runtime/proc.go:1834
#6 0x00466c9c in runtime.mstart0 () at runtime/proc.go:1791
#7 0x004b854c in runtime.mstart () at runtime/asm_mipsx.s:89
--Type <RET> for more, q to quit, c to continue without paging--
Thread 1 (LWP 30380):
#0 internal/runtime/atomic.spinLock () at internal/runtime/atomic/atomic_mipsx.s:251
#1 0x004c10ac in internal/runtime/atomic.lockAndCheck (addr=0x1c3c4c8) at internal/runtime/atomic/atomic_mipsx.go:43
#2 0x004c1e04 in internal/runtime/atomic.Load64 (addr=0x1c3c4c8, val=<optimized out>) at internal/runtime/atomic/atomic_mipsx.go:89
#3 0x004c11e8 in internal/runtime/atomic.(*Int64).Load (i=0x1c3c4c8, ~r0=<optimized out>) at internal/runtime/atomic/types.go:74
#4 0x0049113c in runtime.(*timers).wakeTime (ts=0x1c3c4a0, ~r0=<optimized out>) at runtime/time.go:877
#5 0x0049124c in runtime.(*timers).check (ts=0x1c3c4a0, now=0, rnow=<optimized out>, pollUntil=<optimized out>, ran=<optimized out>) at runtime/time.go:899
#6 0x00469e38 in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3271
#7 0x0046c42c in runtime.schedule () at runtime/proc.go:3996
#8 0x0046cb1c in runtime.goschedImpl (gp=0x1dc1c28, preempted=true) at runtime/proc.go:4137
#9 0x0046cc80 in runtime.gopreempt_m (gp=0x1dc1c28) at runtime/proc.go:4154
#10 0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#11 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC |
In triage: this is a long shot, but in these weird cases it's always good to rule out async preemption (not that anything changed there either recently, though it could be kernel related). Try |
We're at a loss here. How hard would it be to try bisecting between Go 1.22 and Go 1.23? |
Is this how you want us to setup git bisect? git bisect start
git bisect god go1.22.12
git bisect bad go1.23.0 The result can take some time to gather as the problem can take hours/days to trigger. |
In response to the request for git bisect testing between Go 1.22 and 1.23: We've performed a systematic git bisect between go1.22.12 (good) and go1.23.0 (bad) to identify the commit causing our MIPS high CPU/hang issue. After testing the 11 intermediate versions, we've hopefully pinpointed the exact problematic commit:
This commit moved atomic operations from runtime/internal/atomic to internal/runtime/atomic, might affect the MIPS spinlock implementation. We determined good vs. bad commits by deploying our application to 6 MIPSLE devices and monitoring them for at least 30 hours. bad: when at least one device showed the characteristic high CPU usage and application hang, with threads stuck in atomic operations (confirmed via strace and gdb).
|
Excellent analysis. Thanks very much. CC @panjf2000 @golang/runtime |
Thank you and a small fyi; We've confirmed this issue persists in go1.24.0 as well. |
Theory: we avoid async preemption on runtime/internal/, but not internal/runtime: https://cs.opensource.google/go/go/+/master:src/runtime/preempt.go;l=420-434 That is a bug, but maybe not this bug. If you test with |
Tested with If it's of any use, we don't observe this problem on any devices running architectures as amd64 (Linux, Windows, Mac), arm64 (Linux), or armv7hf (Linux). |
The compiler recognizes all runtime packages and mark the all functions as unsafe for async preemption https://cs.opensource.google/go/go/+/master:src/cmd/compile/internal/liveness/plive.go;l=503 So async preemption is probably not enabled for internal/runtime, although it is still good to fix the inconsistency. |
We conducted some experiments by checking out the master branch at commit d31c805, and as expected, it resulted in a hanging process. We applied the following patch on the same commit as noted by @prattmic in #71591 (comment): diff --git a/src/runtime/preempt.go b/src/runtime/preempt.go
index 45b1b5e9c7..3f8a4ed135 100644
--- a/src/runtime/preempt.go
+++ b/src/runtime/preempt.go
@@ -419,6 +419,7 @@ func isAsyncSafePoint(gp *g, pc, sp, lr uintptr) (bool, uintptr) {
name := u.srcFunc(uf).name()
if stringslite.HasPrefix(name, "runtime.") ||
stringslite.HasPrefix(name, "runtime/internal/") ||
+ stringslite.HasPrefix(name, "internal/runtime/") ||
stringslite.HasPrefix(name, "reflect.") {
// For now we never async preempt the runtime or
// anything closely tied to the runtime. Known issues The six MIPSLE devices have now been running for more than 48 hours without any issues. |
Change https://go.dev/cl/654916 mentions this issue: |
Oh, thanks for testing this, and sending the CL @panjf2000. I actually decided these symbol name checks shouldn't matter because of #71591 (comment), went to remove them, only to discover they are load bearing because of #72031, but then forgot to come back here to update. 🤦 |
@gopherbot Please backport to 1.23 and 1.24. This is a regression for mips causing random deadlocks. |
Backport issue(s) opened: #72114 (for 1.23), #72115 (for 1.24). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
suggests that it is not async preemption? But #71591 (comment) suggests the change to Sorry, I'm a bit confused. |
For #71591 Relevant CL 560155 Change-Id: Iebc497d56b36d50c13a6dd88e7bca4578a03cf63 Reviewed-on: https://go-review.googlesource.com/c/go/+/654916 LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Cherry Mui <[email protected]> Auto-Submit: Michael Pratt <[email protected]> Reviewed-by: Michael Pratt <[email protected]>
Go version
go version go1.23.4 linux/amd64
Output of
go env
in your module/workspace:Background
At Axis Communications we make network cameras and also software that run in these cameras. Axis cameras have Linux as OS and have been made with mipsle, armv7hf, and arm64 hardware. I'm part of one team that builds a server in Go that can run on computers and on Axis cameras. We build this server for Linux, Windows, and Mac OS and for amd64, x86, arm64, armv7hf, and mipsle hardware. This server software is installed and is running in thousands of instances.
In the same department there's another team that maintain an earlier generation of a solution that accomplish a similar thing. It is independent code from the server my team does. This earlier solution run on Linux in Axis cameras on arm64, armv7hf, and mipsle hardware and also run as a server.
Since both of these pieces of software run as a server they're supposed to run all the time and respond to request. Both server binaries are normally compressed with UPX on mips hardware. We have built and deployed these servers for several years and have never seen the issue reported in this bug report. The source code is proprietary and can't be shared.
What did you do?
We upgraded the version of Go that we build our server software with from 1.22.9 to 1.23.4.
What did you see happen?
A while after updating to using Go 1.23 and deploying our software, we started getting reports that our server had problems in Axis cameras with mips hardware. The problems we have observed for both of the server software are:
And for the server we make in our team, also:
We now know that it takes several hours to a few days for these problems to happen after the process starts.
These problems have only been observed when running on Linux in an Axis camera with mipsle hardware and when building with Go 1.23.x. Not any other combination of OS, hardware or Go version.
As a next step we built a firmware version with strace and gdb and built the server software with debug info and didn't compress the binary with UPX. After a while the problems happened and we examined the process with strace and gdb.
strace output:
gdb output:
Only one of the server software have been examined with strace and gdb (the one that my team makes). For the other we have only observed that the symptoms from the outside seems similar (the older solution).
After reproducing the problem when building with Go 1.23.4 we rolled back the Go version to 1.22 (we used 1.22.11) for both servers and haven't noticed any problems in production.
We have also reproduced the problem using Go 1.23.0 in a test environment.
What did you expect to see?
Our process not hanging.
Other relevant info
We build the two servers with inling disabled, except for a few select packages where we have noticed it makes a big difference in performance on mips. The purpose of disabling inlining is to get a smaller size of the executable.
gcflags for mipsle:
-gcflags="all=-l" -gcflags="crypto/...=-l=false" -gcflags="vendor/golang.org/x/crypto/...=-l=false" -gcflags="math/...=-l=false"
Speculation/guesses
Since two independent pieces of software display the same kind of problems on mips hardware when using Go 1.23 to build with, but not with earlier Go versions, my guess is that there is some bug in the Go runtime for mips and that this was merged during the development of Go 1.23.
We have looked at the git diff comparing the 1.22.0 and 1.23.0 tags and looked at the changes containing the string mips. We found two commits that look like candidates to look more closely at:
9623a35 runtime/internal/atomic: add mips operators for And/Or
ff0bc46 runtime: add crash stack support for mips/mipsle
The text was updated successfully, but these errors were encountered: