Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: process hangs when built with Go 1.23 for mips hardware #71591

Open
hansgylling opened this issue Feb 6, 2025 · 20 comments
Open

runtime: process hangs when built with Go 1.23 for mips hardware #71591

hansgylling opened this issue Feb 6, 2025 · 20 comments
Labels
arch-mips BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@hansgylling
Copy link

hansgylling commented Feb 6, 2025

Go version

go version go1.23.4 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='mipsle'
GOBIN=''
GOCACHE='/workspace/.cache/go-build/acap-acap/mipsle'
GOENV='/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/gomodcache'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.23.4'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/.config/go/telemetry'
GCCGO='gccgo'
GOMIPS='softfloat'
AR='mipsisa32r2el-axis-linux-gnu-ar'
CC='mipsisa32r2el-axis-linux-gnu-gcc  -mel -mabi=32 -msoft-float   -mno-synci --sysroot=/opt/axis/acapsdk/2.12.0/sysroots/mips32r2el-nf-poky-linux'
CXX='mipsisa32r2el-axis-linux-gnu-g++  -mel -mabi=32 -msoft-float  -march=34kc -msynci --sysroot=/opt/axis/acapsdk/2.12.0/sysroots/mips32r2el-nf-poky-linux'
CGO_ENABLED='1'
GOMOD='/workspace/go.mod'
GOWORK=''
CGO_CFLAGS='-Os -pipe -feliminate-unused-debug-types -mel -mabi=32 -msoft-float -mno-synci --sysroot=/opt/axis/acapsdk/2.12.0/sysroots/mips32r2el-nf-poky-linux -I/opt/axis/acapsdk/2.12.0//sysroots/mips32r2el-nf-poky-linux/usr/include/axsdk -I/opt/axis/acapsdk/2.12.0//sysroots/mips32r2el-nf-poky-linux/usr/include/glib-2.0 -I/opt/axis/acapsdk/2.12.0//sysroots/mips32r2el-nf-poky-linux/usr/lib/glib-2.0/include'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='--sysroot=/opt/axis/acapsdk/2.12.0//sysroots/mips32r2el-nf-poky-linux -Wl,-O1 -Wl,--as-needed'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-msoft-float -mno-synci --sysroot=/opt/axis/acapsdk/2.12.0/sysroots/mips32r2el-nf-poky-linux -I . -fPIC -mabi=32 -march=mips32 -msoft-float -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build4227944803=/tmp/go-build -gno-record-gcc-switches'

Background

At Axis Communications we make network cameras and also software that run in these cameras. Axis cameras have Linux as OS and have been made with mipsle, armv7hf, and arm64 hardware. I'm part of one team that builds a server in Go that can run on computers and on Axis cameras. We build this server for Linux, Windows, and Mac OS and for amd64, x86, arm64, armv7hf, and mipsle hardware. This server software is installed and is running in thousands of instances.

In the same department there's another team that maintain an earlier generation of a solution that accomplish a similar thing. It is independent code from the server my team does. This earlier solution run on Linux in Axis cameras on arm64, armv7hf, and mipsle hardware and also run as a server.

Since both of these pieces of software run as a server they're supposed to run all the time and respond to request. Both server binaries are normally compressed with UPX on mips hardware. We have built and deployed these servers for several years and have never seen the issue reported in this bug report. The source code is proprietary and can't be shared.

What did you do?

We upgraded the version of Go that we build our server software with from 1.22.9 to 1.23.4.

What did you see happen?

A while after updating to using Go 1.23 and deploying our software, we started getting reports that our server had problems in Axis cameras with mips hardware. The problems we have observed for both of the server software are:

  • Stopped sending data to cloud services.
  • Stopped logging to syslog.
  • Stopped responding to requests from cloud services.

And for the server we make in our team, also:

  • Stopped resonding to local APIs (the earlier generation server doesn't have a local API).
  • When running top, our server uses 70% to 90% CPU. (I'm not sure if this has been checked for the older generation server).

We now know that it takes several hours to a few days for these problems to happen after the process starts.

These problems have only been observed when running on Linux in an Axis camera with mipsle hardware and when building with Go 1.23.x. Not any other combination of OS, hardware or Go version.

As a next step we built a firmware version with strace and gdb and built the server software with debug info and didn't compress the binary with UPX. After a while the problems happened and we examined the process with strace and gdb.

strace output:

strace: Process 19765 attached with 11 threads
[pid 19787] futex(0x2080ad8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 19778] futex(0x2080fd8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 19775] futex(0x1749bec, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 19774] futex(0x174a1e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 19773] futex(0x2048fd8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 19765] futex(0x173ec90, FUTEX_WAIT_PRIVATE, 0, NULL

gdb output:

(gdb) thread apply all backtrace

Thread 11 (LWP 19897):
#0  internal/runtime/atomic.spinLock () at internal/runtime/atomic/atomic_mipsx.s:287
#1  0x004c11e8 in internal/runtime/atomic.lockAndCheck (addr=0x203e260) at internal/runtime/atomic/atomic_mipsx.go:45
#2  0x004c1db8 in internal/runtime/atomic.Load64 (addr=0x203e260, val=<optimized out>) at internal/runtime/atomic/atomic_mipsx.go:91
#3  0x004c178c in internal/runtime/atomic.(*Uint64).Load (u=0x203e260, ~r0=<optimized out>) at internal/runtime/atomic/types.go:309
#4  0x00432704 in runtime.(*limiterEvent).stop (e=0x203e260, typ=1 '\001', now=119587286975287) at runtime/mgclimit.go:452
#5  0x0042fe34 in runtime.gcBgMarkWorker (ready=0x248a340) at runtime/mgc.go:1518
#6  0x0042fafc in runtime.gcBgMarkStartWorkers.gowrap1 () at runtime/mgc.go:1328
#7  0x004bac70 in runtime.goexit () at runtime/asm_mipsx.s:664

Thread 10 (LWP 19793):
#0  internal/runtime/atomic.spinLock () at internal/runtime/atomic/atomic_mipsx.s:287
#1  0x004c11e8 in internal/runtime/atomic.lockAndCheck (addr=0x203b1c8) at internal/runtime/atomic/atomic_mipsx.go:45
#2  0x004c1db8 in internal/runtime/atomic.Load64 (addr=0x203b1c8, val=<optimized out>) at internal/runtime/atomic/atomic_mipsx.go:91
#3  0x004c1324 in internal/runtime/atomic.(*Int64).Load (i=0x203b1c8, ~r0=<optimized out>) at internal/runtime/atomic/types.go:74
#4  0x00491278 in runtime.(*timers).wakeTime (ts=0x203b1a0, ~r0=<optimized out>) at runtime/time.go:877
#5  0x00491388 in runtime.(*timers).check (ts=0x203b1a0, now=0, rnow=<optimized out>, pollUntil=<optimized out>, ran=<optimized out>) at runtime/time.go:899
#6  0x00469f70 in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3270
#7  0x0046c568 in runtime.schedule () at runtime/proc.go:3995
#8  0x0046cc58 in runtime.goschedImpl (gp=0x22fc7e8, preempted=true) at runtime/proc.go:4136
#9  0x0046cdbc in runtime.gopreempt_m (gp=0x22fc7e8) at runtime/proc.go:4153
#10 0x004b86fc in runtime.mcall () at runtime/asm_mipsx.s:141
#11 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC

Thread 9 (LWP 19789):
#0  internal/runtime/atomic.spinLock () at internal/runtime/atomic/atomic_mipsx.s:287
#1  0x004c11e8 in internal/runtime/atomic.lockAndCheck (addr=0x174b240 <runtime.gcController+160>) at internal/runtime/atomic/atomic_mipsx.go:45
#2  0x004c1c48 in internal/runtime/atomic.Xadd64 (addr=0x174b240 <runtime.gcController+160>, delta=2068, new=<optimized out>) at internal/runtime/atomic/atomic_mipsx.go:55
#3  0x004c145c in internal/runtime/atomic.(*Int64).Add (i=0x174b240 <runtime.gcController+160>, delta=2068, ~r0=<optimized out>) at internal/runtime/atomic/types.go:109
#4  0x00435f78 in runtime.gcDrain (gcw=0x203bc74, flags=2) at runtime/mgcmark.go:1236
#5  0x00435bc4 in runtime.gcDrainMarkWorkerDedicated (gcw=0x203bc74, untilPreempt=false) at runtime/mgcmark.go:1112
#6  0x0043028c in runtime.gcBgMarkWorker.func2 () at runtime/mgc.go:1504
#7  0x004b87a8 in runtime.systemstack () at runtime/asm_mipsx.s:188
#8  0x004b8688 in runtime.mstart () at runtime/asm_mipsx.s:89

Thread 8 (LWP 19787):
#0  runtime.futex () at runtime/sys_linux_mipsx.s:380
#1  0x0045a0a8 in runtime.futexsleep (addr=0x2080ad8, val=0, ns=-1) at runtime/os_linux.go:69
#2  0x0041b39c in runtime.notesleep (n=0x2080ad8) at runtime/lock_futex.go:170
#3  0x00466fec in runtime.mPark () at runtime/proc.go:1865
#4  0x00468fe8 in runtime.stopm () at runtime/proc.go:2885
#5  0x00469ee8 in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3622
#6  0x0046c568 in runtime.schedule () at runtime/proc.go:3995
#7  0x0046cab4 in runtime.park_m (gp=0x23c8128) at runtime/proc.go:4102
#8  0x004b86fc in runtime.mcall () at runtime/asm_mipsx.s:141
#9  0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC

Thread 7 (LWP 19778):
#0  runtime.futex () at runtime/sys_linux_mipsx.s:380
#1  0x0045a0a8 in runtime.futexsleep (addr=0x2080fd8, val=0, ns=-1) at runtime/os_linux.go:69
#2  0x0041b39c in runtime.notesleep (n=0x2080fd8) at runtime/lock_futex.go:170
#3  0x00466fec in runtime.mPark () at runtime/proc.go:1865
#4  0x00468fe8 in runtime.stopm () at runtime/proc.go:2885
#5  0x00469ee8 in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3622
#6  0x0046c568 in runtime.schedule () at runtime/proc.go:3995
#7  0x0046d290 in runtime.goexit0 (gp=0x23caea8) at runtime/proc.go:4268
#8  0x004b86fc in runtime.mcall () at runtime/asm_mipsx.s:141
#9  0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC

Thread 6 (LWP 19775):
#0  runtime.futex () at runtime/sys_linux_mipsx.s:380
#1  0x0045a0a8 in runtime.futexsleep (addr=0x1749bec <runtime.newmHandoff+12>, val=0, ns=-1) at runtime/os_linux.go:69
#2  0x0041b39c in runtime.notesleep (n=0x1749bec <runtime.newmHandoff+12>) at runtime/lock_futex.go:170
#3  0x00468e88 in runtime.templateThread () at runtime/proc.go:2863
#4  0x00466ec4 in runtime.mstart1 () at runtime/proc.go:1834
#5  0x00466dd8 in runtime.mstart0 () at runtime/proc.go:1791
#6  0x004b8688 in runtime.mstart () at runtime/asm_mipsx.s:89

Thread 5 (LWP 19774):
#0  runtime.futex () at runtime/sys_linux_mipsx.s:380
#1  0x0045a0a8 in runtime.futexsleep (addr=0x174a1e0 <runtime.sig>, val=0, ns=-1) at runtime/os_linux.go:69
#2  0x0041b6c4 in runtime.notetsleep_internal (n=0x174a1e0 <runtime.sig>, ns=-1, ~r0=<optimized out>) at runtime/lock_futex.go:193
#3  0x0041b81c in runtime.notetsleepg (n=0x174a1e0 <runtime.sig>, ns=-1, ~r0=<optimized out>) at runtime/lock_futex.go:247
#4  0x004b4410 in os/signal.signal_recv (~r0=<optimized out>) at runtime/sigqueue.go:152
#5  0x009644a8 in os/signal.loop () at os/signal/signal_unix.go:23
#6  0x004bac70 in runtime.goexit () at runtime/asm_mipsx.s:664

Thread 4 (LWP 19773):
#0  runtime.futex () at runtime/sys_linux_mipsx.s:380
#1  0x0045a0a8 in runtime.futexsleep (addr=0x2048fd8, val=0, ns=-1) at runtime/os_linux.go:69
#2  0x0041b39c in runtime.notesleep (n=0x2048fd8) at runtime/lock_futex.go:170
#3  0x00466fec in runtime.mPark () at runtime/proc.go:1865
#4  0x00468fe8 in runtime.stopm () at runtime/proc.go:2885
#5  0x00469ee8 in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3622
#6  0x0046c568 in runtime.schedule () at runtime/proc.go:3995
#7  0x0046cab4 in runtime.park_m (gp=0x2082fc8) at runtime/proc.go:4102
#8  0x004b86fc in runtime.mcall () at runtime/asm_mipsx.s:141
#9  0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC

Thread 3 (LWP 19771):
#0  internal/runtime/atomic.spinLock () at internal/runtime/atomic/atomic_mipsx.s:287
#1  0x004c11e8 in internal/runtime/atomic.lockAndCheck (addr=0x174b240 <runtime.gcController+160>) at internal/runtime/atomic/atomic_mipsx.go:45
#2  0x004c1c48 in internal/runtime/atomic.Xadd64 (addr=0x174b240 <runtime.gcController+160>, delta=1392, new=<optimized out>) at internal/runtime/atomic/atomic_mipsx.go:55
#3  0x004c145c in internal/runtime/atomic.(*Int64).Add (i=0x174b240 <runtime.gcController+160>, delta=1392, ~r0=<optimized out>) at internal/runtime/atomic/types.go:109
#4  0x00436110 in runtime.gcDrain (gcw=0x203cf74, flags=7) at runtime/mgcmark.go:1256
#5  0x00435b54 in runtime.gcDrainMarkWorkerIdle (gcw=0x203cf74) at runtime/mgcmark.go:1102
#6  0x004301bc in runtime.gcBgMarkWorker.func2 () at runtime/mgc.go:1508
#7  0x004b87a8 in runtime.systemstack () at runtime/asm_mipsx.s:188
#8  0x004b8688 in runtime.mstart () at runtime/asm_mipsx.s:89

Thread 2 (LWP 19770):
#0  internal/runtime/atomic.spinLock () at internal/runtime/atomic/atomic_mipsx.s:287
#1  0x004c11e8 in internal/runtime/atomic.lockAndCheck (addr=0x1740d68 <runtime.sched+8>) at internal/runtime/atomic/atomic_mipsx.go:45
#2  0x004c1db8 in internal/runtime/atomic.Load64 (addr=0x1740d68 <runtime.sched+8>, val=<optimized out>) at internal/runtime/atomic/atomic_mipsx.go:91
#3  0x004c1324 in internal/runtime/atomic.(*Int64).Load (i=0x1740d68 <runtime.sched+8>, ~r0=<optimized out>) at internal/runtime/atomic/types.go:74
#4  0x00472c28 in runtime.sysmon () at runtime/proc.go:6124
#5  0x00466ec4 in runtime.mstart1 () at runtime/proc.go:1834
#6  0x00466dd8 in runtime.mstart0 () at runtime/proc.go:1791
#7  0x004b8688 in runtime.mstart () at runtime/asm_mipsx.s:89

Thread 1 (LWP 19765):
#0  runtime.futex () at runtime/sys_linux_mipsx.s:380
#1  0x0045a0a8 in runtime.futexsleep (addr=0x173ec90 <runtime.m0+208>, val=0, ns=-1) at runtime/os_linux.go:69
#2  0x0041b39c in runtime.notesleep (n=0x173ec90 <runtime.m0+208>) at runtime/lock_futex.go:170
#3  0x00466fec in runtime.mPark () at runtime/proc.go:1865
#4  0x004698e4 in runtime.stoplockedm () at runtime/proc.go:3140
#5  0x0046c4c4 in runtime.schedule () at runtime/proc.go:3974
#6  0x0046cab4 in runtime.park_m (gp=0x2082a28) at runtime/proc.go:4102
#7  0x004b86fc in runtime.mcall () at runtime/asm_mipsx.s:141
#8  0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC

Only one of the server software have been examined with strace and gdb (the one that my team makes). For the other we have only observed that the symptoms from the outside seems similar (the older solution).

After reproducing the problem when building with Go 1.23.4 we rolled back the Go version to 1.22 (we used 1.22.11) for both servers and haven't noticed any problems in production.

We have also reproduced the problem using Go 1.23.0 in a test environment.

What did you expect to see?

Our process not hanging.

Other relevant info

We build the two servers with inling disabled, except for a few select packages where we have noticed it makes a big difference in performance on mips. The purpose of disabling inlining is to get a smaller size of the executable.

gcflags for mipsle:

-gcflags="all=-l" -gcflags="crypto/...=-l=false" -gcflags="vendor/golang.org/x/crypto/...=-l=false" -gcflags="math/...=-l=false"

Speculation/guesses

Since two independent pieces of software display the same kind of problems on mips hardware when using Go 1.23 to build with, but not with earlier Go versions, my guess is that there is some bug in the Go runtime for mips and that this was merged during the development of Go 1.23.

We have looked at the git diff comparing the 1.22.0 and 1.23.0 tags and looked at the changes containing the string mips. We found two commits that look like candidates to look more closely at:

9623a35 runtime/internal/atomic: add mips operators for And/Or
ff0bc46 runtime: add crash stack support for mips/mipsle

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 6, 2025
@seankhliao seankhliao added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. arch-mips labels Feb 6, 2025
@seankhliao
Copy link
Member

cc @golang/mips

@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label Feb 6, 2025
@hansgylling
Copy link
Author

The initial issue had the wrong values for go env. I realized why after I made the issue, but have fixed it now.

@seankhliao seankhliao changed the title runtime: Process hangs when built with Go 1.23 for mips hardware runtime: process hangs when built with Go 1.23 for mips hardware Feb 9, 2025
@seankhliao
Copy link
Member

Given that you've identified potential commits, is it possible for you to build Go with either of the commits reverted and test it?

@hallgren
Copy link

I rebuilt the application based on go1.23.6 with reverted 9623a35, branch.

After some hours the running application stopped responding.

gdb output:

(gdb) thread apply all backtrace

Thread 11 (LWP 2025):
#0  internal/runtime/atomic.spinLock () at internal/runtime/atomic/atomic_mipsx.s:250
#1  0x004c10ac in internal/runtime/atomic.lockAndCheck (addr=0x1740fb0 <runtime.sched+16>) at internal/runtime/atomic/atomic_mipsx.go:43
#2  0x004c1e5c in internal/runtime/atomic.Store64 (addr=0x1740fb0 <runtime.sched+16>, val=0) at internal/runtime/atomic/atomic_mipsx.go:99
#3  0x004c1238 in internal/runtime/atomic.(*Int64).Store (i=0x1740fb0 <runtime.sched+16>, value=0) at internal/runtime/atomic/types.go:81
#4  0x0046a92c in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3584
#5  0x0046c42c in runtime.schedule () at runtime/proc.go:3996
#6  0x0046d154 in runtime.goexit0 (gp=0x1d04d88) at runtime/proc.go:4269
#7  0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#8  0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC

Thread 10 (LWP 30401):
#0  runtime.futex () at runtime/sys_linux_mipsx.s:380
#1  0x00459f6c in runtime.futexsleep (addr=0x1c805d8, val=0, ns=-1) at runtime/os_linux.go:69
#2  0x0041b260 in runtime.notesleep (n=0x1c805d8) at runtime/lock_futex.go:170
#3  0x00466eb0 in runtime.mPark () at runtime/proc.go:1865
#4  0x00468eb0 in runtime.stopm () at runtime/proc.go:2886
#5  0x00469db0 in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3623
#6  0x0046c42c in runtime.schedule () at runtime/proc.go:3996
#7  0x0046c978 in runtime.park_m (gp=0x1f56128) at runtime/proc.go:4103
#8  0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#9  0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC

Thread 9 (LWP 30397):
#0  runtime.futex () at runtime/sys_linux_mipsx.s:380
#1  0x00459f6c in runtime.futexsleep (addr=0x1c499d8, val=0, ns=-1) at runtime/os_linux.go:69
#2  0x0041b260 in runtime.notesleep (n=0x1c499d8) at runtime/lock_futex.go:170
#3  0x00466eb0 in runtime.mPark () at runtime/proc.go:1865
#4  0x00468eb0 in runtime.stopm () at runtime/proc.go:2886
#5  0x00469db0 in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3623
#6  0x0046c42c in runtime.schedule () at runtime/proc.go:3996
#7  0x0046d154 in runtime.goexit0 (gp=0x1d04d88) at runtime/proc.go:4269
#8  0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#9  0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC

Thread 8 (LWP 30396):
#0  runtime.futex () at runtime/sys_linux_mipsx.s:380
#1  0x00459f6c in runtime.futexsleep (addr=0x1c494d8, val=0, ns=-1) at runtime/os_linux.go:69
#2  0x0041b260 in runtime.notesleep (n=0x1c494d8) at runtime/lock_futex.go:170
#3  0x00466eb0 in runtime.mPark () at runtime/proc.go:1865
#4  0x00468eb0 in runtime.stopm () at runtime/proc.go:2886
#5  0x0046eaf0 in runtime.exitsyscall0 (gp=0x1dc1c28) at runtime/proc.go:4829
#6  0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#7  0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC

Thread 7 (LWP 30389):
#0  runtime.futex () at runtime/sys_linux_mipsx.s:380
#1  0x00459f6c in runtime.futexsleep (addr=0x1c80ad8, val=0, ns=-1) at runtime/os_linux.go:69
#2  0x0041b260 in runtime.notesleep (n=0x1c80ad8) at runtime/lock_futex.go:170
#3  0x00466eb0 in runtime.mPark () at runtime/proc.go:1865
#4  0x00468eb0 in runtime.stopm () at runtime/proc.go:2886
#5  0x00469db0 in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3623
#6  0x0046c42c in runtime.schedule () at runtime/proc.go:3996
#7  0x0046cb1c in runtime.goschedImpl (gp=0x1c82248, preempted=false) at runtime/proc.go:4137
--Type <RET> for more, q to quit, c to continue without paging--
#8  0x0046cba0 in runtime.gosched_m (gp=0x1c82248) at runtime/proc.go:4142
#9  0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#10 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC

Thread 6 (LWP 30386):
#0  runtime.futex () at runtime/sys_linux_mipsx.s:380
#1  0x00459f6c in runtime.futexsleep (addr=0x1749e2c <runtime.newmHandoff+12>, val=0, ns=-1) at runtime/os_linux.go:69
#2  0x0041b260 in runtime.notesleep (n=0x1749e2c <runtime.newmHandoff+12>) at runtime/lock_futex.go:170
#3  0x00468d50 in runtime.templateThread () at runtime/proc.go:2864
#4  0x00466d88 in runtime.mstart1 () at runtime/proc.go:1834
#5  0x00466c9c in runtime.mstart0 () at runtime/proc.go:1791
#6  0x004b854c in runtime.mstart () at runtime/asm_mipsx.s:89

Thread 5 (LWP 30385):
#0  runtime.futex () at runtime/sys_linux_mipsx.s:380
#1  0x00459f6c in runtime.futexsleep (addr=0x1c800d8, val=0, ns=-1) at runtime/os_linux.go:69
#2  0x0041b260 in runtime.notesleep (n=0x1c800d8) at runtime/lock_futex.go:170
#3  0x00466eb0 in runtime.mPark () at runtime/proc.go:1865
#4  0x004697ac in runtime.stoplockedm () at runtime/proc.go:3141
#5  0x0046c388 in runtime.schedule () at runtime/proc.go:3975
#6  0x0046c978 in runtime.park_m (gp=0x213a368) at runtime/proc.go:4103
#7  0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#8  0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC

Thread 4 (LWP 30384):
#0  runtime.futex () at runtime/sys_linux_mipsx.s:380
#1  0x00459f6c in runtime.futexsleep (addr=0x174a420 <runtime.sig>, val=0, ns=-1) at runtime/os_linux.go:69
#2  0x0041b588 in runtime.notetsleep_internal (n=0x174a420 <runtime.sig>, ns=-1, ~r0=<optimized out>) at runtime/lock_futex.go:193
#3  0x0041b6e0 in runtime.notetsleepg (n=0x174a420 <runtime.sig>, ns=-1, ~r0=<optimized out>) at runtime/lock_futex.go:247
#4  0x004b42d4 in os/signal.signal_recv (~r0=<optimized out>) at runtime/sigqueue.go:152
#5  0x00964454 in os/signal.loop () at os/signal/signal_unix.go:23
#6  0x004bab34 in runtime.goexit () at runtime/asm_mipsx.s:664

Thread 3 (LWP 30383):
#0  internal/runtime/atomic.spinLock () at internal/runtime/atomic/atomic_mipsx.s:250
#1  0x004c10ac in internal/runtime/atomic.lockAndCheck (addr=0x1c3b1c8) at internal/runtime/atomic/atomic_mipsx.go:43
#2  0x004c1e04 in internal/runtime/atomic.Load64 (addr=0x1c3b1c8, val=<optimized out>) at internal/runtime/atomic/atomic_mipsx.go:89
#3  0x004c11e8 in internal/runtime/atomic.(*Int64).Load (i=0x1c3b1c8, ~r0=<optimized out>) at internal/runtime/atomic/types.go:74
#4  0x0049113c in runtime.(*timers).wakeTime (ts=0x1c3b1a0, ~r0=<optimized out>) at runtime/time.go:877
#5  0x0049124c in runtime.(*timers).check (ts=0x1c3b1a0, now=0, rnow=<optimized out>, pollUntil=<optimized out>, ran=<optimized out>) at runtime/time.go:899
#6  0x00469e38 in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3271
#7  0x0046c42c in runtime.schedule () at runtime/proc.go:3996
#8  0x0046d154 in runtime.goexit0 (gp=0x1f54008) at runtime/proc.go:4269
#9  0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#10 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC

Thread 2 (LWP 30382):
#0  internal/runtime/atomic.spinLock () at internal/runtime/atomic/atomic_mipsx.s:251
#1  0x004c10ac in internal/runtime/atomic.lockAndCheck (addr=0x17515e0 <runtime.memstats+7712>) at internal/runtime/atomic/atomic_mipsx.go:43
#2  0x004c1e04 in internal/runtime/atomic.Load64 (addr=0x17515e0 <runtime.memstats+7712>, val=<optimized out>) at internal/runtime/atomic/atomic_mipsx.go:89
#3  0x0042d34c in runtime.gcTrigger.test (t=..., ~r0=<optimized out>) at runtime/mgc.go:614
#4  0x00472cbc in runtime.sysmon () at runtime/proc.go:6175
#5  0x00466d88 in runtime.mstart1 () at runtime/proc.go:1834
#6  0x00466c9c in runtime.mstart0 () at runtime/proc.go:1791
#7  0x004b854c in runtime.mstart () at runtime/asm_mipsx.s:89

--Type <RET> for more, q to quit, c to continue without paging--
Thread 1 (LWP 30380):
#0  internal/runtime/atomic.spinLock () at internal/runtime/atomic/atomic_mipsx.s:251
#1  0x004c10ac in internal/runtime/atomic.lockAndCheck (addr=0x1c3c4c8) at internal/runtime/atomic/atomic_mipsx.go:43
#2  0x004c1e04 in internal/runtime/atomic.Load64 (addr=0x1c3c4c8, val=<optimized out>) at internal/runtime/atomic/atomic_mipsx.go:89
#3  0x004c11e8 in internal/runtime/atomic.(*Int64).Load (i=0x1c3c4c8, ~r0=<optimized out>) at internal/runtime/atomic/types.go:74
#4  0x0049113c in runtime.(*timers).wakeTime (ts=0x1c3c4a0, ~r0=<optimized out>) at runtime/time.go:877
#5  0x0049124c in runtime.(*timers).check (ts=0x1c3c4a0, now=0, rnow=<optimized out>, pollUntil=<optimized out>, ran=<optimized out>) at runtime/time.go:899
#6  0x00469e38 in runtime.findRunnable (gp=<optimized out>, inheritTime=<optimized out>, tryWakeP=<optimized out>) at runtime/proc.go:3271
#7  0x0046c42c in runtime.schedule () at runtime/proc.go:3996
#8  0x0046cb1c in runtime.goschedImpl (gp=0x1dc1c28, preempted=true) at runtime/proc.go:4137
#9  0x0046cc80 in runtime.gopreempt_m (gp=0x1dc1c28) at runtime/proc.go:4154
#10 0x004b85c0 in runtime.mcall () at runtime/asm_mipsx.s:141
#11 0x00000000 in ?? ()
Backtrace stopped: frame did not save the PC

@mknyszek mknyszek added this to the Backlog milestone Feb 12, 2025
@mknyszek
Copy link
Contributor

In triage: this is a long shot, but in these weird cases it's always good to rule out async preemption (not that anything changed there either recently, though it could be kernel related). Try GODEBUG=asyncpreemptoff=1, maybe.

@mknyszek
Copy link
Contributor

We're at a loss here. How hard would it be to try bisecting between Go 1.22 and Go 1.23?

@hallgren
Copy link

We're at a loss here. How hard would it be to try bisecting between Go 1.22 and Go 1.23?

Is this how you want us to setup git bisect?

git bisect start
git bisect god go1.22.12
git bisect bad go1.23.0

The result can take some time to gather as the problem can take hours/days to trigger.

@alexekdahl
Copy link

alexekdahl commented Feb 26, 2025

In response to the request for git bisect testing between Go 1.22 and 1.23:

We've performed a systematic git bisect between go1.22.12 (good) and go1.23.0 (bad) to identify the commit causing our MIPS high CPU/hang issue. After testing the 11 intermediate versions, we've hopefully pinpointed the exact problematic commit:

4c2b1e0feb3d3112da94fa4cd11ebe995003fa89
Author: Andy Pan <[email protected]>
Date: Thu Feb 1 10:21:14 2024 +0800
  runtime: migrate internal/atomic to internal/runtime
  For #65355

This commit moved atomic operations from runtime/internal/atomic to internal/runtime/atomic, might affect the MIPS spinlock implementation.

We determined good vs. bad commits by deploying our application to 6 MIPSLE devices and monitoring them for at least 30 hours.

bad: when at least one device showed the characteristic high CPU usage and application hang, with threads stuck in atomic operations (confirmed via strace and gdb).
good: when all devices ran stably for at least 30 hours with no issues observed.

git bisect log
git bisect start
# good: [5817e650946aaa0ac28956de96b3f9aa1de4b299] [release-branch.go1.22] go1.22.12
git bisect good 5817e650946aaa0ac28956de96b3f9aa1de4b299
# bad: [6885bad7dd86880be6929c02085e5c7a67ff2887] [release-branch.go1.23] go1.23.0
git bisect bad 6885bad7dd86880be6929c02085e5c7a67ff2887
# good: [66d34c7d08d7c536c3165dc49ed318e73ea5acc2] cmd/go/internal/modfetch: set protocol.version=2 for shallow git fetches
git bisect good 66d34c7d08d7c536c3165dc49ed318e73ea5acc2
# bad: [d4395ebc0c90a11a5f31fab0808baf3cb25d6ba8] runtime: remove overlap in the GC CPU pause time metrics
git bisect bad d4395ebc0c90a11a5f31fab0808baf3cb25d6ba8
# good: [6e49ccc7dbeda9b17bb816c6ca4a3018ff93d681] runtime,runtime/pprof: avoid tiny allocations in finalizer-related tests
git bisect good 6e49ccc7dbeda9b17bb816c6ca4a3018ff93d681
# good: [b822f098c557ea3d9200fd2d8a2e2b4d641e83e4] os: don't try to make the directory FD non-blocking in os.ReadDir
git bisect good b822f098c557ea3d9200fd2d8a2e2b4d641e83e4
# bad: [1e12eab8705d1d8d7472be9147a39caa1c8380db] all: fix a large number of comments
git bisect bad 1e12eab8705d1d8d7472be9147a39caa1c8380db
# good: [cff7267e0d77f02d582c613c272b6f8ebf1c0412] cmd/internal/osinfo,runtime,syscall: use RtlGetVersion instead of RtlGetNtVersionNumbers
git bisect good cff7267e0d77f02d582c613c272b6f8ebf1c0412
# good: [27fdef6168d2802dcf9ef382407fb6b16db3e150] test: put type declaration back inside the function
git bisect good 27fdef6168d2802dcf9ef382407fb6b16db3e150
# good: [b1182f22c0e557840239dfa80259d6b8c67fb559] cmd/go/internal/modget: remove duplicate exit
git bisect good b1182f22c0e557840239dfa80259d6b8c67fb559
# bad: [aab837dd46797b88d9ca827809fd7db663a5dd8d] net/http: panic early in do when Client is nil
git bisect bad aab837dd46797b88d9ca827809fd7db663a5dd8d
# bad: [3fefec96adae7e0ea66e2b624d4bef54a3a8b7df] runtime: fix the failures from longtest builders
git bisect bad 3fefec96adae7e0ea66e2b624d4bef54a3a8b7df
# bad: [e7bdc8819a19f06321d300719224abb2d8567b4a] internal/trace/v2: fix reader version for dumping text output
git bisect bad e7bdc8819a19f06321d300719224abb2d8567b4a
# bad: [4c2b1e0feb3d3112da94fa4cd11ebe995003fa89] runtime: migrate internal/atomic to internal/runtime
git bisect bad 4c2b1e0feb3d3112da94fa4cd11ebe995003fa89
# first bad commit: [4c2b1e0feb3d3112da94fa4cd11ebe995003fa89] runtime: migrate internal/atomic to internal/runtime

@ianlancetaylor
Copy link
Member

Excellent analysis. Thanks very much.

CC @panjf2000 @golang/runtime

@alexekdahl
Copy link

Thank you and a small fyi;

We've confirmed this issue persists in go1.24.0 as well.

@prattmic
Copy link
Member

Theory: we avoid async preemption on runtime/internal/, but not internal/runtime: https://cs.opensource.google/go/go/+/master:src/runtime/preempt.go;l=420-434

That is a bug, but maybe not this bug. If you test with GODEBUG=asyncpreemptoff=1 and find that the issue does not reproduce, that would support this theory.

@alexekdahl
Copy link

Tested with GODEBUG=asyncpreemptoff=1 on commit (devel go1.23-4c2b1e0feb), and the issue still occurs.
Is there anything else we could try to help you?

If it's of any use, we don't observe this problem on any devices running architectures as amd64 (Linux, Windows, Mac), arm64 (Linux), or armv7hf (Linux).

@cherrymui
Copy link
Member

Theory: we avoid async preemption on runtime/internal/, but not internal/runtime:

The compiler recognizes all runtime packages and mark the all functions as unsafe for async preemption https://cs.opensource.google/go/go/+/master:src/cmd/compile/internal/liveness/plive.go;l=503

So async preemption is probably not enabled for internal/runtime, although it is still good to fix the inconsistency.

@hallgren
Copy link

hallgren commented Mar 5, 2025

We conducted some experiments by checking out the master branch at commit d31c805, and as expected, it resulted in a hanging process.

We applied the following patch on the same commit as noted by @prattmic in #71591 (comment):

diff --git a/src/runtime/preempt.go b/src/runtime/preempt.go
index 45b1b5e9c7..3f8a4ed135 100644
--- a/src/runtime/preempt.go
+++ b/src/runtime/preempt.go
@@ -419,6 +419,7 @@ func isAsyncSafePoint(gp *g, pc, sp, lr uintptr) (bool, uintptr) {
        name := u.srcFunc(uf).name()
        if stringslite.HasPrefix(name, "runtime.") ||
                stringslite.HasPrefix(name, "runtime/internal/") ||
+               stringslite.HasPrefix(name, "internal/runtime/") ||
                stringslite.HasPrefix(name, "reflect.") {
                // For now we never async preempt the runtime or
                // anything closely tied to the runtime. Known issues

The six MIPSLE devices have now been running for more than 48 hours without any issues.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/654916 mentions this issue: runtime: explicitly disable async preempt for internal/runtime

@prattmic
Copy link
Member

prattmic commented Mar 5, 2025

Oh, thanks for testing this, and sending the CL @panjf2000. I actually decided these symbol name checks shouldn't matter because of #71591 (comment), went to remove them, only to discover they are load bearing because of #72031, but then forgot to come back here to update. 🤦

@prattmic
Copy link
Member

prattmic commented Mar 5, 2025

@gopherbot Please backport to 1.23 and 1.24. This is a regression for mips causing random deadlocks.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #72114 (for 1.23), #72115 (for 1.24).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@cherrymui
Copy link
Member

#71591 (comment)

Tested with GODEBUG=asyncpreemptoff=1 on commit (devel go1.23-4c2b1e0feb), and the issue still occurs.

suggests that it is not async preemption?

But #71591 (comment) suggests the change to isAsyncSafePoint actually makes a different?

Sorry, I'm a bit confused.

@dmitshur dmitshur modified the milestones: Backlog, Go1.25 Mar 5, 2025
@dmitshur dmitshur moved this from Todo to In Progress in Go Compiler / Runtime Mar 5, 2025
gopherbot pushed a commit that referenced this issue Mar 6, 2025
For #71591
Relevant CL 560155

Change-Id: Iebc497d56b36d50c13a6dd88e7bca4578a03cf63
Reviewed-on: https://go-review.googlesource.com/c/go/+/654916
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Cherry Mui <[email protected]>
Auto-Submit: Michael Pratt <[email protected]>
Reviewed-by: Michael Pratt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-mips BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: In Progress
Development

No branches or pull requests