Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: close TCP listener after breaking the accept() loop #1135

Closed
wants to merge 1 commit into from

Conversation

paper777
Copy link

We got an error from accept() loop:
[nsqd] xx ERROR: listener.Accept() - accept tcp [::]:4750: accept4: too many open files in systeme

but we found that the tcp listener was not closed properly (like net.httpserver). And connections from clinets got a lot of io timeout instead of connection refused

@mdh67899
Copy link
Contributor

interesting.

too many open files means the nsqd can't accept new connection because the file description has been exhausted, trying use ulimit -n 655350 add the max open files limit before start nsqd, it's a temporary solution

when the listener accept new connection failed, he will break the for loop and he will never process the new tcp connection, in your fix, we stop the tcp listener, but other http listener is still alive, this process is also alived, perhaps use defer os.Exit() ?

func TCPServer(listener net.Listener, handler TCPHandler, logf lg.AppLogFunc) {
logf(lg.INFO, "TCP: listening on %s", listener.Addr())
for {
clientConn, err := listener.Accept()
if err != nil {
if nerr, ok := err.(net.Error); ok && nerr.Temporary() {
logf(lg.WARN, "temporary Accept() failure - %s", err)
runtime.Gosched()
continue
}
// theres no direct way to detect this error because it is not exposed
if !strings.Contains(err.Error(), "use of closed network connection") {
logf(lg.ERROR, "listener.Accept() - %s", err)
}
break
}
go handler.Handle(clientConn)
}
logf(lg.INFO, "TCP: closing %s", listener.Addr())
}

@paper777
Copy link
Author

hi, @mdh67899
we set file-max 4M but still overflowed :(

when this error occurred, the http listener will be closed, refer to https://golang.org/src/net/http/server.go?s=93229:93284#L2811

os.Exit() may be not a good idea in my opinion, there are still many connected clients publishing and consuming.

@mreiferson mreiferson changed the title close tcp listener after breaking the accept() loop *: close TCP listener after breaking the accept() loop Feb 18, 2019
@mreiferson
Copy link
Member

Something here needs to be fixed, so thanks for raising the issue!

I tend to lean towards @mdh67899's suggestion:

when the listener accept new connection failed, he will break the for loop and he will never process the new tcp connection, in your fix, we stop the tcp listener, but other http listener is still alive, this process is also alived, perhaps use defer os.Exit() ?

The fact that the Accept loop has failed is "fatal", it won't be restarted and the process will be in an effectively broken state. The only reasonable thing to do is to fail loudly and exit. How we do that (ideally we'd persist whatever valid state we can) is the question...

@mreiferson mreiferson added the bug label Feb 18, 2019
@mdh67899
Copy link
Contributor

mdh67899 commented Feb 23, 2019

@mreiferson I think NSQD struct shoud add fatalChan, when some internal fatal error happens, NSQD should close it

the svc.Service interface add GetExitChan() function, this function will return program's fatal channel, if the program want exit, just close this channel.

the svc.Run function should using GetExitChan() function to get program's exit channel, when the signalChan get system signal or exit channel has been closed, we will use service.Stop() to stop program.

if you think this is a good idea, I am glad to submit a pull request to implement this function, thanks.

@ploxiln
Copy link
Member

ploxiln commented Feb 23, 2019

I think we would want to use the existing NSQD.Exit() which takes care of graceful shut-down. I think we would just want to add the ability for that to exit with a non-zero exit-status at the end.

@andyxning
Copy link
Member

The fact that the Accept loop has failed is "fatal", it won't be restarted and the process will be in an effectively broken state. The only reasonable thing to do is to fail loudly and exit.

Agreed.

I think we would want to use the existing NSQD.Exit() which takes care of graceful shut-down. I think we would just want to add the ability for that to exit with a non-zero exit-status at the end.

What we need to do is for nsqd to do a graceful shutdown whereas for nsqlookupd, a simple exit will be ok. As we use systemd, or some other process monitor system, to run nsqd, nsqd will be restarted after an exit.

@mreiferson
Copy link
Member

Let's continue this discussion in #1138 where we're close to consensus on the approach, thanks everyone!

@mreiferson mreiferson closed this Feb 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants