nsqd/nsqlookupd: exit when tcp server accept error #1138

andyxning · 2019-02-25T04:36:05Z

This is a follow up PR for #1135 .

ploxiln · 2019-02-25T05:54:30Z

internal/protocol/tcp_server.go

@@ -33,4 +34,5 @@ func TCPServer(listener net.Listener, handler TCPHandler, logf lg.AppLogFunc) {
 	}

 	logf(lg.INFO, "TCP: closing %s", listener.Addr())
+	return errors.New("accept new tcp connection error")


There's not much point in always returning the same error. Consider that this also happens for a normal exit (when it closes the Listener) and then because this returned an error, the wrapper func will call NSQD.Exit() again concurrently.

I suggest an NSQD "exiting" bit. In the wrapper (below in nsqd.go), when this function TCPServer finishes, if the exiting flag is already set, that's normal. If the exiting flag is not set, that is unexpected, and it should call NSQD.Exit().

ploxiln · 2019-02-25T05:57:43Z

nsqd/nsqd.go

@@ -263,7 +263,9 @@ func (n *NSQD) Main() {

 	tcpServer := &tcpServer{ctx: ctx}
 	n.waitGroup.Wrap(func() {
-		protocol.TCPServer(n.tcpListener, tcpServer, n.logf)
+		if err := protocol.TCPServer(n.tcpListener, tcpServer, n.logf); err != nil {
+			n.Exit()


... and if this wrapper function calls NSQD.Exit() it has to launch a separate goroutine for it, because it will deadlock with the waitGroup waiting for this wrapper function to finish.

Good catch. Will investigate

andyxning · 2019-02-25T06:51:51Z

@ploxiln PTAL.

ploxiln · 2019-02-25T07:32:30Z

internal/protocol/tcp_server.go

+		return errors.New("accept new tcp connection error")
+	}
+
+	return nil


You could just return fatalErr in all cases, it will be nil if never set.

But actually, I suggest reverting this function to not return an error at all.

well, thinking about it, it might still be a good idea to return fatalErr, for nsqlookupd to do the ungraceful abort of os.Exit(1)

ploxiln · 2019-02-25T07:34:06Z

nsqd/nsqd.go

@@ -263,7 +265,9 @@ func (n *NSQD) Main() {

 	tcpServer := &tcpServer{ctx: ctx}
 	n.waitGroup.Wrap(func() {
-		protocol.TCPServer(n.tcpListener, tcpServer, n.logf)
+		if err := protocol.TCPServer(n.tcpListener, tcpServer, n.logf); err != nil {
+			go n.Exit()


I suggest checking the exiting flag here, and if it is not already set, then go Exit().

The flag variable could be an atomic int, to avoid the need to take a lock.

(this suggestion would require Exit() to set the flag as the first thing it does)

I suggest we keep the check logic be within the Exit function. With this we should not take care about the internal exit status flag check before Exit is called in other places. And we should keep Exit function to be called only once. Without lock protection, even with atomic values, the set and check logic exists race condition.

With setting and checking separated with aotmic values, race condition may happens. For example, a normal exit and accept error exit happens simultaneously, before the normal exit call set the Exit flag, accept error exit checks it and then call Exit again. Although nsqd will exit correctly but Exit is called more than once.

With lock like this. Exit function logic except closing listeners will only be called once.

BTW, lock consumption should be ok and will not invoke any new performance problems, because all this happens under exit phase.

The reason I suggested the exiting flag early on is because deciding to call Exit here could be as simple as:

--- a/nsqd/nsqd.go +++ b/nsqd/nsqd.go @@ -71,6 +71,7 @@ type NSQD struct { notifyChan chan interface{} optsNotificationChan chan struct{} + exitFlag int32 exitChan chan int waitGroup util.WaitGroupWrapper @@ -264,6 +265,10 @@ func (n *NSQD) Main() { tcpServer := &tcpServer{ctx: ctx} n.waitGroup.Wrap(func() { protocol.TCPServer(n.tcpListener, tcpServer, n.logf) + if atomic.LoadInt32(&t.exitFlag) == 0 { + // abnormal listen loop exit + go n.Exit() + } }) httpServer := newHTTPServer(ctx, false, n.getOpts().TLSRequired == TLSRequired) n.waitGroup.Wrap(func() { @@ -423,6 +428,10 @@ func (n *NSQD) PersistMetadata() error { } func (n *NSQD) Exit() { + if !atomic.CompareAndSwapInt32(&t.exitFlag, 0, 1) { + return + } + if n.tcpListener != nil { n.tcpListener.Close() }

(but there still is the issue of exiting with an error code)

Agreed that the atomic would be slightly cleaner, but I prefer directing flow through the existing machinery that's setup to manage the lifecycle of the service to ensure that, whatever svc.Run is doing, it gets done (rather than introducing a new exceptional path).

yeah ... calling NSQD.Exit() cleans up but does not actually exit ... go-svc needs to be involved anyway

one more cheesy idea to avoid needing to deal with go-svc:

--- a/nsqd/nsqd.go +++ b/nsqd/nsqd.go @@ -264,6 +264,12 @@ func (n *NSQD) Main() { tcpServer := &tcpServer{ctx: ctx} n.waitGroup.Wrap(func() { protocol.TCPServer(n.tcpListener, tcpServer, n.logf) + if atomic.LoadInt32(&t.exitFlag) == 0 { + go func() { + n.Exit() + os.Exit(1) + }() + } })

Meh, let's just do it the "right" way.

mreiferson · 2019-02-25T15:23:38Z

An alternative way to approach this, that would avoid having to protect Exit() from multiple calls would be to create a fatalErrCh channel, that gets passed to svc.Run. We could close() it here.

mreiferson · 2019-02-25T15:24:25Z

internal/protocol/tcp_server.go

@@ -26,11 +27,13 @@ func TCPServer(listener net.Listener, handler TCPHandler, logf lg.AppLogFunc) {
 			// theres no direct way to detect this error because it is not exposed
 			if !strings.Contains(err.Error(), "use of closed network connection") {
 				logf(lg.ERROR, "listener.Accept() - %s", err)


If we take the approach I proposed, this log line should communicate that the error is FATAL.

ploxiln · 2019-02-25T15:35:02Z

nsqd/nsqd.go

@@ -436,6 +440,12 @@ func (n *NSQD) Exit() {
 	}

 	n.Lock()
+
+	if n.shuttingDown {
+		return


should unlock if returning here

Good catch.

mreiferson · 2019-02-28T23:27:21Z

Let's finish this up in #1140, thanks @andyxning

ploxiln reviewed Feb 25, 2019

View reviewed changes

andyxning force-pushed the exit_when_tcp_server_accept_error branch 2 times, most recently from 84cb70e to b3faf08 Compare February 25, 2019 06:46

ploxiln reviewed Feb 25, 2019

View reviewed changes

andyxning force-pushed the exit_when_tcp_server_accept_error branch from b3faf08 to 20be5b1 Compare February 25, 2019 08:06

mreiferson reviewed Feb 25, 2019

View reviewed changes

mreiferson mentioned this pull request Feb 25, 2019

*: close TCP listener after breaking the accept() loop #1135

Closed

mreiferson added the bug label Feb 25, 2019

ploxiln reviewed Feb 25, 2019

View reviewed changes

mdh67899 mentioned this pull request Feb 26, 2019

nsqd/nsqlookupd: properly handle fatal accept errors #1140

Merged

exit when tcp server accept error

f315467

andyxning force-pushed the exit_when_tcp_server_accept_error branch from 20be5b1 to f315467 Compare February 26, 2019 06:48

mreiferson closed this Feb 28, 2019

andyxning mentioned this pull request Mar 1, 2019

nsqd/nsqlookupd: exit when tcp server accept error #1145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsqd/nsqlookupd: exit when tcp server accept error #1138

nsqd/nsqlookupd: exit when tcp server accept error #1138

andyxning commented Feb 25, 2019

ploxiln Feb 25, 2019

ploxiln Feb 25, 2019

andyxning Feb 25, 2019

andyxning commented Feb 25, 2019

ploxiln Feb 25, 2019

ploxiln Feb 25, 2019

ploxiln Feb 25, 2019

ploxiln Feb 25, 2019

andyxning Feb 25, 2019 •

edited

Loading

andyxning Feb 25, 2019 •

edited

Loading

ploxiln Feb 25, 2019

mreiferson Feb 25, 2019 •

edited

Loading

ploxiln Feb 25, 2019

ploxiln Feb 26, 2019

mreiferson Feb 26, 2019

mreiferson commented Feb 25, 2019

mreiferson Feb 25, 2019

andyxning Feb 26, 2019

ploxiln Feb 25, 2019

andyxning Feb 26, 2019

mreiferson commented Feb 28, 2019

nsqd/nsqlookupd: exit when tcp server accept error #1138

nsqd/nsqlookupd: exit when tcp server accept error #1138

Conversation

andyxning commented Feb 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andyxning commented Feb 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andyxning Feb 25, 2019 • edited Loading

Choose a reason for hiding this comment

andyxning Feb 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mreiferson Feb 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mreiferson commented Feb 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mreiferson commented Feb 28, 2019

andyxning Feb 25, 2019 •

edited

Loading

andyxning Feb 25, 2019 •

edited

Loading

mreiferson Feb 25, 2019 •

edited

Loading