Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd/nsqlookupd: exit when tcp server accept error #1138

Closed

Conversation

andyxning
Copy link
Member

This is a follow up PR for #1135 .

@@ -33,4 +34,5 @@ func TCPServer(listener net.Listener, handler TCPHandler, logf lg.AppLogFunc) {
}

logf(lg.INFO, "TCP: closing %s", listener.Addr())
return errors.New("accept new tcp connection error")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not much point in always returning the same error. Consider that this also happens for a normal exit (when it closes the Listener) and then because this returned an error, the wrapper func will call NSQD.Exit() again concurrently.

I suggest an NSQD "exiting" bit. In the wrapper (below in nsqd.go), when this function TCPServer finishes, if the exiting flag is already set, that's normal. If the exiting flag is not set, that is unexpected, and it should call NSQD.Exit().

nsqd/nsqd.go Outdated
@@ -263,7 +263,9 @@ func (n *NSQD) Main() {

tcpServer := &tcpServer{ctx: ctx}
n.waitGroup.Wrap(func() {
protocol.TCPServer(n.tcpListener, tcpServer, n.logf)
if err := protocol.TCPServer(n.tcpListener, tcpServer, n.logf); err != nil {
n.Exit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and if this wrapper function calls NSQD.Exit() it has to launch a separate goroutine for it, because it will deadlock with the waitGroup waiting for this wrapper function to finish.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Will investigate

@andyxning andyxning force-pushed the exit_when_tcp_server_accept_error branch 2 times, most recently from 84cb70e to b3faf08 Compare February 25, 2019 06:46
@andyxning
Copy link
Member Author

@ploxiln PTAL.

return errors.New("accept new tcp connection error")
}

return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just return fatalErr in all cases, it will be nil if never set.

But actually, I suggest reverting this function to not return an error at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, thinking about it, it might still be a good idea to return fatalErr, for nsqlookupd to do the ungraceful abort of os.Exit(1)

@@ -263,7 +265,9 @@ func (n *NSQD) Main() {

tcpServer := &tcpServer{ctx: ctx}
n.waitGroup.Wrap(func() {
protocol.TCPServer(n.tcpListener, tcpServer, n.logf)
if err := protocol.TCPServer(n.tcpListener, tcpServer, n.logf); err != nil {
go n.Exit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest checking the exiting flag here, and if it is not already set, then go Exit().

The flag variable could be an atomic int, to avoid the need to take a lock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this suggestion would require Exit() to set the flag as the first thing it does)

Copy link
Member Author

@andyxning andyxning Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we keep the check logic be within the Exit function. With this we should not take care about the internal exit status flag check before Exit is called in other places. And we should keep Exit function to be called only once. Without lock protection, even with atomic values, the set and check logic exists race condition.

With setting and checking separated with aotmic values, race condition may happens. For example, a normal exit and accept error exit happens simultaneously, before the normal exit call set the Exit flag, accept error exit checks it and then call Exit again. Although nsqd will exit correctly but Exit is called more than once.

With lock like this. Exit function logic except closing listeners will only be called once.

Copy link
Member Author

@andyxning andyxning Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, lock consumption should be ok and will not invoke any new performance problems, because all this happens under exit phase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I suggested the exiting flag early on is because deciding to call Exit here could be as simple as:

--- a/nsqd/nsqd.go
+++ b/nsqd/nsqd.go
@@ -71,6 +71,7 @@ type NSQD struct {
 
        notifyChan           chan interface{}
        optsNotificationChan chan struct{}
+       exitFlag             int32
        exitChan             chan int
        waitGroup            util.WaitGroupWrapper
 
@@ -264,6 +265,10 @@ func (n *NSQD) Main() {
        tcpServer := &tcpServer{ctx: ctx}
        n.waitGroup.Wrap(func() {
                protocol.TCPServer(n.tcpListener, tcpServer, n.logf)
+               if atomic.LoadInt32(&t.exitFlag) == 0 {
+                       // abnormal listen loop exit
+                       go n.Exit()
+               }
        })
        httpServer := newHTTPServer(ctx, false, n.getOpts().TLSRequired == TLSRequired)
        n.waitGroup.Wrap(func() {
@@ -423,6 +428,10 @@ func (n *NSQD) PersistMetadata() error {
 }
 
 func (n *NSQD) Exit() {
+       if !atomic.CompareAndSwapInt32(&t.exitFlag, 0, 1) {
+               return
+       }
+
        if n.tcpListener != nil {
                n.tcpListener.Close()
        }

(but there still is the issue of exiting with an error code)

Copy link
Member

@mreiferson mreiferson Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that the atomic would be slightly cleaner, but I prefer directing flow through the existing machinery that's setup to manage the lifecycle of the service to ensure that, whatever svc.Run is doing, it gets done (rather than introducing a new exceptional path).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah ... calling NSQD.Exit() cleans up but does not actually exit ... go-svc needs to be involved anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more cheesy idea to avoid needing to deal with go-svc:

--- a/nsqd/nsqd.go
+++ b/nsqd/nsqd.go
@@ -264,6 +264,12 @@ func (n *NSQD) Main() {
        tcpServer := &tcpServer{ctx: ctx}
        n.waitGroup.Wrap(func() {
                protocol.TCPServer(n.tcpListener, tcpServer, n.logf)
+               if atomic.LoadInt32(&t.exitFlag) == 0 {
+                       go func() {
+                               n.Exit()
+                               os.Exit(1)
+                       }()
+               }
        })

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh, let's just do it the "right" way.

@andyxning andyxning force-pushed the exit_when_tcp_server_accept_error branch from b3faf08 to 20be5b1 Compare February 25, 2019 08:06
@mreiferson
Copy link
Member

An alternative way to approach this, that would avoid having to protect Exit() from multiple calls would be to create a fatalErrCh channel, that gets passed to svc.Run. We could close() it here.

@@ -26,11 +27,13 @@ func TCPServer(listener net.Listener, handler TCPHandler, logf lg.AppLogFunc) {
// theres no direct way to detect this error because it is not exposed
if !strings.Contains(err.Error(), "use of closed network connection") {
logf(lg.ERROR, "listener.Accept() - %s", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take the approach I proposed, this log line should communicate that the error is FATAL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -436,6 +440,12 @@ func (n *NSQD) Exit() {
}

n.Lock()

if n.shuttingDown {
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should unlock if returning here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

@andyxning andyxning force-pushed the exit_when_tcp_server_accept_error branch from 20be5b1 to f315467 Compare February 26, 2019 06:48
@mreiferson
Copy link
Member

Let's finish this up in #1140, thanks @andyxning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants