-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Oplog connection sleeps forever after disconnect; Preprod servers get into a bad state where client does not receive "stream-room-messages" #1289
Comments
UPDATE: It seems like the issue is that the mongodb-0 pod is frequently failing the kubernetes liveness probe and restarting occasionally. After a restart, it sometimes happens that the oplog connection from the server does not re-connect, but ends up in some kind of a sleep state. I believe this happens when this pod gets scheduled onto an overloaded node, and I am working with the infra team to resolve. TODO: Can we make the oplog connection more robust so that it does not permanently go into a sleep state? |
From @bhardwajaditya
Thanks.... I consider this stale connection issue to be a pretty bad bug. For example, kubernetes is designed to swap pods in and out under various circumstances, which is a good thing when needed. It makes for a more robust deployment. But if RC can't consistently maintain the oplog connection when mongodb pods get swapped around, then that breaks the model. |
Redis oplog: https://github.com/cult-of-coders/redis-oplog or use a managed service like Atlas or AWS? |
It's possible that a managed mongodb service is our best solution. @bhardwajaditya please investigate if that would be compatible with Rocketchat. Please investigate the AWS mongodb offering. thanks. |
We've been seeing this frequently again on our test RC instance in the preprod VAP account. When the server is in this bad state the visitor will not see any of the messages from DF unless they refresh the page.
Restarting the RC server fixes the issue.
How can we debug this? do we need to instrument areas in the pu/sub RC server code?
The text was updated successfully, but these errors were encountered: