Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Oplog connection sleeps forever after disconnect; Preprod servers get into a bad state where client does not receive "stream-room-messages" #1289

Open
ear-dev opened this issue Aug 29, 2022 · 5 comments
Assignees

Comments

@ear-dev
Copy link

ear-dev commented Aug 29, 2022

We've been seeing this frequently again on our test RC instance in the preprod VAP account. When the server is in this bad state the visitor will not see any of the messages from DF unless they refresh the page.

Restarting the RC server fixes the issue.

How can we debug this? do we need to instrument areas in the pu/sub RC server code?

@ear-dev
Copy link
Author

ear-dev commented Aug 29, 2022

NOTE: Looking at the Rocketchat Metrics logs in kibana, I notice that there is no oplog activity during the periods when we are experiencing this issue, yet, on a healthy server there is always at least some activity.

image

@ear-dev
Copy link
Author

ear-dev commented Aug 30, 2022

UPDATE: It seems like the issue is that the mongodb-0 pod is frequently failing the kubernetes liveness probe and restarting occasionally. After a restart, it sometimes happens that the oplog connection from the server does not re-connect, but ends up in some kind of a sleep state. I believe this happens when this pod gets scheduled onto an overloaded node, and I am working with the infra team to resolve.

TODO: Can we make the oplog connection more robust so that it does not permanently go into a sleep state?

@ear-dev ear-dev changed the title [BUG] Preprod servers get into a bad state where client does not receive "stream-room-messages" [BUG] Oplog connection sleeps forever after disconnect; Preprod servers get into a bad state where client does not receive "stream-room-messages" Aug 30, 2022
@ear-dev
Copy link
Author

ear-dev commented Aug 31, 2022

From @bhardwajaditya

It’s basically stale connection issue, on restart meteor re establishes the connection. I read few articles on the same will try to find it again if some solution is mentioned over there

Thanks.... I consider this stale connection issue to be a pretty bad bug. For example, kubernetes is designed to swap pods in and out under various circumstances, which is a good thing when needed. It makes for a more robust deployment. But if RC can't consistently maintain the oplog connection when mongodb pods get swapped around, then that breaks the model.

@ear-dev
Copy link
Author

ear-dev commented Sep 1, 2022

Redis oplog: https://github.com/cult-of-coders/redis-oplog

or

use a managed service like Atlas or AWS?

@ear-dev
Copy link
Author

ear-dev commented Sep 1, 2022

It's possible that a managed mongodb service is our best solution. @bhardwajaditya please investigate if that would be compatible with Rocketchat. Please investigate the AWS mongodb offering. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants