[BUG] Oplog connection sleeps forever after disconnect; Preprod servers get into a bad state where client does not receive "stream-room-messages" #1289

ear-dev · 2022-08-29T17:09:40Z

We've been seeing this frequently again on our test RC instance in the preprod VAP account. When the server is in this bad state the visitor will not see any of the messages from DF unless they refresh the page.

Restarting the RC server fixes the issue.

How can we debug this? do we need to instrument areas in the pu/sub RC server code?

ear-dev · 2022-08-29T17:49:44Z

NOTE: Looking at the Rocketchat Metrics logs in kibana, I notice that there is no oplog activity during the periods when we are experiencing this issue, yet, on a healthy server there is always at least some activity.

ear-dev · 2022-08-30T12:11:57Z

UPDATE: It seems like the issue is that the mongodb-0 pod is frequently failing the kubernetes liveness probe and restarting occasionally. After a restart, it sometimes happens that the oplog connection from the server does not re-connect, but ends up in some kind of a sleep state. I believe this happens when this pod gets scheduled onto an overloaded node, and I am working with the infra team to resolve.

TODO: Can we make the oplog connection more robust so that it does not permanently go into a sleep state?

ear-dev · 2022-08-31T15:00:04Z

From @bhardwajaditya

It’s basically stale connection issue, on restart meteor re establishes the connection. I read few articles on the same will try to find it again if some solution is mentioned over there

Thanks.... I consider this stale connection issue to be a pretty bad bug. For example, kubernetes is designed to swap pods in and out under various circumstances, which is a good thing when needed. It makes for a more robust deployment. But if RC can't consistently maintain the oplog connection when mongodb pods get swapped around, then that breaks the model.

ear-dev · 2022-09-01T15:30:46Z

Redis oplog: https://github.com/cult-of-coders/redis-oplog

or

use a managed service like Atlas or AWS?

ear-dev · 2022-09-01T15:43:22Z

It's possible that a managed mongodb service is our best solution. @bhardwajaditya please investigate if that would be compatible with Rocketchat. Please investigate the AWS mongodb offering. thanks.

ear-dev changed the title ~~[BUG] Preprod servers get into a bad state where client does not receive "stream-room-messages"~~ [BUG] Oplog connection sleeps forever after disconnect; Preprod servers get into a bad state where client does not receive "stream-room-messages" Aug 30, 2022

ear-dev assigned bhardwajaditya Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Oplog connection sleeps forever after disconnect; Preprod servers get into a bad state where client does not receive "stream-room-messages" #1289

[BUG] Oplog connection sleeps forever after disconnect; Preprod servers get into a bad state where client does not receive "stream-room-messages" #1289

ear-dev commented Aug 29, 2022

ear-dev commented Aug 29, 2022 •

edited

Loading

ear-dev commented Aug 30, 2022

ear-dev commented Aug 31, 2022 •

edited

Loading

ear-dev commented Sep 1, 2022 •

edited

Loading

ear-dev commented Sep 1, 2022 •

edited

Loading

[BUG] Oplog connection sleeps forever after disconnect; Preprod servers get into a bad state where client does not receive "stream-room-messages" #1289

[BUG] Oplog connection sleeps forever after disconnect; Preprod servers get into a bad state where client does not receive "stream-room-messages" #1289

Comments

ear-dev commented Aug 29, 2022

ear-dev commented Aug 29, 2022 • edited Loading

ear-dev commented Aug 30, 2022

ear-dev commented Aug 31, 2022 • edited Loading

ear-dev commented Sep 1, 2022 • edited Loading

ear-dev commented Sep 1, 2022 • edited Loading

ear-dev commented Aug 29, 2022 •

edited

Loading

ear-dev commented Aug 31, 2022 •

edited

Loading

ear-dev commented Sep 1, 2022 •

edited

Loading

ear-dev commented Sep 1, 2022 •

edited

Loading