Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat any statsd metric driver related errors as non-fatal (just log the errors) #4597

Merged
merged 7 commits into from
Mar 14, 2019

Conversation

Kami
Copy link
Member

@Kami Kami commented Mar 14, 2019

This pull request updates our statsd driver to treat any exception which may be thrown by the driver as non-fatal.

Metrics driver related error should never break or degrade application performance and this pull request ensures that is indeed the case.

Background / Context

One of our users reported an issue with statd driving breaking / hanging the application when a metric backend server goes down. Initially I wasn't able to replicate it (statsd uses UDP so it should be fire and forget and there should be no exceptions).

Today the user provided log files. It turns out the problem lies in DNS resolution errors which can happen when establishing a connection to the backend server.

If a hostname and not IP address is used for metric backend and DNS resolution fails, statsd will propagate this exception all the way up and this will break the application since we don't catch and ignore / log those exceptions.

I wasn't able to reproduce the error because we use IP and not the hostname everywhere and even if backend server goes down, UDP send won't result in any exception due to the nature of the protocol.

Proposed Solution

In this solution I simply use a decorator which just logs any exception thrown by the driver and doesn't propagate it.

Eventually we could perhaps move that logic to the statsd client library, but it's still safer to have such safe guard in our code. As mentioned above, we don't want any such error / exception to break or degrade application performance.

TODO

  • Tests
  • Changelog entry

Underneath, statsd uses UDP protocol which is "fire and forget" which
means all the errors are non-fatal and don't propagate back to the
application. There is an edge case with how statsd library establishes
connection though. If hostname is used instead of an IP address and this
hostname can't be resolved, this DNS resolution error will propagate all
the way up and break the application.

This behavior is of course not desired (all metric related operation
errors should be non-fatal) and this PR fixes that. We simply ignore
those errors and log them.
@Kami
Copy link
Member Author

Kami commented Mar 14, 2019

Added tests.

I also needed to refactor the code a bit so now we only catch and ignore socket.error, IOError and OSError. We still want actual AssertionErrors to propagate since this would indicate a programmer error during development (and those exceptions should also be caught and fixed during development).

@Kami Kami merged commit f067176 into master Mar 14, 2019
@Kami Kami deleted the statsd_exception_shouldnt_be_fatal branch March 14, 2019 14:09
@arm4b
Copy link
Member

arm4b commented Mar 17, 2019

👍 Cool stuff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants