Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reload nameserver information on lookup failure #41582

Merged
merged 1 commit into from
May 6, 2017

Conversation

jonhoo
Copy link
Contributor

@jonhoo jonhoo commented Apr 27, 2017

As discussed in #41570, UNIX systems often cache the contents of /etc/resolv.conf, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see #41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd.

Fixes #41570.
Depends on rust-lang/libc#585.

r? @alexcrichton

@alexcrichton
Copy link
Member

I wonder, are there performance implications about this? E.g. does a bunch of failing queries now take much longer?

@jonhoo
Copy link
Contributor Author

jonhoo commented Apr 27, 2017

@alexcrichton yes, I believe that that is true. This is why I first proposed in #41570 that we might instead want a connect_uncached, or some other way of converting from str to SocketAddr. That said, with this approach, applications for which this is a problem can manually call getaddrinfo, and then use the resulting SocketAddr. Such a workaround does not exist if you do want this behavior, because you can't force the cache to be ignored except by calling res_init.

// The lookup failure could be caused by using a stale /etc/resolv.conf.
// See https://github.com/rust-lang/rust/issues/41570.
// We therefore force a reload of the nameserver information.
c::res_init();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is in sys_common so it is compiled on all platforms, but I don't see a res_init on Windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what the if cfg!(unix) above was intended for?

Copy link
Contributor Author

@jonhoo jonhoo Apr 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, of course, it's still compiled. I need #[cfg(..)] {} instead, right?

@alexcrichton alexcrichton added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 27, 2017
@arielb1 arielb1 added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label May 2, 2017
@arielb1
Copy link
Contributor

arielb1 commented May 2, 2017

@alexcrichton - are you looking at this PR? Friendly ping to keep this on your radar.

@alexcrichton
Copy link
Member

@arielb1 ah yes I am, mostly just caught up in travel!

@jonhoo would you mind benchmarking the effect of calling ::res_init on failed queries? Just to get an idea of how much slower it is.

@jonhoo
Copy link
Contributor Author

jonhoo commented May 2, 2017

@alexcrichton I'd say pretty negligible:

#![feature(test)]
extern crate libc;
extern crate test;

fn main() {}

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;
    use std::net::ToSocketAddrs;

    #[bench]
    fn bench_plain(b: &mut Bencher) {
        let addr = ("google.com", 80);
        b.iter(|| addr.to_socket_addrs().map(|a| a.count()).unwrap_or(0));
    }

    #[bench]
    fn bench_reinit(b: &mut Bencher) {
        let addr = ("google.com", 80);
        b.iter(|| {
                   addr.to_socket_addrs().map(|a| a.count()).unwrap_or(0);
                   unsafe { libc::res_init() };
               });
    }
}
$ cargo +nightly bench
    Finished release [optimized] target(s) in 0.0 secs
     Running target/release/deps/res_init_bench-94665f8bc2ebf0bc

running 3 tests
test tests::bench_plain  ... bench:   4,419,043 ns/iter (+/- 331,978)
test tests::bench_reinit ... bench:   4,420,291 ns/iter (+/- 297,022)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

$ sudo netctl stop-all
$ cargo +nightly bench
    Finished release [optimized] target(s) in 0.0 secs
     Running target/release/deps/res_init_bench-94665f8bc2ebf0bc

running 3 tests
test tests::bench_plain  ... bench:     267,152 ns/iter (+/- 18,776)
test tests::bench_reinit ... bench:     267,891 ns/iter (+/- 9,855)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

@alexcrichton
Copy link
Member

@jonhoo awesome thanks for testing!

@jonhoo jonhoo force-pushed the reread-nameservers-on-lookup-fail branch from 65fcd0d to 1309652 Compare May 4, 2017 01:13
@jonhoo
Copy link
Contributor Author

jonhoo commented May 4, 2017

@alexcrichton I believe that with rust-lang/libc#585 merged, 1309652 should now work fine. What's the process for building with a more recent libc?

@jonhoo
Copy link
Contributor Author

jonhoo commented May 4, 2017

@alexcrichton not sure if this is the right way to do it, but I bumped the submodule revision for liblibc, and everything seems to be working as it should. r?

@alexcrichton
Copy link
Member

@bors: r+

Looks good! Let's see what CI says

@bors
Copy link
Contributor

bors commented May 4, 2017

📌 Commit db36fc8 has been approved by alexcrichton

@bors
Copy link
Contributor

bors commented May 4, 2017

⌛ Testing commit db36fc8 with merge 31b2a0a...

bors added a commit that referenced this pull request May 4, 2017
…excrichton

Reload nameserver information on lookup failure

As discussed in #41570, UNIX systems often cache the contents of `/etc/resolv.conf`, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see #41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd.

Fixes #41570.
Depends on rust-lang/libc#585.

r? @alexcrichton
@bors
Copy link
Contributor

bors commented May 4, 2017

💔 Test failed - status-travis

@TimNN
Copy link
Contributor

TimNN commented May 4, 2017

Looks legitimate (Edit: this is on macOS, eg: https://travis-ci.org/rust-lang/rust/jobs/228616549):

[00:02:40]   = note: Undefined symbols for architecture x86_64:
[00:02:40]             "_res_9_init", referenced from:
[00:02:40]                 std::net::lookup_host::hd3d9eaad793f8abc in std-438eba4cd7d88a45.0.o
[00:02:40]           ld: symbol(s) not found for architecture x86_64
[00:02:40]           clang: error: linker command failed with exit code 1 (use -v to see invocation)

@kennytm
Copy link
Member

kennytm commented May 4, 2017

Problem on libc side. Using res_init on macOS requires linking with -lresolv also.

@jonhoo
Copy link
Contributor Author

jonhoo commented May 4, 2017

@kennytm yeah, that's what I was worried about. @alexcrichton, looks like we can't rely on std to pull in resolv. How do you want us to fix this? Add linking with resolv on macOS in libc?

@alexcrichton
Copy link
Member

Ah yeah sorry to be clear I think adding linkage directives to libstd is ok for now, I think we just want to avoid modifying libc for now due to the impact it'll have.

@jonhoo
Copy link
Contributor Author

jonhoo commented May 4, 2017

@alexcrichton how would I go about doing that?

@alexcrichton
Copy link
Member

I think the best way would likely be to modify libc-shim's build script in this repo to print out rustc-link-lib=resolv on OSX perhaps?

@jonhoo
Copy link
Contributor Author

jonhoo commented May 4, 2017

Ah, okay -- gave that a shot in 912da9b.

@alexcrichton
Copy link
Member

Looks pretty good to me, although on second though I think this may actually be best in src/libstd/build.rs as the function's actually used in libstd, sorry about that! While you're at it as well, could you squash the commits into one?

@jonhoo jonhoo force-pushed the reread-nameservers-on-lookup-fail branch from 912da9b to 43acba5 Compare May 4, 2017 19:26
@jonhoo
Copy link
Contributor Author

jonhoo commented May 4, 2017

Done in 43acba5fa62ac0018aa8dd498d5709f07a68bd43

@alexcrichton
Copy link
Member

@bors: r+

Thanks!

@Mark-Simulacrum
Copy link
Member

macOS failed again with the same error, I think, though not sure:

[01:28:35]   "_res_9_init", referenced from:
[01:28:35]       std::net::lookup_host::hd3d9eaad793f8abc in libbar.a(libbar.0.o)

@jonhoo
Copy link
Contributor Author

jonhoo commented May 5, 2017

That's so strange... The log even says that it's linking with libresolv. At this point I think I'd need someone running macOS to take a look and track this symbol down? Debugging this is really tricky without direct access to a macOS environment. The only thing that comes to mind is that the symbol may actually be called res_9_init (note the lack of the _ prefix) in recent macOS deployments? If so, I guess we'd need to update libc too @alexcrichton ? Could someone running macOS take a stab at confirming this locally?

@alexcrichton
Copy link
Member

Oh -lresolv is correct but both of these tests are performing manual linking, so the libs linked just need to be updated (they're run-make tests I believe)

Verified

This commit was signed with the committer’s verified signature. The key has expired.
jonhoo Jon Gjengset
As discussed in rust-lang#41570, UNIX systems often cache the contents of
/etc/resolv.conf, which can cause lookup failures to persist even after
a network connection becomes available. This patch modifies lookup_host
to force a reload of the nameserver entries following a lookup failure.
This is in line with what many C programs already do (see rust-lang#41570 for
details). On systems with nscd, this should not be necessary, but not
all systems run nscd.

Introduces an std linkage dependency on libresolv on macOS/iOS (which
also makes it necessary to update run-make/tools.mk).

Fixes rust-lang#41570.
Depends on rust-lang/libc#585.
@jonhoo jonhoo force-pushed the reread-nameservers-on-lookup-fail branch from 43acba5 to 68ae617 Compare May 5, 2017 04:02
@jonhoo
Copy link
Contributor Author

jonhoo commented May 5, 2017

Updated src/test/run-make/tools.mk, squashed, and pushed in 68ae617.

@alexcrichton
Copy link
Member

@bors: r+

@bors
Copy link
Contributor

bors commented May 5, 2017

📌 Commit 68ae617 has been approved by alexcrichton

@jonhoo
Copy link
Contributor Author

jonhoo commented May 5, 2017

I'll take that as you agreeing that 68ae617#diff-bff523d3aff3a86f367f9d199a559b71R75 is the right change to make :p

@alexcrichton
Copy link
Member

Indeed!

@jonhoo
Copy link
Contributor Author

jonhoo commented May 5, 2017

Hmm, seems like @bors is taking a while to pick this up -- @alexcrichton ?

@Mark-Simulacrum
Copy link
Member

You are currently 4th in the queue: https://buildbot2.rust-lang.org/homu/queue/rust. It can take a while, especially now--we're actively trying to land a patch that fixes nightly, so that has higher priority so whenever we retry it, it jumps to the top.

@jonhoo
Copy link
Contributor Author

jonhoo commented May 5, 2017

Ah, thanks! Didn't know there was a way of looking at the queue. Never mind me :)

frewsxcv added a commit to frewsxcv/rust that referenced this pull request May 5, 2017
…-fail, r=alexcrichton

Reload nameserver information on lookup failure

As discussed in rust-lang#41570, UNIX systems often cache the contents of `/etc/resolv.conf`, which can cause lookup failures to persist even after a network connection becomes available. This patch modifies lookup_host to force a reload of the nameserver entries following a lookup failure. This is in line with what many C programs already do (see rust-lang#41570 for details). On systems with nscd, this should not be necessary, but not all systems run nscd.

Fixes rust-lang#41570.
Depends on rust-lang/libc#585.

r? @alexcrichton
bors added a commit that referenced this pull request May 5, 2017
Rollup of 9 pull requests

- Successful merges: #41064, #41307, #41512, #41582, #41678, #41722, #41734, #41761, #41763
- Failed merges:
@bors bors merged commit 68ae617 into rust-lang:master May 6, 2017
@bors
Copy link
Contributor

bors commented May 6, 2017

⌛ Testing commit 68ae617 with merge 42a4f37...

@retep998
Copy link
Member

retep998 commented May 6, 2017

Is bors drunk? This PR was just merged in a rollup, so why is bors trying to test it?

@jonhoo
Copy link
Contributor Author

jonhoo commented May 11, 2017

It is worth noting that if https://sourceware.org/bugzilla/show_bug.cgi?id=984 ever lands, we might want to revert this change.

kennytm added a commit to kennytm/rust-ios-android that referenced this pull request May 21, 2017

Verified

This commit was signed with the committer’s verified signature.
// The lookup failure could be caused by using a stale /etc/resolv.conf.
// See https://github.com/rust-lang/rust/issues/41570.
// We therefore force a reload of the nameserver information.
c::res_init();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this still result in surprising behaviour if e.g. the contents of /etc/resolv.conf change without the old resolver becoming unusable?

For instance, if I change my DNS resolver without making the old resolver unreachable, I'll never hit this error and any running rust applications will continue to use the old resolver...indefinitely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Though if the resolution happens successfully, what is the problem? It's also quite hard to get around that particular case. We could always call res_init, but that seems a little wasteful. The real solution to this is to fix libc (most libcs do not have this problem — glibc is the major exception). Applications that want to be robust against this could always call libc::res_init directly though of course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though if the resolution happens successfully, what is the problem?

Playing devil's advocate, "successful" doesn't imply "correct".

We could always call res_init, but that seems a little wasteful.

How wasteful? Perhaps this is worth measuring.

The real solution to this is to fix libc (most libcs do not have this problem — glibc is the major exception).

What do you mean? What would "fixing" libc look like? What do other libcs do in contrast to glibc?

Copy link
Contributor Author

@jonhoo jonhoo May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though if the resolution happens successfully, what is the problem?

Playing devil's advocate, "successful" doesn't imply "correct".

True, though that sounds like a very weird setup indeed. One in which you can connect using the resolution information from the old server, but you need to instead connect to the server provided by a new resolver?

We could always call res_init, but that seems a little wasteful.

How wasteful? Perhaps this is worth measuring.

I did some benchmarks above (#41582 (comment)), and it's not terrible (especially because it doesn't require a syscall), but if we can avoid doing something...

The real solution to this is to fix libc (most libcs do not have this problem — glibc is the major exception).

What do you mean? What would "fixing" libc look like? What do other libcs do in contrast to glibc?

No other libcs have this issue. Some of them don't cache /etc/resolv.conf, some integrate with NSS or similar services, which know when the cache should be flushed. I haven't looked into it too carefully. It is unclear what the "right" solution is given that glibc wants to be both fast (i.e., don't do a file read on every connect), and not rely on other services (like NSS).

@oconnor663
Copy link
Contributor

The glibc man page for res_init notes:

The traditional resolver interfaces such as res_init() and res_query() use some static (global) state stored in the _res structure, rendering these functions non-thread-safe.

Is it safe to call res_init like we're doing now? If not, what are the options for making it safer? We could take a global lock, but I don't think that would help if e.g. we're linking against code in other languages that doesn't know about our lock. The same man page notes that there are more recent functions like res_ninit that can use per-thread state. Those might be safer if they're widespread enough to depend on them.

@jonhoo
Copy link
Contributor Author

jonhoo commented Jul 14, 2017

That's a good point, though through some digging it appears as though res_init is in fact thread-safe. It's unfortunately somewhat tricky to use res_ninit and have it affect the same structure as is used by glibc. There's some discussion of it further down in the glibc man page:

In glibc, when you link with -lpthread, such a per-thread resolver state is already present. It can be accessed using _res', which has been redefined as a macro, in a similar way to what has been done for the errno' and h_errno' variables. This per-thread resolver state is also used for the gethostby*' family of functions, which means that for example `gethostbyname_r' is now fully thread-safe and re-entrant.

This suggests that we could (and probably should) use res_ninit, but we'd need to figure out which per-thread symbol to use. The aliasing magic seems to reside here, but I can't quite tell how it works to provide a per-thread _res symbol through a macro?

@oconnor663
Copy link
Contributor

@jonhoo should we file an issue somewhere for this? I'm not sure I would call it "unsound" so much as "maybe unsound under certain very specific conditions and versions of glibc" :p

@jonhoo
Copy link
Contributor Author

jonhoo commented Jul 17, 2017

Well, https://sourceware.org/bugzilla/show_bug.cgi?id=984 is now marked as RESOLVED, so it could be that we can now revert this commit entirely (though I haven't tested). I don't know of a version of glibc where our use is unsound, though you may be right that there is one.

@oconnor663
Copy link
Contributor

Existing versions of glibc that don't contain the fix will be around for years though. Probably we can never revert this workaround? :(

@oconnor663
Copy link
Contributor

oconnor663 commented Aug 1, 2017

@jonhoo I'm close to convinced that we've actually started doing something unsafe here. See #43592.

Edit: A fix has landed: #44965

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants